In April, Google experienced a fairly significant cloud outage, but it was hardly news at all. In fact, it was likely the most widespread outage to hit a major public cloud to-date. The lack of coverage is strange, considering the industry’s watchful eyes like Brian Krebs and others. The even more recent Salesforce service outage seems to have received more attention. But despite the fact that Google seems to have gotten away with a “pass” this time, the glitch brings renewed attention to the fact that tech players large and small are continuing to deal with software robustness issues.
Google Compute Engine was down for a full 18 minutes around the 7 o’clock hour Pacific Time on April 11, disconnecting all users in all regions. This was a Google cloud outage, and the root cause was a network failure. Network outages appear to be an ongoing challenge for Google, this one being the biggest yet.
Google published a post-mortem analysis stating that it had a number of safeguards in place to prevent this from happening again. But critics argue that these safeguards should have been tested – individually and systemically – to prevent the glitch from happening in the first place. Without high levels of redundancy and careful attention to code quality, it is difficult to prevent any period of network downtime.
A closer look at the details of the outage reinforce the widespread need for more improved network management, application portfolio management and software robustness.
The initial cause of the incident was improperly routed inbound traffic to the Google Compute Engine due to a configuration change to an idle IP block that didn't propagate correctly. Services on the VPNs and L3 network load balancers also went down.
The management software attempted to revert back to the previous configuration, but then a failsafe measure induced the manifestation of an uncaught bug. All of the IP blocks were removed from the configuration and another incomplete configuration was pushed through. A second bug blocked a canary measure that would possibly have made a correction to the push. So, the second bug caused more IP blocks to drop. In total, more than 95% of all inbound traffic was dropped for the duration of the outage.
At CAST, we hear similar stories far too often, and they exemplify how important it is to effectively measure and monitor software robustness to prevent downtime and breaches.
Why You Should Measure Software Robustness
Without an effective measure or benchmark of application performance, it is difficult to fully assess software robustness. As seen in the Google outage, fact-based knowledge of software performance can enable IT to act proactively instead of reactively – giving them the tools needed to keep applications up and running 24/7.
Robustness is measure of software strength and resilience. It’s an indication of the probability that an application will incur defects, corrupt data or fail catastrophically in a production environment. Also known as reliability, CAST measures robustness in accordance with industry best practices from CISQ and OMG.
Software robustness has a direct impact on both customer satisfaction and business continuity. Unstable applications can expose a company to significant financial risks that range from revenue loss to litigation. Understanding your software’s probability for failure helps mitigate risk and keep customer information safe. Recent CRASH research has also shown empirical evidence that Robustness is highly correlated to application security.
To learn more about how you can improve software security and create more reliable applications, check out our industry CRASH Reports.