When the software fails, first blame the hardware


We’ve made it a point on our blog to highlight the fact that software glitches in important IT systems -- like NatWest and Google Drive -- can no longer be “the cost of doing business” in this day and age. Interestingly, we’re starting to see another concerning trend: more and more crashes blamed on faulty hardware or network problems, while the software itself is ignored. It’s funny that the difference in incidents can be more than 10 times between applications with similar functional characteristics. Is it possible that the robustness of the software inside the applications has something to do with apparent hardware failures? I think I see a frustrated data center operator reading this and nodding violently.

The business has this perception that the guys running the databases, the server racks, and the private cloud are 100 percent sure they know exactly what’s running on their servers. In reality, they get a woefully incomplete view of how the system works. At every release, the data center gets a fresh load of executables from the software builds. These may be transactional applications, web services, or batch jobs that get dropped on various servers. In either case, it’s impossible to take a look inside an executable. It’s doubly impossible to know how different executables will interact with the CPU, the network, or each other.

The problem is exacerbated with service-oriented architectures wrapped around legacy systems. If the SOAs -- built for distributed systems or cloud applications -- are invoked inefficiently, then the mainframe will end up hogging a lot of extra CPU cycles and network bandwidth. From the vantage point of the data center, it feels like the ‘invisible hand’ suddenly making the network and the hardware thrash.

And that’s exactly what data center admins are missing: a view of how their specific hardware is interacting with the entire software infrastructure. I’ve heard a software architect talk very eloquently about how easy it is for the software guys to throw the network and data center team under the bus. “The Enterprise Architects can put on a nice suit and talk to the business in their language. So when we tell them the incident that just occurred is because the operations team doesn’t know how to manage the network, they believe us. But, in reality we know our software has no horizontal scalability, no matter how much network bandwidth we throw at it.” How about another hand gesture for that:CAST-When-software-fails-first-blame-the-hardware

We think it’s time for IT leaders to take a deeper look at their data centers to determine how their legacy software is using available hardware. Ultimately, IT leaders need to recognize that stuffing more CPUs into data center racks won’t solve every problem. The software systems at play are highly complex -- spanning across states and continents -- and running at speeds we can barely comprehend. By wrapping legacy applications in architectures designed for the cloud, we’re masking applications’ current performance deficiencies for a time. But to truly fix performance and stability woes in the data center, we need to do the hard work of understanding how our software uses its underlying hardware.

Lev Lesokhin
Lev Lesokhin EVP, Strategy and Analytics at CAST
Lev spends his time investigating and communicating ways that software analysis and measurement can improve the lives of apps dev professionals. He is always ready to listen to customer feedback and to hear from IT practitioners about their software development and management challenges. Lev helps set market & product strategy for CAST and occasionally writes about his perspective on business technology in this blog and other media.
Load more reviews
Thank you for the review! Your review must be approved first
You've already submitted a review for this item