With Black Friday and Cyber Monday around the corner, flashbacks to last year’s website outages at Lululemon, JCrew, Ulta, Lowes, Best Buy and Walmart haunts retailers and shoppers alike. Even Amazon demonstrated that they’re not impervious to peak shopping traffic when they went down for several hours during Prime Day this summer. Millions of dollars are at stake for every minute a website is down (Walmart estimated a whopping $9M loss of revenue during their 150-minute outage in 2018). Having spent several years working at a major online retailer and now deploying CAST Application Intelligence Platform (AIP) at a New York-based online retailer, I can relate to the anxiety and stress that overcomes IT teams this time of the year.
Black Friday Outages: Most commonly cited reason is “Network Failure”
Mainstream media usually cite increased traffic during black Friday sale events as the reason for retail website outages. And, retailers immediately point to infrastructure or third party components. In fact, in a survey on outages by LogicMonitor in 2019, the most commonly cited reason for outages is “network failure”. But, network failures or infrastructure outages are usually caused by something much deeper. The automatic response is to say that this is caused by a traffic surge.
Is Infrastructure the real reason behind the Black Friday Outages?
After all, it must be some IT ops person who forgot to increase the random access memory (RAM) before Thanksgiving, right? Not really. Naming infrastructure as both the source and the solution to the problem is way oversimplifying how a website works. That kind of thinking ignores the fact that the code can be written differently to perform the same action. It’s like saying that regardless of how you approach a problem, the result will always be the same. Well, tell that to the first person who carved a rock into a wheel!
Avoid Black Friday Outages by Focusing on Application Architecture and Quality
According to the survey by LogicMonitor mentioned earlier, IT leaders think more than 50% of incidents are easily avoidable. Let’s take a look at how infrastructure fixes may be hiding underlying software architecture and code issues.
Keep Your Modern Architecture Modern
Transforming your software into service-oriented architecture to create elasticity for your infrastructure is a great idea. But, you’re not done once your new architecture is implemented. In the case of the online retailer I mentioned earlier in this post, this is exactly their approach. To launch a newly designed e-commerce platform, they rearchitected a monolithic application into a microservices platform. Indeed, they saw the reduction in infrastructure strain, but they were not satisfied. They are now using CAST AIP to ensure that each web service is as optimized for performance as possible, and that there is no degradation with incremental releases.
Now, even if your architecture is not service-oriented, there are still many different ways to optimize your code to reduce strain on your architecture. Read on!
- Free Your Resources, Return Your Connections
The holiday season is about sharing! So, make sure you return database connections back to the pool. Unreturned connections can’t be used, and will strain your infrastructure. In fact, memory leakage is one of the most common sources of outages reported by our customers.
This is all about how a few lines of code are written. When software engages a resource like a database or a file, it needs to open a connection. This connection requires RAM (random access memory) to operate. Most of the time there is a pool of connections managed by the platform. Other times, it may be a direct connection. If the connection is not closed after executing a query or pulling a file into the frontend, it remains open consuming RAM. Once enough open connections use up all the RAM, that server will crash--guaranteed--cloud or on-premise.
Many IT ops people will tell you that open connections will be closed periodically by a garbage collector, but during times of peak volume, that garbage collector won’t be able to keep up. Without having a well structured code that closes connections after every one is open, you are opening yourself to sure disaster.
- Stars Work Well as Tree Toppers, Not So Much for SQL Statements
We love developers. Developers are the reason why we have a world that is so intuitively and naturally digital. They work all night just so you can have a button that recesses just right when you click on it. However, the one thing that developers are notoriously bad at optimizing is SQL statements. Yes, I’ll say it. Go ahead, tweet at me.
If I had a dime every time I hear a developer say to a database administrator (DBA), “There is no way this SELECT statement can be written more efficiently”... But, there is almost always a better way. And, it takes talking to the database people. Leaders should require DBAs to make themselves (more) available to development teams so they can review and optimize SQL statements together this holiday season. And, if possible, have DBAs review all existing SQLs just to be sure! Or, if you don’t have a surplus of DBAs, use an enterprise-grade static analyzer that understands interactions between code and database resources to help accelerate your efforts, and get developers and DBAs on the same page quickly.
By the way, the danger of enterprise IT teams working in silos spreads beyond performance and stability--it can impact security and resiliency as well. Read about how system-level analysis can breakdown the silos in another one of my blogs, “States of Chaos: Can a Hacker Steal Your Agency’s Wheels?”
- Santa Checks his List Twice, You should Check your Code Four Times
While you’ve implemented some automated roll back procedures, you may just want to deal with the flaws before it becomes an incident. Checking code prior to deployment many times is just good practice. Don’t check everything! Just check the things that you know could cause an outage (such as not closing a database connection!).
Do it when the code is checked in, when it is merged to branch, when it is merged to trunk, and do it one more time before it is a release candidate. Do it religiously, and do it with an enterprise-grade static analyzer that no one can bypass. We realize it can be inconvenient at times, but no one complains when the shopping season goes off without any issues.
Hug your Architects, Like, Right Now
Your architects know everything. Well, almost everything. They should know how an API is meant to be used. They should know that bypassing certain APIs can cause performance to slow down because APIs (most of them) are optimized for performance. What they may not know is if a developer “accidentally” bypassed an API to run a very heavy query when they were deploying their code at 3 AM this morning.
In short, your architects know what’s good practice, but not everyone listens to them. Help your architects by giving them the authority and tools to police non-compliance architecture. An architectural analyzer like CAST’s Architecture Checker makes their lives easier to communicate and ensure compliance with good design.
While rebooting the server, increasing disk space, memory or connections may solve your Black Friday website outages, it may not be the underlying issue. I highly recommend using an enterprise-grade static analysis tool to identify the root cause of any issues, and continuously look for opportunities to optimize your code.
If you’re interested in learning more about how CAST AIP has helped customers avoid those avoidable outages, feel free to contact us.