If you’re like me, you’ve often questioned the accuracy of a report or an analytics dashboard. Perhaps, like me, you often feel uneasy and skeptical about the numbers you see on a screen. What goes into those numbers anyway?
- Business experts carefully specify needed data and calculations
- Reporting specialists artfully design the report and diagrams
- Data warehouse managers meticulously curate and select the most relevant data
- Database administrators standardize and expose the data for easy consumption
- Developers retrieve, aggregate, and calculate data; then, present it to the user interface
However, ignore SQL error handling at your risk and all of this hard work could be wiped out--made meaningless--with one single haphazardly written SQL statement.
The Murder Weapon: A Sloppy SQL Statement
It’s been almost exactly 3 years. A client contacted me with urgency. An application that delivered daily critical financial reports had stopped working. The client’s fairly experienced team found nothing in the error logs, and everything on the infrastructure side appeared to be in good order.
My 20 years of experience as a DBA told me that something was wrong with the way data is being accessed. Lo and behold, I found a SQL statement embedded in a SQL Server Job written like this:
The SQL statement was written for a newly designed report that has been running fine for months. The statement began by checking the submission date. Based on the date, it refreshed a temporary table by first deleting all the data, and reinserting the selected data. By reviewing the runtime logs, I saw that typically, this operation took about 10 minutes to run, because it retrieved around 30 million rows of data per view. However, on the day of the application crash, this operation only took seconds to complete.
Voila! The smoking gun.
The Hidden Evidence: Neglected SQL Error Handling
It was plain to me that the lack of SQL error handling caused the application to crash. While the first step was a DELETE of the entire data set in the table and the second step was an INSERT of new data, the INSERT could still fail for many foreseeable reasons. When the INSERT was interrupted, the number of lines remain at zero, and the following date comparison will fail, causing the table to never fill again. And, since there was no SQL error handling statement, the application neither knew what to do when it failed nor created a log of the error.
To rectify this problem, a simple TRY CATCH statement with a RAISERROR throw would have done the trick. A TRY CATCH would test the execution of the statement, and when it failed, it would create a log of the error so it can be quickly triaged.
I also recommended a second failsafe in case the INSERT appeared to be successful, the number of inserted lines could be easily obtained with a check of the variable @@ROWCOUNT value to pass the status to the job output. We could double reinforce this check with a count of all lines present in the table at the end of the code. In case the table is empty, a new RAISERROR could be proactively thrown instead of waiting for the next job submission.
The Cause of Death: Not Utilizing Transactions
, the DELETE transaction was already committed, and the rollback was unavailable in case of any interruption. This was especially dangerous as the INSERT transaction ran for approximately 10 minutes--a lifetime for a business application. If the lack of SQL error handling was what caused the application to crash, then the lack of leveraging and optimization transactions was what ensured the application cannot recover.
If the VIEW SQL statement had been optimized, the overall insertion process would have been shorter and less exposed. To avoid the risk of deletion without insertion, a global transaction beginning with a BEGIN TRAN and ending with a COMMIT should have been written to control those two steps. You can read more about transaction modes here.
Protecting Innocent Applications: Audits and Automation
While we are making light of a very serious situation, it must be said that this application crash was easily preventable. To improve their application resiliency, enterprises should define and enforce policies that require auditing of new code and automate as much of it as possible. Engaging DBAs and data architects prior to new releases and periodically to audit SQL statements can drastically improve the availability and recoverability of software, and many times, also the performance thereof.
If you’re like me, there is always too much to do in a single day. Therefore, automation is the key to safeguarding our software. Solutions like CAST Application Intelligence Platform (AIP) automatically detects structural software flaws like missing error handling or not properly using transactions can provide a safety net that no human can. Plus, automated solutions can instill a sort of discipline in everyone on the application team continuously improve software structure and code.
As illustrated above, a simple few lines of code can hide a dangerous outage. To find out more about how to set policies and use automation to improve your application resiliency and avoid SQL error handling issues like the above and more, please feel free to reach out to us.
[Suggested reading : What Vanity License Plates can Teach Us About App Security and SQL Injection]