CAST

Silent Application Killers: SQL Statements without Error Handling

by

If you’re like me, you’ve often questioned the accuracy of a report or an analytics dashboard.  Perhaps, like me, you often feel uneasy and skeptical about the numbers you see on a screen.   What goes into those numbers anyway?

  • Business experts carefully specify needed data and calculations
  • Reporting specialists artfully design the report and diagrams
  • Data warehouse managers meticulously curate and select the most relevant data
  • Database administrators standardize and expose the data for easy consumption
  • Developers retrieve, aggregate, and calculate data; then, present it to the user interface

However, ignore SQL error handling at your risk and all of this hard work could be wiped out--made meaningless--with one single haphazardly written SQL statement.

The Murder Weapon: A Sloppy SQL Statement

It’s been almost exactly 3 years.  A client contacted me with urgency.  An application that delivered daily critical financial reports had stopped working.  The client’s fairly experienced team found nothing in the error logs, and everything on the infrastructure side appeared to be in good order.

My 20 years of experience as a DBA told me that something was wrong with the way data is being accessed.   Lo and behold, I found a SQL statement embedded in a SQL Server Job written like this:

Code-Missing-SQL-Error-Handling

The SQL statement was written for a newly designed report that has been running fine for months.  The statement began by checking the submission date.  Based on the date, it refreshed a temporary table by first deleting all the data, and reinserting the selected data.  By reviewing the runtime logs, I saw that typically, this operation took about 10 minutes to run, because it retrieved around 30 million rows of data per view.  However, on the day of the application crash, this operation only took seconds to complete. 

Voila!  The smoking gun.

The Hidden Evidence: Neglected SQL Error Handling

It was plain to me that the lack of SQL error handling caused the application to crash.  While the first step was a DELETE of the entire data set in the table and the second step was an INSERT of new data, the INSERT could still fail for many foreseeable reasons.  When the INSERT was interrupted, the number of lines remain at zero, and the following date comparison will fail, causing the table to never fill again.  And, since there was no SQL error handling statement, the application neither knew what to do when it failed nor created a log of the error. 

To rectify this problem, a simple TRY CATCH statement with a RAISERROR throw would have done the trick.  A TRY CATCH would test the execution of the statement, and when it failed, it would create a log of the error so it can be quickly triaged.

I also recommended a second failsafe in case the INSERT appeared to be successful, the number of inserted lines could be easily obtained with a check of the variable @@ROWCOUNT value to pass the status to the job output.  We could double reinforce this check with a count of all lines present in the table at the end of the code. In case the table is empty, a new RAISERROR could be proactively thrown instead of waiting for the next job submission.

The Cause of Death: Not Utilizing Transactions

Another major issue with this SQL statement was that when INSERT was engaged, the DELETE transaction was already committed, and the rollback was unavailable in case of any interruption.  This was especially dangerous as the INSERT transaction ran for approximately 10 minutes--a lifetime for a business application.  If the lack of SQL error handling was what caused the application to crash, then the lack of leveraging and optimization transactions was what ensured the application cannot recover.  

If the VIEW SQL statement had been optimized, the overall insertion process would have been shorter and less exposed. To avoid the risk of deletion without insertion, a global transaction beginning with a BEGIN TRAN and ending with a COMMIT should have been written to control those two steps. You can read more about transaction modes here.

Protecting Innocent Applications: Audits and Automation

While we are making light of a very serious situation, it must be said that this application crash was easily preventable.  To improve their application resiliency, enterprises should define and enforce policies that require auditing of new code and automate as much of it as possible.  Engaging DBAs and data architects prior to new releases and periodically to audit SQL statements can drastically improve the availability and recoverability of software, and many times, also the performance thereof. 

If you’re like me, there is always too much to do in a single day.  Therefore, automation is the key to safeguarding our software.  Solutions like CAST Application Intelligence Platform (AIP) automatically detects structural software flaws like missing error handling or not properly using transactions can provide a safety net that no human can.  Plus, automated solutions can instill a sort of discipline in everyone on the application team continuously improve software structure and code.

As illustrated above, a simple few lines of code can hide a dangerous outage.  To find out more about how to set policies and use automation to improve your application resiliency and avoid SQL error handling issues like the above and more, please feel free to reach out to us.

[Suggested reading : What Vanity License Plates can Teach Us About App Security and SQL Injection]

  This report describes the effects of different industrial factors on  structural quality. Structural quality differed across technologies with COBOL  applications generally having the lowest densities of critical weaknesses,  while JAVA-EE had the highest densities. While structural quality differed  slightly across industry segments, there was almost no effect from whether the  application was in- or outsourced, or whether it was produced on- or off-shore.  Large variations in the densities in critical weaknesses across applications  suggested the major factors in structural quality are more related to  conditions specific to each application. CRASH Report 2020: CAST Research on  the Structural Condition of Critical Applications Report
Get the Pulse Newsletter  Sign up for the latest Software Intelligence news Subscribe Now <>
Open source is part of almost every software capability we use today. At the  very least libraries, frameworks or databases that get used in mission critical  IT systems. In some cases entire systems being build on top of open source  foundations. Since we have been benchmarking IT software for years, we thought  we would set our sights on some of the most commonly used open source software  (OSS) projects. Software Intelligence Report <> Papers
Making sense of cloud transitions for financial and telecoms firms Cloud  migration 2.0: shifting priorities for application modernization in 2019  Research Report
Guillaume Diamant
Guillaume Diamant Consultant en infrastructure bases de données, Architecte IT
Load more reviews
Thank you for the review! Your review must be approved first
Rating
New code

You've already submitted a review for this item

|