IT companies spend millions of dollars trying to recover losses incurred due to poor application performance. I am sure each one of us has complained about a machine or application being slow or even dead, and then spent time at the coffee machine waiting for the results of a long running query. How can we fix that?
Most of the business applications or systems are designed to retrieve and/or write information to a local hard disk or a database system.
Consider a typical multi-tier architecture. It will contain the client tier, web tier, application tier, and data tier as shown below.
The data tier represents the database and mainly acts as the storage/manager for business data. Usually when an end-user/client requests some information or executes a query on the client tier, he/she expects to have a response ASAP. However the client tier has to talk to the data tier in order to get back the appropriate information to the client. This might take a few microseconds or sometimes even a few hours depending on several parameters.
Common parameters responsible for such delays include:
- Architecture of the system
- Code complexity
- Unoptimized SQL queries
- Hardware (CPUs, RAM)
- Number of users
- Network traffic
- Database size
Out of all these parameters, unoptimized SQL queries contribute to the majority (around 60-70%) of database performance issues.
DATABASE OPTIMIZATION APPROACH
To avoid these delays, let’s look at some common database optimization approaches. There are three main approaches to go about optimizing databases:
- Optimize the database server hardware and network usage. This involves changing the hardware to a specific configuration to speed up the read/write onto the database.
For example, use RAID 10 if there are equal read/write activities or RAID 5 if there are more read operations. This task is often performed as part of the deployment planning or infrastructure planning in the requirement analysis phase of the Software Development Lifecycle (SDLC). This exercise is also referred as hardware sizing.
- Optimize the database design. This involves the normalization of database. For example, you can go up to the third normal form of normalization, which will definitely help to make the database quicker. Usually this task is carried out during the design phase of the SDLC.
- Optimize the database queries. This involves studying the query plan, analyzing the queries for use of indexes, and joining and simplifying the queries for better performance. It is the most critical and effective approach for optimizing the database. The activity of query optimization can start in implementation phase and continue during testing, evolution, and maintenance phases.
In this post, I will focus on the database/SQL query optimization techniques/guidelines alone. The idea is to help tackle some of the critical database performance issues.
OPTIMIZE DATABASE QUERIES
Many databases come with a built-in optimizer, which performs optimization and helps improve the performance to a certain extent. However the results are not always promising. There are database monitoring tools which only capture information on the resources consumed by the database servers. This can help address 20% of the performance issues. The best way to go about query optimization is to review the functional areas, which take a long time to respond mainly because of the underlying SQL queries.
Below I have tried to list a few SQL rules with examples, based on experience and best practices, which can help optimize the database to a great extent.
- Not using “WHERE” clause to filter the data will return all records and therefore make the system very slow
Original Query 1: select * from Production.TransactionHistory
Returns 113443 rows
Optimized Query 2: select * from Production.TransactionHistory where ProductID=712
Returns 2348 rows
As the number of records retrieved are less (using “where” clause) the query executes much faster.
- Not using required column names in the “SELECT” part will take more time to return the same number of rows.
Original Query 1: select * from Production.TransactionHistory where ProductID=712
Returns 2348 rows
Optimized Query 2: select TransactionID, Quantity from Production.TransactionHistory where ProductID=712
Returns 2348 rows
Examples 1 & 2 might look quite obvious, but the idea is to think in a filtering mode and fetch the optimal set of data required for your application.
- Using Cartesian joins that lead to Cartesian products kills performance, especially when large data sets are involved. A Cartesian join is a multiple-table query that does not explicitly state a join condition among the tables and results in a Cartesian product.
Query 1: select count(*) from Production.Product
Return 504 rows
Query 2: select count(*) from Production.TransactionHistory
Return 113443 rows
Query 3: select count(*) from Production.Product, Production.TransactionHistory
Return 57175272 rows (= 504 x 113443) -> Cartesian Product
Original Query 4: select P.ProductID,TH.Quantity from Production.Product P, Production.TransactionHistory TH
Return 57175272 rows
Optimized Query 5: select P.ProductID,TH.Quantity from Production.Product P, Production.TransactionHistory TH where P.ProductID = TH.ProductID
Return 113443 rows
- Use Joins on indexed columns as much as possible.
Query 1: select P.ProductID,TH.Quantity from Production.Product P, Production.TransactionHistory TH where P.ProductID = TH.ProductID
Execute the query without any index for the first time. Re-run the same query after adding an index on column ProductID.
- Avoid full table scans when dealing with larger tables. To prevent full table scans, we can add clustered indexes on the key columns with distinct values.
Query 1: select P.ProductID,PIn.LocationID from Production.Product P,Production.ProductInventory PIn where P.ProductID = PIn.ProductID
- A. Execute the query without any indexes for the first time. It will table scan by default.
(Execution plan showing table scans for Product and ProductInventory tables below.)
- B. Re-run the same query after adding clustered indexs on one of columns (LocationID).
(Execution plan showing clustered index scan on the table ProductInventory and table scan on table Product below.)
- C. Re-run the same query after adding indexes on both columns ProductID and LocationID to avoid table scan.
(Execution Plan using index scan for both the tables Product and ProductInventory below.)
In some cases, where there are not many unique values, a table scan can be more efficient as indexes will not be used.
- In general, subqueries tend to degrade database performance. In many cases, the alternate option is to use joins.
Original Query 1: SELECT Name,ProductID FROM Production.Product WHERE ProductID NOT IN (SELECT ProductID FROM Production.TransactionHistory)
Original Query 2: SELECT Name,ProductID FROM Production.Product P WHERE NOT EXISTS (SELECT ProductID FROM Production.TransactionHistory where P.ProductID=ProductID)
Replace sub-query by join
Optimized Query 3: SELECT Name,P.ProductID FROM Production.Product P LEFT OUTER JOIN Production.TransactionHistory TH On (P.ProductID = TH.ProductID) where TH.ProductID is NULL
- Too many indexes on a single table are costly. Every index increases the time it takes to perform INSERTS, UPDATES and DELETES, so the number of indexes should not be too high.
There is an additional disk and memory cost involved with each index. The best solution is to monitor your system performance. If there is need to improve the performance, you can go for more indexes.
- When a SQL query is executed in a loop, it will require several roundtrips between the client and the database server. This consumes network resources/bandwidth and hurts performance. Therefore SQL queries inside loops should be avoided. The recommended workaround is to create one query using a temporary table. In this case, only one network round trip will be required. And further optimize the query.
There are few database analyzers in the market which check the SQL code against such rules and help identifying the weak SQL queries.
I will continue to blog on this subject to cover advanced optimizing guidelines linked to Stored Procedures, Cursors, Views and Dynamic SQL. Hope this post gives you a few tips to identify and resolve database performance issues.
Please feel free to share your feedback/questions on this blog, or experiences with any tools you have tried for database optimization.