Monday, March 19, 2007

Hints And Pitfalls In Database Development (Part 1): Missing Indices

I have been involved in a lot of database tuning lately. This has partly to do with the fact that some of our databases simply outgrew the originally anticipated data dimensions. Furthermore I sometimes have to deal with application modules from "external contributors".

So I am trying to sum up some of my experiences on database development over the years in a little mini-series. Most of those things should be quite obvious to many developers, others might be piece of news for one or the other. I will fall back into some SqlServer terminology at times, but many principles should apply to any major database.

So let me start with the number one reason for bad database performance:

Missing Indices

It's just amazing how creating indices tends to be forgotten again and again. I don't know, it might have to do with the fact that on 4th dimension platforms like Microsoft Access, where a lot of folks come from, they did not have to worry too much about indexing. Databases were kind of smaller there, and the database schema wizard most likely made any foreign key an index by default anyway.

I often hear the argument that "indices are going to slow down our updates". Well I don't know what is worse, slowing down updates by a factor of 0.1, or slowing down queries a factor of 100. The numbers vary of course. And it is true, indices that will not be applied by the query optimizer hence won't lead to any performance gain should not be introduced in the first place. The best way to find out is to have a look at the query execution plans, and see if the indices are being actually being used.

So in general, foreign keys and other query criteria fields are candidates for indices, with the exception of attributes with small value ranges like booleans or a tiny set of numbers, large varchar columns and tables with a small number of rows.

But sometimes even "golden rules" of when to create or not to create indices can turn out to be wrong. One of those rules is to avoid indices on columns with sparse value distribution, like a status attribute with 5 different values. But I experienced a case when the query optimizer made completely wrong assumptions about row counts for one of those status values. I added an index on this column, which implicitly induced statistic maintenance, and that in turn helped the optimizer to make the right decisions.

That being said, it is undoubtedly important to avoid excessive index creation, as this can lead to performance penalties during inserts, updates and deletes.

In addition, composite indices can lead to major speedups as well, when applied to tuples of columns which are typically jointly queried (note: leftmost index columns must all appear in a query for the index to be applied).

The good news is: you are not alone. There are plenty of tools which help finding a good indexing strategy. I start up SqlServer's Database Engine Tuning Advisor (formerly known as Index Tuning Wizard) quite frequently, and in addition I have several schema check scripts at hand which - among other things - look for missing indices, e.g. on foreign key columns.

And: Indices must be rebuilt frequently. This serves two purposes: (1) The index nodes then have a fill factor for optimal balance between read and write performance, and (2) Index rebuilding updates index statistics as well. Outdated statistics can be a major cause for slow query performance.

Follow-Ups: