Monday, October 15, 2007

Hints And Pitfalls In Database Development (Part 5): The Importance Of Choosing The Right Clustered Index

In database design, a clustered index defines the physical order in which data rows are stored on disk (Note: the most common data structure for storing rows both in memory and on disk are B-trees, so the term "page" can also be interpretated as "B-tree leaf node" in the following text, although it's not necessarily a 1:1 match - but you get the point). In most cases the default clustered index is the primary key. The trouble starts when people don't spend any further thought and stick with that setting no matter whether the primary key is a good choice for physical ordering or not...

File I/O happens at a page level, so reading a row implies that all other rows stored within the same physical disk page are read as well. Wouldn't it make sense to align those rows together which are most likely to be fetched en bloc too? This limits the number of page reads, and avoids having to switch disk tracks (which would be a costly operation).



So the secret is to choose an attribute for clustering which causes the least overhead for I/O. Those rows that are most likely going to be accessed together should reside within the same page, or at least in pages next to each other.

Usually an auto-increment primary key is a good choice for a clustered index. Rows that have been created consecutively will then be stored consecutively, which fits in case they are likely to be accessed along with each other as well. On the other hand if a row contains a date column, and data is mainly being selected based on these date values, this column might be the right option for clustering. And for child rows it's probably a good idea to choose the foreign key column referencing the parent row for the table's clustered index - a parent row's child rows can then be fetched in one pass.

I work on a project that applies unique identifiers for primary keys. This has several advantages, the client being able to create primary keys in advance among them. But unique identifier primary keys are a bad choice for a clustered index, as their values disperse more or less randomly, hence the physical order on disk will be just as random. We have experienced a many-fold performance speedup by choosing more suitable columns for clustered indexing.

Previous Posts: