Monthly Archives: January 2014
Hey, what is this all about?
This is mainly a blog about management and leadership. But as my boss pointed out, I am a Chief Technical Officer. So for a change, here is a post with a technical focus. Rest assured that they will be few and far between. But for now, if you are a Pointy Haired Boss, please move along.
As it happens, I’ve been working as a database admin / architect a while back, and indexing is an interesting subject that is often ignored – and seldom explained. So with this post I’ll try to do something about that! Kudos if you make it all the way through. 🙂
Introduction / Prerequisites
It is assumed that you have basic database knowledge: You know what a database is, you know what a table is, and you know how to perform operations against that table (preferrably using T-SQL).
It is assumed that you already know how to create, modify and delete indexes using T-SQL or SQL Management studio.
This guide gives a basic introduction to indexes, but does not attempt to explain in great detail how things work or why. It is focused on practical, experience based suggestions on how to perform basic indexing of a database. Even at this brief level, understanding how this works will get a little complicated if you are not familiar with the concepts. This is why a surprisingly large percentage of developers know very little about this topic (shame on them).
If you insist on not understanding how this works, you can cheat and skip the difficult bits. I have marked the somewhat more complex parts with a red star () – that means you are allowed to skip them if you want the quick version of this guide.
What is an index, anyway?
The explanation a human can understand
A simple analogy is to think of a database table as a book. This particular book contains one long list (a table is basically a list) that spans across all the pages in the book, and the list has several columns. An index on a database table serves the same purpose as the index in a book (but is built in a very clever way), and it is usually only related to one specific column in the list that our book contains. So if you want to find something in a specific column, the index will tell you (or the database engine) on which page or pages you can find it.
Say you have a column called “LastName”, and you search for “Andersen”. If you have an index on the column “LastName”, the database engine can ask the index for all the pages that contain “Andersen” in the “LastName” column. The index will conduct a very effective search and reply something like “2, 5, 231 and 299”. The DB engine would then load pages 2, 5, 231 and 299, scan through these, and return only the rows in the list where “Andersen” is present in the “LastName” column.
What if you don’t have an index? Then the database engine would have to scan through every single page in the entire book (table), looking at every single line, checking wether “Andersen” is in the “LastName” column or not. Needless to say, this takes quite some time relative to the index approach.
A simplified illustration: The index you create on a column contains the actual data from the column you index, as well as a reference to the page where the entire data row can be found.
The technical explanation
The index is not actually structured in a plain table like in the illustration above. An index on a SQL Server table is a copy of one or more columns of the table, but it is sorted / structured in a specific way. It is arranged in a B-tree. As a result, searching an index is very fast. Click on the link if you want to know what a B-tree is (not required to complete this guide)
What about clustered indexes, what is that?
You may only have one (1) clustered index per table. A clustered index is the column by which the actual data rows of the table is sorted. Let’s say you have a column “LastName”, and decide to add a clustered index on this column. Then this will not be a copy of the column (as would be the case with a regular non-clustered index), but the actual column in the table. As a result of the creation of the clustered index, the rows of the actual table will reorder and sort itself based on the column you selected (“LastName”). In the illustration above, LastName is obviously not the clustered index, as the data in the table is not sorted by that column. Just judging based on the data we see in the illustration above, both the ID, Created and AddressID column could be the clustered index – as they are all sorted. By default, SQL Server selects the primary key as the clustered index. This is often not such a good idea, especially if the primary key is a randomly generated id like a Guid. It may also often be the case that the primary key is just an internal ID, not actually used in queries by the system.
The best candidate for clustered indexes is a column that you often include in a filter when you are expecting a ranged result (more than one). Columns containing row creation date are often good candidates in data tables (containing records of some sort like orders or transactions), as you would often ask to return all records for the last hour, day or perhaps even month for reporting purposes. If the table is actually sorted according to creation date, such a filter would be very effective.
If you never (or rarely) perform ranged searches (a user table could be an example, unless you often filter by a linked column like customerID), the column most often used for single selects (like the ID) will be the best choice.
The advantage of the clustered index, is that it IS the table. So when you have found a match in a clustered index, you also have immediate acces to the entire data row. In a regular index, you only find a reference to the page that contains the data row, and you will also need to fetch that.
So basically, indexes are great! I should just index everything then, in order to get maximum speed?
I’m glad you asked. That reminds that we need to talk about something else before discussing how to index:
I know my table is slow, but I don’t understand why
What makes sql operations slow?
The more rows in your table, the slower all operations will get. Indexes (applied correctly) will speed up read operations. This is a good thing. Indexes also make every other operation slower (insert, update and delete). That’s not so good. So why does that happen? Remember I said that an index was basically a copy of the column you index (see illustration above)? That means every time you add an index, you actually increase the size of the table with size of the column you are indexing. This increases the disk storage required to store your database. Storage is quite cheap, but you also introduce another issue: You increase the number of columns that have to be modified when you do an update, insert or delete. Say you add an Index on the column “LastName” in the table above. When you do an insert, SQL Server not only has to populate the data into the actual table, it will also need to update the index. It may even need to reorganize the index, as the content of LastName in the new row you just inserted probably fits somewhere in the middle of the existing index. Needless to say, this makes the insert operation slower than it would have been without the index.
Why the size of your datatypes matter
All this talk about size reminds me of a related issue: The size of your table row actually slow down read operations as well! Why? Because every page in our book (table) can only hold a set amount of data (8192 bytes for the geeks). That means that as we increase the number of columns (or the size of each column) in our table, we decrease the number of rows we can fit in each page. That means at least ranged selects (selecting more than one row) will take longer, as they need to retrieve a higher number of pages (blocks of 8192 bytes) to get all the rows you want. This translates into more data reads, which takes longer. So don’t use an int (4 bytes) when all you need is a bit (1 byte) or tinyint (1 byte). Also always use VARCHAR (variable size) instead of CHAR (fixed size), and don’t even get me started on GUID (16 bytes). Lastly, don’t add columns you don’t strictly need.
Okay, stop talking! Just tell me how to fix my slow tables!
Let’s start with how you DON’T fix it
- Don’t index columns that are never (or rarely) included in where clauses of the queries performed by your system.
- In a few specific cases you may also want to index columns that are rarely used in a where clause, say in the query for a monthly report that would take hours if you didn’t add the index
- Only index columns with high variability in the data content. That means you:
- Do not index bit columns
- Do not index columns containing things like a status (typically a small range of different numbers).
- Do not index columns containing stuff like gender (which you should have put in a bit column in the first place, so I didn’t have to put this in a separate rule!).
- Do not index columns that are only included in where clauses IN COMBINATION with other column(s) that you have already indexed, AND the filter on the other column(s) already narrows down the result significiantly. I know this is a long one, so I will include a reverse version in the “how-to” below.
- Do not index very small tables (say, less than 500 rows). They are either used so rarely that it doesn’t matter, or they are used so often that the entire table will always be in memory(RAM), and it will be superfast anyway. A full scan of the table will in practice be just as fast as an indexed search, so even if you add an index, SQL Server may not use it. Also; Most small tables often contain near-static data, and should probably be cached in the application.
- Some people (and some automatic indexing tools) will tell you that something called covering indexes is a good idea. I generally start a tuning session by locating any covering indexes, making a note of the columns they contain, and then deleting them. Covering indexes are used wrong 90% of the time, and only effective in specific cases (not covered by this guide, but if you insist, read here). Just trust me on this one. Forget about covering indexes – cases where they make a real difference are incredibly rare. Thank you. I will tell you what to do if you find one in an existing database below.
- Do not put a clustered index on a GUID column, it will seldom be the optimal choice.
Basic How-to for indexing and / or tuning any database
- Start by figuring out what to use as the clustered index (see above for more information about this). The entire table will need to be restructured (this will take time and will lock the table) if you decide to change this later. In SQL Azure it’s not even possible.
- Index any ID column that are (often) used in where clauses of the queries performed by your system
- Index any foreign key column that are (often) used in the where clauses or joins of the queries performed by your system
- Index any data column that are often used in where clauses of the queries performed by your system, typically in the context of users manually searching for data.
- When several columns are combined in the same where clause, you often only need to index the columns that narrows down the search the most (reverse of rule 3 under “Don’ts” above).
- Example: The system only allows you to search for users based on age if you also include first and last name. In this situation, indexing the age column probably won’t speed up the search at all, as the indexes on first and last name will already have narrowed down the possible hits to just a couple of rows.
- If you find an existing covering index, it is usually relatively easy to understand the purpose of the index. It will usually contain a column covered by rule 2 or 3 in this list. If it does, create a new, non-covering, non-clustered index on this column (if it doesn’t already exist), and delete the covering index. Job done.
Would it be too much to ask for an example?
Here is a very basic database with a few very basic tables, including an indication of how I would index them:
That’s really all there is to it. If the database and the way you query it is reasonably structured – you now know enough to make even fairly large databases (with several million rows in the main tables) perform reasonably well.
When tuning an existing database, one could also use the sql profiler to find the hotspots and fix specific problems faster, but that is not covered by this guide. If the entire database is indexed using the above guidelines, you probably won’t have any major problems anyway.