MySQL Indexes — to MySQL what green kryptonite is to Smallville universe Bizarro

This is the second post in a short series about MySQL and the first is located here. Today’s post stars indexes, those furry little creatures native to the forest moon of Endor.

Indexes are essential for fast and effective conditional queries and you should always use them. If you can afford the extra storage space, that is. Selecting the right indexes is both a space vs. speed consideration, and the result of an analysis of the queries you will be doing. Therefore it shouldn’t be a set-and-forget operation but revised as queries are written and then again revised after live application operation has started, when real-world usage and data can be analyzed. Great tools for this are MySQL’s slow query log, the EXPLAIN command and query rate counters, all of which are commented in this post.

The different engines handle indexes in different ways. Thus indexes should be revised when switching engines. One serious issue is always there though: MySQL will only use one index in a query! (except when performing UNIONs) The results are that adding an index per field could be a huge waste of space and processing power if that index is never used. Another important note is that the order of the secondary index fields matters. If field one and three in an index is used in the query only field one will be index-searched because field three requires that field two is also used.

http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

The grey smoke ( how they work )

Warning : The following section is a crude explanation of how indexes work. The different engines have each their own way of implementing indexes. This section should only be read as a way they could work.

Indexes are ordered duplicates of selected fields from a table. Since these tables are ordered, MySQL can find matching rows in a WHERE clause much faster than in the random order of the original table. An index can also include several fields where the data is stored as a tree (in most cases) in the order you set the fields. The first field becomes the root node and following fields becomes the branches.Here is an example of a table with the fields id, first name and family name.

1,	Alec,		Baldwin
2,	Adam,		Baldwin
3,	William,	Baldwin
4,	Stephen,	Baldwin
5,	Albert,		Einstein
6,	Adam,		West

When executing the query :
SELECT family FROM people WHERE first = ‘William';
MySQL would have to read all the first names to find the matches. But if you add the index ( first ), MySQL would create an index looking something like :

Adam,		2
Adam, 		6
Albert,	 	5
Alec,		1
Stephen, 	4
William,	3

The query would then by different kinds of magic move almost directly to the ‘William’ section and find the references to the rows in the original table.

If we then add the index ( family ) this query would also go faster as the “family“-index would be used :

SELECT first FROM people WHERE family = ‘Einstein';

But when executing a query with WHERE-clauses on both the family and first fields only one will be used and probably the one with highest cardinality (read about it further down). The following query would probably use the “first” index and then exclude “West” from the list after retrieving all the Adams from the original table.

SELECT first, family FROM people WHERE family = ‘Baldwin’ AND first = ‘Adam';

So what we want is an index with both fields like ( first, family ). MySQL will then create a tree in this shape:

Adam,		Baldwin,	2
		West,		6
Albert,		Einstein,	5
Alec,		Baldwin, 	1
Stephen,	Baldwin,	4
William,	Baldwin,	3

The previous query would then be able to exclude rows based on both conditions an find the matching table references. (You would also have a covering index. More on this later.)

A very important thing to know is the difference between ( first, family ) and ( family, first ) as the latter would look like this:

Baldwin,	Adam,		2
 		Alec,		1
 		Stephen,	4
 		William,	3
 Einstein,	Albert,		5
 West,		Adam,		6

Index prefix compression could also be mentioned here. Again the exact implementations could be very different from my version but you should be able to get the gist of it. When the start of multiple index fields match the engine could do some magic and only store that bit once. This can save a lot of space and adding fields with very low cardinality to an index could increase index size minimally, but not all engines implement them. Let’s add a few people and the field “banana” to our table and see how it could look.

First without compression:

Baldwin,	Adam,		Yellow,		2
 		Adam,		Yellow,		9
 		Adam,		Brown,		10
 		Adaminium,	Yellow,		7
 		Adamsomething	Brown,		8
 		Alec,		Yellow,		1
 		Stephen,	Yellow,		4
 		William,	Yellow,		3
 Einstein,	Albert,		Yellow,		5
 West,		Adam,		Yellow,		6

Then with compression:

Baldwin,	(Adam),		(Yellow),	2
 		(),		(),		9
 		(),		Brown		10
 		()inium,	Yellow		7
 		()something	Brown		8
 		Alec,		(Yellow),	1
 		Stephen,	(),		4
 		William,	(),		3
 Einstein,	Albert,		(),		5
 West,		Adam,		(),		6

I hope this explanation will help understanding why indexes are so usefull but still why a set-and-forget strategy seldom works. If 90% of your heavy queries can’t use your long and space consuming indexes then it’s just slowing you down.

The Slow Query Log

The variable long_query_time holds the number of seconds a query can be executed before its added to the slow log. Default value is 10 seconds. It’s recommended to use the mysqldumpslow to summarize the queries.

http://dev.mysql.com/doc/refman/5.0/en/slow-query-log.html

EXPLAIN

Added in front of any query, a summary of MySQL’s optimization engine’s choices for the query are shown. This is an essential and simple way to check that your queries work with indexes as planned. The important fields are type, key, keylen, rows and extra. Type will tell you how the data is read where ALL is your mortal enemy and const and range your more likely targets. There are several others though, google it search them up with your favorite search engine. The key field tells you what key the optimizer chose, if any. On any large data set you would probably want an index here, but even if there are indexes available, MySQL might choose not to use them. If you see that MySQL does poor index choices on a query, USE INDEX or FORCE INDEX hints can be added to the query, but is not recommended for other than testing purposes as the data set is likely to change and the dynamic query path is safer in the long run. keylen tells you how much of the key is used in bytes. A small number such as 3 could indicate that just the first MEDIUMINT field in the index is used while 200 could be the result of a good multi field hit. (Unless your text indexes are unnecessarily long, read about partial indexes below). Rows is the number of hits the optimizer expects to find based on its table statistics. In the Extra field you should see “using index, using where” for a fast query but “using filesort” and a few other could mean problems. Again, google it what I wrote earlier.

http://dev.mysql.com/doc/refman/5.0/en/explain.html

Query rate counter

D-I-Y by just incrementing an integer in Memcached or a MEMORY table every time your code does a database query or by adding a trigger. After some usage you can analyze the data and see which queries you should optimize for. Its really up to you how you choose to implement it.

http://dev.mysql.com/doc/refman/5.0/en/triggers.html

Cardinality

When doing a SHOW INDEXES FROM table; cardinality is one of the more important fields to inspect. Cardinality is the uniqueness of the index so far and the higher cardinality a field has, the better it will work as an index. Where the primary key has a cardinality equal to COUNT(*), the TINYINT(1) field deleted (boolean one or zero for true or false) would have a cardinality of maximum two, regardless of how many rows there are. A primary key lookup often goes at optimal speed but in the case of an index on the TINYINT(1) deleted field, won’t MySQL in most cases even use the index. This is because when the optimizer engine predicts a hit percentage of more than about 30% it will opt for the often faster sequential read of the entire data set. You should therefore order the fields in multiple field indexes first by field hit frequency (how often that field is part of the WHERE clause of your queries), and then by cardinality. This excludes as many rows as possible early in the query and to gives the optimizer an easier and more predictable job. Sometimes, however, even if the cardinality of a field is sub-optimal it’s better to include it in your index just so you can have a covering index.

Covering index

When processing an indexed query, MySQL first reads the index file, excludes as many rows as possible and then uses the byte offset (MyISAM) or primary key (InnoDB) to read the rest of the data needed in those rows from the data file. In some cases, the second file read could increase the latency more than if a larger data set was just read from the index. An index including all fields in a query where just one file read is needed is called a covering index. The decision to having covering indexes is then again a space/speed consideration.

Partial indexes

Normal indexes on text fields (TEXT, VARCHAR, etc) isn’t always a good choice because each distinct index in a VARCHAR(100) will occupy a lot of unnecessary space. With a partial index, only a set number of chars are indexed starting at the beginning of the field. The VARCHAR field for storing a persons first name should be wide enough for ‘Dominar Rygel XVI’ to store his name, but an index of the first three to six chars is probably enough to get a good enough cardinality on most first names to exclude >95% of the rows. How many chars are needed can easily be checked by just adding a new index and compare the listed cardinality after updating the table statistics. At 10 000 users from Scandinavia you might need five chars for a >95% cardinality, but at 100 000 users from all over the world, four might result in the same cardinality with just marginally better result at five.

Email domain queries with index

In some situations you need to query email address by domain name (no you won’t, but you might want to learn this trick anyway) and instead of the slow LIKE %domain.com with a query type of ALL, the solution could be to reverse the address and add a partial index since indexes work from left to right. The address could be reversed in PHP with strrev() or by MySQL with REVERSE(), but a better solution could be to use the triggers BEFORE INSERT and BEFORE UPDATE, but that’s up to you. The query would then end up like this:

SELECT name FROM customers WHERE email LIKE 'on.amotpa%';

or

SELECT name FROM customers WHERE email LIKE REVERSE( '%aptoma.no' );

and an insert could look like this

INSERT INTO customers ( name, email ) VALUES ( 'Kryten', REVERSE( 'mechanoid4000@aptoma.no' ) );

Realistic data set when testing

One of the reasons why indexes should be revised over time is that MySQL uses dynamic execution paths based on statistics updated on data change. If you only have one row in a table, all the indexes in the world (except when querying with FORCE INDEX, but lets forget about that ancient religion) won’t make MySQL use them, because it knows that a sequential read of the entire datafile will be just as fast or faster, but with multi-million rows a sequential read (a query type ALL) is probably the last thing it wants to do. Cardinality is also important to have in mind when testing as indexes again won’t be used if your multi-million rows of test data are all identical. The solution is to either make a more advanced random data creator, manually add realistic data or export realistic data from some other live database.

Moral of the story

If you want the step up from slow to reasonable performance and beyond using MySQL, indexes are mandatory but be weary of their size. You might want to add indexes in all colors and shapes available but rembember that the data is duplicated for each index and when you INSERT, UPDATE or DELETE all indexes have to be updated as well as the main table. Make the index a short as you can while keeping the cardinality as high as possible, have multiple field indexes that match as many of your queries as possible ( rewrite the non index using queries with more constraints than needed if you have to/can ) and make sure the data types of your fields are as trim as possible. The last part will also come up in my next post. Stay tuned for more of my lies next week and please comment on any errors in the post or if you by some miracle learned anything from it. :)