MySQL – Configuration, query cache and other thingamajigs (part 1 of 2)

This is the third post in a short series about MySQL. Read the first here, and the second here. This one covers this and that and has no set topic. The post is a bit long, so I’ve split it in two. The next part will be released next week. Stay tuned with rss.

Natural keys versus surrogate keys

Don’t always add an auto_increment integer as a key when a naturally unique field ( or a group of fields ) exists of reasonable size. The best example of this a join-table where you normally have two primary keys from some other tables where all SELECT queries will be conditional on them. When creating multi-field primary keys, you should also have in mind the order in which you add them. Have the field you want to do most queries on first so the index can be used. If both fields are equally important, also add an index on the other field.

COUNT(*)

When using MyISAM, COUNT(*) needs almost no processing as the primary key index knows exactly how many primary keys there are, and the query is very fast. But the index in an InnoDB table does not have this feature, (because of the Multi Version Concurrency Control, MVCC) and one of the way it finds out how many rows there are is to sequentially read them all, an operation potentially tens of thousand times slower. There are a few alternatives and the first you should do is to decide how exact you need the the number to be. The best solution when it has to be exact is to add ON INSERT and ON DELETE triggers to the table and increment/decrement an integer in some other table. That table could use the MEMORY engine for optimal speed. Other solutions could be to manually increment/decrement a value in Memcached on INSERT and DELETE queries or have a cron-job execute a COUNT(*) once in a while and cache the value. If one row is INSERTed per day or several rows are INSERTed once a day at a specific time, then there is no point in counting the rows more than than once a day, preferably right after the INSERTs have finished.
Insted of

SELECT COUNT(*) FROM wins WHERE player_id = 123;

to find out how many wins a player has, why not just have ‘wins‘ as a field in the users table and increment when the player is winner?

Don’t use functions on indexed fields

When using indexes on a field its also important to not use SQL-functions on the field but list it unchanged alone on one side of the equality sign. In this example only the last will use the index on the field date:

SELECT date FROM table WHERE UNIX_TIMESTAMP( date ) = 1057941242;
SELECT date FROM table WHERE date = FROM_UNIXTIME( 1057941242 );

UNIQUE and LIMIT 1

When a field will be unique but is not suitable as primary key, UNIQUE could be a good choice for it. This is because UNIQUE is in fact treated the same way as the primary key and the optimizer will know that only one row will match. Both lookup times and memory allocation will probably be affected by this, and some of the same performance boost can be had by applying LIMIT 1 to queries where only one row is updated, such as:

UPDATE users SET kickass = 1 WHERE 
name = 'nameless space marine' LIMIT 1;

INSERT DELAYED

MyISAM supports INSERT queries where index updating is held off until other queries with higher priority has completed. The performance boost is supposed to be significant in high load situations, but the chance of out-of-sync indexes increases where a slow REPAIR TABLE and an index rebuild could be needed after a system crash.
http://dev.mysql.com/doc/refman/5.0/en/insert-delayed.html

Slim is fast

Keep all fields as trim as possible. Use unsigned numbers to double the range when only positive numbers are needed and use TINYINT, SMALLINT, MEDIUMINT and limited VARCHARS as much as possible. You might need INT as AUTO_INCREMENT primary key on the users table, but an unsigned TINYINT ( 0-255 ) could be plenty for the categories table. Its a bit counter-intuitive, but splitting nullable-, text-, blob- and less accessed fields into a one-to-one sister table could give you a performance boost. This way you can faster and with less memory load more rows of the more accessed part of the table.
http://dev.mysql.com/doc/refman/5.0/en/data-types.html

Update internal statistics

There are commands which will update cardinality and indexes in a table. For MyISAM the command is OPTIMIZE TABLE and for InnoDB it’s ANALYZE TABLE. Its worth mentioning that OPTIMIZE TABLE will issue a write lock on the table and block all other queries until its done.
http://dev.mysql.com/doc/refman/5.0/en/optimize-table.html
http://dev.mysql.com/doc/refman/5.0/en/analyze-table.html

mysqlhotcopy

When doing a database dump on a large MyISAM table, consider using mysqlhotcopy instead of mysqldump.
http://dev.mysql.com/doc/refman/5.0/en/mysqlhotcopy.html

Query cache

By default MySQLs query cache is disabled and supports three modes: OFF, ON and DEMAND. OFF prevents caching or retrieval of any results, ON caches/retrieves as much as possible except statements that begin width SELECT SQL_NO_CACHE. DEMAND only caches queries starting with SELECT SQL_CACHE. Query cache is between 200 and (infinity+1)% faster on a cache hit, but any miss is normally 5-25% slower. However, both these numbers can vary a lot. On a table with a high number of reads with a high probability of identical queries, query cache will probably increase performance significantly, but since any INSERT, DELETE or UPDATE will invalidate all cache on that table, performance on a table with frequent writes is likely to drop. What’s also important is that the subsequent queries must be identical to the first for its cache to be used. That includes case, white space and brackets. When actively using query cache on tables with dates it’s also important to write reusable and cacheable queries. Any use of NOW() or equivalent non-deterministic functions will render query cache unused even within the same second, so date and time should be calculated outside of MySQL. When doing so, first analyze the need for an exact query. If you can get away with < 1 minute accuracy, don’t add date( ‘Y-m-d H:i:s’ ) but date( ‘Y-m-d H:i:00′ ) so any identical queries within the same minute will get the cached result.
http://dev.mysql.com/doc/refman/5.0/en/query-cache.html

SELECT * ©

You don’t need everything so don’t ask for it! It will only slow down your query, increase memory usage while sorting and make the query less understandable when running ‘SHOW PROCESSLIST’.

GROUP BY or DISTINCT

Both these can be used with the same result but performance-wise, there is a difference. For simpler queries, DISTINCT is often faster but in more advanced queries GROUP BY could be the best choice. Benchmark your queries to find out which one to choose.

IP addresses

As we normally see IP addresses in the form of 000.000.000.000, VARCHAR( 15 ) might be natural to use, but you should treat it as a number because it will both take up less space and could be faster and easier to query. The data type of choice for an IPv4 address is the four byte unsigned INT. You then use INET_ATON and INET_NTOA to convert that number to the dotted-quad representation we normally use. (Or keep it as an int, it will always be superior to the string representation! )

INSERT INTO blacklist SET ( ip, date ) VALUES 
( INET_ATON( '209.207.224.40' ), NOW() );

SELECT date FROM blacklist WHERE ip = INET_ATON( '209.207.224.40' );
SELECT INET_NTOA( ip ) FROM blacklist WHERE ip >= INET_ATON( '127.0' )
  AND ip < INET_ATON( '128.1' );

The next part will be released next week. Stay tuned with rss.