Level up your MySQL query tuning
A closer look at query optimization with a focus on queries with GROUP BY and ORDER BY, from Percona consultant Alexander Rubin.
In this article we will talk about query optimization with the focus on the queries with GROUP BY and ORDER BY. We will start with the basic concepts (Indexes, Explain plan in MySQL, etc) and then will talk about the advanced query optimization techniques. We will cover “loose index scan” and “tight index scan” optimizations and show the benchmark results.
Indexes
In this article we will focus on the b-tree indexes only (InnoDB and MyISAM only support B-tree indexes). Figure 1 illustrates a basic b-tree index implementation.
B-tree supports both “equality” (where id = 1) and range (where date > ‘2013-01-01’ and date < ‘2013-07-01’) search. Figures 2 and 3 show examples of these, illustrated.
Figure 2 is of an equality search with primary or unique key: select select * from table where id = 12. In this scenario MySQL will be able to scan through the tree and go directly to 1 leaf and then stop. That is usually the fastest index scan operation.
In InnoDB primary key searches is even faster as InnoDB “clusters” records around primary key.
Range: select * from table where id in (6, 12, 18)
In this scenario MySQL will need to scan thru the tree and visit many leafs/nodes:
Table for the tests
I will use a couple of tables for the tests. The first table we will use to illustrate is explained below. This table is the part of MySQL “world” test db and can be downloaded from dev.mysql.com.
CREATE TABLE City (
ID int(11) NOT NULL AUTO_INCREMENT,
Name char(35) NOT NULL DEFAULT '',
CountryCode char(3) NOT NULL DEFAULT '',
District char(20) NOT NULL DEFAULT '',
Population int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (ID),
KEY CountryCode (CountryCode)
) Engine=InnoDB;
Explain plan
The main way to “profile” a query with MySQL is to use “explain”. The output below shows an example. As we can see from the explain, MySQL does not use any indexes (key: NULL) and will have to perform a full table scan.
mysql> EXPLAIN select * from City where Name = 'London'G *************************** 1. row id: 1 select_type: SIMPLE table: City type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 4079 Extra: Using where
In this case we can add an index to restrict the number of rows:
mysql> alter table City add key (Name);
Query OK, 4079 rows affected (0.02 sec)
Records: 4079 Duplicates: 0 Warnings: 0
The new explain shows that MySQL will use an
index:
mysql> explain select * from City where Name = 'London'G *********************** 1. row *************************** id: 1 select_type: SIMPLE table: City type: ref possible_keys: Name key: Name key_len: 35 ref: const rows: 1 Extra: Using where
Index usages
MySQL will choose only 1 index per query (and per table if the query joins multiple tables). In some cases MySQL can also intersect indexes, however we will not cover it in this article. MySQL uses index statistics to make a decision about the best possible index.
Combined Indexes
Combined indexes are very important for MySQL query optimizations. MySQL can use leftmost part of any index. For example, if we have this index:
Comb(CountryCode, District, Population)
CountryCode only part
CountryCode + District
CountryCode + District + Population
mysql> explain select * from City where CountryCode = 'USA'G ********************** 1. row ****************** table: City type: ref possible_keys: comb key: comb key_len: 3 ref: const rows: 273
Similarly, MySQL can use 2 leftmost fields or all 3 fields from the index. In this example, the 2 leftmost fields are used:
mysql> explain select * from City
where CountryCode = 'USA' and District = 'California'G
********************** 1. row ******************
table: City
type: ref
possible_keys: comb
key: comb
key_len: 23
ref: const,const
rows: 68
So MySQL will use 2 first fields from the comb key:
- CountryCode = 3 chars
- District = 20 chars
- Total = 23
Whereas in this one, all 3 fields are used from the index:
mysql> explain select * from City where CountryCode = 'USA' and District = 'California’ and population > 10000G ********************** 1. row ****************** table: City type: range possible_keys: comb key: comb key_len: 27 ref: NULL rows: 68
- CountryCode = 3 chars/bytes
- District = 20 chars/bytes
- Population = 4 bytes (INT)
- Total = 27
However, if the query does not have the first leftmost part of an index, MySQL will not be able to use it:
mysql> explain select * from City where District = 'California' and population > 10000G ********************** 1. row ****************** table: City type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 3868
As we do not have the CountryCode in the where clause, MySQL will not be able to use our “comb” index.
Covered index
The covered index is an index that covers all fields in the query. For example, for this query:
select name from City where CountryCode = 'USA' and District = 'Alaska' and population > 10000
the following index will be a “covered” index:
cov1(CountryCode, District, population, name)
The above index uses all fields in the query in particular order:
- Where part
- Group By/Order (not used now)
- Select part (here: name)
Here is an example:
mysql> explain select name from City where CountryCode = 'USA' and District = 'Alaska' and population > 10000G *************************** 1. row *********** table: City type: range possible_keys: cov1 key: cov1 key_len: 27 ref: NULL rows: 1 Extra: Using where; Using index
“Using index” in the extra field of the explain output means that MySQL will use our covered index. That also means that MySQL will use an index only to satisfy the query: all the information that MySQL needs is in the index. That is usually much faster, especially if we have a large text or blob fields in the table.
Order of the fields in index
The order of the fields in the index is very important. The way b-tree works, it is more beneficial to have a field which will be used for “equality” comparison first and the fields with “range” (more than and less than comparison) second.
For example, take the following query:
select * from City where district = 'California' and population > 30000
The best Index will be on (district, population), in this particular order.
- (district, population) index [Figure 4]: In this case MySQL will be able to go “directly” (via index scan) to the correct district (“CA”) and do a range scan by population. All other nodes for the ”district” field (other US states in this example) will not be scanned.
- (population, district) index [Figure 5]: In this example, MySQL will have to do a “range” scan for the population and for each index record it will have to check for the correct district, which will be slower.
Complex Slow Queries
In this article we will be concentrating on 2 major query types:
- Queries with “group by”
- Queries with “order by”
Those queries are usually the slowest ones. We will show how to optimize those queries and decrease the query response time as well as the application performance in general.
Group by example
Let’s look at this simple example, a query of how many cities there are in each country.
mysql> explain select name from City where CountryCode = 'USA' and District = 'Alaska' and population > 10000G *************************** 1. row *********** table: City type: range possible_keys: cov1 key: cov1 key_len: 27 ref: NULL rows: 1 Extra: Using where; Using index
As we can see the output above, MySQL does not use any indexes (no proper indexes are available), but it also shows “Using temporary; Using filesort”. MySQL will need to create a temporary table to satisfy the “Group by” clause if there is no appropriate index (You can find more information about the temporary tables here). However, MySQL can use a combined index to satisfy the “group by” clause and avoid creating temporary table.
Group by and covered index
To illustrate the “group by” queries I will use the following table:
CREATE TABLE ontime_2012 ( YearD int(11) DEFAULT NULL, MonthD tinyint(4) DEFAULT NULL, DayofMonth tinyint(4) DEFAULT NULL, DayOfWeek tinyint(4) DEFAULT NULL, Carrier char(2) DEFAULT NULL, Origin char(5) DEFAULT NULL, DepDelayMinutes int(11) DEFAULT NULL, ... ) ENGINE=InnoDB DEFAULT CHARSET=latin1
The table contains freely available airline performance statistics. The table is 6 million rows and approximately 2GB in size.
- From this data we want to:
- Find maximum delay for flights on Sunday
- Group by airline
Our example query is:
select max(DepDelayMinutes), carrier, dayofweek from ontime_2012 where dayofweek = 7 group by Carrier
The explain plan is:
... type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 4833086 Extra: Using where; Using temporary; Using filesort
As we can see, MySQL does not use any index and have to scan ~4M rows. In addition, it will have to create a large temporary table.
If we will create an index on “dayofweek” it will only filter out some rows and MySQL will still need to create a temporary table:
mysql> explain select name from City where CountryCode = 'USA' and District = 'Alaska' and population > 10000G *************************** 1. row *********** table: City type: range possible_keys: cov1 key: cov1 key_len: 27 ref: NULL rows: 1 Extra: Using where; Using index
However, we can create a covered index on (dayofweek, Carrier, DepDelayMinutes) in this particular order. In this case MySQL will be able to use this index and avoid creating a temporary table:
mysql> alter table ontime_2012 add key covered(dayofweek, Carrier, DepDelayMinutes); explain select max(DepDelayMinutes), Carrier, dayofweek from ontime_2012 where dayofweek =7 group by CarrierG ... possible_keys: DayOfWeek,covered key: covered key_len: 2 ref: const rows: 905138 Extra: Using where; Using index
As we can see from the explain, MySQL will use our index and will avoid creating a temporary table. This is the fastest possible solution.
Note that MySQL will also be able to use non-covered index on (dayofweek, Carrier) and avoid creating a temporary table. However, covered index will be faster as MySQL will be able to satisfy the whole query by just reading the index.
Group by and a range scan
The covered index works well if we have a “const” (where dayofweek=N). However, MySQL will not be able to use an index and avoid a filesort if we have a “range” scan in the where clause. Here’s an example:
mysql> explain select max(DepDelayMinutes), Carrier, dayofweek from ontime_2012 where dayofweek > 5 group by Carrier, dayofweekG ... type: range possible_keys: covered key: covered key_len: 2 ref: NULL rows: 2441781 Extra: Using where; Using index; Using temporary; Using filesort
MySQL will still have to create a temporary table. To fix that we can use a simple trick and rewrite the query into 2 parts with UNION:
(select max(DepDelayMinutes), Carrier, dayofweek from ontime_2012 where dayofweek = 6 group by Carrier, dayofweek) union (select max(DepDelayMinutes), Carrier, dayofweek from ontime_2012 where dayofweek = 7 group by Carrier, dayofweek)
For each of the 2 queries in the union MySQL will be able to use an index instead of creating a temporary table, as shown in the explain plan below:
*************************** 1. row *************************** table: ontime_2012 key: covered ... Extra: Using where; Using index *************************** 2. row *************************** table: ontime_2012 key: covered ... Extra: Using where; Using index *************************** 3. row *************************** id: NULL select_type: UNION RESULT table: <union1,2> type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: NULL Extra: Using temporary
As we can see, MySQL uses our covered index for each of the 2
queries. It will still have to create a temporary table to merge
the results, however, it will probably be much smaller temporary
table as it will only need to store
the resultsets
of 2 queries.
Group by and a loose index scan
Loose index scan is another MySQL algorithm to optimise Group By queries. Loose index scan considers only a fraction of the keys in an index and is very fast. The following limitations apply:
- The query is over a single table.
- The GROUP BY names only columns that form a leftmost prefix of the index and no other columns.
- The only aggregate functions used in the select list (if any) are MIN() and MAX(), same column
More information can be found in MySQL documentation on Loose Index Scan.
For loose index scan to work we will need to create an additional index and start with columns in the “Group by” clause and then all fields in the “where” clause (order of the fields in the index matters!). For example, for our query:
select max(DepDelayMinutes) as ddm, Carrier, dayofweek from ontime_2012 where dayofweek = 5 group by Carrier, dayofweek
We will need to create this index:
KEY loose_index_scan (Carrier,DayOfWeek,DepDelayMinutes)
Note that looseindexscan is only a name of the index, it can be any name:
mysql> explain select max(DepDelayMinutes) as ddm, Carrier, dayofweek from ontime_2012 where dayofweek = 5 group by Carrier, dayofweek table: ontime_2012 type: range possible_keys: covered key: loose_index_scan key_len: 5 ref: NULL rows: 201 Extra: Using where; Using index for group-by
“Using index for group-by” in the where clause means that MySQL will use the loose index scan. Loose index scan is very fast as it only scans a fraction of the key. It will also work with the range scan:
mysql> explain select max(DepDelayMinutes) as ddm, Carrier, dayofweek from ontime_2012 where dayofweek > 5 group by Carrier, dayofweek; table: ontime_2012 type: range possible_keys: covered key: loose_index_scan key_len: 5 ref: NULL rows: 213 Extra: Using where; Using index for group-by;
Benchmark
We have performed a benchmark to compare query speed with a temporary table, a tight index scan (covered index) and a loose index scan. The table is 6 million rows and approximately 2GB in size.
As we can see, loose index scan index shows the best performance. Unfortunately, loose index scan only works with 2 aggregate functions, min and max for the group by. “AVG” + group by does not work with loose index scan. As we can see below, MySQL will use a covered index (not looseindexscan index) and, because we have a range in the where clause (dayofweek > 5), it will have to create a temporary table.
mysql> explain select avg(DepDelayMinutes) as ddm, Carrier, dayofweek from ontime_2012 where dayofweek >5 group by Carrier, dayofweek G table: ontime_2012 type: range key: covered key_len: 2 ref: NULL rows: 2961617 Extra: Using where; Using index; Using temporary; Using fileso
ORDER BY and filesort
MySQL may have to perform a “filesort” operation when a query uses the “order by” clause. You can find more information about “filesort” here.
The filesort operation below is usually pretty slow operation (even if it does not involve creating a file on disk), especially if MySQL will have to sort a lots of rows:
table: City type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 4079 Extra: Using where; Using filesort
To optimize this query we can use a combined index:
mysql> alter table City add key my_sort2 (CountryCode, population); mysql> explain select district, name, population from City where CountryCode = 'USA' order by population desc limit 10G table: City type: ref key: my_sort2 key_len: 3 ref: const rows: 207 Extra: Using where
MySQL was able to use our combined index to avoid sorting: as the index is sorted, MySQL was able to read the index leafs in the correct order.
Sorting and limit
If we have a “LIMIT” clause in our query (and the limit is relatively small (i.e. LIMIT 10 or LIMIT 100) compared to the amount of rows in the table), MySQL can avoid using a filesort and can utilize index instead.
Here’s an example:
mysql> alter table ontime_2012 add key (DepDelayMinutes);
We can create an index on DepDelayMinutes fields only and run the explain below (note the query with LIMIT 10):
mysql> explain select * from ontime_2012 where dayofweek in (6,7) order by DepDelayMinutes desc limit 10G type: index possible_keys: DayOfWeek,covered key: DepDelayMinutes key_len: 5 ref: NULL rows: 24 Extra: Using where