Ding!

Level up your MySQL query tuning

AlexanderRubin

A closer look at query optimization with a focus on queries with GROUP BY and ORDER BY, from Percona consultant Alexander Rubin.

 In this article we will talk about query optimization with the focus on the queries with GROUP BY and ORDER BY. We will start with the basic concepts (Indexes, Explain plan in MySQL, etc) and then will talk about the advanced query optimization techniques. We will cover “loose index scan” and “tight index scan” optimizations and show the benchmark results.

Indexes

In this article we will focus on the b-tree indexes only (InnoDB and MyISAM only support B-tree indexes). Figure 1 illustrates a basic b-tree index implementation.

 

B-tree supports both “equality” (where id = 1) and range (where date > ‘2013-01-01’ and date < ‘2013-07-01’) search. Figures 2 and 3 show examples of these, illustrated.

Figure 2 is of an equality search with primary or unique key: select select * from table where id = 12. In this scenario MySQL will be able to scan through the tree and go directly to 1 leaf and then stop. That is usually the fastest index scan operation.

In InnoDB primary key searches is even faster as InnoDB “clusters” records around primary key.

Range: select * from table where id in (6, 12, 18)

In this scenario MySQL will need to scan thru the tree and visit many leafs/nodes:

 

Table for the tests

I will use a couple of tables for the tests. The first table we will use to illustrate is explained below. This table is the part of MySQL “world” test db and can be downloaded from dev.mysql.com.

CREATE TABLE City (
    ID int(11) NOT NULL AUTO_INCREMENT,
    Name char(35) NOT NULL DEFAULT '',
    CountryCode char(3) NOT NULL DEFAULT '',
    District char(20) NOT NULL DEFAULT '',
    Population int(11) NOT NULL DEFAULT '0',
    PRIMARY KEY (ID),
    KEY CountryCode (CountryCode)
) Engine=InnoDB;

Explain plan

The main way to “profile” a query with MySQL is to use “explain”. The output below shows an example. As we can see from the explain, MySQL does not use any indexes (key: NULL) and will have to perform a full table scan.

mysql> EXPLAIN select * from City where Name = 'London'G
*************************** 1. row
           id: 1
  select_type: SIMPLE
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 4079
        Extra: Using where

In this case we can add an index to restrict the number of rows:

mysql> alter table City add key (Name);
Query OK, 4079 rows affected (0.02 sec)
Records: 4079  Duplicates: 0  Warnings: 0
The new explain shows that MySQL will use an index:
mysql> explain select * from City where Name = 'London'G
*********************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: ref
possible_keys: Name
          key: Name
      key_len: 35
          ref: const
         rows: 1
        Extra: Using where

Index usages

MySQL will choose only 1 index per query (and per table if the query joins multiple tables). In some cases MySQL can also intersect indexes, however we will not cover it in this article. MySQL uses index statistics to make a decision about the best possible index.

Combined Indexes

Combined indexes are very important for MySQL query optimizations. MySQL can use leftmost part of any index. For example, if we have this index: 

Comb(CountryCode, District, Population)
Then MySQL can use:
CountryCode only part
CountryCode + District
CountryCode + District + Population
From the explain plan in below we can understand which part(s) will be used:
mysql> explain select * from City 
      where CountryCode = 'USA'G
********************** 1. row ******************
        table: City
         type: ref
possible_keys: comb
          key: comb
      key_len: 3
          ref: const
         rows: 273 
Note the key_len part – it shows: 3. This is the number of bytes used from our index. As the CountryCode field is declared as char(3), that means that MySQL will use the first field from our combined index.

Similarly, MySQL can use 2 leftmost fields or all 3 fields from the index. In this example, the 2 leftmost fields are used:

mysql> explain select * from City 
where CountryCode = 'USA' and District = 'California'G
********************** 1. row ******************
        table: City
         type: ref
possible_keys: comb
          key: comb
      key_len: 23
          ref: const,const
         rows: 68

So MySQL will use 2 first fields from the comb key:

  • CountryCode = 3 chars
  • District = 20 chars
  • Total = 23

Whereas in this one, all 3 fields are used from the index:

mysql> explain select * from City 
where CountryCode = 'USA' and District = 'California’
and population > 10000G
********************** 1. row ******************
        table: City
         type: range
possible_keys: comb
          key: comb
      key_len: 27
          ref: NULL
         rows: 68
  • CountryCode = 3 chars/bytes
  • District = 20 chars/bytes
  • Population = 4 bytes (INT)
  • Total = 27

However, if the query does not have the first leftmost part of an index, MySQL will not be able to use it:

mysql> explain select * from City where  
District = 'California' and population > 10000G
********************** 1. row ******************
 table: City
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 3868

As we do not have the CountryCode in the where clause, MySQL will not be able to use our “comb” index.

Covered index

The covered index is an index that covers all fields in the query. For example, for this query:

select name from City  where CountryCode = 'USA' and District = 'Alaska' 
and population > 10000

the following index will be a “covered” index:

cov1(CountryCode, District, population, name)

The above index uses all fields in the query in particular order:

  1. Where part
  2. Group By/Order (not used now)
  3. Select part (here: name)

Here is an example:

mysql> explain select name from City  where CountryCode = 'USA' and District = 'Alaska' and population > 10000G
*************************** 1. row ***********
        table: City
         type: range
possible_keys: cov1
          key: cov1
      key_len: 27
          ref: NULL
         rows: 1
        Extra: Using where; Using index

“Using index” in the extra field of the explain output means that MySQL will use our covered index. That also means that MySQL will use an index only to satisfy the query: all the information that MySQL needs is in the index. That is usually much faster, especially if we have a large text or blob fields in the table.

Order of the fields in index

The order of the fields in the index is very important. The way b-tree works, it is more beneficial to have a field which will be used for “equality” comparison first and the fields with “range” (more than and less than comparison) second.

For example, take the following query:

select * from City where district = 'California' and population > 30000

The best Index will be on (district, population), in this particular order.

  1. (district, population) index [Figure 4]: In this case MySQL will be able to go “directly” (via index scan) to the correct district (“CA”) and do a range scan by population. All other nodes for the ”district” field (other US states in this example) will not be scanned.

  1. (population, district) index [Figure 5]: In this example, MySQL will have to do a “range” scan for the population and for each index record it will have to check for the correct district, which will be slower.

Complex Slow Queries

In this article we will be concentrating on 2 major query types:

  • Queries with “group by”
  • Queries with “order by”

Those queries are usually the slowest ones. We will show how to optimize those queries and decrease the query response time as well as the application performance in general.

Group by example

Let’s look at this simple example, a query of how many cities there are in each country.

mysql> explain select name from City  where CountryCode = 'USA' and District = 'Alaska' and population > 10000G
*************************** 1. row ***********
        table: City
         type: range
possible_keys: cov1
          key: cov1
      key_len: 27
          ref: NULL
         rows: 1
        Extra: Using where; Using index

As we can see the output above, MySQL does not use any indexes (no proper indexes are available), but it also shows “Using temporary; Using filesort”. MySQL will need to create a temporary table to satisfy the “Group by” clause if there is no appropriate index (You can find more information about the temporary tables here). However, MySQL can use a combined index to satisfy the “group by” clause and avoid creating temporary table.

Group by and covered index

To illustrate the “group by” queries I will use the following table:

CREATE TABLE ontime_2012 (
    YearD int(11) DEFAULT NULL,
    MonthD tinyint(4) DEFAULT NULL,
    DayofMonth tinyint(4) DEFAULT NULL,
    DayOfWeek tinyint(4) DEFAULT NULL,
    Carrier char(2) DEFAULT NULL,
    Origin char(5) DEFAULT NULL,
    DepDelayMinutes int(11) DEFAULT NULL,
    ...
) ENGINE=InnoDB DEFAULT CHARSET=latin1 

The table contains freely available airline performance statistics. The table is 6 million rows and approximately 2GB in size.

  • From this data we want to:
  • Find maximum delay for flights on Sunday
  • Group by airline

Our example query is:

select max(DepDelayMinutes), 
carrier, dayofweek 
from ontime_2012 
where dayofweek = 7 
group by Carrier

The explain plan is:

...
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 4833086
       Extra: Using where; Using temporary; Using filesort

As we can see, MySQL does not use any index and have to scan ~4M rows. In addition, it will have to create a large temporary table.

If we will create an index on “dayofweek” it will only filter out some rows and MySQL will still need to create a temporary table:

mysql> explain select name from City  where CountryCode = 'USA' and District = 'Alaska' and population > 10000G
*************************** 1. row ***********
        table: City
         type: range
possible_keys: cov1
          key: cov1
      key_len: 27
          ref: NULL
         rows: 1
        Extra: Using where; Using index

However, we can create a covered index on (dayofweek, Carrier, DepDelayMinutes) in this particular order. In this case MySQL will be able to use this index and avoid creating a temporary table:

mysql> alter table ontime_2012 
add key covered(dayofweek, Carrier, DepDelayMinutes);

explain select max(DepDelayMinutes), Carrier, dayofweek from ontime_2012 where dayofweek =7 group by CarrierG
...             
possible_keys: DayOfWeek,covered
         key: covered
     key_len: 2
         ref: const
        rows: 905138
       Extra: Using where; Using index

As we can see from the explain, MySQL will use our index and will avoid creating a temporary table. This is the fastest possible solution.

Note that MySQL will also be able to use non-covered index on (dayofweek, Carrier) and avoid creating a temporary table. However, covered index will be faster as MySQL will be able to satisfy the whole query by just reading the index.

Group by and a range scan

The covered index works well if we have a “const” (where dayofweek=N). However, MySQL will not be able to use an index and avoid a filesort if we have a “range” scan in the where clause. Here’s an example:

mysql> explain select max(DepDelayMinutes), Carrier, dayofweek from ontime_2012 
where dayofweek > 5 group by Carrier, dayofweekG
...

        type: range
possible_keys: covered
         key: covered
     key_len: 2
         ref: NULL
        rows: 2441781
       Extra: Using where; Using index; Using temporary; Using filesort

MySQL will still have to create a temporary table. To fix that we can use a simple trick and rewrite the query into 2 parts with UNION:

(select max(DepDelayMinutes), Carrier, dayofweek 
from ontime_2012 
where dayofweek = 6 
group by Carrier, dayofweek)
union
(select max(DepDelayMinutes), Carrier, dayofweek 
from ontime_2012 
where dayofweek = 7 
group by Carrier, dayofweek)

For each of the 2 queries in the union MySQL will be able to use an index instead of creating a temporary table, as shown in the explain plan below:

*************************** 1. row ***************************
       table: ontime_2012
         key: covered
...
       Extra: Using where; Using index
*************************** 2. row ***************************
       table: ontime_2012
         key: covered
...
       Extra: Using where; Using index
*************************** 3. row ***************************
          id: NULL
 select_type: UNION RESULT
       table: <union1,2>
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: NULL
       Extra: Using temporary

As we can see, MySQL uses our covered index for each of the 2 queries. It will still have to create a temporary table to merge the results, however, it will probably be much smaller temporary table as it will only need to store the resultsets of 2 queries.

Group by and a loose index scan

Loose index scan is another MySQL algorithm to optimise Group By queries. Loose index scan considers only a fraction of the keys in an index and is very fast. The following limitations apply:

  • The query is over a single table.
  • The GROUP BY names only columns that form a leftmost prefix of the index and no other columns.
  • The only aggregate functions used in the select list (if any) are MIN() and MAX(), same column

More information can be found in MySQL documentation on Loose Index Scan.

For loose index scan to work we will need to create an additional index and start with columns in the “Group by” clause and then all fields in the “where” clause (order of the fields in the index matters!). For example, for our query:

select max(DepDelayMinutes) as ddm, Carrier, dayofweek from ontime_2012  where dayofweek = 5  group by Carrier, dayofweek

We will need to create this index:

KEY loose_index_scan (Carrier,DayOfWeek,DepDelayMinutes)

Note that looseindexscan is only a name of the index, it can be any name:

mysql> explain select max(DepDelayMinutes) as ddm, Carrier, dayofweek from ontime_2012  where dayofweek = 5  group by Carrier, dayofweek

       table: ontime_2012
        type: range
possible_keys: covered
         key: loose_index_scan
     key_len: 5
         ref: NULL
        rows: 201
       Extra: Using where; Using index for group-by

“Using index for group-by” in the where clause means that MySQL will use the loose index scan. Loose index scan is very fast as it only scans a fraction of the key. It will also work with the range scan:

mysql> explain select max(DepDelayMinutes) as ddm, Carrier, dayofweek from ontime_2012 
where dayofweek > 5 group by Carrier, dayofweek;

table: ontime_2012
        type: range
possible_keys: covered
         key: loose_index_scan
     key_len: 5
         ref: NULL
        rows: 213
       Extra: Using where; Using index for group-by; 

Benchmark

We have performed a benchmark to compare query speed with a temporary table, a tight index scan (covered index) and a loose index scan. The table is 6 million rows and approximately 2GB in size.

As we can see, loose index scan index shows the best performance. Unfortunately, loose index scan only works with 2 aggregate functions, min and max for the group by. “AVG” + group by does not work with loose index scan. As we can see below, MySQL will use a covered index (not looseindexscan index) and, because we have a range in the where clause (dayofweek > 5), it will have to create a temporary table.

mysql> explain select avg(DepDelayMinutes) as ddm, Carrier, dayofweek from ontime_2012 where dayofweek >5  group by Carrier, dayofweek G 

       table: ontime_2012
        type: range
         key: covered
     key_len: 2
         ref: NULL
        rows: 2961617
       Extra: Using where; Using index; Using temporary; Using fileso

ORDER BY and filesort

MySQL may have to perform a “filesort” operation when a query uses the “order by” clause. You can find more information about “filesort” here.

The filesort operation below is usually pretty slow operation (even if it does not involve creating a file on disk), especially if MySQL will have to sort a lots of rows:

table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 4079
        Extra: Using where; Using filesort

To optimize this query we can use a combined index:

mysql> alter table City 
add key my_sort2 (CountryCode, population);

mysql> explain select district, name, population from City where CountryCode = 'USA' order by population desc limit 10G

       table: City
        type: ref
         key: my_sort2
     key_len: 3
         ref: const
        rows: 207
       Extra: Using where

MySQL was able to use our combined index to avoid sorting: as the index is sorted, MySQL was able to read the index leafs in the correct order.

Sorting and limit

If we have a “LIMIT” clause in our query (and the limit is relatively small (i.e. LIMIT 10 or LIMIT 100) compared to the amount of rows in the table), MySQL can avoid using a filesort and can utilize index instead.

Here’s an example:

mysql> alter table ontime_2012 add key (DepDelayMinutes);

We can create an index on DepDelayMinutes fields only and run the explain below (note the query with LIMIT 10):

mysql> explain select * from ontime_2012 
where dayofweek in (6,7) order by DepDelayMinutes desc
limit 10G

        type: index
possible_keys: DayOfWeek,covered
         key: DepDelayMinutes
     key_len: 5
         ref: NULL
        rows: 24
       Extra: Using where
A closer look at query optimization with a focus on queries with GROUP BY and ORDER BY, from Percona consultant Alexander Rubin. This article was originally published on webandphp.com.

Image by John Dronges.
Author
Comments
comments powered by Disqus