Alternative title: The things we do to store U+1F4A9 PILE OF POO (💩
) correctly.
Are you using MySQL’s utf8
charset in your databases? In this write-up I’ll explain why you should switch to utf8mb4
instead, and how to do it.
UTF-8
The UTF-8 encoding can represent every symbol in the Unicode character set, which ranges from U+000000 to U+10FFFF. That’s 1,114,112 possible symbols. (Not all of these Unicode code points have been assigned characters yet, but that doesn’t stop UTF-8 from being able to encode them.)
UTF-8 is a variable-width encoding; it encodes each symbol using one to four 8-bit bytes. Symbols with lower numerical code point values are encoded using fewer bytes. This way, UTF-8 is optimized for the common case where ASCII characters and other BMP symbols (whose code points range from U+000000 to U+00FFFF) are used — while still allowing astral symbols (whose code points range from U+010000 to U+10FFFF) to be stored.
MySQL’s utf8
For a long time, I was using MySQL’s utf8
charset for databases, tables, and columns, assuming it mapped to the UTF-8 encoding described above. By using utf8
, I’d be able to store any symbol I want in my database — or so I thought.
While writing about JavaScript’s internal character encoding, I noticed that there was no way to insert the U+1D306 TETRAGRAM FOR CENTRE (𝌆
) symbol into the MySQL database behind this site. The column I was trying to update had the utf8_unicode_ci
collation, and the connection charset was set to utf8
.
mysql> SET NAMES utf8; # just to emphasize that the connection charset is set to `utf8`
Query OK, 0 rows affected (0.00 sec)
mysql> UPDATE database_name.table_name SET column_name = 'foo𝌆bar' WHERE id = 9001;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 1
mysql> SELECT column_name FROM database_name.table_name WHERE id = 9001;
+-------------+
| column_name |
+-------------+
| foo |
+-------------+
1 row in set (0.00 sec)
The content got truncated at the first astral Unicode symbol, in this case 𝌆
— so, attempting to insert foo𝌆bar
actually inserted foo
instead, resulting in data loss (and possibly introducing security issues; see below). MySQL returned a warning message, too:
mysql> SHOW WARNINGS;
+---------+------+------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xF0\x9D\x8C\x86' for column 'column_name' at row 1 |
+---------+------+------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Turns out MySQL’s utf8
charset only partially implements proper UTF-8 encoding. It can only store UTF-8-encoded symbols that consist of one to three bytes; encoded symbols that take up four bytes aren’t supported.
Since astral symbols (whose code points range from U+010000 to U+10FFFF) each consist of four bytes in UTF-8, you cannot store them using MySQL’s utf8
implementation.
This doesn’t just affect the 𝌆
character, but more important symbols like U+01F4A9 PILE OF POO (💩
) as well. In total, that’s 1,048,575 possible code points you can’t use. In fact, MySQL’s utf8
only allows you to store 5.88% ((0x00FFFF + 1) / (0x10FFFF + 1)
) of all possible Unicode code points. Proper UTF-8 can encode 100% of all Unicode code points.
As shown above, this behavior can lead to data loss, but it gets worse — it can result in security vulnerabilities. Here are some examples, all of which were discovered after publishing this write-up:
- PHP object injection vulnerability in WordPress < 3.6.1, leading to remote code execution in combination with certain WordPress plugins
- Email authentication bypass in Phabricator
- Stored XSS in WordPress 4.1.2
- Remote command execution in the Joomla! CMS
TL;DR MySQL’s utf8
encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support, which can lead to data loss or security vulnerabilities.
MySQL’s utf8mb4
Luckily, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4
which maps to proper UTF-8 and thus fully supports Unicode, including astral symbols.
Switching from MySQL’s utf8
to utf8mb4
Step 1: Create a backup
Create a backup of all the databases on the server you want to upgrade. Safety first!
Step 2: Upgrade the MySQL server
Upgrade the MySQL server to v5.5.3+, or ask your server administrator to do it for you.
Step 3: Modify databases, tables, and columns
Change the character set and collation properties of the databases, tables, and columns to use utf8mb4
instead of utf8
.
# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# (Don’t blindly copy-paste this! The exact statement depends on the column type, maximum length, and other properties. The above line is just an example for a `VARCHAR` column.)
Since utf8mb4
is fully backwards compatible with utf8
, no mojibake or other forms of data loss should occur. (But you have a backup, right?)
Step 4: Check the maximum length of columns and index keys
This is probably the most tedious part of the whole upgrading process.
When converting from utf8
to utf8mb4
, the maximum length of a column or index key is unchanged in terms of bytes. Therefore, it is smaller in terms of characters, because the maximum length of a character is now four bytes instead of three.
For example, a TINYTEXT
column can hold up to 255 bytes, which correlates to 85 three-byte or 63 four-byte characters. Let’s say you have a TINYTEXT
column that uses utf8
but must be able to contain more than 63 characters. Given this requirement, you can’t convert this column to utf8mb4
unless you also change the data type to a longer type such as TEXT
— because if you’d try to fill it with four-byte characters, you’d only be able to enter 63 characters, but not more.
The same goes for index keys. The InnoDB
storage engine has a maximum index length of 767 bytes, so for utf8
or utf8mb4
columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8
columns with indexes longer than 191 characters, you will need to index a smaller number of characters when using utf8mb4
. (Because of this, I had to change some indexed VARCHAR(255)
columns to VARCHAR(191)
.)
Section 10.1.11 of the MySQL 5.5 Reference Manual has some more information on this.
Step 5: Modify connection, client, and server character sets
In your application code, set the connection character set to utf8mb4
. This can be done by simply replacing any variants of SET NAMES utf8
with SET NAMES utf8mb4
. If your old SET NAMES
statement specified the collation, make sure to change that as well, e.g. SET NAMES utf8 COLLATE utf8_unicode_ci
becomes SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci
.
Make sure to set the client and server character set as well. I have the following in my MySQL configuration file (/etc/my.cnf
):
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
You can easily confirm these settings work correctly:
mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.00 sec)
As you can see, all the relevant options are set to utf8mb4
, except for character_set_filesystem
which should be binary
unless you’re on a file system that supports multi-byte UTF-8-encoded characters in file names, and character_set_system
which is always utf8
and can’t be overridden.
Note: The default character set and collation can be configured at some other levels as well.
Step 6: Repair and optimize all tables
After upgrading the MySQL server and making the necessary changes explained above, make sure to repair and optimize all databases and tables. I didn’t do this right away after upgrading (I didn’t think it was necessary, as everything seemed to work fine at first glance), and ran into some weird bugs where UPDATE
statements didn’t have any effect, even though no errors were thrown.
You could run the following MySQL queries for each table you want to repair and optimize:
# For each table
REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;
Luckily, this can easily be done in one go using the command-line mysqlcheck
utility:
$ mysqlcheck -u root -p --auto-repair --optimize --all-databases
This will prompt for the root user’s password, after which all tables in all databases will be repaired and optimized.
Summary
Never use utf8
in MySQL — always use utf8mb4
instead. Updating your databases and code might take some time, but it’s definitely worth the effort. Why would you arbitrarily limit the set of symbols that can be used in your database? Why would you lose data every time a user enters an astral symbol as part of a comment or message or whatever it is you store in your database? There’s no reason not to strive for full Unicode support everywhere. Do the right thing, and use utf8mb4
. 🍻
Comments
David Higgins wrote on : (reply)
Nice write-up Mathias! You could also escape the Unicode symbols into some sort of sanitized version, before you add them into the database. E.g.
𝌆
→𝌆
.Mathias wrote on : (reply)
David: That goes against the principles of normalization. And what if you want to use the data in another format that doesn’t support HTML character entities? Better to just store the raw data, and make sure your database supports it.
maxw3st wrote on : (reply)
Thank you much for documenting this thoroughly. Looks like updating my existing databases and tables will be a pain, but adding new ones should be a snap once I’ve followed the connection/client/server instructions and updated the config files. In general, great point about increasing the available character set. Thank you.
MeMyselfAndI wrote on : (reply)
What is the purpose of
character-set-client-handshake=FALSE
?Why do you use
init-connect='SET NAMES utf8mb4'
, which is already the default because ofcharacter-set-server=utf8mb4
?Mathias wrote on : (reply)
MeMyselfAndI:
While your questions are probably a bit out of scope for this article, I’ll respond to them to the best of my knowledge:
character-set-client-handshake=FALSE
causes the server to ignore character set information sent by the client (e.g. if the client requests a connection inutf8
, it would still useutf8mb4
) . By using it you can rest assured that the default server character set will be used at all times.SET NAMES
indicates what character set the client will use to send SQL statements to the server, i.e. the connection charset.character-set-server
sets the server charset. To useutf8mb4
correctly, you need to make sure the client, the server, and the connection are all set toutf8mb4
.That is my understanding of these settings. If you think this is wrong, I’d appreciate a clarification. Thanks!
MeMyselfAndI wrote on : (reply)
Did you test the effect of
character-set-client-handshake=FALSE
? In my MySQL 5.1.x my clients could still change the character set, no matter how I wrote this option (with-
, with_
, withskip-…
instead of=FALSE
). And why should a client not be able to change the character set for its own connection?character-set-server
isn’t the only value you can set (withoutinit-connect
) in themy.cnf
(yep,character-set-filesystem
too, but it doesn’t matter for now). This is also the default value for all other character set settings you do not explicitly change (in a client’s session, for a database, table, column, etc.). In other words, (nearly) all othercharacter_set_*
will inherit from this value. So the default value for a client’s connection (character_set_client
,character_set_results
, andcharacter_set_connection
) will always be this value. You don’t need to useinit-connect
unless you want your client’s connection’s default value to be different than what’s defined incharacter-set-server
.Mathias wrote on : (reply)
MeMyselfAndI: Setting
character-set-client-handshake=FALSE
(or usingskip-character-set-client-handshake
) is the only way I could getcollation_connection
to show up asutf8mb4_unicode_ci
instead ofutf8mb4_general_ci
when performing aSHOW VARIABLES LIKE 'collation%'
query. Unless there’s a better way to achieve the same effect, I’m afraid this setting can not be omitted.Thanks for pointing out the
init-connect
setting was unnecessary — I’ve removed it now. Please let me know if the example/etc/my.cnf
can be optimized further.MeMyselfAndI wrote on : (reply)
Mathias: To set the connection’s collation to
_unicode_
instead of_general_
I could do this successfully usingSET NAMES … COLLATE …
. But you only need to use_unicode_
forß=ss
, otherwise you can use_general_
which is faster as the manual says. https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.htmlNo, not really. Of course the client has to speak UTF-8. For MySQL you only need two things: the setting for your client’s connection and the setting for the string fields.
character-set-server
is only a default value for creating databases if you don’t set a character set in theCREATE DATABASE
statement. The database’s value is only a default value for its tables if you don’t set it inCREATE TABLE
. The table’s value is only a default value for the string fields if you don’t say something special for the fields. Only now both the character set and the collation have a real purpose for storing values and comparing strings etc. The second point is the connection’s value. MySQL needs it to decode your input correctly and to encode the results. (A recoding can be skipped if connection and fields use the same character set value.) See the first three sub-chapters of https://dev.mysql.com/doc/refman/5.5/en/charset-syntax.html (the key sentences are at the end of these chapters) and https://dev.mysql.com/doc/refman/5.5/en/charset-connection.html.My conclusion is: you can leave the default values of the server as-is. Just specify the values for your fields (or defaults for your tables or your databases if you don’t want to set it field by field). And use
SET NAMES
ormysql(i)_set_charset()
in PHP or something similar after each connect to a MySQL server. That’s all you need. Don’t rely on the server’s default values.Patricio wrote on : (reply)
Nice article Mathias, this also took me by surprise some time ago. Who decided to name that other charset
utf8
? :sI think it’s much safer not to trust any default charset set in the server configuration, and make yourself sure that your app is setting it correctly, whenever it creates/modifies tables and databases, and opens new connections.
Peter van Dijk wrote on : (reply)
If you really needed to
REPAIR
after these changes, that’s a bug and you should report it.Binyamin wrote on : (reply)
@Mathias, I followed your steps to change MySQL 5.5.25 encoding to
utf8mb4
, but unfortunately SQL returns:…when I run the query:
…and with
character-set-client-handshake = FALSE
the whole website returns Error 500; without it returns Error 500 for some pages. MySQL log does not show any error.With
utf8
the website works fine. Any suggestions? Seems like latest stable MySQL release is 5.5.25a.Mathias wrote on : (reply)
Binyamin: The 500 error most likely has nothing to do with MySQL itself, but rather with the server-side code that connects to it, or elsewhere in your code. Take a closer look at the pages that result in the error, and try to figure out what they’re doing different than the other pages.
I’m using MySQL 5.5.25a as well, with these exact settings, and everything works flawlessly.
Binyamin wrote on : (reply)
Mathias: Got it to work after discovering a Webmin bug with
character-set-client-handshake=FALSE
and fixing it withnodbi=1
.Strange that server returned error 500 also because some MySQL row type was
mediumtext
. After I change the type totext
, the error disappeared and theutf8mb4
encoding works fine now.Mathias wrote on : (reply)
Binyamin: The
MEDIUMTEXT
issue was probably a result of what’s described in step 4 (check the maximum length of columns and index keys), no?Binyamin wrote on : (reply)
Mathias: I am not sure about that, because in some cases a
MEDIUMTEXT
row will cause Error 500 even when it has a small character value, and usingTEXT
instead fixes it. Maybe it is related to a MySQL 5.5.25 bug.ArunMohan wrote on : (reply)
Still I am getting
java.sql.SQLException: Incorrect string value: '\xF0\x9D\x8C\x86\xF0\x9D...' for column 'title' at row 1… character(𝌆𝌆)
.But I can insert those characters through my MySQL client (SqlYog)…
Richard wrote on : (reply)
What about performance and memory usage? Presumably this has an impact on both.
Mathias wrote on : (reply)
Richard: I don’t know about memory usage, but performance of
utf8mb4
is slower than for the 3-byteutf8
. I suspect the difference to be negligible though, although I’d love to see some numbers.Jon T. wrote on : (reply)
Are you using MySQL’s
utf8
charset in your databases? In this write-up I’ll explain why you should switch to PostgreSQL – the Real Man’s database.Neil Martin wrote on : (reply)
I found many resources out there that claimed to provide solutions to using
utf8mb4
with MySQL, but they were all incomplete. I read your write up and applied the suggested configuration changes and everything worked. Great work; thanks!Deron Meranda wrote on : (reply)
If you’re using Python with the popular MySQLpy module for your database connectivity, you will need to apply a small patch to the python source code so it recognizes the new
utf8mb4
character set. Until this patch is officially made in the up-stream code, see the small Pastebin patch.Roland Franssen wrote on : (reply)
ArunMohan:
SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;
made it work for me on MySQL 5.5.28.CodeKingPlusPlus wrote on : (reply)
What about changing the collation settings for JDBC? I just fixed all my databases and tables. I can run queries fine inside my database but I am getting the following exception from my Java code:
I cannot figure out how to change the
utf8_general_ci,COERCIBLE
toutf8mb4_unicode_ci, IMPLICIT
.My connection string is:
ArunMohan: Have you solved your
Java.sql.Exception
? I am having the same problem!Robert wrote on : (reply)
Thank you very very much, I searched for a solution for nearly a year!
Pawel wrote on : (reply)
If I stumble upon a PILE OF POO that needs to be inserted into database, I will remember your post :)
Julia Neystadt wrote on : (reply)
Thanks a lot for your article. Unfortunately, I followed all the instructions and I am still having a problem:
This is the same problem ArunMohan reported earlier.
Any ideas?
Mathias wrote on : (reply)
ArunMohan and Julia: You’re probably getting that exception because the JDBC driver (the connection between Java and your database) doesn’t recognize the
utf8mb4
charset. Try using a JDBC driver withutf8mb4
support — MySQL Connector/J 5.1.13 (or a newer version), for example.Julia Neystadt wrote on : (reply)
Mathias: Thanks a lot for your fast reply. Unfortunately, this does not help me. My MySQL connector is 5.1.18.
But I paid attention that my DB version is 5.5.19-log. On the other hand, all the
utf8mb4
definitions are set correctly:Your help is greatly appreciated.
Julia
Julia Neystadt wrote on : (reply)
I resolved my problem by upgrading the MySQL Connector/J version to 5.1.22.
Michael wrote on : (reply)
Thanks for your article, it did help me a lot.
Here are my thoughts and some additional information that may help ArunMohan and Julia: http://info.michael-simons.eu/2013/01/21/java-mysql-and-multi-byte-utf-8-support/
CodeKingPlusPlus: You must have
character-set-server = utf8mb4
inmy.cnf
anduseUnicode=true
in your JDBC connection. Depending on the version of your MySQL Connector/J, you need to havecharacterEncoding=UTF-8
as well or must not have it, see my article.Itay wrote on : (reply)
Thanks a lot for the great write-up. I did all the steps but I still see question marks where 4-byte chars should display. What am I missing?
Jamie wrote on : (reply)
I found that phpMyAdmin didn’t cope with
utf8mb4
. Existing 4-byte characters display as question marks and trying to insert a 4-byte character seems to insert four actual question marks into the database. Turning on themysqld
log and looking and what phpMyAdmin was doing, I noticed it was setting the character set toutf8
each time. It’s hard-coded in line 1303 oflibraries/database_interface.lib.php
:Changing to it to
utf8mb4
solves the problem, although being a hack you’ll need to remember to make this change each time you upgrade phpMyAdmin.Also, if you want to be sure of correct sorting of results you might want to set the Server connection collation on phpMyAdmin’s front page to
utf8mb4_unicode_ci
.Mathias wrote on : (reply)
Jamie: Thanks! You might want to append that info to the corresponding phpMyAdmin bug ticket: http://sourceforge.net/p/phpmyadmin/bugs/3619/
Paul wrote on : (reply)
At the top of your aticle, you really should mention the version of MySQL this applies to because the version in Debian Squeeze is 5.1 and that’s too old to have
utf8mb4
apparently. Spent ages writing a program to update my MySQL tables etc. and then found that I couldn’t without the update of the package. Your article is good stuff though. :-)cmr wrote on : (reply)
Hi guys, I tried the above. I’ve restarted the MySQL server but the output still looks like this:
This is what my
/usr/local/mysql/my.cnf
looks like:I’ve looked at the default server settings and there is no mention of the above variables for
my.cnf
file.I can’t save
𝌆
but I can save characters like♥
,❤
, and❥
.Mathias wrote on : (reply)
cmr: Are you sure the
my.cnf
file you’re editing is the one that gets loaded? Try the following command:On my machine, this prints:
Make sure the configuration file you’re saving to is listed. Also, check all the other files that get loaded and make sure they’re not overriding your settings.
cmr wrote on : (reply)
Mathias: I moved
my.cnf
to/usr/local/etc/
and it’s working now. Thanks heaps!Update: I can’t seem to add the character to the database:
SQL Statement:
I tried running this command
The output says “0 rows affected”.
I’m guessing that
utf8mb4
is not being applied to my row.But when I ran the following:
What else could be happening?
Update: Found it! I was missing:
Thanks for the great post.
Binyamin wrote on : (reply)
default-character-set
removed in MySQL 5.5.3 andutf8mb4
supported since MySQL 5.5.3.Why would you use
default-character-set
in[client]
and[mysql]
?Mathias wrote on : (reply)
Binyamin: I think only the command-line option was removed, not the variable. A quick test confirms this: removing these variables from the config file impacts the results.
With these settings:
…I get these results:
Without the
[client]
and[mysql]
sections, the results are different:Mihai wrote on : (reply)
Very good writeup. We knew about this concoction called
utf8
, but you provide the mechanics to properly addressing this issue.Jef Vlamings wrote on : (reply)
Thanks! Definitely the clearest explanation on how to configure
utf8mb4
out there!Steve Hibbert wrote on : (reply)
About the only place I could find that had the correct way to set
collation_connection
in 5.6.15! Thanks very much.HS wrote on : (reply)
That’s what I have in
my.cnf
under/etc/
. Am I missing something or doing something wrong? I’m running MySQL 5.6.16-log.HS wrote on : (reply)
After hitting my head around a lot, these are the best settings I could figure out to run everything with
utf8mb4
:When running:
I get this:
My questions are:
utf8mb4
? I’m using CentOS 6.5 and WHM. Or it would be a bad idea?utf8mb4
?my.cnf
?Thanks.
P.S. With the above configuration in
my.cnf
, I don’t get ANY sort of warnings or errors anymore when checking the MySQL error messages. Another note is that I’m running the latest MySQL from WHM: 5.6.16-log MySQL Community Server (GPL).Mathias wrote on : (reply)
HS: All those questions are already answered in the above article. I’d suggest reading it.
RobertG wrote on : (reply)
Instead of reducing column sizes from
255
to191
, have you tried to set MySQL’sinnodb_large_prefix
option? It increases the maximum allowed key size from 767 bytes to 3072, cf. here: https://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_large_prefixChris wrote on : (reply)
Thanks for writing this! If anyone is doing this migration with Django, don’t forget to add:
…to the
DATABASES
object in your settings file. Hope this saves someone else some time.Serge wrote on : (reply)
Just one question — why do you prefer
utf8mb4_unicode_ci
? Why is it better thanutf8mb4_bin
?Mathias wrote on : (reply)
Serge: See this Stack Overflow answer for the difference between
*_unicode_ci
,*_general_ci
and*_bin
.Tam wrote on : (reply)
Hi guys, I followed the above setup, seems all is ok, but when I insert a
NickName(Primary Key)
to MySQL, it shows error 1062.Table and this column is use
UTF*mb4_unicode_ci
.…after that I find out if change to use
UTF8mb4_general_ci
instead ofUTF*mb4_unicode_ci
then insert the second row without any error, so any suggestion to fix this if keeping useUTF*mb4_unicode_ci
? Thanks.HS wrote on : (reply)
Did you see that there is an
utf8mb4_unicode_520_ci
collation? I’m wondering how it differs fromutf8mb4_unicode_ci
. Any ideas? Would you be able to give us some examples and show us why one would be better than the other one?Update: https://dev.mysql.com/doc/refman/5.6/en/charset-collation-names.html says:
It sounds like the
_520
stands for language standards and not to the MySQL likes? I mean, it looks like that the520_ci
looks to be a better approach to the Unicode standards? Is this true? Please, also see: https://bugs.mysql.com/bug.php?id=9604It seems that MySQL won’t update the supposed Unicode stuff to keep up with the standards as it seems it would imply in their clients to update the contents of their current databases. Thereafter they created the new collation, which keeps up with standard Unicode. And the version they have available is v5.2.0, which is this one: http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt
Then, it sounds like it would be better to stick with the real Unicode standard one, i.e.
utf8mb4_unicode_520_ci
?Is there a way to compile MySQL 5.6 using the latest available UCA weight keys (v7.0.0 at the time of writing)?
Allan Wallace wrote on : (reply)
Nice article — although
utf8mb4_bin
may be more appropriate for certain tasks, as case insensitivity can be a problem in certain contexts, or where the use of accenting in languages alters the meaning of the words. E.g. if you are trying to build a dictionary of terms in Greek and you have aUNIQUE
index defined in order to prevent the input of duplicates, you will get false duplicates for words where there should be no match.As an example, here are two Greek words with considerably different meanings, one of which is rude, the other is not:
μαλακά
μαλάκα
If you try to insert these into a
utf8mb4_unicode_ci
column, you will get “MySQL Error 1062: duplicate entry for key”. So in this particular case,utf8mb4_bin
is the logical option.David Trotz wrote on : (reply)
Thank you for the article — it’s been a lifesaver in better understanding the MySQL documentation.
I had a question that I am still unclear on. If I have a
VARCHAR(255)
column and it is not indexed, can I leave its size at255
?David Trotz wrote on : (reply)
I answered my own question, and discovered a few things that might help readers of your article.
Before I go into that I want to reiterate that your article has been extremely helpful in getting me started on understanding what I need to do in order to get to proper
utf8
support. Thank you for being a pioneer here!Notes on Converting Indexes
You mentioned a little bit why you changed the columns from
VARCHAR(255)
toVARCHAR(191)
and I understand now why you did this, but you didn’t mention (although it is implied if you understand MySQL) that this will lead to truncation if you are using more than 191 characters in that column anywhere in your database table. MySQL will warn you after the dirty deed is done, but by then it is too late.It might help to explain that you only need to do this if you index the column and are not worried about truncation but in reality you don’t need to do it, because you can simply create the index to have only use the valid 191 characters like so:
For more on this see: MySQL 5.5 docs: 13.1.13
CREATE INDEX
syntax.Notes on Converting Tables
It is not necessary to convert all your columns as they will inherit the table’s encoding and be converted when the table is converted, in fact even if the column has a specified encoding the column will get the table’s default encoding. This is most likely what you want but I put it out there as a warning for others.
Also of note when converting a table to
utf8mb4
fromutf8
: if you have a column that isTEXT
, MySQL will automagically promote that column toMEDIUMTEXT
to accomodate for the additional storage space needed for the new encoding. I only tested this withTEXT
and assume it is similar withTINYTEXT
etc.I say all of that to note that you really do not need to run:
…on all of your columns unless you want to override the default behaviour inherited from the table when it was converted. It will re-run and waste your time, as it does not look at your
ALTER TABLE
command and note that no changes will take place.Peter Torelli wrote on : (reply)
Great post. I made all the updates to my server you specify. However, when I enter your test line
foo𝌆bar
there is still an extra space in the select results, like so:What did I miss? Thanks!
Beau Key wrote on : (reply)
Maybe I am stupid but it is not easier to create a backup of all (relevant) databases, drop the databases, update
my.ini
and restore the databases from a backup?Mehdi wrote on : (reply)
Really useful article.
Thanks a lot.
Klaus wrote on : (reply)
I am using
utf8mb4_unicode_ci
in MySQL but when I upload a file with UTF-8 (CP 65001) then I get all the German “Umlaute”ÄäÜüÖöß
wrong what I could see proper in the editor Notepad.hnuecke wrote on : (reply)
Thank you for the excellent suggestions and explanations; they already helped me a lot.
I had the same messy config as cmr showed in his post on 30th May 2013 at 15:54. I made some progress, but I still cannot set the
character-set-client
andcharacter-set-results
toutf8mb4
. My variables actually look like this:I assume that this stops me from converting my WordPress site to a multisite. At least I still get error messages when WordPress tries to create a table for a new site:
WordPress uses
utf8mb4
since March or April 2015, I think. I’m also checking in the WP forum if what I see is a bug, because tables for the first site could be generated. Hopefully somebody here can give me a hint how I can manage to change the character sets for the remaining two variables.Thanks!
Chris Christoff wrote on : (reply)
hnuecke: See https://core.trac.wordpress.org/ticket/32308#comment:13.
Paul wrote on : (reply)
Thank you for this guide — it really helped.
I also had to drop and re-create all the database’s stored procedures (and functions too) in order that they execute within the new character set, otherwise I would receive:
SQL state [HY000]; error code [1366]; Incorrect string value: '\xF0…'
when calling them. Even whencharacter_set_connection
equalledutf8mb4
.Run:
…to see which procedures have not been updated to the server’s new
character_set_client
,collation_connection
and Database Collation values.Justin Lee wrote on : (reply)
Thank you so much for this write-up, Mathias! I don’t know how I would have fixed my UTF-8 issues without this.
I followed some of the advice in some of the other comments, and I spent a day and rewrote my entire project to use Postgres instead. It was not too bad. It mostly involves changing all the non-standard funky MySQL SQL into proper SQL. And guess what, in Postgres the UTF-8 stuff works out of the box. You don’t have to specify anything. It just works by default because it’s the right way to do it.
For anyone else who stumbled on this website here, I encourage you to switch to PostgreSQL. MySQL was bought by Oracle and they haven’t done anything since. I think it was pretty broken even before.
Looks like Michael Stonebreaker will still get the last laugh over Larry Ellison ;) (at least in terms of software quality)
Erez wrote on : (reply)
Using SqlWorkbench I could not change the
character_set_result
andcharacter_set_client
until I used the following line as a part of your provided script:SET CHARACTER SET utf8mb4;
.This setting is erased each time I restart SqlWorkbench. Any idea how to keep it?
Dennis A wrote on : (reply)
I’ve done all these steps and everything looks fine in my tables now, yet the server still hangs when I try to submit a form that contains these annoying emojis that pop up in people’s communications nowadays. I am using the MySQL ODBC connector because the software is very old. I am wondering if the connector is the reason it is failing. My version of MySQL is 5.5.36.
Mighty Chaffinch wrote on : (reply)
Anyone using Connector/ODBC should be aware of the advice in MySQL bug #69021:
Also, if you want to round-trip text back from classic ASP forms back to MySQL, you need to set:
And save your
.asp
files as UTF-8 in your text editor.Hope this saves someone the pain and confusion I experienced arriving at this solution.
Moz wrote on : (reply)
And don’t forget to scatter
--default-character-set=utf8mb4
liberally around any database-handling scripts you might have, as a lot of MySQL utilities default toutf8
and pick up the local environment if you don’t force properutf8mb4
on everything.Also, thank you so much for collecting everything into one page like this. It’s made an awful process somewhat easier.
Hypolite Petovan wrote on : (reply)
Thanks for this comprehensive guide to add pile of poops on websites everywhere!
hoekma wrote on : (reply)
I put the settings for the three sections into
/etc/my.cnf
and restarted the server but not all of them were applied when I checked theSHOW VARIABLES
command. It was the exact same problem as @cmr’s post. However, once I put the settings into/etc/mysql/my.cnf
they were all applied successfully.Hopefully this helps someone!
P.S.: MySQL Server version 5.5.46 and Ubuntu 14.04.2.
Brad Kent wrote on : (reply)
MySQL workbench requires you to
SET names utf8mb4
after connecting — no way to configure it to use a defined default or per-connection setting.Oliver wrote on : (reply)
I think you have an error in “Step 3: Modify databases, tables, and columns” in the
ALTER TABLE
statement.It says twice
column_name column_name
but actually it should be like this:…and you should use
MODIFY
instead ofCHANGE
because it’s less powerful.Cheers Oliver
Mikael wrote on : (reply)
I was wondering why you recommend utf8mb4 instead of utf32? Any particular reason? I'd prefer utf32 to utf8mb4 anytime because I dislike variable length encodings, except for transportation purposes, so I'd like to know if this is a poor choice with MySQL/MariaDB.
Mikael wrote on : (reply)
Mikael: I just changed my server's default to utf32 and did the same with the databases and tables. I also changed the client default to utf32 and that turned out to be a bad idea. The 'mysql' command, on Linux, choked and came with this error (with no further info):
So I had to change the client's default encoding back to "utf8mb4". The output of:
Now makes sense as only the client side uses utf8mb4 and the server uses utf32.
Sam wrote on : (reply)
Chris: Do you know where to add that if I’m using Flask?
Paul: Did you add settings to the stored procedures when you re-created them? I’m running into the same issue (
character-set-server
is showingutf8mb4
but website is still throwing1366
).ExplodingCabbage wrote on : (reply)
I think you have a typo:
character-set-server
andcollation-server
need to be written ascharacter_set_server
andcollation_server
(underscores instead of hyphens) inmy.cnf
, at least on my machine.newPaul wrote on : (reply)
I am using XAMPP. Where do I find the
my.cnf
file?Shah Hussain wrote on : (reply)
newPaul: In XAMPP you have to do all this in your "my.ini" under mysql config tab...
Shashi wrote on : (reply)
@ArunMohan I was facing the same problem… In my case the mysql-connector version was the problem. I was using 5.1.22; I changed it to 6.0.3 and it worked. Hope this helps!
Rick James wrote on : (reply)
The 3
ALTER
s are too much, and probably broken. In particular,CONVERT TO
will probably fail if the191
is not done first.The
191
problem has been eliminated in 5.7.Raoul Duk wrote on : (reply)
Great article!
I have a follow-up question: is the conversion an all or nothing scenario?
Let’s say I have a database with > 70 tables and I know I really only care about
utf8mb4_unicode_ci
in one particular table. I know I can just convert that one particular table. But my question is: in my client connection (from a Python Django app) I need to set the encoding toutf8mb4
to ensure I am inserting/querying data in the correct encoding for the table in question.But is it harmful to have that as a global setting in all DB connections for all the tables I have not migrated to
utf8mb4
?Or is it just better to migrate all the tables? Thanks.
Prateek Gupte wrote on : (reply)
Would this take care of replicas for the database?
dGo wrote on : (reply)
Thanks for this! I wrote a little script to generate the update queries for tables, columns and views. If it’s correct (I’m asking), I figure it might be useful for others as well. So this script generates and list of SQL queries you’ll have to run to update the tables, columns and views charsets.
Björn wrote on : (reply)
Thank you for this article.
I tried a converstion of every field on a table with 1,000,000 rows and it performed very badly. A faster solution is to create a new table with correct charset and insert data directly from the old table. After this rename both tables and everything works fine.
Brendon wrote on : (reply)
Since this ancient article is still the very best reference on the whole web about setting up Mysql and Mariadb correctly, perhaps you can update to use the latest version of collation instead of the old standard?
utf8mb4_unicode_520_ci
gives the most correct collation for non-English languages. It fixes several problems that exist inunicode_ci
, including treating some characters as others and treating some characters differently depending on capitalization.Thanks!