Charsets & Collation in MySQL
Back at the end of 2014, I put a lot of time into ensuring that users uploading CSVs could throw almost anything at us and we’d be able to handle it. Recently an issue surfaced where we couldn’t handle those foreign characters that lead me down the rabbit hole that is collation and character sets in MySQL.
The issue that I was running into was our tasks were erroring out if a foreign character like the Ę in Ęlarmstrong were included. I dug deeper and we were reading it fine in unicode. However, when we tried looking up Ęlarmstrong in the with the Django ORM, MySQL was erroring out with the following message.
"Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='"
I never had any problem querying with unicode characters prior. I did a couple tests and noticed that it worked fine on our main shard. However, it always failed on the other shard. If you’re in a sql shell, you can use:
SELECT SCHEMA_NAME 'database', default_character_set_name 'charset', DEFAULT_COLLATION_NAME 'collation' FROM information_schema.SCHEMATA;
to find out what charset and collation you’re database is using. To dig further at the table/column level you can use:
SHOW FULL COLUMNS FROM some_table;
and see what it’s using. I noticed a discrepancy between our two shards. The main shard with no issues was set to utf8_unicode_ci, while the problematic shard was set to latin1_swedish_ci. It turns out that this is the default collation for historic reasons and that when the new shard had been set up, we never bothered changing it to utf8_unicode_ci. You can set the db to utf8 using
ALTER DATABASE dev_messages CHARACTER SET utf8 COLLATE utf8_unicode_ci;
And the table with:
ALTER TABLE grail_emailunsubscribe CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
After making that change, I could then query for Ęlarmstrong without any issues! However, during code review a further point was brought up. Why don’t we aim to support emojis as well? UTF8 on it’s own does not. We needed something multi-byte.
I did a bit of research and came to the conclusion that utf8mb4 was the right charset and utf8mb4_unicode_ci the collation we wanted. Should be simple enough. So I re-ran the ALTER TABLE and ALTER DATABASE. Nope. It turned out that it didn’t work on some of the tables due to the following error:
Specified key was too long; max key length is 767 bytes
It turns out that the tables that had used some of the string type fields as UNIQUE KEYs would not work due to a size limitation. I was fine leaving these keys as normal utf8_unicode_ci because they were email addresses, and hopefully nobody has emojis in their email addresses.
However, the other possible solution to this was to reduce the size of the column as well. If the max key length is 767, then 767 / 4 bytes is 191. Currently the email column I was keying was set to a 255 character maximum. So I weighed the tradeoff and came to the conclusion that it’s a rare exception for an email address to be more than 191 characters. If it occurs, I’ll liken it to a Gangnam style situation.
Anyways, this just meant that I had to change the field size to 191 before, and then I could happily change the collation to utf8mb4_unicode_ci.
ALTER TABLE some_table MODIFY COLUMN email VARCHAR(191);
After that I was good to go! A fun learning experience.












