Posted on Dec 1, 2011

Battles with MySQL, big data and UTF-8

I was recently asked to restore a 5GB MySQL database to a new environment during a site migration. Easy enough I thought… we’ve got a dump file and an empty database. OK, it’s going to take a while but it should be fine.

Sheer stupidity. With that much data coming from another source there were bound to be problems.

The restore

First run, after around half an hour, the import bailed out with its first key constraint error.

Duplicate entry 'content/??-' for key 'dst_language'

Dodgy data. Well, at least to an extent. We decided to see how far or bad it was so ran the import again, this time with the -f option to force through errors:

mysql -u root -f -p database_name < dumpfile

This showed us only a few errors like this and not within critical data. The client was happy to let them slip so we continued.

Performance

The import ultimately took hours though, and spanked the load on the server all the way to 11. Not good! There was something not right here so I delved into the MySQL conf to add some performance tweaks. This was a default install of CentOS and hadn’t been tuned yet. MySQL is fairly bad for still thinking it’s being installed on a P2 with 512MB ram, and in this case it was on an Octo-core box with 8GB.  Upped a few settings like below and tried the restore again:

key_buffer_size = 128M
max_allowed_packet = 1M
table_open_cache = 256
sort_buffer_size = 5M
read_buffer_size = 20M
read_rnd_buffer_size = 4M
myisam_sort_buffer_size = 64M
thread_cache_size = 8
query_cache_size= 16M
thread_concurrency = 8
innodb_file_per_table
innodb_buffer_pool_size = 4G
innodb_additional_mem_pool_size = 1G
innodb_log_file_size = 256M
innodb_log_buffer_size = 128M
innodb_flush_log_at_trx_commit = 0

The restore flew in this time in (well, around 30 mins) so that was decent. We’ll be doing this a few times before the site goes live.

Encoding Hell

After a while, testing turned up that we had encoding issues. Chinese characters in the data were appearing as ??. This was the first I’d heard about ANY foreign characters in this data, but I guess I need to add that to my “big list of technical questions to ask the client”. In fact there’s a future blog post in that one.

Looking into the database, everything was running as collation latin1_swedish_ci. MySQL default right there. I switched the database to use utf8_unicode_ci and tried the import again. No joy…

Looking at the tables, they were still using latin1_swedish. I checked the create statements in the mysqldump file, and there were no character set or collations specified. Time to look at the server config.

mysql> show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Well hey, at least the server’s running UTF8!

I then had some fun and games with my.cnf, trying to get the right defaults in there. It spanked me out a couple of times trying to restart with dodgy parameters I’d found online. It’s worth remembering to test config with mysqld –help (although that’s not foolproof), and  that some mysql variables use underscores and some use hyphens, and some aren’t meant for the my.cnf file at all.

What I ended up with was this, which seemed to do the trick:

[mysqld]
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8

[client]
default-character-set=utf8

[mysql]
default-character-set=utf8

Testing that:
mysql> show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Now, I thought, we must be getting somewhere… run the import again.

No!  Somehow, despite the server being setup correctly and the database being created with UTF8 collation I’m still getting tables created with latin1, and fields created with the same.  wft??

Checking the docs, if no collation and charset is specified in the create statements (which they’re not) it should cascade up from column, to table, to database and finally to server defaults.  Both server and database were right, weren’t they?  Double check the database:

mysql> use database_name;
mysql> show variables like "character_set_database";
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| character_set_database | utf8  |
+------------------------+-------+
1 row in set (0.00 sec)
mysql> show variables like "collation_database";
+--------------------+-----------------+
| Variable_name      | Value           |
+--------------------+-----------------+
| collation_database | utf8_unicode_ci |
+--------------------+-----------------+
1 row in set (0.00 sec)

OK, that’s definitely fine.  I was starting to go a bit spare by now, until I noticed the comment at the foot of this page in the docs.  Someone it seems had been as frustrated as me, but had figured it out:

“Please add the information about skip-character-set-client-handshake into every page that references server character set and collation.”

This parameter unlocks the rest of the parameters you’ve set, and overrides those being passed by the connection itself.  Genius eh? You can do all the config you want, but running a standard query on a command line runs amok over the whole lot, turning your well crafted defaults into, well, someone else’s defaults.  Now finally we have all our ducks in a row:

skip-character-set-client-handshake
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8

Run the import again and voila!  We have a nice set of tables, and columns, all utf8_unicode_ci.

I was getting pretty hopeful now, but I checked the data only to find we still have ?? where the Chinese characters should be!  This wasn’t funny in the slightest.  Was it the backup itself?  I now set about parsing the living daylights out of the 5GB mysql dump file.  With combinations of grep, sed and more I found the pesky ??s right there in the sql.  Nightmare.

Now all the effort so far wasn’t entirely wasted – we still ultimately need a UTF8 database, and a fast one is good, but it’s not going to help if the characters aren’t in the backup in the first place.

I felt I had to sense check everything in case one of the command-line tools was leading me astray.  I headed into the restored data and pasted some chinese characters in there.  They saved fine (phew) and I then ran an export of the table using a standard mysqldump.

mysqldump -u root -p database_name table_name > testbackup.sql

On looking at the data with the same tools again, there it is, the character as it should be.

Somehow, even thought default settings for mysqldump worked fine at my end, something had screwed up at source.  So we’re left going back to the client to see if they can access the original data again. Lets hope so!

If I find out what happened during the backup I’ll update this as I hate leaving a story unfinished…  at least there’s been a number of lessons learned from this little project.  One of the biggest being not to sit and watch a 5Gig database being restored.  It’s worse than watching the kettle, or paint dry.  Write a two line shell script that sends you an email when it’s done and make yourself some coffee.

– UPDATE – When it’s working don’t break it!

Second night on this job and I’ve needed to migrate a badly restored (to latin) database  into a proper UTF8 db.  Should be straight forward but I’ve ended up with more issues which are worth mentioning as further gotchas when dealing with this kind of thing.

As far as I can tell, there’s no way to take a mysqldump without it bringing it’s collations with the create statements, so I now have a 5 Gig backup file peppered with lots of “DEFAULT CHARSET=latin1″ declarations.  This post was very useful, and doing a find and replace across the backup file I removed completely all these references, grepping for latin1 to make sure they were all gone.

Should be fine then, but on creation all tables were being created with utf8_general_ci, not unicode.  This took me a while to figure out, but I eventually realised at one point I’d thrown in a –default-character-set=utf8 parameter into my restore syntax, i.e.:

mysqldump -u root -p -default-character-set=utf8 database_name table_name > testbackup.sql

While I didn’t think this would do any harm, it blooming well did, and again this one parameter overruled my server, and database default collation settings.  This param sets MySQL up to read the default collation associated with UTF8 from its information schema, and that is utf8_general_ci.  Removing this from the restore statement left the server defaults to take control and the restore went in as desired.

Sheesh… think I’ve earned my beer this week!