Acorns
Marcel's blog
WordPress Race Condition with MySQL Replication
My employer runs WordPress to power the Healthy and Green Living section of our web site. The blog serves spikes of dozens of pageviews per second. We use HyperDB to send read queries to a few slave databases.
One day, I found that replication was falling further and further behind, mostly because of updates to wp_options. These writes were contending with reads for the MyISAM table lock on the slave. Because MySQL prioritizes writes higher than reads, it reduced our read concurrency to roughly one thread:
- a replicated write would release the lock
- one read would run
- then another write would get dibs on the lock before some of the remaining reads
The updates looked like this:
UPDATE wp_options SET option_value = 'O:8:\"stdClass\":4:{s:12:\"last_checked\";i:1282296714;s:15:\" ...' WHERE option_name = 'update_core';
WordPress is designed to check for a new version of itself only every day or so. It decides when to check based on the 'update_core' option. If the timestamp there is old enough, it checks again and updates the timestamp.
But this time, the version check happened to occur while the database replication was a bit behind and we had a traffic spike. An unfortunate downward spiral ensued. Each page request saw that a version check was due, checked, and dutifully updated the timstamp. But because replication was behind, subsequent requests still saw that the check was still overdue. This continued for several minutes until the slave database reached the first update. Meanwhile, reads to the table were locked and so each page held the connection for a long time. Eventually the slave hit its connection limit and the site was mostly down.
We considered a few solutions: one was to selectively read from the master database. Although we do that in a few places on our web site outside of wordpress, all the implementations we considered seem too fragile. We settled on restricting the task to admin pages, which required hack to wp_version_check in wp-includes/update.php. It has worked well in production ever since.
(For folks not interested in hacking core wordpress files, there are plugins available that disables the version check entirely or allows you to check on demand.)
Well, we that problem at least. We ran into very similar problem the following week, this time with the 'doing_cron' option. The queries this time were:
UPDATE wp_options SET option_value = '1273074475' WHERE option_name = 'doing_cron'
Essentially the same thing was happening: the timestamp update was delayed by replication and soon every request was updating it. The slaves ran out of connections and the site was down.
Luckly this issue has a well-documented solution: invoke WordPress's task scheduler from a real cron job.
With both the version check and cron fixes, we've had no more nasty interactions between MySQL replication, HyperDB, and wordpress option updates. Hooray!