Menu Bar
Downtime: June 25, 2008
At around 4:30 pm PST, while configuring the database for performance, I accidentally killed the database. It took a record two hours to recover from it. The good news is that I eventually recovered everything, with no loss of data.
See the bottom of this post if you care to know all the technical details.
Needless to say, it stressed me the fuck out!!
Once I got everything back up and running. I took a break. But after an hour or so, I noticed that the performance tweaks I added were actually sucking up a ton of resources and slowing down the site.
I thought, what the hell, the sites been up and down a ton already today, what's a little more downtime while I undo everything and go back to where I was before.
Suffice to say, everything is now nice and zippy and solid. Lesson learned.
As a little treat, I'll pre-announce here that we are using a brand spanking new search engine for the site. The new search had nothing to do with today's downtime and has actually improved performance dramatically, especially the speed of searches themselves. I haven't officially announced the new search because, so far, it's been pretty much a drop in replacement of the old one. Same interface and features, just way faster and more efficient. But that's only the tip of the iceberg.
When I get a chance, over the next several days, I will expose more of the advanced features of the new search in an Advanced Search page. Stay tuned.
Brief Technical Details of Today's Downtime
Over the last several weekends, I have been working on increasing the performance of MilkandCookies. This was typically done at night, especially on Sunday nights when traffic is low. I noticed that some SQL INSERTS seemed to take a while, so I thought maybe my main tables had grown too big and table level locking was getting in the way.
Switch from MyISAM to InnoDB on some tables to get row level locking seemed like a probable solution. Problem is that InnoDB doesn't support FULL TEXT indexing, which is what I have been using for our search feature.
I had wanted to use a search application rather than FULL TEXT for a while now, and this seemed the perfect excuse to go for it. I researched and asked around and settled on Lucene. Once up and running I tested it for a few days as an internal beta and early this week quietly rolled it out.
The next night I killed FULL TEXT and switched my largest/slowest tables to InnoDB. Everything seemed fine until today when I tried to optimize settings for InnoDB and everything when SNFU. I had followed the advice of several sources, including MySQL Performance Blog. Everything looked in order until I restarted MySQL. It would not come back up, and my InnoDB tables appeared corrupted.
Two hours later I discovered the culprit. Amoung other directives, I had changed the innodb_buffer_pool_size variable in my my.cnf file. This turned out to cause the InnoDB log files to become unreadable because they were created at a different size. What confused me is that I thought I had reverted all my changes, but it turns out I had reverted all settings except that one, the only one that was a problem. So I reverted it too, and everything was good again, or so I thought.
Even with everything fixed, performance was deteriorating. So I gave up on InnoDB and reverted the tables back to MyISAM. It's not a total loss of effort, everything is back to being fast and stable, and the new search is amazing. And now I know that InnoDB is not the way to go for our needs. Live and learn.


