The Day the Items Disappeared
Sometimes things have a way of working out drastically different than you had planned. More specifically: sometimes all it takes is one line of code for things to go very wrong. Last Friday was a case in point.
The 7-hour outage on Tranquility we experienced on Friday, May 7, 2010 is the longest unscheduled downtime EVE has had in over two years. My purpose for writing this blog is to explain as best I can what happened, what we have already done to try to set things right and what lessons we have learned.
During EVE's daily runs we execute various maintenance jobs that ensure the health of the database. One of these maintenance tasks is recycling itemIDs. Almost every object in EVE has an ID, from a humble lump of Veldspar to characters, skills, space stations and celestial objects. When items get deleted or destroyed they are moved to a specific location in the database that we refer to as the junkyard. This location grows huge over time as the wheels of the EVE universe keep spinning. To keep things running smoothly there needs to be a way to recycle itemIDs from the junkyard and make them available for in game use. Which is where the junkyard cleanup scripts come into play. There are a few of those, each responsible for "cleaning" a specific part of the junkyard and reinserting previously used itemIDs into the overall pool of available itemIDs. Usually this happens behind the curtains without anyone noticing except the system admins responsible for watching over the EVE universe.
What happened on Friday was that a part of the junkyard was being cleaned that hadn't been cleaned online before so a new modified script was required. The script was prepared and went through the customary rounds of code reviews and testing on Singularity before being deployed on the Tranquility database.
What all the scrutiny and testing of this script had failed to take into consideration was the huge difference in transaction volume on Tranquility vs. Singularity. During normal operation the Tranquility database can see anywhere from 3-5,000 transactions per second. Only part of the transactions between the EVE clients and application servers actually end up going directly to the database because of caching and batching of transactions, so the actual number of transactions in EVE is somewhat higher. But I digress. Long story short, the flaw in the script went unnoticed on Singularity because the error had only a miniscule chance of occurring there.
So what the script did was actually to make itemIDs from the Junkyard available for use and then continually overwrite them as if they were available. This created a lot of problems in game. Overall, up to 0.0071% of all itemIDs were affected, which is up to 114,000 itemIDs; for a game that revolves around persistent items, that is a very serious thing.
Petitions soon started piling in reporting a multitude of strange problems and the logs started filling up with errors and exceptions. This ultimately resulted in a decision to close down the game in order to prevent further data corruption while we were analyzing the problem. About 75 minutes after downtime, TQ was closed down and some of CCP's best and brightest minds started analyzing the problem.
It didn't take us long to figure out what was going on but in order to fix it we had a dilemma and the dreaded word "rollback" started creeping into the conversation. We quickly decided that a rollback was not a feasible option: firstly, because rollbacks are well, very nasty things in general (we have only once done a rollback in the 7 years EVE has been in operation and that was back in 2003) and secondly because of the time it would take. We were simply faced with the option of either having EVE offline for an extended period of time or attempting to retrace and repair the damage and get TQ back up and running as soon as possible.
We opted to repair the damage, bring Tranquility back up and resolve incoming petitions as quickly as possible for those things that could not be undone. Over the past few days we have been working tirelessly on answering petitions and will continue to do so until we have solved every last one of these cases. To help speed up response times on petitions our software engineers worked all of Saturday to create specialized tools for customer service to review and deal with petitions related to this unusual error and made this suite of tools available to customer support. If those of you that read this are still having problems then I would like to ask you to please submit a petition through the EVE client or through the web interface.
Mistakes can and will inevitably happen. There will always be Murphy. In this case all prudent precaution was taken in terms of code review and testing. The main lesson learned last Friday, however, is that we need to step up our game in terms of speedy restoration of backups and options for recovery.
We need to be able to recover and restore the EVE database in a much quicker manner than we are currently able to do now. We have been planning to change that for a while now and those plans are being set into motion. This will allow us to recover from even a complete database failure within a very short period of time. We will also be making a thorough risk assessment for Tranquility to identify operational risk and categorize it according to priority. We will use it to focus our work in reducing risk and improving long term stability.
So, what a day. May we wait another two years or more for another day like this one and when it comes we aim to be readier than ever.