TQ Move Outage Details
We wanted to give you some details about what happened on June 23, 2010, when we performed the Tranquility server move, why the move took much longer than we scheduled and what we are doing to prevent the issue that caused the extended downtime from happening again.
First, what did we get done?
Everything. The new Ethernet and Fibre Channel switches were installed, the servers were moved to the new larger and cooler space, redundancies were put in place, etc. We actually got most of the work we had planned done in the timeline we had originally announced. Despite rumors and criticisms to the contrary, our plan included a significant time buffer for the work. We‘d been prepping the space for about three weeks prior as well--testing power and cooling, putting in place all the backbone cable systems for servers and switches, and getting external network connectivity verified and tested. To some it seemed that we randomly chose „six hours" as our total time frame, however at no time would we make up numbers we didn‘t wholeheartedly expect to meet.
So what happened?
When we attempted to fire up the Tranquility database we experienced some failures on the new storage area network we had just put in place. These issues were not discovered until we started running our normal cleanup jobs (these jobs touch just about every part of the database) on the database and started putting actual load on the storage area network. Once under load the problems were discovered, but not before the database and many of the vital tables needed to operate EVE were found to be heavily corrupted.
Why did it take so long to get TQ back online?
In order to recover the database after finding the root cause and fixing it, we had to go through the process of replacing the logical database with a new copy. A backup of the Tranquility database was deployed: we began recovering the corrupted transaction logs, and replaying them to fill in any missing data. You can think of this process much like your credit card statement. You can see the current balance that may not reflect the burrito buying spree you went on last night. In order to get the statement to match what you actually owe you will also need to add in the transactions for the burritos, soda and antacid you bought last night. Then comes the integrity checks, To verify the database is in good shape a number of very slow and CPU intensive programs have to be run. This helps ensure that we are not going to cause further damage and the database is in fact all there. These can take hours to do and we started with the most vital tables (such as the Inventory Items Database that makes sure the Raven in you ship hanger is yours and exists in game) and worked our way down in parallel with our QA team, who did a very thorough job of testing EVE in VIP Mode. VIP Mode is when Tranquility is up, but accessible to CCP staff only (many of you noticed and were curious about why 30+ others were on TQ while you couldn‘t login). While rolling back the database and losing transactions was an option, we chose the longer recovery path and testing to make sure no player actions in the game were lost due to the corruption.
What are we doing to prevent this?
As you may know, EVE‘s database is a fairly big and powerful thing. In order to maintain it and reduce the recovery time in situations like this we are putting a project in place to modify nearline recovery and establish faster rebuilds of transactions should gaps exist. The team is currently working on the specifics of this new database architecture. Once we have the new design plan in place and tested, we‘ll post more details and some drawings of the changes.
We all really appreciate the understanding and kind words... and even the harsh ones we needed to hear.
Let us know if you have more questions and I‘ll do what I can to keep up and answer for a few days here.
-- CCP Yokai