Happy easter ... yeah, right.
I think none of you reading this are likely to forget this easter weekend any time soon. Neither is anyone here at CCP. The purpose of this dev blog is to shed a little light on what went wrong and what we are doing to make sure it doesn´t happen again.
Perhaps a little background is in order. I am a network administrator working for Siminn, an Icelandic Telco/ISP to whom CCP subcontracts all networking and administration of the EVE Online cluster and customer support. I usually quite enjoy my work as a network administrator, with the exception of the one or two days every year when bad things happen in quick succession when you least expect them to. This is one of those days or more accurately, this weekend was one of those weekends.
Just to refresh everyone's memory we had incidents on Friday night, Saturday and Sunday. By far the longest and most serious outage was the one on Friday. I will try to go through each one in turn and explain.
Friday: 22:50 - 05:40
To fully explain what happened here I will need to reveal a part of the cluster network structure. Now, while 99.99% percent of you guys and girls reading this are honest and trustworthy people, there is always that .01 percent that thinks "hmmm, interresting.. I wonder if it is possible to ...", if you catch my drift. So I will try to explain as well as I can without revealing too much. Hope you don´t mind.
EVE has been running 23/7 for nearly two years now. During this time EVE has grown quite substantially but the network structure we started out with has not changed very much. We have added a lot of capacity and improved redundancy in some areas but in other areas we are still relying on the same basic design principles we started out with and so far it has worked very well. However, last December, when deploying the Exodus release we made changes to the external network (between the EVE servers and the gateway routers) which were not in tune with the initial design.
There are two types of EVE servers, proxy servers and application servers. When connecting to EVE you connect to a virtual server ip (which is actually a hardware loadbalancer). Your connection is then transparently forwarded to a proxy server which services your connection with the help of X many application servers depending on what your requirements are (station, space, chat etc).
There are two of these loadbalancers, one primary and the other one acting as a backup in case of failure. They are internally linked and communicate status information over this internal link so that the backup loadbalancer will know when something goes wrong on the primary one (no communication for x seconds usually means that something is wrong, for example). Surrounding them is a network of switches, both in front and behind which form a fully redundant hierarchy so every node has an alternate path in case of link failure and every node is redundant in case of node failures. The loadbalancers were (note the emphasis) transparent bridges but the rest of the network used the spanning tree protocol to avoid loops and build a path for network traffic to flow optimally through the network. This was all tried and tested and verified to be working correctly.
At the same time as deploying Exodus we also deployed new proxy servers which required a change in the network wiring. To make this possible the link connecting the loadbalancers was moved to a different port on the loadbalancers and reconfigured. This reconfiguration was not correctly done on the backup loadbalancer however, enabling normal IP traffic to flow over the link if ever that loadbalancer became the primary one (which was not the case before). As some of you may have guessed, since the loadbalancers did not participate in the spanning tree this would result in a layer 2 loop in the network with adverse effects to network performance. And this is just what happened on Friday night. At 22:50 the primary loadbalancer crashed partially due to what appears to be a bug in it´s firmware. When the backup loadbalancer took over the connections, network performance went down the drain and in a most bizzare and hard to analyze kind of way which is common for layer-2 network loops. After a long night of debugging which included shutting down every redundant link, manually building a non-loop prone path through the network, we finally found the fault, six hours after the first report. Not a great success I have to admit, attributable in part to a few key persons being on vacation and thus hard to reach in the middle of the night. The worst part however is the lack of testing done after this network change in December. So to make sure there are no more surprises to be found we will be doing thorough testing of every network component during downtimes this week and the week thereafter.
We also made a change to the design during downtime today with the aim of reducing the complexity of the network and making it more transparent. This change was to have the loadbalancers take part in the spanning tree so that every link and every node was accounted for when the spanning tree algorithm builds a map of the network and calculates a loop free path. This way there are no "hidden shortcuts" for network traffic to take towards it´s destination. So far so good, we have a lot of testing ahead of us to make sure this works as intended.
Saturday: 15:30 - 16:40
This outage was a direct result of the problems the night before. A lot of links and equipment had been shut down the night before including the faulty loadbalancer. The network admin on duty thought it would be a good idea to power the equipment up again but unfortunately did not make sure beforehand that all the links to this equipment were disabled. No need to discribe what then happened in detail, you get the picture.. Needless to say it was not our finest hour.
Sunday: 8:30 - 10:30
Different matter alltogether this time but not entirely unrelated. The database which sits at the heart of EVE is a cluster, running on two pretty massive servers. One of them is active while the other one sits quietly waiting to take over in case something goes wrong. For a very long time nothing had gone wrong and the primary had been the primary for quite a while. During this time there were a number of configuration changes done, including modifications to the way the database is backed up. Every configuration change has to be done in exactly the same way on both servers but unfortunately there was a misconfiguration on the standby server which had gone unnoticed up until this point. During the network problems on Friday night the standby database server had assumed the active role. The net result of the misconfiguration was that database backups were failing without notifications, this in turn caused the database transaction log to expand which slowed down the database, ultimately crashing EVE.
While this was a human error and those do happen every now and then we will now implement regular checks of the DB failover mechanism and a checklist that will be examined every time. This way the likelyhood of misconfiguration is reduced and we are more likely to catch them during maintenance periods with no impact to gameplay.
In addition to this we have added additional checks and alarms to the backup process to try and catch any and all types of errors that may occur and report them proactively before they have a chance to affect the database
Sunday: 13:00 - 16:00
During this period the market in EVE was behaving very erratically, was even missing in some regions. This was determined to be an indirect result of the transaction log filling up and a reboot of the EVE servers was required after some administrative tasks were performed by the EVE dev team.
All in all this was a disastrous weekend for EVE and especially for Siminn because it was our responsibility, we should have done testing after the network changes in December and we should have been doing sanity checks on the DB on a regular basis. I won´t try to make excuses but I can promise that this has put us on our toes to try and keep to the quality standards that we have so far been very proud of. Like I said in the beginning of this blog, I like my work a lot but there are days when o'l Murphy likes to give you lessons in his law. Let's hope we have learned ours.