Recent Tranquility Database Issues - An Update | EVE Online

Recent Tranquility Database Issues - An Update

2007-09-15 - By Svarthol

We still haven't been able to identify the exact cause for the recent database issues on Tranquility. However, we're moving slowly closer and each crash brings us one step further in diagnosing what is wrong.

We know that something causes a random database thread to block other threads access to tables (application locks), which then rapidly prevent more and more threads to access data, ultimately resulting in the database initiating a fail-over to its active stand-by server.

Even though the resulting database outage is only for a 5 second period, the application servers running EVE are not able to survive that long without database connections. This is due to the real-time nature of EVE, we require database access speeds measured in milliseconds. Thus, the fail-over causes a crash.

In addition to our server operations and core server team investigating and analyzing logs of this, we brought in IBM hardware, storage and server technicians and just yesterday a Microsoft SQL server expert was flown in to Iceland to further assist us in this effort. This fault can be in anything from our own software, to the SQL server, operating system, server hardware or storage. Nothing has been ruled out so we got everyone we could get.

We are able to minimize the effects on Tranquility by a team constantly monitoring the SQL server for blocking threads, which we are usually able to clear up. However, when it's a thread that's working with a frequently accessed table, we are unable to keep up and the number of blocked threads finally overwhelms us, resulting in Tranquility going down. If it were not for these guys, we would be crashing several times a day.

We can't stress enough how serious we take this situation. We realize that not posting news daily or hourly about this might give the perception that we aren't taking it seriously, but we prefer to post news when we have something new to report.

Unfortunately, this time, we have no news, except that we have the crashes minimized and that we're sparing no manpower or expense in getting this fixed. We hope this gives you better insight in what is being done and assure you that we take this seriously and that we're doing everything we can to find a solution to this issue.

Discussion of this news item may be found in this forum thread.