Progress Update: Restoring Tranquility to Tranquility

Since November 25th 2009, a few days before the deployment of Dominion, we have experienced frequent unscheduled reboots of Tranquility. Almost all of those were due to a bug in the networking subsystem that causes the SQL Server to fail over.

Why, oh, why and what are you doing about it?

As soon as the first failover happened, we, per our policy, opened up a support case with the vendor regarding the incident, since our logs, surprisingly, showed nothing. Their response was that the problem had been caused by a race condition in the system.

We have worked closely with the vendor's support and development teams in an attempt to isolate the bug, collected vast amounts of diagnostic data and implemented changes that were considered potential solutions by the vendor. We believe we've found a workaround that makes it unlikely that the bug is triggered, but does not 100% prevent it. This has yet to be confirmed however.

As one can imagine it is difficult to diagnose a running, high performance production environment like ours without causing lag or other performance or reliability problems. The vendor has been working diligently to attempt to reproduce this issue in their lab, although collecting diagnostic data from similar systems presents a major challenge - doing so without negatively impacting performance levels for customers.

We do have programmers and virtual world system administrators working on putting together a test script to run on the database server we use for Singularity and Multiplicity, and if we are able to reproduce the issue there, we can supply our vendor with code that reproduces the problem in their lab.

I, personally, have been spending quite a large part of my work hours the last 3 months communicating directly with the vendor, collecting diagnostics data, setting up collection tools and working on things related to solving the SQL Server issues.

In short, we are using all the resources at our disposal to resolve this issue. It is a high priority issue for all parties involved as it affects not only our system and customers, but can affect equally massive systems and user bases using similar network and database solutions.

What have we done already? What do we know?

We know that problem lies in the TCP stack and likely has something to do with handling of closed or closing sockets. Our vendor has asked us to implement a few potential fixes or workarounds. We've adjusted various networking features and upgraded our SQL Server engine with a version that has a workaround for issues of this nature. The database handler in the EVE application server uses session pooling and we've experimented with changing various settings there.  Turning off recycling of idle sessions seems promising as a workaround that makes triggering the bug less likely.

We still are working toward a fix, as I said before, and we seem to be able to make the failovers happen less frequently with the latest workaround. Expect to hear more in the near future on our progress with this issue.

  • CCP Valar