Internet spaceship crashes are serious business

Greetings Capsuleer,

This blog might be slightly odd compared to the normal dev blogs you read around here, but in a good way. After the Council of Stellar Management was here some weeks ago and had the opportunity to sit down with senior testers and Quality Assurance leads, we realized that Quality Assurance is something which we rarely talk about in terms of the ongoing process it is; rather we refer to it once in a while when something goes wrong. This is a terrible shame, so I figured it might be time to sit down and share some insights into one of the things we rarely talk about much, but tends to create bad experiences for players: Crashes.

One of the most annoying things I can think of, especially on games like EVE Online, is crashes. You might be in a fleet moving around, doing a mission, fighting an officer spawn, or otherwise doing something where a crash is just exactly what you didn't need. And it might cause you to lose your shiny new ship, which can ruin both your experience and wallet. It's almost the worst case scenario.

Crashes are serious business 

As a direct consequence of the immersion-shattering experience a crash is, we go to great lengths to try and avoid them. One event that especially comes to mind was the deployment of Tyrannis 1.1 ("UICore"). During the dry runs, where all of Quality Assurance runs tests to make sure that critical functionality works on Tranquility post-deployment, a tester found a crash which would happen if you closed the Fitting Window before it had fully rendered the scene which renders your ship. While it's a bug that's easily avoidable, it is still an example of one with a potential cause for a lot of grief. Thus, the call was made by the people in charge to keep Tranquility down for an extra 5 hours and 5 minutes to build and test a set of new patches to resolve this issue.

Unfortunately, a scenario like this is a luxury we rarely have. In this case, a clear set of reproduction steps were present, which made verification and testing a fix something that could be done very quickly. But in most cases, we have little information to go on. We get the occasional bug report (which we greatly appreciate, keep them coming!) which we can sometimes reproduce. Despite these issues, we have a couple of tricks up our sleeve, which I'll now show you.

Client statistics and Winqual

You know how, when an application crashes, Windows will prompt you to submit the crash report to Microsoft? Yeah, that thing. Most people tell you not to submit it, because it doesn't make a difference and is a potential privacy risk. Yes, I too thought that back in the day when it was introduced with Windows XP. Back when I was an EVE Online player and experienced the occasional crash, I'd go "pfft, this won't make a difference even if I submit it." Boy, was I wrong!

As it turns out, the data you're prompted to submit is invaluable for tracking down crashes. In the absence of clear reproduction steps, they're the second best thing we can hope for when we find a crash issue. We get the data you submit through a Microsoft program called "Winqual," which allows us to see statistics about certain crash "signatures," and get crash dumps which allows us to see in which part of the code the crash occurred, and subsequently fix it.

For instance, when we were deploying Incursion 1.1.0, we were deploying no less than seven fixes to different crashes in different parts of the subsystems of EVE Online. This is only possible thanks to the men and women who, when prompted by Windows to submit the crash report, actually submit it. 

When we deploy any patch, we keep a very close eye on not just forum and in-game channels, but also different channels we have. One is winqual, and the other is called “Client Statistics.” This is data we sample from Tranquility which is written to our database once an hour with different data, such as crashes, memory usage, CPU time and ping times. From that, we’re able to see if a patch has had an impact on crashes. Here’s an example of what a normal day on Tranquility looks like: 

Crash graph for a normal day on TQ

The percentage of logins to Tranquility that eventually ends in a crash is between 0.5 and 0.9%. An interesting visible feature of this graph is that around downtime the percentage of crashes significantly increases. There is a very good explanation for this.

One of the most common causes for a crash in some of our different sub-systems such as the Carbon graphics-engine, called Trinity, is when code attempts to access memory which is no longer there. When a client shuts down, it needs to ensure that it cleans up after itself, which means it needs to clear up memory, shut everything down correctly and, preferably, as fast as possible. This leaves room for code to try and access a memory resource which has been removed already, which can result in a crash. 

If we see that a patch has caused an impact on crash rates, we use Winqual to locate new "Crash Signatures." As I mentioned earlier, these are the crashes you submit when your client crashes, and we use this to pin-point specific causes which results in a crash. Here’s an example of a typical issue as it appears from Winqual:

Winqual data

Click image to view larger version.

Here we have a basic idea of: when a crash was first observed by Microsoft, which versions of Windows crash the most and which language edition the operating system uses. There are some other interesting features in this graph. For instance, you can see that it took some days for the crash to really start happening. Notice the small peak before the big spike? That was a mass-test, which is when we take 100s of people onto Singularity and test things. These often help us track in which patch a specific crash was introduced.

As in these situations when we observe this kind of crash behavior, a developer is put on the case to fix the issue. As we don't always find reproduction steps, we schedule crash-fixes into the next possible patch and monitor the situation once again. And as you can see in the above graph, we can also confirm that the specific crash was fixed (although there’s still some noise left).

One of the problems, of course, is how much visibility we have into crashes and where they happen. Crashes are hard to deal with, really hard. If you experience a crash and get prompted to submit a crash report, please do send it. It gives us greater visibility into crashes and eventually helps towards publishing a timely fix. It also helps us filter out noise from crashes caused by bad hardware from issues we can fix. It’s a win-win situation for everybody.

So do not fear submitting crash reports, and remember to fly safe!