Bugs at the gates

We recently fixed a bug which caused client lock ups for some of our customers. There were forum threads calling for heads to roll and all sorts of angry talk. Which is understandable, we screwed up. This blog is a collaboration between myself and CCP GingerDude in an effort to uncover the history of this bug.

And so it begins...

Sit down me wee ones and let me tell ye a tale. In ye olde ‘n happy days of EVE's youth, cloaking was introduced into the game. This was good and allowed for more ganking. Now, the intentions of The Designer were such that a cloaked ship should de-cloak if it was within 2000 meters of an object. This worked except for a few cases where someone would warp to a station and would not get de-cloaked for some strange reason, allowing the pilot to dock without ever being seen by other players. A bug regarding this was filed in mid 2004, but as it was a corner case which would only happen on a subset of stations. It didn't impact gameplay much and was really tricky to fix, so it got marked as low-priority and left to accumulate dust for years.

Enter CCP GingerDude, who doesn't like it when bugs become features. In an effort to clean up the backlog of really old issues he analyzed the proximity handling as it related to cloaking and changed the mechanism to allow sensors to sense cloaked objects in a limited manner. This was generally considered a good thing and, apart from a few minor issues, it seemed to work well. That is, until someone noticed that if you arrived within proximity range of NPC's and stayed cloaked for a few seconds, the NPC's would ignore you since their sensors had already processed your ship while cloaked and de-cloaking didn't ‘wake them up' again.

"Aha" thought GingerDude, "we need to generate a new event when an object de-cloaks" and set out to implement a state-change from cloaked to de-cloaked. Because of complicated technical issues a ship in space couldn't just generate an event when de-cloaking in this case, it had to mark itself as ‘de-cloaking' so that the simulation would pick up on it, generate the event and then finish the actual de-cloak. For all intents and purposes, being in the process of de-cloaking meant you were still cloaked until fully de-cloaked. This is important.

A few days before Apocrypha 1.2 launch a new issue with this was discovered. When ships jumped through gates there was a cloak state mismatch between different clients in some cases. This caused the game client to be unable to update its cloak state. This was fixed by letting the client treat the de-cloaking state like the ship had fully de-cloaked (as it would be fully de-cloaked within at most 2 seconds thereafter anyways). But with this fix was introduced yet another issue where a ship would de-cloak after jumping and other players would see and be able to target it 1 to 2 seconds before that ship was allowed to do anything itself. This was not so good for obvious reasons.

To address this latest fallout we changed the rule from ‘de-cloaking is cloaked' to ‘de-cloaking is de-cloaked' and updated both server and client code accordingly.

But we missed a spot.

Deep down in the rarely visited dungeons of basic simulation and communication layers there were implicit assumptions about what data the cloak-state variable could hold. De-cloaking player A meant that his ship should be added to the simulation of all other clients within range, but this piece of code would ignore ships in the intermediary state and not add them to other players simulation at this time.

So what was it that caused my client to lock up?

If a ship was de-cloaking at the exact same time as another player was loading the scene it would cause issues for the latter, copied below are the actual step by step instructions as they appear in our internal defect tracking system:

          Two accounts required.
          1\. position two pilots at same gate
          2\. first jump A, then B ~2 seconds later
          3\. as B is loading the scene on the other side have A move so that his ship uncloaks
          4\. note that B's session is now all messed up

So this was not just something that would in the end happen if you jumped often enough. It was a very specific order of events that had to happen within a pretty narrow timeframe to reveal the bug. There are however around 3 million jumps performed every day on TQ so it wasn't very hard to get 'lucky' in the live environment. Here we come to the reason why this specific issue wasn't caught before it got to TQ, it just wasn't very likely to happen on our sparsely populated test servers. Even when we knew exactly how to cause it we had a hard time getting the timing right in our local test environments. In cases like this it can be very helpful to have bug reports that are as detailed as possible, the narrow set of circumstances that seem to be unrelated to the problem at hand can be the lynchpin for a cascading flow of events that result in a troublesome bug like this one. Often the individual bug reports alone are not enough, but once there are enough of them gathered in the same place a pattern may start to emerge.

Deployment?

When an issue is found that has this kind of effect on game play we always attempt to tackle it as soon as possible. To be able to pull this off there is a specific team of experts that are labeled as a Live Team. These individuals are earmarked for being pulled off whatever development they are currently working on if a serious issue that falls under their area of expertise is reported on TQ. This approach has proven much more productive than involving everyone in the development team in the process. It usually only requires a few select individuals to solve each case, and this way everyone else can get on with their normal day to day game development stuff. Fixes by this team are usually deployed in what is referred to as a hotfix, which means there is only a server update over downtime (in some rare cases there have been updates without taking the server down). Without a client update the entire deployment process is much simpler, unfortunately in this case a client fix was required. This brings us back to timing, client updates are usually deployed on Tuesdays because it gives us Monday to prepare and the rest of the week for monitoring and responding to fallout. Unfortunately in this case Thursday the 21st was a day off.

Testing

A client patch requires a lot of work, we need to test all supported platforms, a multitude of client version patches need to be built and tested before being uploaded to the tubes and then downloaded again to be verified. Each of the modifications being made needs to be checked for changes in functionality and similar tests need to be run on all game features to look for unexpected fallout elsewhere. Overall this takes a couple of days to complete. Unfortunately in this case everyone (except CCP GingerDude) in the Iceland office had a day off on Thursday so patch building could not start until Friday, testing was concluded Monday and finally we deployed the fix to Tranquility on Tuesday the 26th. Which is rather late but we didn't want to break protocol and risk another blunder.

To conclude here is a timeline of the events from discovery to deployment:

All times are GMT.

Tuesday,19th of May - Apocrypha 1.2 release

17:47 -> GM Nythanos working out of the Atlanta office sends an e-mail reporting an influx of petitions describing a client lock up after jumping.

18:35 -> No abnormal DB activity found, database load tentatively ruled out as a cause.

19:33 -> An internal ticket is created as an umbrella for bug reports on this issue.

20:20 -> Forum thread created, demands for action start mounting.

22:39 -> Reply to thread asking for more bug reports.

Wednesday, 20th of May - Investigation begins

07:54 -> Issue escalated to a special team, more on this below.

12:19 -> Second reply to forum thread, quite a few helpful reports received during the night.

12:35 -> CCP Tuxford manages to reproduce the error on demand, starts digging into code.

14:13 -> Third reply to forum thread.

Thursday, 21st of May - Resolution (day off)

09:33 -> CCP GingerDude starts working on a fix for the actual issue.

12:43 -> Fix checked in.

17:27 -> Official info thread created.

Friday, 22nd of May - Patch making

Building and testing of the many patch versions required for a client upgrade starts.

Monday 25th of May - Green light on deployment

Final testing concluded on all aspects with no show stoppers reported, green light for deployment given.

Tuesday 26th of May - Tranquility is updated

09:00 -> Server taken down, upgrade applied.

11:57 -> Server comes out of VIP mode i.e. is open for business.

And that, me wee ones, be the end of me tale on the little bug that tried to get away.