Facing Destiny | EVE Online

Facing Destiny

2009-09-11 - 发布者 CCP GingerDude

Welcome class

I'm CCP GingerDude and I've been tasked with edumacating y'all on the recent changes I've done to our simulation engine, Destiny, particularly as it relates to the long standing and elusive issue known as desync.

Now, if you'd take your seats, remove your headphones and turn off them darned thingamajics you're constantly fidgeting with, we can get down to business.

So, what is Destiny?

Destiny, as stated above, is the simulation engine in EVE. It's the piece of code that makes your ship move, turn, collide, warp and cloak, to name a few of its tasks. In short, it maintains the state of all ships and in-space objects in the game. It runs both on the server and on each client and the idea is that it runs exactly the same on both the server and each client for any given scenario. Considering the deterministic nature of computers, running the same code with the same data should always yield the same results. The only difference between the server and client calculations is the fact that the server needs to process every scenario with objects in them for a given system, while the client only deals with objects visible in and relevant to the scene it is currently located in. In other words, the client deals with a subset of what the server deals with.

... and this desync you speak of?

A very good question. Over the years the term 'desync' has been applied to various unrelated issues, most of which are out of scope for this class, but lets get those who do not directly concern us out of the way:

  • Lag is not desync
    When you suddenly explode while believing you still had some structure left or warp into a scene and get ganked before you even notice the enemy on your overview, you're probably just lagged or have had your network packets dropped or delivered to you out of order. This is inherent in the nature of the intertubes and not much can be done on our end except to try to compensate, which is exactly what we do.

  • Clock skew is not desync
    When your shield and cap show wrong values, e.g. you start taking armor damage while still showing healthy shield status or the inverse when you don't take armor damage when you think your shield is gone, you're experiencing clock skew. Same applies for analogous cap issues. This is a simple artifact of PC clocks which all run at slightly different speeds. Your client will re-sync its internal clock when you log in, undock and jump but if you stay in the same space long enough, chances are you'll get slightly skewed. Usually not enough to notice, but it happens. On a couple of occasions in the past a server or proxy clock on TQ has gone out of sync, resulting in the same symptoms, but we've fixed our servers so they should never do that again.

  • Time shift is not desync
    The client and server simulations may get out of step, i.e. your client may be running the simulation a tick or two behind or ahead of the server. This is normal and happens mostly because of different load on different machines and is fully expected and dealt with. You'll see this e.g. as 'rubberbanding' when your ship gets 'snapped' back or forth after a warp.

So, what is desync?

It is the situation when the server and your client disagree on the position of an object in space at a given time.

If you were paying attention earlier you'll remember that computers are deterministic and since we're running the same code on the same data on both ends, getting into such a state should be impossible.

How does this happen?

The physics simulation is basically a differential equation and only knows the current state of objects and only needs this to calculate what their state will be in the next timestep, but if any calculation of direction or acceleration is even slightly different between a client and the server, that difference, however small, will eventually lead to a big difference in position. The clever ones amongst you lot will already have figured out what the result of this is: when you can detect a desync, you've already missed what caused it.

Desync is hard to detect in the first place; a lot of ships are moving at high speeds, coming and going, bumping into each other and whatnot, and all of a sudden maybe one client gets a message saying that he can't target a ship which is, according to his UI, only 5km away because the server says it's out of range. Using devsploits to detect desync helps but as explained above, we can really only detect the issue some time after what caused it has already happened.

Why now?

You mean 'why has this issue persisted for so long'? Let me assure you it's not for lack of trying. On many occasions and for many years we've been trying to reproduce this reliably but without much luck until recently. We knew from the bug reports that desync mostly came from scenarios where a metric crapton of ships were interacting heavily. CCP staff had, of course, confirmed that this issue existed and even experienced it first-hand, but we needed to be able to make it happen on demand so we could debug it. Trying to read through the code, tracing paths of executions and maintaining variable dependencies with or without debuggers is extremely hard. Trying to spot places where paths could diverge or relations differs across processes is even harder. Coming up with ideas and hypotheses as to what might be happening and coding test cases and analyzing logs for those ideas is ridiculously frustrating, borderline cruel and unusual punishment IMHO.

So, what changed?

Well, some of you may remember a time when you could effectively warp scramble a capital ship by simply bumping into it repeatedly with a frigate or a shuttle. I won't get into details about that issue except to state that the fix required that the collision resolution guaranteed that the two objects were moved out of collision within the time step. This had some side-effects, some expected, some not. The expected side-effect was that ships would be 'bumped' a lot more when found to be in deep collision. The unexpected side effect is that CCP Habakuk (who was a bughunter from our volunteer program at the time) found a repro case for desync. All we had to do was to h4x0r 10 or more motherships into our cargohold and eject them all, watch them scatter and use devtools to compare server and client position. They'd almost always become heavily desynced right away.

Aha, and thus the problem was what exactly?

Now that we knew that desync was somehow related to many ships being in intersection with one another, we started digging. Getting in touch with the original Destiny author, CCP LeKjart, he quickly asserted that this repro implied that the order of the balls being tested for collision was different on client vs. server. However, the datastructure we used guaranteed a defined iteration sequence... or did it? Reading over the documentation again, this time with lawyers lenses on, revealed a gotcha. The guarantee was for identical same-sized collections only, not subsets. Example: A collection with the numbers 5,3,4,6 would be processed in some defined order regardless of in what order the numbers were inserted, but the collection 5,3,4 would NOT be guaranteed to have the same iteration order relative to the superset. Those of you still awake may remember me explaining the difference between the client and server calculation at the very beginning of this lecture.

This also explains why we had such a hard time reproducing this issue previously; First, the order doesn't matter if only two objects are involved and would only yield a desync for a single object roughly half of the time if 3 balls were involved in the same collision. Second, the difference prior to the titan bumping fix was tiny and entirely in the velocity of a object being bounced away. Only after exaggerating the bumping did it have an immediately noticeable effect, particularly when many objects were involved since the objects would then start to bounce of one another many times in the same timestep.

Slight modifications to the collection of movable objects enforced a defined order and we tested again. This time the desync was much less, but still not gone. Looking for other spots using the same or similar datastructures where order might matter revealed two more cases. After the same treatment of those we tested yet again. Lo and behold, desync was gone.

Was it really that simple?

Don't be silly, of course not!

Every time a poor programmer-type of person such as myself starts to feel good about fixing stuff, evil, nasty, vicious QA-type people butt in. Testing the fix revealed that it was still sometimes possible to get desynced. All you had to do was to eject about 50 titans from your cargohold into space and then log or warp into that exact spot after a server reboot. Happens all the time on TQ...

At this time Apocrypha 1.5 was just a couple of days away and regardless of how quickly I could find and fix the cause of this, it would too late for inclusion. We decided to release the partial fix anyways. It would be an improvement but not perfect.

The next few days found me cursing and swearing a lot. Because the repro steps involved rebooting the server, I had to code, compile and reboot everything every time I needed to check anything. I identified one more spot where an object could belong to different space partitions on client vs. server for a short time if the server had objects in that area and the client didn't, e.g. a cloaked ship. This was purely hypothetical but I fixed it anyways. But the issue remained and my hair was getting thinner.

After too much time had passed, I stopped trying to be smart and got clever. Since I now had a repro case involving only one client and only needing one timestep or two of the simulation to get a repro, I just littered the code with extremely detailed logging of every calculation and every variable. I couldn't really log into the game anymore and it took several seconds to advance each step but it didn't matter. A huge logfile was generated even before the first frame was rendered on screen. Then I sat down and wrote analyzing scripts which split the logfile into server vs. client lines, reordered the lines in chronological order adjusted for timeshift, discarded server only and irrelevant stuff and matched the server loglines with appropriate client lines.

Once that was done, the lines were compared, and identical ones thrown out. Staring me in the face was a strange discrepancy. The acceleration calculated for objects in collision on the very first tick of the client simulation was radically different from what the server yielded. Examining the formula for bounce velocity revealed the error. Remember when I said that the simulation only uses the current state of objects to calculate the next state? Well, here was the only place in the simulation code where your velocity during the previous timestep was in fact used.

Remember that both client and server run the exact same code. When an object is added to space with the client already in the scene, both client and server start with every variable holding the same value. When logging into a scene which the server has already loaded, the server will have calculated a velocity value from the previous tick and store that as a temporary. When the client has no such temporary and incorrectly assumes that your current velocity is also your previous velocity and if a ball was in mid-collision during that very first tick, what happens? Desync.

A quick test where I added this extra information to the packets sent to the clients eliminated all desync cases completely and utterly.

Nice, so problem solved, right?

Well, almost. At this time we wrote the patch notes for Apocrypha 1.5 and claimed to have fully fixed the issue. And we had, really. But I wasn't completely done.

First, I took a long hard look at how we bump ships apart after the titan fix. Although grateful that it had revealed the desync issue, I had never been very satisfied with how the distribution of force to move the ships out of collision worked. It yielded too much speed for heavy objects and too little for lighter ships. After a few tweaks I settled for a much better approach: heavily intersecting ships were now bumping much less and titans could now simply push lighter ships out of their way. Titans where now losing only a small fraction of their speed when colliding with lighter ships, without the lighter ships shooting out like an air-filled balloon submerged in water being released. I also tested new ways to calculate bounce without relying on the previous velocity, which worked well. I was a happy camper that day and life was peachy. The sun shone on my ginger little head and nothing could spoil my triumph. Right?

Except evil, nasty, vicious QA, who were dead set on crushing my ego and abolishing happiness from the world at large. Again. Two days before Apocrypha 1.5 patch day, no less. How these guys manage to sleep at night, I do not know.

They'd figured out that they could get desync by going in a pod really close to a station, creating a bookmark there, and then warping a titan to that bookmark so that it intersected the station in many places. I had only a couple of hours to do something before the final build or else the patch would be delayed for weeks. Quickly employing the process of elimination, we determined that my bouncing tweaks or some other cleanup was somehow to blame and that the fix for the previous velocity issue was sound. Thus, all those extra bumping tweaks and the fix for the hypothetical desync scenario were quickly reverted, leaving only the core velocity issue fix in the patch.

How does this end? What happens next?

Dominion is what happens next. I'm sorry to keep you guys and gals waiting for the rest of the tweaks and for the fact that we may have jumped the gun slightly by calling the desync issue fully fixed in this patch when there may still be at least a hypothetical chance of desync.

But, you're all welcome to log onto Singularity in a couple of weeks where Dominion features will slowly be revealed and enjoy, amongst other things, much improved bumping and syncing. If you experience desync henceforth, be sure to file a bug report. It will probably make me cry, but hey, better you than QA.....

That is all. Class dismissed.