Fixing Lag: Drakes of Destiny (Part 1)
Hi. I'm CCP Masterplan, one of the members of Team Gridlock. Our job in Gridlock is to tackle the issues of degraded performance under excessive load - in particular making EVE's signature massive fleet-fights more fun.
Whilst CCP Veritas has been having lots of fun switching modules on and off, I've been looking at another area of server load: Things in space. Specifically, adding and removing them, moving them around, and how we tell your clients all about them.
You might have seen mention of something called Destiny, and how it is something to do with physics. In this blog, I'm going to talk a bit about what Destiny does, and how I've been profiling it. In the second part, I'll show how one improvement in particular applies to my current nemesis: missile-spamming Drakes.
The first step in improving performance under heavy load is to be able to reproduce a situation where load occurs. Previously this was done during mass tests. However thanks to the wonderful thin-clients, we now have much simpler ways to recreate different aspects of fleet fights on-demand.
We know that Drakes are currently a popular ship, and we know that they can fire lots of missiles. All the following results were obtained using this same scenario:
- 200 thin clients flying Drakes, each with three fully loaded heavy missile launchers (representing three grouped launchers)
- 100 thin clients flying other ships. These were present to add ‘observer load' - representing the other people involved in a fleet fight. This is important, as the cost of a missile is somewhat proportional to how many observers can see it.
- 1 control tower with a ‘death star' set-up. The tower was configured NOT to aggress attackers.
Each Drake was instructed to lock the control tower. They were then instructed to activate their launchers, firing until their loaded ammo was depleted - this gave around 5-6 minutes of steady-state load (AKA missile-spam).
The numbers of ships was chosen (by trial-and-error) to push the CPU load of the node to around 95% during the steady-state missile spam. This is an important level, as it represents the edge of performance where lag starts to occur (module activations start to become delayed, movement commands stop being responsive, jumping becomes difficult, and so on.) These absolute numbers can't be directly compared to TQ due to hardware differences, but the relative proportions can.
During the steady-state load, we can then use profiling tools to analyse where we are spending CPU time. We can measure any improvements against this baseline of 95% load. Note that this test is entirely aimed at tackling Destiny load in the case of a stationary (non-warping) fleet battle. As such, it doesn't cover other bottle-neck areas such as jumping (though other work has focused on this). Turret-based damage was also not considered, since this has been observed to cause lower load relative to missiles. However this is an on-going effort - as the load of one aspect is reduced others will become proportionally larger and get their share of GridLock's Special-LoveTM.
I'm your density. I mean... your destiny
Before I show you the results of the profiling, it would help if I explained a bit more about Destiny. We often talk about Destiny as being the physics engine of EVE. Whilst that is an important part of what it does, as an overall system it is responsible for much more as well.
First I need to define a couple of terms, so everyone knows what I'm talking about:
- Ballpark: Manages the space occupied by a single solar-system. A ballpark is a collection of bubbles.
- Bubble: A small volume of space within which entities can physically interact. (Players have come to refer to this as a ‘grid') Not to be confused with Warp Disruption Bubbles, which this report is NOT about. Bubbles shrink and grow dynamically, and can never overlap.
- Ball: An entity (ship, drone, missile, asteroid, NPC, wormhole...) in space. As far as Destiny is concerned, all objects in space are represented by one or more balls. As the name suggests, the simulation treats them as spheres.
- Tick: Destiny operates in discrete ticks. Each ballpark runs at a rate of one tick per second.
That finely-shaped engine of the universe
Each tick is comprised of three main steps:
- Update the bubble-ball relationships
- Apply any client actions (Warp, Orbit...) that came in since the last tick
- Apply any server events (Explode, spawn missile...) that happened since the last tick
- Build client updates, so they know what had happened
- Send updates to clients
- Move balls according to their current action
- Resolve collisions between balls
- Dispatch collision callbacks
- Handle new clients (due to jump-ins, undocks etc)
Once per second, we go through these steps. Whatever time is left over is then used to process everything else that needs to happen. It is important to note that the Destiny tick is entirely non-blocking - we can't do any asynchronous operations such as database queries or anything that might cause execution to yield. Most of the collision callbacks, for example, simply schedule a task to be run later. In this task, we can than perform database operations we need, such as creating wrecks, and moving characters to the clone bay.
This is why, in loaded situations, there is sometimes a delay between a missile hitting a target and that target actually exploding. Sometimes this is even long enough for the target to warp out and think he is safe. In fact, he is already dead - he just doesn't know it yet!
The Evolve step is the most complex part, at least in terms of hardcore maths. This is where all the physics of spaceflight happens. (At least EVE's version of spaceflight physics) It has long been assumed that this was also where the majority of CPU time is spent - and so where the biggest savings could be made. Suggestions such as ‘GPUs are good at math - why not implement this on a GPU?' have been raised. However, as we'll see shortly, the Evolve step is responsible for such a small portion of the overall load that such efforts would probably be counter-productive once the additional communication overheads are taken into account.
Our favourite new toy
I'd now like to introduce the latest weapon in our anti-lag arsenal: a profiling tool called Telemetry. This is a tool for visualising real-time application performance, developed by RAD Game Tools.
We started evaluating this over the summer, and have been finding more and more uses for it. Essentially what Telemetry does is build a timeline of events as the code execution moves in and out of named regions, and then provide a visual way to analyse this execution data. Combined with the thin-clients, we now have a really good way both to generate load, and to visualise it. This makes us in Gridlock very happy.
Figure 1 shows how Telemetry displays a particular Destiny tick. Time is along the horizontal axis (increasing to the right). The vertical axis (increasing downwards) shows how execution enters and exits different timer sections. This looks much like the evolution traditional call-stack over time, just using named timer sections instead of function calls. When a new timer section is entered, a new block is added below the previous one. When a timer section is exited, the corresponding block is removed.
(Programmers are odd creatures. Not only do they start counting from 0 instead of 1, they tend to view structures such as stacks growing down, rather than up. It is just the way we do things...)
Figure 1: A typical Destiny tick during missile spam, with the three main steps highlighted
Looking at Figure 1, we can see that less than 10% of the overall time in Destiny is spent in Evolve. In fact, the majority of Evolve is spent in the collision callbacks - in this case scheduling missiles for explosion. That is why I mentioned above there is little to be gained right now in making the actual simulation code run any faster.
The majority of the time is spent in the Pre-tick step. In particular, two operations make up the bulk of the time:
1) For the list of balls which were added/deleted since the last tick, figure out which clients need to be informed. When a missile is fired a ball must be added to space - when a missile explodes a ball is removed from space. Drakes are good at making this happen a lot!
2) Sending an update to each client identified in the step above. Further investigation showed that this is mainly expensive due to the time required to serialise each update message. (Serialising means converting a data structure in memory into a stream of bytes that can be sent over a network)
If we can find a way to make each of these steps more efficient, then overall Destiny load will be reduced. Well, as it turns out, there is indeed a way. But to find out how it works, you'll have to wait for the second half of this blog...