HED-GP Technical Retrospective: What a HED-ache
Hello spacefriends, your local technical director and performance explainerer here. As a few of you may know, we had a fight recently in HED-GP. Not the first time that's happened and probably not the last, but it's quite notable as it was moderately gigantic in scale and the performance of the server was, well, not pretty. I'd like to talk today a bit about why that was from the technical standpoint.
As many have pointed out, there's a fight in recent history that was of nearly identical scale to the HED fight: 6VDT on 28 July, 2013. Performance for that fight was considerably better than the HED fight. I want to put a number on the comparison but first I need to explain what that number means.
We've got three metrics that are used in series to talk about how loaded a server is. A vast majority of the time, we need only look at the current CPU utilization of the node, as that's usually nicely below 80% and everything's nice and responsive. Beyond that, the server starts getting overloaded and Time Dilation starts kicking in, so we switch to using the amount of dilation as the metric to track. If you keep on going though, eventually you'll hit the artificial 10% cap on the time dilation factor. That cap is in place to provide some form of resolution to an extremely overloaded fight, even if it’s one side disengaging and warping out. Naturally, folks can induce more load than what would put a server into 10% TiDi, so we need a metric to measure how bad things are past then.
Handily, we have one from the pre-TiDi days - how far behind the process that handles module stop/repeats is in its processing, called Dogma Lateness. “Dogma” being the system that handles modules and their effects, among other things, and “Lateness” being a common term for how late something is. Clever name really. In any case, this is the number I believe best captures the impact of overloading on players. It looks beyond how unresponsive the server may be to immediate requests and accurately captures how much built up load there is to chug through before normal processing of the game can resume.
So, the gravy. In 6VDT, the Dogma lateness metric peaked at 42 simulation-seconds. Since the system was at 10% TiDi, that translates to 7 minutes in real life time. HED-GP on the other hand peaked at 193 simulation-seconds, or 32 minutes real-time. That, as you can imagine, is a pretty big difference in player experience.
What caused the difference? Well, we can't be completely sure of that as we don't run our performance analysis tools during such insanely high load since they add load of their own, but we've got a couple pretty good guesses. Those guesses are increased drone use and the extended length of the fight. Length of fight’s influence is pretty easy to see, as the two fights start out similarly for the first couple hours, but then 6VDT cooled down while HED-GP carried on with more and more load backing up over time.
Now for a number that motivates the drones reasoning. It's difficult to recreate from logs exactly how many active drones there were in space at any given time in each fight, but as a close approximation we harvested up all of the logged events of drones being deployed and grouped up how many unique drone instances were dropped in each system during the duration of each fight. In 6VDT there were 21,123 unique drones deployed into the system, while in HED there were 38,852. For the division-impaired, that's an 84% increase. That number probably doesn't directly translate to an 84% increase in the number of active drones on field at any given point, especially given the extended duration. Even so, it does indicate a substantial increase in drone usage in HED over 6VDT.
Now, for why that's a problem. Drones plainly do more than just guns – guns go pewpew, drones approach, orbit, go pewpew, then come back and like do some decision making of their own along the way. Yes, even sentry drones do these things, they just move very, very slowly. Given that, it’s hopefully no news that all things being equal, a drone attack event is more to do than a single gun hit. The picture gets worse though after you consider the second part of the scaling problem – how many game clients see those events and therefore need to be told about them.
This is one of the bounding scaling factors in large fleet fights, the unavoidable O(n2) situation where n people do things that n people need to see. This is a problem for guns too, of course, but it’s magnified for drones in two ways. The first is that they simply have more messages, so as the n2 part becomes large, drone’s contribution becomes larger faster. The other bit is that the decision making code behind drone behavior does a poor job of scaling, often considering all attackable objects on grid when figuring out who to go after. Again, an n2-like problem, where n drones consider attacking (n + num_ships) targets.
These are addressable problems. The recently reinvigorated performance-minded gameplay engine team - Team Gridlock - has been having conversations with game design about what their goals are with drones and how we can support them in making drones a viable and happy weapon system for players and the hamsters that run Tranquility alike. It's unclear at this point how work specifically relating to drones will prioritize against the on-going Dogma work the team has been taking on. That prioritization depends strongly on the future popularity of drone-centric fleet doctrines, which we're expecting to see changes in due to both general drone balance updates and other design adjustments currently being planned. We'll have a clearer picture of how needed the work is as we see those efforts come to fruition.