Optimizations, optimizations, optimizations

We've been reading a lot on the forums lately on bashing this and that decision because of various reasons but mostly because of optimizations, which is quite new.

Now me being the troll that I am, I was itching to reply to most of them. However, sense had the better of me since supposedly I'm the one that has to look remotely intelligent and respectful being the Producer of EVE. Many of you have met me in person and if anything, I don't ooze of intelligence and respect. More ooze of alcohol and lust for cigarettes when I'm drunk. (See chapter 14 of "Oveur gets drunk at Fanfest")

Anyways, we felt we had to explains some very fundamental things of how you do a computer game, well, not only a computer game, but an MMO and how all this scaling works, where "lag" is created, perceived and exterminated.

Why? Because reading through the forums there seems to be this common illusion that fixing things is usually just doing a single thing. That's simply wishful thinking at best, there is no "Fix" button any more than there is an "I Win" button.

EVE Architecture 101

As you can see in this nifty animated pic I did, there are many places where "lag" can exist and most of it only happens in certain places. However, lag is most commonly caused by either insufficient resources (CPU, RAM) or network and latency (Packet loss, latency) which is the Interweb part and is mostly out of our control.

We showed a couple of examples on where we are adressing lag in Red Moon Rising. We did this because we wanted to show you how one thing can severely affect other parts of the game. For example, factory facilties use a lot of hardware resources on our SOL servers and SQL servers, which directly affects Drones, NPC's, Agents and Turrets/Effects (which is essentially combat) since those systems are also using resources on SOL servers.

The same can therefore be said for Drones, they use the same resources that NPC's and Turrets do, both on the Client and on the SOL Servers. Fortunately, our Proxy layer does not suffer from most of what is listed, that is more geneal CPU usage since it's basically just routing and load balancing.

The last view you see it the easiest way to address many of these problems - adding more Hardware (Affectinately knowns as YARRDWARE from now on). However, this only solves the problem to some extent since all of this has to be solved in content and software too.

We always want to evolve the gameplay in EVE and in many cases, like Drones, there is no way to do that without optimizing the bejesus out of the system. An easy fix isn't always possible since the system is already so resource intensive that we are directly prohibited from evolving them.

Optimizing it usually means exterminating the factor which causes the most load - and in the case of Drones, the number of drones in space was #1, using heavy resources on the Client and the SOL servers. Exterminating the factor meant reducing the number of drones, plain and simple.

Therefore, we always attack from all fronts, we work the content, we optimizes the software AND we buy YARRDWARE. This enables us to evolve gameplay while at the same time we get better performance on both the client and the SOL server.

BUT MESS WITH SOMETHING ELSE THAN MY STUFF!!!

Everyone gets the nerf bat in situations like this. There is no option to leave resource intensive systems like they are. I was actually reading a bit from Colin Powell the other day and I found this quote quite relevant to all this discussion

"If it ain't broke, don't fix it" is the slogan of the complacent, the arrogant or the scared. It's an excuse for inaction. It's a mindset that assumes (or hopes) that today's realities will continue tomorrow in a tidy, linear and predictable fashion. Pure fantasy. In this sort of culture, you won't find people who proactively take steps to solve problems as they emerge."

It's quite blunt, but I like it and it constantly reminds me to never stop evolving EVE. If something is using so much resources that it is affecting other gameplay and directly prevents evolution, it's broken.

WHY DON'T YOU JUST MOVE THE AGENTS!!!1

Interesting suggestion, we hadn't thought of that. Oh wait, we had - and it has been stated a number of times by us that we will move agents. We were even going to do it in the last patch, but found some other measures which could have solved the problem. They didn't. Well, agent relocation is in RMR.

You see, the problem with our 10 hotspot systems is the fact that a single solar system, with all its activity - like Agents/NPC's, Dungeons, Stations etc. - can only run on a single CPU. These systems already have their own CPU. (Market-, Corp- and Chat services are not included in this, they are seperate)

This brings us to solving the problem in the content, simply by moving the agents around. This is going to cost us a lot of customers, but we're still going to do it because at this point in time, it's the only way.

We're already optimizing the individual systems (Effects, Turrets, NPC's, Drones, Manufacturing/Research Facilities) so that a solar system is handling far less work than today, resulting in more resources available for the people playing in that solar system but that will still not be enough.

We have neighbouring systems that are running on CPU's with faaar less average CPU load than for example Rens which is at 100% pretty much all day. Hence the simple suggestion to players to move to another system - before we force you too.

However, this only solves Rens. We still need to do the Turret, Effects, NPC's and Drones for combat everywhere around space. Larger engagements of only 40 ships encounter problems today. It's not justifiable that Turrets, Effects and Drones alone are taking 10's of % of a whole CPU, is it? I'd say that's broken.

Since we're talking about examples, lets not forget bookmarks. We've seen a lot of speculation that they possibly can't be a server load. Well, they sure are. Copying a couple of hundred bookmarks takes serious CPU time on SOL and SQL servers, so take this as a hint. If you're doing thousands, you might be killing a couple of pour soul in combat somewhere.

Summing it up

Optimizing is far from simple and there is not a single solution which does not affect anyone to some extent. We're certainly not going to start sharding Tranquility because of bugs or bad code which we can fix, especially not when we have 70 of the 110 CPU's with less than 50% average load.

We're also going for new YARRDWARE. We're investigating how much we add, what type (AMD? INTEL? 64 BIT! Dual Core?) or if we should simply replace all the damn hardware. I doubt we can afford that, but hey, we're CCP, we're crazy (you can read a number of certified psychologists opionins on that on the forums) and we're using millions of dollars on hardware and optimizations, but we live for EVE, we love EVE, she only deserves the best. So do you.

_Addendum to a lot of questions in the blog

Why isn't EVE a multithreaded "application"?

First off, EVE is a very very multithreaded application and far more so than most you will encounter. We use Stackless Python to get microthreading abilities within the single process running on each CPU.

To better clarify that, right now, the EVE cluster is running 110 threads on 110 CPU's. Within those threads we have 100's to 1000's of microthreads running various services.

Other MMO's are usually doing a monolithic design (one big supercomputer) which do not scale very well or they have a distributed design but have physical assignment of threads to CPU's where a thread is running a geographical part of the universe but do instancing which they can run abstract from the geography because the instanced location has nothing to do with the geography and thus does not matter to the universe.

The problem for us is the granularity of the load balancing. We have lots of very small grains, like a chat service for a solar system or is geographically abstracted and can therefore be located in another thread which physically translates to another CPU. Same goes for lots of bigger grains, such as market, corp services, alliance services etc.

However, we have on very big grain which we can today not split between threads without running the risk of handover problems, synchronization problems or even worse - players stuck in the middle of space.

Stuck usually happens only on session changes (docking, jumping) which is exactly when you are handed over between CPU's (or forced by a service which requires it). The big grain is the Solar System. Imagine being able to get stuck around a gate in combat. Not a pleasant thought :)

As stated below, Stations can be removed and done as a seperate grain which are then geographically abstract, but we still have a number of services depending on them being on the same CPU to decrease the need for handovers in the session changes. Moving between solar systems on different threads require handoff and a session change.

This is the best I could explain this on short notice. I'm the Producer and as a result am not allowed close to codes or technical stuff with a 10-feet pole, but I hope it helps.

Our SQL problems were essentially exterminated recently when we got our RAMSAN-400 but we still have services which can wreck havoc by simply requiring some other resource for a short period of time, like RAM or CPU and only affects it in the very short term (spiking) but we want to eliminate those services.

The fault doesn't lie with the SQL software or hardware, it lies in our server software. I'm sure you're all used to people blaming the SQL but usually it's our own stuff which kills it the most :D_