Introducing steps to improve network performance of EVE: THE FCP
CCP has always chosen the more challenging path in how we design games; one world-wide single shard server proves that. As we've grown and the number of logins has risen we've gone through iteration in fighting lag. The War on Lag is never ending, because EVE Online is always growing. But we have some interesting things to show you this week. This is the second in a series of three blogs that will show you how we fight lag through hardware upgrades, software fixes, and even wrestling the internet itself...
Tl;dr: go down to "What can we do about this"
Running an MMO on the internet is not without perils; running a single MMO world in one location for the whole world is even harder. The internet is in a constant state of flux. There are many issues that plague the internet: politics, DOS attacks, rate limiting based on content, faults, traffic engineering to cut cost and "BGP," to name a few. The only constant in our configuration is the location of the EVE Online servers.
EVE Online is located in a very good spot in London and sits in one of the datacenters where LINX is located. LINX is one of the largest internet peering spots in the world so all of the major network providers are located within 10 km of our location. The location is also smack dab in the middle where most of the trans-Atlantic oceanic cables are terminated.
We currently peer with 6 major "Tier 1" () providers. Telia Sonera/Global Crossing/AboveNet/KPN ("Euroring")/Level3/Teleglobe (TATA). We also have a special peering platform where we reach 250 direct DSL service providers.
The BGP problem
The internet is based on a fairly old and simple protocol called BGP (Border Gateway protocol). Basic function of BGP is very simple, you build a relationship with a neighbor and announce to that neighbor your IP network. This neighbor then propagates the network(s) he has learned to it's neighbors. This mesh of interconnected networks make up the ever-growing global routing table.
The AS path
Each network operator (like CCP) gets assigned an AS number (autonomous system). This is a unique number that is assigned to each network operator. Not to go too far into definitions and standards, AS number basically ties the IP addresses assigned by the internet gods to the operator and is used for BGP to communicate between each operator. End users don't have AS numbers, they get a single IP address (or more) that belong to the ISP's AS number. So for a small end-user ISP to be able to reach the whole internet it must connect to a larger ISP, this might be a "Tier 1" //en.wikipedia.org/wiki/Tier_1_network service provider or a mid-tier provider that knows how to get to the whole internet. Your ISP makes a "peering" arrangement with the provider. It will announce the IP networks that belongs to that particular AS number. The mid-tier service provider then sends all the learned networks to it's other connected service providers. That way the BGP AS path is made, it's basically showing through what ISP (AS'es) it has to go through to reach the end network.
BGP does not pass any information on the remote network between the providers, the AS path is just built on a "point of view" bases. Shortest AS path is taken by default. AS path does not reflect any metrics like latency and bandwidth available on the links. The shortest AS path is sometimes the longest path!
Here is an extreme example. A user that is located in Germany which is quite close to Tranquility is supposed to have around 10ms in round trip latency but is seeing 100ms. His ISP has two upstream connections. One is to a big German wholesale service provider and then to reach the US, he has a connection to a large US based provider. The large German wholesale provider has a connection to Level3 and so does CCP, as it happens CCP also has a direct connection to the US based provider. As BGP sees it, CCP is one hop away (AS-path) through the US based provider but two hops via the German + Level3 provider. The traffic travels to the US and back whereas the obvious best path would be through the two providers locally in Europe. To fix this both CCP and the local provider would have to manually tinker with the route decision.
We have many other cases to look at, CCP peers with Tier 1 providers. The basic definition of a Tier 1 provider is that they will not buy connections from anyone. Other providers buy from them. Connections to the Tier1's are expensive, CCP only connects to Tier 1 providers or directly to end user SP's. The small end user SP's will often buy cheap connection to mid-tier providers for the bulk of the traffic and engineer the routing table that general traffic will use the cheap connection. Those connections are often heavily congested. This is in most cases out of reach of CCP as it happens on the other side of the Tier 1. This is just business as usual on the internet.
One more thing about BGP
We can only affect where we transmit our data, we don't control the return path. This is just how the internet/BGP works. In many cases we try to fix "asymmetric" routing by manually forcing the outgoing path to use the same incoming path, but sometimes the incoming path is just bad.
What can we do about this!
Up to now, we have tried to reply to the petitions we get on slow speed and connections problems. This means we go into our Edge routers and manually force the routers to send traffic out to another provider. The problem is that the internet is dynamic, when the end user provider gets a new connection, problem is fixed etc. The static configuration is still in place and choses a sub-optimal path.
Introducing the FCP - perhaps the greatest thing for EVE players since caffeine based drinks
CCP just bought a system from a company called Internap, that has the slogan "The internet performance company". The product is called the FCP - Flow Control Platform.
The FCP ties in closely with our Edge routers; it monitors all traffic to and from the gamecluster, has a BGP peering relationship with the Edge routers and monitors the pipes to our providers for bandwidth and errors.
So what does it really do?
It sends direct probes to IP addresses it learns through examining the stream of data going to and from Tranquility! This is accomplished with special configuration both in the FCP and our edge routers. It can then select from what interface the probe is sent - testing each provider. It uses sniffing interfaces to examine the TCP stream between the EVE Online client and proxy and detects if TCP resends are unusually high.
With the probe data from each provider and the TCP quality measured from sniffing all the traffic the FCP will determine if there is perhaps a better path through another provider to reach the user!
It can also help us divert from a major internet issues. It will detect when large "black holes" appear on the internet due to faulty routers and re-route accordingly.
Best thing about this that this is done dynamically, don't have to wait for a network engineer to login and analyze the problem
The FCP has now been in monitor mode for few weeks and we can see clearly that he can improve greatly on the default BGP routing policy - So we have decided to put him live very soon. When the FCP will go live, he will start making changes to the default BGP routing table and improving response time for our clients!
We are in a continuous cycle on improving the network performance to the EVE cluster to our best abilities, we will be adding new connections to the London datacenter soon.
The FCP is not a total fix of an imperfect system, we will still be affected on routing decisions made by network operators far from our reaches, we do our best to help with those cases. So please don't stop letting us know of issues with your connection, we look at them case by case. Hopefully the FCP will help out.
CCP Mort / CCP Network operations
Look for the next blog in this series later this week, and be sure to check the stream from Fanfest for even more detail.