The API Dev Blog Trilogy - Volume One
Over the last month or so, CCP PrismX, the operations team and I have been working on improving the overall performance of the API. We still have more work to do to streamline this much-used service and increase its responsiveness. I'll say more about that later, but first I want to explain the fixes we have implemented recently.
Improving the API experience for everyone
As outlined in this post by CCP Zirnitra, we've been working with some of our high-volume API users to make more efficient use of what is on offer. We've seen some examples of individual sites making a lot of bad calls to the server, which takes up considerable time from other requests that will actually return useful data. It is important to make sure that your software does not make repeated requests with invalid API keys. It is also very helpful to respect the caching timers, as this makes the API more usable for everyone.
As announced in the aforementioned post, we have discussed blacklisting applications that consistently make bad requests. While this is not a step we want to take, it may become necessary for the quality service of others using the API. Of course, we would much rather work with developers to fix any problems. To achieve this we will be making every effort to contact those who maintain high load applications. We ask those API developers amongst you to help make it easy for our GMs to get in touch with you so they can communicate any issues well in advance of having to take action.
Over recent months we have seen some large spikes in response time from the API. This issue was resolved two weeks ago. We isolated the cause as another application that was running on one of our six API servers. We removed the offending application and immediately saw average response times drop dramatically.
We also found a database locking issue caused by an internal server-monitoring tool. Locks were being put on the data needed to serve the API pages which was resulting in delays of up to 14 seconds. We solved the issue, with the help of the server-monitoring tool vendor, and we have seen vastly improved performance from this.
We've also moved the API onto a brand spanking new set of disk-arrays. This will lower overall response times in addition to reducing the frequency of response time spikes. The database has been moved from shared disks onto its own RAID 10 8-disk setup utilizing 15,000 RPM fiber channel disks.
That's all great. What's Next?
With all this work complete, we are ready to take the next step in optimizing the API backend.
The logging we use is based on log4net, and it allows players to see the requests that are made against their API keys. The API receives 17-18 million requests every day, which translates into 2-3 GB of data being inserted into a single table per day. This is a considerable amount of data, meaning we have to do a daily purge. This is done by a job that runs every day to delete any data over 7 days old. We've found that this has caused spikes in database response times due to insufficient indexing. In order to solve this, we have to create a new table which has these indexes. This will allow us to clean the table without a considerable performance hit. However given the massive amount of data already existing in the old table, it would take us 3-4 hours to move the data onto the new table.
To limit the length of the downtime needed when we do this, we believe that it will be better to simply not carry over the old data. This means that all request logs that players have access to will effectively be emptied once we take this step. We realize that this is a significant change but we think the benefits it brings will be worth it.
On Monday, September 27th we will delete all existing logs. Then we will immediately resume logging and from that point we will only keep logs for the previous 7 days.
So what where, when and what?
On Monday, September 27th, during normal downtime, we will take down the API in order to move to this new logging table. All old request logs will be permanently deleted. If you need these logs, please make a copy of the latest data before downtime on Monday 27th.
We've got much more in store for you, both in terms of performance improvements and general API work. PrismX is working on the next release of the API, and he will provide specifics in an upcoming dev blog. So for now I'm going to leave you panting in anticipation.