The ddos attack on Tuesday
Well, it was a distributed denial of service (ddos) attack but perhaps not of the type we would have expected. That is to say it was not a malicious griefer with a ping script. The week before last we had been troubled by a high stuck rate and numerous user crashes to desktop. Most of them were traced to db performance issues or code bugs. During downtime on tuesday we did some alterations to the db aimed at improving performance. We were therefore very surprised to see the problem only getting worse. After a great deal of debugging mainly focused on eve server code and the db we realised that the problem was external rather than anything to do with the cluster. There was a high degree of packetloss into the server cluster and periodic losses of connectivity.
Now the EVE client/server network protocols are fairly robust so moderate packetloss will usually only result in lag. However the packetloss had another unforeseen side effect. For those of you familiar with routing protocols the gateway router in front of the cluster uses BGP (border gateway protocol) to advertise the networks behind it to the outside world. The packetloss was causing BGP to disconnect every 5-20 minutes resulting in connection loss to the cluster and a subsequent disconnect of all eve users. This situation is also known as route flapping.
After some additional debugging we discovered that the cause of the packetloss was a multitude of small ip packets being sent to ports 135, 80 and several others. Apparently this was the work of the virus strains running rampant this past week sending discovery packets from all over into our network. The inbound packetrate was too high for the router to cope with (in excess of 20k packets per sec), hence the dropped packets. To remedy the situation we installed a filter to incoming packets on the inbound interface of the router which resolved the problem.
The reason why there was no filter there in the first place was that before launch we were worried about cpuload on the router as it matched inbound traffic to the access lists, of course that no longer applies as initially we were expecting the traffic volume per user to be much higher than it currently is (around 16kbit per sec versus about 3kbit per sec per client I think). We are taking steps to prevent this sort of thing from repeating, the most obvious of which is to keep the filter in place. Other things also under consideration are an upgrade to the router in question and an additional redundant connection into the cluster. This would be on top of an already fully redundant EVE cluster. Well, hope you found this interresting. Please comment away, I do read your replies and appreciate them ... well, the polite ones anyway.