Friday, January 7, 2011

When the radius server came to a grinding halt - and how we brought it back

One of my customers have a few hundred pieces of hardware spread around the country that are authenticated and get their IP addresses from a radius server (http://freeradius.org/) we run. In between the few hundred pieces of hardware and the radius server there is another server that among other things handles the lookups against the radius server  and then gives the clients their connection information.

This have been running for four years with no problems what so ever, handeling a handful or so authentications a minute.

Yesterday the intermediate server suddenly booted (we learned this later), and that led to a need reinitialize the network connections for all the little pieces of hardware (thus authenticating them and looking up their IPs) - all at once. The load on the radius server went through the roof, and we had large amounts of timeouts on the lookups and the radius server came to a grinding halt. Most of the little pieces of hardware didn't get their IP's and the customer's monitoring system went all red and no-one was enjoying them selves much.
A trivial radius server should be able to handle at least a few thousand requests a second, so we were a bit boggled by this. A restart of the service didn't help.

We saw log lines like these in large amounts

Thu Jan  6 15:30:26 2011 : Error: Discarding duplicate request from client FOOFOO3:49910 - ID: 32 due to unfinished request 895
Thu Jan  6 15:30:26 2011 : Error: WARNING: Unresponsive child (id 1314167728) for request 890

The server was also exhausted CPU-wise - and mostly spent it's time in system CPU.
This led me to suspect that the problem was related to some kind of busy waiting on some kind of resource. A quick check in the trusty old /proc filesystem showed that the radius process had a couple of hundred file descriptors pointing to the same file - /var/log/radius/radutmp - an 89MB file with almost 1M short lines.

So what is this radutmp file for? It is updated whenever a client logs in or out, and is used by the command  radwho to see who is logged in at the moment. On this system we have no need for that so I removed the relevant sections from the config and restarted the service - and service was immediately restored.

According to the default and heavily commented config file the radutmp file is not a log file, and does not need to be rotated. But ours have obviously been growing over the last years, so some kind of housekeeping should probably have been done. Out of curiosity I tried starting with an empty radutmp and the original config - and it seemed that the size of the radutmp file affects how hard it is for the server to do it's work  - which makes sense. (but this testing was not done in any way scietifically, and might just be misleading).

I think the main lesson to learn from this is that /proc is your friend - always. Secondary, remove parts of services you don't need - problems might just as well show up there.