Just when you think you know somecode.


I have recently been looking into the glibc resolver code.
It started out like any other troubleshooting effort, just trying to get a good foothold and identify where things could go wrong and how to ensure they went right. Once I got in the code… it was a real “took the red pill” sort of moment.

I often deal with how the resolver is configured, but had never needed to consider where it lived. As it turns out, I have been sort of imagining a sort of PFM magic bubble whenever I thought about name resolution on *nix based operating systems. I generally understood that name resolution was not handled by the kernel, but I also never imagined that resolution occurred purely in the user space. I’m not sure what I imagined wedged between user space and the kernel. I think I envisioned a sort of stateful shared resolver living under the mystic veil of glibc. As you might guess, that is not the case. Every process for itself.

Sort of…

The nscd process helps out, using it’s own user-land resolver to provide resolution services over a local unix domain socket. Every other process’s resolver can then forego doing the work itself and just pass requests off to nscd, which may have already done the lookup within the result’s TTL, eliminating the need to resolve the same name for multiple processes. Shazam! A stateful shared resolver! Except I knew nscd was a completely optional service and still imagined a non-caching single resolver living somewhere.

It is indeed, every process for itself. The gethostbyname and getaddrinfo functions (along with some group and user related resolvers) create an instance of the resolver entirely within the process. res_init() or more accurately one of it’s internal calls (__res_maybe_init() is maybe my favorite) is called, initializing the resolver. The initialization involves reading /etc/resolv.conf to load the search suffixes, nameservers, and other configs. This could very well be the last time that information is ever read by the process. This is the source of the trouble I was trying to… shoot.

… wobbly transition to flashback …
Changes needed to be made to resolv.conf to add a search suffix. After the change, nscd was restarted and the server seemed completely functional. Command line tests worked. Our PHP code running under Apache httpd were now able to resolve the hosts with the new suffix. All was well.

Are you sure I can’t just take the blue pill?

Days later we started seeing periodic “unknown host” errors from PHP applications.

Why would it have worked initially, but start failing a couple days later?

What we found.

First, just a little more background on the configuration of our web servers. We used the pre-fork MPM configuration in Apache, so that means our apache processes would fork a set number of child processes that are used to process web requests. We did this because our web servers were shared between clients and teams and we wanted the web processes isolated from each other in case an application was deployed that did something like calling ‘exit()’. Pre-forking guaranteed the application would not result in an impact to the main server process if there was such a misbehaving application.

So our webserver starts and creates something like 200 child processes. Each of the processes will load /etc/resolv.conf at the point it does its first DNS resolution, so it will load whatever set of domain suffixes exist in the file, and will use that configuration until the process exits. It was probably at least a week between the start of the webserver and the time a new application was deployed that required an update to /etc/resolv.conf, so by then all child processes would have loaded the original resolv.conf. We update the domain suffixes, then restart nscd so that it picks up the new configuration. Since all processes will use nscd, we assume all is well after a few tests of the new configuration.

So how is it that we start seeing the intermittent failures? Why are some requests being resolved using the old configuration, while this small set of failures appear to be using the old configuration?

Well, here is the catch. There is a performance protection built into the resolve code in glibc. If a process fails to communicate with nscd over the unix domain socket, it will fall back to resolving DNS names itself for subsequent requests to avoid repeated failures attempting to communicate with nscd. Eventually it will attempt to use nscd again, but it is not a period of time that it waits, it is a non-configurable number of requests. Regardless of when the failure occurred, it takes 100 DNS queries before the process will attempt to use nscd again.

When we updated /etc/resolv.conf and restarted nscd, this was anticipated to be a non-impacting change, so there was still traffic hitting the webservers. During the period that nscd was down during the restart, any child processes that received requests and attempted DNS queries would have failed over to using the old resolv.conf. Since the child processes get used in a round robin fashion, and we have 200 or more child processes, the soonest those processes would attempt to use nscd again would be 20,000 requests later. That is only if every web request contains code that needs to perform DNS queries. There are many requests that serve static content or at least don’t perform DNS queries. There are multiple webservers. There is also a configuration on our load balancer that will perform an HTTP keep-alive on the server side connection, so multiple transactions utilize the some connection. All of these conditions lead to it being a few days before requests started hitting the child processes that was active during the nscd restart.

Once we started seeing the intermittent failures they continued in small numbers due to the child processes aligning differently on the servers. Our http logs show the pid of the child processes that were having DNS failures. After seeing that it appeared to be the same set of children that experienced the errors, we had somewhere to start.

Using strace to monitor the system calls performed by a working child process and a failed process, I was able to see the difference in the DNS resolution. In the working child process, a connection was made to the unix domain socket associated with nscd, whereas the failed process did not. It was at this point that I pulled the glibc source for the DNS resolution and found the logic that falls back to performing resolutions local to the process. This allow us to rule out a larger issue with our infrastructure, so we moved forward with reboots of the webservers to make sure all processes were completely updated.

It appears as though newer systems using systemd likely don’t suffer from this same issue because those systems use a system-resolved process to provide a full caching DNS server on the local system rather than making a decision in the glibc resolver code. If system-resolved is down, the system is probably non-functional, so it has its own problems, but the nscd problem has been deprecated along with nscd.

In the mean time, here are a couple links to other interesting DNS resolution information

RFC1535 – AKA “That’s it, I’m using http://gigapogo.com./ from now on.”

So far I’ve only run into the nss_resinit module to address the fact that changes to resolv.conf aren’t automagically loaded into the resolver.