Just when you think you know somecode.

Once again, it has been ages since I have touched this site. And once again I promise to be more active. … what’s up? Oh, yeah! BLOG!

I have recently been looking into the glibc resolver code.
It started out like any other troubleshooting effort, just trying to get a good foothold and identify where things could go wrong and how to ensure they went right. Once I got in the code… it was a real “took the red pill” sort of moment.

I often deal with how the resolver is configured, but had never needed to consider where it lived. As it turns out, I have been sort of imagining a sort of PFM magic bubble whenever I thought about name resolution on *nix based operating systems. I generally understood that name resolution was not handled by the kernel, but I also never imagined that resolution occurred purely in the user space. I’m not sure what I imagined wedged between user space and the kernel. I think I envisioned a sort of stateful shared resolver living under the mystic veil of glibc. As you might guess, that is not the case. Every process for itself.

Sort of…

The nscd process helps out, using it’s own user-land resolver to provide resolution services over a local unix domain socket. Every other process’s resolver can then forego doing the work itself and just pass requests off to nscd, which may have already done the lookup within the result’s TTL, eliminating the need to resolve the same name for multiple processes. Shazam! A stateful shared resolver! Except I knew nscd was a completely optional service and still imagined a non-caching single resolver living somewhere.

It is indeed, every process for itself. The gethostbyname and getaddrinfo functions (along with some group and user related resolvers) create an instance of the resolver entirely within the process. res_init() or more accurately one of it’s internal calls (__res_maybe_init() is maybe my favorite) is called, initializing the resolver. The initialization involves reading /etc/resolv.conf to load the search suffixes, nameservers, and other configs. This could very well be the last time that information is ever read by the process. This is the source of the trouble I was trying to… shoot.

… wobbly transition to flashback …
Changes needed to be made to resolv.conf to add a search suffix. After the change, nscd was restarted and the server seemed completely functional. Command line tests worked. Our PHP code running under Apache httpd were now able to resolve the hosts with the new suffix. All was well.

Are you sure I can’t just take the blue pill?

Days later we started seeing periodic “unknown host” errors from PHP applications.

Why would it have worked initially, but start failing a couple days later?

I will update this post in a couple days with what we found.

In the mean time, here are a couple links to other interesting DNS resolution information

RFC1535 – AKA “That’s it, I’m using http://gigapogo.com./ from now on.”

So far I’ve only run into the nss_resinit module to address the fact that changes to resolv.conf aren’t automagically loaded into the resolver.

Leave a Reply

Your email address will not be published. Required fields are marked *