(cache) 984 – Respond to changed resolv.conf in gethostbyname

Bug 984 - Respond to changed resolv.conf in gethostbyname

Summary: Respond to changed resolv.conf in gethostbyname

Status:	NEW

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	network (show other bugs)
Version:	2.3.5

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Duplicates (3):	3675 18279 20900 (view as bug list)
Depends on:
Blocks:

Reported:	2005-05-31 14:36 UTC by Martin von Gagern
Modified:	2016-12-02 09:28 UTC (History)
CC List:	11 users (show)

See Also:	https://bugs.gentoo.org/show_bug.cgi?id=177416
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
attachment-37701-0.html (1.09 KB, text/html) 2016-03-09 05:17 UTC, Karl	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin von Gagern 2005-05-31 14:36:55 UTC

I've got a notebook and start an OpenVPN daemon at boot time, but lacking a dns
server it cannot connect. When I get some kinf of network connection by plugging
in an ethernet cable or coming near some WLAN access point, my network is
reconfigured and all newly started processes work well enough. But processes
started before the resolv.conf was changed have to be restarted by root which is
really annoying, especially if they maintain some kind of state while running.

This was just one example, I believe there might be many more programs around
with this kind of problem, e.g. nscd. So I make this a feature request. I'm
suprised I could not find this around already.

Possible solutions:
* reread resolv.conf every time some name is resolved
* check modification time of resolv.conf every time a name is resolved
* reread resolv.conf if a nameserver does not respond
* reread resolv.conf if cached dns ip is more than t seconds old
* include some explicit "reread now" command for such daemons
* ... to be combined and continued

Comment 1 Jakub Jelinek 2005-05-31 14:40:11 UTC

There is a solution, already implemented.
Use nscd and nscd -i hosts in the script that rewrites your resolv.conf
(or nsswitch.conf etc.).

Comment 2 Clark Wang 2006-02-07 07:56:56 UTC

I met the same problem. Is there any way to let gethostbyname() reread
'/etc/resolv.conf'?

Comment 3 Ulrich Drepper 2006-04-25 17:59:29 UTC

Use nscd.  There will be no support for this is libc routines themselves.

Comment 4 Florian Weimer 2015-05-15 20:37:07 UTC

*** Bug 3675 has been marked as a duplicate of this bug. ***

Comment 5 Karl 2015-06-04 20:37:54 UTC

nscd does not resolve the resolver issue.

I have verified with sendmail.  Sendmail caches the resolver information and will not accept updates, even with nscd running.  I have verified it will still attempt to communicate with the old name severs after updating the resolv.conf.

Another issue with the nscd solution is that it interferes with freeipa/IDM as they require sssd and all referenced documentation states not to have nscd intalled/running with sssd.

Comment 6 Florian Weimer 2015-06-04 20:57:34 UTC

(In reply to Karl from comment #5)
> nscd does not resolve the resolver issue.
> 
> I have verified with sendmail.  Sendmail caches the resolver information and
> will not accept updates, even with nscd running.  I have verified it will
> still attempt to communicate with the old name severs after updating the
> resolv.conf.

Indeed, this is a very good point.  Thanks.

Comment 7 Carlos O'Donell 2015-06-04 23:50:03 UTC

(In reply to Karl from comment #5)
> nscd does not resolve the resolver issue.
> 
> I have verified with sendmail.  Sendmail caches the resolver information and
> will not accept updates, even with nscd running.  I have verified it will
> still attempt to communicate with the old name severs after updating the
> resolv.conf.

How is that a problem in glibc if sendmail caches the resolver? Or are you saying something else? That libc.so.6 caches the resolvers and fails to call out to nscd? That would be a real bug, and we'd like to see some kind of reproducer for that if possible, so we can fix the issue.
 
> Another issue with the nscd solution is that it interferes with freeipa/IDM
> as they require sssd and all referenced documentation states not to have
> nscd intalled/running with sssd.

This documentation is wrong. I have worked closely with the SSSD team at Red Hat and I can get more concrete evidence to prove this if we need to. You can run SSSD with NSCD without any problem. Can you provide a reference to the documentation so I can talk to the SSSD team about this?

Comment 8 Florian Weimer 2015-06-05 07:18:27 UTC

(In reply to Carlos O'Donell from comment #7)

> How is that a problem in glibc if sendmail caches the resolver? Or are you
> saying something else? That libc.so.6 caches the resolvers and fails to call
> out to nscd? That would be a real bug, and we'd like to see some kind of
> reproducer for that if possible, so we can fix the issue.

sendmail needs to do MX lookups, so it uses res_query (and res_search, depending on the context) and not one of the getaddrinfo/get*by* functions.  It's also a forking daemon.  It initializes the glibc resolver before forking, and I assume all the child processes inherit the cached list of name servers.

Comment 9 Carlos O'Donell 2015-06-05 13:04:17 UTC

(In reply to Florian Weimer from comment #8)
> (In reply to Carlos O'Donell from comment #7)
> 
> > How is that a problem in glibc if sendmail caches the resolver? Or are you
> > saying something else? That libc.so.6 caches the resolvers and fails to call
> > out to nscd? That would be a real bug, and we'd like to see some kind of
> > reproducer for that if possible, so we can fix the issue.
> 
> sendmail needs to do MX lookups, so it uses res_query (and res_search,
> depending on the context) and not one of the getaddrinfo/get*by* functions. 
> It's also a forking daemon.  It initializes the glibc resolver before
> forking, and I assume all the child processes inherit the cached list of
> name servers.

Correct, the res_* functions are designed specifically to talk directly to Internet domain name servers, and as such bypass nscd and sssd.

You are also correct, that once you initialize the resolver state the state is static, this is all well known. All children and threads will have the same resolver state if created after initialization.

It is also well known that calling res_init() again will cause any underlying configuration files to be reloaded (atomic increment of __res_initstamp does this).

Therefore the bug is entirely in sendmail. If you use this API you must have a side-channel to notify the application that it should call res_init again. This is a push process. For example it might be with systemd integration that you discover the network has changed and call res_init() again.

There have been patches floated that add stat() calls to *all* of the res_* functions, but the performance implications of that change have never been analyzed and that's why the patch keeps getting rejected. Is it within the noise to do stat() on /etc/resolv.conf to reload the resolvers if they change? It seems a heavy handed approach for systems which are less dynamic and have more stable configurations. At first blush it seems the stat has to be less costly than the upcoming network traffice, but it's still a non-zero cost paid in the hot-path of all these functions.

Comment 10 Joseph Myers 2015-08-24 09:43:21 UTC

*** Bug 18279 has been marked as a duplicate of this bug. ***

Comment 11 Eric Biggers 2016-03-09 05:07:52 UTC

Please consider for inclusion in glibc the stat() patch used by Debian.  I recently spent a long time tracking down a problem with hostname resolution in a long-running process which ultimately turned out to be caused by glibc caching an empty /etc/resolv.conf.  This can occur if the network configuration is dynamic, e.g. managed by DHCP and NetworkManager.  From googling, it's apparent that many different programs, such as web browsers (Firefox, Chromium, etc.), have also run into this problem and have had to add hacks to work around it.

Interestingly, in contrast with /etc/resolv.conf, glibc's resolver immediately recognizes changes in /etc/hosts.  Furthermore, /etc/hosts is always read in full.  It seems that if there is concern about performance of stat()ing /etc/resolv.conf, there could be an optimization made to skip reading /etc/hosts if it hasn't changed, thereby replacing about 5 system calls with 1.  That would likely save more time per getaddrinfo() than is spent by stat()ing /etc/resolv.conf an extra time.

Of course, the "correct" solution would be to use inotify to push changes only when they actually happen.  Unfortunately, glibc doesn't have an opportunity to do that.

Comment 12 Karl 2016-03-09 05:17:13 UTC

Created attachment 9078 [details]
attachment-37701-0.html

This sounds like a great solution!!!
On Tue, Mar 8, 2016 at 9:07 PM ebiggers3 at gmail dot com <
sourceware-bugzilla@sourceware.org> wrote:

> https://sourceware.org/bugzilla/show_bug.cgi?id=984
>
> Eric Biggers <ebiggers3 at gmail dot com> changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|                            |ebiggers3 at gmail dot com
>
> --- Comment #11 from Eric Biggers <ebiggers3 at gmail dot com> ---
> Please consider for inclusion in glibc the stat() patch used by Debian.  I
> recently spent a long time tracking down a problem with hostname
> resolution in
> a long-running process which ultimately turned out to be caused by glibc
> caching an empty /etc/resolv.conf.  This can occur if the network
> configuration
> is dynamic, e.g. managed by DHCP and NetworkManager.  From googling, it's
> apparent that many different programs, such as web browsers (Firefox,
> Chromium,
> etc.), have also run into this problem and have had to add hacks to work
> around
> it.
>
> Interestingly, in contrast with /etc/resolv.conf, glibc's resolver
> immediately
> recognizes changes in /etc/hosts.  Furthermore, /etc/hosts is always read
> in
> full.  It seems that if there is concern about performance of stat()ing
> /etc/resolv.conf, there could be an optimization made to skip reading
> /etc/hosts if it hasn't changed, thereby replacing about 5 system calls
> with 1.
>  That would likely save more time per getaddrinfo() than is spent by
> stat()ing
> /etc/resolv.conf an extra time.
>
> Of course, the "correct" solution would be to use inotify to push changes
> only
> when they actually happen.  Unfortunately, glibc doesn't have an
> opportunity to
> do that.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Comment 13 Florian Weimer 2016-03-09 13:20:12 UTC

(In reply to Eric Biggers from comment #11)
> Please consider for inclusion in glibc the stat() patch used by Debian.

The Debian patch is incorrect, it breaks applications which override name servers by direct access to _res.

I plan to add some /etc/resolv.conf auto-update functionality, but it will need a different implementation.

This work is blocked by our inability to properly test libresolv and /etc/resolv.conf processing right now.  A first step along the path is this patch, which is still awaiting review:

  <https://sourceware.org/ml/libc-alpha/2016-02/msg00376.html>

Comment 14 swormuth 2016-06-23 21:26:25 UTC

I am having the same issue with CrashPlan.  The backup engine service fails because there is no connection on boot.  Here is the text from the support representative.  Any thoughts would be appreciated.

==========================================================================

Thanks for your patience on this. Our inquiry wound it's way to the Tier 3 Engineers and the Engineering department. What you are experiencing is a bug in Linux... depending on who you ask. RedHat thinks it is working as expected, everyone else thinks it's a bug. Go figure!

Here's what is going on (just a warning, things get pretty jargon-y):

When a process (like CrashPlan), makes its first DNS request, glibc reads the list of DNS servers from /etc/resolv.conf. If networks are chosen after boot, or dynamically with DHCP, the /etc/resov.conf may be empty. This means that applications that don't re-initialize their name severs will be stuck with nothing, and will not be able to resolve addresses.

RedHat's stance on this is that it is up to the application to handle this re-initialization logic. For short lived programs (ping for example), this isn't a big deal because once they are run again, there is usually a name server for resolution. For long-running daemons, such as the CrashPlan Engine, they never re-intialize the name servers, and cannot connect. Restarting the service with a connection, and therefore a nameserver, gets things rolling again - which is why you notice that a restart resolves the issue.

Notably, Debian based distros use a patched version of glibc that takes care of this problem.

Unfortunately, as only a very small subset of our users experience this problem, we have no intentions of changing the logic of our product to account for how RedHat distributions handle initial name resolution. That leaves you with 3 options, all of which are beyond CrashPlan's scope of support:

Rebuild glibc with Debian's patch.
Configure NetworkManager to use a local dnsmasq instance.
Switch distros to a Debian-based solution (Debian, Ubuntu, Mint, etc.)
Likewise, CrashPlan may not be the product that fits your use case - and that is what the trial period is for! We want you to have a backup solution that works for you.

You may find the glibc bug page interesting, though to be honest I can't make heads or tails out of it!:

https://sourceware.org/bugzilla/show_bug.cgi?id=984
Though not satisfying, within the context of CrashPlan support I must consider this ticket resolved, and will mark it as solved. If you have any additional questions, please let me know!

Comment 15 Karl 2016-08-12 18:55:33 UTC

Any update? 

This bug is now 11 years old and injects false notions into posiz compliant code.  

Caching the resolver should be avoided at all costs. There are methods to cache the name lookups which should be used, but caching the resolver results in bad results with Network Manager (installed by default by Red Hat) and any modifications to the resolv.conf name servers. 

The only way to address this currently is to reboot the server anytime the resolver is modified. This is not practical and, again, Network Manager will modify it after boot. I've already proven that nscd and sssd do not address this break.

There's also a very real exploit here. A hacker could gain the ability to modify the resolv.conf, restart apache, sendmail, or other app which is caching the resolver information, place back the original resolv.conf and now use their name servers to route web or smtp traffic to their sites.

Comment 16 Carlos O'Donell 2016-08-15 20:03:01 UTC

(In reply to Karl from comment #15)
> Any update? 
> 
> This bug is now 11 years old and injects false notions into posiz compliant
> code.  
> 
> Caching the resolver should be avoided at all costs. There are methods to
> cache the name lookups which should be used, but caching the resolver
> results in bad results with Network Manager (installed by default by Red
> Hat) and any modifications to the resolv.conf name servers. 
> 
> The only way to address this currently is to reboot the server anytime the
> resolver is modified. This is not practical and, again, Network Manager will
> modify it after boot. I've already proven that nscd and sssd do not address
> this break.
> 
> There's also a very real exploit here. A hacker could gain the ability to
> modify the resolv.conf, restart apache, sendmail, or other app which is
> caching the resolver information, place back the original resolv.conf and
> now use their name servers to route web or smtp traffic to their sites.

There is some consensus that glibc should be changed to match the debian-glibc behaviour which checks for changes in /etc/resolv.conf.

The problem as noted in comment 13 by Florian we need better test infrastructure in glibc to test resolver changes. With that in mind I reviewed Florian's chroot-based test for resolver changes here:
https://sourceware.org/ml/libc-alpha/2016-06/msg00376.html
https://sourceware.org/ml/libc-alpha/2016-06/msg00366.html

Thus I think we're making some progress here.

Comment 17 H.J. Lu 2016-12-01 19:34:35 UTC

*** Bug 20900 has been marked as a duplicate of this bug. ***

Comment 18 Mike Frysinger 2016-12-02 09:28:00 UTC

(In reply to Karl from comment #15)

sorry, but your security claim makes no sense.  if a hacker has compromised your system enough to modify resolv.conf, then you've already lost.  claiming reloading it on the fly fixes things is a bit ridiculous.