Regression: Resolving unqualified DNS names fails #1307

discordianfish · 2018-03-30T13:55:38Z

What did you do?
Running AM in prometheus as stateful set with a headless service, giving each AM a name like alertmanager-0.alertmanager.default.svc.cluster.local.
The pod gets, among others, default.svc.cluster.local configured as search domain in /etc/resolve.conf:

# cat /etc/resolv.conf 
nameserver 10.35.240.10
search default.svc.cluster.local svc.cluster.local cluster.local c.latency-at.internal google.internal

This allows for the alertmanager-0.alertmanager name to be resolved unqualified like this:

# nslookup alertmanager-0.alertmanager
Server:    10.35.240.10
Address 1: 10.35.240.10 kube-dns.kube-system.svc.cluster.local

Name:      alertmanager-0.alertmanager
Address 1: 10.32.4.32 alertmanager-0.alertmanager.default.svc.cluster.local

The alertmanager though can't resolve this name unqualified (which was working at least in 0.11.0) and logs this error:

level=warn ts=2018-03-30T13:25:15.016042032Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to resolve alertmanager-0.alertmanager:6783: lookup alertmanager-0.alertmanager on 10.35.240.10:53: no such host\n* Failed to join 10.32.6.23: dial tcp 10.32.6.23:6783: connect: connection refused"

Environment

Alertmanager version:

alertmanager, version 0.15.0-rc.1 (branch: HEAD, revision: acb111e812530bec1ac6d908bc14725793e07cf3)

The text was updated successfully, but these errors were encountered:

stuartnelson3 · 2018-04-03T10:11:29Z

Memberlist is doing the resolving, and from looking at the code it uses go's stdlib net.LookupIP for this.

https://github.com/hashicorp/memberlist/blob/9f5b38f1dc837733754bf57f4ea62726a509c0fc/memberlist.go#L344

Are you able to resolve the address using this go code running in its own script in your container?

stuartnelson3 · 2018-04-06T09:26:38Z

There's a probably related issue in #1312. Does your status page show any peers?

stuartnelson3 · 2018-04-06T09:31:09Z

Looking a bit further into the code, it is resolving the addresses but apparently unable to join them ..

discordianfish · 2018-04-08T11:13:36Z

@stuartnelson3 it's not showing any peers when using the unqualified name. I'll test with a go binary using net.LookupIP once I get back at this.

discordianfish · 2018-04-11T13:08:44Z

Just built a simple binary running LookupIP and this seems to work just fine:

/ # nslookup alertmanager-0.alertmanager
Server:    10.35.240.10
Address 1: 10.35.240.10 kube-dns.kube-system.svc.cluster.local

Name:      alertmanager-0.alertmanager
Address 1: 10.32.4.33 alertmanager-0.alertmanager.default.svc.cluster.local
/ # /test alertmanager-0.alertmanager
err: <nil>, res: [10.32.4.33]/ # 
/ # cat /etc/resolv.conf 
nameserver 10.35.240.10
search default.svc.cluster.local svc.cluster.local cluster.local c.latency-at.internal google.internal
options ndots:5

stuartnelson3 · 2018-04-11T14:13:30Z

Thanks for looking at this.

From the original log line you provided, there are two errors: One is a failure to resolve, and another is a failure to join.

Both of those bits of code are in the same loop here: https://github.com/hashicorp/memberlist/blob/9f5b38f1dc837733754bf57f4ea62726a509c0fc/memberlist.go#L214-L234

The initial lookup seems to be happening on a forwarder local to that kubelet

Failed to resolve alertmanager-0.alertmanager:6783: lookup alertmanager-0.alertmanager on 10.35.240.10:53: no such host

But then the second error is connecting to an IP that isn't (according to your nslookup) an AM ipaddr:

Failed to join 10.32.6.23: dial tcp 10.32.6.23:6783: connect: connection refused

The connection failure, I think, would be stale DNS data or something .. I'm not sure where that IP came from.

Are you configuring each AM instance to have the full list of peers? So instance1 has --cluster.peer flags for both instance1 (itself) and instance2?

Also, how do you have --cluster.listen-address configured? It looks like it needs to be a routable podIP: prometheus-operator/prometheus-operator#1193

discordianfish · 2018-04-11T17:31:17Z

So don't think it's related to listen address or something since using the fully qualified name it works.

But yeah I focused on the DNS error but you're right, it's confusing that it tries to join some other IP..? Not sure how what happened.. When I'm trying to reproduce it by deleting my pods and recreating them I don't see this, just the DNS error:


level=info ts=2018-04-11T17:24:05.603230635Z caller=main.go:140 msg="Starting Alertmanager" version="(version=0.15.0-rc.1, branch=HEAD, revision=acb111e812530bec1ac6d908bc14725793e07cf3)"
level=info ts=2018-04-11T17:24:05.603309763Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-04-11T17:24:06.066127059Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to resolve alertmanager-0.alertmanager:9094: lookup alertmanager-0.alertmanager on 10.35.240.10:53: no such host\n* Failed to resolve alertmanager-1.alertmanager:9094: lookup alertmanager-1.alertmanager on 10.35.240.10:53: no such host"
level=info ts=2018-04-11T17:24:06.067671792Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yaml
level=info ts=2018-04-11T17:24:06.073860511Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-04-11T17:24:06.077161531Z caller=main.go:346 msg=Listening address=:9093
level=info ts=2018-04-11T17:24:08.074302197Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=2 elapsed=2.000182926s

I'm as much confused as you as why this fails, give that it's using the stdlib LookupIP but it's definitely a problem with resolving the name. Just double checked that my test binary can resolve the unqualified name just fine in this same pod.

discordianfish · 2018-04-11T17:33:44Z

WTH.. I just read the memberlist code and it implements it's own resolver and only uses the stdlib when this fails: https://github.com/hashicorp/memberlist/blob/9f5b38f1dc837733754bf57f4ea62726a509c0fc/memberlist.go#L247

I'll gonna fill an upstream issue.

stuartnelson3 · 2018-04-18T07:04:01Z

silence from them after 7 days :/

would it be a lot of work for you to package your own am after patching https://github.com/hashicorp/memberlist/blob/9f5b38f1dc837733754bf57f4ea62726a509c0fc/memberlist.go#L333-L339 to be commented out? I'm just starting back at SC and won't have time to try this for probably 2 weeks.

discordianfish · 2018-04-18T14:28:32Z

As a workaround, I'm using the FQDN. So not urgent but something that should get fixed because others will trip over this too.

alesnav · 2019-04-25T14:47:22Z

Hi there, any news about this?

zetaab · 2019-05-15T05:09:17Z

no idea is this same problem but I do have problem like this in kubernetes gliderlabs/docker-alpine#255

in debian based docker images the dns works fine, but in alertmanager:

% kubectl exec -it alertmanager-main-0 -c alertmanager /bin/sh
/alertmanager $ nslookup google.com
;; connection timed out; no servers could be reached

/alertmanager $ nslookup kubernetes.default.svc.cluster.local
;; connection timed out; no servers could be reached
/alertmanager $ cat /etc/resolv.conf
nameserver 100.64.0.10
search monitoring.svc.cluster.local svc.cluster.local cluster.local openstacklocal
options ndots:5

level=warn ts=2019-05-14T11:05:13.087102858Z caller=cluster.go:226 component=cluster msg="failed to join cluster" err="3 errors occurred:\n\n* Failed to resolve alertmanager-main-0.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n* Failed to resolve alertmanager-main-1.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-1.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n* Failed to resolve alertmanager-main-2.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-2.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host"
level=info ts=2019-05-14T11:05:13.087142251Z caller=cluster.go:228 component=cluster msg="will retry joining cluster every 10s"
level=warn ts=2019-05-14T11:05:13.087157795Z caller=main.go:268 msg="unable to join gossip mesh" err="3 errors occurred:\n\n* Failed to resolve alertmanager-main-0.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n* Failed to resolve alertmanager-main-1.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-1.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n* Failed to resolve alertmanager-main-2.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-2.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host"

discordianfish · 2019-05-19T11:18:37Z

@zetaab Why do you think your probably might be the same? As you verified, DNS isn't working at all in your container. That doesn't seem to relate to this issue.

@alesnav See the upstream issue (hashicorp/memberlist#147), nothing we can do here beside replacing memberlist.

xkfen · 2020-04-29T09:14:18Z

same problem.
alertmanager version:
/alertmanager $ alertmanager --version
alertmanager, version 0.18.0 (branch: HEAD, revision: 1ace0f7)
build user: root@868685ed3ed0
build date: 20190708-14:31:49
go version: go1.12.6

any help? thanks

discordianfish · 2020-04-29T12:43:41Z

@xkfen Someone would have to fix the upstream issue: hashicorp/memberlist#147
Nothing has happened since I filled that issue. I'm still using my describe workaround above.

stuartnelson3 mentioned this issue Apr 5, 2018

Silences are not propagated in a ha/mesh configuration (v0.15.0-rc1) #1312

Closed

discordianfish mentioned this issue Apr 11, 2018

Join() fails to resolve unqualified DNS name hashicorp/memberlist#147

Open

simonpasquier mentioned this issue Apr 18, 2018

template: Deploy oauth-proxy sidecar for TLS + authentication openshift/autoheal#54

Merged

simonpasquier mentioned this issue Apr 24, 2018

Release Alertmanager v0.15.0 #1340

Closed

stuartnelson3 mentioned this issue May 15, 2018

swarm install alertmanager cluster failed #1383

Closed

simonpasquier added the component/high availability label Jul 11, 2018

julienlau mentioned this issue Nov 10, 2020

Notify for alerts failed #1683

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: Resolving unqualified DNS names fails #1307

Regression: Resolving unqualified DNS names fails #1307

discordianfish commented Mar 30, 2018

stuartnelson3 commented Apr 3, 2018

stuartnelson3 commented Apr 6, 2018

stuartnelson3 commented Apr 6, 2018

discordianfish commented Apr 8, 2018

discordianfish commented Apr 11, 2018

stuartnelson3 commented Apr 11, 2018

discordianfish commented Apr 11, 2018

discordianfish commented Apr 11, 2018

stuartnelson3 commented Apr 18, 2018

discordianfish commented Apr 18, 2018

alesnav commented Apr 25, 2019

zetaab commented May 15, 2019 •

edited

Loading

discordianfish commented May 19, 2019

xkfen commented Apr 29, 2020

discordianfish commented Apr 29, 2020

Regression: Resolving unqualified DNS names fails #1307

Regression: Resolving unqualified DNS names fails #1307

Comments

discordianfish commented Mar 30, 2018

stuartnelson3 commented Apr 3, 2018

stuartnelson3 commented Apr 6, 2018

stuartnelson3 commented Apr 6, 2018

discordianfish commented Apr 8, 2018

discordianfish commented Apr 11, 2018

stuartnelson3 commented Apr 11, 2018

discordianfish commented Apr 11, 2018

discordianfish commented Apr 11, 2018

stuartnelson3 commented Apr 18, 2018

discordianfish commented Apr 18, 2018

alesnav commented Apr 25, 2019

zetaab commented May 15, 2019 • edited Loading

discordianfish commented May 19, 2019

xkfen commented Apr 29, 2020

discordianfish commented Apr 29, 2020

zetaab commented May 15, 2019 •

edited

Loading