Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Resolving unqualified DNS names fails #1307

Open
discordianfish opened this issue Mar 30, 2018 · 15 comments
Open

Regression: Resolving unqualified DNS names fails #1307

discordianfish opened this issue Mar 30, 2018 · 15 comments

Comments

@discordianfish
Copy link
Member

What did you do?
Running AM in prometheus as stateful set with a headless service, giving each AM a name like alertmanager-0.alertmanager.default.svc.cluster.local.
The pod gets, among others, default.svc.cluster.local configured as search domain in /etc/resolve.conf:

# cat /etc/resolv.conf 
nameserver 10.35.240.10
search default.svc.cluster.local svc.cluster.local cluster.local c.latency-at.internal google.internal

This allows for the alertmanager-0.alertmanager name to be resolved unqualified like this:

# nslookup alertmanager-0.alertmanager
Server:    10.35.240.10
Address 1: 10.35.240.10 kube-dns.kube-system.svc.cluster.local

Name:      alertmanager-0.alertmanager
Address 1: 10.32.4.32 alertmanager-0.alertmanager.default.svc.cluster.local

The alertmanager though can't resolve this name unqualified (which was working at least in 0.11.0) and logs this error:

level=warn ts=2018-03-30T13:25:15.016042032Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to resolve alertmanager-0.alertmanager:6783: lookup alertmanager-0.alertmanager on 10.35.240.10:53: no such host\n* Failed to join 10.32.6.23: dial tcp 10.32.6.23:6783: connect: connection refused"

Environment

  • Alertmanager version:
alertmanager, version 0.15.0-rc.1 (branch: HEAD, revision: acb111e812530bec1ac6d908bc14725793e07cf3)
@stuartnelson3
Copy link
Contributor

Memberlist is doing the resolving, and from looking at the code it uses go's stdlib net.LookupIP for this.

https://github.com/hashicorp/memberlist/blob/9f5b38f1dc837733754bf57f4ea62726a509c0fc/memberlist.go#L344

Are you able to resolve the address using this go code running in its own script in your container?

@stuartnelson3
Copy link
Contributor

There's a probably related issue in #1312. Does your status page show any peers?

@stuartnelson3
Copy link
Contributor

Looking a bit further into the code, it is resolving the addresses but apparently unable to join them ..

@discordianfish
Copy link
Member Author

@stuartnelson3 it's not showing any peers when using the unqualified name. I'll test with a go binary using net.LookupIP once I get back at this.

@discordianfish
Copy link
Member Author

Just built a simple binary running LookupIP and this seems to work just fine:

/ # nslookup alertmanager-0.alertmanager
Server:    10.35.240.10
Address 1: 10.35.240.10 kube-dns.kube-system.svc.cluster.local

Name:      alertmanager-0.alertmanager
Address 1: 10.32.4.33 alertmanager-0.alertmanager.default.svc.cluster.local
/ # /test alertmanager-0.alertmanager
err: <nil>, res: [10.32.4.33]/ # 
/ # cat /etc/resolv.conf 
nameserver 10.35.240.10
search default.svc.cluster.local svc.cluster.local cluster.local c.latency-at.internal google.internal
options ndots:5

@stuartnelson3
Copy link
Contributor

Thanks for looking at this.

From the original log line you provided, there are two errors: One is a failure to resolve, and another is a failure to join.

Both of those bits of code are in the same loop here: https://github.com/hashicorp/memberlist/blob/9f5b38f1dc837733754bf57f4ea62726a509c0fc/memberlist.go#L214-L234

The initial lookup seems to be happening on a forwarder local to that kubelet

Failed to resolve alertmanager-0.alertmanager:6783: lookup alertmanager-0.alertmanager on 10.35.240.10:53: no such host

But then the second error is connecting to an IP that isn't (according to your nslookup) an AM ipaddr:

Failed to join 10.32.6.23: dial tcp 10.32.6.23:6783: connect: connection refused

The connection failure, I think, would be stale DNS data or something .. I'm not sure where that IP came from.

Are you configuring each AM instance to have the full list of peers? So instance1 has --cluster.peer flags for both instance1 (itself) and instance2?

Also, how do you have --cluster.listen-address configured? It looks like it needs to be a routable podIP: prometheus-operator/prometheus-operator#1193

@discordianfish
Copy link
Member Author

So don't think it's related to listen address or something since using the fully qualified name it works.

But yeah I focused on the DNS error but you're right, it's confusing that it tries to join some other IP..? Not sure how what happened.. When I'm trying to reproduce it by deleting my pods and recreating them I don't see this, just the DNS error:


level=info ts=2018-04-11T17:24:05.603230635Z caller=main.go:140 msg="Starting Alertmanager" version="(version=0.15.0-rc.1, branch=HEAD, revision=acb111e812530bec1ac6d908bc14725793e07cf3)"
level=info ts=2018-04-11T17:24:05.603309763Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-04-11T17:24:06.066127059Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to resolve alertmanager-0.alertmanager:9094: lookup alertmanager-0.alertmanager on 10.35.240.10:53: no such host\n* Failed to resolve alertmanager-1.alertmanager:9094: lookup alertmanager-1.alertmanager on 10.35.240.10:53: no such host"
level=info ts=2018-04-11T17:24:06.067671792Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yaml
level=info ts=2018-04-11T17:24:06.073860511Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-04-11T17:24:06.077161531Z caller=main.go:346 msg=Listening address=:9093
level=info ts=2018-04-11T17:24:08.074302197Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=2 elapsed=2.000182926s

I'm as much confused as you as why this fails, give that it's using the stdlib LookupIP but it's definitely a problem with resolving the name. Just double checked that my test binary can resolve the unqualified name just fine in this same pod.

@discordianfish
Copy link
Member Author

WTH.. I just read the memberlist code and it implements it's own resolver and only uses the stdlib when this fails: https://github.com/hashicorp/memberlist/blob/9f5b38f1dc837733754bf57f4ea62726a509c0fc/memberlist.go#L247

I'll gonna fill an upstream issue.

@stuartnelson3
Copy link
Contributor

silence from them after 7 days :/

would it be a lot of work for you to package your own am after patching https://github.com/hashicorp/memberlist/blob/9f5b38f1dc837733754bf57f4ea62726a509c0fc/memberlist.go#L333-L339 to be commented out? I'm just starting back at SC and won't have time to try this for probably 2 weeks.

@discordianfish
Copy link
Member Author

As a workaround, I'm using the FQDN. So not urgent but something that should get fixed because others will trip over this too.

@alesnav
Copy link

alesnav commented Apr 25, 2019

Hi there, any news about this?

@zetaab
Copy link

zetaab commented May 15, 2019

no idea is this same problem but I do have problem like this in kubernetes gliderlabs/docker-alpine#255

in debian based docker images the dns works fine, but in alertmanager:

% kubectl exec -it alertmanager-main-0 -c alertmanager /bin/sh
/alertmanager $ nslookup google.com
;; connection timed out; no servers could be reached

/alertmanager $ nslookup kubernetes.default.svc.cluster.local
;; connection timed out; no servers could be reached
/alertmanager $ cat /etc/resolv.conf
nameserver 100.64.0.10
search monitoring.svc.cluster.local svc.cluster.local cluster.local openstacklocal
options ndots:5
level=warn ts=2019-05-14T11:05:13.087102858Z caller=cluster.go:226 component=cluster msg="failed to join cluster" err="3 errors occurred:\n\n* Failed to resolve alertmanager-main-0.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n* Failed to resolve alertmanager-main-1.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-1.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n* Failed to resolve alertmanager-main-2.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-2.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host"
level=info ts=2019-05-14T11:05:13.087142251Z caller=cluster.go:228 component=cluster msg="will retry joining cluster every 10s"
level=warn ts=2019-05-14T11:05:13.087157795Z caller=main.go:268 msg="unable to join gossip mesh" err="3 errors occurred:\n\n* Failed to resolve alertmanager-main-0.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n* Failed to resolve alertmanager-main-1.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-1.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n* Failed to resolve alertmanager-main-2.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-2.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host"

@discordianfish
Copy link
Member Author

@zetaab Why do you think your probably might be the same? As you verified, DNS isn't working at all in your container. That doesn't seem to relate to this issue.

@alesnav See the upstream issue (hashicorp/memberlist#147), nothing we can do here beside replacing memberlist.

@xkfen
Copy link

xkfen commented Apr 29, 2020

same problem.
alertmanager version:
/alertmanager $ alertmanager --version
alertmanager, version 0.18.0 (branch: HEAD, revision: 1ace0f7)
build user: root@868685ed3ed0
build date: 20190708-14:31:49
go version: go1.12.6

any help? thanks

@discordianfish
Copy link
Member Author

@xkfen Someone would have to fix the upstream issue: hashicorp/memberlist#147
Nothing has happened since I filled that issue. I'm still using my describe workaround above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants