Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager merges peers through IP instead DNS #2295

Open
devlucasc opened this issue Jun 11, 2020 · 9 comments
Open

Alertmanager merges peers through IP instead DNS #2295

devlucasc opened this issue Jun 11, 2020 · 9 comments

Comments

@devlucasc
Copy link

devlucasc commented Jun 11, 2020

What did you do?
I configured the alertmanager in the AWS EKS cluster using the prometheus-operator helm chart and 3 replicas.

What did you expect to see?
The alarms were expected to be propagated and synchronized correctly between the pods.

What did you see instead? Under which circumstances?
Alarms are lost between pods when using more than one replica. The problem is that statefulset pods end up going up in parallel using podmanagementpolicy as parallel, but that doesn't always happen. For example, if pod-0 starts last, pod-0 can communicate with pod-1 and pod-2, but not the other way around. The same happens when one pod falls and another pod rises. Considering this, the pods start to act independently since they are unable to join each other, the sync is lost, and the alerts start to double when they are sent by API using the DNS configured in Ingress. I checked other issues, tested the connectivity in both TCP and UDP. Changing the log for debug I found that the alertmanager resolved the DNS to IP and instead of using DNS, uses the IP of a Pod that no longer exists, as the private IP of EKS is allocated to the pod dynamically, he can no longer see the peer.
PodManagementPolicy is configured as parallel: here
I looked at this issue: 1261 and 1312
I believe this issue is related to way how alertmanager resolve peers, converting to direct IP address instead using k8s DNS like svc.cluster.local iplookup

Environment
AWS EKS

  • System information:
    EKS - Kubernetes 1.15 using official docker image from quay.io

  • Alertmanager version:

alertmanager, version 0.20.0 (branch: HEAD, revision: f74be0400a6243d10bb53812d6fa408ad71ff32d)
  build user:       root@00c3106655f8
  build date:       20191211-14:13:14
  go version:       go1.13.5
/bin/alertmanager --config.file=/etc/alertmanager/config/alertmanager.yaml --cluster.listen-address=[***.***.***.217]:9094 --storage.path=/alertmanager --data.retention=120h --web.listen-address=:9093 --web.external-url=http://redacted/ --web.route-prefix=/ --cluster.peer=alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:9094
--cluster.peer=alertmanager-prometheus-operator-alertmanager-1.alertmanager-operated.monitoring.svc:9094
--cluster.peer=alertmanager-prometheus-operator-alertmanager-2.alertmanager-operated.monitoring.svc:9094
@devlucasc
Copy link
Author

devlucasc commented Jun 11, 2020

Adding:

 Failed to join ***.***.***.197: dial tcp ***.***.***.197:9094: connect: no route to host\n"

This IP does not exists anymore, but DNS alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc is ok (verified tcp/udp with nc command)

@teke97
Copy link

teke97 commented Jun 14, 2020

I do not sure my issue is related, but i have the similar problems.
I have installed prometheus in kubernetes 1.17.
I`m using helm and stable/prometheus-operator chart.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5", GitCommit:"e0fccafd69541e3750d460ba0f9743b90336f24f", GitTreeState:"clean", BuildDate:"2020-04-16T11:35:47Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
$ helm3 version
version.BuildInfo{Version:"v3.2.1", GitCommit:"fe51cd1e31e6a202cba7dead9552a6d418ded79a", GitTreeState:"clean", GoVersion:"go1.13.10"}

I got problem

kubectl logs alertmanager-prometheus-operator-alertmanager-0 alertmanager
level=warn ts=2020-06-11T15:32:36.386Z caller=main.go:322 msg="unable to join gossip mesh" err="1 error occurred:\n\t* Failed to resolve alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.prometheus-operator.svc:9094: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.prometheus-operator.svc on 10.245.0.10:53: no such host\n\n"

After investigation I have realise that problem may be in busybox image.
The problem is similar to kubernetes/kubernetes#66924 (comment)

part of kubernetes deploymet config:

        - args:
        - -c
        - while true; do nslookup alertmanager-bot; sleep 10; done
        command:
        - /bin/sh
        image: busybox:1.31.1

pod log:

Server:		10.245.0.10
Address:	10.245.0.10:53

** server can't find alertmanager-bot.monitoring.svc.cluster.local: NXDOMAIN

*** Can't find alertmanager-bot.svc.cluster.local: No answer
*** Can't find alertmanager-bot.cluster.local: No answer
*** Can't find alertmanager-bot.monitoring.svc.cluster.local: No answer
*** Can't find alertmanager-bot.svc.cluster.local: No answer
*** Can't find alertmanager-bot.cluster.local: No answer

coredns log:

coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:29:27.561Z [INFO] 10.244.0.215:43144 - 19456 "AAAA IN alertmanager-bot.cluster.local. udp 48 false 512" NXDOMAIN qr,aa,rd 141 0.000202924s
coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:29:27.562Z [INFO] 10.244.0.215:43144 - 19456 "A IN alertmanager-bot.monitoring.svc.cluster.local. udp 63 false 512" NOERROR qr,aa,rd 124 0.000145229s
coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:29:27.562Z [INFO] 10.244.0.215:43144 - 19456 "A IN alertmanager-bot.svc.cluster.local. udp 52 false 512" NXDOMAIN qr,aa,rd 145 0.000084224s
coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:29:27.562Z [INFO] 10.244.0.215:43144 - 19456 "A IN alertmanager-bot.cluster.local. udp 48 false 512" NXDOMAIN qr,aa,rd 141 0.000056272s
coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:29:27.562Z [INFO] 10.244.0.215:43144 - 19456 "AAAA IN alertmanager-bot.monitoring.svc.cluster.local. udp 63 false 512" NOERROR qr,aa,rd 156 0.000060009s
coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:29:27.562Z [INFO] 10.244.0.215:43144 - 19456 "AAAA IN alertmanager-bot.svc.cluster.local. udp 52 false 512" NXDOMAIN qr,aa,rd 145 0.000051978s

pod log with busybox 1.28.4:

Name:      alertmanager-bot
Address 1: 10.245.48.126 alertmanager-bot.monitoring.svc.cluster.local
Server:    10.245.0.10
Address 1: 10.245.0.10 kube-dns.kube-system.svc.cluster.local

coredns log:

coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:34:42.790Z [INFO] 10.244.0.204:53241 - 3 "AAAA IN alertmanager-bot.monitoring.svc.cluster.local. udp 63 false 512" NOERROR qr,aa,rd 156 0.000207196s
coredns-84c79f5fb4-bspnj coredns 2020-06-11T14:34:42.792Z [INFO] 10.244.0.204:57444 - 4 "A IN alertmanager-bot.monitoring.svc.cluster.local. udp 63 false 512" NOERROR qr,aa,rd 124 0.000175375s

resolv.conf

/ # cat /etc/resolv.conf # the same on both images
nameserver 10.245.0.10
search monitoring.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

I have already opened a bug on busybox https://bugs.busybox.net/show_bug.cgi?id=13006
My cloud provider is digital ocean.
Image version

$ kubectl get pods -o yaml alertmanager-prometheus-operator-alertmanager-0 | grep image:
    image: quay.io/prometheus/alertmanager:v0.20.0
    image: quay.io/coreos/configmap-reload:v0.0.1

alertmanager version

/alertmanager $ alertmanager --version
alertmanager, version 0.20.0 (branch: HEAD, revision: f74be0400a6243d10bb53812d6fa408ad71ff32d)
  build user:       root@00c3106655f8
  build date:       20191211-14:13:14
  go version:       go1.13.5

@simonpasquier
Copy link
Member

Alertmanager refreshes the addresses of cluster peers periodically (every 15 seconds):

func (p *Peer) refresh() {
logger := log.With(p.logger, "msg", "refresh")
resolvedPeers, err := resolvePeers(context.Background(), p.knownPeers, p.advertiseAddr, &net.Resolver{}, false)
if err != nil {
level.Debug(logger).Log("peers", p.knownPeers, "err", err)
return
}
members := p.mlist.Members()
for _, peer := range resolvedPeers {
var isPeerFound bool
for _, member := range members {
if member.Address() == peer {
isPeerFound = true
break
}
}
if !isPeerFound {
if _, err := p.mlist.Join([]string{peer}); err != nil {
p.failedRefreshCounter.Inc()
level.Warn(logger).Log("result", "failure", "addr", peer, "err", err)
} else {
p.refreshCounter.Inc()
level.Debug(logger).Log("result", "success", "addr", peer)
}
}
}
}

Can you try running with --log.level=debug and share the logs?

@teke97
Copy link

teke97 commented Jun 19, 2020

@simonpasquier
Thank you for your answer, the issue was on my side.

@devlucasc
Copy link
Author

devlucasc commented Jun 19, 2020

I opened an other issue on prometheus-operator, I will test if using FQDN DNS pointing to cluster.local will solve the problem. Issue 3289. Then, I answer back here.

@hwoarang
Copy link

This sounds a lot like #2250

@rameshpitchaiah
Copy link

@teke97 : I am facing the same problem. Did you fix it?

@teke97
Copy link

teke97 commented Oct 8, 2020

@teke97 : I am facing the same problem. Did you fix it?

To the best of my memory, the problem was in my config file and did not related to level=warn ts=2020-06-11T15:32:36.386Z caller=main.go:322 msg="unable to join gossip mesh" err="1 error occurred:\n\t* Failed to resolve alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.prometheus-operator.svc:9094: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.prometheus-operator.svc on 10.245.0.10:53: no such host\n\n"

@rameshpitchaiah
Copy link

@teke97 : What was the issue in the config file ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants