Alertmanager merges peers through IP instead DNS #2295

devlucasc · 2020-06-11T17:27:26Z

What did you do?
I configured the alertmanager in the AWS EKS cluster using the prometheus-operator helm chart and 3 replicas.

What did you expect to see?
The alarms were expected to be propagated and synchronized correctly between the pods.

What did you see instead? Under which circumstances?
Alarms are lost between pods when using more than one replica. The problem is that statefulset pods end up going up in parallel using podmanagementpolicy as parallel, but that doesn't always happen. For example, if pod-0 starts last, pod-0 can communicate with pod-1 and pod-2, but not the other way around. The same happens when one pod falls and another pod rises. Considering this, the pods start to act independently since they are unable to join each other, the sync is lost, and the alerts start to double when they are sent by API using the DNS configured in Ingress. I checked other issues, tested the connectivity in both TCP and UDP. Changing the log for debug I found that the alertmanager resolved the DNS to IP and instead of using DNS, uses the IP of a Pod that no longer exists, as the private IP of EKS is allocated to the pod dynamically, he can no longer see the peer.
PodManagementPolicy is configured as parallel: here
I looked at this issue: 1261 and 1312
I believe this issue is related to way how alertmanager resolve peers, converting to direct IP address instead using k8s DNS like svc.cluster.local iplookup

Environment
AWS EKS

System information:
EKS - Kubernetes 1.15 using official docker image from quay.io
Alertmanager version:

alertmanager, version 0.20.0 (branch: HEAD, revision: f74be0400a6243d10bb53812d6fa408ad71ff32d)
  build user:       root@00c3106655f8
  build date:       20191211-14:13:14
  go version:       go1.13.5

/bin/alertmanager --config.file=/etc/alertmanager/config/alertmanager.yaml --cluster.listen-address=[***.***.***.217]:9094 --storage.path=/alertmanager --data.retention=120h --web.listen-address=:9093 --web.external-url=http://redacted/ --web.route-prefix=/ --cluster.peer=alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:9094
--cluster.peer=alertmanager-prometheus-operator-alertmanager-1.alertmanager-operated.monitoring.svc:9094
--cluster.peer=alertmanager-prometheus-operator-alertmanager-2.alertmanager-operated.monitoring.svc:9094

The text was updated successfully, but these errors were encountered:

devlucasc · 2020-06-11T17:34:01Z

Adding:

 Failed to join ***.***.***.197: dial tcp ***.***.***.197:9094: connect: no route to host\n"

This IP does not exists anymore, but DNS alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc is ok (verified tcp/udp with nc command)

teke97 · 2020-06-14T10:56:15Z

I do not sure my issue is related, but i have the similar problems.
I have installed prometheus in kubernetes 1.17.
I`m using helm and stable/prometheus-operator chart.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5", GitCommit:"e0fccafd69541e3750d460ba0f9743b90336f24f", GitTreeState:"clean", BuildDate:"2020-04-16T11:35:47Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
$ helm3 version
version.BuildInfo{Version:"v3.2.1", GitCommit:"fe51cd1e31e6a202cba7dead9552a6d418ded79a", GitTreeState:"clean", GoVersion:"go1.13.10"}

I got problem

kubectl logs alertmanager-prometheus-operator-alertmanager-0 alertmanager
level=warn ts=2020-06-11T15:32:36.386Z caller=main.go:322 msg="unable to join gossip mesh" err="1 error occurred:\n\t* Failed to resolve alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.prometheus-operator.svc:9094: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.prometheus-operator.svc on 10.245.0.10:53: no such host\n\n"

After investigation I have realise that problem may be in busybox image.
The problem is similar to kubernetes/kubernetes#66924 (comment)

part of kubernetes deploymet config:

        - args:
        - -c
        - while true; do nslookup alertmanager-bot; sleep 10; done
        command:
        - /bin/sh
        image: busybox:1.31.1

pod log:

Server:		10.245.0.10
Address:	10.245.0.10:53

** server can't find alertmanager-bot.monitoring.svc.cluster.local: NXDOMAIN

*** Can't find alertmanager-bot.svc.cluster.local: No answer
*** Can't find alertmanager-bot.cluster.local: No answer
*** Can't find alertmanager-bot.monitoring.svc.cluster.local: No answer
*** Can't find alertmanager-bot.svc.cluster.local: No answer
*** Can't find alertmanager-bot.cluster.local: No answer

coredns log:

coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:29:27.561Z [INFO] 10.244.0.215:43144 - 19456 "AAAA IN alertmanager-bot.cluster.local. udp 48 false 512" NXDOMAIN qr,aa,rd 141 0.000202924s
coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:29:27.562Z [INFO] 10.244.0.215:43144 - 19456 "A IN alertmanager-bot.monitoring.svc.cluster.local. udp 63 false 512" NOERROR qr,aa,rd 124 0.000145229s
coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:29:27.562Z [INFO] 10.244.0.215:43144 - 19456 "A IN alertmanager-bot.svc.cluster.local. udp 52 false 512" NXDOMAIN qr,aa,rd 145 0.000084224s
coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:29:27.562Z [INFO] 10.244.0.215:43144 - 19456 "A IN alertmanager-bot.cluster.local. udp 48 false 512" NXDOMAIN qr,aa,rd 141 0.000056272s
coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:29:27.562Z [INFO] 10.244.0.215:43144 - 19456 "AAAA IN alertmanager-bot.monitoring.svc.cluster.local. udp 63 false 512" NOERROR qr,aa,rd 156 0.000060009s
coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:29:27.562Z [INFO] 10.244.0.215:43144 - 19456 "AAAA IN alertmanager-bot.svc.cluster.local. udp 52 false 512" NXDOMAIN qr,aa,rd 145 0.000051978s

pod log with busybox 1.28.4:

Name:      alertmanager-bot
Address 1: 10.245.48.126 alertmanager-bot.monitoring.svc.cluster.local
Server:    10.245.0.10
Address 1: 10.245.0.10 kube-dns.kube-system.svc.cluster.local

coredns log:

coredns-84c79f5fb4-vkc7j coredns 2020-06-11T14:34:42.790Z [INFO] 10.244.0.204:53241 - 3 "AAAA IN alertmanager-bot.monitoring.svc.cluster.local. udp 63 false 512" NOERROR qr,aa,rd 156 0.000207196s
coredns-84c79f5fb4-bspnj coredns 2020-06-11T14:34:42.792Z [INFO] 10.244.0.204:57444 - 4 "A IN alertmanager-bot.monitoring.svc.cluster.local. udp 63 false 512" NOERROR qr,aa,rd 124 0.000175375s

resolv.conf

/ # cat /etc/resolv.conf # the same on both images
nameserver 10.245.0.10
search monitoring.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

I have already opened a bug on busybox https://bugs.busybox.net/show_bug.cgi?id=13006
My cloud provider is digital ocean.
Image version

$ kubectl get pods -o yaml alertmanager-prometheus-operator-alertmanager-0 | grep image:
    image: quay.io/prometheus/alertmanager:v0.20.0
    image: quay.io/coreos/configmap-reload:v0.0.1

alertmanager version

/alertmanager $ alertmanager --version
alertmanager, version 0.20.0 (branch: HEAD, revision: f74be0400a6243d10bb53812d6fa408ad71ff32d)
  build user:       root@00c3106655f8
  build date:       20191211-14:13:14
  go version:       go1.13.5

simonpasquier · 2020-06-17T15:51:27Z

Alertmanager refreshes the addresses of cluster peers periodically (every 15 seconds):

alertmanager/cluster/cluster.go

Lines 416 to 445 in 2747a02

    
           func (p *Peer) refresh() { 
        
           	logger := log.With(p.logger, "msg", "refresh") 
        
           	resolvedPeers, err := resolvePeers(context.Background(), p.knownPeers, p.advertiseAddr, &net.Resolver{}, false) 
        
           	if err != nil { 
        
           		level.Debug(logger).Log("peers", p.knownPeers, "err", err) 
        
           		return 
        
           	} 
        
           	members := p.mlist.Members() 
        
           	for _, peer := range resolvedPeers { 
        
           		var isPeerFound bool 
        
           		for _, member := range members { 
        
           			if member.Address() == peer { 
        
           				isPeerFound = true 
        
           				break 
        
           			} 
        
           		} 
        
           		if !isPeerFound { 
        
           			if _, err := p.mlist.Join([]string{peer}); err != nil { 
        
           				p.failedRefreshCounter.Inc() 
        
           				level.Warn(logger).Log("result", "failure", "addr", peer, "err", err) 
        
           			} else { 
        
           				p.refreshCounter.Inc() 
        
           				level.Debug(logger).Log("result", "success", "addr", peer) 
        
           			} 
        
           		} 
        
           	} 
        
           }

Can you try running with --log.level=debug and share the logs?

teke97 · 2020-06-19T13:57:16Z

@simonpasquier
Thank you for your answer, the issue was on my side.

devlucasc · 2020-06-19T15:49:12Z

I opened an other issue on prometheus-operator, I will test if using FQDN DNS pointing to cluster.local will solve the problem. Issue 3289. Then, I answer back here.

hwoarang · 2020-08-24T11:07:02Z

This sounds a lot like #2250

rameshpitchaiah · 2020-10-08T07:45:07Z

@teke97 : I am facing the same problem. Did you fix it?

teke97 · 2020-10-08T16:11:05Z

@teke97 : I am facing the same problem. Did you fix it?

To the best of my memory, the problem was in my config file and did not related to level=warn ts=2020-06-11T15:32:36.386Z caller=main.go:322 msg="unable to join gossip mesh" err="1 error occurred:\n\t* Failed to resolve alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.prometheus-operator.svc:9094: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.prometheus-operator.svc on 10.245.0.10:53: no such host\n\n"

rameshpitchaiah · 2020-10-08T22:17:55Z

@teke97 : What was the issue in the config file ?

simonpasquier added the component/high availability label Jun 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alertmanager merges peers through IP instead DNS #2295

Alertmanager merges peers through IP instead DNS #2295

devlucasc commented Jun 11, 2020 •

edited

Loading

devlucasc commented Jun 11, 2020 •

edited

Loading

teke97 commented Jun 14, 2020 •

edited

Loading

simonpasquier commented Jun 17, 2020

teke97 commented Jun 19, 2020

devlucasc commented Jun 19, 2020 •

edited

Loading

hwoarang commented Aug 24, 2020

rameshpitchaiah commented Oct 8, 2020

teke97 commented Oct 8, 2020

rameshpitchaiah commented Oct 8, 2020

Alertmanager merges peers through IP instead DNS #2295

Alertmanager merges peers through IP instead DNS #2295

Comments

devlucasc commented Jun 11, 2020 • edited Loading

devlucasc commented Jun 11, 2020 • edited Loading

teke97 commented Jun 14, 2020 • edited Loading

simonpasquier commented Jun 17, 2020

teke97 commented Jun 19, 2020

devlucasc commented Jun 19, 2020 • edited Loading

hwoarang commented Aug 24, 2020

rameshpitchaiah commented Oct 8, 2020

teke97 commented Oct 8, 2020

rameshpitchaiah commented Oct 8, 2020

devlucasc commented Jun 11, 2020 •

edited

Loading

devlucasc commented Jun 11, 2020 •

edited

Loading

teke97 commented Jun 14, 2020 •

edited

Loading

devlucasc commented Jun 19, 2020 •

edited

Loading