-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alertmanager merges peers through IP instead DNS #2295
Comments
Adding:
This IP does not exists anymore, but DNS |
I do not sure my issue is related, but i have the similar problems. $ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5", GitCommit:"e0fccafd69541e3750d460ba0f9743b90336f24f", GitTreeState:"clean", BuildDate:"2020-04-16T11:35:47Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
$ helm3 version
version.BuildInfo{Version:"v3.2.1", GitCommit:"fe51cd1e31e6a202cba7dead9552a6d418ded79a", GitTreeState:"clean", GoVersion:"go1.13.10"} I got problem kubectl logs alertmanager-prometheus-operator-alertmanager-0 alertmanager
level=warn ts=2020-06-11T15:32:36.386Z caller=main.go:322 msg="unable to join gossip mesh" err="1 error occurred:\n\t* Failed to resolve alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.prometheus-operator.svc:9094: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.prometheus-operator.svc on 10.245.0.10:53: no such host\n\n" After investigation I have realise that problem may be in busybox image. part of kubernetes deploymet config:
pod log:
coredns log:
pod log with busybox 1.28.4:
coredns log:
resolv.conf
I have already opened a bug on busybox https://bugs.busybox.net/show_bug.cgi?id=13006 $ kubectl get pods -o yaml alertmanager-prometheus-operator-alertmanager-0 | grep image:
image: quay.io/prometheus/alertmanager:v0.20.0
image: quay.io/coreos/configmap-reload:v0.0.1 alertmanager version /alertmanager $ alertmanager --version
alertmanager, version 0.20.0 (branch: HEAD, revision: f74be0400a6243d10bb53812d6fa408ad71ff32d)
build user: root@00c3106655f8
build date: 20191211-14:13:14
go version: go1.13.5 |
Alertmanager refreshes the addresses of cluster peers periodically (every 15 seconds): alertmanager/cluster/cluster.go Lines 416 to 445 in 2747a02
Can you try running with |
@simonpasquier |
I opened an other issue on prometheus-operator, I will test if using FQDN DNS pointing to cluster.local will solve the problem. Issue 3289. Then, I answer back here. |
This sounds a lot like #2250 |
@teke97 : I am facing the same problem. Did you fix it? |
To the best of my memory, the problem was in my config file and did not related to |
@teke97 : What was the issue in the config file ? |
What did you do?
I configured the alertmanager in the AWS EKS cluster using the prometheus-operator helm chart and 3 replicas.
What did you expect to see?
The alarms were expected to be propagated and synchronized correctly between the pods.
What did you see instead? Under which circumstances?
Alarms are lost between pods when using more than one replica. The problem is that statefulset pods end up going up in parallel using podmanagementpolicy as parallel, but that doesn't always happen. For example, if pod-0 starts last, pod-0 can communicate with pod-1 and pod-2, but not the other way around. The same happens when one pod falls and another pod rises. Considering this, the pods start to act independently since they are unable to join each other, the sync is lost, and the alerts start to double when they are sent by API using the DNS configured in Ingress. I checked other issues, tested the connectivity in both TCP and UDP. Changing the log for debug I found that the alertmanager resolved the DNS to IP and instead of using DNS, uses the IP of a Pod that no longer exists, as the private IP of EKS is allocated to the pod dynamically, he can no longer see the peer.
PodManagementPolicy is configured as parallel: here
I looked at this issue: 1261 and 1312
I believe this issue is related to way how alertmanager resolve peers, converting to direct IP address instead using k8s DNS like svc.cluster.local iplookup
Environment
AWS EKS
System information:
EKS - Kubernetes 1.15 using official docker image from quay.io
Alertmanager version:
The text was updated successfully, but these errors were encountered: