Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silences are not propagated in a ha/mesh configuration (v0.15.0-rc1) #1312

Closed
gmauleon opened this issue Apr 5, 2018 · 13 comments
Closed

Silences are not propagated in a ha/mesh configuration (v0.15.0-rc1) #1312

gmauleon opened this issue Apr 5, 2018 · 13 comments

Comments

@gmauleon
Copy link

gmauleon commented Apr 5, 2018

What did you do?
Create a 2 replicas alertmanager setup.
Create a silence in alertmanager from the exposed UI (silenced the default "DeadManSwitch")

What did you expect to see?
All alertmanagers in the mesh should have the silence set

What did you see instead? Under which circumstances?
Only one of the 2 replicas seems to have the silent set

Environment
prometheus-operator: v0.18.0
alertmanager: v0.15.0-rc.1
prometheus: v2.2.1

Notes
Reverted back to alertmanager v0.14.0 and it was working properly.
Sorry in advance if this is already on the radar.

Thanks guys.

@stuartnelson3
Copy link
Contributor

stuartnelson3 commented Apr 5, 2018

Can you provide any log information? Have you checked the status page in the web ui to confirm that the mesh has been formed?

here is an example of the cluster when running the example HA setup, which is successfully gossiping silences:
image

@gmauleon
Copy link
Author

gmauleon commented Apr 5, 2018

I sure can, here is from one of the alertmanager pod:

level=info ts=2018-04-05T14:30:58.590730272Z caller=main.go:140 msg="Starting Alertmanager" version="(version=0.15.0-rc.1, branch=HEAD, revision=acb111e812530bec1ac6d908bc14725793e07cf3)"
level=info ts=2018-04-05T14:30:58.590839906Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-04-05T14:30:58.973752176Z caller=cluster.go:85 component=cluster err="couldn't deduce an advertise address: failed to parse bind addr ''"
level=warn ts=2018-04-05T14:30:58.99164516Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to resolve alertmanager-main-0.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-0.alertmanager-operated.monitoring.svc on 192.168.51.2:53: no such host\n* Failed to resolve alertmanager-main-1.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-main-1.alertmanager-operated.monitoring.svc on 192.168.51.2:53: no such host"
level=info ts=2018-04-05T14:30:58.992091208Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2018-04-05T14:30:58.992683838Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-04-05T14:30:58.996979028Z caller=main.go:346 msg=Listening address=:9093
level=info ts=2018-04-05T14:31:00.993007787Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000159597s
level=info ts=2018-04-05T14:31:04.993966965Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=2 before=1 now=2 elapsed=6.001219133s
level=info ts=2018-04-05T14:31:12.994686524Z caller=cluster.go:266 component=cluster msg="gossip settled; proceeding" elapsed=14.001941182s

Here is the view from alertmanager:
image

And here is the result from the api for each pod after creating a silence:

k port-forward alertmanager-main-0 9093:9093  -n monitoring
curl localhost:9093/api/v1/silences

{"status":"success","data":[{"id":"76857ac4-e656-4446-a15e-3726389bf809","matchers":[{"name":"severity","value":"none","isRegex":false},{"name":"alertname","value":"DeadMansSwitch","isRegex":false}],"startsAt":"2018-04-05T14:36:29.821772974Z","endsAt":"2018-04-05T16:36:25.737Z","updatedAt":"2018-04-05T14:36:29.821795651Z","createdBy":"Gael Mauleon","comment":"Test","status":{"state":"active"}}]}

----

k port-forward alertmanager-main-1 9093:9093  -n monitoring
curl localhost:9093/api/v1/silences

{"status":"success","data":[]}

@stuartnelson3
Copy link
Contributor

level=warn ts=2018-04-05T14:30:58.973752176Z caller=cluster.go:85 component=cluster err="couldn't deduce an advertise address: failed to parse bind addr ''"

What are you setting as your --cluster.listen-address? Is it set to the port :6783? That would seem to be a possible error that the current code isn't accounting for.

There's also a lookup error, so it could be related to #1307

@gmauleon
Copy link
Author

gmauleon commented Apr 5, 2018

Humm, might be related indeed, I don't have that error with 0.14.0.
Althought I mentionned it in my environment section, to be extra clear I'm using the prometheus-operator so maybe 0.15 has some modifications that are not yet supported in the operator?

Looking at the generated args from the prometheus-operator, they are indeed different:

v0.14.0

  - args:
    - --config.file=/etc/alertmanager/config/alertmanager.yaml
    - --mesh.listen-address=:6783
    - --storage.path=/alertmanager
    - --web.listen-address=:9093
    - --web.external-url=http://my-private-external-adress-here/alertmanager
    - --web.route-prefix=/
    - --mesh.peer=alertmanager-main-0.alertmanager-operated.monitoring.svc
    - --mesh.peer=alertmanager-main-1.alertmanager-operated.monitoring.svc

v0.15.0-rc.1

  - args:
    - --config.file=/etc/alertmanager/config/alertmanager.yaml
    - --cluster.listen-address=:6783
    - --storage.path=/alertmanager
    - --web.listen-address=:9093
    - --web.external-url=http://my-private-external-adress-here/alertmanager
    - --web.route-prefix=/
    - --cluster.peer=alertmanager-main-0.alertmanager-operated.monitoring.svc:6783
    - --cluster.peer=alertmanager-main-1.alertmanager-operated.monitoring.svc:6783

@simonpasquier
Copy link
Member

@gmauleon your status page shows that the cluster is up and running so it is weird that the silences aren't propagated. I've tested in my local env (without the Prometheus operator but very similar setup with Statefulsets) and I can't reproduce it. Maybe you could share the statefulset definition which is generated by the operator?

@stuartnelson3
Copy link
Contributor

@brancz @fabxc have you all encountered anything like this using prometheus operator? Maybe one of you has a chance to take a look, i don't have access to a k8s cluster using this.

@gmauleon based on the status page, it does look like it's connected to a peer ... can you specify 0.0.0.0:6783 as the cluster.listen-address? I'm really not sure what that would do, but it would eliminate one error on startup.

@jolson490
Copy link

jolson490 commented Apr 6, 2018

I ran into this same issue last night when running v0.18.0 of prometheus-operator/kube-prometheus (on a K8s cluster in AWS) - with my own modified copy of manifests/alertmanager.yaml to change the version of alertmanager being used to v0.15.0-rc.0.

But then I switched to using v0.15.0-rc.1 and everything worked. So perhaps a change was made in v0.15.0-rc.1 that resolves this issue? Though I do see the initial comment on this issue indicates rc1 was being used.

(Reference info: the Support Alertmanager v0.15.0 PR was merged to master on 3/22, thus it was included when Cut 0.18.0 happened on 4/4.)

@gmauleon
Copy link
Author

gmauleon commented Apr 6, 2018

Sorry guys couldn't find the time to test further today (stuart suggestion) . Will look into it worst case by Monday evening.

And in my case I was indeed testing with rc1

@mxinden
Copy link
Member

mxinden commented Apr 7, 2018

I am able to reproduce this issue with:

  • Minikube: v0.24.1
  • K8s: v1.9.0
  • PO: v0.18.0
  • AM: v0.15.0-rc.1

I will look further into this. Eventually we should add a test AddingSilenceCheckIfPropagated to the Prometheus operator e2e test suite.

@gmauleon Thanks a lot for reporting this.

@jolson490
Copy link

Another thing that was interesting for me is I didn't encounter this issue at all with minikube - not even when I was using rc0.
The only scenario where I ran into this issue was using rc0 on AWS.

@mxinden
Copy link
Member

mxinden commented Apr 8, 2018

@gmauleon prometheus-operator/prometheus-operator#1193 should fix the issue.


I didn't encounter this issue at all with minikube - not even when I was using rc0.

@jolson490 That is very surprising to me. This should have never worked, even on minikube.

@gmauleon
Copy link
Author

gmauleon commented Apr 8, 2018

Thanks!

@brancz
Copy link
Member

brancz commented Apr 9, 2018

I will look further into this. Eventually we should add a test AddingSilenceCheckIfPropagated to the Prometheus operator e2e test suite.

@mxinden all for this, let's make it happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants