Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: Start metric and status probe HTTP server as earlier as possible #1656

Merged
merged 10 commits into from
Oct 18, 2019

Conversation

kakkoyun
Copy link
Member

@kakkoyun kakkoyun commented Oct 16, 2019

As a result of a couple of issues that we had by running Thanos Store with liveness probes on Kubernetes, this PR attempts to fix the liveness probe issues by moving them earlier in the start-up sequence. Now, metrics and status probe HTTP server starts as earlier as possible and returns success from /-/healthy endpoint and sets its status ready when everything else properly bootstrapped.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

A good read on the subject: https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html

Changes

Verification

  • make local-test
  • MINIO_ENABLED=1 ./scripts/quickstart.sh while debug log enabled, and observed start sequence.

Thanos Store start sequence logs:

level=info name=store ts=2019-10-17T12:52:19.814229Z caller=main.go:170 msg="Tracing will be disabled"
level=info name=store ts=2019-10-17T12:52:19.814515Z caller=factory.go:39 msg="loading bucket configuration"
level=info name=store ts=2019-10-17T12:52:19.814916Z caller=cache.go:172 msg="created index cache" maxItemSizeBytes=131072000 maxSizeBytes=262144000 maxItems=math.MaxInt64
level=info name=store ts=2019-10-17T12:52:19.815104Z caller=main.go:257 msg="disabled TLS, key and cert must be set to enable"
level=info name=store ts=2019-10-17T12:52:19.815301Z caller=store.go:252 msg="starting store node"
level=info name=store ts=2019-10-17T12:52:19.815336Z caller=store.go:198 msg="initializing bucket store"
level=info name=store ts=2019-10-17T12:52:19.815359Z caller=main.go:353 msg="listening for requests and metrics" component=store address=0.0.0.0:10906
level=info name=store ts=2019-10-17T12:52:19.815421Z caller=prober.go:143 msg="changing probe status" status=healthy
level=info name=store ts=2019-10-17T12:52:19.818189Z caller=store.go:203 msg="bucket store ready" init_duration=3.131623ms
level=info name=store ts=2019-10-17T12:52:19.818247Z caller=store.go:245 msg="listening for StoreAPI gRPC" address=0.0.0.0:10905
level=info name=store ts=2019-10-17T12:52:19.818287Z caller=prober.go:114 msg="changing probe status" status=ready

@kakkoyun kakkoyun force-pushed the liveness_probes branch 2 times, most recently from 2611d22 to 816c2fc Compare October 16, 2019 12:46
@kakkoyun
Copy link
Member Author

@squat
Copy link
Member

squat commented Oct 16, 2019

Looking at the code changes, it seems to me that the only thing that has changed is the lexicographic ordering of the function calls but that the server is still started at exactly the same time. Note that the HTTP servers are assigned to the run group and that they only actually start serving requests when the run group itself is started.

@kakkoyun
Copy link
Member Author

@squat Thanks for pointing it. I somehow missed that. I'll have another hard look at it.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions (:

cmd/thanos/downsample.go Outdated Show resolved Hide resolved
cmd/thanos/store.go Show resolved Hide resolved
@kakkoyun
Copy link
Member Author

@squat @bwplotka I have updated scheduleHTTPServer, now it spawns a goroutine independent from the rungroup to start Server as soon as it's scheduled, and then syncs it with the rungroup.

cmd/thanos/downsample.go Outdated Show resolved Hide resolved
cmd/thanos/main.go Outdated Show resolved Hide resolved
@bwplotka
Copy link
Member

Wonder if we can also add more logs to the startup as well, since you are touching this part as per: #1655

@kakkoyun
Copy link
Member Author

@bwplotka Of course, I can I'll check the related conversation.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand really why we move around all of this scheduleHTTP servers but happy with that if there is a reason (:

2 comments, otherwise LGTM! Thanks.

I think we really need to advertise now to NOT add readiness probe on metric. Users do that and now they will have false impression store that store is ready, but it actually still just loading the blocks...

Wonder how to communicate that ):

I will add more logging to initial sync in separare PR (:

cmd/thanos/main.go Outdated Show resolved Hide resolved
cmd/thanos/store.go Show resolved Hide resolved
cmd/thanos/downsample.go Outdated Show resolved Hide resolved
cmd/thanos/query.go Outdated Show resolved Hide resolved
cmd/thanos/receive.go Outdated Show resolved Hide resolved
cmd/thanos/rule.go Outdated Show resolved Hide resolved
cmd/thanos/sidecar.go Outdated Show resolved Hide resolved
@kakkoyun kakkoyun changed the title .*: Start metric and status probe HTTP server as earlier as possible store: Start metric and status probe HTTP server as earlier as possible Oct 17, 2019
@kakkoyun
Copy link
Member Author

@bwplotka @squat I removed unnecessary changes, fixed emphasized issues, updated CHANGELOG and store documentation.

Documentation definitely needs some help, I'm open to suggestions.

I will add more logging to initial sync in separare PR (:

I couldn't get to this one yet, however it would probably better since you have more context on the component, thanks a lot.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 more nits and LGTM!

cmd/thanos/store.go Outdated Show resolved Hide resolved
docs/components/store.md Show resolved Hide resolved
Copy link
Member

@FUSAKLA FUSAKLA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 great, thanks for doing this. I remember discussing the speedup of HTTP server startup with store but it eventually did not happen. Not sure why exactly.

Agree with @bwplotka on those two comments otherwise LGTM!

CHANGELOG.md Outdated Show resolved Hide resolved
kakkoyun and others added 5 commits October 18, 2019 15:58
Signed-off-by: Kemal Akkoyun <[email protected]>
Co-Authored-By: Martin Chodur <[email protected]>
Signed-off-by: Kemal Akkoyun <[email protected]>
Copy link
Member

@squat squat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌮

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Finally we have metrics on Thanos startup!

Let's make sure we have proper logging and metrics on sync - happy to attack this on Monday.

@bwplotka
Copy link
Member

Good work @kakkoyun thanks! ❤️

@bwplotka bwplotka merged commit 19b9b89 into thanos-io:master Oct 18, 2019
@kakkoyun kakkoyun deleted the liveness_probes branch October 19, 2019 07:59
GiedriusS pushed a commit that referenced this pull request Oct 28, 2019
…le (#1656)

* Start metric and status probe server as soon as possible

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update changelog

Signed-off-by: Kemal Akkoyun <[email protected]>

* Schedule a separate goroutine to start server

Signed-off-by: Kemal Akkoyun <[email protected]>

* Add InitSync to the rungroup

Signed-off-by: Kemal Akkoyun <[email protected]>

* Fix linter pointed issues

Signed-off-by: Kemal Akkoyun <[email protected]>

* Move InitSync to alreay existed run.Group

Signed-off-by: Kemal Akkoyun <[email protected]>

* Remove unnecessary changes and update CHANGELOG

Signed-off-by: Kemal Akkoyun <[email protected]>

* Add simple explanation for probes

Signed-off-by: Kemal Akkoyun <[email protected]>

* Make requested changes

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update CHANGELOG.md

Co-Authored-By: Martin Chodur <[email protected]>
Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Giedrius Statkevičius <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

store gateway: Start metric server from the very start of the proces.
4 participants