Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use exec probe for queue proxy readiness check #4148

Merged
merged 6 commits into from
Jun 25, 2019

Conversation

joshrider
Copy link
Contributor

@joshrider joshrider commented May 22, 2019

Fixes #3308

Proposed Changes

  • Replaces HTTPGet ReadinessProbe with ExecAction. The ExecAction recreates the HTTPGet semantics within the queue proxy container and hits the same previous health check HTTP path. This approach was taken since the queue proxy process maintains state on whether it's active.

Also worth noting that this makes us dependent on how ko builds images. If the binary ever stops showing up at /ko-app/queue, this will break.

Release Note

NONE

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label May 22, 2019
@knative-prow-robot knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 22, 2019
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshrider: 0 warnings.

In response to this:

Fixes #3308

Proposed Changes

  • Replaces HTTPGet ReadinessProbe with ExecAction. The ExecAction recreates the HTTPGet semantics within the queue proxy container and hits the same previous health check HTTP path. This approach was taken since the queue proxy process maintains state on whether it's active.

Release Note

NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added area/API API objects and controllers area/networking labels May 22, 2019
@markusthoemmes
Copy link
Contributor

/assign

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! 🎉 🎉

A few nits throughout.

cmd/queue/main.go Show resolved Hide resolved
cmd/queue/main.go Outdated Show resolved Hide resolved
cmd/queue/main.go Outdated Show resolved Hide resolved
cmd/queue/main.go Outdated Show resolved Hide resolved
Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more in-depth comments on the retry logics at play.

We currently retry in at least two places:

  1. In the newly added logic to retry on /health
  2. In /health itself until the user-container is up

The timeouts of both need to be aligned for each respective mechanism to work correctly. With the proposed changes though it feels like we could maybe strip the retries from the /health route itself and send a single tcp probe per call to /health (if not already healthy). That way the only retry would be in the proposed logic.

Thoughts?


var lastErr error

timeoutErr := wait.PollImmediate(100*time.Millisecond, time.Second, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we give this some more leeway? As it stands, this will fail if the queue-proxy doesn't actually come up for a second. It feels to me as if we could have a much tighter retry interval because the actual request will again block as soon as the queue-proxy comes up until the user-container was successfully probed.

As such, each retry is pretty cheap and we can likely this to 50ms or even lower. However, the upper-bound of this can I think be bumped towards 5 or even 10 seconds to make sure we practically never retry on kube's periodSeconds interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - if we long poll here we can make this initial retry loop very tight.

Can we also pull this loop time out in to a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned this up to a 10 second timeout with a 25 millisecond interval.

We should totally take advantage of the full 10 second timeout on the ExecProbe. That was an oversight in the initial implementation.

The 25ms retry interval is a totally made up number. 🤷‍♂️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

25ms (40 times/sec) seems like it might be too aggressive. Help me convince myself that's not true.

Up to the provided frequency, we will:

  1. Open a TCP socket
  2. Send an HTTP request
  3. Get an HTTP response
  4. Check that the response code is 200

But, wait.PollImmediate will ignore timer ticks while the ConditionFunc is being evaluated, so if it takes (for example) 33ms to do 1-4, you will end up polling every 50ms. Additionally, this will only happen until we get a 200, so in the normal case, we don't care about the retry window.

The worst case for this would be something that actually spends a bunch of time working to get a good health check answer, and these requests end up clogging up the CPU, while blocking actual initialization from taking place (which would end the health check pounding).

Might be worth documenting next to the 25ms that the number was made up as a balance between "don't make the user wait" and "don't pummel the container too hard".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment explaining the "best guess" nature of the retry interval. Seems like something to keep an eye on and adjust in future.

Copy link
Contributor

@greghaynes greghaynes May 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some detail for the sake of clarity:

In the queue-proxy we are probing here once a connection is able to be established we should hold it open for up to 10s. This means there are two cases for work we will do here that are a bit different than mentioned:

until queue-proxy is running we will:

  1. open a tcp socket
  2. get a RST (since the socket is not yet listening) and retry

This is the loop we want to optimize to retry quickly as it is very lightweight. Once queue proxy is up we do essentially as you describe (send a http req, get response), but waiting on the response will block up to 10s until the user container is ready so our poll frequency here will be mostly irrelevent.

@@ -26,6 +26,8 @@ import (
"os"
"time"

"github.com/pkg/errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind collapsing the imports here? (we should have a bot for this... @mattmoor another socketpuppet?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

// Do not use the cached connection
DisableKeepAlives: true,
},
Timeout: 500 * time.Millisecond,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an end-to-end request timeout? If so, I think this is too short and doesn't make use of the queue-proxies mechanism to keep the connection on /health open for 10 seconds currently.

It feels like we might want these retries in only one place. Maybe we should strip them out of the prober itself and only have this health binary retry tightly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

practically I don't think it matters much either way so feel free to ignore, but my preference would be to long poll here as well. Better response, less network/cpu spamming, etc.

Regardless, can we pull this timeout (or retry interval if we go that route) out in to a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this up to 10 seconds. Makes sense to use the full 10 seconds on /health.

I don't understand all of the moving parts, but it seems like we may want to keep the retry logic in the prober. The function with the prober (probeUserContainer) is also used by a handler on the queueServingPort when a request has a K-Network-Probe header. Do we care about retries there?

Copy link
Contributor

@greghaynes greghaynes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for doing this!

cmd/queue/main.go Show resolved Hide resolved
// Do not use the cached connection
DisableKeepAlives: true,
},
Timeout: 500 * time.Millisecond,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

practically I don't think it matters much either way so feel free to ignore, but my preference would be to long poll here as well. Better response, less network/cpu spamming, etc.

Regardless, can we pull this timeout (or retry interval if we go that route) out in to a constant?


var lastErr error

timeoutErr := wait.PollImmediate(100*time.Millisecond, time.Second, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - if we long poll here we can make this initial retry loop very tight.

Can we also pull this loop time out in to a constant?

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment from my end.

Have we verified via @greghaynes's tracing that this actually has an effect and is worth the added complexity/is behaving as intended?

@@ -61,6 +60,10 @@ const (
// in the mesh.
quitSleepDuration = 20 * time.Second

// Set equal to the queue-proxy's ExecProbe timeout to take
// advantage of the full window
queueProbeTimeout = 10 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use this in the actual probeUserContainer as well to make sure all of these values are properly aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure whether it was suitable to use it there since probeUserContainer was also used in another handler and I didn't feel confident totally tying the concepts together.

How is the K-Network-Probe header used by the system with respect to the queue-proxy?

If we don't reuse the same probe in the /health check and the K-Network-Probe, we can leave the network probe's functionality as-is and then potentially remove the retry/timeout logic from the /health "server-side" TCP probe. Having all of the readinessProbe retry interval/timeout logic live in a single place on the -probe HTTP "client" side of things would make it easier to read/reason about. There would be a little bit of duplicated code and each retried TCP probe would then also have the additional overhead of an HTTP roundtrip. Thoughts?

If not, should I just make a duplicate probeUserContainer and use the constant there?

And I'd definitely appreciate someone else booting it up and taking it for a spin. 😁

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm following. The inner TCP loop currently has exactly this 10s timeout as well and those timeouts are meant to align, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree that they should align and I guess that's a great reason for me to stop complaining and make the change. My reluctance came from the fact that probeUserContainer is also used by handler for what appears to be another purpose, though I guess that can be worried about when it becomes an issue. I will make the swap now. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markusthoemmes @greghaynes
Is the above-mentioned tracing tool available somewhere that I can run myself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshrider #3491 will give you the most level of detail if you're able to rebase on it. Otherwise using just master if you enable tracing in config/config-tracing.yaml you should see a more consistent / smaller probe span in zipkin

Replaces HTTPGet ReadinessProbe with ExecAction. The ExecAction
recreates the HTTPGet semantics within the queue proxy container and hits
the same previous health check HTTP path. This approach was taken
since the queue proxy process maintains state on whether it's active.
@joshrider
Copy link
Contributor Author

/assign @tcnghia

@greghaynes
Copy link
Contributor

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2019
@tcnghia
Copy link
Contributor

tcnghia commented Jun 18, 2019

/approve
/lgtm

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joshrider, tcnghia

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2019
@tcnghia
Copy link
Contributor

tcnghia commented Jun 18, 2019

/test pull-knative-serving-integration-tests

Port: intstr.FromInt(networking.QueueAdminPort),
Path: queue.RequestQueueHealthPath,
Exec: &corev1.ExecAction{
Command: []string{"/ko-app/queue", "-probe", "true"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm that it's our ultimate goal to:

  1. Pass the user HTTP/TCP probe through for the queue to evaluate
  2. Default the probe to a TCP probe

This seems like a reasonable first step, but I wanted to confirm we have a shared understanding of the end state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the plan for #4014, but I don't know that this is necessary for that.

The hope was for this to address #3308 (with the possibility of maybe using it in #4014).

@@ -64,6 +63,10 @@ const (
// in the mesh.
quitSleepDuration = 20 * time.Second

// Set equal to the queue-proxy's ExecProbe timeout to take
// advantage of the full window
probeTimeout = 10 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a flag so it can be kept consistent with TimeoutSeconds in the probe configuration? Maybe we shouldn't do this until we pass the entire probe descriptor through?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the idea was that this was separate from the issue of passing in the full probe to have something user-defined applied via the queue-proxy. There are a bunch of things in this file that will change once we start passing all of the configuration in.

@greghaynes
Copy link
Contributor

greghaynes commented Jun 19, 2019

I just fixed my local setup to where im getting better tracing results now and I was wrong earlier to say this looks good. Heres a picture of what I am seeing with this patch + pod-tracing patch (using https://github.com/joshrider/serving/tree/execprobe-trace)

exec-probe-1

Note the container_startup spans, the bottom one is the user-application which goes ready as soon as it runs (due to no readiness probe) and the top one is the queue proxy. Theres about .6s delay between this (it really shouldn't take that long for queue-proxy to startup).

Heres one from master where I got lucky and Kubelet didnt send the networking probe until after queue-proxy had the socket open. Note how quickly after the user app goes ready the queue-proxy does as well. I'd think this is what a fix should look like:

master-3

I don't grok why this is happening though, maybe there's some other delay occurring before the exec probe starts running?

@tcnghia
Copy link
Contributor

tcnghia commented Jun 19, 2019

@greghaynes our queue-proxy also performs a TCP probe on user-container, so may be the 0.6 seconds is the time for user-container web server to finish starting up (and listening)?

@greghaynes
Copy link
Contributor

@tcnghia It could be coming from between queue-proxy and user-container, but I am surprised it would take nearly that long - in this case im testing the helloworld-go sample app. The second set of traces I think is a good baseline for what startup cost should be (~100ms)

@markusthoemmes
Copy link
Contributor

/hold

Until we figured out this is behaving as designed. Feel free to remove the hold if you are all fine with this and I misinterpreted the last comments.

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2019
@joshrider
Copy link
Contributor Author

@greghaynes my tests with the pod-trace branch also looked a lot like your top graph. I'm going to play around a bit more with the tracing (and whatever other debug strategies I can cook up) to try to get a better idea of what is actually going on inside.

@markusthoemmes good call 👍

@mattmoor
Copy link
Member

tl;dr Given the sad state of our CI, I agree with holding on this (and probably a lot of the rest of 0.7 items) until 0.8.

A few things come to mind:

  1. 600ms is faster than the 1s floor we get through kubelet (Something something bird in the hand),
  2. I'd like to start baking this early in 0.8,
  3. This is just the first piece of a bigger change to how we probe (I'd like to get all of it baking),
  4. Getting it in enables us to parallelize better.
  5. (bonus!) The traces above continue to be SUPER useful, if there are things we can do to incorporate them into our testing infrastructure, then shut up and take my money. :)

@greghaynes @markusthoemmes @tcnghia WDYT?

@mattmoor
Copy link
Member

/retest

@markusthoemmes
Copy link
Contributor

@mattmoor I agree, if it works and doesn't cause a regression, ship it!

@mattmoor
Copy link
Member

Ok, I plan to /hold cancel after the cut Tuesday.

@mattmoor
Copy link
Member

/hold cancel

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 25, 2019
@knative-prow-robot knative-prow-robot merged commit 91c4e8e into knative:master Jun 25, 2019
@joshrider joshrider deleted the execprobe branch June 25, 2019 15:36
hohaichi pushed a commit to hohaichi/serving that referenced this pull request Jun 25, 2019
* use exec probe for queue proxy readiness check

Replaces HTTPGet ReadinessProbe with ExecAction. The ExecAction
recreates the HTTPGet semantics within the queue proxy container and hits
the same previous health check HTTP path. This approach was taken
since the queue proxy process maintains state on whether it's active.

* use descriptive name for error, pass port as int to probeQueueHealthPath()

* pull probe timeout into constant, tighten up retry interval, clean up imports

* add comment explaining queue probe retry interval

* remove out of date comment

* rename queueProbeTimeout, use constant in probeUserContainer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/networking cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Good gRPC deployment pods frequently fail at least one health check
8 participants