-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use exec probe for queue proxy readiness check #4148
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshrider: 0 warnings.
In response to this:
Fixes #3308
Proposed Changes
- Replaces HTTPGet ReadinessProbe with ExecAction. The ExecAction recreates the HTTPGet semantics within the queue proxy container and hits the same previous health check HTTP path. This approach was taken since the queue proxy process maintains state on whether it's active.
Release Note
NONE
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! 🎉 🎉
A few nits throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more in-depth comments on the retry logics at play.
We currently retry in at least two places:
- In the newly added logic to retry on /health
- In /health itself until the user-container is up
The timeouts of both need to be aligned for each respective mechanism to work correctly. With the proposed changes though it feels like we could maybe strip the retries from the /health route itself and send a single tcp probe per call to /health (if not already healthy). That way the only retry would be in the proposed logic.
Thoughts?
cmd/queue/main.go
Outdated
|
||
var lastErr error | ||
|
||
timeoutErr := wait.PollImmediate(100*time.Millisecond, time.Second, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we give this some more leeway? As it stands, this will fail if the queue-proxy doesn't actually come up for a second. It feels to me as if we could have a much tighter retry interval because the actual request will again block as soon as the queue-proxy comes up until the user-container was successfully probed.
As such, each retry is pretty cheap and we can likely this to 50ms or even lower. However, the upper-bound of this can I think be bumped towards 5 or even 10 seconds to make sure we practically never retry on kube's periodSeconds
interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - if we long poll here we can make this initial retry loop very tight.
Can we also pull this loop time out in to a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turned this up to a 10 second timeout with a 25 millisecond interval.
We should totally take advantage of the full 10 second timeout on the ExecProbe. That was an oversight in the initial implementation.
The 25ms retry interval is a totally made up number. 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
25ms (40 times/sec) seems like it might be too aggressive. Help me convince myself that's not true.
Up to the provided frequency, we will:
- Open a TCP socket
- Send an HTTP request
- Get an HTTP response
- Check that the response code is 200
But, wait.PollImmediate
will ignore timer ticks while the ConditionFunc is being evaluated, so if it takes (for example) 33ms to do 1-4, you will end up polling every 50ms. Additionally, this will only happen until we get a 200, so in the normal case, we don't care about the retry window.
The worst case for this would be something that actually spends a bunch of time working to get a good health check answer, and these requests end up clogging up the CPU, while blocking actual initialization from taking place (which would end the health check pounding).
Might be worth documenting next to the 25ms that the number was made up as a balance between "don't make the user wait" and "don't pummel the container too hard".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment explaining the "best guess" nature of the retry interval. Seems like something to keep an eye on and adjust in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some detail for the sake of clarity:
In the queue-proxy we are probing here once a connection is able to be established we should hold it open for up to 10s. This means there are two cases for work we will do here that are a bit different than mentioned:
until queue-proxy is running we will:
- open a tcp socket
- get a RST (since the socket is not yet listening) and retry
This is the loop we want to optimize to retry quickly as it is very lightweight. Once queue proxy is up we do essentially as you describe (send a http req, get response), but waiting on the response will block up to 10s until the user container is ready so our poll frequency here will be mostly irrelevent.
cmd/queue/main.go
Outdated
@@ -26,6 +26,8 @@ import ( | |||
"os" | |||
"time" | |||
|
|||
"github.com/pkg/errors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind collapsing the imports here? (we should have a bot for this... @mattmoor another socketpuppet?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
cmd/queue/main.go
Outdated
// Do not use the cached connection | ||
DisableKeepAlives: true, | ||
}, | ||
Timeout: 500 * time.Millisecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an end-to-end request timeout? If so, I think this is too short and doesn't make use of the queue-proxies mechanism to keep the connection on /health open for 10 seconds currently.
It feels like we might want these retries in only one place. Maybe we should strip them out of the prober itself and only have this health binary retry tightly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
practically I don't think it matters much either way so feel free to ignore, but my preference would be to long poll here as well. Better response, less network/cpu spamming, etc.
Regardless, can we pull this timeout (or retry interval if we go that route) out in to a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put this up to 10 seconds. Makes sense to use the full 10 seconds on /health.
I don't understand all of the moving parts, but it seems like we may want to keep the retry logic in the prober. The function with the prober (probeUserContainer
) is also used by a handler on the queueServingPort
when a request has a K-Network-Probe
header. Do we care about retries there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton for doing this!
cmd/queue/main.go
Outdated
// Do not use the cached connection | ||
DisableKeepAlives: true, | ||
}, | ||
Timeout: 500 * time.Millisecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
practically I don't think it matters much either way so feel free to ignore, but my preference would be to long poll here as well. Better response, less network/cpu spamming, etc.
Regardless, can we pull this timeout (or retry interval if we go that route) out in to a constant?
cmd/queue/main.go
Outdated
|
||
var lastErr error | ||
|
||
timeoutErr := wait.PollImmediate(100*time.Millisecond, time.Second, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - if we long poll here we can make this initial retry loop very tight.
Can we also pull this loop time out in to a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment from my end.
Have we verified via @greghaynes's tracing that this actually has an effect and is worth the added complexity/is behaving as intended?
cmd/queue/main.go
Outdated
@@ -61,6 +60,10 @@ const ( | |||
// in the mesh. | |||
quitSleepDuration = 20 * time.Second | |||
|
|||
// Set equal to the queue-proxy's ExecProbe timeout to take | |||
// advantage of the full window | |||
queueProbeTimeout = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use this in the actual probeUserContainer
as well to make sure all of these values are properly aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure whether it was suitable to use it there since probeUserContainer
was also used in another handler and I didn't feel confident totally tying the concepts together.
How is the K-Network-Probe
header used by the system with respect to the queue-proxy?
If we don't reuse the same probe in the /health
check and the K-Network-Probe
, we can leave the network probe's functionality as-is and then potentially remove the retry/timeout logic from the /health
"server-side" TCP probe. Having all of the readinessProbe
retry interval/timeout logic live in a single place on the -probe
HTTP "client" side of things would make it easier to read/reason about. There would be a little bit of duplicated code and each retried TCP probe would then also have the additional overhead of an HTTP roundtrip. Thoughts?
If not, should I just make a duplicate probeUserContainer
and use the constant there?
And I'd definitely appreciate someone else booting it up and taking it for a spin. 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I'm following. The inner TCP loop currently has exactly this 10s timeout as well and those timeouts are meant to align, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree that they should align and I guess that's a great reason for me to stop complaining and make the change. My reluctance came from the fact that probeUserContainer
is also used by handler
for what appears to be another purpose, though I guess that can be worried about when it becomes an issue. I will make the swap now. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markusthoemmes @greghaynes
Is the above-mentioned tracing tool available somewhere that I can run myself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshrider #3491 will give you the most level of detail if you're able to rebase on it. Otherwise using just master if you enable tracing in config/config-tracing.yaml you should see a more consistent / smaller probe span in zipkin
Replaces HTTPGet ReadinessProbe with ExecAction. The ExecAction recreates the HTTPGet semantics within the queue proxy container and hits the same previous health check HTTP path. This approach was taken since the queue proxy process maintains state on whether it's active.
/assign @tcnghia |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: joshrider, tcnghia The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-knative-serving-integration-tests |
Port: intstr.FromInt(networking.QueueAdminPort), | ||
Path: queue.RequestQueueHealthPath, | ||
Exec: &corev1.ExecAction{ | ||
Command: []string{"/ko-app/queue", "-probe", "true"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm that it's our ultimate goal to:
- Pass the user HTTP/TCP probe through for the queue to evaluate
- Default the probe to a TCP probe
This seems like a reasonable first step, but I wanted to confirm we have a shared understanding of the end state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -64,6 +63,10 @@ const ( | |||
// in the mesh. | |||
quitSleepDuration = 20 * time.Second | |||
|
|||
// Set equal to the queue-proxy's ExecProbe timeout to take | |||
// advantage of the full window | |||
probeTimeout = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this a flag so it can be kept consistent with TimeoutSeconds in the probe configuration? Maybe we shouldn't do this until we pass the entire probe descriptor through?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the idea was that this was separate from the issue of passing in the full probe to have something user-defined applied via the queue-proxy. There are a bunch of things in this file that will change once we start passing all of the configuration in.
I just fixed my local setup to where im getting better tracing results now and I was wrong earlier to say this looks good. Heres a picture of what I am seeing with this patch + pod-tracing patch (using https://github.com/joshrider/serving/tree/execprobe-trace) Note the container_startup spans, the bottom one is the user-application which goes ready as soon as it runs (due to no readiness probe) and the top one is the queue proxy. Theres about .6s delay between this (it really shouldn't take that long for queue-proxy to startup). Heres one from master where I got lucky and Kubelet didnt send the networking probe until after queue-proxy had the socket open. Note how quickly after the user app goes ready the queue-proxy does as well. I'd think this is what a fix should look like: I don't grok why this is happening though, maybe there's some other delay occurring before the exec probe starts running? |
@greghaynes our queue-proxy also performs a TCP probe on user-container, so may be the 0.6 seconds is the time for user-container web server to finish starting up (and listening)? |
@tcnghia It could be coming from between queue-proxy and user-container, but I am surprised it would take nearly that long - in this case im testing the helloworld-go sample app. The second set of traces I think is a good baseline for what startup cost should be (~100ms) |
/hold Until we figured out this is behaving as designed. Feel free to remove the hold if you are all fine with this and I misinterpreted the last comments. |
@greghaynes my tests with the pod-trace branch also looked a lot like your top graph. I'm going to play around a bit more with the tracing (and whatever other debug strategies I can cook up) to try to get a better idea of what is actually going on inside. @markusthoemmes good call 👍 |
tl;dr Given the sad state of our CI, I agree with holding on this (and probably a lot of the rest of 0.7 items) until 0.8. A few things come to mind:
|
/retest |
@mattmoor I agree, if it works and doesn't cause a regression, ship it! |
Ok, I plan to |
/hold cancel |
* use exec probe for queue proxy readiness check Replaces HTTPGet ReadinessProbe with ExecAction. The ExecAction recreates the HTTPGet semantics within the queue proxy container and hits the same previous health check HTTP path. This approach was taken since the queue proxy process maintains state on whether it's active. * use descriptive name for error, pass port as int to probeQueueHealthPath() * pull probe timeout into constant, tighten up retry interval, clean up imports * add comment explaining queue probe retry interval * remove out of date comment * rename queueProbeTimeout, use constant in probeUserContainer
Fixes #3308
Proposed Changes
Also worth noting that this makes us dependent on how
ko
builds images. If the binary ever stops showing up at/ko-app/queue
, this will break.Release Note