-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP fix(core): allow requests to be queued in CONNECTING state (#374) #583
WIP fix(core): allow requests to be queued in CONNECTING state (#374) #583
Conversation
6f719bc
to
3a1d442
Compare
@jeffwidman, @StephenSorriaux: As far as I can tell, just removing the check for |
First, I have not looked deeply at this section of the code, so please take what I say with a grain of salt.
That sounds very reasonable.
I'm not sure. @StephenSorriaux can you also take a look at this? It's a very small code change, but connection blips are common enough that we do want to make sure we get the model right.
How do the Java and C clients handle this? IMO they're the canonical examples so most people will be expecting similar behavior... |
From the top of my head: they just queue requests, and don't have a maximum queue length. (I will double-check when I circle back to this.) |
Neither checks the state of the connection before doing so. |
Sounds good, we can do the same. That handles one of your outstanding TODOs. For the second part:
Can you explain this a little more? Sorry I'm asking for all these details, I'm trying to be somewhat thoughtful on a limited time budget and I know you've looked at this code deeply so already understand it. Looking forward to landing this soon! |
Seems reasonable.
Sure.
First, something I am not planning to change: when the connection is lost, the requests which have been emitted but are not completed fail with With the current version of the patch, the requests which were queued but not emitted are also cancelled with the same exception: Lines 564 to 570 in 3a1d442
The idea would be not to cancel these requests when the new state is (Moreover, this would be more consistent with the rest of the patch; it would not matter if a request was submitted just before or during the "outage"—as long as it was not emitted.) I have implemented the feature, but it currently breaks some assumptions in the tests; I will append a a second commit to the PR once I get to fix that. (And it is fine with me if we decide to only merge the first half in the end.)
No problem! I have looked somewhat deeply, but am not very familiar with the codebase—so I am happy that you are questioning my assumptions. |
Thanks, that is helpful. I definitely agree with you that queued but not emitted should not fail while in The one thing I do want to make sure of is that if the connection never manages to recover, all the queued-but-not-emitted requests are in some way are visible as either failed or incomplete to the application. I think you understand why, but if not let me know and I can explain further. Assuming that is true, then I am in agreement with everything you said. Thank you again! |
ACK; I think we are on the same page :) |
Thank you for this PR and sorry for this late reply. I agree with both of you on the fact that the Do you think it would be possible to precise this "new" behavior somewhere in the documentation? |
…#374) As discussed in python-zk#570 (comment): With this patch, requests issued while the client is in the 'CONNECTING' state get queued instead of raising a (misleading) 'SessionExpiredError'.
…thon-zk#374) If the connection is lost, but the state is 'CONNECTING', the client is trying to revalidate an existing session. When that happens, the requests which have been dequeued but are still pending are interrupted with a 'ConnectionLoss' exception--as we don't know if the packet reached the server, and the ACKs will never come back anyway. Without this patch, the requests which are still queued, and thus haven't been emitted are also cancelled with the same exception (see bottom half of '_notify_pending'). It seems that there is no reason to cancel such requests when the new state is 'CONNECTING', as the client is trying to validate an existing session; these "pristine" requests could very well be submitted in-order over the new connection if the session is recovered. This also seems more consistent with the first patch for issue python-zk#374: it does not matter whether a request was queued just before or during the "outage"--as long as it was not emitted. This patch implements that feature. ATTN: The patch is marked WIP because it should most probably *NOT* be merged--as it turns out that both the Java and the C client cancel such requests when the connection is lost. Java: ClientCnxn socket error -> cleanAndNotifyState -> cleanup -> conLossPacket/remove C: Socket error -> handle_socket_error_msg -> handle_error -> cleanup -> cleanup_bufs -> free_buffers/free_completions
3a1d442
to
b989f10
Compare
@StephenSorriaux wrote:
No problem!
@jeffwidman, @StephenSorriaux: I believe the first commit of this series (the actual fix for #374), should be good to go. It does not try to bound the queue length, but neither do the Java or C clients. I have now implemented the new feature (not emptying the queue on retry-able disconnect), and have fixed/augmented the tests—but am now inclined to think that it should not be merged! (Hence the Indeed, it turns out that both the Java and C clients cancel such requests when the connection is lost:
I have double-checked; this happens even when the errors are not fatal and the session is recovered. (The C client doesn't even have a way to distinguish between queued & pending, and thus naturally cancels everything.) I'm not sure it makes a difference in practice, as there is a "race" between the queuing of requests and the session event notification—particularly in the Java version which does not even lock the Anyway: I have pushed both patches, and will let you have a look. Unless you disagree, I will respin without the second one in a few days.
Sure; I'll add a note for each patch we end up including. |
Oh boy, that's a doozy of a research report. Thank you for digging into this. If you ever get tired of consulting and want a full time gig, I know a number of folks who are always looking to hire engineers with such attention to detail. 😁 So I think you're on the right track here... on the surface at least, it seems the Java and C clients are making a mistake. But I also agree we should not deviate from them because most knowledgable ZK folks tend to be experts in ZK who switch-hit in polyglot environments so it's easier for the ecosystem if the behavior stays consistent across clients, even if it's arguably slightly incorrect (but clearly workable otherwise it would have been fixed long ago). So I think we should do as you suggest:
Furthermore, I think we should go one step further and submit an issue to the upstream ZK team to change this behavior in the Java and C clients. Do you want to submit this or shall I? If you don't have time I'm willing to do it if, but you discovered the issue, you understand it far better than I do, and as a consultant it's always a nice trust builder with potential clients to say that you've got a patch accepted to a core project like Zookeeper. Up to you, I just want to make sure in some way this is pushed upstream which which should result in either a fix or further clarification. As a heads up, my personal experience with the core ZK project is that it tends to be a slower moving project. Thoughts? |
@jeff Damien knows, he already has helped get SASL support in the C client
😉
…On Sun, Feb 16, 2020, 01:01 Jeff Widman ***@***.***> wrote:
Oh boy, that's a doozy of a research report. Thank you for digging into
this. If you ever get tired of consulting and want a full time gig, I know
a number of folks who are always looking to hire engineers with such
attention to detail. 😁
So I think you're on the right track here... on the surface at least, it
seems the Java and C clients are making a mistake. But I also agree we
should not deviate from them because most knowledgable ZK folks tend to be
experts in ZK who switch-hit in polyglot environments so it's easier for
the ecosystem if the behavior stays consistent across clients, even if it's
arguably slightly incorrect (but clearly workable otherwise it would have
been fixed long ago).
So I think we should do as you suggest:
1. open a second PR with just the first fix
2. keep the second commit in this PR around (ie, don't force-push and
obliterate it), but close it for now so it doesn't get merged.
Furthermore, I think we should go one step further and submit an issue to
the upstream ZK team to change this behavior in the Java and C clients.
Do you want to submit this or shall I? If you don't have time I'm willing
to do it if, but you discovered the issue, you understand it far better
than I do, and as a consultant it's always a nice trust builder with
potential clients to say that you've got a patch accepted to a core project
like Zookeeper. Up to you, I just want to make sure in some way this is
pushed upstream which which should result in either a fix or further
clarification.
As a heads up, my personal experience with the core ZK project is that it
tends to be a slower moving project.
Thoughts?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#583?email_source=notifications&email_token=AAIFTHVGYWVJKACUGHAVTYTRDDJFFA5CNFSM4KSCUODKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL36VLI#issuecomment-586672813>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIFTHQ6RDKBNTWYIMJ7BODRDDJFFANCNFSM4KSCUODA>
.
|
…#374) With this patch, requests issued while the client is in the 'CONNECTING' state get queued instead of raising a misleading 'SessionExpiredError'. This fixes python-zk#374, and brings Kazoo more in line with the Java and C clients. See the 'kazoo.client.KazooClient.state' documentation as well as these discussions for more details: python-zk#570 (comment) python-zk#583 (comment)
:) I'm just trying to be a bit careful with concurrent systems programming; I know that one never is too prudent with those. But thank you for the kind words !
Right. But I don't know if I would say "incorrect"; perhaps just a bit pessimistic and/or overly disruptive. (Which is not out of character; ZK currently is of a pessimistic nature. It would be nice to have something akin to ZOOKEEPER-22, but that proposal has been lingering for a long time now! Raft has that feature IIRC.)
Done: #588. That commit is based on the first fix, but with an expanded set of tests adapted from what I had cooked up for the second part.
Okay; good idea.
I am definitely going to keep this in mind, and on my TODO list. I would like to think a bit more about which "use-cases" it could break, though. (I could imagine an application batching asynchronous requests and hoping to get away with it by taking note of state changes; it could suddenly find itself in a strange place much more often.) I am planning to ask for clarification on the ML a bit later, unless I identify an issue with this "proposal" in the meantime. (I will report here in any case.) But feel free to go ahead if you are curious and/or impatient, I'm not trying to collect points, and will gladly follow that conversation!
Right; as @ceache mentions, I have already gotten a taste of it :) The good news is that they seem to have picked up some steam lately! Cheers, -D |
Sounds good. TBH, I switched teams at my day job last fall and am not currently responsible for any production ZK ensembles or applications that talk to ZK, so due to lack of time I unsubscribed from the mailing list... I pitch in here from time to time for fun and to give back to the community so that PR's don't languish. That said, I'll be curious to hear what the eventual outcome is. |
…#374) With this patch, requests issued while the client is in the 'CONNECTING' state get queued instead of raising a misleading 'SessionExpiredError'. This fixes python-zk#374, and brings Kazoo more in line with the Java and C clients. See the 'kazoo.client.KazooClient.state' documentation as well as these discussions for more details: python-zk#570 (comment) python-zk#583 (comment)
With this patch, requests issued while the client is in the 'CONNECTING' state get queued instead of raising a misleading 'SessionExpiredError'. This fixes #374, and brings Kazoo more in line with the Java and C clients. See the 'kazoo.client.KazooClient.state' documentation as well as these discussions for more details: #570 (comment) #583 (comment)
As discussed in #570 (comment):
With this patch, requests issued while the client is in the
CONNECTING
state get queued instead of raising a (misleading)SessionExpiredError
.TBD:
This patch does not prevent requests which have been queued but not emitted from being rejected with
ConnectionLoss
. It should also get rid of the second part of_notify_pending
, shouldn't it? [Not doing this, as the C/Java clients do not, either.]This patch does not try to limit the maximum queue length, which should probably be controlled via a new
KazooClient
parameter. (How aboutmax_queue_length
? What should it default to? Which exception should we raise when the queue overflows?) [Not doing this, as the C/Java clients do not, either.]