Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP fix(core): allow requests to be queued in CONNECTING state (#374) #583

Closed

Conversation

ztzg
Copy link
Contributor

@ztzg ztzg commented Feb 9, 2020

As discussed in #570 (comment):

With this patch, requests issued while the client is in the CONNECTING state get queued instead of raising a (misleading) SessionExpiredError.

TBD:

  • This patch does not prevent requests which have been queued but not emitted from being rejected with ConnectionLoss. It should also get rid of the second part of _notify_pending, shouldn't it? [Not doing this, as the C/Java clients do not, either.]

  • This patch does not try to limit the maximum queue length, which should probably be controlled via a new KazooClient parameter. (How about max_queue_length? What should it default to? Which exception should we raise when the queue overflows?) [Not doing this, as the C/Java clients do not, either.]

@ztzg ztzg force-pushed the fix-374-queue-requests-while-connecting branch from 6f719bc to 3a1d442 Compare February 9, 2020 16:53
@ztzg
Copy link
Contributor Author

ztzg commented Feb 9, 2020

@jeffwidman, @StephenSorriaux: As far as I can tell, just removing the check for CONNECTING from KazooClient._call is sufficient to obtain the desired behavior; _queue and _write_sock should remain valid in that state (with the exception of the "race with close()" which is already handled by the code). Do you agree?

@jeffwidman
Copy link
Member

First, I have not looked deeply at this section of the code, so please take what I say with a grain of salt.

removing the check for CONNECTING from KazooClient._call is sufficient to obtain the desired behavior; _queue and _write_sock should remain valid in that state (with the exception of the "race with close()" which is already handled by the code). Do you agree?

That sounds very reasonable.

This patch does not prevent requests which have been queued but not emitted from being rejected with ConnectionLoss. It should also get rid of the second part of _notify_pending, shouldn't it?

I'm not sure. @StephenSorriaux can you also take a look at this? It's a very small code change, but connection blips are common enough that we do want to make sure we get the model right.

This patch does not try to limit the maximum queue length, which should probably be controlled via a new KazooClient parameter. (How about max_queue_length? What should it default to? Which exception should we raise when the queue overflows?)

How do the Java and C clients handle this? IMO they're the canonical examples so most people will be expecting similar behavior...

@ztzg
Copy link
Contributor Author

ztzg commented Feb 10, 2020

This patch does not try to limit the maximum queue length, which should probably be controlled via a new KazooClient parameter. (How about max_queue_length? What should it default to? Which exception should we raise when the queue overflows?)

How do the Java and C clients handle this? IMO they're the canonical examples so most people will be expecting similar behavior...

From the top of my head: they just queue requests, and don't have a maximum queue length. (I will double-check when I circle back to this.)

@ztzg
Copy link
Contributor Author

ztzg commented Feb 12, 2020

This patch does not try to limit the maximum queue length, which should probably be controlled via a new KazooClient parameter. (How about max_queue_length? What should it default to? Which exception should we raise when the queue overflows?)

How do the Java and C clients handle this? IMO they're the canonical examples so most people will be expecting similar behavior...

Neither checks the state of the connection before doing so.

@jeffwidman
Copy link
Member

Sounds good, we can do the same. That handles one of your outstanding TODOs.

For the second part:

This patch does not prevent requests which have been queued but not emitted from being rejected with ConnectionLoss. It should also get rid of the second part of _notify_pending, shouldn't it?

Can you explain this a little more?
On the surface it sounds like a good thing if requests that were queued but then not emitted due to a permanent network failure eventually raise a ConnectionLoss... but we'd want to make sure some timeout had passed to guarantee permanent failure. Because if it was a temp network blip that the library recovers successfully and then sends these messages, then obviously we want to mark as successful.

Sorry I'm asking for all these details, I'm trying to be somewhat thoughtful on a limited time budget and I know you've looked at this code deeply so already understand it.

Looking forward to landing this soon!

@ztzg
Copy link
Contributor Author

ztzg commented Feb 12, 2020

Sounds good, we can do the same. That handles one of your outstanding TODOs.

Seems reasonable.

For the second part:

This patch does not prevent requests which have been queued but not emitted from being rejected with ConnectionLoss. It should also get rid of the second part of _notify_pending, shouldn't it?

Can you explain this a little more?

Sure.

On the surface it sounds like a good thing if requests that were queued but then not emitted due to a permanent network failure eventually raise a ConnectionLoss... but we'd want to make sure some timeout had passed to guarantee permanent failure. Because if it was a temp network blip that the library recovers successfully and then sends these messages, then obviously we want to mark as successful.

First, something I am not planning to change: when the connection is lost, the requests which have been emitted but are not completed fail with ConnectionLoss (as the ACKs will never reach the client anyway).

With the current version of the patch, the requests which were queued but not emitted are also cancelled with the same exception:

kazoo/kazoo/client.py

Lines 564 to 570 in 3a1d442

while True:
try:
request, async_object = self._queue.popleft()
if async_object:
async_object.set_exception(exc)
except IndexError:
break

The idea would be not to cancel these requests when the new state is CONNECTING—as the client is trying to validate an existing session, and these "pristine" requests could very well be submitted in-order over the new connection if the session is recovered.

(Moreover, this would be more consistent with the rest of the patch; it would not matter if a request was submitted just before or during the "outage"—as long as it was not emitted.)

I have implemented the feature, but it currently breaks some assumptions in the tests; I will append a a second commit to the PR once I get to fix that. (And it is fine with me if we decide to only merge the first half in the end.)

Sorry I'm asking for all these details, I'm trying to be somewhat thoughtful on a limited time budget and I know you've looked at this code deeply so already understand it.

No problem! I have looked somewhat deeply, but am not very familiar with the codebase—so I am happy that you are questioning my assumptions.

@jeffwidman
Copy link
Member

jeffwidman commented Feb 12, 2020

Thanks, that is helpful.

I definitely agree with you that queued but not emitted should not fail while in CONNECTING state... essentially we want that reconnection to be invisible to the application (other than error logs or possibly if they can explicitly check the state of each message they can see it's not complete yet).

The one thing I do want to make sure of is that if the connection never manages to recover, all the queued-but-not-emitted requests are in some way are visible as either failed or incomplete to the application. I think you understand why, but if not let me know and I can explain further.

Assuming that is true, then I am in agreement with everything you said.

Thank you again!

@ztzg
Copy link
Contributor Author

ztzg commented Feb 13, 2020

The one thing I do want to make sure of is that if the connection never manages to recover, all the queued-but-not-emitted requests are in some way are visible as either failed or incomplete to the application.

ACK; I think we are on the same page :)

@StephenSorriaux
Copy link
Member

Thank you for this PR and sorry for this late reply.

I agree with both of you on the fact that the _queue should not be emptied if the connection was somehow temporary lost.

Do you think it would be possible to precise this "new" behavior somewhere in the documentation?

ztzg added 2 commits February 14, 2020 17:54
…#374)

As discussed in python-zk#570 (comment):

With this patch, requests issued while the client is in the
'CONNECTING' state get queued instead of raising a (misleading)
'SessionExpiredError'.
…thon-zk#374)

If the connection is lost, but the state is 'CONNECTING', the client
is trying to revalidate an existing session.

When that happens, the requests which have been dequeued but are still
pending are interrupted with a 'ConnectionLoss' exception--as we don't
know if the packet reached the server, and the ACKs will never come
back anyway.

Without this patch, the requests which are still queued, and thus
haven't been emitted are also cancelled with the same exception
(see bottom half of '_notify_pending').

It seems that there is no reason to cancel such requests when the new
state is 'CONNECTING', as the client is trying to validate an existing
session; these "pristine" requests could very well be submitted
in-order over the new connection if the session is recovered.

This also seems more consistent with the first patch for issue python-zk#374:
it does not matter whether a request was queued just before or during
the "outage"--as long as it was not emitted.

This patch implements that feature.

ATTN: The patch is marked WIP because it should most probably *NOT* be
merged--as it turns out that both the Java and the C client cancel
such requests when the connection is lost.

Java: ClientCnxn socket error -> cleanAndNotifyState -> cleanup
    -> conLossPacket/remove

C: Socket error -> handle_socket_error_msg -> handle_error -> cleanup
    -> cleanup_bufs -> free_buffers/free_completions
@ztzg ztzg force-pushed the fix-374-queue-requests-while-connecting branch from 3a1d442 to b989f10 Compare February 14, 2020 18:23
@ztzg
Copy link
Contributor Author

ztzg commented Feb 14, 2020

@StephenSorriaux wrote:

Thank you for this PR and sorry for this late reply.

No problem!

I agree with both of you on the fact that the _queue should not be emptied if the connection was somehow temporary lost.

@jeffwidman, @StephenSorriaux: I believe the first commit of this series (the actual fix for #374), should be good to go. It does not try to bound the queue length, but neither do the Java or C clients.

I have now implemented the new feature (not emptying the queue on retry-able disconnect), and have fixed/augmented the tests—but am now inclined to think that it should not be merged! (Hence the WIP in the commit title.)

Indeed, it turns out that both the Java and C clients cancel such requests when the connection is lost:

  • Java: ClientCnxn socket error → cleanAndNotifyStatecleanupconLossPacket/remove;

  • C: Socket error → handle_socket_error_msghandle_errorcleanupcleanup_bufsfree_buffers/free_completions.

I have double-checked; this happens even when the errors are not fatal and the session is recovered. (The C client doesn't even have a way to distinguish between queued & pending, and thus naturally cancels everything.)

I'm not sure it makes a difference in practice, as there is a "race" between the queuing of requests and the session event notification—particularly in the Java version which does not even lock the outgoingQueue—and one cannot assume that every request submitted before the notification is going to be canceled. But staying aligned with the Java/C clients might just be a better idea.

Anyway: I have pushed both patches, and will let you have a look. Unless you disagree, I will respin without the second one in a few days.

Do you think it would be possible to precise this "new" behavior somewhere in the documentation?

Sure; I'll add a note for each patch we end up including.

@jeffwidman
Copy link
Member

Oh boy, that's a doozy of a research report. Thank you for digging into this. If you ever get tired of consulting and want a full time gig, I know a number of folks who are always looking to hire engineers with such attention to detail. 😁

So I think you're on the right track here... on the surface at least, it seems the Java and C clients are making a mistake. But I also agree we should not deviate from them because most knowledgable ZK folks tend to be experts in ZK who switch-hit in polyglot environments so it's easier for the ecosystem if the behavior stays consistent across clients, even if it's arguably slightly incorrect (but clearly workable otherwise it would have been fixed long ago).

So I think we should do as you suggest:

  1. open a second PR with just the first fix
  2. keep the second commit in this PR around (ie, don't force-push and obliterate it), but close it for now so it doesn't get merged.

Furthermore, I think we should go one step further and submit an issue to the upstream ZK team to change this behavior in the Java and C clients.

Do you want to submit this or shall I? If you don't have time I'm willing to do it if, but you discovered the issue, you understand it far better than I do, and as a consultant it's always a nice trust builder with potential clients to say that you've got a patch accepted to a core project like Zookeeper. Up to you, I just want to make sure in some way this is pushed upstream which which should result in either a fix or further clarification.

As a heads up, my personal experience with the core ZK project is that it tends to be a slower moving project.

Thoughts?

@ceache
Copy link
Contributor

ceache commented Feb 16, 2020 via email

ztzg added a commit to ztzg/kazoo that referenced this pull request Feb 17, 2020
…#374)

With this patch, requests issued while the client is in the
'CONNECTING' state get queued instead of raising a misleading
'SessionExpiredError'.

This fixes python-zk#374, and brings
Kazoo more in line with the Java and C clients.

See the 'kazoo.client.KazooClient.state' documentation as well as
these discussions for more details:

python-zk#570 (comment)
python-zk#583 (comment)
@ztzg
Copy link
Contributor Author

ztzg commented Feb 17, 2020

Oh boy, that's a doozy of a research report. Thank you for digging into this. If you ever get tired of consulting and want a full time gig, I know a number of folks who are always looking to hire engineers with such attention to detail.

:)  I'm just trying to be a bit careful with concurrent systems programming; I know that one never is too prudent with those. But thank you for the kind words !

So I think you're on the right track here... on the surface at least, it seems the Java and C clients are making a mistake. But I also agree we should not deviate from them because most knowledgable ZK folks tend to be experts in ZK who switch-hit in polyglot environments so it's easier for the ecosystem if the behavior stays consistent across clients, even if it's arguably slightly incorrect (but clearly workable otherwise it would have been fixed long ago).

Right. But I don't know if I would say "incorrect"; perhaps just a bit pessimistic and/or overly disruptive.

(Which is not out of character; ZK currently is of a pessimistic nature. It would be nice to have something akin to ZOOKEEPER-22, but that proposal has been lingering for a long time now! Raft has that feature IIRC.)

So I think we should do as you suggest:

  1. open a second PR with just the first fix

Done: #588.

That commit is based on the first fix, but with an expanded set of tests adapted from what I had cooked up for the second part.

  1. keep the second commit in this PR around (ie, don't force-push and obliterate it), but close it for now so it doesn't get merged.

Okay; good idea.

Furthermore, I think we should go one step further and submit an issue to the upstream ZK team to change this behavior in the Java and C clients.

Do you want to submit this or shall I? […] I just want to make sure in some way this is pushed upstream which which should result in either a fix or further clarification.

I am definitely going to keep this in mind, and on my TODO list. I would like to think a bit more about which "use-cases" it could break, though.

(I could imagine an application batching asynchronous requests and hoping to get away with it by taking note of state changes; it could suddenly find itself in a strange place much more often.)

I am planning to ask for clarification on the ML a bit later, unless I identify an issue with this "proposal" in the meantime. (I will report here in any case.)

But feel free to go ahead if you are curious and/or impatient, I'm not trying to collect points, and will gladly follow that conversation!

As a heads up, my personal experience with the core ZK project is that it tends to be a slower moving project.

Right; as @ceache mentions, I have already gotten a taste of it :)  The good news is that they seem to have picked up some steam lately!

Cheers, -D

@ztzg ztzg closed this Feb 17, 2020
@jeffwidman
Copy link
Member

jeffwidman commented Feb 18, 2020

Sounds good.

TBH, I switched teams at my day job last fall and am not currently responsible for any production ZK ensembles or applications that talk to ZK, so due to lack of time I unsubscribed from the mailing list... I pitch in here from time to time for fun and to give back to the community so that PR's don't languish. That said, I'll be curious to hear what the eventual outcome is.

ztzg added a commit to ztzg/kazoo that referenced this pull request Feb 24, 2020
…#374)

With this patch, requests issued while the client is in the
'CONNECTING' state get queued instead of raising a misleading
'SessionExpiredError'.

This fixes python-zk#374, and brings
Kazoo more in line with the Java and C clients.

See the 'kazoo.client.KazooClient.state' documentation as well as
these discussions for more details:

python-zk#570 (comment)
python-zk#583 (comment)
StephenSorriaux pushed a commit that referenced this pull request Mar 9, 2020
With this patch, requests issued while the client is in the
'CONNECTING' state get queued instead of raising a misleading
'SessionExpiredError'.

This fixes #374, and brings
Kazoo more in line with the Java and C clients.

See the 'kazoo.client.KazooClient.state' documentation as well as
these discussions for more details:

#570 (comment)
#583 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants