-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix data race while accessing connection in partitionConsumer #535
Conversation
pc.log.Info("Connected consumer") | ||
pc.conn.AddConsumeHandler(pc.consumerID, pc) | ||
pc._getConn().AddConsumeHandler(pc.consumerID, pc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the behavior if _getConn() ever returns nil? Is that possible? Same for all the other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question; _getConn()
should never return nil
.
An invariant in this code is that the partitionConsumer.conn
field must be set and is never nil
.
The grabConn()
method sets the partitionConsumer.conn
field; grabConn()
is called from the newPartitionConsumer
factory method which will fail construction of the partitionConsumer if grabConn() returns an error.
The above said, it is probably better for the cast in _getConn()
to be unchecked and let the code panic()
if the invariant is broken rather than returning nil
causing a nil
pointer de-reference further down the line. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the cast unchecked and added a comment to explain why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was change that moved the broker reconnect out of the events go routine (https://github.com/apache/pulsar-client-go/pull/376/files). This is now causing the data race issue.
The question is if there is a reconnection what should happen with the pending events in the event channel? Right now they are processed using a stale/closed connection. Even with this fix it's possible events will try to get processed using a stale connection. Maybe that is ok but for me it makes the code difficult to follow and reason about. Ideally, we could come up with a cleaner way so we are never unintentionally using a stale/closed connection.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cckellogg ,
I'm thinking that since PR 376 drains the connection incomingRequestsCh on close it should be possible remove the extra go-routine in the partitionConsumer and select from the connectionClosedCh in the partitonConsumer.runEventsLoop. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cckellogg ,
I've attempted the above with the following commit:
17335cd
If this change is accepted I'll clean up this branch and update the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 5dfcc13 we also short-circuit broker re-connection attempts on consumer close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted the behaviour of this PR back to the first approach, making access to the connection in the partitionConsumer atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest CI failure is going to be addressed by #544
d3c9ab3
to
18b976a
Compare
The partitionConsumer maintains a few internal go-routines, two of which access the underlying internal.Connection. The main runEvenstLoop() go-routine reads the connection field while a separate go-routine is used to detect connection loss, initiate re-connection, and set the connection. The go-routine that initiates re-connection on connection loss was added in the following PR in order to address a deadlock: apache#535 The above PR also includes the following changes: * connection drains and fails the incomingRequestsCh when the conneciton is closed. * partitionConsumer uses a separate channel to communicate connection loss to the re-connection go-routine. With the above it in place it is possible for the partitionConsumer to handle the connection loss in the main runEventsLoop(), allowing us to use a single go-routine to manage the connection and resolve the data race. Signed-off-by: Daniel Ferstay <[email protected]>
5dfcc13
to
18b976a
Compare
18b976a
to
6339d74
Compare
The partitionConsumer maintains a few internal go-routines, two of which access the underlying internal.Connection. The main runEvenstLoop() go-routine reads the connection field while a separate go-routine is used to detect connnection loss, initiate reconnection, and sets the connection. Previously, access to the conn field was not synchronized. Now, the conn field is read and written atomically; resolving the data race. Signed-off-by: Daniel Ferstay <[email protected]>
6339d74
to
dadae1d
Compare
Signed-off-by: xiaolongran <[email protected]> ### Motivation In #700, we use a separate go rutine to handle the logic of reconnect, so here you may encounter the same data race problem as #535 ### Modifications Now, the conn field is read and written atomically; avoiding race conditions.
The partitionConsumer maintains a few internal go-routines, two of which
access the underlying internal.Connection. The main runEvenstLoop()
go-routine reads the connection field while a separate go-routine is used
to detect connnection loss, initiate reconnection, and sets the connection.
Previously, access to the conn field was not synchronized.
Now, the conn field is read and written atomically; avoiding race
conditions.
Signed-off-by: Daniel Ferstay [email protected]
Motivation
While attempting to submit a separate PR (#534) I found that the
pulsar/reader_test
consistently failed with the following data race. This change in this PR is an attempt to fix it.Modifications
Store the internal.Connection managed for the partitionConsumer in an
atomic.Value
: https://golang.org/pkg/sync/atomic/#ValueVerifying this change
This change is already covered by existing tests that use
partitionConsumer
instances, such aspulsar/reader_test
.