Fix data race while accessing connection in partitionConsumer #535

dferstay · 2021-06-08T00:04:00Z

The partitionConsumer maintains a few internal go-routines, two of which
access the underlying internal.Connection. The main runEvenstLoop()
go-routine reads the connection field while a separate go-routine is used
to detect connnection loss, initiate reconnection, and sets the connection.

Previously, access to the conn field was not synchronized.

Now, the conn field is read and written atomically; avoiding race
conditions.

Signed-off-by: Daniel Ferstay [email protected]

Motivation

While attempting to submit a separate PR (#534) I found that the pulsar/reader_test consistently failed with the following data race. This change in this PR is an attempt to fix it.

2021-06-07T22:33:43.1825587Z ==================
2021-06-07T22:33:43.1825992Z WARNING: DATA RACE
2021-06-07T22:33:43.1826513Z Read at 0x00c0003de4a8 by goroutine 463:
2021-06-07T22:33:43.1828038Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).requestGetLastMessageID()
2021-06-07T22:33:43.1829418Z       /pulsar-client-go/pulsar/consumer_partition.go:279 +0x27e
2021-06-07T22:33:43.1830873Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).internalGetLastMessageID()
2021-06-07T22:33:43.1832427Z       /pulsar-client-go/pulsar/consumer_partition.go:270 +0xea
2021-06-07T22:33:43.1833799Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).runEventsLoop()
2021-06-07T22:33:43.1835074Z       /pulsar-client-go/pulsar/consumer_partition.go:806 +0x2cb
2021-06-07T22:33:43.1835559Z 
2021-06-07T22:33:43.1836251Z Previous write at 0x00c0003de4a8 by goroutine 294:
2021-06-07T22:33:43.1837949Z time="2021-06-07T22:33:41Z" level=info msg="[Connected consumer]" consumerID=2 name= subscription=reader-kcnmq topic=my-topic-971826719
2021-06-07T22:33:43.1839441Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).grabConn()
2021-06-07T22:33:43.1841513Z       /pulsar-client-go/pulsar/consumer_partition.go:974 +0x1875
2021-06-07T22:33:43.1842783Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).reconnectToBroker()
2021-06-07T22:33:43.1844507Z       /pulsar-client-go/pulsar/consumer_partition.go:887 +0x2db
2021-06-07T22:33:43.1845873Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).runEventsLoop.func2()
2021-06-07T22:33:43.1847183Z       /pulsar-client-go/pulsar/consumer_partition.go:791 +0xbe
2021-06-07T22:33:43.1847678Z 
2021-06-07T22:33:43.1848159Z Goroutine 463 (running) created at:
2021-06-07T22:33:43.1849312Z   github.com/apache/pulsar-client-go/pulsar.newPartitionConsumer()
2021-06-07T22:33:43.1850983Z       /pulsar-client-go/pulsar/consumer_partition.go:208 +0xf46
2021-06-07T22:33:43.1852092Z   github.com/apache/pulsar-client-go/pulsar.newReader()
2021-06-07T22:33:43.1853173Z       /pulsar-client-go/pulsar/reader_impl.go:105 +0x8ab
2021-06-07T22:33:43.1854294Z   github.com/apache/pulsar-client-go/pulsar.(*client).CreateReader()
2021-06-07T22:33:43.1855391Z       /pulsar-client-go/pulsar/client_impl.go:170 +0xcb
2021-06-07T22:33:43.1857422Z   github.com/apache/pulsar-client-go/pulsar.TestReaderLatestInclusiveHasNext()
2021-06-07T22:33:43.1858855Z       /pulsar-client-go/pulsar/reader_test.go:587 +0x946
2021-06-07T22:33:43.1859491Z   testing.tRunner()
2021-06-07T22:33:43.1860118Z       /usr/local/go/src/testing/testing.go:909 +0x199
2021-06-07T22:33:43.1860498Z 
2021-06-07T22:33:43.1860966Z Goroutine 294 (running) created at:
2021-06-07T22:33:43.1862630Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).runEventsLoop()
2021-06-07T22:33:43.1864091Z       /pulsar-client-go/pulsar/consumer_partition.go:784 +0x174
2021-06-07T22:33:43.1865119Z ==================

Modifications

Store the internal.Connection managed for the partitionConsumer in an atomic.Value: https://golang.org/pkg/sync/atomic/#Value

Verifying this change

This change is already covered by existing tests that use partitionConsumer instances, such as pulsar/reader_test.

cckellogg · 2021-06-08T16:05:26Z

pulsar/consumer_partition.go

 	pc.log.Info("Connected consumer")
-	pc.conn.AddConsumeHandler(pc.consumerID, pc)
+	pc._getConn().AddConsumeHandler(pc.consumerID, pc)


What's the behavior if _getConn() ever returns nil? Is that possible? Same for all the other places.

@cckellogg ,

Good question; _getConn() should never return nil.

An invariant in this code is that the partitionConsumer.conn field must be set and is never nil.
The grabConn() method sets the partitionConsumer.conn field; grabConn() is called from the newPartitionConsumer factory method which will fail construction of the partitionConsumer if grabConn() returns an error.

The above said, it is probably better for the cast in _getConn() to be unchecked and let the code panic() if the invariant is broken rather than returning nil causing a nil pointer de-reference further down the line. Thoughts?

I've made the cast unchecked and added a comment to explain why.

There was change that moved the broker reconnect out of the events go routine (https://github.com/apache/pulsar-client-go/pull/376/files). This is now causing the data race issue.

The question is if there is a reconnection what should happen with the pending events in the event channel? Right now they are processed using a stale/closed connection. Even with this fix it's possible events will try to get processed using a stale connection. Maybe that is ok but for me it makes the code difficult to follow and reason about. Ideally, we could come up with a cleaner way so we are never unintentionally using a stale/closed connection.

Thoughts?

@cckellogg ,
I'm thinking that since PR 376 drains the connection incomingRequestsCh on close it should be possible remove the extra go-routine in the partitionConsumer and select from the connectionClosedCh in the partitonConsumer.runEventsLoop. What do you think?

@cckellogg ,
I've attempted the above with the following commit:
17335cd

If this change is accepted I'll clean up this branch and update the PR description.

In 5dfcc13 we also short-circuit broker re-connection attempts on consumer close.

@cckellogg ,

I've reverted the behaviour of this PR back to the first approach, making access to the connection in the partitionConsumer atomic.

The latest CI failure is going to be addressed by #544

The partitionConsumer maintains a few internal go-routines, two of which access the underlying internal.Connection. The main runEvenstLoop() go-routine reads the connection field while a separate go-routine is used to detect connection loss, initiate re-connection, and set the connection. The go-routine that initiates re-connection on connection loss was added in the following PR in order to address a deadlock: apache#535 The above PR also includes the following changes: * connection drains and fails the incomingRequestsCh when the conneciton is closed. * partitionConsumer uses a separate channel to communicate connection loss to the re-connection go-routine. With the above it in place it is possible for the partitionConsumer to handle the connection loss in the main runEventsLoop(), allowing us to use a single go-routine to manage the connection and resolve the data race. Signed-off-by: Daniel Ferstay <[email protected]>

pulsar/consumer_partition.go

The partitionConsumer maintains a few internal go-routines, two of which access the underlying internal.Connection. The main runEvenstLoop() go-routine reads the connection field while a separate go-routine is used to detect connnection loss, initiate reconnection, and sets the connection. Previously, access to the conn field was not synchronized. Now, the conn field is read and written atomically; resolving the data race. Signed-off-by: Daniel Ferstay <[email protected]>

Signed-off-by: xiaolongran <[email protected]> ### Motivation In #700, we use a separate go rutine to handle the logic of reconnect, so here you may encounter the same data race problem as #535 ### Modifications Now, the conn field is read and written atomically; avoiding race conditions.

merlimat requested a review from cckellogg June 8, 2021 00:11

freeznet approved these changes Jun 8, 2021

View reviewed changes

cckellogg reviewed Jun 8, 2021

View reviewed changes

dferstay force-pushed the fix-data-race-in-reader-test branch from d3c9ab3 to 18b976a Compare June 8, 2021 17:50

dferstay mentioned this pull request Jun 11, 2021

Fix write to closed channel panic() in internal/connection during connection close #539

Merged

dferstay force-pushed the fix-data-race-in-reader-test branch from 5dfcc13 to 18b976a Compare June 19, 2021 21:18

jonyhy96 reviewed Jun 21, 2021

View reviewed changes

pulsar/consumer_partition.go Outdated Show resolved Hide resolved

jonyhy96 reviewed Jun 21, 2021

View reviewed changes

pulsar/consumer_partition.go Show resolved Hide resolved

dferstay force-pushed the fix-data-race-in-reader-test branch from 18b976a to 6339d74 Compare June 21, 2021 17:24

dferstay force-pushed the fix-data-race-in-reader-test branch from 6339d74 to dadae1d Compare June 22, 2021 16:39

dferstay requested a review from cckellogg June 22, 2021 23:41

cckellogg approved these changes Jun 23, 2021

View reviewed changes

merlimat added the type/bug label Jun 24, 2021

merlimat added this to the 0.6.0 milestone Jun 24, 2021

merlimat merged commit 96ce2de into apache:master Jun 24, 2021

leizhiyuan mentioned this pull request Aug 3, 2021

Datarace in partitionConsumer #580

Closed

XuanYang-cn mentioned this pull request Sep 14, 2021

Race Condition inside PulsarClient at Datanode milvus-io/milvus#7770

Closed

wolfstudy mentioned this pull request Jan 11, 2022

Fix data race while accessing connection in partitionProducer #701

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data race while accessing connection in partitionConsumer #535

Fix data race while accessing connection in partitionConsumer #535

dferstay commented Jun 8, 2021

cckellogg Jun 8, 2021 •

edited

Loading

dferstay Jun 8, 2021 •

edited

Loading

dferstay Jun 8, 2021 •

edited

Loading

cckellogg Jun 9, 2021

dferstay Jun 10, 2021

dferstay Jun 10, 2021

dferstay Jun 12, 2021

dferstay Jun 19, 2021

dferstay Jun 19, 2021

Fix data race while accessing connection in partitionConsumer #535

Fix data race while accessing connection in partitionConsumer #535

Conversation

dferstay commented Jun 8, 2021

Motivation

Modifications

Verifying this change

cckellogg Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

dferstay Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

dferstay Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

cckellogg Jun 9, 2021

Choose a reason for hiding this comment

dferstay Jun 10, 2021

Choose a reason for hiding this comment

dferstay Jun 10, 2021

Choose a reason for hiding this comment

dferstay Jun 12, 2021

Choose a reason for hiding this comment

dferstay Jun 19, 2021

Choose a reason for hiding this comment

dferstay Jun 19, 2021

Choose a reason for hiding this comment

cckellogg Jun 8, 2021 •

edited

Loading

dferstay Jun 8, 2021 •

edited

Loading

dferstay Jun 8, 2021 •

edited

Loading