Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stuck when reconnect broker #703

Merged
merged 2 commits into from
Jan 12, 2022

Conversation

wolfstudy
Copy link
Member

@wolfstudy wolfstudy commented Jan 11, 2022

Signed-off-by: xiaolongran [email protected]

Fixes #697

Motivation

As #697 said, In Go SDK, when the reconnection logic is triggered under certain conditions, the reconnection will not succeed due to request timeout.

Comparing the implementation of the Java SDK, we can see that each time the reconnection logic is triggered, the original connection will be closed and a new connection will be created.

image

So in this pr, we introduced a new reconnectFlag field in the connection struct to mark the reconnection state. When the broker actively informs the client to close the connection to trigger the reconnection logic, we will store it from the connections cache of the connectionPool. The old connection object is deleted, and a new connection is created to complete the reconnection

Modifications

  • Add reconnectFlag in connection struct

key, conn.logicalAddr, conn.physicalAddr)

// remove stale/failed connection
if conn.closed() {
// When the current connection is in a closed state or the broker actively notifies that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the broker notify that the connection is closed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The action of the broker's notification to close the connection will be processed by the two commands CommandCloseProducer and CommandCloseConsumer. Generally, it may trigger the automatic load balancing of the topic or the unload operation will trigger the active closing of the broker.

@@ -163,6 +163,9 @@ type connection struct {
consumerHandlersLock sync.RWMutex
consumerHandlers map[uint64]ConsumerHandler

reconnectFlagLock sync.Mutex
reconnectFlag bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is a little confusing to me. What is this flag suppose to indicate? And if it's set is the connection in an unusable state?

Copy link
Member Author

@wolfstudy wolfstudy Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calling path of the entire code is:

  1. The broker notifies the client to close the connection and sends CommandCloseProducer or CommandCloseConsumer cmd to the client
  2. The client receives this command and starts to enter the logic of handleCloseConsumer or handleCloseProducer
  3. The client starts to trigger the ConnectionClosed action, and then enters the reconnection logic
  4. In the reconnection logic, the GetConnection method will be called to obtain a connection

Previously in GetConnection(), we obtained the old connection, and now we want to change this behavior, expecting to remove the old connection object from the connections map when the ConnectionClosed action is triggered, and then from the GetConnection() method to get a new connection

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the current wrapper implementation of Connection and ConnectionPool, we have two elegant ways to handle this:

  1. In my current implementation, a flag is introduced into the struct of connection to mark the state of reconnection
  2. When using changeState() to trigger handleCloseProducer, set the state of the current connection to close

@wolfstudy
Copy link
Member Author

@cckellogg Choose to reuse the current code logic for processing. When receiving the CommandCloseProducer or CommandCloseConsumer cmd, set the current connection state to connectionClosed, and then delete the connection in the closed state through the following check-in GetConnection().

		if conn.closed() {
			delete(p.connections, key)
			p.log.Infof("Removed connection from pool key=%s logical_addr=%+v physical_addr=%+v",
				key, conn.logicalAddr, conn.physicalAddr)
			conn = nil // set to nil so we create a new one
		}

@lhotari
Copy link
Member

lhotari commented Apr 26, 2022

Comparing the implementation of the Java SDK, we can see that each time the reconnection logic is triggered, the original connection will be closed and a new connection will be created.

@wolfstudy I think that might be an incorrect observation. In the Java SDK, the concepts (classes) are fairly misleading, and something that might seem to be a connection close isn't one. The ClientHandler is really a reference to a connection in the Java SDK. The ClientCnx isn't closed in the Java SDK in ClientHandler.connectionClosed
https://github.com/apache/pulsar/blob/04aa9e8e51869d1621a7e25402a656084eebfc09/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConnectionHandler.java#L122-L141

lhotari added a commit to lhotari/pulsar-client-go that referenced this pull request Apr 26, 2022
zzzming added a commit to datastax/pulsar-client-go that referenced this pull request Apr 29, 2022
wolfstudy pushed a commit that referenced this pull request May 23, 2022
jayxhj pushed a commit to billowqiu/pulsar-client-go that referenced this pull request Jun 6, 2022
jiaqingchen pushed a commit to billowqiu/pulsar-client-go that referenced this pull request Jul 18, 2022
zwb1234567 pushed a commit to billowqiu/pulsar-client-go that referenced this pull request Jul 19, 2022
jiaqingchen added a commit to billowqiu/pulsar-client-go that referenced this pull request Jul 19, 2022
Revert "Revert "Fix stuck when reconnect broker (apache#703)" (apache#767)"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reconnection blocked in producer by request timed out
5 participants