-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix] [issue 1051]: Fix inaccurate producer mem limit in chunking and schema #1055
[fix] [issue 1051]: Fix inaccurate producer mem limit in chunking and schema #1055
Conversation
@Gleiphir2769 Great job, actaully I am preparing to fix this issue and #1043 and refactor the callback/releaseSemaphore/releaseMemory/metrics together, it will be a BIG PR, I am busy these days, if you have time, you can do it together. My idea is:
|
Hi @gunli.
Before chunking introduction, the semaphore is required before dataChan. I moved it from pulsar-client-go/pulsar/producer_partition.go Lines 569 to 572 in be35740
If we can get it brefore
Sounds great. It's a bit difficult to understand And I think we can fix these bugs firstly. Refactoring work can be done in parallel. What do you think? |
I think it is not a problem, you can check the code of
func (sr *sendRequest) done(id *MessageID, err error) {
if (sr.semaphore){
sr.semaphore.Release()
}
sr.memLimit.Release()
runcallback(sr.callback, id, err)
metrics.IncXXX()
log.Debug/Error
...
} In any other logic where the request is done, we just call request.done(), no need to care about resources/callback/metrics/debug logs, the code will be more clear.
I agree with that. |
Could you give a review? @gunli @RobertIndie @shibd Please feel free to leave your comment, thanks! |
@@ -489,14 +488,13 @@ func (p *partitionProducer) internalSend(request *sendRequest) { | |||
|
|||
// The block chan must be closed when returned with exception | |||
defer request.stopBlock() | |||
if !p.canAddToQueue(request, uncompressedPayloadSize) { | |||
if !p.canAddToQueue(request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forget to runCallback, it is better to add a debug log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runCallback
is invoked by canReserveMem
if it's failed.
@@ -542,6 +538,11 @@ func (p *partitionProducer) internalSend(request *sendRequest) { | |||
|
|||
uncompressedSize := len(uncompressedPayload) | |||
|
|||
// try to reserve memory for uncompressedPayload | |||
if !p.canReserveMem(request, int64(uncompressedSize)) { | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forget to release semaphore and runCallback, it is better to add a debug log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semaphore is released by canReserveMem
if it failed, runCallback
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think Semaphore released by canReserveMem
may not be a good idea. It makes canReserveMem
must be invoked after canAddToQueue
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, runCallback/releaseSemaphore/releaseMemory in canAddToQueue and canReserveMem violates the Single Responsibility Principle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final solution is encapsulating the resource releasing logic in request.done(), anywhere there is an error, just call request.done()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 for moving the semaphore release out of canReserveMem
. It's better that we release it here than in the canReserveMem
before we find a good solution for it.
I think is better to merge #1052 first, and then this PR is better to rebase after that, or it will confilct. @Gleiphir2769 @RobertIndie @shibd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! Overall looks good to me. Left some comments.
pulsar/producer_test.go
Outdated
assert.Error(t, err) | ||
assert.ErrorContains(t, err, getResultStr(ClientMemoryBufferIsFull)) | ||
|
||
// wait all the chunks have been released |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better that we add producer.flush
before retryAssert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of DisableBatching=true
, producer.flush
here is useless and cause panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The producer.flush
is also useful when disabling the batching and using sendAsync. But I just find that we already send a message synchronously at line 2047. So we don't need flush the producer here.
and cause panic.
Why does it panic? Seems like an unexpected behavior?
When enabling the chunking, we cannot get the number of total chunks before pushing the request to the dataChan. And there may be a deadlock issue similar to apache/pulsar#17446
+1 for this. It's a good practice to manage the resource. |
Please rerun workflow for this pr, thx! cc @RobertIndie |
@RobertIndie I think it is possible to do that, the trade off is we have to call |
@RobertIndie Would you please review #1049 , if it is OK, please merge it, this PR should rebase and update after that PR. |
@gunli Yes. That would be a performance issue. But if we introduce another channel, we still need to wait for the channel and block the user goroutine. And it also introduces more complexity. |
@RobertIndie I checked the code of compress And I checked the code of java client |
Ping @RobertIndie |
Thanks. @gunli I also found a bug related to this: #1057 The initial idea I came up with is to have the operation of pushing a message to the producer queue happen in the user thread. Just like the Java client did. Let's move this discussion into that issue(or a new issue if it's not related). |
e9eb93b
to
cf128d4
Compare
I have rebased this branch to master. Please rerun the workflow. Thx! @RobertIndie |
@RobertIndie would you please merge the latest PRs #1051 #1057 #1059 , we are eager to do the left refactoring work after these PRs :) |
… schema (#1055) ### Motivation The producer memory limit have some problem when `EnableChunking=true` or `Schema` is set. - When `Schema` is set, the actual message payload is `msg.Value`. The `len(msg.Payload)` may be 0 and memory can not be reserved acurate. https://github.com/apache/pulsar-client-go/blob/be3574019383ac0cdc65fec63e422fcfd6c82e4b/pulsar/producer_partition.go#L479-L494 - In chunking, if producer meets the memory limit, it should release the memory for **chunks which has send out**. But the calculate for this release is not accurate, it should be `uncompressedPayloadSize - int64(lhs)` instead of `uncompressedPayloadSize - int64(rhs)` https://github.com/apache/pulsar-client-go/blob/be3574019383ac0cdc65fec63e422fcfd6c82e4b/pulsar/producer_partition.go#L662-L664 - In chunking, if `internalSingleSend` is failed, it should release the memory for **single chunk**. But we release all the chunks memory repeatly now. https://github.com/apache/pulsar-client-go/blob/be3574019383ac0cdc65fec63e422fcfd6c82e4b/pulsar/producer_partition.go#L838-L843 - When producer received the receipt from broker, it should release the memory **it reserved before sending**. But it releases wrong size in `chunking` and `schema`. https://github.com/apache/pulsar-client-go/blob/be3574019383ac0cdc65fec63e422fcfd6c82e4b/pulsar/producer_partition.go#L1221-L1230 ### Modifications - Fix all the memory limit problems relative to `chunking` and `schema` - Add unit tests to cover these scenarios --------- Co-authored-by: shenjiaqi.2769 <[email protected]> (cherry picked from commit 28f61d2)
Fixes #1051
Motivation
The producer memory limit have some problem when
EnableChunking=true
orSchema
is set.When
Schema
is set, the actual message payload ismsg.Value
. Thelen(msg.Payload)
may be 0 and memory can not be reserved acurate.pulsar-client-go/pulsar/producer_partition.go
Lines 479 to 494 in be35740
In chunking, if producer meets the memory limit, it should release the memory for chunks which has send out. But the calculate for this release is not accurate, it should be
uncompressedPayloadSize - int64(lhs)
instead ofuncompressedPayloadSize - int64(rhs)
pulsar-client-go/pulsar/producer_partition.go
Lines 662 to 664 in be35740
In chunking, if
internalSingleSend
is failed, it should release the memory for single chunk. But we release all the chunks memory repeatly now.pulsar-client-go/pulsar/producer_partition.go
Lines 838 to 843 in be35740
When producer received the receipt from broker, it should release the memory it reserved before sending. But it releases wrong size in
chunking
andschema
.pulsar-client-go/pulsar/producer_partition.go
Lines 1221 to 1230 in be35740
Modifications
chunking
andschema
Verifying this change
Does this pull request potentially affect one of the following parts:
If
yes
was chosen, please highlight the changesDocumentation