-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(StargateQueries): use a sync pool when unmarshalling responses of protobuf objects #7346
fix(StargateQueries): use a sync pool when unmarshalling responses of protobuf objects #7346
Conversation
if err != nil { | ||
return nil, err | ||
} | ||
// no matter what happens after this point, but we must return | ||
// the response type to the pool. | ||
defer returnStargateResponseToPool(request.Path, protoResponseType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to return the sync.Pool object, so it does not leak
wasmbinding/stargate_whitelist.go
Outdated
// The query is multi-threaded so we're using a sync.Pool | ||
// to manage the allocation and de-allocation of newly created | ||
// pb objects. | ||
var stargateResponsePools map[string]*sync.Pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use a sync.Pool of multiple proto responses type so we do not allocate every time, this should provide relief to the GC in moments of high traffic.
@@ -184,34 +185,48 @@ func init() { | |||
setWhitelistedQuery("/osmosis.concentratedliquidity.v1beta1.Query/CFMMPoolIdLinkFromConcentratedPoolId", &concentratedliquidityquery.CFMMPoolIdLinkFromConcentratedPoolIdResponse{}) | |||
} | |||
|
|||
// GetWhitelistedQuery returns the whitelisted query at the provided path. | |||
// IsWhitelistedQuery returns if the query is not whitelisted. | |||
func IsWhitelistedQuery(queryPath string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposed this method in place of getWhitelistedQuery to avoid unexported usage of a function that can leak memory if not used properly
codec.ProtoMarshaler | ||
} | ||
|
||
func setWhitelistedQuery[T any, PT protoTypeG[T]](queryPath string, _ PT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this creates a sync.Pool for the given protobuf object, we use generics so we can properly instantiate an object that queryPath expects as response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you comment in the code with this context please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment here 801eae8
} | ||
|
||
func setWhitelistedQuery(queryPath string, protoType codec.ProtoMarshaler) { | ||
stargateWhitelist.Store(queryPath, protoType) | ||
func returnStargateResponseToPool(queryPath string, pb codec.ProtoMarshaler) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this returns the protobuf object to its appropriate pool (based on the queryPath)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transferring this into a comment would also be helpful IMO 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment here 2192082
if !ok { | ||
return nil, wasmvmtypes.Unknown{} | ||
return nil, fmt.Errorf("failed to assert type to codec.ProtoMarshaler") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use a wasmvmtypes error here?
Looking at the caller this seems fine, but wanted to flag to ensure I wasn't sneaking this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke with Roman offline, looks fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! I added a type assertion that doesn't return a wasmtype error, would like someone to ACK that this is okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Requesting additional comments and clarifications
codec.ProtoMarshaler | ||
} | ||
|
||
func setWhitelistedQuery[T any, PT protoTypeG[T]](queryPath string, _ PT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you comment in the code with this context please?
} | ||
|
||
func setWhitelistedQuery(queryPath string, protoType codec.ProtoMarshaler) { | ||
stargateWhitelist.Store(queryPath, protoType) | ||
func returnStargateResponseToPool(queryPath string, pb codec.ProtoMarshaler) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transferring this into a comment would also be helpful IMO 🙏
// The query can be multi-thread, so we have to use | ||
// thread safe sync.Map. | ||
var stargateWhitelist sync.Map | ||
// The query is multi-threaded so we're using a sync.Pool | ||
// to manage the allocation and de-allocation of newly created | ||
// pb objects. | ||
var stargateResponsePools = make(map[string]*sync.Pool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to understand - is the main reason sync.Pool
works and sync.Map
doesn't is that the former allocates new objects for concurrent requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically the sync.Map was keeping safe the map, which was not needed since after init the map is readonly.
the value of the map was a pointer to a protobuf object.
So we had a map of Map[K, *V]
and simulate + delivertx (which shared the same string request path) where all editing the same pointer, meaning they were editing the same variable underneath, concurrently (not the map but the value associated with the key in that map).
So what were we using that value for? To unmarshal a stargate query into a protobuf object that then we marshal back as JSON for CosmWasm contracts.
So what this map of sync pool does is that it provides a way to create new objects matching to a specific gRPC query response type (creation=allocation), and when we're done with them we put them it the pool so we can use them again without allocating anymore. sync.Pool takes care of de-allocating them when they're not needed anymore so we do not have to worry.
hope this clarifies the issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is yes, the later requires a pointer and shares the same struct to unmarshal into, whereas this creates a new object for each request. Utilizing sync.Pool allows us to reallocate the object once completed though so it doesn't have a large impact on performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! We could have used a map of map[string]func() codec.ProtoMarshaler
, where string is the request path and func() codec.ProtoMarshaler
is a function that returns a newly and freshly created protobuf object to be used as target for response unmarshalling, but this could cause GC over-head in concurrent scenarios as the object is created and needs to immediately be GC'd (eg: during a lot of concurrent sims).
So sync.Pool simply allows us to recycle unused objects instead of immediately forcing the GC to de-allocate them immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testinginprod's explanation is much more thorough, thanks.
protoResponseType, ok := protoResponseAny.(codec.ProtoMarshaler) | ||
protoMarshaler, ok := protoResponseAny.Get().(codec.ProtoMarshaler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether this cast is the primary source of the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unreleated: #7346 (comment)
… protobuf objects (#7346) * use a sync pool when unmarshalling responses of protobuf objects in StargateQueries * fix uninitted pool * type assertion and lints * changelog * add comment for returnStargateResponseToPool * add setWhitelistedQuery comment * lint --------- Co-authored-by: unknown unknown <unknown@unknown> Co-authored-by: Adam Tucker <[email protected]> (cherry picked from commit 2caa5c6) # Conflicts: # CHANGELOG.md
… protobuf objects (#7346) * use a sync pool when unmarshalling responses of protobuf objects in StargateQueries * fix uninitted pool * type assertion and lints * changelog * add comment for returnStargateResponseToPool * add setWhitelistedQuery comment * lint --------- Co-authored-by: unknown unknown <unknown@unknown> Co-authored-by: Adam Tucker <[email protected]> (cherry picked from commit 2caa5c6) # Conflicts: # CHANGELOG.md
… protobuf objects (backport #7346) (#7349) * fix(StargateQueries): use a sync pool when unmarshalling responses of protobuf objects (#7346) * use a sync pool when unmarshalling responses of protobuf objects in StargateQueries * fix uninitted pool * type assertion and lints * changelog * add comment for returnStargateResponseToPool * add setWhitelistedQuery comment * lint --------- Co-authored-by: unknown unknown <unknown@unknown> Co-authored-by: Adam Tucker <[email protected]> (cherry picked from commit 2caa5c6) # Conflicts: # CHANGELOG.md * changelog --------- Co-authored-by: testinginprod <[email protected]> Co-authored-by: Adam Tucker <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix looks really good (and better than the naive solution of copying and creating a new allocation on each use). Great find too! GG @czarcas7ic and @testinginprod
… protobuf objects (#7346) * use a sync pool when unmarshalling responses of protobuf objects in StargateQueries * fix uninitted pool * type assertion and lints * changelog * add comment for returnStargateResponseToPool * add setWhitelistedQuery comment * lint --------- Co-authored-by: unknown unknown <unknown@unknown> Co-authored-by: Adam Tucker <[email protected]> (cherry picked from commit 2caa5c6) # Conflicts: # CHANGELOG.md
Closes: #XXX
What is the purpose of the change
This PR uses a sync pool to unmarshal responses of protobuf objects in stargate queries.
We were previously utilizing pointers, which under heavy load can result in nondeterminism.
Testing and Verifying
This code was backported to v21 and tested against mainnet.
Previously, we were able to app hash nodes within 10 minutes of spam. With this change, the node has been running for 1 hour with no issues.