fix(StargateQueries): use a sync pool when unmarshalling responses of protobuf objects #7346

testinginprod · 2024-01-19T19:12:33Z

Closes: #XXX

What is the purpose of the change

This PR uses a sync pool to unmarshal responses of protobuf objects in stargate queries.

We were previously utilizing pointers, which under heavy load can result in nondeterminism.

Testing and Verifying

This code was backported to v21 and tested against mainnet.

Previously, we were able to app hash nodes within 10 minutes of spam. With this change, the node has been running for 1 hour with no issues.

…targateQueries

testinginprod · 2024-01-19T19:14:21Z

wasmbinding/query_plugin.go

 		if err != nil {
 			return nil, err
 		}
+		// no matter what happens after this point, but we must return
+		// the response type to the pool.
+		defer returnStargateResponseToPool(request.Path, protoResponseType)


we need to return the sync.Pool object, so it does not leak

testinginprod · 2024-01-19T19:15:01Z

wasmbinding/stargate_whitelist.go

+// The query is multi-threaded so we're using a sync.Pool
+// to manage the allocation and de-allocation of newly created
+// pb objects.
+var stargateResponsePools map[string]*sync.Pool


we use a sync.Pool of multiple proto responses type so we do not allocate every time, this should provide relief to the GC in moments of high traffic.

testinginprod · 2024-01-19T19:15:32Z

wasmbinding/stargate_whitelist.go

@@ -184,34 +185,48 @@ func init() {
 	setWhitelistedQuery("/osmosis.concentratedliquidity.v1beta1.Query/CFMMPoolIdLinkFromConcentratedPoolId", &concentratedliquidityquery.CFMMPoolIdLinkFromConcentratedPoolIdResponse{})
 }

-// GetWhitelistedQuery returns the whitelisted query at the provided path.
+// IsWhitelistedQuery returns if the query is not whitelisted.
+func IsWhitelistedQuery(queryPath string) error {


Exposed this method in place of getWhitelistedQuery to avoid unexported usage of a function that can leak memory if not used properly

testinginprod · 2024-01-19T19:16:11Z

wasmbinding/stargate_whitelist.go

+	codec.ProtoMarshaler
+}
+
+func setWhitelistedQuery[T any, PT protoTypeG[T]](queryPath string, _ PT) {


this creates a sync.Pool for the given protobuf object, we use generics so we can properly instantiate an object that queryPath expects as response.

Could you comment in the code with this context please?

Added comment here 801eae8

testinginprod · 2024-01-19T19:16:36Z

wasmbinding/stargate_whitelist.go

 }

-func setWhitelistedQuery(queryPath string, protoType codec.ProtoMarshaler) {
-	stargateWhitelist.Store(queryPath, protoType)
+func returnStargateResponseToPool(queryPath string, pb codec.ProtoMarshaler) {


this returns the protobuf object to its appropriate pool (based on the queryPath)

Transferring this into a comment would also be helpful IMO 🙏

Added comment here 2192082

czarcas7ic · 2024-01-19T20:17:41Z

wasmbinding/stargate_whitelist.go

 	if !ok {
-		return nil, wasmvmtypes.Unknown{}
+		return nil, fmt.Errorf("failed to assert type to codec.ProtoMarshaler")


Do we need to use a wasmvmtypes error here?

Looking at the caller this seems fine, but wanted to flag to ensure I wasn't sneaking this in.

Spoke with Roman offline, looks fine.

czarcas7ic

This LGTM! I added a type assertion that doesn't return a wasmtype error, would like someone to ACK that this is okay.

p0mvn

Nice work!

Requesting additional comments and clarifications

p0mvn · 2024-01-19T20:57:35Z

wasmbinding/stargate_whitelist.go

+	codec.ProtoMarshaler
+}
+
+func setWhitelistedQuery[T any, PT protoTypeG[T]](queryPath string, _ PT) {


Could you comment in the code with this context please?

p0mvn · 2024-01-19T20:58:09Z

wasmbinding/stargate_whitelist.go

 }

-func setWhitelistedQuery(queryPath string, protoType codec.ProtoMarshaler) {
-	stargateWhitelist.Store(queryPath, protoType)
+func returnStargateResponseToPool(queryPath string, pb codec.ProtoMarshaler) {


Transferring this into a comment would also be helpful IMO 🙏

p0mvn · 2024-01-19T21:02:04Z

wasmbinding/stargate_whitelist.go

-// The query can be multi-thread, so we have to use
-// thread safe sync.Map.
-var stargateWhitelist sync.Map
+// The query is multi-threaded so we're using a sync.Pool
+// to manage the allocation and de-allocation of newly created
+// pb objects.
+var stargateResponsePools = make(map[string]*sync.Pool)


Trying to understand - is the main reason sync.Pool works and sync.Map doesn't is that the former allocates new objects for concurrent requests?

basically the sync.Map was keeping safe the map, which was not needed since after init the map is readonly.

the value of the map was a pointer to a protobuf object.

So we had a map of Map[K, *V]

and simulate + delivertx (which shared the same string request path) where all editing the same pointer, meaning they were editing the same variable underneath, concurrently (not the map but the value associated with the key in that map).

So what were we using that value for? To unmarshal a stargate query into a protobuf object that then we marshal back as JSON for CosmWasm contracts.

So what this map of sync pool does is that it provides a way to create new objects matching to a specific gRPC query response type (creation=allocation), and when we're done with them we put them it the pool so we can use them again without allocating anymore. sync.Pool takes care of de-allocating them when they're not needed anymore so we do not have to worry.

hope this clarifies the issue

My understanding is yes, the later requires a pointer and shares the same struct to unmarshal into, whereas this creates a new object for each request. Utilizing sync.Pool allows us to reallocate the object once completed though so it doesn't have a large impact on performance.

Exactly! We could have used a map of map[string]func() codec.ProtoMarshaler, where string is the request path and func() codec.ProtoMarshaler is a function that returns a newly and freshly created protobuf object to be used as target for response unmarshalling, but this could cause GC over-head in concurrent scenarios as the object is created and needs to immediately be GC'd (eg: during a lot of concurrent sims).

So sync.Pool simply allows us to recycle unused objects instead of immediately forcing the GC to de-allocate them immediately.

testinginprod's explanation is much more thorough, thanks.

p0mvn · 2024-01-19T21:08:38Z

wasmbinding/stargate_whitelist.go

-	protoResponseType, ok := protoResponseAny.(codec.ProtoMarshaler)
+	protoMarshaler, ok := protoResponseAny.Get().(codec.ProtoMarshaler)


I'm wondering whether this cast is the primary source of the issue?

unreleated: #7346 (comment)

wasmbinding/export_test.go

… protobuf objects (#7346) * use a sync pool when unmarshalling responses of protobuf objects in StargateQueries * fix uninitted pool * type assertion and lints * changelog * add comment for returnStargateResponseToPool * add setWhitelistedQuery comment * lint --------- Co-authored-by: unknown unknown <unknown@unknown> Co-authored-by: Adam Tucker <[email protected]> (cherry picked from commit 2caa5c6) # Conflicts: # CHANGELOG.md

… protobuf objects (backport #7346) (#7349) * fix(StargateQueries): use a sync pool when unmarshalling responses of protobuf objects (#7346) * use a sync pool when unmarshalling responses of protobuf objects in StargateQueries * fix uninitted pool * type assertion and lints * changelog * add comment for returnStargateResponseToPool * add setWhitelistedQuery comment * lint --------- Co-authored-by: unknown unknown <unknown@unknown> Co-authored-by: Adam Tucker <[email protected]> (cherry picked from commit 2caa5c6) # Conflicts: # CHANGELOG.md * changelog --------- Co-authored-by: testinginprod <[email protected]> Co-authored-by: Adam Tucker <[email protected]>

nicolaslara

This fix looks really good (and better than the naive solution of copying and creating a new allocation on each use). Great find too! GG @czarcas7ic and @testinginprod

… protobuf objects (#7346) * use a sync pool when unmarshalling responses of protobuf objects in StargateQueries * fix uninitted pool * type assertion and lints * changelog * add comment for returnStargateResponseToPool * add setWhitelistedQuery comment * lint --------- Co-authored-by: unknown unknown <unknown@unknown> Co-authored-by: Adam Tucker <[email protected]> (cherry picked from commit 2caa5c6) # Conflicts: # CHANGELOG.md

use a sync pool when unmarshalling responses of protobuf objects in S…

7fa9f7a

…targateQueries

testinginprod commented Jan 19, 2024

View reviewed changes

fix uninitted pool

9039ff5

czarcas7ic added V:state/compatible/backport State machine compatible PR, should be backported A:backport/v22.x backport patches to v22.x branch A:backport/v21.x backport patches to v21.x branch labels Jan 19, 2024

czarcas7ic added 2 commits January 19, 2024 13:07

type assertion and lints

20fefad

changelog

378492b

czarcas7ic reviewed Jan 19, 2024

View reviewed changes

czarcas7ic approved these changes Jan 19, 2024

View reviewed changes

czarcas7ic marked this pull request as ready for review January 19, 2024 20:27

p0mvn reviewed Jan 19, 2024

View reviewed changes

add comment for returnStargateResponseToPool

2192082

testinginprod commented Jan 19, 2024

View reviewed changes

wasmbinding/export_test.go Show resolved Hide resolved

czarcas7ic added 2 commits January 19, 2024 14:50

add setWhitelistedQuery comment

801eae8

lint

4cebe95

p0mvn approved these changes Jan 19, 2024

View reviewed changes

czarcas7ic merged commit 2caa5c6 into osmosis-labs:main Jan 20, 2024
1 check passed

mergify bot mentioned this pull request Jan 20, 2024

fix(StargateQueries): use a sync pool when unmarshalling responses of protobuf objects (backport #7346) #7348

Closed

mergify bot mentioned this pull request Jan 20, 2024

fix(StargateQueries): use a sync pool when unmarshalling responses of protobuf objects (backport #7346) #7349

Merged

czarcas7ic removed the A:backport/v21.x backport patches to v21.x branch label Jan 20, 2024

nicolaslara approved these changes Jan 20, 2024

View reviewed changes

github-actions bot mentioned this pull request Mar 15, 2024

Broken Links Detected #7752

Closed

czarcas7ic mentioned this pull request Mar 20, 2024

[Maintainance]: Backport the Stargate Whitelist query issue to all relevant old versions #7525

Open

3 tasks

mattverse added A:backport/v19.x backport patches to v19.x branch A:backport/v20.x backport patches to v20.x branch A:backport/v21.x backport patches to v21.x branch labels Apr 29, 2024

mergify bot mentioned this pull request Apr 29, 2024

fix(StargateQueries): use a sync pool when unmarshalling responses of protobuf objects (backport #7346) #8163

Closed

github-actions bot mentioned this pull request May 15, 2024

Broken Links Detected #8269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(StargateQueries): use a sync pool when unmarshalling responses of protobuf objects #7346

fix(StargateQueries): use a sync pool when unmarshalling responses of protobuf objects #7346

testinginprod commented Jan 19, 2024 •

edited by czarcas7ic

Loading

testinginprod Jan 19, 2024

testinginprod Jan 19, 2024

testinginprod Jan 19, 2024

testinginprod Jan 19, 2024

p0mvn Jan 19, 2024

czarcas7ic Jan 19, 2024

testinginprod Jan 19, 2024

p0mvn Jan 19, 2024

czarcas7ic Jan 19, 2024

czarcas7ic Jan 19, 2024 •

edited

Loading

czarcas7ic Jan 20, 2024

czarcas7ic left a comment

p0mvn left a comment

p0mvn Jan 19, 2024

p0mvn Jan 19, 2024

p0mvn Jan 19, 2024

testinginprod Jan 19, 2024 •

edited

Loading

czarcas7ic Jan 19, 2024

testinginprod Jan 19, 2024

czarcas7ic Jan 19, 2024

p0mvn Jan 19, 2024

testinginprod Jan 19, 2024

nicolaslara left a comment

		protoResponseType, ok := protoResponseAny.(codec.ProtoMarshaler)
		protoMarshaler, ok := protoResponseAny.Get().(codec.ProtoMarshaler)

fix(StargateQueries): use a sync pool when unmarshalling responses of protobuf objects #7346

fix(StargateQueries): use a sync pool when unmarshalling responses of protobuf objects #7346

Conversation

testinginprod commented Jan 19, 2024 • edited by czarcas7ic Loading

What is the purpose of the change

Testing and Verifying

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

czarcas7ic Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

czarcas7ic left a comment

Choose a reason for hiding this comment

p0mvn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

testinginprod Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolaslara left a comment

Choose a reason for hiding this comment

testinginprod commented Jan 19, 2024 •

edited by czarcas7ic

Loading

czarcas7ic Jan 19, 2024 •

edited

Loading

testinginprod Jan 19, 2024 •

edited

Loading