bug: concurrent commit and RPC Simulate call causes a data race in IAVL #1356

p0mvn · 2022-04-26T19:51:23Z

Background

There was a fatal error observed on v7.1.0. On initial investigation, it is being caused by concurrently committing and calling the following RPC.

This happened at least 3 times over the past 3 days

Logs: concurrent_rw_panic.log

Acceptance Criteria

investigate and synchronize and fix the problem
create an e2e test to test the fix by running a chain and calling the Simulate RPC several times

p0mvn · 2022-04-26T20:07:42Z

SDK tag used on v7.1.0: v0.45.1-0.20220327212402-9f71a993afdc
IAVL tag: v0.17.3-osmo-v7

p0mvn · 2022-04-26T20:14:08Z

There was #137 that might be related. IAVL must be concurrent with this change present

p0mvn · 2022-04-26T20:18:01Z

More logs from different occurrence that are stemming from the same source:

The original report is from April 26

p0mvn · 2022-04-26T20:20:31Z

This node is used for wallets

vorotech · 2022-04-26T22:04:06Z

hi @p0mvn
Do you think the following is related?

If I'm "lucky" enough and hit the endpoint https://lcd-osmosis.blockapsis.com/cosmos/tx/v1beta1/txs?events=tx.height%3D3815464&pagination.offset=0 several times, I'm getting:

{
  "code": 8,
  "message": "grpc: received message larger than max (4589341 vs. 4194304)",
  "details": [
  ]
}

vorotech · 2022-04-26T22:08:25Z

Tbh, I think it might be unrelated issue, the gRPC client should have the grpcClient.MaxCallSendMsgSize() method called here (also missed in the original implementation)

p0mvn · 2022-04-26T22:21:20Z

Tbh, I think it might be unrelated issue, the gRPC client should have the grpcClient.MaxCallSendMsgSize() method called here (also missed in the original implementation)

@vorotech You're right. We have already updated this on the latest branch of the SDK fork. However, this has not been backported to v7.x branch yet.

We are planning a minor (v7.2.1) release soon and should backport osmosis-labs/cosmos-sdk#170 to it

CC: @czarcas7ic

catShaark · 2022-04-27T04:54:59Z

So I looked at the log, this is where the error showed up right ? Apr 26 18:43:17 ip-172-31-49-142 osmosisd_osmosis-1[333746]: fatal error: concurrent map iteration and map write

p0mvn · 2022-04-27T18:49:43Z

So I looked at the log, this is where the error showed up right ? Apr 26 18:43:17 ip-172-31-49-142 osmosisd_osmosis-1[333746]: fatal error: concurrent map iteration and map write

That's correct. It's happening because 2 goroutines are trying to iterate over this map and write to it concurrently.

One goroutine is accessing that IAVL map through Commit() in baseapp and another by calling this RPC

alexanderbez · 2022-04-27T22:44:19Z

Doesn't Tendermint place a lock on that ABCI flow though? Or is this lower level @p0mvn? What can be do in the SDK to synchronize these calls?

p0mvn · 2022-04-27T23:17:27Z

@alexanderbez essentially, this PR osmosis-labs/cosmos-sdk#137 removed the synchrony with the ABCI flow. The asynchrony is needed by integrators for handling the load. Otherwise, their queries are blocked at epoch / long commits

I know exactly where the problem is (IAVL - mutable_tree.go) and working on a fix as well as an e2e test to expose this in our suite. Should have an update in the next few days

p0mvn · 2022-04-27T23:21:12Z

SDK seems to be reading from a height in IAVL that hasn't been committed yet. We don't have any mutexes there because the assumption was that this should never happen. So the solution is to avoid reading from the uncommitted state. Instead, read from the latest commit height.

ValarDragon · 2022-04-28T06:53:56Z

Is the situation that queries don't have a height field included, and then the query responds with the currently processing height, whereas it really should be responding using the last committed height?

catShaark · 2022-04-28T12:18:34Z

Is the situation that queries don't have a height field included, and then the query responds with the currently processing height, whereas it really should be responding using the last committed height?

Yeah, The query uses the returned sdk.Context after calling getContextForTx( runTxModeSimulate, txBytes) which is the sdk.Context of current state, not the lastest height. Maybe we should change the function so that it uses createQueryContext to return sdk.Context of the latest height in case of runTxModeSimulate, like this ?

alexanderbez · 2022-04-28T13:57:39Z

I guess simulate should use the latest committed state.

p0mvn · 2022-04-28T16:59:43Z

Is the situation that queries don't have a height field included, and then the query responds with the currently processing height, whereas it really should be responding using the last committed height?

Yeah, The query uses the returned sdk.Context after calling getContextForTx( runTxModeSimulate, txBytes) which is the sdk.Context of current state, not the lastest height. Maybe we should change the function so that it uses createQueryContext to return sdk.Context of the latest height in case of runTxModeSimulate, like this ?

Thanks for the suggestion @catShaark . While I think that this may help mitigate the problem, I don't think it would be removing the root cause.

To illustrate, I would like to point out a change that I did in osmosis-labs/cosmos-sdk#100 when upgrading to fast storage:

By being able to iterate over the mutable_tree, the SDK can access uncommitted state when a query/simulate tx happens to arrive before we commit the latest height. When working on the PR above I wasn't aware of all of the access patterns and concurrency flows so might have caused this bug.

I think the following should be done:

Revert the Iterator and similar ReverseIterator change that allows iterating over the mutable tree with an uncommitted state.
In addition, we might want to synchronize the following 2 resources with a read/write locks:
https://github.com/osmosis-labs/iavl/blob/a324be8dc8bd24756adf2acb72e09e9d0c5ec5da/mutable_tree.go#L34-L35
- Given that we do 1), this might not be as important because there is no other flow I know of that may access these resources from multiple gorotuines
- If we end up doing that, the change should be heavily benchmarked.

So, to sum up, I think we can go ahead with the change suggested in 1). Wait with 2) until later. Please let me know if there is anything I'm missing and if anyone has more ideas

p0mvn · 2022-04-28T17:03:27Z

Is the situation that queries don't have a height field included, and then the query responds with the currently processing height, whereas it really should be responding using the last committed height?

Yes, that's the high-level description of this problem 👍

p0mvn · 2022-04-28T17:05:52Z

Also, I'm currently still working on e2e tests to expose this. If anyone is interested in trying out PRing a potential solution, I'm happy to discuss and guide

alexanderbez · 2022-04-28T18:49:17Z

Yes, @p0mvn your summary sounds like the correct approach. I should've caught that in review :(

ValarDragon · 2022-04-29T00:23:15Z

(1) makes sense as an immediate fix, though this does generally feel like the SDK is asking for something unsafe, and thats where the bug is. (What is the API usage its doing thats leading to an iterator over a mutable tree)

I think (2) being important would only come up once the SDK is writing things in parallel to that tree. But that is not the store architecture that it currently does, instead all writes to the IAVL store are done in sequence. (And all intermediate writes are buffered within the CacheKVStore)

catShaark · 2022-04-29T03:09:53Z

Also, I'm currently still working on e2e tests to expose this. If anyone is interested in trying out PRing a potential solution, I'm happy to discuss and guide

I want to help but I will very much need your guidance

p0mvn · 2022-05-01T20:02:36Z

Hi everyone, thanks for all the help and input on this so far.

this does generally feel like the SDK is asking for something unsafe, and that's where the bug is

I kept thinking about what @ValarDragon said, and I think that was right. The SDK was incorrectly requesting data. It ended up being more complex than just doing the change I suggested earlier in 1).

I arrived at this conclusion by first struggling with reproducing this bug via e2e tests. Then, @alexanderbez pointed out offline that this issue would be easier to reproduce by approaching it from the SDK.

I proceeded by doing so and discovered that #137 introduced several issues. Due to how some integration tests were set up and the lack of others, these issues were never caught. Once I reconfigured the setup to function as expected, I was able to observe many data races and other problems related to requesting specific heights in queries.

I made this PR with the fix, more unit, and integration tests. You can find a more detailed summary in the description.

Please let me know if that change is acceptable, and I can get a tag out for testing

ValarDragon · 2022-05-03T03:40:44Z

Ensuring I understand that PR correctly:

The bug before was the ABCI parallelization allowed someone to query a future height, which could cause a crash (on top of returning an incorrect result)
The behavior is changed so that a query to a future height will return an error. The intended behavior for anyone who runs into that, is to retry sending

If so, that seems good for us to roll into a new minor release to me. Thanks for working through that! I was suspicious of the GRPC desynchronization handling these things correctly when the PR was first made, wish I investigated it more at the time.

p0mvn · 2022-05-03T04:09:48Z

The behavior is changed so that a query to a future height will return an error. The intended behavior for anyone who runs into that, is to retry sending

I would also add that some queries like simulation and tendermint are changed to be always directed to the ABCI flow to avoid going through GRPC.

In the case of Tendermint queries, those are having issues with proto marshaling / unmarshalling when done via GRPC.

In the case of simulation, @alexanderbez and I discussed and thought that these are risky to do concurrently with the commit flow due to mimicking regular transaction logic. They have the potential for more unexpected concurrency problems. If throughput is reported to be problematic, we can experiment and change that in the future.

I have a PR with the fix to v7.x in #1390. Planning to do a v7.2.2 release tomorrow

p0mvn · 2022-05-16T14:56:27Z

This issue was still reported on v7.3.0

p0mvn · 2022-07-19T20:47:48Z

Haven't heard of this being a problem recently. Closing for now

p0mvn added this to Osmosis Chain Development Apr 26, 2022

p0mvn moved this to 🕒 Todo in Osmosis Chain Development Apr 26, 2022

p0mvn added the T:bug 🐛 Something isn't working label Apr 26, 2022

p0mvn self-assigned this Apr 27, 2022

daniel-farina moved this from 🕒 Todo to 🏃 In Progress in Osmosis Chain Development Apr 28, 2022

czarcas7ic mentioned this issue Apr 29, 2022

fix: revert iterator osmosis-labs/cosmos-sdk#212

Closed

p0mvn mentioned this issue Apr 30, 2022

fix: concurrency issues between simulation, grpc queries and ABCI commit flow osmosis-labs/cosmos-sdk#213

Merged

p0mvn moved this from 🏃 In Progress to 🔍 Needs Review in Osmosis Chain Development May 2, 2022

p0mvn mentioned this issue May 3, 2022

chore: upgrade sdk to v0.45.0x-osmo-v7.9 #1390

Merged

p0mvn mentioned this issue May 3, 2022

[E2E] identify more e2e tests specific for Osmosis and write up issues to add them #1202

Closed

2 tasks

p0mvn closed this as completed May 4, 2022

Repository owner moved this from 🔍 Needs Review to ✅ Done in Osmosis Chain Development May 4, 2022

p0mvn reopened this May 16, 2022

Repository owner moved this from Done ✅ to In Progress🏃 in Osmosis Chain Development May 16, 2022

p0mvn closed this as completed Jul 19, 2022

Repository owner moved this from In Progress🏃 to Done ✅ in Osmosis Chain Development Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: concurrent commit and RPC Simulate call causes a data race in IAVL #1356

bug: concurrent commit and RPC Simulate call causes a data race in IAVL #1356

p0mvn commented Apr 26, 2022 •

edited

Loading

p0mvn commented Apr 26, 2022 •

edited

Loading

p0mvn commented Apr 26, 2022

p0mvn commented Apr 26, 2022 •

edited

Loading

p0mvn commented Apr 26, 2022

vorotech commented Apr 26, 2022

vorotech commented Apr 26, 2022 •

edited

Loading

p0mvn commented Apr 26, 2022

catShaark commented Apr 27, 2022 •

edited

Loading

p0mvn commented Apr 27, 2022

alexanderbez commented Apr 27, 2022 •

edited

Loading

p0mvn commented Apr 27, 2022 •

edited

Loading

p0mvn commented Apr 27, 2022

ValarDragon commented Apr 28, 2022

catShaark commented Apr 28, 2022 •

edited

Loading

alexanderbez commented Apr 28, 2022

p0mvn commented Apr 28, 2022 •

edited

Loading

p0mvn commented Apr 28, 2022

p0mvn commented Apr 28, 2022

alexanderbez commented Apr 28, 2022

ValarDragon commented Apr 29, 2022

catShaark commented Apr 29, 2022 •

edited

Loading

p0mvn commented May 1, 2022

ValarDragon commented May 3, 2022 •

edited

Loading

p0mvn commented May 3, 2022 •

edited

Loading

p0mvn commented May 16, 2022

p0mvn commented Jul 19, 2022

bug: concurrent commit and RPC Simulate call causes a data race in IAVL #1356

bug: concurrent commit and RPC Simulate call causes a data race in IAVL #1356

Comments

p0mvn commented Apr 26, 2022 • edited Loading

p0mvn commented Apr 26, 2022 • edited Loading

p0mvn commented Apr 26, 2022

p0mvn commented Apr 26, 2022 • edited Loading

p0mvn commented Apr 26, 2022

vorotech commented Apr 26, 2022

vorotech commented Apr 26, 2022 • edited Loading

p0mvn commented Apr 26, 2022

catShaark commented Apr 27, 2022 • edited Loading

p0mvn commented Apr 27, 2022

alexanderbez commented Apr 27, 2022 • edited Loading

p0mvn commented Apr 27, 2022 • edited Loading

p0mvn commented Apr 27, 2022

ValarDragon commented Apr 28, 2022

catShaark commented Apr 28, 2022 • edited Loading

alexanderbez commented Apr 28, 2022

p0mvn commented Apr 28, 2022 • edited Loading

p0mvn commented Apr 28, 2022

p0mvn commented Apr 28, 2022

alexanderbez commented Apr 28, 2022

ValarDragon commented Apr 29, 2022

catShaark commented Apr 29, 2022 • edited Loading

p0mvn commented May 1, 2022

ValarDragon commented May 3, 2022 • edited Loading

p0mvn commented May 3, 2022 • edited Loading

p0mvn commented May 16, 2022

p0mvn commented Jul 19, 2022

p0mvn commented Apr 26, 2022 •

edited

Loading

p0mvn commented Apr 26, 2022 •

edited

Loading

p0mvn commented Apr 26, 2022 •

edited

Loading

vorotech commented Apr 26, 2022 •

edited

Loading

catShaark commented Apr 27, 2022 •

edited

Loading

alexanderbez commented Apr 27, 2022 •

edited

Loading

p0mvn commented Apr 27, 2022 •

edited

Loading

catShaark commented Apr 28, 2022 •

edited

Loading

p0mvn commented Apr 28, 2022 •

edited

Loading

catShaark commented Apr 29, 2022 •

edited

Loading

ValarDragon commented May 3, 2022 •

edited

Loading

p0mvn commented May 3, 2022 •

edited

Loading