-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: concurrent commit and RPC Simulate call causes a data race in IAVL #1356
Comments
SDK tag used on |
There was #137 that might be related. IAVL must be concurrent with this change present |
More logs from different occurrence that are stemming from the same source: The original report is from April 26 |
This node is used for wallets |
hi @p0mvn If I'm "lucky" enough and hit the endpoint https://lcd-osmosis.blockapsis.com/cosmos/tx/v1beta1/txs?events=tx.height%3D3815464&pagination.offset=0 several times, I'm getting: {
"code": 8,
"message": "grpc: received message larger than max (4589341 vs. 4194304)",
"details": [
]
} |
Tbh, I think it might be unrelated issue, the gRPC client should have the grpcClient.MaxCallSendMsgSize() method called here (also missed in the original implementation) |
@vorotech You're right. We have already updated this on the latest branch of the SDK fork. However, this has not been backported to We are planning a minor ( CC: @czarcas7ic |
So I looked at the log, this is where the error showed up right ? |
That's correct. It's happening because 2 goroutines are trying to iterate over this map and write to it concurrently. One goroutine is accessing that IAVL map through |
Doesn't Tendermint place a lock on that ABCI flow though? Or is this lower level @p0mvn? What can be do in the SDK to synchronize these calls? |
@alexanderbez essentially, this PR osmosis-labs/cosmos-sdk#137 removed the synchrony with the ABCI flow. The asynchrony is needed by integrators for handling the load. Otherwise, their queries are blocked at epoch / long commits I know exactly where the problem is (IAVL - |
SDK seems to be reading from a height in IAVL that hasn't been committed yet. We don't have any mutexes there because the assumption was that this should never happen. So the solution is to avoid reading from the uncommitted state. Instead, read from the latest commit height. |
Is the situation that queries don't have a height field included, and then the query responds with the currently processing height, whereas it really should be responding using the last committed height? |
Yeah, The query uses the returned |
I guess simulate should use the latest committed state. |
Thanks for the suggestion @catShaark . While I think that this may help mitigate the problem, I don't think it would be removing the root cause. To illustrate, I would like to point out a change that I did in osmosis-labs/cosmos-sdk#100 when upgrading to fast storage: By being able to iterate over the I think the following should be done:
So, to sum up, I think we can go ahead with the change suggested in 1). Wait with 2) until later. Please let me know if there is anything I'm missing and if anyone has more ideas |
Yes, that's the high-level description of this problem 👍 |
Also, I'm currently still working on e2e tests to expose this. If anyone is interested in trying out PRing a potential solution, I'm happy to discuss and guide |
Yes, @p0mvn your summary sounds like the correct approach. I should've caught that in review :( |
(1) makes sense as an immediate fix, though this does generally feel like the SDK is asking for something unsafe, and thats where the bug is. (What is the API usage its doing thats leading to an iterator over a mutable tree) I think (2) being important would only come up once the SDK is writing things in parallel to that tree. But that is not the store architecture that it currently does, instead all writes to the IAVL store are done in sequence. (And all intermediate writes are buffered within the CacheKVStore) |
I want to help but I will very much need your guidance |
Hi everyone, thanks for all the help and input on this so far.
I kept thinking about what @ValarDragon said, and I think that was right. The SDK was incorrectly requesting data. It ended up being more complex than just doing the change I suggested earlier in 1). I arrived at this conclusion by first struggling with reproducing this bug via e2e tests. Then, @alexanderbez pointed out offline that this issue would be easier to reproduce by approaching it from the SDK. I proceeded by doing so and discovered that #137 introduced several issues. Due to how some integration tests were set up and the lack of others, these issues were never caught. Once I reconfigured the setup to function as expected, I was able to observe many data races and other problems related to requesting specific heights in queries. I made this PR with the fix, more unit, and integration tests. You can find a more detailed summary in the description. Please let me know if that change is acceptable, and I can get a tag out for testing |
Ensuring I understand that PR correctly:
If so, that seems good for us to roll into a new minor release to me. Thanks for working through that! I was suspicious of the GRPC desynchronization handling these things correctly when the PR was first made, wish I investigated it more at the time. |
I would also add that some queries like simulation and tendermint are changed to be always directed to the ABCI flow to avoid going through GRPC. In the case of Tendermint queries, those are having issues with proto marshaling / unmarshalling when done via GRPC. In the case of simulation, @alexanderbez and I discussed and thought that these are risky to do concurrently with the commit flow due to mimicking regular transaction logic. They have the potential for more unexpected concurrency problems. If throughput is reported to be problematic, we can experiment and change that in the future. I have a PR with the fix to |
This issue was still reported on |
Haven't heard of this being a problem recently. Closing for now |
Background
There was a fatal error observed on v7.1.0. On initial investigation, it is being caused by concurrently committing and calling the following RPC.
This happened at least 3 times over the past 3 days
Logs: concurrent_rw_panic.log
Acceptance Criteria
The text was updated successfully, but these errors were encountered: