Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and fix panics on fast iavl branch #892

Closed
3 tasks done
p0mvn opened this issue Feb 17, 2022 · 11 comments
Closed
3 tasks done

Investigate and fix panics on fast iavl branch #892

p0mvn opened this issue Feb 17, 2022 · 11 comments
Assignees
Labels
T:bug 🐛 Something isn't working

Comments

@p0mvn
Copy link
Member

p0mvn commented Feb 17, 2022

Background

Validators are experiencing issues on the new fast node branch with panics such as:
image

This is an issue to investigate and fix the problems reported.

Acceptance Criteria

  • gather all information from the #validators channel on Discord and document here
  • investigate and document the steps taken here
  • if applicable, create a PR with the fix and update 6.x branch
@p0mvn p0mvn moved this from 🔍 Needs Review to 🏃 In Progress in Osmosis Chain Development Feb 17, 2022
@p0mvn p0mvn added the T:bug 🐛 Something isn't working label Feb 17, 2022
@UnityChaos
Copy link
Member

Replicated some of this by using a misalignment between pruning and snapshot settings, so that node would crash when attempting to prune an old height. On restart the node would rerun the previous block, but get a different result in the distribution.keeper.allocation.AllocateTokens function.

First run on v6.3.0 (crashes because of trying to delete block being held for snapshot)
1-6 3 0-start-block-90

2-6 3 0-end-block-90-crash

Restart after disabling snapshot. Note the different feesCollected amounts. This seems to be connected to the insufficient funds error when the new value is higher than the old value, and the already saved to different hash error if the result is lower than the original.

3-6 3 0-restart-block-90
4-6 3 0-finish-90-crash


Then reran this process on v6.2.0, and things ran as expected.

First run through, crashing because of trying to delete snapshot:

5-6 2 0-start-block-40
6-6 2 0-mid-block-40
7-6 2 0-end-block-40

Restart (after disabling snapshots) and makes it through:
8-6 2 0-restart-block-40

9-6 2 0-finish-block-40


Not clear to me what the cause is yet, but I would imagine it's some sort of incomplete state write which is not properly reverted during the crash, is still around when the node attempts to rerun the block, leading to different results.

Also worth noting that the above situation was constructed to replicate this error, but my understanding is that node operators are hitting these errors during regular operation, not immediately after recovering from a crash as I am here.

@p0mvn
Copy link
Member Author

p0mvn commented Feb 18, 2022

From the most recent discussion, we have a guess that there are problems with atomic commit. Currently, each module has a separate iavl store. The commit is atomic within a tree. However, it is not atomic inter-tree or between modules. As a result, if an operator kills the node at the wrong time, the module's stores may be at different heights. Relevant SDK issue: cosmos/cosmos-sdk#6370

However, although this non-atomic commit between stores exists, our problem might still be something else since the error "was already saved to a different hash" is coming from a check within a tree. More investigation is still needed.

@p0mvn
Copy link
Member Author

p0mvn commented Feb 18, 2022

A suggestion on how to track inconsistencies across stores:
image

@p0mvn
Copy link
Member Author

p0mvn commented Feb 18, 2022

@UnityChaos @ValarDragon any other updates?

@ValarDragon
Copy link
Member

The SDK logical commit not being atomic is looking more likely to be the cause from what Unity was saying.

Copying from Unity's message:

okay, so yeah after the crash on pruning, 6.3 (using fast cache), is reading the value for the next height:

unity@pop-os:~$ osmosisd q bank balances osmo17xpfvakm2amg962yls6f84z3kell8c5lczssa0 --node=[https://osmosis-1.technofractal.com:443](https://osmosis-1.technofractal.com/) --height=3249089
balances:
- amount: "25939"
  denom: uosmo
pagination:
  next_key: null
  total: "0"
unity@pop-os:~$ osmosisd q bank balances osmo17xpfvakm2amg962yls6f84z3kell8c5lczssa0 --node=[https://osmosis-1.technofractal.com:443](https://osmosis-1.technofractal.com/) --height=3249090
balances:
- amount: "14240"
  denom: uosmo
pagination:
  next_key: null
  total: "0"

(When trying to replay block 3249090, 6.3 reads 14240 from fast cache, whereas 6.2 reads 25939 from commit multi store)

@ValarDragon
Copy link
Member

Can we detect whether were replaying a block vs executing a new block? If we're replaying a block then don't enable fast cache?

@p0mvn
Copy link
Member Author

p0mvn commented Feb 18, 2022

I'm trying to see why replaying a block is breaking things. The fast nodes should still stay consistent with the actual nodes even if we are replaying

@p0mvn
Copy link
Member Author

p0mvn commented Feb 19, 2022

PR with the fix: osmosis-labs/cosmos-sdk#115

@p0mvn
Copy link
Member Author

p0mvn commented Feb 19, 2022

Still need to address the "active reader" issue related to misconfig between pruning and snapshot. Unity explained how to trigger that bug in the PR above

@p0mvn
Copy link
Member Author

p0mvn commented Feb 19, 2022

This PR currently contains the latest changes related to this issue.

@p0mvn
Copy link
Member Author

p0mvn commented Feb 27, 2022

This issue seems to be resolved, closing

@p0mvn p0mvn closed this as completed Feb 27, 2022
Repository owner moved this from 🏃 In Progress to ✅ Done in Osmosis Chain Development Feb 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:bug 🐛 Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants