Investigate and fix panics on fast iavl branch #892

p0mvn · 2022-02-17T23:03:16Z

Background

Validators are experiencing issues on the new fast node branch with panics such as:

This is an issue to investigate and fix the problems reported.

Acceptance Criteria

gather all information from the #validators channel on Discord and document here
investigate and document the steps taken here
if applicable, create a PR with the fix and update 6.x branch

UnityChaos · 2022-02-17T23:31:15Z

Replicated some of this by using a misalignment between pruning and snapshot settings, so that node would crash when attempting to prune an old height. On restart the node would rerun the previous block, but get a different result in the distribution.keeper.allocation.AllocateTokens function.

First run on v6.3.0 (crashes because of trying to delete block being held for snapshot)

Restart after disabling snapshot. Note the different feesCollected amounts. This seems to be connected to the insufficient funds error when the new value is higher than the old value, and the already saved to different hash error if the result is lower than the original.

Then reran this process on v6.2.0, and things ran as expected.

First run through, crashing because of trying to delete snapshot:

Restart (after disabling snapshots) and makes it through:

Not clear to me what the cause is yet, but I would imagine it's some sort of incomplete state write which is not properly reverted during the crash, is still around when the node attempts to rerun the block, leading to different results.

Also worth noting that the above situation was constructed to replicate this error, but my understanding is that node operators are hitting these errors during regular operation, not immediately after recovering from a crash as I am here.

p0mvn · 2022-02-18T18:29:52Z

From the most recent discussion, we have a guess that there are problems with atomic commit. Currently, each module has a separate iavl store. The commit is atomic within a tree. However, it is not atomic inter-tree or between modules. As a result, if an operator kills the node at the wrong time, the module's stores may be at different heights. Relevant SDK issue: cosmos/cosmos-sdk#6370

However, although this non-atomic commit between stores exists, our problem might still be something else since the error "was already saved to a different hash" is coming from a check within a tree. More investigation is still needed.

p0mvn · 2022-02-18T18:31:14Z

A suggestion on how to track inconsistencies across stores:

p0mvn · 2022-02-18T18:36:40Z

@UnityChaos @ValarDragon any other updates?

ValarDragon · 2022-02-18T20:41:38Z

The SDK logical commit not being atomic is looking more likely to be the cause from what Unity was saying.

Copying from Unity's message:

okay, so yeah after the crash on pruning, 6.3 (using fast cache), is reading the value for the next height:

unity@pop-os:~$ osmosisd q bank balances osmo17xpfvakm2amg962yls6f84z3kell8c5lczssa0 --node=[https://osmosis-1.technofractal.com:443](https://osmosis-1.technofractal.com/) --height=3249089
balances:
- amount: "25939"
  denom: uosmo
pagination:
  next_key: null
  total: "0"
unity@pop-os:~$ osmosisd q bank balances osmo17xpfvakm2amg962yls6f84z3kell8c5lczssa0 --node=[https://osmosis-1.technofractal.com:443](https://osmosis-1.technofractal.com/) --height=3249090
balances:
- amount: "14240"
  denom: uosmo
pagination:
  next_key: null
  total: "0"

(When trying to replay block 3249090, 6.3 reads 14240 from fast cache, whereas 6.2 reads 25939 from commit multi store)

ValarDragon · 2022-02-18T20:42:40Z

Can we detect whether were replaying a block vs executing a new block? If we're replaying a block then don't enable fast cache?

p0mvn · 2022-02-18T22:08:06Z

I'm trying to see why replaying a block is breaking things. The fast nodes should still stay consistent with the actual nodes even if we are replaying

p0mvn · 2022-02-19T03:53:51Z

PR with the fix: osmosis-labs/cosmos-sdk#115

p0mvn · 2022-02-19T03:54:50Z

Still need to address the "active reader" issue related to misconfig between pruning and snapshot. Unity explained how to trigger that bug in the PR above

p0mvn · 2022-02-19T04:00:15Z

This PR currently contains the latest changes related to this issue.

p0mvn · 2022-02-27T17:36:49Z

This issue seems to be resolved, closing

p0mvn moved this from 🔍 Needs Review to 🏃 In Progress in Osmosis Chain Development Feb 17, 2022

p0mvn assigned p0mvn and UnityChaos Feb 17, 2022

p0mvn added the T:bug 🐛 Something isn't working label Feb 17, 2022

p0mvn unassigned UnityChaos Feb 17, 2022

p0mvn closed this as completed Feb 27, 2022

Repository owner moved this from 🏃 In Progress to ✅ Done in Osmosis Chain Development Feb 27, 2022

vorotech mentioned this issue Mar 3, 2022

[Issue] panic: 1515uosmo is smaller than 3482uosmo: insufficient funds #1028

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and fix panics on fast iavl branch #892

Investigate and fix panics on fast iavl branch #892

p0mvn commented Feb 17, 2022 •

edited

Loading

UnityChaos commented Feb 17, 2022

p0mvn commented Feb 18, 2022

p0mvn commented Feb 18, 2022

p0mvn commented Feb 18, 2022

ValarDragon commented Feb 18, 2022

ValarDragon commented Feb 18, 2022

p0mvn commented Feb 18, 2022

p0mvn commented Feb 19, 2022

p0mvn commented Feb 19, 2022 •

edited

Loading

p0mvn commented Feb 19, 2022

p0mvn commented Feb 27, 2022

Investigate and fix panics on fast iavl branch #892

Investigate and fix panics on fast iavl branch #892

Comments

p0mvn commented Feb 17, 2022 • edited Loading

UnityChaos commented Feb 17, 2022

p0mvn commented Feb 18, 2022

p0mvn commented Feb 18, 2022

p0mvn commented Feb 18, 2022

ValarDragon commented Feb 18, 2022

ValarDragon commented Feb 18, 2022

p0mvn commented Feb 18, 2022

p0mvn commented Feb 19, 2022

p0mvn commented Feb 19, 2022 • edited Loading

p0mvn commented Feb 19, 2022

p0mvn commented Feb 27, 2022

p0mvn commented Feb 17, 2022 •

edited

Loading

p0mvn commented Feb 19, 2022 •

edited

Loading