Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Borevent snapshot validation #9436

Merged
merged 52 commits into from
Aug 19, 2024
Merged

Borevent snapshot validation #9436

merged 52 commits into from
Aug 19, 2024

Conversation

mh0lt
Copy link
Contributor

@mh0lt mh0lt commented Feb 13, 2024

This adds integrity checking for borevents in the following places

  1. Stage Bor Heimdall - check that the event occurs withing the expected time window
  2. Dump Events to snapshots - check that the event ids are continuous in and between blocks (will add a time window tests)
  3. Index event snapshots - check that the event ids are continuous in and between blocks

It also adds an external integrity checker which runs on snapshots checking for continuity and timeliness of events. This can be called using the following command:
erigon snapshots integrity --datadir=~/snapshots/bor-mainnet-patch-1 --from=45500000
(--to specifies an end block)

This also now fixes the long running issue with bor events causing unexpected fails in executions:

the problem event validation uncovered was a follows:

The kv.BorEventNums mapping table currently keeps the mapping first event id->block. The code which produces bor-event snapshots to determine which events to put into the snapshot.

however if no additional blocks have events by the time the block is stored in the snapshot, the snapshot creation code does not know which events to include - so drops them.

This causes problems in two places:

  • RPC queries & re-execution from snapshots can't find these dropped events
  • Depending on purge timing these events may erroneously get inserted into future blocks

The code change in this PR fixes that bug.

It has been tested by running:

erigon --datadir=~/chains/e3/amoy --chain=amoy --bor.heimdall=https://heimdall-api-amoy.polygon.technology --db.writemap=false --txpool.disable --no-downloader --bor.milestone=false

with validation in place and the confimed by running the following:

erigon snapshots rm-all-state-snapshots --datadir=~/chains/e3/amoy
rm ~/chains/e3/amoy/chaindata/
erigon --datadir=~/chains/e3/amoy --chain=amoy --bor-heimdall=https://heimdall-api-amoy.polygon.technology --db.writemap=false --no-downloader --bor.milestone=false

To recreate the chain from snapshots.

It has also been checked with:

erigon snapshots integrity --datadir=~/chains/e3/amoy --check=NoBorEventGaps --failFast=true"

@mh0lt mh0lt added the polygon label Feb 13, 2024
polygon/bor/bor.go Outdated Show resolved Hide resolved
@@ -220,19 +217,21 @@ func fetchAndWriteHeimdallStateSyncEvents(
continue
}

if lastStateSyncEventID+1 != eventRecord.ID || eventRecord.ChainID != chainID || !eventRecord.Time.Before(to) {
if lastStateSyncEventID+1 != eventRecord.ID || eventRecord.ChainID != chainID ||
!(eventRecord.Time.After(from) && eventRecord.Time.Before(to)) {
Copy link
Member

@taratorio taratorio Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to add a new unit test for this error path - it should be fairly easy to add.

You can have a look at TestBorHeimdallForwardPersistsStateSyncEvents and stagedsynctest.Harness.mockHeimdallClient. For this you will need to pass in a new function to h.heimdallClient.EXPECT().FetchStateSyncEvents(...).DoAndReturn().

A clean way to do this may be to add an optional MockFetchStateSyncEventsCallback to the HarnessCfg and check if it is present use that inside the DoAndReturn otherwise use the current one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I'll look at this tomorrow.

for i := uint64(0); i < maxBlockNum; i += 10_000 {

if to != 0 && maxBlockNum > to {
maxBlockNum = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity - why is this necessary? is it worth leaving a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm - should say: maxBlockNum = to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mh0lt this typo still needs fixing

@AskAlexSharov
Copy link
Collaborator

did try this "integrity" check, see:

EROR[02-20|04:59:03.985] [integrity] NoGapsInBorEvents: invalid event time block=25196544 event=1716410 time=2022-02-21T14:37:02+0000 expected="2022-02-21 14:38:38 +0000 UTC-2022-02-21 14:38:38 +0000 UTC"

for all events

@mh0lt
Copy link
Contributor Author

mh0lt commented Feb 21, 2024

did try this "integrity" check, see:

EROR[02-20|04:59:03.985] [integrity] NoGapsInBorEvents: invalid event time block=25196544 event=1716410 time=2022-02-21T14:37:02+0000 expected="2022-02-21 14:38:38 +0000 UTC-2022-02-21 14:38:38 +0000 UTC"

for all events

I'm only just testing this now. I need to validate that the check is actually correctly configured - which I'm talking through with Polygon. As there is no spec for the start of a window its going to take some adjustment to get right.

@mh0lt mh0lt added imp1 High importance and removed imp3 Low importance labels Aug 17, 2024
return err
}

fmt.Println("LAST Event", lastEventId, "BH", borHeimdallProgress, "B", bodyProgress)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like leftover print from debugging? should we remove and/or use a logger?

"github.com/erigontech/erigon-lib/log/v3"
)

func TestOver50EventBlockFetch(t *testing.T) {
Copy link
Member

@taratorio taratorio Aug 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test should maybe be removed since it is calling an external service https://heimdall-api.polygon.technology/ and may fail our CI or alternatively we can add t.Skip with info why we skip it and why it is left behind in the code base

@@ -131,3 +131,29 @@ type StateSyncEventResponse struct {
Height string `json:"height"`
Result EventRecordWithTime `json:"result"`
}

var methodId []byte = borabi.StateReceiverContractABI().Methods["commitState"].ID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[]byte type can be omitted in the declaration


fmt.Println("LAST Event", lastEventId, "BH", borHeimdallProgress, "B", bodyProgress)

if bodyProgress > borHeimdallProgress {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if and for loop don't seem to do anything?
probably the whole if db != nil if branch above on line 89 can be removed?

@mh0lt mh0lt merged commit e4eb9fc into main Aug 19, 2024
10 checks passed
@mh0lt mh0lt deleted the borevent_snapshot_validation branch August 19, 2024 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imp1 High importance polygon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants