Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(dot/state/epoch, lib/babe): enable block production through epochs without rely on finalization #2593

Merged

Conversation

EclesioMeloJunior
Copy link
Member

@EclesioMeloJunior EclesioMeloJunior commented Jun 9, 2022

Changes

  • While moving to the next epoch we were checking for data persisted only if the finalization is working. If the finalization is not working the data is not persisted and the gossamer node cannot produce/import more blocks in the next epoch.
  • Changed the initiateEpoch to use the BestBlockHeader in order to get the epoch data and define the slots
  • Changed the state/epoch/EpochState.GetConfigData method, if there is no config data for the current epoch retrieve the config data for the previous epoch.
  • Changed the functions getEpochDataFromMemory and getConfigDataFromMemory to be a unique generic function called retrieveFromMemory (both codes are similar just changing the map to lookup)

Tests

Spins up a gossamer (ALICE) node and a substrate (BOB) node with a GRANDPA authority set of 4 authorities, they will not be able to finalize however, they should be able to produce blocks through epochs.

Issues

Primary Reviewer

@noot

@EclesioMeloJunior EclesioMeloJunior changed the title Eclesio/fix/block production for epochs fix(dot/state/epoch, lib/babe): enable block production through epochs without rely on finalization Jun 9, 2022
@EclesioMeloJunior EclesioMeloJunior self-assigned this Jun 10, 2022
Comment on lines 394 to 404
// sometimes while moving to the next epoch is possible the header
// is not fully imported by the blocktree, in this case we will use
// its parent header which migth be already imported.
if errors.Is(err, blocktree.ErrEndNodeNotFound) {
parentHeader, err := es.blockState.GetHeader(header.ParentHash)
if err != nil {
return nil, fmt.Errorf("cannot get parent header: %w", err)
}

return retrieveFromMemory(nextEpochMap, es, epoch, parentHeader)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean the header is not fully imported? a block can only be imported or not imported afaik

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I notice is that at the block where we should change to the next epoch, I got the error blocktree.ErrEndNodeNotFound on function chainProcessor.processBlockData.

This method (processBlockData) calls the method handleHeader which calls the babeVerifier.VerifyBlock(header), and the variable header contains the header value of the first block in the next epoch. The VerifyBlock method (lib/babe/verify.go) calls the method getVerifierInfo which calls the epochState.GetEpochData using the variables header and epoch.

The epochState.GetEpochData did not find anything in the database for epoch 1 so we look up in the memory map using the header and that is when the problem happens, the block state is not aware of the header (first block in the next epoch) and return: end node does not exist.

The fully error is:

2022-06-08T18:30:12+05:30 EROR block data processing for block with hash 0xd671d24ff29cf990010bfc98a54981e06079ad71f900f86881e30b186ea5d158 failed: could not verify block: failed to get verifier info for block 23: failed to get epoch data for epoch 1: failed to get epoch data from memory: cannot verify the ancestry: end node does not exist	chain_processor.go:L84	pkg=sync

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense, this seems like a decent fix, but maybe we should just pass the parent hash into this function instead, since the parent would always be imported already (i think). what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noot yeah, it can work but we should use the parent header only if we got a problem with the current header, passing the parent header directly can cause the same problem leading us to use the parent of the parent. So keeping this implementation we can handle cases where the block is not in the block tree but some ancestor is and retrieves the right config/epoch data.

lib/babe/epoch.go Outdated Show resolved Hide resolved
Copy link
Contributor

@noot noot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good so far, I'm surprised it didn't work previously actually as we were already storing the digests in memory if not finalized. but if this fixes it that's good!

@EclesioMeloJunior EclesioMeloJunior marked this pull request as ready for review June 13, 2022 14:59
dot/state/epoch.go Outdated Show resolved Hide resolved
dot/state/epoch.go Outdated Show resolved Hide resolved
dot/state/epoch.go Show resolved Hide resolved
dot/state/epoch.go Outdated Show resolved Hide resolved
dot/state/epoch.go Outdated Show resolved Hide resolved
dot/state/epoch.go Outdated Show resolved Hide resolved
lib/babe/epoch.go Outdated Show resolved Hide resolved
Copy link
Contributor

@timwu20 timwu20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we changing chain/dev/genesis-spec.json?

Copy link
Contributor

@timwu20 timwu20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we changing chain/dev/genesis-spec.json?

dot/state/epoch.go Outdated Show resolved Hide resolved
Copy link
Contributor

@kishansagathiya kishansagathiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Stuff 🎉

pending

  • on fixing tests and
  • reverting the genesis-spec file

Otherwise this looks pretty good!!

dot/state/epoch.go Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jul 1, 2022

Codecov Report

Merging #2593 (89d544e) into development (7eede9a) will increase coverage by 0.06%.
The diff coverage is 28.04%.

@@               Coverage Diff               @@
##           development    #2593      +/-   ##
===============================================
+ Coverage        62.09%   62.16%   +0.06%     
===============================================
  Files              216      216              
  Lines            28751    28693      -58     
===============================================
- Hits             17853    17837      -16     
+ Misses            9129     9091      -38     
+ Partials          1769     1765       -4     
Flag Coverage Δ
unit-tests 62.16% <28.04%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
dot/state/epoch.go 50.97% <20.00%> (+2.34%) ⬆️
lib/babe/epoch.go 25.66% <42.30%> (-2.22%) ⬇️
lib/babe/verify.go 91.23% <100.00%> (-0.24%) ⬇️
dot/network/message.go 65.80% <0.00%> (-2.60%) ⬇️
lib/runtime/wasmer/imports.go 48.20% <0.00%> (-0.20%) ⬇️
dot/network/service.go 57.00% <0.00%> (+0.46%) ⬆️
lib/trie/database.go 51.55% <0.00%> (+0.93%) ⬆️
dot/network/host.go 65.37% <0.00%> (+1.06%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7eede9a...89d544e. Read the comment docs.

timwu20 and others added 2 commits July 4, 2022 14:21
* create nextEpochMap type

* remove unnecessary cast
Copy link
Contributor

@timwu20 timwu20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like CI is still failing?

@EclesioMeloJunior EclesioMeloJunior merged commit a0a1804 into development Jul 6, 2022
@EclesioMeloJunior EclesioMeloJunior deleted the eclesio/fix/block-production-for-epochs branch July 6, 2022 17:23
github-actions bot pushed a commit that referenced this pull request Nov 23, 2022
# [0.7.0](v0.6.0...v0.7.0) (2022-11-23)

### Bug Fixes

* **chain:** update ed25519 addresses in dev/gssmr genesis files ([#2225](#2225)) ([5f47d8b](5f47d8b))
* **ci:** caching of Go caches ([#2451](#2451)) ([ce3c10c](ce3c10c))
* **ci:** codecov.yml configuration ([#2698](#2698)) ([d4fc383](d4fc383))
* **ci:** comment skip code for required workflows ([#2312](#2312)) ([45dce9b](45dce9b))
* **ci:** copyright workflow to exit if different files ([#2487](#2487)) ([89c32ae](89c32ae))
* **ci:** deepsource toml configuration ([#2744](#2744)) ([86a70de](86a70de))
* **ci:** embed v0.9.20 runtime, update test suite, and ci workflows ([#2543](#2543)) ([0fff418](0fff418)), closes [#2419](#2419) [#2561](#2561) [#2572](#2572) [#2581](#2581) [#2671](#2671)
* **ci:** fix staging Dockerfile ([#2474](#2474)) ([ae04b80](ae04b80))
* **ci:** mocks checking fixes ([#2274](#2274)) ([d1308e0](d1308e0))
* **ci:** run devnet module unit tests ([#2756](#2756)) ([f635c59](f635c59))
* **ci:** run golangci-lint on integration tests ([#2275](#2275)) ([3ae3401](3ae3401))
* **cmd:** allow --genesis flag to be passed to base command ([#2427](#2427)) ([7f5b5aa](7f5b5aa))
* **cmd:** avoid nil pointer dereference ([#2578](#2578)) ([f2cdfea](f2cdfea))
* **config:** temporary fix for pprof enabled setting precedence ([#2786](#2786)) ([d4d6262](d4d6262))
* **core:** fix txn pool for latest runtime ([#2809](#2809)) ([1551e66](1551e66))
* **deps:** upgrade chaindb to remove badger logs ([#2738](#2738)) ([e0c5706](e0c5706))
* **devnet:** Fix build workflow for devnet ([#2125](#2125)) ([0375fc2](0375fc2))
* **Dockerfile:** remove script entrypoint ([#2707](#2707)) ([abd161b](abd161b))
* **dot/core:** `RuntimeInstance` interface `Version` signature ([#2783](#2783)) ([7d66ec0](7d66ec0))
* **dot/core:** fix the race condition in TrieState ([#2499](#2499)) ([804069c](804069c)), closes [#2402](#2402)
* **dot/digest:** BABE NextEpochData and NextConfigData should be set on finalization ([#2339](#2339)) ([e991cc8](e991cc8))
* **dot/digest:** verify if next epoch already contains some definition ([#2472](#2472)) ([a2ac6c2](a2ac6c2))
* **dot/netwok:** check for duplicate message earlier ([#2435](#2435)) ([d62503f](d62503f))
* **dot/network:** change BlockRequestMessage number from uint64 to uint32 ([8105cd4](8105cd4))
* **dot/network:** close notifications streams ([#2093](#2093)) ([de6e7c9](de6e7c9)), closes [#2046](#2046)
* **dot/network:** fixing errMissingHandshakeMutex ([#2303](#2303)) ([eb07a53](eb07a53))
* **dot/network:** memory improvement for network buffers ([#2233](#2233)) ([fd9b70d](fd9b70d))
* **dot/network:** public IP address logging ([#2140](#2140)) ([9e21587](9e21587))
* **dot/network:** re-add nil mutex check for disconnected peer ([#2408](#2408)) ([9b39bd1](9b39bd1))
* **dot/network:** remove `defer cancel()` inside loop ([#2248](#2248)) ([9e360a5](9e360a5))
* **dot/network:** resize bytes slice buffer if needed ([#2291](#2291)) ([8db8b2a](8db8b2a))
* **dot/peerset:** fix sending on closed channel race condition when dropping peer ([#2573](#2573)) ([2fa5d8a](2fa5d8a))
* **dot/peerset:** remove race conditions from `peerset` package ([#2267](#2267)) ([df09d45](df09d45))
* **dot/rpc/modules:** grandpa.proveFinality update parameters, fix bug ([#2576](#2576)) ([e7749cf](e7749cf))
* **dot/rpc/modules:** rpc.state.queryStorage fixed ([#2565](#2565)) ([1ec0d47](1ec0d47))
* **dot/rpc:** include unsafe flags to be considered by RPC layer ([#2483](#2483)) ([3822257](3822257))
* **dot/state/epoch, lib/babe:** enable block production through epochs without rely on finalization ([#2593](#2593)) ([a0a1804](a0a1804))
* **dot/state:** actually prune finalized tries from memory ([#2196](#2196)) ([e4bc375](e4bc375))
* **dot/state:** change map of tries implementation to have working garbage collection ([#2206](#2206)) ([fada46b](fada46b))
* **dot/state:** inject mutex protected tries to states ([#2287](#2287)) ([67a9bbb](67a9bbb))
* **dot/subscription:** check websocket message from untrusted data ([#2527](#2527)) ([1f20d98](1f20d98))
* **dot/subscription:** unsafe type casting from untrusted input ([#2529](#2529)) ([1015733](1015733))
* **dot/sync, dot/rpc:** implement HighestBlock ([#2195](#2195)) ([f8d8657](f8d8657))
* **dot/sync:** cleanup logs; don't log case where we fail to get parent while processing ([#2188](#2188)) ([cb360ab](cb360ab))
* **dot/sync:** fix "block with unknown header is ready" error ([#2191](#2191)) ([483466f](483466f))
* **dot/sync:** fix `Test_lockQueue_threadSafety` ([#2605](#2605)) ([223cfbb](223cfbb))
* **dot/sync:** Fix flaky tests `Test_chainSync_logSyncSpeed` and `Test_chainSync_start` ([#2610](#2610)) ([7e1014b](7e1014b))
* **dot/sync:** Gossip `BlockAnnounceMessage` only after  successfully imported ([#2885](#2885)) ([69031a6](69031a6))
* **dot/sync:** remove block announcement in `bootstrap` sync mode ([#2906](#2906)) ([2b4c257](2b4c257))
* **dot/sync:** remove max size limit from ascending block requests ([#2256](#2256)) ([e287d7e](e287d7e))
* **dot/sync:** sync benchmark ([#2234](#2234)) ([2f3aef8](2f3aef8))
* **dot/telemetry:** telemetry hashes to be in the hexadecimal format ([#2194](#2194)) ([9b48106](9b48106))
* **dot:** database close error checks ([#2948](#2948)) ([bdb0eea](bdb0eea))
* **dot:** no error logged for init check ([#2502](#2502)) ([2971325](2971325))
* ensure we convert the `uint` type ([#2626](#2626)) ([792e53f](792e53f))
* fix logger mutex locking in `.New` method ([#2114](#2114)) ([e7207ed](e7207ed))
* **internal/log:** log level `DoNotChange` ([#2672](#2672)) ([0008b59](0008b59))
* **levels-logged:** Fix log levels logging at start ([#2236](#2236)) ([a90a6e0](a90a6e0))
* **lib/babe:** check if authority index is in the `authorities` range ([#2601](#2601)) ([1072888](1072888))
* **lib/babe:** ensure the slot time is correct before build a block ([#2648](#2648)) ([78c03b6](78c03b6))
* **lib/babe:** epoch context error wrapping ([#2484](#2484)) ([c053dea](c053dea))
* **lib/babe:** Unrestricted Loop When Building Blocks (GSR-19) ([#2632](#2632)) ([139ad89](139ad89))
* **lib/blocktree:** reimplement `BestBlockHash` to take into account primary blocks in fork choice rule ([#2254](#2254)) ([1a368e2](1a368e2))
* **lib/grandpa:** avoid spamming round messages ([#2688](#2688)) ([b0042b8](b0042b8))
* **lib/grandpa:** capped number of tracked commit messages ([#2490](#2490)) ([47c23e6](47c23e6))
* **lib/grandpa:** capped number of tracked vote messages ([#2485](#2485)) ([d2ee47e](d2ee47e)), closes [#1531](#1531)
* **lib/grandpa:** check equivocatory votes count ([#2497](#2497)) ([014629d](014629d)), closes [#2401](#2401)
* **lib/grandpa:** Duplicate votes is GRANDPA are counted as equivocatory votes (GSR-11) ([#2624](#2624)) ([422e7b3](422e7b3))
* **lib/grandpa:** Storing Justification Allows Extra Bytes (GSR-13) ([#2618](#2618)) ([0fcde63](0fcde63))
* **lib/grandpa:** update grandpa protocol ID ([#2678](#2678)) ([3be75b2](3be75b2))
* **lib/grandpa:** various finality fixes, improves cross-client finality ([#2368](#2368)) ([c04d185](c04d185))
* **lib/grandpa:** verify equivocatory votes in grandpa justifications ([#2486](#2486)) ([368f8b6](368f8b6))
* **lib/runtime:** avoid caching version in runtime instance ([#2425](#2425)) ([7ab31f0](7ab31f0))
* **lib/runtime:** stub v0.9.17 host API functions ([#2420](#2420)) ([6a7b223](6a7b223))
* **lib/trie:** `handleDeletion` generation propagation ([24c303d](24c303d))
* **lib/trie:** `PopulateMerkleValues` functionality changes and fixes ([#2871](#2871)) ([7131290](7131290))
* **lib/trie:** Check for root in EncodeAndHash ([#2359](#2359)) ([087db89](087db89))
* **lib/trie:** Make sure writing and reading a trie to disk gives the same trie  and cover more store/load child trie related test cases ([#2302](#2302)) ([7cd4118](7cd4118))
* **lib/trie:** prepare trie nodes for mutation only when needed ([#2834](#2834)) ([26868df](26868df))
* **lib/trie:** remove map deletion at `loadProof` ([#2259](#2259)) ([fbd13d2](fbd13d2))
* **lint:** fix issues found by golangcilint 1.47.3 ([#2715](#2715)) ([5765e67](5765e67))
* **mocks:** add missing `//go:generate` for mocks ([#2273](#2273)) ([f4f7465](f4f7465))
* **pprof:** pprofserver flag changed to boolean ([#2205](#2205)) ([be00a69](be00a69))
* **staging:** revise datadog-agent start process ([#2935](#2935)) ([36ce37d](36ce37d))
* **state/epoch:** assign epoch 1 when block number is 0 ([#2592](#2592)) ([e5c8cf5](e5c8cf5))
* **state/grandpa:** track changes across forks ([#2519](#2519)) ([3ab76bc](3ab76bc))
* **tests:** `TestAuthorModule_HasSessionKeys_Integration` ([#2932](#2932)) ([8d809aa](8d809aa))
* **tests:** fix block body regex in `TestChainRPC` ([#2805](#2805)) ([b0680f8](b0680f8))
* **tests:** Fix RFC3339 regex for log unit tests ([9caea2a](9caea2a))
* **tests:** Fix wasmer flaky sorts ([#2643](#2643)) ([7eede9a](7eede9a))
* **tests:** handle node crash during waiting ([#2691](#2691)) ([843bd50](843bd50))
* **tests:** update block body regex in `TestChainRPC` ([#2674](#2674)) ([055e5c3](055e5c3))
* **trie:** decode inline child nodes ([#2369](#2369)) ([9efde47](9efde47))
* **trie:** descendants count for clear prefix ([#2606](#2606)) ([1826896](1826896))
* **trie:** disallow empty byte slice node values ([#2927](#2927)) ([d769d1c](d769d1c))
* **trie:** equality differentiate nil and empty storage values ([#2969](#2969)) ([72a08ec](72a08ec))
* **trie:** no in-memory caching of node encoding ([#2919](#2919)) ([856780b](856780b))
* **trie:** Panic when deleting nonexistent keys from trie (GSR-10) ([#2609](#2609)) ([7886318](7886318))
* **trie:** remove encoding buffers pool ([#2929](#2929)) ([f4074cc](f4074cc))
* **trie:** use cached Merkle values for root hash ([#2943](#2943)) ([ec2549a](ec2549a))
* **trie:** use direct Merkle value for database keys ([#2725](#2725)) ([1a3c3ae](1a3c3ae))
* upgrade auto-generated mocks ([#2910](#2910)) ([a2975a5](a2975a5))
* **wasmer:** error logs for signature verification ([#2752](#2752)) ([363c080](363c080))
* **wasmer:** fix flaky sort in `Test_ext_crypto_sr25519_public_keys_version_1` ([#2607](#2607)) ([c061b35](c061b35))

### Features

* **build:** add `github.com/breml/rootcerts` ([#2695](#2695)) ([c74a5b0](c74a5b0))
* **build:** binary built-in timezone data ([#2697](#2697)) ([fdd5bda](fdd5bda))
* **chain:** use always the raw genesis file ([#2775](#2775)) ([dd2fbc9](dd2fbc9))
* **ci:** update mockery from `2.10` to `2.14` ([#2642](#2642)) ([d2c42b8](d2c42b8))
* **cross-client:** create docker-compose.yml for local devnet  ([#2282](#2282)) ([8abbd87](8abbd87))
* detect chain directory dynamically ([#2292](#2292)) ([85c466c](85c466c))
* **devnet:** add substrate docker images to dockerfile  ([#2263](#2263)) ([b7b2a66](b7b2a66))
* **devnet:** continuous integration `gssmr` devnet on AWS ECS ([#2096](#2096)) ([d096d44](d096d44))
* **docker:** docker-compose.yml to run Gossamer, Prometheus and Grafana ([#2706](#2706)) ([c5dda51](c5dda51))
* **dot/network:** add mismatched genesis peer reporting ([#2265](#2265)) ([a1d7269](a1d7269))
* **dot/state:** `gossamer_storage_tries_cached_total` gauge metric ([#2272](#2272)) ([625cbcf](625cbcf))
* **e2e:** build Gossamer on any test run ([#2608](#2608)) ([f97e0ef](f97e0ef))
* **go:** upgrade Go from 1.17 to 1.18 ([#2379](#2379)) ([d85a1db](d85a1db))
* include nested varying data type on neighbor messages ([#2722](#2722)) ([426569a](426569a))
* **lib/babe:** implement secondary slot block production ([#2260](#2260)) ([fcb81a3](fcb81a3))
* **lib/runtime:** support Substrate WASM compression ([#2213](#2213)) ([fd60061](fd60061))
* **lib/trie:** atomic tracked merkle values ([#2876](#2876)) ([1c4174c](1c4174c))
* **lib/trie:** clear fields when node is dirty ([#2297](#2297)) ([1162828](1162828))
* **lib/trie:** only copy nodes when mutation is certain ([#2352](#2352)) ([86624cf](86624cf))
* **lib/trie:** opportunistic parallel hashing ([#2081](#2081)) ([790dfb5](790dfb5))
* **metrics:** replace metrics port with address (breaking change) ([#2382](#2382)) ([d2ec68d](d2ec68d))
* **pkg/scale:** add `Encoder` with `Encode` method ([#2741](#2741)) ([af5c63f](af5c63f))
* **pkg/scale:** add use of pkg/error Wrap for error handling ([#2708](#2708)) ([08c4281](08c4281))
* **pkg/scale:** encoding and decoding of maps in scale ([#2894](#2894)) ([405db51](405db51)), closes [#2796](#2796)
* **pkg/scale:** support for custom `VaryingDataType` types ([#2612](#2612)) ([914a747](914a747))
* remove uneeded runtime prefix logs ([#2110](#2110)) ([8bd05d1](8bd05d1))
* remove unused code ([#2677](#2677)) ([b3698d7](b3698d7))
* **scale:** add range checks to decodeUint function ([#2683](#2683)) ([ac700f8](ac700f8))
* **trie:** decode all inlined node variants ([#2611](#2611)) ([b09eb07](b09eb07))
* **trie:** export `LoadFromProof` ([#2455](#2455)) ([0b4f33d](0b4f33d))
* **trie:** faster header decoding ([#2649](#2649)) ([d9460e3](d9460e3))
* **trie:** finer deep copy of nodes ([#2384](#2384)) ([bd6d8e4](bd6d8e4))
* **trie:** tracking of number of descendant nodes for each node ([#2378](#2378)) ([dfcdd3c](dfcdd3c))
* **trie:** use scale encoder ([#2930](#2930)) ([e3dc108](e3dc108))
* **wasmer/crypto:** move sig verifier to crypto pkg ([#2057](#2057)) ([dc8bbef](dc8bbef))
* **wasmer:** Add `SetTestVersion` method to `Config` struct ([#2823](#2823)) ([e5c9336](e5c9336))
* **wasmer:** get and cache state version in instance context ([#2747](#2747)) ([3fd63db](3fd63db))
@github-actions
Copy link

🎉 This PR is included in version 0.7.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gossamer cannot produce blocks in the next epoch if finalization stales
4 participants