Node recovery from stalled Stateproof chain #4056

id-ms · 2022-05-30T15:52:43Z

Summary

In case state proof chain is being delayed nodes have the ability to recover and "catch-up" on not-yet-confirmed state proofs. In order to do so, there are some resources allocated for that target.

Nodes store the following resources and remove them only when a state proof is confirmed on-chain
1 -Voters array, participation tree, and proven weight
2 -builder's data for each state proof round
3-signatures on state proof messages

If state proof chain is delayed too much those resources will never be released and might lead to extensive memory consumption.

Test Plan

codecov · 2022-05-30T17:45:45Z

Codecov Report

Merging #4056 (2f912a1) into feature/stateproofs (145c065) will increase coverage by 0.14%.
The diff coverage is 94.11%.

❗ Current head 2f912a1 differs from pull request most recent head 5799646. Consider uploading reports for the commit 5799646 to get more accurate results

@@                   Coverage Diff                   @@
##           feature/stateproofs    #4056      +/-   ##
=======================================================
+ Coverage                54.69%   54.83%   +0.14%     
=======================================================
  Files                      396      396              
  Lines                    48958    48980      +22     
=======================================================
+ Hits                     26779    26860      +81     
+ Misses                   19935    19885      -50     
+ Partials                  2244     2235       -9

Impacted Files	Coverage Δ
crypto/stateproof/builder.go	`89.21% <ø> (ø)`
stateproof/worker.go	`90.00% <ø> (ø)`
stateproof/builder.go	`69.54% <91.42%> (+2.38%)`	⬆️
config/consensus.go	`85.75% <100.00%> (+0.04%)`	⬆️
ledger/internal/eval.go	`67.28% <100.00%> (ø)`
ledger/voters.go	`62.85% <100.00%> (+10.54%)`	⬆️
ledger/tracker.go	`70.56% <0.00%> (-3.90%)`	⬇️
catchup/peerSelector.go	`98.95% <0.00%> (-1.05%)`	⬇️
data/abi/abi_type.go	`87.67% <0.00%> (-0.95%)`	⬇️
network/wsNetwork.go	`65.27% <0.00%> (+0.47%)`	⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 145c065...5799646. Read the comment docs.

algonautshant

Can you please update the PR title to be more specific?
It is not "Stateproof recovery", it is "Node recovery from stalled Stateproof chain".
The title is implying that the StateProof is recovered, while it is not.

algonautshant · 2022-06-15T16:00:23Z

config/consensus.go

+	// StateProofRecoveryInterval represents the number of state proof intervals that the network will try to catch-up with.
+	// When the difference between the latest state proof and the current round will be greater than value, Nodes will
+	// release resources allocated for creating state proofs.
+	StateProofRecoveryInterval uint64


Is it necessary for this be a protocol parameter?
What is the harm when different nodes have different values for this?
Ideally, every node should set this to the highest value its memory can handle.

I think this should be a consensus param since different values on different nodes might cause them to disagree on stateproof transactions (they might remove data used for the verification).
for example:
assume node A stores 20 stateproof intervals back and node B stores 2 intervals back.
If both of them receive an old stateproof transaction node A will be able to verify the transition since it has the provenWeight and the PartCommitment but node B will not have it

@gmalouf @cce, We would like to hear your thoughts on this.

I think if you look at how we are handling the 320 round lookback effort, you will see us using consensus params in a similar situation; I am okay with this.

Here is why I think this parameter should be a configuration parameter, and not a consensus parameter:

This parameter decides how much memory a system can tolerate. This parameter needs to be set to a reasonably large value, to provide some tolerance in case the SP is delayed, and perhaps give enough time for intervention when something wrong happens.

However, a machine with limited amount of memory will crash when the memory usage goes up because of this parameter.

As a more concrete example, let's say there are two machines, one with 95% stake and a lot of memory, and another machine, with 5% stake, and limited memory. It is not a good idea to have the same value for this parameter to all of these machines.

The machines with high stake and high memory, can and should set a large value for this parameter, while others, with limited memory, can only tolerate a small value.

Therefore, I think this parameter should not be a consensus parameter.

@gmalouf @brianolson your thoughts.

StateProofInterval = 256 so StateProofRecoveryInterval = 10 means we need to store data stretching back 2560 rounds now? That's a big jump over the 1000 rounds of blocks that we currently store (and we're doing a big chunk of work to thin out that data and only store what we really need out of those 1000 blocks).

It sounds like you are pushing for an option C Shant - make configurable per node. I thought we were debating constant vs consensus param, neither of which is intended to be per node configurable.

Idan's comments to me offline imply an intent to keep the value identical on all nodes (and keeping it small) - sounds like worth a live discussion.

Yes, we need a live discussion for this, absolutely. And the value of 10, which is too much for Brian :-), is of very limited use if set that low.
Also, once added a consensus parameter, it will be there forever!

@algoidan says:

a key observation in our discussion is that there are two parts of data needed for stateproof
1 - data needed to create a stateproof transaction (participation array, bound of signatures etc)
2- data needed to verify a stateproof transaction on-chain. (need to block header from 256 rounds before the stateproof)

and currently, the new parameter affects both parts
In my opinion, regarding the second part - all nodes must be synced on what blocks they have in order to verify a stateproof transaction. Having different parameter values might lead to disagreements.

@algonautshant replies:

This is the argument for having it as a consensus parameter. But I would say that, accounts will either vote yes, or no to the transaction/block. If enough "yes" votes are obtained, then all is good. The idea that all nodes must be "synced", is not going to happen.

In the event of a delayed SP, either all will say "No" (parameter too small), or the weak ones will crash, and others will say "Yes" (parameter big enough).

No. All nodes that have online accounts. This parameter will impact their ability to sign blocks with SP transaction in them.

I agree with Idan on (2). There cannot be disagreement over whether a transaction is valid. All nodes (whether relay, participating, non-participating) must be able to validate transactions and block proposals. I disagree with "If enough 'yes' votes are obtained, then all is good. The idea that all nodes must be 'synced', is not going to happen."

My understanding is that consensus means every node validates all transactions, proposals, votes, and blocks for themselves and agrees on the same deterministic outcome. Relaxing this requirement introduces subjectivity into whether block proposals or transactions are valid, and would lead to consensus failing to find agreement (and stalling) if enough nodes disagree on the validity of the same transactions/blocks.

@algoidan goes on:

However, regarding the first part, we can definitely extend this and allow some beefy servers to store more data (this will be implemented using a configuration parameter and not consensus ). This is part of what we plan to implement next quarter.

Would it help to give this parameter a different name perhaps?

Talking this over with Shant just now, StateProofInterval and StateProofRecoveryInterval must be consensus parameters because they govern evaluating new transactions in a block (the recording of a state proof is a new txn type) and everyone must have the same rules for evaluating that txn. Also this commits us to some extra storage on each node (at least (32 byte addr)*(1000 top accounts)*(10 state proof periods) == 320,000 bytes) to store which accounts were valid signers for each state proof we may yet need to check signatures for. I was going to say 'and we should have a discussion about whether to commit that storage', but 320kB ought to be fine.

algonautshant · 2022-06-15T16:01:39Z

config/consensus.go

@@ -1141,6 +1146,7 @@ func initConsensusProtocols() {
 	vFuture.StateProofVotersLookback = 16
 	vFuture.StateProofWeightThreshold = (1 << 32) * 30 / 100
 	vFuture.StateProofStrengthTarget = 256
+	vFuture.StateProofRecoveryInterval = 10


This is a very short value to be useful at all. It gives less than 3 hours to resolve a problem. I think setting it 10 times this value is a better starting point.
Need to evaluate the memory impact of this value.

test/e2e-go/features/stateproofs/stateproofs_test.go

stateproof/builder.go

gmalouf · 2022-06-17T14:25:06Z

Sounds like you have your answer Idan :)

…

On Fri, Jun 17, 2022 at 10:22 AM Brian Olson ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In config/consensus.go <#4056 (comment)>: > @@ -387,6 +387,11 @@ type ConsensusParams struct { // StateProofStrengthTarget represents either k+q (for pre-quantum security) or k+2q (for post-quantum security) StateProofStrengthTarget uint64 + // StateProofRecoveryInterval represents the number of state proof intervals that the network will try to catch-up with. + // When the difference between the latest state proof and the current round will be greater than value, Nodes will + // release resources allocated for creating state proofs. + StateProofRecoveryInterval uint64 Talking this over with Shant just now, StateProofInterval and StateProofRecoveryInterval *must* be consensus parameters because they govern evaluating new transactions in a block (the recording of a state proof is a new txn type) and everyone must have the same rules for evaluating that txn. *Also* this commits us to some extra storage on each node (at least (32 byte addr)*(1000 top accounts)*(10 state proof periods) == 320,000 bytes) to store which accounts were valid signers for each state proof we may yet need to check signatures for. I was going to say 'and we should have a discussion about whether to commit that storage', but 320kB ought to be fine. — Reply to this email directly, view it on GitHub <#4056 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHP3U4TGAV2XEPSPN3A7GLVPSC33ANCNFSM5XK72BLA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ledger/ledger_test.go

algonautshant · 2022-06-18T02:09:02Z

ledger/voters.go

+	recentRoundOnRecoveryPeriod := basics.Round(uint64(hdr.Round) - uint64(hdr.Round)%proto.StateProofInterval)
+	oldestRoundOnRecoveryPeriod := recentRoundOnRecoveryPeriod.SubSaturate(basics.Round(proto.StateProofInterval * proto.StateProofRecoveryInterval))
+
+	for r, tr := range vt.votersForRoundCache {


This loop does not need to be traversed for every block.
This check can be divided into two functions, one checking:
stateProofRound < hdr.StateProofTracking[protocol.StateProofBasic].StateProofNextRound
the other checking:
stateProofRound <= oldestRoundOnRecoveryPeriod

stateProofRound < hdr.StateProofTracking[protocol.StateProofBasic].StateProofNextRound
This only need to be checked when StateProofNextRound changes the value.

stateProofRound <= oldestRoundOnRecoveryPeriod
This only needs to be checked if hdr.Round-1 and hdr.Round produce different oldestRoundOnRecoveryPeriod.

Basically, You are right, and the logic you've suggested will work fine.
However, I think that the current approach is simpler (we don't need a store state for example- and we might have more edge cases).
Since the map is quite small (it usually contains 0-3 elements and in the worst case contains 10 elements). I suggest we stick to a simpler code.

stateproof/worker_test.go

test/e2e-go/features/stateproofs/stateproofs_test.go

algonautshant · 2022-06-18T04:55:58Z

test/e2e-go/features/stateproofs/stateproofs_test.go

+			if lastStateProofBlock.Round() == 0 {
+				lastStateProofBlock = blk
+			}
+		}


Also check in else block that StateProofVotersCommitment is length 0.

algonautshant · 2022-06-18T05:04:24Z

test/e2e-go/features/stateproofs/stateproofs_test.go

@@ -173,3 +175,154 @@ func verifyStateProofForRound(r *require.Assertions, libgoal libgoal.Client, res
 	r.NoError(err)
 	return stateProofMessage, nextStateProofBlock
 }
+
+func TestStateProofsRecovery(t *testing.T) {


It is not clear what this test is about. Some comments will be appreciated here.

test/e2e-go/features/stateproofs/stateproofs_test.go

algonautshant · 2022-06-21T23:42:20Z

test/e2e-go/features/stateproofs/stateproofs_test.go

+	configurableConsensus := make(config.ConsensusProtocols)
+	consensusVersion := protocol.ConsensusVersion("test-fast-stateproofs")
+	consensusParams := config.Consensus[protocol.ConsensusCurrentVersion]
+	consensusParams.StateProofInterval = 16


I think 16 is too much. The test is taking too much time. This can be reduced to 4?

test/e2e-go/features/stateproofs/stateproofs_test.go

algonautshant · 2022-06-22T00:12:47Z

test/e2e-go/features/stateproofs/stateproofs_test.go

+	var lastStateProofBlock bookkeeping.Block
+	var lastStateProofMessage stateproofmsg.Message
+	libgoal := fixture.LibGoalClient
+	for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ {


This loop condition is problematic. If the SP is delayed, which is normal, the test will fail.
It is better to set some upper bound, and break the loop when the desired number of SPs are observed.

Suggested change

for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ {

for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+10); rnd++ {

algonautshant · 2022-06-22T00:16:07Z

test/e2e-go/features/stateproofs/stateproofs_test.go

+			lastStateProofMessage = stateProofMessage
+			lastStateProofBlock = nextStateProofBlock
+		}
+	}


Suggested change

}

if consensusParams.StateProofInterval*expectedNumberOfStateProofs == uint64(lastStateProofBlock.Round()) {

break

}

}

algonautshant · 2022-06-22T00:23:39Z

test/e2e-go/features/stateproofs/stateproofs_test.go

+	}
+	r.Equalf(consensusParams.StateProofInterval*expectedNumberOfStateProofs, uint64(lastStateProofBlock.Round()), "the expected last state proof block wasn't the one that was observed")
+}
+


Suggested change

// TestUnableToRecoverFromLaggingStateProofChain tests that the network continues after it fails to create SPs before StateProofRecoveryInterval

// It stops one of the nodes to prevent the SP creation and starts it after StateProofRecoveryInterval deadline

algonautshant · 2022-06-22T00:48:06Z

test/e2e-go/features/stateproofs/stateproofs_test.go

+	var lastStateProofBlock bookkeeping.Block
+
+	libgoal := fixture.LibGoalClient
+	for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ {


Give the network more time after restarting the node, and make sure the network continues to make progress.

Suggested change

for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ {

for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(consensusParams.StateProofRecoveryInterval+3); rnd++ {

algonautshant

A few more comments to the e2e tests.

Looks great! Thanks for all the updates.

id-ms added the Team Sphinx label May 30, 2022

id-ms self-assigned this May 30, 2022

id-ms force-pushed the stateproof-recovery branch from f07949a to dcf4b40 Compare May 30, 2022 15:57

id-ms force-pushed the stateproof-recovery branch from b34c852 to 64fd05b Compare June 2, 2022 13:38

id-ms force-pushed the feature/stateproofs branch from c9d1a31 to fff5ecc Compare June 7, 2022 09:06

id-ms force-pushed the stateproof-recovery branch from 64fd05b to 98f493f Compare June 12, 2022 07:56

id-ms marked this pull request as ready for review June 14, 2022 08:37

id-ms requested a review from algonautshant June 15, 2022 06:25

algoidan added 11 commits June 15, 2022 11:02

add test to voter tracker + fix off by one

d6c621b

bounding the voterTracker + tests

152ea18

fix off by one

8c9b09e

add tests for cleanup builders

d00c3b1

change recovery interval caclualtion

db4d64e

keep only fixed number of builder if state proof chain stalls

3bc82f6

more fixes to voters + add bound on signatures storage

5a2fc77

code cleanup

fcc6279

fix lint errors

776af29

fix mod by zero

8ecede9

fix stateproof message bug + add recovery e2e tests

c5e08b6

id-ms force-pushed the stateproof-recovery branch from 2652f37 to c5e08b6 Compare June 15, 2022 08:14

algonautshant reviewed Jun 16, 2022

View reviewed changes

algoidan added 3 commits June 16, 2022 11:34

fix e2e

c6eaa1f

minor refactor

6a6e5cd

add ledger block test

da42ca2

id-ms force-pushed the stateproof-recovery branch from 5cbd23b to da42ca2 Compare June 16, 2022 09:24

id-ms changed the title ~~Stateproof recovery~~ Node recovery from stalled Stateproof chain Jun 16, 2022

algonautshant reviewed Jun 18, 2022

View reviewed changes

fix CR comments

70d4bbb

algoidan added 3 commits June 19, 2022 11:47

Merge branch 'feature/stateproofs' into stateproof-recovery

d8bf9d3

Merge branch 'feature/stateproofs' into stateproof-recovery

7220f3f

fix race bug

fee89ee

id-ms force-pushed the stateproof-recovery branch from 3ed26b2 to fee89ee Compare June 19, 2022 11:52

minor refactoring

2f912a1

algonautshant reviewed Jun 22, 2022

View reviewed changes

algonautshant approved these changes Jun 22, 2022

View reviewed changes

add comments for tests

9e89c33

id-ms force-pushed the stateproof-recovery branch from 5799646 to 9e89c33 Compare June 22, 2022 09:46

id-ms merged commit ba187c2 into algorand:feature/stateproofs Jun 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node recovery from stalled Stateproof chain #4056

Node recovery from stalled Stateproof chain #4056

id-ms commented May 30, 2022

codecov bot commented May 30, 2022 •

edited

Loading

algonautshant left a comment

algonautshant Jun 15, 2022

id-ms Jun 16, 2022 •

edited

Loading

id-ms Jun 16, 2022

gmalouf Jun 16, 2022

algonautshant Jun 16, 2022

brianolson Jun 16, 2022

gmalouf Jun 16, 2022

algonautshant Jun 16, 2022

cce Jun 17, 2022 •

edited

Loading

brianolson Jun 17, 2022 •

edited

Loading

algonautshant Jun 15, 2022

gmalouf commented Jun 17, 2022 via email

algonautshant Jun 18, 2022

id-ms Jun 20, 2022

algonautshant Jun 18, 2022

algonautshant Jun 18, 2022

algonautshant Jun 21, 2022

algonautshant Jun 22, 2022

algonautshant Jun 22, 2022

algonautshant Jun 22, 2022

algonautshant Jun 22, 2022

algonautshant left a comment

	for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ {
	for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+10); rnd++ {

-	}
+		if consensusParams.StateProofInterval*expectedNumberOfStateProofs == uint64(lastStateProofBlock.Round()) {
+			break
+		}
+	}



	// TestUnableToRecoverFromLaggingStateProofChain tests that the network continues after it fails to create SPs before StateProofRecoveryInterval
	// It stops one of the nodes to prevent the SP creation and starts it after StateProofRecoveryInterval deadline

Node recovery from stalled Stateproof chain #4056

Node recovery from stalled Stateproof chain #4056

Conversation

id-ms commented May 30, 2022

Summary

Test Plan

codecov bot commented May 30, 2022 • edited Loading

Codecov Report

algonautshant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

id-ms Jun 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cce Jun 17, 2022 • edited Loading

Choose a reason for hiding this comment

brianolson Jun 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmalouf commented Jun 17, 2022 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

algonautshant left a comment

Choose a reason for hiding this comment

codecov bot commented May 30, 2022 •

edited

Loading

id-ms Jun 16, 2022 •

edited

Loading

cce Jun 17, 2022 •

edited

Loading

brianolson Jun 17, 2022 •

edited

Loading