Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node recovery from stalled Stateproof chain #4056

Merged
merged 20 commits into from
Jun 22, 2022

Conversation

id-ms
Copy link
Contributor

@id-ms id-ms commented May 30, 2022

Summary

In case state proof chain is being delayed nodes have the ability to recover and "catch-up" on not-yet-confirmed state proofs. In order to do so, there are some resources allocated for that target.

Nodes store the following resources and remove them only when a state proof is confirmed on-chain
1 -Voters array, participation tree, and proven weight
2 -builder's data for each state proof round
3-signatures on state proof messages

If state proof chain is delayed too much those resources will never be released and might lead to extensive memory consumption.

Test Plan

@id-ms id-ms self-assigned this May 30, 2022
@id-ms id-ms force-pushed the stateproof-recovery branch from f07949a to dcf4b40 Compare May 30, 2022 15:57
@codecov
Copy link

codecov bot commented May 30, 2022

Codecov Report

Merging #4056 (2f912a1) into feature/stateproofs (145c065) will increase coverage by 0.14%.
The diff coverage is 94.11%.

❗ Current head 2f912a1 differs from pull request most recent head 5799646. Consider uploading reports for the commit 5799646 to get more accurate results

@@                   Coverage Diff                   @@
##           feature/stateproofs    #4056      +/-   ##
=======================================================
+ Coverage                54.69%   54.83%   +0.14%     
=======================================================
  Files                      396      396              
  Lines                    48958    48980      +22     
=======================================================
+ Hits                     26779    26860      +81     
+ Misses                   19935    19885      -50     
+ Partials                  2244     2235       -9     
Impacted Files Coverage Δ
crypto/stateproof/builder.go 89.21% <ø> (ø)
stateproof/worker.go 90.00% <ø> (ø)
stateproof/builder.go 69.54% <91.42%> (+2.38%) ⬆️
config/consensus.go 85.75% <100.00%> (+0.04%) ⬆️
ledger/internal/eval.go 67.28% <100.00%> (ø)
ledger/voters.go 62.85% <100.00%> (+10.54%) ⬆️
ledger/tracker.go 70.56% <0.00%> (-3.90%) ⬇️
catchup/peerSelector.go 98.95% <0.00%> (-1.05%) ⬇️
data/abi/abi_type.go 87.67% <0.00%> (-0.95%) ⬇️
network/wsNetwork.go 65.27% <0.00%> (+0.47%) ⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 145c065...5799646. Read the comment docs.

@id-ms id-ms force-pushed the stateproof-recovery branch from b34c852 to 64fd05b Compare June 2, 2022 13:38
@id-ms id-ms force-pushed the feature/stateproofs branch from c9d1a31 to fff5ecc Compare June 7, 2022 09:06
@id-ms id-ms force-pushed the stateproof-recovery branch from 64fd05b to 98f493f Compare June 12, 2022 07:56
@id-ms id-ms marked this pull request as ready for review June 14, 2022 08:37
@id-ms id-ms requested a review from algonautshant June 15, 2022 06:25
@id-ms id-ms force-pushed the stateproof-recovery branch from 2652f37 to c5e08b6 Compare June 15, 2022 08:14
Copy link
Contributor

@algonautshant algonautshant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the PR title to be more specific?
It is not "Stateproof recovery", it is "Node recovery from stalled Stateproof chain".
The title is implying that the StateProof is recovered, while it is not.

// StateProofRecoveryInterval represents the number of state proof intervals that the network will try to catch-up with.
// When the difference between the latest state proof and the current round will be greater than value, Nodes will
// release resources allocated for creating state proofs.
StateProofRecoveryInterval uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary for this be a protocol parameter?
What is the harm when different nodes have different values for this?
Ideally, every node should set this to the highest value its memory can handle.

Copy link
Contributor Author

@id-ms id-ms Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a consensus param since different values on different nodes might cause them to disagree on stateproof transactions (they might remove data used for the verification).
for example:
assume node A stores 20 stateproof intervals back and node B stores 2 intervals back.
If both of them receive an old stateproof transaction node A will be able to verify the transition since it has the provenWeight and the PartCommitment but node B will not have it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmalouf @cce, We would like to hear your thoughts on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you look at how we are handling the 320 round lookback effort, you will see us using consensus params in a similar situation; I am okay with this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is why I think this parameter should be a configuration parameter, and not a consensus parameter:

This parameter decides how much memory a system can tolerate. This parameter needs to be set to a reasonably large value, to provide some tolerance in case the SP is delayed, and perhaps give enough time for intervention when something wrong happens.

However, a machine with limited amount of memory will crash when the memory usage goes up because of this parameter.

As a more concrete example, let's say there are two machines, one with 95% stake and a lot of memory, and another machine, with 5% stake, and limited memory. It is not a good idea to have the same value for this parameter to all of these machines.

The machines with high stake and high memory, can and should set a large value for this parameter, while others, with limited memory, can only tolerate a small value.

Therefore, I think this parameter should not be a consensus parameter.

@gmalouf @brianolson your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StateProofInterval = 256 so StateProofRecoveryInterval = 10 means we need to store data stretching back 2560 rounds now? That's a big jump over the 1000 rounds of blocks that we currently store (and we're doing a big chunk of work to thin out that data and only store what we really need out of those 1000 blocks).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like you are pushing for an option C Shant - make configurable per node. I thought we were debating constant vs consensus param, neither of which is intended to be per node configurable.

Idan's comments to me offline imply an intent to keep the value identical on all nodes (and keeping it small) - sounds like worth a live discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need a live discussion for this, absolutely. And the value of 10, which is too much for Brian :-), is of very limited use if set that low.
Also, once added a consensus parameter, it will be there forever!

Copy link
Contributor

@cce cce Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@algoidan says:

a key observation in our discussion is that there are two parts of data needed for stateproof
1 - data needed to create a stateproof transaction (participation array, bound of signatures etc)
2- data needed to verify a stateproof transaction on-chain. (need to block header from 256 rounds before the stateproof)

and currently, the new parameter affects both parts
In my opinion, regarding the second part - all nodes must be synced on what blocks they have in order to verify a stateproof transaction. Having different parameter values might lead to disagreements.

@algonautshant replies:

This is the argument for having it as a consensus parameter. But I would say that, accounts will either vote yes, or no to the transaction/block. If enough "yes" votes are obtained, then all is good. The idea that all nodes must be "synced", is not going to happen.

In the event of a delayed SP, either all will say "No" (parameter too small), or the weak ones will crash, and others will say "Yes" (parameter big enough).

No. All nodes that have online accounts. This parameter will impact their ability to sign blocks with SP transaction in them.

I agree with Idan on (2). There cannot be disagreement over whether a transaction is valid. All nodes (whether relay, participating, non-participating) must be able to validate transactions and block proposals. I disagree with "If enough 'yes' votes are obtained, then all is good. The idea that all nodes must be 'synced', is not going to happen."

My understanding is that consensus means every node validates all transactions, proposals, votes, and blocks for themselves and agrees on the same deterministic outcome. Relaxing this requirement introduces subjectivity into whether block proposals or transactions are valid, and would lead to consensus failing to find agreement (and stalling) if enough nodes disagree on the validity of the same transactions/blocks.

@algoidan goes on:

However, regarding the first part, we can definitely extend this and allow some beefy servers to store more data (this will be implemented using a configuration parameter and not consensus ). This is part of what we plan to implement next quarter.

Would it help to give this parameter a different name perhaps?

Copy link
Contributor

@brianolson brianolson Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking this over with Shant just now, StateProofInterval and StateProofRecoveryInterval must be consensus parameters because they govern evaluating new transactions in a block (the recording of a state proof is a new txn type) and everyone must have the same rules for evaluating that txn. Also this commits us to some extra storage on each node (at least (32 byte addr)*(1000 top accounts)*(10 state proof periods) == 320,000 bytes) to store which accounts were valid signers for each state proof we may yet need to check signatures for. I was going to say 'and we should have a discussion about whether to commit that storage', but 320kB ought to be fine.

@@ -1141,6 +1146,7 @@ func initConsensusProtocols() {
vFuture.StateProofVotersLookback = 16
vFuture.StateProofWeightThreshold = (1 << 32) * 30 / 100
vFuture.StateProofStrengthTarget = 256
vFuture.StateProofRecoveryInterval = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very short value to be useful at all. It gives less than 3 hours to resolve a problem. I think setting it 10 times this value is a better starting point.
Need to evaluate the memory impact of this value.

test/e2e-go/features/stateproofs/stateproofs_test.go Outdated Show resolved Hide resolved
stateproof/builder.go Show resolved Hide resolved
stateproof/builder.go Show resolved Hide resolved
@id-ms id-ms force-pushed the stateproof-recovery branch from 5cbd23b to da42ca2 Compare June 16, 2022 09:24
@id-ms id-ms changed the title Stateproof recovery Node recovery from stalled Stateproof chain Jun 16, 2022
@gmalouf
Copy link
Contributor

gmalouf commented Jun 17, 2022 via email

ledger/ledger_test.go Outdated Show resolved Hide resolved
ledger/ledger_test.go Outdated Show resolved Hide resolved
recentRoundOnRecoveryPeriod := basics.Round(uint64(hdr.Round) - uint64(hdr.Round)%proto.StateProofInterval)
oldestRoundOnRecoveryPeriod := recentRoundOnRecoveryPeriod.SubSaturate(basics.Round(proto.StateProofInterval * proto.StateProofRecoveryInterval))

for r, tr := range vt.votersForRoundCache {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop does not need to be traversed for every block.
This check can be divided into two functions, one checking:
stateProofRound < hdr.StateProofTracking[protocol.StateProofBasic].StateProofNextRound
the other checking:
stateProofRound <= oldestRoundOnRecoveryPeriod

stateProofRound < hdr.StateProofTracking[protocol.StateProofBasic].StateProofNextRound
This only need to be checked when StateProofNextRound changes the value.

stateProofRound <= oldestRoundOnRecoveryPeriod
This only needs to be checked if hdr.Round-1 and hdr.Round produce different oldestRoundOnRecoveryPeriod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, You are right, and the logic you've suggested will work fine.
However, I think that the current approach is simpler (we don't need a store state for example- and we might have more edge cases).
Since the map is quite small (it usually contains 0-3 elements and in the worst case contains 10 elements). I suggest we stick to a simpler code.

stateproof/worker_test.go Outdated Show resolved Hide resolved
stateproof/worker_test.go Outdated Show resolved Hide resolved
if lastStateProofBlock.Round() == 0 {
lastStateProofBlock = blk
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check in else block that StateProofVotersCommitment is length 0.

@@ -173,3 +175,154 @@ func verifyStateProofForRound(r *require.Assertions, libgoal libgoal.Client, res
r.NoError(err)
return stateProofMessage, nextStateProofBlock
}

func TestStateProofsRecovery(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear what this test is about. Some comments will be appreciated here.

@id-ms id-ms force-pushed the stateproof-recovery branch from 3ed26b2 to fee89ee Compare June 19, 2022 11:52
test/e2e-go/features/stateproofs/stateproofs_test.go Outdated Show resolved Hide resolved
test/e2e-go/features/stateproofs/stateproofs_test.go Outdated Show resolved Hide resolved
configurableConsensus := make(config.ConsensusProtocols)
consensusVersion := protocol.ConsensusVersion("test-fast-stateproofs")
consensusParams := config.Consensus[protocol.ConsensusCurrentVersion]
consensusParams.StateProofInterval = 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 16 is too much. The test is taking too much time. This can be reduced to 4?

test/e2e-go/features/stateproofs/stateproofs_test.go Outdated Show resolved Hide resolved
var lastStateProofBlock bookkeeping.Block
var lastStateProofMessage stateproofmsg.Message
libgoal := fixture.LibGoalClient
for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop condition is problematic. If the SP is delayed, which is normal, the test will fail.
It is better to set some upper bound, and break the loop when the desired number of SPs are observed.

Suggested change
for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ {
for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+10); rnd++ {

lastStateProofMessage = stateProofMessage
lastStateProofBlock = nextStateProofBlock
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
if consensusParams.StateProofInterval*expectedNumberOfStateProofs == uint64(lastStateProofBlock.Round()) {
break
}
}

}
r.Equalf(consensusParams.StateProofInterval*expectedNumberOfStateProofs, uint64(lastStateProofBlock.Round()), "the expected last state proof block wasn't the one that was observed")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TestUnableToRecoverFromLaggingStateProofChain tests that the network continues after it fails to create SPs before StateProofRecoveryInterval
// It stops one of the nodes to prevent the SP creation and starts it after StateProofRecoveryInterval deadline

var lastStateProofBlock bookkeeping.Block

libgoal := fixture.LibGoalClient
for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give the network more time after restarting the node, and make sure the network continues to make progress.

Suggested change
for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ {
for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(consensusParams.StateProofRecoveryInterval+3); rnd++ {

Copy link
Contributor

@algonautshant algonautshant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments to the e2e tests.

Looks great! Thanks for all the updates.

@id-ms id-ms force-pushed the stateproof-recovery branch from 5799646 to 9e89c33 Compare June 22, 2022 09:46
@id-ms id-ms merged commit ba187c2 into algorand:feature/stateproofs Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants