-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node recovery from stalled Stateproof chain #4056
Node recovery from stalled Stateproof chain #4056
Conversation
f07949a
to
dcf4b40
Compare
Codecov Report
@@ Coverage Diff @@
## feature/stateproofs #4056 +/- ##
=======================================================
+ Coverage 54.69% 54.83% +0.14%
=======================================================
Files 396 396
Lines 48958 48980 +22
=======================================================
+ Hits 26779 26860 +81
+ Misses 19935 19885 -50
+ Partials 2244 2235 -9
Continue to review full report at Codecov.
|
b34c852
to
64fd05b
Compare
c9d1a31
to
fff5ecc
Compare
64fd05b
to
98f493f
Compare
2652f37
to
c5e08b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update the PR title to be more specific?
It is not "Stateproof recovery", it is "Node recovery from stalled Stateproof chain".
The title is implying that the StateProof is recovered, while it is not.
// StateProofRecoveryInterval represents the number of state proof intervals that the network will try to catch-up with. | ||
// When the difference between the latest state proof and the current round will be greater than value, Nodes will | ||
// release resources allocated for creating state proofs. | ||
StateProofRecoveryInterval uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary for this be a protocol parameter?
What is the harm when different nodes have different values for this?
Ideally, every node should set this to the highest value its memory can handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a consensus param since different values on different nodes might cause them to disagree on stateproof transactions (they might remove data used for the verification).
for example:
assume node A stores 20 stateproof intervals back and node B stores 2 intervals back.
If both of them receive an old stateproof transaction node A will be able to verify the transition since it has the provenWeight and the PartCommitment but node B will not have it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you look at how we are handling the 320 round lookback effort, you will see us using consensus params in a similar situation; I am okay with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is why I think this parameter should be a configuration parameter, and not a consensus parameter:
This parameter decides how much memory a system can tolerate. This parameter needs to be set to a reasonably large value, to provide some tolerance in case the SP is delayed, and perhaps give enough time for intervention when something wrong happens.
However, a machine with limited amount of memory will crash when the memory usage goes up because of this parameter.
As a more concrete example, let's say there are two machines, one with 95% stake and a lot of memory, and another machine, with 5% stake, and limited memory. It is not a good idea to have the same value for this parameter to all of these machines.
The machines with high stake and high memory, can and should set a large value for this parameter, while others, with limited memory, can only tolerate a small value.
Therefore, I think this parameter should not be a consensus parameter.
@gmalouf @brianolson your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StateProofInterval = 256
so StateProofRecoveryInterval = 10
means we need to store data stretching back 2560 rounds now? That's a big jump over the 1000 rounds of blocks that we currently store (and we're doing a big chunk of work to thin out that data and only store what we really need out of those 1000 blocks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like you are pushing for an option C Shant - make configurable per node. I thought we were debating constant vs consensus param, neither of which is intended to be per node configurable.
Idan's comments to me offline imply an intent to keep the value identical on all nodes (and keeping it small) - sounds like worth a live discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need a live discussion for this, absolutely. And the value of 10, which is too much for Brian :-), is of very limited use if set that low.
Also, once added a consensus parameter, it will be there forever!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@algoidan says:
a key observation in our discussion is that there are two parts of data needed for stateproof
1 - data needed to create a stateproof transaction (participation array, bound of signatures etc)
2- data needed to verify a stateproof transaction on-chain. (need to block header from 256 rounds before the stateproof)and currently, the new parameter affects both parts
In my opinion, regarding the second part - all nodes must be synced on what blocks they have in order to verify a stateproof transaction. Having different parameter values might lead to disagreements.
@algonautshant replies:
This is the argument for having it as a consensus parameter. But I would say that, accounts will either vote yes, or no to the transaction/block. If enough "yes" votes are obtained, then all is good. The idea that all nodes must be "synced", is not going to happen.
In the event of a delayed SP, either all will say "No" (parameter too small), or the weak ones will crash, and others will say "Yes" (parameter big enough).
No. All nodes that have online accounts. This parameter will impact their ability to sign blocks with SP transaction in them.
I agree with Idan on (2). There cannot be disagreement over whether a transaction is valid. All nodes (whether relay, participating, non-participating) must be able to validate transactions and block proposals. I disagree with "If enough 'yes' votes are obtained, then all is good. The idea that all nodes must be 'synced', is not going to happen."
My understanding is that consensus means every node validates all transactions, proposals, votes, and blocks for themselves and agrees on the same deterministic outcome. Relaxing this requirement introduces subjectivity into whether block proposals or transactions are valid, and would lead to consensus failing to find agreement (and stalling) if enough nodes disagree on the validity of the same transactions/blocks.
@algoidan goes on:
However, regarding the first part, we can definitely extend this and allow some beefy servers to store more data (this will be implemented using a configuration parameter and not consensus ). This is part of what we plan to implement next quarter.
Would it help to give this parameter a different name perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking this over with Shant just now, StateProofInterval
and StateProofRecoveryInterval
must be consensus parameters because they govern evaluating new transactions in a block (the recording of a state proof is a new txn type) and everyone must have the same rules for evaluating that txn. Also this commits us to some extra storage on each node (at least (32 byte addr)*(1000 top accounts)*(10 state proof periods) == 320,000 bytes)
to store which accounts were valid signers for each state proof we may yet need to check signatures for. I was going to say 'and we should have a discussion about whether to commit that storage', but 320kB ought to be fine.
@@ -1141,6 +1146,7 @@ func initConsensusProtocols() { | |||
vFuture.StateProofVotersLookback = 16 | |||
vFuture.StateProofWeightThreshold = (1 << 32) * 30 / 100 | |||
vFuture.StateProofStrengthTarget = 256 | |||
vFuture.StateProofRecoveryInterval = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very short value to be useful at all. It gives less than 3 hours to resolve a problem. I think setting it 10 times this value is a better starting point.
Need to evaluate the memory impact of this value.
5cbd23b
to
da42ca2
Compare
Sounds like you have your answer Idan :)
…On Fri, Jun 17, 2022 at 10:22 AM Brian Olson ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In config/consensus.go
<#4056 (comment)>:
> @@ -387,6 +387,11 @@ type ConsensusParams struct {
// StateProofStrengthTarget represents either k+q (for pre-quantum security) or k+2q (for post-quantum security)
StateProofStrengthTarget uint64
+ // StateProofRecoveryInterval represents the number of state proof intervals that the network will try to catch-up with.
+ // When the difference between the latest state proof and the current round will be greater than value, Nodes will
+ // release resources allocated for creating state proofs.
+ StateProofRecoveryInterval uint64
Talking this over with Shant just now, StateProofInterval and
StateProofRecoveryInterval *must* be consensus parameters because they
govern evaluating new transactions in a block (the recording of a state
proof is a new txn type) and everyone must have the same rules for
evaluating that txn. *Also* this commits us to some extra storage on each
node (at least (32 byte addr)*(1000 top accounts)*(10 state proof
periods) == 320,000 bytes) to store which accounts were valid signers for
each state proof we may yet need to check signatures for. I was going to
say 'and we should have a discussion about whether to commit that storage',
but 320kB ought to be fine.
—
Reply to this email directly, view it on GitHub
<#4056 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHP3U4TGAV2XEPSPN3A7GLVPSC33ANCNFSM5XK72BLA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
recentRoundOnRecoveryPeriod := basics.Round(uint64(hdr.Round) - uint64(hdr.Round)%proto.StateProofInterval) | ||
oldestRoundOnRecoveryPeriod := recentRoundOnRecoveryPeriod.SubSaturate(basics.Round(proto.StateProofInterval * proto.StateProofRecoveryInterval)) | ||
|
||
for r, tr := range vt.votersForRoundCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop does not need to be traversed for every block.
This check can be divided into two functions, one checking:
stateProofRound < hdr.StateProofTracking[protocol.StateProofBasic].StateProofNextRound
the other checking:
stateProofRound <= oldestRoundOnRecoveryPeriod
stateProofRound < hdr.StateProofTracking[protocol.StateProofBasic].StateProofNextRound
This only need to be checked when StateProofNextRound changes the value.
stateProofRound <= oldestRoundOnRecoveryPeriod
This only needs to be checked if hdr.Round-1
and hdr.Round
produce different oldestRoundOnRecoveryPeriod
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, You are right, and the logic you've suggested will work fine.
However, I think that the current approach is simpler (we don't need a store state for example- and we might have more edge cases).
Since the map is quite small (it usually contains 0-3 elements and in the worst case contains 10 elements). I suggest we stick to a simpler code.
if lastStateProofBlock.Round() == 0 { | ||
lastStateProofBlock = blk | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also check in else block that StateProofVotersCommitment
is length 0.
@@ -173,3 +175,154 @@ func verifyStateProofForRound(r *require.Assertions, libgoal libgoal.Client, res | |||
r.NoError(err) | |||
return stateProofMessage, nextStateProofBlock | |||
} | |||
|
|||
func TestStateProofsRecovery(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear what this test is about. Some comments will be appreciated here.
3ed26b2
to
fee89ee
Compare
configurableConsensus := make(config.ConsensusProtocols) | ||
consensusVersion := protocol.ConsensusVersion("test-fast-stateproofs") | ||
consensusParams := config.Consensus[protocol.ConsensusCurrentVersion] | ||
consensusParams.StateProofInterval = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 16 is too much. The test is taking too much time. This can be reduced to 4?
var lastStateProofBlock bookkeeping.Block | ||
var lastStateProofMessage stateproofmsg.Message | ||
libgoal := fixture.LibGoalClient | ||
for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop condition is problematic. If the SP is delayed, which is normal, the test will fail.
It is better to set some upper bound, and break the loop when the desired number of SPs are observed.
for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ { | |
for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+10); rnd++ { |
lastStateProofMessage = stateProofMessage | ||
lastStateProofBlock = nextStateProofBlock | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
if consensusParams.StateProofInterval*expectedNumberOfStateProofs == uint64(lastStateProofBlock.Round()) { | |
break | |
} | |
} |
} | ||
r.Equalf(consensusParams.StateProofInterval*expectedNumberOfStateProofs, uint64(lastStateProofBlock.Round()), "the expected last state proof block wasn't the one that was observed") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TestUnableToRecoverFromLaggingStateProofChain tests that the network continues after it fails to create SPs before StateProofRecoveryInterval | |
// It stops one of the nodes to prevent the SP creation and starts it after StateProofRecoveryInterval deadline |
var lastStateProofBlock bookkeeping.Block | ||
|
||
libgoal := fixture.LibGoalClient | ||
for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give the network more time after restarting the node, and make sure the network continues to make progress.
for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(expectedNumberOfStateProofs+1); rnd++ { | |
for rnd := uint64(2); rnd <= consensusParams.StateProofInterval*(consensusParams.StateProofRecoveryInterval+3); rnd++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments to the e2e tests.
Looks great! Thanks for all the updates.
5799646
to
9e89c33
Compare
Summary
In case state proof chain is being delayed nodes have the ability to recover and "catch-up" on not-yet-confirmed state proofs. In order to do so, there are some resources allocated for that target.
Nodes store the following resources and remove them only when a state proof is confirmed on-chain
1 -Voters array, participation tree, and proven weight
2 -builder's data for each state proof round
3-signatures on state proof messages
If state proof chain is delayed too much those resources will never be released and might lead to extensive memory consumption.
Test Plan