-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agd does not support joining with state sync #3769
Comments
Tend to agree unless there are good arguments for postponing this |
@dtribble suggests that as long as catching up is 3x to 5x faster than the running chain, we can postpone this to a later milestone. possible optimization:
(I think we only replay on-line vats, of which there is a bounded number, so that optimization doesn't seem worthwhile) |
First need to measure how long it currently takes based on estimated number of blocks, and or calc the blocks / time of recovery. Create sub-ticket for this initial measurement. |
some mainnet0 data shows new nodes should eventually catch up, but it takes a long time. The validator community seems to prefer informal snapshot sharing. |
A few quick thoughts:
*: My understanding is that the meaningful number to measure is the amount of time spent in Swingset, compared to time elapsed since genesis. That gives Swingset utilization. If you consider the cosmos processing to be negligible comparatively, you can then calculate time it'd take to rebuild all the JS state through catch up. It also gives a lower bound. |
Can we replay each vat separately and in parallel?
|
We need "state sync" to jump to a snapshot of the kernel data close to the current block or else we can only replay then verify a single block at a time since genesis. That's really slow, even if we do more in parallel. |
@warner further to the discussion we just had about trade-offs between performance and integrity of snapshots, as I mentioned, our validator community is doing some informal snapshot sharing currently: Agoric/testnet-notes#42. I looked around and found that it seems to take about 3.5min of downtime to do a daily mainnet0 snapshot.
|
one validator notes:
|
One data point: I watched a node crash today, it missed about 200s before getting restarted. The restart took 2m10s to replay vat transcripts enough to begin processing blocks again, then took another 33s to replay the 95-ish (empty) missed blocks, after which is was caught up and following properly again. The vat-transcript replay time is roughly bounded by the frequency of our heap snapshots: we take a heap snapshot every 2000 deliveries, so no single vat should ever need to replay more than 2000 deliveries at reboot time, so reboot time will be random but roughly constant (depends on Note that this doesn't tell us anything about how long it takes to start up a whole new validator from scratch. |
After discussing state-sync the other day, @arirubinstein mentioned validators leverage state sync to work around a cosmos DB pruning issue: they start a new node state syncing from their existing node to prune their DB. In case for some reason we can't figure out state sync by the time the DBs grows out too large, we should check if the following rough hack may work:
For consistency protection, Swingset saves the block height it last committed, and checks that the next block it sees is either the next block N + 1, or the same block N (in which case it doesn't execute anything but simply replays calls it previously made back to the go / cosmos side). |
a recent data point: 26hrs to catch up on 26 chain days. So 24x. |
Describe the bug
While there is a practice of sharing informal snapshots, the only in-protocol way to join an Agoric chain, currently, is to replay all transactions from genesis; this may take days or weeks. Contrast this with the norm in the Cosmos community:
Other blockchain systems have similar features. In Bitcoin and Ethereum, software releases include a hash of a known-good state; this way, new nodes can download a state that is not more than a few months old and start verifying from there.
Design Notes
cc @michaelfig @erights
The text was updated successfully, but these errors were encountered: