Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move DoStateCheckpoint off critical path #15411

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

msmouse
Copy link
Contributor

@msmouse msmouse commented Nov 27, 2024

Description

  1. Add struct State to represent the speculative state, utilizing the LayeredMap / MapLayer
  2. Wrap the smt in StateSummary, to represent the speculative state tree. StateStorageUsage is removed from the smt and put in State.
  3. The LayeredMap / MapLayer stack can read diffs between versions by following in mem links, hence we were able to remove the hashmaps to track the updates between versions.
  4. Proof read is not removed from the execution stage so we save some CPU by only reading them for the writes, not the reads.

Todo: move it further to a sepate stage.

How Has This Been Tested?

existing coverage

Key Areas to Review

Type of Change

  • New feature

Which Components or Systems Does This Change Impact?

[x] Validator

Copy link

trunk-io bot commented Nov 27, 2024

⏱️ 2h 52m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
rust-cargo-deny 20m 🟩🟩🟩🟩 (+7 more)
check-dynamic-deps 17m 🟩🟩🟩🟩🟥 (+7 more)
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 12m 🟩
rust-move-tests 9m
rust-move-tests 7m
general-lints 5m 🟩🟩🟩🟩 (+7 more)
semgrep/ci 5m 🟩🟩🟩🟩🟩 (+7 more)
file_change_determinator 2m 🟩🟩🟩🟩🟩 (+7 more)

settingsfeedbackdocs ⋅ learn more about trunk.io

@msmouse msmouse force-pushed the 1125-alden-state-summary branch from 0eb8d0e to ddaade9 Compare November 28, 2024 21:58
@msmouse msmouse marked this pull request as draft November 28, 2024 22:11
@aptos-labs aptos-labs deleted a comment from graphite-app bot Nov 28, 2024
@msmouse msmouse force-pushed the 1125-alden-state-summary branch 8 times, most recently from 6c4828a to 6bc9a36 Compare December 3, 2024 03:31
@msmouse msmouse force-pushed the 1125-alden-state-summary branch 3 times, most recently from ae365d6 to 9e9964b Compare December 8, 2024 21:51
@msmouse msmouse added the CICD:run-execution-performance-test Run execution performance test label Dec 10, 2024
@msmouse msmouse force-pushed the 1125-alden-state-summary branch 4 times, most recently from 550a066 to d59dead Compare December 10, 2024 10:40
@msmouse msmouse added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Dec 10, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@msmouse msmouse changed the title StateSummary Move DoStateCheckpoint off critical path Dec 18, 2024
@msmouse msmouse marked this pull request as ready for review December 18, 2024 05:10

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@msmouse msmouse mentioned this pull request Dec 18, 2024
22 tasks

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on f8d4befe5aed2b07953a6a791efdc2f086115494

two traffics test: inner traffic : committed: 14526.70 txn/s, latency: 2732.14 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3300 ms), latency samples: 5523540
two traffics test : committed: 100.08 txn/s, latency: 1367.53 ms, (p50: 1300 ms, p70: 1400, p90: 1500 ms, p99: 1600 ms), latency samples: 1780
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.604, avg: 1.550", "ConsensusProposalToOrdered: max: 0.320, avg: 0.294", "ConsensusOrderedToCommit: max: 0.329, avg: 0.319", "ConsensusProposalToCommit: max: 0.621, avg: 0.613"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.56s no progress at version 43845 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.93s no progress at version 6242935 (avg 0.75s) [limit 16].
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 6593fb81261f25490ffddc2252a861c994234c2a ==> f8d4befe5aed2b07953a6a791efdc2f086115494

Compatibility test results for 6593fb81261f25490ffddc2252a861c994234c2a ==> f8d4befe5aed2b07953a6a791efdc2f086115494 (PR)
1. Check liveness of validators at old version: 6593fb81261f25490ffddc2252a861c994234c2a
compatibility::simple-validator-upgrade::liveness-check : committed: 16034.06 txn/s, latency: 2108.44 ms, (p50: 2100 ms, p70: 2200, p90: 2500 ms, p99: 4200 ms), latency samples: 524480
2. Upgrading first Validator to new version: f8d4befe5aed2b07953a6a791efdc2f086115494
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 1298.04 txn/s, latency: 18917.18 ms, (p50: 19100 ms, p70: 25500, p90: 26400 ms, p99: 27300 ms), latency samples: 48580
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7070.13 txn/s, latency: 4717.28 ms, (p50: 5100 ms, p70: 5100, p90: 5400 ms, p99: 5500 ms), latency samples: 235580
3. Upgrading rest of first batch to new version: f8d4befe5aed2b07953a6a791efdc2f086115494
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6148.48 txn/s, latency: 4746.88 ms, (p50: 5500 ms, p70: 5700, p90: 5800 ms, p99: 5900 ms), latency samples: 111920
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6177.15 txn/s, latency: 5326.10 ms, (p50: 5800 ms, p70: 5900, p90: 6100 ms, p99: 6400 ms), latency samples: 215540
4. upgrading second batch to new version: f8d4befe5aed2b07953a6a791efdc2f086115494
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 10006.53 txn/s, latency: 2790.81 ms, (p50: 3100 ms, p70: 3300, p90: 3400 ms, p99: 3600 ms), latency samples: 180480
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10642.21 txn/s, latency: 2971.40 ms, (p50: 3100 ms, p70: 3300, p90: 3500 ms, p99: 3900 ms), latency samples: 348420
5. check swarm health
Compatibility test for 6593fb81261f25490ffddc2252a861c994234c2a ==> f8d4befe5aed2b07953a6a791efdc2f086115494 passed
Test Ok

/// the current state and the last checkpoint. shared with outside world.
current_state: Arc<Mutex<LedgerStateWithSummary>>,
/// The most recent checkpoint sent for persistence, not guaranteed to have committed already.
last_snapshot: StateWithSummary,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, it is called snapshot, while it is called checkpoint in current_state. maybe rename this to last_persisting_checkpoint to be consistent?

target_items: usize,
current_state: Arc<Mutex<CurrentState>>,
persisted_state: Arc<Mutex<PersistedState>>,
out_current_state: Arc<Mutex<LedgerStateWithSummary>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what does "out_" mean?

state_commit_sender: SyncSender<CommitMessage<StateWithSummary>>,
/// Estimated number of items in the buffer.
estimated_items: usize,
/// The target number of items in the buffer between commits.
target_items: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could target_item be a const instead of a field of bufferedstate?

@@ -354,7 +356,7 @@ pub trait DbReader: Send + Sync {
/// Returns the proof of the given state key and version.
fn get_state_proof_by_version_ext(
&self,
state_key: &StateKey,
key: &HashValue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name it as state_key to be explicit ?


pub struct StateUpdateRefs<'kv> {
pub per_version: PerVersionStateUpdateRefs<'kv>,
pub for_last_checkpoint: Option<BatchedStateUpdateRefs<'kv>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "updates_till_last_checkpoint" could be more intuitive?

pub struct StateUpdateRefs<'kv> {
pub per_version: PerVersionStateUpdateRefs<'kv>,
pub for_last_checkpoint: Option<BatchedStateUpdateRefs<'kv>>,
pub for_latest: Option<BatchedStateUpdateRefs<'kv>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: updates_after_last_checkpoint?

Copy link
Contributor

@areshand areshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse through half of the changes. mainly naming suggestions. feel free to drop them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR CICD:run-execution-performance-full-test Run execution performance test (full version)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants