Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aptos-workspace-server] indexer support and graceful shutdown #15183

Merged
merged 7 commits into from
Nov 20, 2024

Conversation

vgao1996
Copy link
Contributor

@vgao1996 vgao1996 commented Nov 4, 2024

This is major enhancement to aptos-workspace-server, implementing the following features

  • Full indexer support, including
    • Starting a PostgreSQL container
    • Applying DB migration and starting indexer processors
    • Starting the indexer API service (container)
  • Graceful shutdown
    • This ensures all created docker networks, volumes and containers get cleaned up

@vgao1996 vgao1996 requested a review from 0xmaayan November 4, 2024 22:57
Copy link

trunk-io bot commented Nov 4, 2024

⏱️ 6m total CI duration on this PR
Job Cumulative Duration Recent Runs
rust-cargo-deny 2m 🟩
rust-move-tests 2m 🟩
check-dynamic-deps 39s 🟩
semgrep/ci 29s 🟩
general-lints 28s 🟩
permission-check 10s 🟩🟩
file_change_determinator 10s 🟩
permission-check 5s 🟩🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

@vgao1996 vgao1996 changed the title [DRAFT][aptos-workspace-server] add postgres support [DRAFT][aptos-workspace-server] add postgres support + graceful shutdown Nov 9, 2024
.context("failed to start node api")?;
res_indexer_grpc
.map_err(anyhow::Error::msg)
.context("failed to start node api")?;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message indicates "failed to start node api" but this line is checking the indexer_grpc result. The message should be updated to "failed to start indexer grpc" to accurately reflect which service failed to start.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

@vgao1996 vgao1996 changed the title [DRAFT][aptos-workspace-server] add postgres support + graceful shutdown [DRAFT][aptos-workspace-server] indexer support and graceful shutdown Nov 14, 2024
Copy link
Contributor

@0xmaayan 0xmaayan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do we show an error when trying to spin up the server but docker is not available? same as what we do for the current localnet
  2. With the existing CLI, we use a JSON format with Result/Error. This helps downstream processes (i.e sdk) to handle errors and logs as expected (for reference) - could we make sure we use the same format in the aptos-workspace-server as well?
{
" Error":"some error"
}
{
 "Result":"success"
}

@vgao1996 vgao1996 force-pushed the aws-01 branch 8 times, most recently from fa41505 to b00b7cd Compare November 19, 2024 01:01
@vgao1996 vgao1996 changed the title [DRAFT][aptos-workspace-server] indexer support and graceful shutdown [aptos-workspace-server] indexer support and graceful shutdown Nov 19, 2024
@vgao1996 vgao1996 marked this pull request as ready for review November 19, 2024 01:05
@vgao1996 vgao1996 requested a review from zekun000 November 19, 2024 22:40
Copy link
Contributor

@banool banool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs are beautiful, super easy to follow :')

tokio::pin!(fut_faucet_finish);
// Phase 2: Wait for all services to be up.
let all_services_up = async move {
tokio::try_join!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose by design you can't configure as much as you can with localnet, e.g. using a host postgres / hasura, choosing to not run the faucet / particular processors, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan is to make these configurable at a later time. Design-wise it will require a bit more chore (storing things in Option<impl Future> or some other collections) but otherwise perfectly doable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a host postgres / hasura, choosing to not run the faucet / particular processors, etc?

These are crucial for the localnet unification


let (options, config) =
create_container_options_and_config(instance_id, docker_network_name);
let (fut_container, fut_container_cleanup) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong but I notice you don't try to clear out any leftover containers before creating the new one, e.g. if the clean up failed or the user siginted it or whatever. Should you be doing so here? Or I suppose it's not necessary because you'll use different names and networks each time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's not necessary because you'll use different names and networks each time?

Correct, every aptos-workspace-server instance creates its own network, volumes and containers, so there aren't really leftovers (from self) to clean up.

However I do think we should add a separate pass to clean up leftovers globally, in case some previous runs aborted abnormally. Still need to figure out a few more details:

  • When do we run it? During localnet start up or as a global pass when you hit the workspace test command?
  • When do we consider a docker resource removable? More than 2 hours since it was created?
  • Do we need some sort of global lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to say when it's removable safely yeah, maybe just a workspace prune command might be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc, we create a docker resource for each test suite, and after the test finishes we technically dont need it anymore. So we can remove the resource in the "after" hook in workspace (that runs after each test suite is finish)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can remove the resource in the "after" hook in workspace (that runs after each test suite is finish)

@0xmaayan Yeah on the TS-side we need to make sure SIGNTERM gets sent to aptos-workspace-server binary once it's no longer in use, so it can clean up the resources. I'll double check this part later today.

}
}

/// Returns the URLfor connecting to the indexer grpc service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
/// Returns the URLfor connecting to the indexer grpc service.
/// Returns the URL for connecting to the indexer grpc service.


let (options, config) =
create_container_options_and_config(instance_id, docker_network_name);
let (fut_container, fut_container_cleanup) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to say when it's removable safely yeah, maybe just a workspace prune command might be better.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 5240dfd1481b2abf161457ab510d9b33803e4750

two traffics test: inner traffic : committed: 14564.58 txn/s, latency: 2730.75 ms, (p50: 2700 ms, p70: 2700, p90: 2900 ms, p99: 3200 ms), latency samples: 5537760
two traffics test : committed: 99.92 txn/s, latency: 1574.22 ms, (p50: 1400 ms, p70: 1400, p90: 1500 ms, p99: 9500 ms), latency samples: 1780
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.987, avg: 1.529", "ConsensusProposalToOrdered: max: 0.316, avg: 0.290", "ConsensusOrderedToCommit: max: 0.378, avg: 0.366", "ConsensusProposalToCommit: max: 0.667, avg: 0.656"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 1.07s no progress at version 2826089 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.59s no progress at version 2826087 (avg 8.59s) [limit 15].
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 7f9825b598e86c97526ffe2c796ee17b565f933a ==> 5240dfd1481b2abf161457ab510d9b33803e4750

Compatibility test results for 7f9825b598e86c97526ffe2c796ee17b565f933a ==> 5240dfd1481b2abf161457ab510d9b33803e4750 (PR)
Upgrade the nodes to version: 5240dfd1481b2abf161457ab510d9b33803e4750
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1343.00 txn/s, submitted: 1345.48 txn/s, failed submission: 2.48 txn/s, expired: 2.48 txn/s, latency: 2280.09 ms, (p50: 2100 ms, p70: 2400, p90: 3600 ms, p99: 4800 ms), latency samples: 119080
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1316.48 txn/s, submitted: 1320.04 txn/s, failed submission: 3.56 txn/s, expired: 3.56 txn/s, latency: 2274.76 ms, (p50: 2100 ms, p70: 2400, p90: 3800 ms, p99: 5400 ms), latency samples: 118340
5. check swarm health
Compatibility test for 7f9825b598e86c97526ffe2c796ee17b565f933a ==> 5240dfd1481b2abf161457ab510d9b33803e4750 passed
Upgrade the remaining nodes to version: 5240dfd1481b2abf161457ab510d9b33803e4750
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1545.78 txn/s, submitted: 1548.32 txn/s, failed submission: 2.54 txn/s, expired: 2.54 txn/s, latency: 2194.90 ms, (p50: 2100 ms, p70: 2400, p90: 3000 ms, p99: 4800 ms), latency samples: 121900
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 7f9825b598e86c97526ffe2c796ee17b565f933a ==> 5240dfd1481b2abf161457ab510d9b33803e4750

Compatibility test results for 7f9825b598e86c97526ffe2c796ee17b565f933a ==> 5240dfd1481b2abf161457ab510d9b33803e4750 (PR)
1. Check liveness of validators at old version: 7f9825b598e86c97526ffe2c796ee17b565f933a
compatibility::simple-validator-upgrade::liveness-check : committed: 17180.12 txn/s, latency: 1977.16 ms, (p50: 2100 ms, p70: 2100, p90: 2200 ms, p99: 2500 ms), latency samples: 551880
2. Upgrading first Validator to new version: 5240dfd1481b2abf161457ab510d9b33803e4750
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7889.61 txn/s, latency: 3609.26 ms, (p50: 4100 ms, p70: 4200, p90: 4300 ms, p99: 4400 ms), latency samples: 146200
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7085.75 txn/s, latency: 4464.98 ms, (p50: 4400 ms, p70: 4400, p90: 6800 ms, p99: 7100 ms), latency samples: 236140
3. Upgrading rest of first batch to new version: 5240dfd1481b2abf161457ab510d9b33803e4750
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6288.95 txn/s, latency: 4509.28 ms, (p50: 5100 ms, p70: 5400, p90: 5500 ms, p99: 5600 ms), latency samples: 114500
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 5705.14 txn/s, latency: 5625.87 ms, (p50: 5500 ms, p70: 6100, p90: 7400 ms, p99: 8400 ms), latency samples: 216800
4. upgrading second batch to new version: 5240dfd1481b2abf161457ab510d9b33803e4750
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 12831.70 txn/s, latency: 2142.89 ms, (p50: 2300 ms, p70: 2500, p90: 2700 ms, p99: 2800 ms), latency samples: 220360
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 8981.49 txn/s, latency: 3707.62 ms, (p50: 2500 ms, p70: 2600, p90: 4200 ms, p99: 15400 ms), latency samples: 408900
5. check swarm health
Compatibility test for 7f9825b598e86c97526ffe2c796ee17b565f933a ==> 5240dfd1481b2abf161457ab510d9b33803e4750 passed
Test Ok

@vgao1996 vgao1996 merged commit 40a7425 into aptos-labs:main Nov 20, 2024
81 of 92 checks passed
Copy link
Contributor

@zekun000 zekun000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably separate the cleanup from the normal path, but it's just a nit. something like this

async move {
  all_services_up.await?;
  select! {
    // wait for finish
  }
  Ok(())
}
  let (run, cleanup) = run_service();
  select! {
    _ = shutdown => (),
    result = run => if let Err(_) = result {}
   }
   cleanup.await;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants