[Demand Scalability] Permissionless demand load testing & validation #742

Olshansk · 2024-08-16T19:55:01Z

Objective

Ensure the network can manage permissionless gateways, applications, services and other types of demand.

Origin Document

Related to living ticket: [Living Ticket] Scalability related efforts #621
Related to supply scalability: [Performance] Reduce RelayMiner memory consumption under load #739
The whole point of building Shannon

Goals

Building & documenting a set of tools/processes to determine what MainNet governance parameters need to be
Identify any issues in enabling permissionless demand
Load (not just stress) test the network to its maximum today
https://dev.poktroll.com/operate/testing/load_testing

Deliverables

Non-goals / Non-deliverables

Involving community members in the load test
Fixing the relevant issues identified

General deliverables

Comments: Add/update TODOs and comments alongside the source code so it is easier to follow.
Testing: Add new tests (unit and/or E2E) to the test suite.
Makefile: Add new targets to the Makefile to make the new functionality easier to use.
Documentation: Update architectural or development READMEs; use mermaid diagrams where appropriate.

Creator: @Olshansk
Co-Owners: @okdas

Olshansk · 2024-08-23T15:29:22Z

Update from @okdas

Hey, I wanted to share a quick update on the permissionless demand load testing effort.
- I decided to do all testing on our testnet. We've got gateways and supplier infrastructure there deployed and currently handles just hundreds of requests. 
- I don't have any interesting visuals yet. There are some findings:
    - Validator does consume a lot of resources, but it can be a result of a large number of RPC requests to the validator endpoint.
        - I'm going to change that endpoint to the full node so we validator will only validate.
        - Also there might be some room for improvement on how gateway/relayminer queries the data. Will check.
    - Gateways crash often. Might be a resource constraint, but as we are going to have a different gateway (path) - I'll throw more resources into them instead of performing deep troubleshooting/investigation.
    - Some of the blocks were pretty large for the amount of traffic (2.5 MiB). Will investigate and post findings tomorrow. (recent block - example https://shannon.testnet.pokt.network/poktroll/block/10297)
- Currently in the process of deploying an indexer so we can also get more insight.
- I had issues with creating a lot of services from one address. Same `account sequence mismatch, expected *, got *: incorrect account sequence` issue.
    - For some reason our CLI ignores `--sequence=` argument.
    - Comsmos 0.51 will have unordered transactions rendering this a non-issue in the future.
    - A workaround currently is creating many addresses, funding them with multi-send, and adding services from many accounts at the same time.

okdas · 2024-09-09T19:09:45Z

Performed more testing last week and ended up breaking the infrastructure around the validator's RPC.

To mitigate, I deployed and staked two more validators.
Will rerun the largest test yet with relayminers pointed to the different node directly (without load-balancer and ingress-nginx).

okdas · 2024-09-30T20:46:15Z

Last time we synched on this, we've made a decision to:

Focus on performing a load-test locally on our machines before doing larger tests on TestNet.
Bring PATH to LocalNet.

As any somewhat large load tests currently breaks the network (#841) I'll be focusing on secondary goals - observability (lots of changes in #832) and deploying PATH on TestNet.

okdas · 2024-10-29T00:33:37Z

Have been running into this issue during load testing lately, will see if this is a low-hanging fruit.


{"level":"info","session_end_height":30,"claim_window_open_height":32,"message":"waiting & blocking until the earliest claim commit height offset seed block height"}
{"level":"info","session_end_height":30,"claim_window_open_height":32,"claim_window_open_block_hash":"e36c39f113e47da8e0b1b3417bd05e823086c63a04f6f296382dd00d3b03877a","message":"observed earliest claim commit height offset seed block height"}
{"level":"info","session_end_height":30,"claim_window_open_height":32,"claim_window_open_block_hash":"e36c39f113e47da8e0b1b3417bd05e823086c63a04f6f296382dd00d3b03877a","earliest_claim_commit_height":32,"message":"waiting & blocking until the earliest claim commit height for this supplier"}
{"level":"info","session_end_height":30,"claim_window_open_height":32,"claim_window_open_block_hash":"e36c39f113e47da8e0b1b3417bd05e823086c63a04f6f296382dd00d3b03877a","earliest_claim_commit_height":32,"message":"observed earliest claim commit height"}
{"level":"info","app_addr":"pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4","service_id":"anvil","session_id":"cb5157c91af08f0d126765b9279f2b0891ef5a56e64d50f396b2273a9464240b","supplier_operator_addr":"pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj","message":"created a new claim"}
{"level":"error","error":"with hash 0d82ff8b8e65935dae1ed423c9f4e8aa29b2036df3de78d0aea43d07b1e8a1f2: failed to execute message; message index: 0: rpc error: code = FailedPrecondition desc = current block height (37) is greater than session claim window close height (36): claim attempted outside of the session's claim window: tx timed out","message":"failed to create claims"}
{"level":"error","error":"with hash 0d82ff8b8e65935dae1ed423c9f4e8aa29b2036df3de78d0aea43d07b1e8a1f2: failed to execute message; message index: 0: rpc error: code = FailedPrecondition desc = current block height (37) is greater than session claim window close height (36): claim attempted outside of the session's claim window: tx timed out"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3c419a4]

goroutine 276 [running]:
github.com/pokt-network/poktroll/pkg/relayer/session.(*sessionTree).Delete(0x400192a6e0)
	/Users/dk/pocket/poktroll/pkg/relayer/session/sessiontree.go:285 +0x344
github.com/pokt-network/poktroll/pkg/relayer/session.(*relayerSessionsManager).deleteExpiredSessionTreesFn.func1({0x52d5290, 0x400197fd40}, {0x4000e76a00, 0x1, 0x1})
	/Users/dk/pocket/poktroll/pkg/relayer/session/session.go:478 +0x278
github.com/pokt-network/poktroll/pkg/observable/channel.ForEach[...].func1({0x4000e76a00, 0x1, 0x1})
	/Users/dk/pocket/poktroll/pkg/observable/channel/map.go:103 +0x6c
github.com/pokt-network/poktroll/pkg/observable/channel.goMapTransformNotification[...]({0x52d5290, 0x400197fd40}, {0x52ce590, 0x4000b71ec0}, 0x400013f860, 0x400013f8c0, 0x4000b9c9a0)
	/Users/dk/pocket/poktroll/pkg/observable/channel/map.go:125 +0xc4
created by github.com/pokt-network/poktroll/pkg/observable/channel.Map[...] in goroutine 1
	/Users/dk/pocket/poktroll/pkg/observable/channel/map.go:24 +0x318
[event: pod relayminer1-687547c69f-lvc5h] Container image "poktrolld:tilt-c8d80bb2e7daf0e1" already present on machine

okdas · 2024-10-29T00:46:56Z

Okaaay, seems like there's another issue that breaks the network that we are going to need to address before upgrade. Looking into this as well:


12:34AM INF Timed out dur=14979.481981 height=60 module=consensus round=0 step=RoundStepNewHeight
12:34AM INF received proposal module=consensus proposal="Proposal{60/0 (E8DDDC9B7FD3B5622492459BCCF5B768577045437033E911194DD10B015DC918:1:7CE673CD6F5A, -1) 3376100465F5 @ 2024-10-29T00:34:58.805851558Z}" proposer=A6B0BAD7039843C118CFC588D5A6D38C459B9C25
12:34AM INF received complete proposal block hash=E8DDDC9B7FD3B5622492459BCCF5B768577045437033E911194DD10B015DC918 height=60 module=consensus
12:34AM INF finalizing commit of block hash=E8DDDC9B7FD3B5622492459BCCF5B768577045437033E911194DD10B015DC918 height=60 module=consensus num_txs=0 root=8CF58F38B7F1DC22E6E227E7F74885A80B061E11ED20CA106E2E513553BF7113
12:34AM INF Stored block hash at height 60 EndBlock=SessionModuleEndBlock module=x/session
12:34AM INF found 1 expiring claims at block height 60 method=SettlePendingClaims module=x/tokenomics
12:34AM INF claim does not require proof due to claimed amount (1048950upokt) being less than the threshold (20000000upokt) and random sample (0.35) being greater than probability (0.25) method=proofRequirementForClaim module=server
12:34AM INF Claim by supplier pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj IS WITHIN LIMITS of servicing application pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4. Max claimable amount >= Claim amount: 6663868upokt >= 1048950 application=pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4 claim_settlement_upokt=1048950 helper=ensureClaimAmountLimits method=ProcessTokenLogicModules module=x/tokenomics num_claim_compute_units=24975 num_relays=24975 service_id=anvil session_id=77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c supplier_operator=pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj
12:34AM INF About to start processing TLMs for (24975) compute units, equal to (1048950upokt) claimed actual_settlement_upokt=1048950upokt application=pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4 claim_settlement_upokt=1048950 method=ProcessTokenLogicModules module=x/tokenomics num_claim_compute_units=24975 num_relays=24975 service_id=anvil session_id=77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c supplier_operator=pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj
12:34AM INF Starting TLM processing: "TLMRelayBurnEqualsMint" actual_settlement_upokt=1048950upokt application=pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4 claim_settlement_upokt=1048950 method=ProcessTokenLogicModules module=x/tokenomics num_claim_compute_units=24975 num_relays=24975 service_id=anvil session_id=77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c supplier_operator=pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj
12:34AM INF sent 1048950upokt from the supplier module to the supplier shareholder with address "pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj" method=distributeSupplierRewardsToShareHolders module=x/tokenomics
12:34AM INF distributed 1048950 uPOKT to supplier "pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj" shareholders method=distributeSupplierRewardsToShareHolders module=x/tokenomics
12:34AM ERR error processing token logic modules for claim "77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c": TLM "TLMRelayBurnEqualsMint": burning 1048950upokt from the application module account: spendable balance 958026upokt is smaller than 1048950upokt: insufficient funds [cosmos/[email protected]/x/bank/keeper/send.go:278]: failed to burn uPOKT from application module account [/Users/dk/go/pkg/mod/cosmossdk.io/[email protected]/errors.go:155]: failed to process TLM [/Users/dk/go/pkg/mod/cosmossdk.io/[email protected]/errors.go:155] claimed_upokt=1048950upokt module=server num_claim_compute_units=24975 num_estimated_compute_units=24975 num_relays_in_session_tree=24975 proof_requirement=NOT_REQUIRED session_id=77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c supplier_operator_address=pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj
12:34AM ERR could not settle pending claims due to error TLM "TLMRelayBurnEqualsMint": burning 1048950upokt from the application module account: spendable balance 958026upokt is smaller than 1048950upokt: insufficient funds [cosmos/[email protected]/x/bank/keeper/send.go:278]: failed to burn uPOKT from application module account [/Users/dk/go/pkg/mod/cosmossdk.io/[email protected]/errors.go:155]: failed to process TLM [/Users/dk/go/pkg/mod/cosmossdk.io/[email protected]/errors.go:155] method=EndBlocker module=x/tokenomics
12:34AM ERR CONSENSUS FAILURE!!! err="runtime error: invalid memory address or nil pointer dereference" module=consensus stack="goroutine 180 [running]:\nruntime/debug.Stack()\n\t/opt/homebrew/Cellar/go/1.23.2/libexec/src/runtime/debug/stack.go:26 +0x64\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:801 +0x4c\npanic({0x3f299c0?, 0x713b210?})\n\t/opt/homebrew/Cellar/go/1.23.2/libexec/src/runtime/panic.go:785 +0xf0\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).FinalizeBlock.func1()\n\t/Users/dk/go/pkg/mod/github.com/cosmos/[email protected]/baseapp/abci.go:860 +0x124\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).FinalizeBlock(0x4000223208, 0x4004f50480)\n\t/Users/dk/go/pkg/mod/github.com/cosmos/[email protected]/baseapp/abci.go:892 +0x374\ngithub.com/cosmos/cosmos-sdk/server.cometABCIWrapper.FinalizeBlock({{0xffff74564168, 0x4001081308}}, {0x52d53a8, 0x7202380}, 0x4004f50480)\n\t/Users/dk/go/pkg/mod/github.com/cosmos/[email protected]/server/cmt_abci.go:44 +0x54\ngithub.com/cometbft/cometbft/abci/client.(*localClient).FinalizeBlock(0x400185df20, {0x52d53a8, 0x7202380}, 0x4004f50480)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/abci/client/local_client.go:185 +0xf8\ngithub.com/cometbft/cometbft/proxy.(*appConnConsensus).FinalizeBlock(0x40015806a8, {0x52d53a8, 0x7202380}, 0x4004f50480)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/proxy/app_conn.go:104 +0x1d0\ngithub.com/cometbft/cometbft/state.(*BlockExecutor).applyBlock(_, {{{0xb, 0x0}, {0x40013a2cb9, 0x7}}, {0x40013a2ce0, 0x8}, 0x1, 0x3b, {{0x400534e5a0, ...}, ...}, ...}, ...)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/state/execution.go:224 +0x3c0\ngithub.com/cometbft/cometbft/state.(*BlockExecutor).ApplyVerifiedBlock(_, {{{0xb, 0x0}, {0x40013a2cb9, 0x7}}, {0x40013a2ce0, 0x8}, 0x1, 0x3b, {{0x400534e5a0, ...}, ...}, ...}, ...)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/state/execution.go:202 +0xd8\ngithub.com/cometbft/cometbft/consensus.(*State).finalizeCommit(0x4001729188, 0x3c)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:1772 +0xd50\ngithub.com/cometbft/cometbft/consensus.(*State).tryFinalizeCommit(0x4001729188, 0x3c)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:1682 +0x2c0\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit.func1()\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:1617 +0xb8\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit(0x4001729188, 0x3c, 0x0)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:1655 +0xd90\ngithub.com/cometbft/cometbft/consensus.(*State).addVote(0x4001729188, 0x4002e89a00, {0x0, 0x0})\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:2335 +0x26c0\ngithub.com/cometbft/cometbft/consensus.(*State).tryAddVote(0x4001729188, 0x4002e89a00, {0x0, 0x0})\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:2067 +0x50\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0x4001729188, {{0x529e7c0, 0x40016261d8}, {0x0, 0x0}})\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:929 +0x5c0\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0x4001729188, 0x0)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:856 +0x5fc\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 1\n\t/Users/dk/go/pkg/mod/github.com/cometbft/[email protected]/consensus/state.go:398 +0x1e4\n"
12:34AM INF service stop impl=baseWAL module=consensus msg="Stopping baseWAL service" wal=/root/.poktroll/data/cs.wal/wal
12:34AM INF service stop impl=Group module=consensus msg="Stopping Group service" wal=/root/.poktroll/data/cs.wal/wal

Olshansk · 2024-10-29T23:15:41Z

@red-0ne Can you soft-confirm if the last one should be solved by the PRs you have open right now?

If so:

Which one?
Double-checking that there's on-chain safety against this?

okdas · 2024-11-18T20:39:29Z

I can confirm that the last issue has been addressed. Didn't have a chance to run a larger test last week, so need to do it this wee.k.

Olshansk added the scalability label Aug 16, 2024

Olshansk added this to the Shannon Beta TestNet Launch milestone Aug 16, 2024

Olshansk assigned okdas Aug 16, 2024

Olshansk added this to Shannon Aug 16, 2024

Olshansk added infra Infra or tooling related improvements, additions or fixes tooling Tooling - CLI, scripts, helpers, off-chain, etc... labels Aug 16, 2024

Olshansk moved this to 🔖 Ready in Shannon Aug 16, 2024

This was referenced Aug 20, 2024

[Performance] Reduce RelayMiner memory consumption under load #739

Merged

[LoadTesting] Permissionless Demand Load Testing #711

Closed

okdas moved this from 🔖 Ready to 🏗 In progress in Shannon Sep 9, 2024

okdas mentioned this issue Sep 11, 2024

[LoadTesting] Code changes to unblock load testing #819

Closed

15 tasks

okdas mentioned this issue Sep 30, 2024

[Living Ticket] Scalability related efforts #621

Open

11 tasks

Olshansk modified the milestones: Shannon Beta TestNet Launch, Shannon Beta TestNet Support Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Demand Scalability] Permissionless demand load testing & validation #742

[Demand Scalability] Permissionless demand load testing & validation #742

Olshansk commented Aug 16, 2024

Olshansk commented Aug 23, 2024

okdas commented Sep 9, 2024

okdas commented Sep 30, 2024

okdas commented Oct 29, 2024

okdas commented Oct 29, 2024

Olshansk commented Oct 29, 2024

okdas commented Nov 18, 2024

[Demand Scalability] Permissionless demand load testing & validation #742

[Demand Scalability] Permissionless demand load testing & validation #742

Comments

Olshansk commented Aug 16, 2024

Objective

Origin Document

Goals

Deliverables

Non-goals / Non-deliverables

General deliverables

Olshansk commented Aug 23, 2024

okdas commented Sep 9, 2024

okdas commented Sep 30, 2024

okdas commented Oct 29, 2024

okdas commented Oct 29, 2024

Olshansk commented Oct 29, 2024

okdas commented Nov 18, 2024