Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

op-batcher: Assess da-type choice when submitting a channel #11609

Closed
sebastianst opened this issue Aug 27, 2024 · 8 comments · Fixed by #12002
Closed

op-batcher: Assess da-type choice when submitting a channel #11609

sebastianst opened this issue Aug 27, 2024 · 8 comments · Fixed by #12002
Assignees
Labels
A-op-batcher Area: op-batcher T-protocol Team: Protocol

Comments

@sebastianst
Copy link
Member

sebastianst commented Aug 27, 2024

Context

Since #11219 the batcher can dynamically switch between calldata and blobs, depending on the economically best option at the time of starting a new channel. This works reasonably well for high-throughput chains, where the delay between starting and submitting a channel is usually only a few minutes. But it is still not optimal, as the most economical option might have changed since building up the channel. Furthermore, for low-throughput chains, this behavior is useless, as a da-type decision delay of multiple hours is ineffective. And lastly, the decision is currently made under the assumption that the channel will be full to its size limit. But the channel may not be full, e.g. using less frames (=blobs) with a half-empty last frame because of a short max-channel-duration setting that closed it (especially relevant on testnets).

Change Description

To address these shortcomings, the batcher should perform a second/the check at the time that it would start submitting a channel (i.e., when starting to post the first batcher transaction for that channel (almost always a channel fits into a single batcher transaction, unless the first block added to it doesn't fit into the target-num-frames frames)), and possibly rebuild the channel using different da-type settings.

The goal of this improvement is to always select the best da-type for a channel at the time of tx submission, and make the dynamic da-type selection feature useable for low-throughput chains.

Implementation Hints

It probably makes most sense to move the channel config retrieval via the ChannelConfigProvider from the point at which a new channel is created (

cfg := s.cfgProvider.ChannelConfig()
) to the point at which the first frame of any channel is submitted (places where we return non-empty txData inside channelManager.TxData()), and then reuse that config for the following channel. This has two advantages.

  1. This adds the check for the correct channel config at the point when submission happens. Most of the time this check will confirm the current configuration.
  2. The channel config of the previous channel is the best estimator for the correct channel config of the following channel, when the exact data content isn't known yet.

Note that a special case is a fresh batcher restart, at which point an estimated configuration has to be retrieved once before the first channel. But this could happen in the batcher's initialization path.

Channel Rebuilding

If the check doesn't confirm the channel config that's currently in use, the channel should be rebuilt using the new config. One good option to accomplish this may be to reinsert the blocks into the L2 blocks queue inside the channelManager, so that the existing code path is then used to build the channel with the new channel config. This has the advantage that e.g. if switching from calldata to blobs, the channel would suddenly have more space, so it wouldn't be immediately submitted and just continue to fill up regularly.

ChannelConfigProvider

It's important to note that the current ChannelConfigProvider has a single method to retrieve the next estimated channel config, and the implementation of the dynamic provider always assumes full channels. This is not optimal for the proposed change, as we will know the exact channel data at the point of submission, so we can make an exact fee calculation. To enable this, I suggest we add a second method to the ChannelConfigProvider to make its interface look similar to the following:

type ChannelConfigProvider interface {
	ChannelConfigFull() ChannelConfig // note: renamed old method that estimates full channels, would be used only at startup
	ChannelConfig(data []byte) ChannelConfig // new method, takes exact data
}

The new method can take as a parameter the exact data and use core.IntrinsicGas to exactly calculate the calldata gas in the dynamic provider, instead of this manual calculation:

// It would be nicer to use core.IntrinsicGas, but we don't have the actual data at hand
calldataBytes := dec.calldataConfig.MaxFrameSize + 1 // + 1 version byte
calldataGas := big.NewInt(int64(calldataBytes*randomByteCalldataGas + params.TxGas))

Otherwise, the existing cost comparison code at DynamicEthChannelConfig.ChannelConfig() should be as much reused between the two new methods as possible.

Channel Config Immutability

One important rule that should be observed is to not change the DA-type and channel configuration once the first frame of a channel started being submitted. This is important a) because otherwise we might end up trying to replace a transaction with a different tx type, which is not allowed and would get the batcher into a blocked tx recovery mode and b) in view of the upcoming Holocene features regarding Strict Batch Ordering would invalidate future batcher transactions that might already be in the tx-pool because of parallel transaction sending.

Hardening the batcher's behavior during restarts should be done as a separate task, possibly introducing some form of persistence of in-flight transactions.

@sebastianst sebastianst added A-op-batcher Area: op-batcher T-protocol Team: Protocol labels Aug 27, 2024
@sebastianst sebastianst changed the title op-batcher: Reassess da-type choice when submitting a channel op-batcher: Assess da-type choice when submitting a channel Aug 27, 2024
@roberto-bayardo
Copy link
Collaborator

I'm seeing a situation with our batcher that I think/hope this PR might fix:

We're configured to auto-switch, with max 2 parallel sends and 5 blobs per blob tx.

For over 30 minutes, I saw the batcher repeatedly alternative between trying to send a calldata tx and then a blob tx. Each tx kept failing with "AlreadyReserved", since it was being blocked from reaching the mempool by the previous one.

Usually when you get AlreadyReserved, a cancellation transaction would be sent to clear out the mempool. But this wasn't happening. From what I gathered, the batcher was stuck doing this flip-flopping in the for {} loop in the batcher's publishStateToL1(). This call blocks the sender loop, so the cancellation traffic couldn't go out.

The situation eventually cleared on its own, but I'm not entirely sure how.

One fix would be to check the txpool state in publishStateToL1() to get it to abort if the transaction pool gets blocked.

@roberto-bayardo
Copy link
Collaborator

While looking into making publishStateToL1() abort if the txpool gets blocked, I noticed what might be a race condition from this:

https://github.com/ethereum-optimism/optimism/blame/36f093a10da09496c3ef5a706cd494a4e2a9b9bd/op-batcher/batcher/driver.go#L317

Now that there are two vars being updated (txpool state and txpoolblockedblob), we probably want to use a mutex to update them both atomically?

@sebastianst
Copy link
Member Author

sebastianst commented Aug 29, 2024

Now that there are two vars being updated (txpool state and txpoolblockedblob), we probably want to use a mutex to update them both atomically?

Yeah I was lazy when implementing this and thought the single atomic state var management should be sufficient and deemed it extremely unlikely that we'd overwrite the blob state with the wrong routine.

I think a more fundamental aspect to change about the cancellation handling is that we need to send unblock transactions for a specific nonce. We currently just send a cancellation tx candidate and then let the txmgr choose the nonce, which doesn't make much sense for getting txs unstuck of wrong type of a particular nonce, especially during times where regular switching between calldata and blobs is occurring. The cancellation code also needs to get better and stopping to cancel if it becomes evident that another tx for that nonce already got confirmed.

For Holocene, I'd like to rework some of the batcher and txmgr behavior that might actually simplify this part, because we should now attach a specific batch range to a nonce and try to submit that tx at all costs, because of the new strict batch ordering rules.

@roberto-bayardo
Copy link
Collaborator

I think the txmgr PR I'm wrapping up should fix the issue of cancellation transactions getting stuck, along with this one I just sent which will make the batcher send loop abort while cancellation is being handled.

Agree though the entire batcher/txmgr flow could use a rethink.

@geoknee geoknee moved this from Todo to In Progress in OP Stack Upgrades Sep 10, 2024
@geoknee geoknee linked a pull request Sep 17, 2024 that will close this issue
@geoknee geoknee moved this from In Progress to In-review in OP Stack Upgrades Sep 25, 2024
@github-project-automation github-project-automation bot moved this from In-review to Done in OP Stack Upgrades Sep 25, 2024
@geoknee
Copy link
Contributor

geoknee commented Oct 2, 2024

We've seen the new behaviour play out on sepolia:

 t = 0            | Requeue and start using CALLDATA
dt = 11s          | Publish tx
dt = 3m           | Requeue and start using BLOBS
dt = 4m40s        | Requeue and start using CALLDATA
dt = 5s           | Publish tx
dt = 8s           | Publish tx
dt = 3m57s        | Requeue and start using BLOBS
dt = 4m36s        | Publish tx 

It is expected that switching from calldata to blobs results in a delay to transaction publishing, since the channel has much more capacity then and has more space to fill with block data before being sent.

When switching from blobs to calldata, we only expect a short delay due to processing time requeing blocks. It would be interesting to benchmark some worst-case numbers for this.

@geoknee
Copy link
Contributor

geoknee commented Oct 2, 2024

I believe the Requeue method is only ever going to requeue the blocks from a single channel. This is because the batcher now only requeues blocks from channels which didn't start to get sent on chain yet, and we don't make a new channel (and dequeue the blocks for it) until we have started to send the previous one.

So the worst case seems to be when we have a single very large channel to requeue.

@geoknee
Copy link
Contributor

geoknee commented Oct 2, 2024

Sketch of a benchmark here https://github.com/ethereum-optimism/optimism/compare/gk/batcher-requeue-bench?expand=1:

go test -benchmem -run=^$ -bench ^BenchmarkChannelManager_Requeue$ github.com/ethereum-optimism/optimism/op-batcher/batcher -v

goos: darwin
goarch: arm64
pkg: github.com/ethereum-optimism/optimism/op-batcher/batcher
BenchmarkChannelManager_Requeue
BenchmarkChannelManager_Requeue-14    	       4	 296146114 ns/op	74611654 B/op	   32528 allocs/op
PASS
ok  	github.com/ethereum-optimism/optimism/op-batcher/batcher	3.803s

It took on average 296ms to requeue 500 single-tx blocks and then add some back to a calldata channel and generate the first frame / tx data to send.

@sebastianst
Copy link
Member Author

sebastianst commented Oct 2, 2024

Thanks for adding the benchmarks! The worst-case is a chain with no transactions and a long channel duration. Would be great to parameterize the benchmark to also run it with

  • span batches
  • zero transactions per block
  • making sure we bump the L1 origin every 6 L2 blocks (not sure if this is guaranteed in the current random generator)
  • 30 * 60 * 5 = 9,000 blocks which is 5h worth of blocks, which is what a batcher with 5h channel duration would need to re-batch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-op-batcher Area: op-batcher T-protocol Team: Protocol
Projects
Status: Done
3 participants