op-batcher: Assess da-type choice when submitting a channel #11609

sebastianst · 2024-08-27T12:39:28Z

Context

Since #11219 the batcher can dynamically switch between calldata and blobs, depending on the economically best option at the time of starting a new channel. This works reasonably well for high-throughput chains, where the delay between starting and submitting a channel is usually only a few minutes. But it is still not optimal, as the most economical option might have changed since building up the channel. Furthermore, for low-throughput chains, this behavior is useless, as a da-type decision delay of multiple hours is ineffective. And lastly, the decision is currently made under the assumption that the channel will be full to its size limit. But the channel may not be full, e.g. using less frames (=blobs) with a half-empty last frame because of a short max-channel-duration setting that closed it (especially relevant on testnets).

Change Description

To address these shortcomings, the batcher should perform a second/the check at the time that it would start submitting a channel (i.e., when starting to post the first batcher transaction for that channel (almost always a channel fits into a single batcher transaction, unless the first block added to it doesn't fit into the target-num-frames frames)), and possibly rebuild the channel using different da-type settings.

The goal of this improvement is to always select the best da-type for a channel at the time of tx submission, and make the dynamic da-type selection feature useable for low-throughput chains.

Implementation Hints

It probably makes most sense to move the channel config retrieval via the ChannelConfigProvider from the point at which a new channel is created (

optimism/op-batcher/batcher/channel_manager.go

Line 206 in a36291d

cfg := s.cfgProvider.ChannelConfig()

) to the point at which the first frame of any channel is submitted (places where we return non-empty txData inside channelManager.TxData()), and then reuse that config for the following channel. This has two advantages.

This adds the check for the correct channel config at the point when submission happens. Most of the time this check will confirm the current configuration.
The channel config of the previous channel is the best estimator for the correct channel config of the following channel, when the exact data content isn't known yet.

Note that a special case is a fresh batcher restart, at which point an estimated configuration has to be retrieved once before the first channel. But this could happen in the batcher's initialization path.

Channel Rebuilding

If the check doesn't confirm the channel config that's currently in use, the channel should be rebuilt using the new config. One good option to accomplish this may be to reinsert the blocks into the L2 blocks queue inside the channelManager, so that the existing code path is then used to build the channel with the new channel config. This has the advantage that e.g. if switching from calldata to blobs, the channel would suddenly have more space, so it wouldn't be immediately submitted and just continue to fill up regularly.

`ChannelConfigProvider`

It's important to note that the current ChannelConfigProvider has a single method to retrieve the next estimated channel config, and the implementation of the dynamic provider always assumes full channels. This is not optimal for the proposed change, as we will know the exact channel data at the point of submission, so we can make an exact fee calculation. To enable this, I suggest we add a second method to the ChannelConfigProvider to make its interface look similar to the following:

type ChannelConfigProvider interface {
	ChannelConfigFull() ChannelConfig // note: renamed old method that estimates full channels, would be used only at startup
	ChannelConfig(data []byte) ChannelConfig // new method, takes exact data
}

The new method can take as a parameter the exact data and use core.IntrinsicGas to exactly calculate the calldata gas in the dynamic provider, instead of this manual calculation:

optimism/op-batcher/batcher/channel_config_provider.go

Lines 66 to 68 in a36291d

    
           // It would be nicer to use core.IntrinsicGas, but we don't have the actual data at hand 
        
           calldataBytes := dec.calldataConfig.MaxFrameSize + 1 // + 1 version byte 
        
           calldataGas := big.NewInt(int64(calldataBytes*randomByteCalldataGas + params.TxGas))

Otherwise, the existing cost comparison code at DynamicEthChannelConfig.ChannelConfig() should be as much reused between the two new methods as possible.

Channel Config Immutability

One important rule that should be observed is to not change the DA-type and channel configuration once the first frame of a channel started being submitted. This is important a) because otherwise we might end up trying to replace a transaction with a different tx type, which is not allowed and would get the batcher into a blocked tx recovery mode and b) in view of the upcoming Holocene features regarding Strict Batch Ordering would invalidate future batcher transactions that might already be in the tx-pool because of parallel transaction sending.

Hardening the batcher's behavior during restarts should be done as a separate task, possibly introducing some form of persistence of in-flight transactions.

The text was updated successfully, but these errors were encountered:

roberto-bayardo · 2024-08-28T22:06:14Z

I'm seeing a situation with our batcher that I think/hope this PR might fix:

We're configured to auto-switch, with max 2 parallel sends and 5 blobs per blob tx.

For over 30 minutes, I saw the batcher repeatedly alternative between trying to send a calldata tx and then a blob tx. Each tx kept failing with "AlreadyReserved", since it was being blocked from reaching the mempool by the previous one.

Usually when you get AlreadyReserved, a cancellation transaction would be sent to clear out the mempool. But this wasn't happening. From what I gathered, the batcher was stuck doing this flip-flopping in the for {} loop in the batcher's publishStateToL1(). This call blocks the sender loop, so the cancellation traffic couldn't go out.

The situation eventually cleared on its own, but I'm not entirely sure how.

One fix would be to check the txpool state in publishStateToL1() to get it to abort if the transaction pool gets blocked.

roberto-bayardo · 2024-08-28T22:14:56Z

While looking into making publishStateToL1() abort if the txpool gets blocked, I noticed what might be a race condition from this:

https://github.com/ethereum-optimism/optimism/blame/36f093a10da09496c3ef5a706cd494a4e2a9b9bd/op-batcher/batcher/driver.go#L317

Now that there are two vars being updated (txpool state and txpoolblockedblob), we probably want to use a mutex to update them both atomically?

sebastianst · 2024-08-29T08:27:11Z

Now that there are two vars being updated (txpool state and txpoolblockedblob), we probably want to use a mutex to update them both atomically?

Yeah I was lazy when implementing this and thought the single atomic state var management should be sufficient and deemed it extremely unlikely that we'd overwrite the blob state with the wrong routine.

I think a more fundamental aspect to change about the cancellation handling is that we need to send unblock transactions for a specific nonce. We currently just send a cancellation tx candidate and then let the txmgr choose the nonce, which doesn't make much sense for getting txs unstuck of wrong type of a particular nonce, especially during times where regular switching between calldata and blobs is occurring. The cancellation code also needs to get better and stopping to cancel if it becomes evident that another tx for that nonce already got confirmed.

For Holocene, I'd like to rework some of the batcher and txmgr behavior that might actually simplify this part, because we should now attach a specific batch range to a nonce and try to submit that tx at all costs, because of the new strict batch ordering rules.

roberto-bayardo · 2024-08-29T13:49:00Z

I think the txmgr PR I'm wrapping up should fix the issue of cancellation transactions getting stuck, along with this one I just sent which will make the batcher send loop abort while cancellation is being handled.

Agree though the entire batcher/txmgr flow could use a rethink.

geoknee · 2024-10-02T09:32:20Z

We've seen the new behaviour play out on sepolia:

 t = 0            | Requeue and start using CALLDATA
dt = 11s          | Publish tx
dt = 3m           | Requeue and start using BLOBS
dt = 4m40s        | Requeue and start using CALLDATA
dt = 5s           | Publish tx
dt = 8s           | Publish tx
dt = 3m57s        | Requeue and start using BLOBS
dt = 4m36s        | Publish tx

It is expected that switching from calldata to blobs results in a delay to transaction publishing, since the channel has much more capacity then and has more space to fill with block data before being sent.

When switching from blobs to calldata, we only expect a short delay due to processing time requeing blocks. It would be interesting to benchmark some worst-case numbers for this.

geoknee · 2024-10-02T09:40:23Z

I believe the Requeue method is only ever going to requeue the blocks from a single channel. This is because the batcher now only requeues blocks from channels which didn't start to get sent on chain yet, and we don't make a new channel (and dequeue the blocks for it) until we have started to send the previous one.

So the worst case seems to be when we have a single very large channel to requeue.

geoknee · 2024-10-02T09:58:11Z

Sketch of a benchmark here https://github.com/ethereum-optimism/optimism/compare/gk/batcher-requeue-bench?expand=1:

go test -benchmem -run=^$ -bench ^BenchmarkChannelManager_Requeue$ github.com/ethereum-optimism/optimism/op-batcher/batcher -v

goos: darwin
goarch: arm64
pkg: github.com/ethereum-optimism/optimism/op-batcher/batcher
BenchmarkChannelManager_Requeue
BenchmarkChannelManager_Requeue-14    	       4	 296146114 ns/op	74611654 B/op	   32528 allocs/op
PASS
ok  	github.com/ethereum-optimism/optimism/op-batcher/batcher	3.803s

It took on average 296ms to requeue 500 single-tx blocks and then add some back to a calldata channel and generate the first frame / tx data to send.

sebastianst · 2024-10-02T17:40:51Z

Thanks for adding the benchmarks! The worst-case is a chain with no transactions and a long channel duration. Would be great to parameterize the benchmark to also run it with

span batches
zero transactions per block
making sure we bump the L1 origin every 6 L2 blocks (not sure if this is guaranteed in the current random generator)
30 * 60 * 5 = 9,000 blocks which is 5h worth of blocks, which is what a batcher with 5h channel duration would need to re-batch.

sebastianst added A-op-batcher Area: op-batcher T-protocol Team: Protocol labels Aug 27, 2024

sebastianst assigned geoknee Aug 27, 2024

sebastianst changed the title ~~op-batcher: Reassess da-type choice when submitting a channel~~ op-batcher: Assess da-type choice when submitting a channel Aug 27, 2024

BlocksOnAChain added this to OP Stack Upgrades Aug 28, 2024

BlocksOnAChain moved this to Todo in OP Stack Upgrades Aug 28, 2024

geoknee moved this from Todo to In Progress in OP Stack Upgrades Sep 10, 2024

geoknee linked a pull request Sep 17, 2024 that will close this issue

Towards issue 11609 #11905

Closed

geoknee mentioned this issue Sep 19, 2024

op-batcher: Move decision about data availability type to channel submission time #12002

Merged

geoknee moved this from In Progress to In-review in OP Stack Upgrades Sep 25, 2024

geoknee closed this as completed in #12002 Sep 25, 2024

github-project-automation bot moved this from In-review to Done in OP Stack Upgrades Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

op-batcher: Assess da-type choice when submitting a channel #11609

op-batcher: Assess da-type choice when submitting a channel #11609

sebastianst commented Aug 27, 2024 •

edited

Loading

roberto-bayardo commented Aug 28, 2024

roberto-bayardo commented Aug 28, 2024

sebastianst commented Aug 29, 2024 •

edited

Loading

roberto-bayardo commented Aug 29, 2024

geoknee commented Oct 2, 2024

geoknee commented Oct 2, 2024

geoknee commented Oct 2, 2024 •

edited

Loading

sebastianst commented Oct 2, 2024 •

edited

Loading

op-batcher: Assess da-type choice when submitting a channel #11609

op-batcher: Assess da-type choice when submitting a channel #11609

Comments

sebastianst commented Aug 27, 2024 • edited Loading

Context

Change Description

Implementation Hints

Channel Rebuilding

ChannelConfigProvider

Channel Config Immutability

roberto-bayardo commented Aug 28, 2024

roberto-bayardo commented Aug 28, 2024

sebastianst commented Aug 29, 2024 • edited Loading

roberto-bayardo commented Aug 29, 2024

geoknee commented Oct 2, 2024

geoknee commented Oct 2, 2024

geoknee commented Oct 2, 2024 • edited Loading

sebastianst commented Oct 2, 2024 • edited Loading

sebastianst commented Aug 27, 2024 •

edited

Loading

`ChannelConfigProvider`

sebastianst commented Aug 29, 2024 •

edited

Loading

geoknee commented Oct 2, 2024 •

edited

Loading

sebastianst commented Oct 2, 2024 •

edited

Loading