-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
op-batcher: Assess da-type choice when submitting a channel #11609
Comments
I'm seeing a situation with our batcher that I think/hope this PR might fix: We're configured to auto-switch, with max 2 parallel sends and 5 blobs per blob tx. For over 30 minutes, I saw the batcher repeatedly alternative between trying to send a calldata tx and then a blob tx. Each tx kept failing with "AlreadyReserved", since it was being blocked from reaching the mempool by the previous one. Usually when you get AlreadyReserved, a cancellation transaction would be sent to clear out the mempool. But this wasn't happening. From what I gathered, the batcher was stuck doing this flip-flopping in the for {} loop in the batcher's publishStateToL1(). This call blocks the sender loop, so the cancellation traffic couldn't go out. The situation eventually cleared on its own, but I'm not entirely sure how. One fix would be to check the txpool state in publishStateToL1() to get it to abort if the transaction pool gets blocked. |
While looking into making publishStateToL1() abort if the txpool gets blocked, I noticed what might be a race condition from this: Now that there are two vars being updated (txpool state and txpoolblockedblob), we probably want to use a mutex to update them both atomically? |
Yeah I was lazy when implementing this and thought the single atomic state var management should be sufficient and deemed it extremely unlikely that we'd overwrite the blob state with the wrong routine. I think a more fundamental aspect to change about the cancellation handling is that we need to send unblock transactions for a specific nonce. We currently just send a cancellation tx candidate and then let the txmgr choose the nonce, which doesn't make much sense for getting txs unstuck of wrong type of a particular nonce, especially during times where regular switching between calldata and blobs is occurring. The cancellation code also needs to get better and stopping to cancel if it becomes evident that another tx for that nonce already got confirmed. For Holocene, I'd like to rework some of the batcher and txmgr behavior that might actually simplify this part, because we should now attach a specific batch range to a nonce and try to submit that tx at all costs, because of the new strict batch ordering rules. |
We've seen the new behaviour play out on sepolia:
It is expected that switching from calldata to blobs results in a delay to transaction publishing, since the channel has much more capacity then and has more space to fill with block data before being sent. When switching from blobs to calldata, we only expect a short delay due to processing time requeing blocks. It would be interesting to benchmark some worst-case numbers for this. |
I believe the So the worst case seems to be when we have a single very large channel to requeue. |
Sketch of a benchmark here https://github.com/ethereum-optimism/optimism/compare/gk/batcher-requeue-bench?expand=1:
It took on average 296ms to requeue 500 single-tx blocks and then add some back to a calldata channel and generate the first frame / tx data to send. |
Thanks for adding the benchmarks! The worst-case is a chain with no transactions and a long channel duration. Would be great to parameterize the benchmark to also run it with
|
Context
Since #11219 the batcher can dynamically switch between calldata and blobs, depending on the economically best option at the time of starting a new channel. This works reasonably well for high-throughput chains, where the delay between starting and submitting a channel is usually only a few minutes. But it is still not optimal, as the most economical option might have changed since building up the channel. Furthermore, for low-throughput chains, this behavior is useless, as a da-type decision delay of multiple hours is ineffective. And lastly, the decision is currently made under the assumption that the channel will be full to its size limit. But the channel may not be full, e.g. using less frames (=blobs) with a half-empty last frame because of a short max-channel-duration setting that closed it (especially relevant on testnets).
Change Description
To address these shortcomings, the batcher should perform a second/the check at the time that it would start submitting a channel (i.e., when starting to post the first batcher transaction for that channel (almost always a channel fits into a single batcher transaction, unless the first block added to it doesn't fit into the
target-num-frames
frames)), and possibly rebuild the channel using different da-type settings.The goal of this improvement is to always select the best da-type for a channel at the time of tx submission, and make the dynamic da-type selection feature useable for low-throughput chains.
Implementation Hints
It probably makes most sense to move the channel config retrieval via the
ChannelConfigProvider
from the point at which a new channel is created (optimism/op-batcher/batcher/channel_manager.go
Line 206 in a36291d
txData
insidechannelManager.TxData()
), and then reuse that config for the following channel. This has two advantages.Note that a special case is a fresh batcher restart, at which point an estimated configuration has to be retrieved once before the first channel. But this could happen in the batcher's initialization path.
Channel Rebuilding
If the check doesn't confirm the channel config that's currently in use, the channel should be rebuilt using the new config. One good option to accomplish this may be to reinsert the blocks into the L2 blocks queue inside the
channelManager
, so that the existing code path is then used to build the channel with the new channel config. This has the advantage that e.g. if switching from calldata to blobs, the channel would suddenly have more space, so it wouldn't be immediately submitted and just continue to fill up regularly.ChannelConfigProvider
It's important to note that the current
ChannelConfigProvider
has a single method to retrieve the next estimated channel config, and the implementation of the dynamic provider always assumes full channels. This is not optimal for the proposed change, as we will know the exact channel data at the point of submission, so we can make an exact fee calculation. To enable this, I suggest we add a second method to theChannelConfigProvider
to make its interface look similar to the following:The new method can take as a parameter the exact data and use
core.IntrinsicGas
to exactly calculate the calldata gas in the dynamic provider, instead of this manual calculation:optimism/op-batcher/batcher/channel_config_provider.go
Lines 66 to 68 in a36291d
Otherwise, the existing cost comparison code at
DynamicEthChannelConfig.ChannelConfig()
should be as much reused between the two new methods as possible.Channel Config Immutability
One important rule that should be observed is to not change the DA-type and channel configuration once the first frame of a channel started being submitted. This is important a) because otherwise we might end up trying to replace a transaction with a different tx type, which is not allowed and would get the batcher into a blocked tx recovery mode and b) in view of the upcoming Holocene features regarding Strict Batch Ordering would invalidate future batcher transactions that might already be in the tx-pool because of parallel transaction sending.
Hardening the batcher's behavior during restarts should be done as a separate task, possibly introducing some form of persistence of in-flight transactions.
The text was updated successfully, but these errors were encountered: