add fallback request for req-response protocols #2771

alindima · 2023-12-21T09:18:04Z

Previously, it was only possible to retry the same request on a different protocol name that had the exact same binary payloads.

Introduce a way of trying a different request on a different protocol if the first one fails with Unsupported protocol.

This helps with adding new req-response versions in polkadot while preserving compatibility with unupgraded nodes.

The way req-response protocols were bumped previously was that they were bundled with some other notifications protocol upgrade, like for async backing (but that is more complicated, especially if the feature does not require any changes to a notifications protocol). Will be needed for implementing polkadot-fellows/RFCs#47

TODO:

add tests
add guidance docs in polkadot about req-response protocol versioning

Previously, it was only possible to retry the same request on a different protocol name that had the exact same binary protocol. Introduce a way of trying a different request on a different protocol if the first one fails with Unsupported protocol. This helps with adding new req-response versions in polkadot.

alindima · 2023-12-21T15:54:58Z

I suggest reviewing commit-by-commit. The commit with hash 51d68e7da301f510b0f75e4907aa6e6e240ef2b9 has many uninteresting changes to test code

alexggh

Left you some comments, overall the approach looks sane to me, I'll let the networking people tear it apart :D.

alexggh · 2023-12-22T07:55:59Z

substrate/client/network/src/request_responses.rs

+								}) => {
+									// Try using the fallback request if the protocol was not
+									// supported.
+									if let OutboundFailure::UnsupportedProtocols = error {


I've got a few questions about the flow here:

Does this implies that for every request we have to always go to the other side twice or would it just fail early in lib-p2p before reaching the other side.

Is sending an unsupported protocol request side effect free, would the node risk being disconnected because it is sending unsupported protocols ?

Does this implies that for every request we have to always go to the other side twice or would it just fail early in lib-p2p before reaching the other side.

It means making two requests to the other peer. I see there's a way of requesting the SupportedProtocols in libp2p from a peer, but I don't see it being used anywhere.

I think it's fine because:

we can cache in the interested subsystem the protocol version that a validator uses, to minimise requests.

since it's encouraged that validators upgrade to the latest version, it shouldn't be long before the v2 request will succeed on most cases (hopefully).

this is what is already hapenning for the fallback_names in ProtocolConfig (but at a lower level, in libp2p)

Is sending an unsupported protocol request side effect free, would the node risk being disconnected because it is sending unsupported protocols ?

AFAIK yes, to quote a comment from rust-libp2p:

// The remote merely doesn't support the protocol(s) we requested.
// This is no reason to close the connection, which may
// successfully communicate with other protocols already.
// An event is reported to permit user code to react to the fact that
// the remote peer does not support the requested protocol(s).

this is what is already hapenning for the fallback_names in ProtocolConfig (but at a lower level, in libp2p)

I think in this case no extra round-trip happens: when opening a new substream for a request, all protocol names (we just call them fallback, but they are equivalent to the main protocol name from libp2p perspective) are sent on the wire and compared to the list of supported protocols on the remote.

As for point 2, looking at the code, the substrate code is not made aware of the request attempts on the unsupported protocol, so can't reduce the peer's reputation — it's all handled inside libp2p.

I think in this case no extra round-trip happens: when opening a new substream for a request, all protocol names (we just call them fallback, but they are equivalent to the main protocol name from libp2p perspective) are sent on the wire and compared to the list of supported protocols on the remote.

According to https://github.com/libp2p/specs/blob/master/connections/README.md#protocol-negotiation and the code in rust-libp2p: https://github.com/libp2p/rust-libp2p/blob/b6bb02b9305b56ed2a4e2ff44b510fa84d8d7401/misc/multistream-select/src/dialer_select.rs#L192,
this isn't true. Each protocol in the list is tried one by one

Yeah, you are right. Disregard my comment please 🙈

dmitry-markin

Looks good to me, but I would wait also for review from @altonen as I'm not super familiar with this part of the codebase.

substrate/client/network/src/request_responses.rs

altonen

I think it would be cleaner to store the remote protocols received in an Identify response to Peerstore and then querying those protocols when sending a request to see what they support. That way the request-response code wouldn't require any modifications

sandreim

Nice, this was indeed needed and the approach seems reasonable. The way this was done before was indeed coupling to the notification protocols versioning which is in a way hacky.

I also would want to suggest a better solution for handling protocol versioning in general that relies on capabilities being exchanged between nodes during handshake. The capabilities should represent all the notification and request protocols the node can talk. We'd make this information available in NotificationEvent::NotificationStreamOpened and this makes the higher level code (subsystems) more reasonable wrt to which protocol to use when talking to specific nodes.

@altonen @dmitry-markin WDYT ?

altonen · 2024-01-04T13:26:59Z

These capabilities are exchanged in libp2p's Identify messages and I think we should use that information instead of having a fallback system for the request-response implementation.

I don't like the idea of supplying capabilities in NotificationStreamOpened though because libp2p Identify and our notification protocol are completely independent systems and conceptually the notification protocol should only deal with notifications. I'd rather not see these two protocols getting coupled together but if there's no other sensible way of achieving what we want to do here then I guess we have to do it.

I think what could work quite well would be introducing PeerIdentified event for the event stream system that would report all identified peers and their capabilities to all subsystems that have subscribed. NetworkBridge could then listen to incoming substreams from peers but would not emit NetworkBridgeEvent::PeerConnected until the peer had been identified.

Now that we have custom handshakes and substream validation available in each protocol, we could try and take advantage of that system and advertise the relevant capabilities in the handshake. This is not backwards-compatible though but for any future breaking changes in the networking protocols, it's good to keep in mind that the handshake system is also available now.

alindima · 2024-01-08T10:19:42Z

Thanks for the suggestions, I'll explore them a bit

alindima · 2024-01-08T13:11:23Z

I think it would be cleaner to store the remote protocols received in an Identify response to Peerstore and then querying those protocols when sending a request to see what they support. That way the request-response code wouldn't require any modifications

Agreed that it would be cleaner, this would also spare a second request in case the newer protocol version is not supported.

I do see a problem with this however. The req-response behaviour does not require that the Identify message was exchanged previously AFAICT (they are network protocols that can run concurrently). This means that when making a request we may not know the supported protocols of the peer. We may not even be connected to the peer. Also, the PeerStore is more like a cache. It prunes old records, which would also contribute to inconsistent views of the supported protocols of the peer.

We can be conservative on the polkadot side and always use the older protocol version if we don't have access to the supported protocols, but that defeats the purpose of the feature IMO if it happens often enough.

altonen · 2024-01-08T13:40:58Z

We may not even be connected to the peer

This is a good point, I had forgotten we use this behavior in Polkadot. Manually instructing sc-network to first dial the peer, waiting for Identify exchange to happen and only then sending the request sounds like a lot of hassle. Granted the supported protocols are available soon the request-response code has dialed the peer but at least right now there is no easy way of postponing the request until Identify has completed. Maybe something we could look in the future but I think fallback requests are a cleaner way of achieving the goal of this PR.

alindima · 2024-01-09T09:00:00Z

Maybe something we could look in the future but I think fallback requests are a cleaner way of achieving the goal of this PR.

I think we all agree now that what this PR does is a reasonable approach, right?

However, I did hit a roadblock now when testing it with a real request in polkadot. In sc-network, we override the substream upgrade protocol (multistream-select) to be V1Lazy, for the entirety of the network service. However, V1Lazy only works when there is one protocol. Otherwise, the connection is closed and we don't get a chance to do a fallback request.

The problem is that libp2p does not provide a way of setting the substream upgrade protocol version on a per-request basis (not even on a per-behaviour basis for that matter). It's a single config (which IMO is obviously not fine), so we'd have to use V1 instead for this PR to work (which would mean an extra round-trip for every single protocol negotiation).

I could open an issue and maybe a PR on rust-libp2p for adding support for this, but seeing the super tedious process of updating libp2p in substrate, (#1631 as an example), I'm thinking whether this is at all a good idea. Maybe we can get the maintainers to backport it as a patch to the version we're currently using (although not sure that's possible since it'll add a new API).
I'll see what I can come up with.

@altonen let me know if you have a suggestion

alindima · 2024-01-09T09:05:10Z

The alternative would be to perform the fallback request at a higher level, in one of the polkadot subsystems, instead of in sc-network.
This would complicate the polkadot code a bit (particularly the availability-recovery, which is the first one that needs it), and it would scale as more features need req-response protocol upgrades. Which is why I tried embedding it in substrate in the first place

altonen · 2024-01-09T09:40:28Z

I don't think V1 is a major issue and FWIW, #1631 switches from V1Lazy to V1 because of a bug in the negotiation that halted syncing while I was running a burn-in for that PR a while ago. So whenever that PR is merged, we'd be switching to V1 for a while anyway until the underlying issue is fixed.

If you're against using V1 then maybe the higher-level fallback request might be a good idea but is there no way it could be done in sc-network without having to expose the details to Polkadot? Like if it fails to negotiate the substream with V1Lazy, it would catch that error in sc-network and if the original request had a fallback request attached to it, it would try to negotiate again with the fallback name and if that also fails, only then is the error reported to Polkadot?

alindima · 2024-01-09T13:10:50Z

I don't think V1 is a major issue and FWIW, #1631 switches from V1Lazy to V1 because of a bug in the negotiation that halted syncing while I was running a burn-in for that PR a while ago. So whenever that PR is merged, we'd be switching to V1 for a while anyway until the underlying issue is fixed.

Thanks, didn't know that!

If you're against using V1 then maybe the higher-level fallback request might be a good idea

I'm not against using V1 for now. Considering that it'll be needed for your linked PR, I think it's a reasonable compromise until we get support for handling it in a better way in libp2p. I opened this issue, where I also posted an alternative fix in libp2p.

but is there no way it could be done in sc-network without having to expose the details to Polkadot? Like if it fails to negotiate the substream with V1Lazy, it would catch that error in sc-network and if the original request had a fallback request attached to it, it would try to negotiate again with the fallback name and if that also fails, only then is the error reported to Polkadot?

it's not doable without a change in libp2p AFAICT. The OutboundUpgrade impl of RequestProtocol returns a plain std::io::Error. We can't decide whether it's a fatal error or an unsupported protocol. Now that I think of it some more, I don't think it's doable in polkadot either. For this same reason, we'll get a ConnectionClosed error instead of an UnsupportedProtocols error.

I suggest we move forward with this PR. This PR will not switch to using V1 just yet, because there are no requests that need this fallback yet. It'll be needed in #1644 (until it's merged, maybe we'll have a fix in libp2p or your upgrade PR will switch to using V1 anyway).

…se-fallback

Previously, it was only possible to retry the same request on a different protocol name that had the exact same binary payloads. Introduce a way of trying a different request on a different protocol if the first one fails with Unsupported protocol. This helps with adding new req-response versions in polkadot while preserving compatibility with unupgraded nodes. The way req-response protocols were bumped previously was that they were bundled with some other notifications protocol upgrade, like for async backing (but that is more complicated, especially if the feature does not require any changes to a notifications protocol). Will be needed for implementing polkadot-fellows/RFCs#47 TODO: - [x] add tests - [x] add guidance docs in polkadot about req-response protocol versioning

**Don't look at the commit history, it's confusing, as this branch is based on another branch that was merged** Fixes #598 Also implements [RFC #47](polkadot-fellows/RFCs#47) ## Description - Availability-recovery now first attempts to request the systematic chunks for large POVs (which are the first ~n/3 chunks, which can recover the full data without doing the costly reed-solomon decoding process). This has a fallback of recovering from all chunks, if for some reason the process fails. Additionally, backers are also used as a backup for requesting the systematic chunks if the assigned validator is not offering the chunk (each backer is only used for one systematic chunk, to not overload them). - Quite obviously, recovering from systematic chunks is much faster than recovering from regular chunks (4000% faster as measured on my apple M2 Pro). - Introduces a `ValidatorIndex` -> `ChunkIndex` mapping which is different for every core, in order to avoid only querying the first n/3 validators over and over again in the same session. The mapping is the one described in RFC 47. - The mapping is feature-gated by the [NodeFeatures runtime API](#2177) so that it can only be enabled via a governance call once a sufficient majority of validators have upgraded their client. If the feature is not enabled, the mapping will be the identity mapping and backwards-compatibility will be preserved. - Adds a new chunk request protocol version (v2), which adds the ChunkIndex to the response. This may or may not be checked against the expected chunk index. For av-distribution and systematic recovery, this will be checked, but for regular recovery, no. This is backwards compatible. First, a v2 request is attempted. If that fails during protocol negotiation, v1 is used. - Systematic recovery is only attempted during approval-voting, where we have easy access to the core_index. For disputes and collator pov_recovery, regular chunk requests are used, just as before. ## Performance results Some results from subsystem-bench: with regular chunk recovery: CPU usage per block 39.82s with recovery from backers: CPU usage per block 16.03s with systematic recovery: CPU usage per block 19.07s End-to-end results here: #598 (comment) #### TODO: - [x] [RFC #47](polkadot-fellows/RFCs#47) - [x] merge #2177 and rebase on top of those changes - [x] merge #2771 and rebase - [x] add tests - [x] preliminary performance measure on Versi: see #598 (comment) - [x] Rewrite the implementer's guide documentation - [x] #3065 - [x] paritytech/zombienet#1705 and fix zombienet tests - [x] security audit - [x] final versi test and performance measure --------- Signed-off-by: alindima <[email protected]> Co-authored-by: Javier Viola <[email protected]>

**Don't look at the commit history, it's confusing, as this branch is based on another branch that was merged** Fixes paritytech#598 Also implements [RFC paritytech#47](polkadot-fellows/RFCs#47) ## Description - Availability-recovery now first attempts to request the systematic chunks for large POVs (which are the first ~n/3 chunks, which can recover the full data without doing the costly reed-solomon decoding process). This has a fallback of recovering from all chunks, if for some reason the process fails. Additionally, backers are also used as a backup for requesting the systematic chunks if the assigned validator is not offering the chunk (each backer is only used for one systematic chunk, to not overload them). - Quite obviously, recovering from systematic chunks is much faster than recovering from regular chunks (4000% faster as measured on my apple M2 Pro). - Introduces a `ValidatorIndex` -> `ChunkIndex` mapping which is different for every core, in order to avoid only querying the first n/3 validators over and over again in the same session. The mapping is the one described in RFC 47. - The mapping is feature-gated by the [NodeFeatures runtime API](paritytech#2177) so that it can only be enabled via a governance call once a sufficient majority of validators have upgraded their client. If the feature is not enabled, the mapping will be the identity mapping and backwards-compatibility will be preserved. - Adds a new chunk request protocol version (v2), which adds the ChunkIndex to the response. This may or may not be checked against the expected chunk index. For av-distribution and systematic recovery, this will be checked, but for regular recovery, no. This is backwards compatible. First, a v2 request is attempted. If that fails during protocol negotiation, v1 is used. - Systematic recovery is only attempted during approval-voting, where we have easy access to the core_index. For disputes and collator pov_recovery, regular chunk requests are used, just as before. ## Performance results Some results from subsystem-bench: with regular chunk recovery: CPU usage per block 39.82s with recovery from backers: CPU usage per block 16.03s with systematic recovery: CPU usage per block 19.07s End-to-end results here: paritytech#598 (comment) #### TODO: - [x] [RFC paritytech#47](polkadot-fellows/RFCs#47) - [x] merge paritytech#2177 and rebase on top of those changes - [x] merge paritytech#2771 and rebase - [x] add tests - [x] preliminary performance measure on Versi: see paritytech#598 (comment) - [x] Rewrite the implementer's guide documentation - [x] paritytech#3065 - [x] paritytech/zombienet#1705 and fix zombienet tests - [x] security audit - [x] final versi test and performance measure --------- Signed-off-by: alindima <[email protected]> Co-authored-by: Javier Viola <[email protected]>

alindima added 5 commits December 21, 2023 11:08

fix subsystem-bench

7649e89

fix req-resp tests and add more tests

6465b19

add polkadot req-resp versioning docs

2ed8e36

add prdoc

637f3a2

alindima added T0-node This PR/Issue is related to the topic “node”. I5-enhancement An additional feature request. labels Dec 21, 2023

alindima requested a review from a team December 21, 2023 14:10

fix clippy

7f05dcd

alindima requested a review from alexggh December 21, 2023 14:17

modify tests

51d68e7

alexggh reviewed Dec 22, 2023

View reviewed changes

dmitry-markin approved these changes Dec 22, 2023

View reviewed changes

substrate/client/network/src/request_responses.rs Outdated Show resolved Hide resolved

substrate/client/network/src/request_responses.rs Show resolved Hide resolved

dmitry-markin requested a review from altonen December 22, 2023 14:33

altonen reviewed Jan 2, 2024

View reviewed changes

sandreim approved these changes Jan 4, 2024

View reviewed changes

alindima mentioned this pull request Jan 9, 2024

Finer-grained multistream-select upgrade protocol version selection libp2p/rust-libp2p#5074

Closed

altonen approved these changes Jan 10, 2024

View reviewed changes

alindima added 3 commits January 10, 2024 13:31

address some review comments

e34d874

comment out new_with_fallback

06cb631

Merge remote-tracking branch 'origin/master' into alindima/req-respon…

4bdcc7a

…se-fallback

fix clippy

ad8c4d4

alindima merged commit f2a750e into master Jan 10, 2024
122 checks passed

alindima deleted the alindima/req-response-fallback branch January 10, 2024 13:19

kamilsa mentioned this pull request Jan 17, 2024

Refactor req-response protocols qdrvm/kagome#1935

Open

alindima mentioned this pull request Jan 18, 2024

Add availability-recovery from systematic chunks #1644

Merged

10 tasks

nazar-pc mentioned this pull request Jan 18, 2024

Upgrade Substrate autonomys/subspace#2424

Merged

1 task

github-actions bot mentioned this pull request Feb 19, 2024

Update substrate/polkadot/cumulus from v1.3.0 to v1.6.0 moondance-labs/tanssi#419

Closed

github-actions bot mentioned this pull request Mar 13, 2024

Update polkadot-sdk from v1.3.0 to v1.7.2 moonbeam-foundation/moonbeam#2703

Closed

alindima mentioned this pull request Aug 16, 2024

Compress the State Response Message in State Sync polkadot-fellows/RFCs#112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fallback request for req-response protocols #2771

add fallback request for req-response protocols #2771

alindima commented Dec 21, 2023 •

edited

Loading

alindima commented Dec 21, 2023

alexggh left a comment

alexggh Dec 22, 2023

alindima Dec 22, 2023

dmitry-markin Dec 22, 2023

alindima Dec 22, 2023

dmitry-markin Dec 22, 2023

dmitry-markin left a comment

altonen left a comment

sandreim left a comment

altonen commented Jan 4, 2024

alindima commented Jan 8, 2024

alindima commented Jan 8, 2024

altonen commented Jan 8, 2024

alindima commented Jan 9, 2024

alindima commented Jan 9, 2024

altonen commented Jan 9, 2024

alindima commented Jan 9, 2024

add fallback request for req-response protocols #2771

add fallback request for req-response protocols #2771

Conversation

alindima commented Dec 21, 2023 • edited Loading

alindima commented Dec 21, 2023

alexggh left a comment

Choose a reason for hiding this comment

alexggh Dec 22, 2023

Choose a reason for hiding this comment

alindima Dec 22, 2023

Choose a reason for hiding this comment

dmitry-markin Dec 22, 2023

Choose a reason for hiding this comment

alindima Dec 22, 2023

Choose a reason for hiding this comment

dmitry-markin Dec 22, 2023

Choose a reason for hiding this comment

dmitry-markin left a comment

Choose a reason for hiding this comment

altonen left a comment

Choose a reason for hiding this comment

sandreim left a comment

Choose a reason for hiding this comment

altonen commented Jan 4, 2024

alindima commented Jan 8, 2024

alindima commented Jan 8, 2024

altonen commented Jan 8, 2024

alindima commented Jan 9, 2024

alindima commented Jan 9, 2024

altonen commented Jan 9, 2024

alindima commented Jan 9, 2024

alindima commented Dec 21, 2023 •

edited

Loading