Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade libp2p from 0.52.4 to 0.54.1 #6248

Merged
merged 21 commits into from
Dec 16, 2024

Conversation

nazar-pc
Copy link
Contributor

@nazar-pc nazar-pc commented Oct 27, 2024

Description

Fixes #5996

https://github.com/libp2p/rust-libp2p/releases/tag/libp2p-v0.53.0
https://github.com/libp2p/rust-libp2p/blob/master/CHANGELOG.md

Integration

Nothing special is needed, just note that yamux_window_size is no longer applicable to libp2p (litep2p seems to still have it though).

Review Notes

There are a few simplifications and improvements done in libp2p 0.53 regarding swarm interface, I'll list a few key/applicable here.

libp2p/rust-libp2p#4788 removed write_length_prefixed function, so I inlined its code instead.

libp2p/rust-libp2p#4120 introduced new libp2p::SwarmBuilder instead of now deprecated libp2p::swarm::SwarmBuilder, the transition is straightforward and quite ergonomic (can be seen in tests).

libp2p/rust-libp2p#4581 is the most annoying change I have seen that basically makes many enums #[non_exhaustive]. I mapped some, but those that couldn't be mapped I dealt with by printing log messages once they are hit (the best solution I could come up with, at least with stable Rust).

libp2p/rust-libp2p#4306 makes connection close as soon as there are no handler using it, so I had to replace KeepAlive::Until with an explicit future that flips internal boolean after timeout, achieving the old behavior, though it should ideally be removed completely at some point.

yamux_window_size is no longer used by libp2p thanks to libp2p/rust-libp2p#4970 and generally Yamux should have a higher performance now.

I have resolved and cleaned up all deprecations related to libp2p except BandwidthSinks. Libp2p deprecated it (though it is still present in 0.54.1, which is why I didn't handle it just yet). Ideally Substrate would finally switch to the official Prometheus client, in which case we'd get metrics for free. Otherwise a bit of code will need to be copy-pasted to maintain current behavior with BandwidthSinks gone, which I left a TODO about.

The biggest change in 0.54.0 is libp2p/rust-libp2p#4568 that changed transport APIs and enabled unconditional potential port reuse, which can lead to very confusing errors if running two Substrate nodes on the same machine without changing listening port explicitly.

Overall nothing scary here, but testing is always appreciated.

Checklist

  • My PR includes a detailed description as outlined in the "Description" and its two subsections above.
  • My PR follows the labeling requirements of this project (at minimum one label for T required)
    • External contributors: ask maintainers to put the right label on your PR.

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

@nazar-pc
Copy link
Contributor Author

Looks like something called zombienet-orchestrator still pulls old version of libp2p in, so whoever maintains that will have work to do after this

@nazar-pc nazar-pc changed the title Upgrade libp2p from 0.52.4 to 0.53.2 Upgrade libp2p from 0.52.4 to 0.54.1 Oct 27, 2024
@dmitry-markin dmitry-markin added the T0-node This PR/Issue is related to the topic “node”. label Oct 28, 2024
@lexnv lexnv self-requested a review October 30, 2024 11:07
@dmitry-markin dmitry-markin self-requested a review November 3, 2024 19:07
Copy link
Contributor

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing scary indeed. Thanks a lot for upgrading libp2p for us!

substrate/client/network/sync/src/engine.rs Outdated Show resolved Hide resolved
Comment on lines -225 to -232
// Populate kad with both the legacy and the new protocol names.
// Remove the legacy protocol:
// https://github.com/paritytech/polkadot-sdk/issues/504
let kademlia_protocols = if let Some(legacy_protocol) = kademlia_legacy_protocol {
vec![kademlia_protocol.clone(), legacy_protocol]
} else {
vec![kademlia_protocol.clone()]
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! We should finally do this)

substrate/client/network/src/discovery.rs Outdated Show resolved Hide resolved
substrate/client/network/src/discovery.rs Outdated Show resolved Hide resolved
substrate/client/network/src/peer_info.rs Outdated Show resolved Hide resolved
substrate/client/network/src/request_responses.rs Outdated Show resolved Hide resolved
substrate/client/network/src/request_responses.rs Outdated Show resolved Hide resolved
substrate/client/network/src/request_responses.rs Outdated Show resolved Hide resolved
@rockbmb
Copy link
Contributor

rockbmb commented Nov 4, 2024

/tip small

Copy link

Only members of paritytech/tip-bot-approvers have permission to request the creation of the tip referendum from the bot.

However, you can create the tip referendum yourself using Polkassembly or PolkadotJS Apps.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Nov 5, 2024

Looks like in your CI request_responses::tests::max_response_size_exceeded runs even slower than I anticipated, would it be okay if I set an even larger connection timeout for this test?

@dmitry-markin
Copy link
Contributor

Looks like in your CI request_responses::tests::max_response_size_exceeded runs even slower than I anticipated, would it be okay if I set an even larger connection timeout for this test?

Looking at the test, it doesn't seem to do anything time consuming. And a longer timeout was not needed with old libp2p?

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Nov 5, 2024

Looking at the test, it doesn't seem to do anything time consuming. And a longer timeout was not needed with old libp2p?

It runs quickly on its own, but when running all network tests at once it suddenly takes much longer and always seems to finish later than most tests. Didn't debug why very deeply though, primarily because as comment states it is not really applicable to how it is actually used in Substrate, just test-specific behavior. Pumping to 5 minutes fixed the test and shouldn't be particularly flaky.

Not sure about older version, didn't run tests in a loop to try and reproduce, but keep-alive behavior has certainly changed.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Nov 7, 2024

I remember with previous libp2p upgrades there was some kind of burn-in testing done, can it be triggered for this PR, hopefully making it into stable2412 if all goes well 🤞 ?

@dmitry-markin
Copy link
Contributor

I remember with previous libp2p upgrades there was some kind of burn-in testing done, can it be triggered for this PR, hopefully making it into stable2412 if all goes well 🤞 ?

Yes, that was the plan — I was going to do a versi burn-in once it is available after syncing refactoring testing. Last time the issues with libp2p upgrade popped a week after running the nodes, so we need to run a burn-in for at least a week or so.

Unfortunately, the branch-off for stable2412 was planned for yesterday and is going to happen today, so I don't think libp2p upgrade will make it into December release.

@dmitry-markin
Copy link
Contributor

/cmd prdoc --help

Copy link

Command help:
usage: /cmd prdoc [-h] [--pr PR]
                  [--audience [{runtime_dev,runtime_user,node_dev,node_operator} ...]]
                  [--bump {patch,minor,major,silent,ignore,no_change}]
                  [--force]

options:
  -h, --help            show this help message and exit
  --pr PR               The PR number to generate the PrDoc for.
  --audience [{runtime_dev,runtime_user,node_dev,node_operator} ...]
                        The audience of whom the changes may concern. Example:
                        --audience runtime_dev node_dev
  --bump {patch,minor,major,silent,ignore,no_change}
                        A default bump level for all crates. Example: --bump
                        patch
  --force               Whether to overwrite any existing PrDoc.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Dec 5, 2024

Any remaining blockers I can help with here?

@jasl
Copy link
Contributor

jasl commented Dec 9, 2024

Kindly ping, can this be merged?

@dmitry-markin
Copy link
Contributor

We are still running this version in our testnet. If everything is fine by the end of this week, we will merge it.

@dmitry-markin dmitry-markin added this pull request to the merge queue Dec 12, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 12, 2024
@dmitry-markin dmitry-markin added this pull request to the merge queue Dec 12, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 12, 2024
@dmitry-markin
Copy link
Contributor

request_responses::tests::max_response_size_exceeded fails in the merge queue even with 5 min timeout, so this has to be addresses before we merge this PR.

@nazar-pc do you have the capacity to look into it?

@nazar-pc
Copy link
Contributor Author

I'll try to take a closer look at this soon. Any timeouts like that eventually bite and need to be dealt with, looks like it just happened sooner in this case. Anyone else feel free to beat me to it.

@nazar-pc
Copy link
Contributor Author

The test was always flaky, I think it just took longer to tear down connection in the past and now it happens quicker and makes it more likely to spontaneously fail. Simply wait for connection to close on the sender side making sure the message gets delivered to the other side should make it much more deterministic (timeouts are still involved, but extremely unlikely to be triggered).

# Conflicts:
#	substrate/client/network/src/discovery.rs
@nazar-pc
Copy link
Contributor Author

There was a minor conflict in imports, should be good now

@@ -32203,7 +32089,7 @@ dependencies = [
"futures",
"glob-match",
"hex",
"libp2p",
"libp2p 0.52.4",
Copy link
Contributor

@jasl jasl Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just notice that zombienet-orchestrator still requires libp2p 0.52.4. However, it won't affect downstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a PR for zombienet-sdk

@dmitry-markin dmitry-markin added this pull request to the merge queue Dec 16, 2024
Merged via the queue into paritytech:master with commit c881288 Dec 16, 2024
202 checks passed
@dmitry-markin
Copy link
Contributor

Thanks again @nazar-pc for doing this!

@bkchr
Copy link
Member

bkchr commented Dec 16, 2024

/tip medium

Copy link

@bkchr A referendum for a medium (80 DOT) tip was successfully submitted for @nazar-pc (1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd on polkadot).

Referendum number: 1354.
tip

Copy link

The referendum has appeared on Polkassembly.

@nazar-pc nazar-pc deleted the libp2p-0.53.x branch December 16, 2024 09:29
nazar-pc added a commit to autonomys/polkadot-sdk that referenced this pull request Dec 20, 2024
# Description

Fixes paritytech#5996

https://github.com/libp2p/rust-libp2p/releases/tag/libp2p-v0.53.0
https://github.com/libp2p/rust-libp2p/blob/master/CHANGELOG.md

## Integration

Nothing special is needed, just note that `yamux_window_size` is no
longer applicable to libp2p (litep2p seems to still have it though).

## Review Notes

There are a few simplifications and improvements done in libp2p 0.53
regarding swarm interface, I'll list a few key/applicable here.

libp2p/rust-libp2p#4788 removed
`write_length_prefixed` function, so I inlined its code instead.

libp2p/rust-libp2p#4120 introduced new
`libp2p::SwarmBuilder` instead of now deprecated
`libp2p::swarm::SwarmBuilder`, the transition is straightforward and
quite ergonomic (can be seen in tests).

libp2p/rust-libp2p#4581 is the most annoying
change I have seen that basically makes many enums `#[non_exhaustive]`.
I mapped some, but those that couldn't be mapped I dealt with by
printing log messages once they are hit (the best solution I could come
up with, at least with stable Rust).

libp2p/rust-libp2p#4306 makes connection close
as soon as there are no handler using it, so I had to replace
`KeepAlive::Until` with an explicit future that flips internal boolean
after timeout, achieving the old behavior, though it should ideally be
removed completely at some point.

`yamux_window_size` is no longer used by libp2p thanks to
libp2p/rust-libp2p#4970 and generally Yamux
should have a higher performance now.

I have resolved and cleaned up all deprecations related to libp2p except
`BandwidthSinks`. Libp2p deprecated it (though it is still present in
0.54.1, which is why I didn't handle it just yet). Ideally Substrate
would finally [switch to the official Prometheus
client](paritytech/substrate#12699), in which
case we'd get metrics for free. Otherwise a bit of code will need to be
copy-pasted to maintain current behavior with `BandwidthSinks` gone,
which I left a TODO about.

The biggest change in 0.54.0 is
libp2p/rust-libp2p#4568 that changed transport
APIs and enabled unconditional potential port reuse, which can lead to
very confusing errors if running two Substrate nodes on the same machine
without changing listening port explicitly.

Overall nothing scary here, but testing is always appreciated.

# Checklist

* [x] My PR includes a detailed description as outlined in the
"Description" and its two subsections above.
* [x] My PR follows the [labeling requirements](

https://github.com/paritytech/polkadot-sdk/blob/master/docs/contributor/CONTRIBUTING.md#Process
) of this project (at minimum one label for `T` required)
* External contributors: ask maintainers to put the right label on your
PR.

---

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

---------

Co-authored-by: Dmitry Markin <[email protected]>

(cherry picked from commit c881288)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Blocked ⛔️
Development

Successfully merging this pull request may close these issues.

network: Update libp2p to 0.54.1
6 participants