Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balancing Network Diversity and Performance #6413

Open
LukeWheeldon opened this issue Nov 8, 2024 · 19 comments
Open

Balancing Network Diversity and Performance #6413

LukeWheeldon opened this issue Nov 8, 2024 · 19 comments
Labels
I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@LukeWheeldon
Copy link

LukeWheeldon commented Nov 8, 2024

I have observed that nodes running at the core of the European and US internet infrastructure, are connected to a substantially higher number of peers and are getting better network performance. This comes with advantages such as a better reputation and, just as importantly, better rewards.

The wiki says, “A diverse network of nodes in varying different regions helps strengthen decentralized networks”, which I agree with. However, my experience has been that doing so, implies lower performance as well as lower rewards for a validator.

The Wiki for Validators should be more clear in terms of what regions would be beneficial for the network without expecting a negative impact on the validators performance and rewards. Likewise, it may also be worth documenting what regions would be, at this point, too far off and most likely incur performance and reward penalties.

@github-actions github-actions bot added the I10-unconfirmed Issue might be valid, but it's not yet known. label Nov 8, 2024
@burdges
Copy link

burdges commented Nov 8, 2024

Afaik, reputation should only matter for collators, not validators, right? It might or might not exist for validators too, but in practice validators must talk to all other validators directly in availability, via fixed topologies in approvals, etc. I doubt this impact rewards right now, but..

Impact upon collators matters too. I'm afraid some collator tech like elastic scaling will always be massively unfair, but we could be more careful for parachains who do not require elastic scaling.

It could massively impact validator rewards once we make vlaidator rewards make sense using #1811

We should explore information about latencies to different ISPs around the world. In theory, these all say under 300 ms but likely the results get worse once you look at say Thailand.

https://wondernetwork.com/pings
https://www.verizon.com/business/terms/latency/
https://kb.leaseweb.com/kb/network/network-ip-performance-measurements/

And maybe other bandwidth measures besides latency, ala https://www.speedtest.net/global-index

Anyways we picked the approval (ELVES) parameters so that 500 ms round trip works fine. We could tweak these future, like a 1 second tranche time should reduce no-shows.

@eskimor
Copy link
Member

eskimor commented Nov 8, 2024

This sounds like an issue that should be fixed. Can you elaborate more on what data are you basing this on please? Also if you have any logs from affected validators, that should be helpful.

@LukeWheeldon
Copy link
Author

Afaik, reputation should only matter for collators, not validators, right?

On staking.polkadot.cloud, you have an indicator of “validator performance” and similarly, there is apps.turboflakes.io to get more reporting on validators. This is what I alluded to when I referred to reputation for validators; it's not the only aspect that matters for reputation, but it matters a lot nonetheless.

Anyways we picked the approval (ELVES) parameters so that 500 ms round trip works fine. We could tweak these future, like a 1 second tranche time should reduce no-shows.

Indeed, it generally works fine. My observation on validators not located right at the core of the Internet infrastructure is that the further away they get, the fewer peers they tend to be connected to, and the more votes they tend to miss. The missed votes numbers are not very high at this point, but I still wanted to report my observations because I am concerned that as the load on the network increases, this effect may increase as well. That would substantially disadvantage those who have invested in the decentralization of the network.

This sounds like an issue that should be fixed. Can you elaborate more on what data are you basing this on please?

I may be mostly alone with this issue, and somehow it's my infrastructure that has an issue somewhere. I cannot see how at this point, which is why I have opened this ticket.

To be more specific, I always have excellent performances & rewards in the Netherlands, but for example, Bulgarian or Polish validator performances would be slightly less. As a matter of fact, currently, I see 9 peers on the Kusama validator in Bulgaria, 18 in Poland, and 35 in the Netherlands.

@burdges
Copy link

burdges commented Nov 9, 2024

Your not alone, someone in bangkok mentioned have issues there. Jonas pointed out this map:

https://www.certhum.com/polkadot-validator-map

It's not that diverse in terms of validator locations really. We do want more validators in non-aligned or brics+ nations. I think dv/1kv has some affirmative action that selects such validators more.

We do need limits though so really poor connections would miss rewards if they run validators, so no starlink or other craziness, but we need data on where to draw this line.

After #1811 we should probably increase maximum commission in dv/1kv for validators in nations not well represented, so then maximum rewards would mean finding a fast ISP in a different nation. We should discuss the nato, west, brics+, or non-aligned classifications too.

@LukeWheeldon
Copy link
Author

Is there a specific Github issue to follow on this matter? It's something I care a lot about and would gladly attempt to help where possible.

The incentive for 1kv/dv is great, but I am particularly interested in ensuring that self-reliant validators also have the best possible incentives to run nodes as decentralized as possible.

@burdges
Copy link

burdges commented Nov 9, 2024

1kv/dv is an internal w3f thing. They do whatever they think benefits the network.

I've suggested but we've never seriously considered location based affirmative action in actual rewards. It'd be done using median computaitons similar to #1811 which always scares people. Not impossible, but not politically the easest thing.

Anyways like @eskimor said we're happy to have stats from validators who have latency problems, but ideally we'd love some information on selection of ISP there because a city might've good internet but a particularly inexpensive ISP might be bad or just non-commercial.

As a related example, we'd trouble with too many validators being on Hetzner, who are cheap for Germany, but hate crpyto-currencies and cut off nodes without warning.

@LukeWheeldon
Copy link
Author

Okay. I will provide some details while ensuring they are not getting into the specifics of the exact location of my validator(s). Hopefully this will be useful, if not you can just let me know and tell me what would be useful instead. We can also exchange privately, in which case I could share a bit more details than publicly.

This is from a validator in Bulgaria, connected through ReTN, Cogent, HE, SOX. Does this look like a normal number of peers connected to the node?

2024-11-10 05:23:55 ♻️  Reorg on #25715178,0x5149…e6fe to #25715179,0x3596…7c65, common ancestor #25715177,0x3252…f652
2024-11-10 05:23:55 🏆 Imported #25715179 (0x4c31…a5f7 → 0x3596…7c65)
2024-11-10 05:23:56 💤 Idle (7 peers), best: #25715179 (0x3596…7c65), finalized #25715176 (0x90d3…f279), ⬇ 2.2MiB/s ⬆ 2.9MiB/s
2024-11-10 05:24:00 🏆 Imported #25715180 (0x3596…7c65 → 0x9937…6597)
2024-11-10 05:24:00 ♻️  Reorg on #25715180,0x9937…6597 to #25715180,0xc08f…3039, common ancestor #25715179,0x3596…7c65
2024-11-10 05:24:00 🏆 Imported #25715180 (0x3596…7c65 → 0xc08f…3039)
2024-11-10 05:24:01 💤 Idle (6 peers), best: #25715180 (0xc08f…3039), finalized #25715177 (0x3252…f652), ⬇ 3.5MiB/s ⬆ 3.1MiB/s
2024-11-10 05:24:06 🏆 Imported #25715181 (0xc08f…3039 → 0x8459…6497)
2024-11-10 05:24:06 💤 Idle (8 peers), best: #25715181 (0x8459…6497), finalized #25715177 (0x3252…f652), ⬇ 2.5MiB/s ⬆ 667.5kiB/s
2024-11-10 05:24:06 ♻️  Reorg on #25715181,0x8459…6497 to #25715181,0x2024…4621, common ancestor #25715180,0xc08f…3039
2024-11-10 05:24:06 🏆 Imported #25715181 (0xc08f…3039 → 0x2024…4621)
2024-11-10 05:24:11 💤 Idle (8 peers), best: #25715181 (0x2024…4621), finalized #25715179 (0x3596…7c65), ⬇ 2.6MiB/s ⬆ 1021.7kiB/s
2024-11-10 05:24:12 🏆 Imported #25715182 (0x2024…4621 → 0x48e5…9a89)
2024-11-10 05:24:12 🆕 Imported #25715182 (0x2024…4621 → 0x4e00…43d0)
2024-11-10 05:24:12 🆕 Imported #25715182 (0x2024…4621 → 0x2af3…122d)
2024-11-10 05:24:16 💤 Idle (8 peers), best: #25715182 (0x48e5…9a89), finalized #25715180 (0xc08f…3039), ⬇ 3.2MiB/s ⬆ 1.2MiB/s
2024-11-10 05:24:18 🏆 Imported #25715183 (0x48e5…9a89 → 0xcc6f…d176)
2024-11-10 05:24:21 💤 Idle (8 peers), best: #25715183 (0xcc6f…d176), finalized #25715180 (0xc08f…3039), ⬇ 2.4MiB/s ⬆ 869.7kiB/s
2024-11-10 05:24:24 🏆 Imported #25715184 (0xcc6f…d176 → 0x0af1…e706)
2024-11-10 05:24:26 💤 Idle (8 peers), best: #25715184 (0x0af1…e706), finalized #25715180 (0xc08f…3039), ⬇ 2.3MiB/s ⬆ 549.5kiB/s
2024-11-10 05:24:30 🏆 Imported #25715185 (0x0af1…e706 → 0xed53…1180)
2024-11-10 05:24:30 🆕 Imported #25715185 (0x0af1…e706 → 0xbd6d…d992)

Likewise, Turboflakes says that this validator has about 60% "less than the average of Backing Points collected by all Para-Authorities of the last 192 sessions"; normal?

Finally, I have done a speedtest and even though my server interface is 1Gbps, I seem to be getting substantially less.

$ speedtest
Retrieving speedtest.net configuration...
Testing from XXX (XXX)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Korabi NET (Peshkopi) [XXX km]: 26.459 ms
Testing download speed................................................................................
Download: 290.13 Mbit/s
Testing upload speed......................................................................................................
Upload: 700.99 Mbit/s

Running an iperf3 test seem to show better performance even if the testing server is located further away (France):

$ iperf3 -c ping.online.net
Connecting to host ping.online.net, port 5201
[  5] local XXX port 41292 connected to 51.158.1.21 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  61.5 MBytes   516 Mbits/sec    0   5.62 MBytes
[  5]   1.00-2.00   sec  82.5 MBytes   692 Mbits/sec    0   5.62 MBytes
[  5]   2.00-3.00   sec  83.8 MBytes   703 Mbits/sec    0   5.62 MBytes
[  5]   3.00-4.00   sec  82.5 MBytes   692 Mbits/sec    0   5.62 MBytes
[  5]   4.00-5.00   sec  83.8 MBytes   703 Mbits/sec    0   5.62 MBytes
[  5]   5.00-6.00   sec  82.5 MBytes   692 Mbits/sec    0   5.62 MBytes
[  5]   6.00-7.00   sec  83.8 MBytes   703 Mbits/sec    0   5.62 MBytes
[  5]   7.00-8.00   sec  82.5 MBytes   692 Mbits/sec    0   5.62 MBytes
[  5]   8.00-9.00   sec  83.8 MBytes   703 Mbits/sec    0   5.62 MBytes
[  5]   9.00-10.00  sec  82.5 MBytes   692 Mbits/sec    0   5.62 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   809 MBytes   679 Mbits/sec    0             sender
[  5]   0.00-10.03  sec   809 MBytes   676 Mbits/sec                  receiver

iperf Done.

$ iperf3 -c ping.online.net --reverse
Connecting to host ping.online.net, port 5201
Reverse mode, remote host ping.online.net is sending
[  5] local XXX port 39660 connected to 51.158.1.21 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  63.4 MBytes   532 Mbits/sec
[  5]   1.00-2.00   sec  85.0 MBytes   713 Mbits/sec
[  5]   2.00-3.00   sec  85.1 MBytes   714 Mbits/sec
[  5]   3.00-4.00   sec  84.9 MBytes   712 Mbits/sec
[  5]   4.00-5.00   sec  85.0 MBytes   713 Mbits/sec
[  5]   5.00-6.00   sec  84.9 MBytes   712 Mbits/sec
[  5]   6.00-7.00   sec  85.0 MBytes   713 Mbits/sec
[  5]   7.00-8.00   sec  84.9 MBytes   712 Mbits/sec
[  5]   8.00-9.00   sec  84.9 MBytes   713 Mbits/sec
[  5]   9.00-10.00  sec  85.1 MBytes   714 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.03  sec   880 MBytes   736 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   828 MBytes   695 Mbits/sec                  receiver

iperf Done.

@eskimor
Copy link
Member

eskimor commented Nov 18, 2024

What would be interesting is ping/round trip times. E.g. between the well performing and the not-so-well performing node.

@LukeWheeldon
Copy link
Author

Sure thing.

From a large DC in the Netherlands:

Screenshot 2024-11-19 at 16 20 06

From a large DC in Germany:

Screenshot 2024-11-19 at 16 20 19

@LukeWheeldon
Copy link
Author

LukeWheeldon commented Nov 28, 2024

2024-11-28 00:04:47 💤 Idle (1 peers), best: #25966965 (0x4702…021b), finalized #25966934 (0x246d…f680), ⬇ 1.9MiB/s ⬆ 2.4MiB/s
2024-11-28 00:04:48 🏆 Imported #25966966 (0x4702…021b → 0xa8f4…5304)
2024-11-28 00:04:48 ♻️  Reorg on #25966966,0xa8f4…5304 to #25966966,0x8a82…829a, common ancestor #25966965,0x4702…021b
2024-11-28 00:04:48 🏆 Imported #25966966 (0x4702…021b → 0x8a82…829a)
2024-11-28 00:04:52 💤 Idle (1 peers), best: #25966966 (0x8a82…829a), finalized #25966934 (0x246d…f680), ⬇ 2.1MiB/s ⬆ 2.2MiB/s
2024-11-28 00:04:54 🏆 Imported #25966967 (0x8a82…829a → 0xad5f…efd4)
2024-11-28 00:04:57 💤 Idle (1 peers), best: #25966967 (0xad5f…efd4), finalized #25966934 (0x246d…f680), ⬇ 2.0MiB/s ⬆ 2.4MiB/s
2024-11-28 00:05:00 🏆 Imported #25966968 (0xad5f…efd4 → 0x6dce…4341)
2024-11-28 00:05:02 💤 Idle (1 peers), best: #25966968 (0x6dce…4341), finalized #25966934 (0x246d…f680), ⬇ 1.8MiB/s ⬆ 2.4MiB/s
2024-11-28 00:05:06 🏆 Imported #25966969 (0x6dce…4341 → 0xa9ca…48a6)
2024-11-28 00:05:07 💤 Idle (1 peers), best: #25966969 (0xa9ca…48a6), finalized #25966934 (0x246d…f680), ⬇ 1.7MiB/s ⬆ 2.0MiB/s
2024-11-28 00:05:12 💤 Idle (1 peers), best: #25966969 (0xa9ca…48a6), finalized #25966934 (0x246d…f680), ⬇ 1.7MiB/s ⬆ 2.3MiB/s
2024-11-28 00:05:12 🏆 Imported #25966970 (0xa9ca…48a6 → 0xf51e…1626)
2024-11-28 00:05:17 💤 Idle (2 peers), best: #25966970 (0xf51e…1626), finalized #25966934 (0x246d…f680), ⬇ 1.8MiB/s ⬆ 2.1MiB/s
2024-11-28 00:05:19 🏆 Imported #25966971 (0xf51e…1626 → 0xccef…ec93)
2024-11-28 00:05:19 ♻️  Reorg on #25966971,0xccef…ec93 to #25966971,0x0ed1…7377, common ancestor #25966970,0xf51e…1626
2024-11-28 00:05:19 🏆 Imported #25966971 (0xf51e…1626 → 0x0ed1…7377)
2024-11-28 00:05:22 💤 Idle (2 peers), best: #25966971 (0x0ed1…7377), finalized #25966967 (0xad5f…efd4), ⬇ 2.1MiB/s ⬆ 3.6MiB/s
2024-11-28 00:05:24 🏆 Imported #25966972 (0x0ed1…7377 → 0x11fb…0635)
2024-11-28 00:05:27 💤 Idle (2 peers), best: #25966972 (0x11fb…0635), finalized #25966967 (0xad5f…efd4), ⬇ 3.4MiB/s ⬆ 2.6MiB/s
2024-11-28 00:05:32 💤 Idle (2 peers), best: #25966972 (0x11fb…0635), finalized #25966969 (0xa9ca…48a6), ⬇ 1.6MiB/s ⬆ 2.4MiB/s
2024-11-28 00:05:36 🏆 Imported #25966973 (0x11fb…0635 → 0xc228…811d)
2024-11-28 00:05:37 💤 Idle (2 peers), best: #25966973 (0xc228…811d), finalized #25966970 (0xf51e…1626), ⬇ 1.9MiB/s ⬆ 4.5MiB/s
2024-11-28 00:05:42 💤 Idle (2 peers), best: #25966973 (0xc228…811d), finalized #25966970 (0xf51e…1626), ⬇ 1.7MiB/s ⬆ 2.4MiB/s

FYI. Nothing has changed on my end, really no idea why I am getting such a low peer number.

Also, on app.turboflakes, that validators is remaining A+. I can't say that I understand what is going on, but at least the validation aspect seems to be operating normally.

@eskimor
Copy link
Member

eskimor commented Dec 23, 2024

Hmm. The data looks rather similar. How big is the difference in rewards?

@LukeWheeldon
Copy link
Author

I'm starting one in Turkey and I'll try to get you some more useful details.

For one thing so far, to complete the warp sync at current rate is about to take a week; very few peers. In AMS, UK, DE, US, etc... it would take a day at most.

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Jan 7, 2025

I can't say that I understand what is going on, but at least the validation aspect seems to be operating normally.

The number of peers reported is for sync peers, validation protocol has it's own peer set. So we should find out why the syncing is not able to establish enough connections.

In any case, this is weird, I would expect poor networking to affect both syncing and validation at the same time.

Is the peer count low from the very beginning after the node startup, or is it high first and than drops to 1-2 peers?

Can you collect the logs with -l sync=trace,sub-libp2p=trace? This is going to produce a lot of logs, so collecting it for 5-10 minutes is enough, but we want to capture the period where the node loses peers.

@burdges
Copy link

burdges commented Jan 7, 2025

The Tor project puts Turkley in the top-10 countries by possible censorship events. Internet censorship often disrupts unrelated traffic using the same ISP, so polkadot infrastructure could be censored inadvertently, especially if the ISP criteria wind up similar for node operators of both polkadot and tor.

It's good to know if we're hitting censorship, even if not targetted at us. Also, if someone want to ask questions about censorship, there are many people who study it not, both at Tor and elsewher, so one could ask questions at https://tor.stackexchange.com/ maybe. Also, the Tor matrix space has rooms for the "global south" and "relay operators" which maybe good places.

@LukeWheeldon
Copy link
Author

LukeWheeldon commented Jan 8, 2025

#1 The server is staying behind on sync after about a week. It has a 1Gbps connection and is located in a top tier datacenter.

2025-01-08 03:19:26 ⏩ Block history, #14075941 (3 peers), best: #26552741 (0x36be…3efa), finalized #26552738 (0x1742…c893), ⬇ 1.2MiB/s ⬆ 39.9kiB/s

#2 @burdges the network I use is confirmed by ooni.org not to have filtering / censorship issues. I have manually tested for this using ooni-cli (0% im blocking, 0% circumvention tools blocking, 0% experimental tests blocking, 1% controversial websites blocking). The 1% controversial websites blocking is very low, basically out of 2092 controversial websites tested, 22 didn't work. I suspect most were just offline but sure, maybe, there could be some ips being blocked somewhere. With those results, I cannot imagine this is the problem here though, personally.

#3 @dmitry-markin

Can you collect the logs with -l sync=trace,sub-libp2p=trace?

Sure I'll do so right away. But I would rather not share the output here as I don't know how to anonymize it.


I'm not going to keep this server longer than January if it cannot even sync up to the tip. That is, unless, it can help figure out some useful things to the network.

At this point I suggest taking this private with a Parity team member. If you agree please email me from a parity.io email address to [email protected] and from there we can either keep it over email or switch to Signal for faster back and forth.

@burdges
Copy link

burdges commented Jan 8, 2025

Ahh ooni-cli is a nice idea for this! :)

@bkchr
Copy link
Member

bkchr commented Jan 9, 2025

Sure I'll do so right away. But I would rather not share the output here as I don't know how to anonymize it.

There are no private information in these logs. Yeah, it will contain ip addresses, but I can also query those via the public network.

@LukeWheeldon
Copy link
Author

LukeWheeldon commented Jan 9, 2025

Can you link those IPs with my name via the public network? My offer to work on this together stands, but I'm not sharing the private bits here.

@bkchr
Copy link
Member

bkchr commented Jan 9, 2025

Can you link those IPs with my name via the public network?

If you are running a validator or having any node name, that can be associated with you, it also works in a public network. I just wanted to say that there is not that much data in it, you need to afraid of. Especially if it is only about your ip address, which you could search&replace.

But @dmitry-markin will reach out to you via mail and then you can share your logs there :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
None yet
Development

No branches or pull requests

5 participants