Parachain Blocktimes Increasing #6910

wilwade · 2024-12-16T15:54:39Z

What is happening?

The average parachain blocktime appears to be increasing slowly over the last 7-14 days.

I'm working on collecting better data, but here's an image of an internal chart for Frequency (6s blocktimes) that shows the number of blocks in the last 7 days over 30 minutes and the trend line over the last 7 days is clear.

Is it just Frequency? Nope

While it was having trouble, I checked with a script the lag of Frequency vs the lag of Mythos (another 6s blocktime parachain), and over 1000 blocks, the lag matched. (Trying to download a dump of blocktimes from Frequency and Mythos and will share once collected)

Additionally other chains have seen the issue and I hope will post here as well.

There is also no clear correlation between volume and the lag or other issues (which there could be, but as other chains are seeing it, unlikely to be the issue with the trend)

Ideas on what it might be?

The Turboflakes Validator dashboard doesn't show any issues, but that would be my first thought.

Additional Notes

There has been a consistent dip in blocktimes from 0930-1200 UTC for Frequency, but I've not yet been able to validate that it is seen on other chains yet. This has been ongoing for a while and been trying to track it down.
Frequency is on Polkadot SDK v1.13.0

bkchr · 2024-12-16T15:58:46Z

@sandreim @alexggh can you please take a look.

alexggh · 2024-12-16T16:12:39Z

@sandreim @alexggh can you please take a look.

First of all there hasn't been a release in that timeframe, so I don't think it is something from a new release.

Will try to have a look tomorrow, just clicking through the relay chain blocks it seems that there are blocks where no-block is included, so I think that might be the reason you are seeing this.

E.g:
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot.public.curie.radiumblock.co%2Fws#/explorer/query/0xb35995b7d16148c82c0dc0ad31ef6f1f4c2878d51f4a26dec409e689a36ba574

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot.public.curie.radiumblock.co%2Fws#/explorer/query/0xdf674f849703488ec923df212dce27155461441988466589bcc7118b65952cb8

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot.public.curie.radiumblock.co%2Fws#/explorer/query/0x81ee51ce255741c50b9f37adee230f9bdf208a88151c106f67cc8573ccb161cb

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot.public.curie.radiumblock.co%2Fws#/explorer/query/0x6b9639227545b15f3349c4f6d7b05c2bc575c4ba7a381b16565e71a5a868c3c6

So, basically all parachain blocks sometimes are skipping a relaychain block from backed to included.

The availability bitmaps from those blocks seems to indicate the 2/3 quorum is not reached so I think they are correctly not included in the first relay chain blocked after being backed.

So, we need to understand why some of the availability bitmaps for some of the validators are 0 sometimes, one possible reason for this would be that, those validators are simply slow either on network or on CPU.

sandreim · 2024-12-16T16:31:00Z

Yes, there is a clear trend here as seen here:

The availability bitmaps from those blocks seems to indicate the 2/3 quorum is not reached so I think they are correctly not included in the first relay chain blocked after being backed.

So, we need to understand why some of the availability bitmaps for some of the validators are 0 sometimes, one possible reason for this would be that, those validators are simply slow either on network or on CPU.

One idea is that this can happen if some networking subsystem channel is full, this would lead to a stall of the bridge, so bitfields are not processed by block author in the required time window.

I have been seeing bitfield distribution being pressured at times and it's channel getting full on Kusama on our own validators:

However, I don't see why this started suddenly without any change.

eskimor · 2024-12-16T16:46:56Z

Depending on the root cause, speculative availability as described here might help and would actually be rather quick to implement.

eskimor · 2024-12-16T17:03:01Z

In the data I can't see any significant correlated increase, except for active accounts:

block weights and size don't seem to have increased at all in that timespan. 🤔

sandreim · 2024-12-16T17:07:32Z

Yeah, I don't think the load has increased, could it be just some under spec validators got into the set, currently I see 3 F grade validators on Polkadot.

eskimor · 2024-12-16T17:09:04Z

Frequency seems to be affected the most:

in the data I don't see anything that has changed with Frequency blocks though. Block size of phala (best chain) seems similar to Frequency.

eskimor · 2024-12-16T17:09:55Z

Yeah, I don't think the load has increased, could it be just some under spec validators got into the set, currently I see 3 F grade validators on Polkadot.

Weird though that it seems to increase continuously over days.

burdges · 2024-12-17T03:50:41Z

Are there more or bigger parablocks? Someone else bought some cores recently, right?

crystalin · 2024-12-17T09:26:27Z

Additional metrics that could help would be the average/sum of the PoV size as well as execution time (maybe harder to get)

alexggh · 2024-12-17T09:35:04Z

I have been seeing bitfield distribution being pressured at times and it's channel getting full on Kusama on our own validators:

If that's the case then this will help and it should be coming in the next release: https://github.com/paritytech/polkadot-sdk/pull/5787/files.

However, just looking at the metrics we have, it doesn't seem there is any increase in load, finalization seems to be stable, the number of parachain blocks included around the same numbers. So I tend to think there is something going wrong on some validators, so I'm going to try to build some statistics regarding the ones missing bitfields to see if we can pin-point which ones became slow.

alexggh · 2024-12-17T11:43:41Z

From this blocks that have no included parachain blocks:
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot-rpc.publicnode.com#/explorer/query/0x99549b6c4450220d44c6a3d4d2dc62505338cbf7fff70723c5d073437060bfbb
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot-rpc.publicnode.com#/explorer/query/0xd5e08346906a534140b596968fc972176e72466c2cabeb3f9afa94d0acbf9e20
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot-rpc.publicnode.com#/explorer/query/0xbfd891fa8cda1b0c7d1f85549b36db36cd2043dadfd0e3bba613a679d9288bc3
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot-rpc.publicnode.com#/explorer/query/0xf66c2a4601c61ebf58decc99fdf079f2bd9019c340d50cacfd10753eb2359d93

It seems there are a lot of validators where the bitfield looks like this, a big 0.

0b00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000

That tells us that the bitfield arrive on the block author, so I don't think there is a problem with the bitfield-distribution, because bitfields that don't arrive aren't noted on chain, but with the availability being slow on the said validators. I built the list here and for all the block it seems to be kind of the same validators, I'll ask on the validators channels to see if I can get some logs.

bad_validators.txt

eskimor · 2024-12-17T11:53:52Z

But more than a third of the validators would need to be slow on availability for this to cause issues. Just a few should be no problem at all. All 0 can also hardly be explained by slowness, there must be some greater malfunction on that node.

alexggh · 2024-12-17T11:57:43Z

But more than a third of the validators would need to be slow on availability for this to cause issues.

Yes, there seems to be more that 1/3 with 0 on those blocks, I asked on this channel here: https://matrix.to/#/!NZrbtteFeqYKCUGQtr:matrix.parity.io/$17344363742DoDav:parity.io?via=parity.io&via=corepaper.org&via=matrix.org logs from the validators, hopefully that tells us more.

alexggh · 2024-12-17T12:33:34Z

polka-logs.txt

Got some logs from one of the validator, one thing I notice is that we seem to have a lot of Reorgs on the relaychain in 6 hours span we have 400 reorgs, that's a Re-org every other ~10 blocks, that also correlates with the blocks I've seen to have 0 included candidates, so it might be related or not.

eskimor · 2024-12-17T12:54:12Z

Interesting. Would need to check, but one good explanation for those all zeros would be that backing already failed on that fork that got revealed late. If nothing was occupying the core, then availability obviously will also not run. Another explanation, that block triggering the reorg was revealed so late that availability had no chance to run. (Seems less likely, given the perfect zero)

But if that first theory were to hold, then all the validators would have sent all zeros.

alexggh · 2024-12-17T13:23:00Z

I think the Re-org is a red herring, I looked at the logs going as far as 17th of November and it seems they always happened in the same percentages, so it doesn't correlate with the start of this problem.

aaron2048 · 2024-12-17T21:49:52Z

FWIW, we see increasing block times on Moonbeam as well. What's especially concerning is it doesn't seem to have plateaued yet.

wilwade · 2024-12-17T22:26:58Z

I finally pulled the data for the last ~11 days for both Frequency and Mythos

It is a dump of key information for each block from 2024-12-05T04:27:24.000Z to 2024-12-16T14:33:42.000Z. I hope to use it to try to better pinpoint when the lags are happening and when they are not across two chains that are close to the same tx volume, node count, collator count, etc...

So here it is in case it helps anyone else as this is all public data.

Columns:

blockNumber: The Parachain Block Number
extrinsicsCount: The number of extrinsics (signed and unsigned) in the block
povSize: Block PoV size, same as is shown on the Polkadot Dev UI
weight: Block Weight, same as is shown on the Polkadot Dev UI
timestamp: The value from the setTimestamp Extrinsic
datetime: The ISO version of Timestamp for humans
secondsSinceLastBlock: The number of seconds since the prior block rounded to the nearest second (block n timestamp - block n-1 timestamp)

See CSVs in the Gist: https://gist.github.com/wilwade/0fbb9dccc4f8d2f20fe4035aab422f25

alexggh · 2024-12-18T08:01:16Z

A short summary, the good news is that the network seems to have healed overnight:

Average block times:

Mythos

Moonbeam

What we know so far

Starting 12th of December there seem to be a number of validators which had their availability bitfields with 0, in some relaychain blocks, that seems to point to some availabilty slowness on those validators for all parachains just on some blocks, before and after that block their availability bitfields seem alright.
Validators of all versions seem to be affected, see the list here: https://matrix.to/#/!NZrbtteFeqYKCUGQtr:matrix.parity.io/$17344363742DoDav:parity.io?via=parity.io&via=corepaper.org&via=matrix.org.
Restart of the node seems to help I tested with 3 validators(2 upgraded from 1.16.2 to 1.17) and 1(restarted on 1.15.2) all seemed to recover after restart.
We just released polkadot v1.17, so I assumed that's the reason the network recovered and parachain block times look better now, because some validators upgraded.

Root cause

Alright, I think I've got the full picture now, we've got this long standing issue #5258 (all known versions except v1.17.0 are affected), where availability-distribution's cpu usage is increasing slowly but steady with around 1% of cpu usage per-day. So the longer a validator runs availability-distribution slows down, to a point where it can't fetch its PoV chunks and populate the bitfields in time(1.5s since the block was imported).

This is the last 6 months from a polkadot validator affected by this issue(kudos to tugy | Amforc)

So, the longer a validator rans before a restart the slower it gets, the validators I've been in contacted ran for 30 days and another for ~3 months.

Therefore, once the active has enough of such validators(~1/3) the parachain inclusion slows down as well and it starts skipping blocks, because less than 2/3 informed the block author that they have their PoV chunk.

Solution

The issue has been accidentally fixed with #5875, which is included in the v1.17.0 release, all validators that upgrade are not missing bitfields anymore and their CPU usage is not increasing, also confirmed on versi by running the version before and after the commit.

Conclusion

I don't expect a repeat of this, there are already 184 validators on polkadot that upgraded to the unaffected polkadot v1.17.0, with more to follow probably shortly.

Action items

Notify validators on the validator channels, that if their node ran for more than 30 days without restart, they might need to restart it every now and then or simply upgrade to the latest version which is unaffected by this issue.
Find a home for a dashboard, that can aggregate and show validators that are not populating their availability bitfields from the onchain data, so that we can pinpoint when things go awry.

wilwade changed the title ~~Parachain Blocktimes increasing~~ Parachain Blocktimes Increasing Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parachain Blocktimes Increasing #6910

Parachain Blocktimes Increasing #6910

wilwade commented Dec 16, 2024 •

edited

Loading

bkchr commented Dec 16, 2024

alexggh commented Dec 16, 2024 •

edited

Loading

sandreim commented Dec 16, 2024 •

edited

Loading

eskimor commented Dec 16, 2024

eskimor commented Dec 16, 2024

sandreim commented Dec 16, 2024

eskimor commented Dec 16, 2024

eskimor commented Dec 16, 2024 •

edited

Loading

burdges commented Dec 17, 2024

crystalin commented Dec 17, 2024

alexggh commented Dec 17, 2024 •

edited

Loading

alexggh commented Dec 17, 2024

eskimor commented Dec 17, 2024 •

edited

Loading

alexggh commented Dec 17, 2024

alexggh commented Dec 17, 2024

eskimor commented Dec 17, 2024 •

edited

Loading

alexggh commented Dec 17, 2024

aaron2048 commented Dec 17, 2024 •

edited

Loading

wilwade commented Dec 17, 2024 •

edited

Loading

alexggh commented Dec 18, 2024 •

edited

Loading

aaron2048 commented Dec 18, 2024

Next

wilwade commented Dec 18, 2024

alexggh commented Dec 19, 2024 •

edited

Loading

Parachain Blocktimes Increasing #6910

Parachain Blocktimes Increasing #6910

Comments

wilwade commented Dec 16, 2024 • edited Loading

What is happening?

Is it just Frequency? Nope

Ideas on what it might be?

Additional Notes

bkchr commented Dec 16, 2024

alexggh commented Dec 16, 2024 • edited Loading

sandreim commented Dec 16, 2024 • edited Loading

eskimor commented Dec 16, 2024

eskimor commented Dec 16, 2024

sandreim commented Dec 16, 2024

eskimor commented Dec 16, 2024

eskimor commented Dec 16, 2024 • edited Loading

burdges commented Dec 17, 2024

crystalin commented Dec 17, 2024

alexggh commented Dec 17, 2024 • edited Loading

alexggh commented Dec 17, 2024

eskimor commented Dec 17, 2024 • edited Loading

alexggh commented Dec 17, 2024

alexggh commented Dec 17, 2024

eskimor commented Dec 17, 2024 • edited Loading

alexggh commented Dec 17, 2024

aaron2048 commented Dec 17, 2024 • edited Loading

wilwade commented Dec 17, 2024 • edited Loading

alexggh commented Dec 18, 2024 • edited Loading

What we know so far

Next

aaron2048 commented Dec 18, 2024

Next

wilwade commented Dec 18, 2024

alexggh commented Dec 19, 2024 • edited Loading

Root cause

Solution

Conclusion

Action items

wilwade commented Dec 16, 2024 •

edited

Loading

alexggh commented Dec 16, 2024 •

edited

Loading

sandreim commented Dec 16, 2024 •

edited

Loading

eskimor commented Dec 16, 2024 •

edited

Loading

alexggh commented Dec 17, 2024 •

edited

Loading

eskimor commented Dec 17, 2024 •

edited

Loading

eskimor commented Dec 17, 2024 •

edited

Loading

aaron2048 commented Dec 17, 2024 •

edited

Loading

wilwade commented Dec 17, 2024 •

edited

Loading

alexggh commented Dec 18, 2024 •

edited

Loading

alexggh commented Dec 19, 2024 •

edited

Loading