Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean sync Goerli broken(?) post merge on 1.10.21 and 1.10.23 #25693

Closed
ulope opened this issue Sep 6, 2022 · 42 comments
Closed

Clean sync Goerli broken(?) post merge on 1.10.21 and 1.10.23 #25693

ulope opened this issue Sep 6, 2022 · 42 comments

Comments

@ulope
Copy link
Member

ulope commented Sep 6, 2022

System information

Geth version: v1.10.21 / v1.10.23
OS & Version: Linux

Expected behaviour

A fresh sync from scratch of Goerli to work

Actual behaviour

Does not work.

  • Geth 1.10.21:
    • Snap sync completes, afterwards State heal seems to continue indefinitely (aborted after 4 days)
    • Last state heal log before abort:
      INFO [09-05|14:37:35.138] State heal in progress [email protected] [email protected] [email protected] nodes=16,319,[email protected] pending=4369
    • Many, many Unexpected trienode heal packet messages (~70% of all log lines)
    • Pivot is only changed 12 times. Last one:
      WARN [09-02|16:42:58.589] Pivot seemingly stale, moving old=7,516,913 new=7,516,977
    • Suspiciously new blocks are no longer imported:
      INFO [09-05|14:37:34.872] Imported new block headers count=0 elapsed=15.422ms number=7,382,818 hash=aa32c4..48c7cc age=3w4d12h ignored=178
      • Probably because of:
        Local chain is post-merge, waiting for beacon client sync switch-over...
      • However Prsym claims geth is not synced:
        level=error msg="Unable to process past deposit contract logs, perhaps your execution client is not fully synced" error="no contract code at given address" prefix=powchain
  • Geth 1.10.23:
    • Syncing never even starts
      No sync progress is reported, apparently every peer is dropped because of:
      WARN [09-06|09:25:04.945] Snapshot extension registration failed peer=6de3885d err="peer connected on snap without compatible eth support"

Steps to reproduce the behaviour

Compose file used:

version: "3"

services:
  geth:
    image: ethereum/client-go:v1.10.23
    restart: always
    network_mode: host
    stop_grace_period: 1m
    volumes:
      - /data/geth:/data
      - jwt:/jwt
    command: >
      --goerli
      --datadir=/data
      --http
      --http.api eth,net,web3,txpool
      --http.addr 0.0.0.0
      --http.corsdomain '*'
      --ws
      --ws.api eth,web3,net
      --authrpc.jwtsecret /jwt/jwtsecret
      --authrpc.vhosts '*'
      --metrics
      --metrics.addr=0.0.0.0
      --metrics.port=9191

  prysm:
    image: gcr.io/prysmaticlabs/prysm/beacon-chain:stable
    restart: always
    network_mode: host
    volumes:
      - /data/prysm:/data
      - jwt:/jwt
    command: >
      --goerli
      --datadir=/data
      --rpc-host=0.0.0.0
      --http-web3provider=http://localhost:8551
      --jwt-secret=/jwt/jwtsecret
      --accept-terms-of-use

volumes:
  jwt:
@fjl
Copy link
Contributor

fjl commented Sep 6, 2022

Syncing never even starts
No sync progress is reported

Can you provide more information that leads you to this conclusion? It can take a while to start sometimes, just leave it running for a bit.

@rjl493456442
Copy link
Member

We have a couple of PRs for fixing/improving snap sync on master recently. Maybe you can try to use master once we merge this PR #25694

@ulope
Copy link
Member Author

ulope commented Sep 6, 2022

Syncing never even starts
No sync progress is reported

Can you provide more information that leads you to this conclusion? It can take a while to start sometimes, just leave it running for a bit.

Are 20 hours enough? ;)

These are the top non unique log messages by count:

1776 Snapshot extension registration failed
 236 Beacon client online, but never received consensus updates. Please ensure your beacon client is operational to follow the chain!
  73 Dropping unsynced node during sync
  28 Looking for peers
  20 Regenerated local transaction journal
  19 Writing clean trie cache to disk
  19 Persisted the clean trie cache

I also re-tried this locally before submitting the issue (the above setup is running on a droplet).

  • 1.10.21 starts syncing within minutes.
  • 1.10.23 has been running for 15+ minutes and shows the same behaviour as on the droplet (i.e. no sync and lots of Snapshot extension registration failed)

@ligi ligi removed the status:triage label Sep 6, 2022
@ulope
Copy link
Member Author

ulope commented Sep 7, 2022

Update: It's now been almost 40h and still no sync. Going to stop it now and try with latest master.

@holiman
Copy link
Contributor

holiman commented Sep 7, 2022

I'm wondering if your node actually manages to find any peers -- is your firewall sufficiently open, so bidirectional communication can occur over the relevant ports ?

@ulope
Copy link
Member Author

ulope commented Sep 7, 2022

Next update:
Using commit 5ddedd2 (which includes #25694, @rjl493456442) I'm seeing the same behaviour - No sync, Snapshot extension registration failed.
Left it running for 30min.

@holiman As I wrote in the initial report 1.10.21 was able to sync (but then got stuck in the heal stage).
Just to verify doubly I switched back to the 1.10.21 docker image and sync started within ~30 seconds (but I'll just assume it will run into the healing issue again as it did so the last two attempts).
Edit: Also it definitely finds peers. Here's the complete log of the run: https://gist.github.com/ulope/776e87c5bb1e3fcd893d1a512a2f6f48

Also as I wrote yesterday, I can replicate the non-syncing behaviour locally by just running geth --goerli (>=1.10.23) with an empty datadir

@holiman
Copy link
Contributor

holiman commented Sep 7, 2022

@ulope that log say

Beacon client online, but never received consensus updates. Please ensure your beacon client is operational to follow the chain!

Goerli is post-merge, it needs the beacon client to tell it what the head is, and then it will sync to that.

@ulope
Copy link
Member Author

ulope commented Sep 7, 2022

@holiman Unless I'm very much mistaken (and the release notes are wrong) 1.10.21 is just as aware of the goerli merge as later versions. However it does start snap syncing immediately as mentioned.

Also with both versions Prysm seems to wait for geth to be synced to some degree because it continually logs:

level=error msg="Unable to process past deposit contract logs, perhaps your execution client is not fully synced" error="no contract code at given address" prefix=powchain

(I also tried other beacon clients before, but didn't record their output unfortunately. Can check again if that's helpful.)

So either I'm missing something else or this looks like a hen/egg problem.

@holiman
Copy link
Contributor

holiman commented Sep 7, 2022

They are both aware, but not quite "as awaare" :) The more recent version will spit out something like

INFO [09-07|11:25:00.972] Merge configured: 
INFO [09-07|11:25:00.972]  - Hard-fork specification:    https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/paris.md 
INFO [09-07|11:25:00.972]  - Network known to be merged: true 

The difference being that this flag is set for goerli:

	// TerminalTotalDifficultyPassed is a flag specifying that the network already
	// passed the terminal total difficulty. Its purpose is to disable legacy sync
	// even without having seen the TTD locally (safer long term).
	TerminalTotalDifficultyPassed bool `json:"terminalTotalDifficultyPassed,omitempty"`

See #24538 for more info:

The rationale is that once a network transitions into PoS mode, sync is directed by the beacon client. If a new Geth instance is started may years down the line however, it will not know of the transition event, so will still attempt to do a PoW based legacy sync. The legacy sync will need to "fail" when TTD is reached and sync swapped from legacy algo to beacon algo.

@Snehapati11
Copy link

We are closely following this issue as we are also receiving the same message “Beacon client online, but never received consensus updates. Please ensure your beacon client is operational to follow the chain!”. Our current setup is GETH 1.10.23 and prysm alpine image. We are connecting to network goerli prater.

@fjl
Copy link
Contributor

fjl commented Sep 7, 2022

Maybe something is broken in Prysm <-> Geth in TTD-passed mode? AFAIK geth needs a signal from the CL to start syncing. Could be that this signal doesn't come?

@fjl
Copy link
Contributor

fjl commented Sep 7, 2022

What should happen is: geth should print something like

INFO [09-07|15:54:25.313] Forkchoice requested sync to new head    number=7,547,668 hash=2218c5..1c5aa2

If it doesn't print that, but a CL is attached, it means the CL is not sending FcU requests, probably because it is waiting for geth to sync.

@fjl
Copy link
Contributor

fjl commented Sep 7, 2022

Can confirm that geth from master branch (at commit d30e39b) did start syncing after a couple seconds. I ran geth like this:

./build/bin/geth --goerli --http

And lighthouse like this:

lighthouse beacon_node --network goerli --execution-endpoint http://127.0.0.1:8551 --execution-jwt ~/.ethereum/goerli/geth/jwtsecret --checkpoint-sync-url http://...

@ulope
Copy link
Member Author

ulope commented Sep 7, 2022

@Snehapati11 It seems that Prysm doesn't even start syncing the beacon chain in our setup. Is that the case for you too?

I've tried now with both lighthouse and nimbus. They both at least start syncing the beacon chain (very very slowly though, current ETA 6d+)

@fjl Hm I assume that start syncing signal will only come once the beacon chain is synced. So this might after all be not a geth problem.

I'll investigate further.

@ulope
Copy link
Member Author

ulope commented Sep 7, 2022

@fjl Was your lighthouse node already synced?

@fjl
Copy link
Contributor

fjl commented Sep 7, 2022

I used checkpoint sync and it's a bit faster. See this guide for more info: https://lighthouse-book.sigmaprime.io/checkpoint-sync.html#use-infura-as-a-remote-beacon-node-provider

@ulope
Copy link
Member Author

ulope commented Sep 7, 2022

@fjl But at the point where you started both clients the beacon chain wasn’t finished syncing yet?

I’ll try replicating that tomorrow.

@fjl
Copy link
Contributor

fjl commented Sep 7, 2022

I start both clients in quick succession, and they both go into sync kind of quickly. This is with a completely blank DB.

Pretty sure this is an issue with prysm. Maybe it doesn't enable optimistic sync by default?

@begetan
Copy link

begetan commented Sep 7, 2022

I confirm that Geth fresh sync is broken for Goerli on Geth v1.10.23 with Prysm v3.1.0
I feel like it was working with Prysm v3.0.0

We run the same setup as for Ropsten and Mainnet and they are fine! This issue is repeatable for different nodes!

Here is Geth logs:
WARN [09-07|22:19:10.142] Beacon client online, but never received consensus updates. Please ensure your beacon client is operational to follow the chain!

Here is Prysm logs:

time="2022-09-07 22:21:47" level=info msg="Processing block batch of size 63 starting from 0xd0c9baf6... 85760/3840108 - estimated time remaining 166h51m35s" blocksPerSecond=6.2 peers=200 prefix=initial-sync
time="2022-09-07 22:21:57" level=info msg="Ready for The Merge" latestDifficulty=1 network=prater prefix=powchain terminalDifficulty=10790000

Geth sync status:

curl -s -X POST -H "Content-Type:application/json" --data '{"jsonrpc":"2.0","method":"eth_syncing","id":1}' localhost:8545
{"jsonrpc":"2.0","id":1,"result":false}

There is no data in Geth directory

We also know that sync status is broken for the latest Geth, because a bunch of our monitoring tools have issues

@begetan
Copy link

begetan commented Sep 7, 2022

I've switched different version of prysm and it didn't help.

geth version 1.10.21 has started syncing immediately

@begetan
Copy link

begetan commented Sep 7, 2022

@ulope I just want to say that infinite "State heal in progress" may be due to hardware issue. If you run on cloud, try to spin up a new machine. If you go with bare metal, you need probably better hardware. This is unrelated to the broken sync issue, but it's quite common. I am seeing it in 10-20% launches on low spec machines.

@Snehapati11
Copy link

@ulope Beacon node is still is in progress .Are you facing any issues while geth goerli syncing process? It shows beacon client is online but not passing the consensus update.

@Snehapati11
Copy link

@ulope We have just successfully tested Goerli version 1.10.23 with prysm version 3.0.0 and we are no longer seeing the issue that " Beacon client online, but never received consensus updates. Please ensure your beacon client is operational to follow the chain!".

It would appear prysm version 3.1.0 has an issue as you suggested.

@begetan
Copy link

begetan commented Sep 9, 2022

Erigon + Prysm v3.1.0 successfully synced from scratch.

@0xDualCube
Copy link

checkpoint

the checkpoint sync was key for me to get lighthouse to poke geth and get it to start syncing

https://notes.ethereum.org/@launchpad/checkpoint-sync#EF-DevOps-Endpoints

@MariusVanDerWijden
Copy link
Member

Looks like this issue is resolved, will close. Feel free to open a new issue if geth sync is broken for you

@begetan
Copy link

begetan commented Sep 13, 2022

Why you close a critical issue without fix?

It should be either fixed or official announced that old fresh synch method is deprecated.

@MariusVanDerWijden
Copy link
Member

We've successfully synced multiple nodes on goerli and never ran into this issue.
What do you mean with "old fresh synch method is deprecated." ?

@begetan
Copy link

begetan commented Sep 13, 2022

I've repeated this issue today for Goerli and Ropsten as well. The probability is not 100%, for Ropsten it was in 2 times from 4 tries, and for Goerli it was 3 times from 4 tries.

This issue will appear probably on Mainnet after The Merge, because sync condition is changed for all Post-Merge network.

@fjl fjl reopened this Sep 13, 2022
@fjl
Copy link
Contributor

fjl commented Sep 13, 2022

The problem here is with the CL clients (Prysm, Lighthouse, etc.). The CL client needs to start syncing the beacon chain optimistically and start delivering ForkchoiceUpdated requests to geth, otherwise geth will not start syncing.

@fjl
Copy link
Contributor

fjl commented Sep 13, 2022

I have brought this up in chat with CL devs, let's see how they respond.

@begetan
Copy link

begetan commented Sep 13, 2022

Replacing geth with version v1.10.21 always solve the problem.
Unfortunately the logs provided in the firs message is not quite relevant to this issue.

It would be better to open a new issue with more relevant details

@fjl
Copy link
Contributor

fjl commented Sep 13, 2022

geth v1.10.21 'works' because it always starts the legacy non-PoS sync. It's not a good fix long term.

@begetan
Copy link

begetan commented Sep 13, 2022

@fjl this is more relevant description of fresh sync issue: #25753

@ulope
Copy link
Member Author

ulope commented Sep 13, 2022

Sorry for the late reply (with the merge looming time is a bit scarce). So with a checkpoint synced lighthouse and geth 1.10.22+ I was able to successfully sync.

So I'd say at least for me this was definitely (in part) user error.

However, having said that I do find that this is a very drastic change in behaviour esp. for a patch release. Syncing has always started on its own in the 7+ years history.

IMO this should have been geth 2.0.

@fjl
Copy link
Contributor

fjl commented Sep 15, 2022

After the merge, Geth requires input from the consensus layer to find the correct chain. There is no way for it to know the sync target without the CL. This is a protocol limitation, and it's why we changed it in the release after the merge on Goerli.

We are working on alternatives to the engine API connection, so Geth may potentially be able to sync on its own again in the future.

@fjl fjl closed this as completed Sep 15, 2022
@daedlock
Copy link

daedlock commented Sep 18, 2022

I want to run EL without CL! So, I can confirm geth-v1.10.21 fixes the sync stalling. Looking forward to a long term fix in HEAD

@thomaseth2
Copy link

geth v1.10.25 and still seems similar issue, the beacon node takes ages to sync and dont show correct time estimations,
INFO powchain: Ready for The Merge latestDifficulty=17179869184 network=mainnet terminalDifficulty=58750000000000000000000
Keep seeing this from Prysm,

And this from geth:
Snapshot extension registration failed peer=a05e766c err="peer connected on snap without compatible eth support"

Is every new sync even with a light node is by default syncing from start?

I read that the fix seems to be the checkpoint,

As. someone who was. able to spin a node quickly before,

This seems a downgrade. of usability from past geth version,

@thomaseth2
Copy link

So for Prysm is there really sync checkpoint other then local or testnet nodes or dowloading a file? https://notes.ethereum.org/@launchpad/checkpoint-sync

@thomaseth2
Copy link

this checkpoint is life changer: --checkpoint-sync-url=https://beaconstate.ethstaker.cc --genesis-beacon-api-url=https://beaconstate.ethstaker.cc

@alperensozer
Copy link

alperensozer commented Nov 16, 2022

Geth/v1.10.26 + Prysm 3.1.2 still same problem on goerli-prater from scratch. 2 days still no sycing.

But with

--checkpoint-sync-url=https://goerli.checkpoint-sync.ethpandaops.io
--genesis-beacon-api-url=https://goerli.checkpoint-sync.ethpandaops.io

it started syncing immediately. Thanks @thomaseth2 for https://notes.ethereum.org/@launchpad/checkpoint-sync.

@icemagno
Copy link

icemagno commented Aug 17, 2023

Any progress?
I can't put my private blockchain to sync. Same error.

My execution can connect to another execution client but no sync.

The beacon is complaining level=error msg="Unable to process past deposit contract logs, perhaps your execution client is not fully synced" error="no contract code at given address" prefix=powchain forever and receive a "goodbye" from the another consensus peer when try to connect..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests