Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/rawdb: avoid exiting geth from freezer on fsync failure (ref #22112) #22118

Closed
wants to merge 1 commit into from

Conversation

holiman
Copy link
Contributor

@holiman holiman commented Jan 5, 2021

This is a partial fix for #22112 .

In that ticket, an fsync operation failed -- unclear why. When that happens, the current code hard-exits, without any kind of recovery.

The current behaviour is wrong, and is prone to corrupt the database since we don't properly close anything. In the particular case where the error occurs, it's actually possible to just un-write the data that we just wrote (truncate), back off and try again later. That would probably have a higher chance of working.

If the error persists, it will just lead to the move-data-from-leveldb-to-ancient becomes a no-op, but it won't cause corruption.

Copy link

@DGKSK8LIFE DGKSK8LIFE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AppVeyor CI tests are failing:

--- FAIL: TestTransactionPropagation65 (2.45s)
    handler_eth_test.go:459: sink 0: transaction propagation timed out: have 0, want 1024
FAIL
coverage: 21.4% of statements
FAIL	github.com/ethereum/go-ethereum/eth	16.913s

@fjl fjl changed the title core/rawdb: don't hard-exit from freezer on fsync failure (ref #22112) core/rawdb: avoid exiting geth from freezer on fsync failure (ref #22112) Jan 19, 2021
@holiman
Copy link
Contributor Author

holiman commented Jan 19, 2021

Closing this after discussion - it might be better to close (hard exit) and force a re-open of all files.

@holiman holiman closed this Jan 19, 2021
@vdamle
Copy link
Contributor

vdamle commented Jan 22, 2021

Thanks for providing additional details on Discord, @holiman . For context, we are seeing on Azure that writing to Azure FS experiences transient failures which seem to go away on subsequent retries (which is obviously after a restart of geth, currently) such as:

CRIT [01-18|03:49:02.013] Failed to flush frozen tables            err="[sync /<path>/ethereum/geth/chaindata/ancient/bodies.cidx: input/output error]"

In such environments, we believe there's value in retrying on the next timer, to see if such a transient error has gone away. Please let me know your thoughts.

vdamle pushed a commit to kaleido-io/quorum that referenced this pull request Jan 22, 2021
vdamle pushed a commit to kaleido-io/quorum that referenced this pull request Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants