Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longevity job is getting error "Storage error: Hummock error: ReadCurrentEpoch error Cannot read when cluster is under recovery" #7841

Closed
Tracked by #6640
sushantgupta opened this issue Feb 11, 2023 · 7 comments
Labels

Comments

@sushantgupta
Copy link

Describe the bug

Recent longevity is getting error "Storage error: Hummock error: ReadCurrentEpoch error Cannot read when cluster is under recovery" while executing "select * from LIMIT 1" query.

Job Details:
https://buildkite.com/risingwave-test/longevity-test/builds/359#01863c49-2ff3-4ded-8e38-821bd2136889

Step/timeline:

  1. 17:08 UTC: created materialized view q22,q101,q102 with 'STREAMING_PARALLELISM=3'.

  2. 18:38 UTC: Able to fetch the records from materialized view.

image

3)19:08 UTC: now we are unable to fetch the data from materialized view and getting "Storage error: Hummock error: ReadCurrentEpoch error Cannot read when cluster is under recovery" error.

But, There was not a single pod crash.

image

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 12, 2023

This error is expected when there are queries running in VISIBILITY_MODE=all (by default) during recovery because CN's in-memory states will be clear on recovery, which will result in incorrect result in some cases for VISIBILITY_MODE=all queries (See #7188 (comment))

We can retry the batch query after recovery is done.

@sumittal
Copy link
Contributor

@hzxa21 As this error came with the default setting, what should be done to avoid such errors?

@fuyufjh
Copy link
Member

fuyufjh commented Feb 13, 2023

As this error came with the default setting, what should be done to avoid such errors?

This is expected after some node crashes and the cluster is recovering itself. But why node failed during the longevity test? @sumittal

@zwang28
Copy link
Contributor

zwang28 commented Feb 13, 2023

This error itself is expected as explained by Patrick.

The problem in this test is exactly same as rwc-3-longevity-20230208-170541: the cluster cannot succeed recovery because:

  • The meta node is restarted without any error log at 2023-02-10T18:43 UTC
  • All existing worker node cannot reach the new meta node(why?), thus they are treated as expired nodes and removed from cluster by the new meta node 5min later (heartbeat). (The new meta node seems can reach worker nodes however, according to its log.
  • Compactor node and frontend node are restarted later and can reach the new meta node.
  • Compute nodes are not restarted. Thus there is no compute node in the cluster.

I suspect there is some meta addr resolution issue and is investigating.

BTW if worker node's heartbeat request doesn't succeed for long time(10 min here), worker node is expected to exit, which seems not happen in this test (don't find worker expired in CN's log). @yezizp2012

@yezizp2012
Copy link
Member

I'm taking a day off for something today. @shanicky would you please help to TAL. 🙏

@fuyufjh
Copy link
Member

fuyufjh commented Feb 20, 2023

See #7841 (comment)

@fuyufjh fuyufjh closed this as not planned Won't fix, can't repro, duplicate, stale Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants