Longevity job is getting error "Storage error: Hummock error: ReadCurrentEpoch error Cannot read when cluster is under recovery" #7841

sushantgupta · 2023-02-11T07:58:05Z

Describe the bug

Recent longevity is getting error "Storage error: Hummock error: ReadCurrentEpoch error Cannot read when cluster is under recovery" while executing "select * from LIMIT 1" query.

Job Details:
https://buildkite.com/risingwave-test/longevity-test/builds/359#01863c49-2ff3-4ded-8e38-821bd2136889

Step/timeline:

17:08 UTC: created materialized view q22,q101,q102 with 'STREAMING_PARALLELISM=3'.
18:38 UTC: Able to fetch the records from materialized view.

3)19:08 UTC: now we are unable to fetch the data from materialized view and getting "Storage error: Hummock error: ReadCurrentEpoch error Cannot read when cluster is under recovery" error.

But, There was not a single pod crash.

To Reproduce

No response

Expected behavior

No response

Additional context

No response

hzxa21 · 2023-02-12T06:32:11Z

This error is expected when there are queries running in VISIBILITY_MODE=all (by default) during recovery because CN's in-memory states will be clear on recovery, which will result in incorrect result in some cases for VISIBILITY_MODE=all queries (See #7188 (comment))

We can retry the batch query after recovery is done.

sumittal · 2023-02-12T14:51:22Z

@hzxa21 As this error came with the default setting, what should be done to avoid such errors?

fuyufjh · 2023-02-13T02:54:50Z

As this error came with the default setting, what should be done to avoid such errors?

This is expected after some node crashes and the cluster is recovering itself. But why node failed during the longevity test? @sumittal

zwang28 · 2023-02-13T02:55:29Z

This error itself is expected as explained by Patrick.

The problem in this test is exactly same as rwc-3-longevity-20230208-170541: the cluster cannot succeed recovery because:

The meta node is restarted without any error log at 2023-02-10T18:43 UTC
All existing worker node cannot reach the new meta node(why?), thus they are treated as expired nodes and removed from cluster by the new meta node 5min later (heartbeat). (The new meta node seems can reach worker nodes however, according to its log.
Compactor node and frontend node are restarted later and can reach the new meta node.
Compute nodes are not restarted. Thus there is no compute node in the cluster.

I suspect there is some meta addr resolution issue and is investigating.

BTW if worker node's heartbeat request doesn't succeed for long time(10 min here), worker node is expected to exit, which seems not happen in this test (don't find worker expired in CN's log). @yezizp2012

yezizp2012 · 2023-02-13T03:17:49Z

I'm taking a day off for something today. @shanicky would you please help to TAL. 🙏

lmatz · 2023-02-13T06:00:46Z

maybe related?:

fuyufjh · 2023-02-20T03:34:38Z

See #7841 (comment)

sushantgupta added type/bug Something isn't working found-by-longevity-test labels Feb 11, 2023

github-actions bot added this to the release-0.1.17 milestone Feb 11, 2023

lmatz mentioned this issue Feb 11, 2023

Tracking: Critical Performance & Stability Issues #6640

Open

65 tasks

This was referenced Feb 13, 2023

rwc-3-longevity-20230212-170319 fails #7851

Closed

rwc-3-longevity-20230211-170317 fails #7852

Closed

fuyufjh closed this as not planned Won't fix, can't repro, duplicate, stale Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Longevity job is getting error "Storage error: Hummock error: ReadCurrentEpoch error Cannot read when cluster is under recovery" #7841

Longevity job is getting error "Storage error: Hummock error: ReadCurrentEpoch error Cannot read when cluster is under recovery" #7841

sushantgupta commented Feb 11, 2023

hzxa21 commented Feb 12, 2023

sumittal commented Feb 12, 2023

fuyufjh commented Feb 13, 2023

zwang28 commented Feb 13, 2023 •

edited

Loading

yezizp2012 commented Feb 13, 2023

lmatz commented Feb 13, 2023

fuyufjh commented Feb 20, 2023

Longevity job is getting error "Storage error: Hummock error: ReadCurrentEpoch error Cannot read when cluster is under recovery" #7841

Longevity job is getting error "Storage error: Hummock error: ReadCurrentEpoch error Cannot read when cluster is under recovery" #7841

Comments

sushantgupta commented Feb 11, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

hzxa21 commented Feb 12, 2023

sumittal commented Feb 12, 2023

fuyufjh commented Feb 13, 2023

zwang28 commented Feb 13, 2023 • edited Loading

yezizp2012 commented Feb 13, 2023

lmatz commented Feb 13, 2023

fuyufjh commented Feb 20, 2023

zwang28 commented Feb 13, 2023 •

edited

Loading