Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(compaction): unexpected l0->lbase task including more l0 inputs than level0_max_compact_file_number #11154

Closed
Tracked by #6640
zwang28 opened this issue Jul 24, 2023 · 1 comment · Fixed by #11204
Assignees
Labels
type/bug Something isn't working
Milestone

Comments

@zwang28
Copy link
Contributor

zwang28 commented Jul 24, 2023

Describe the bug

The hard limit level0_max_compact_file_number is 96, but the problematic compaction task contains 500+ l0 files. All included l0 sub levels contains only 1~3 SSTs.

Logs and metrics are available.

When I deleted a test cluster today, the files in S3 bucket were too many (it has deleted for tens of minutes and still on going)
So I took a look on the Grafana. The compactor seemed to stop working at around 17:34pm. I suspect some bugs make the compaction stuck. Please take a look when you have time.
The image is the latest nightly build. I am investing an OOM bug so several CN crashes were expected.

Grafana
Logs

image

image

Error message/log

No response

To Reproduce

Since the #10888 has not been fixed yet, you may reproduce it easily by running longevity-test

BENCH_SKU="medium-3cn"
BUILDKITE="true"
CI="true"
DURATION="10h"
INTERVAL="30m"
NEXMARK_QUERIES="q1,q2,q3,q4,q5,q7,q8,q9,q10,q14,q15,q16,q17,q18,q20"
RW_VERSION="nightly-20230719"
SLACK_CHANNEL="test_notification"
STREAMING_PARALLELISM="3"
max_clusters_allowed="2"
skip_cleanup_on_failure="0"
skip_mvs_from_mvs="1"
skip_view_queries_execution="0"
skipcountmvfromsource="1"
user="eric"

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly

Additional context

See discussions in Slack: #11154

@zwang28 zwang28 added the type/bug Something isn't working label Jul 24, 2023
@zwang28 zwang28 self-assigned this Jul 24, 2023
@github-actions github-actions bot added this to the release-1.1 milestone Jul 24, 2023
@zwang28
Copy link
Contributor Author

zwang28 commented Jul 25, 2023

The cause is a flaw in sublevel picker. Consider this case:

  1. sub level 1, ... ,10 are included in overlap_len_and_begins, even though their overlap_files_range is empty for now. total_file_count is now below hard limit.
  2. for sub level 11, its overlap_files_range is not empty, so it extends task input. Then during check reverse overlap, more files from sub level 1 ~ sublevel 10 can be included because they overlap with the new task input. We don't check total_file_count < hard limit here. So it's possible arbitrary SSTs are included in task input.
  3. Result a task with more level 0 SST inputs than hard limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant