Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSQ generates tombstones honoring granularity specified in a REPLACE query. #15243

Merged
merged 14 commits into from
Nov 15, 2023

Conversation

abhishekrb19
Copy link
Contributor

@abhishekrb19 abhishekrb19 commented Oct 24, 2023

Unlike native batch ingestion, MSQ is strict about segment allocation -- it will throw an InsertCannotAllocateSegment fault if the requested segment granularity doesn't align with the allocated segment granularity.

This patch modifies the segment generation logic in MSQ REPLACE to:

  • Honor the segment granularity specified in the query for all finite-interval tombstones upto 5000 segments (Limits.MAX_PARTITION_BUCKETS) - e.g., [2000, 2010). Once the limit exceeds, a TooManyBucketsFault fault is returned similar to data segments.
  • For infinite-interval tombstones, retain the current behavior of generating one tombstone each for positive and negative eternity - e.g., [-Inf, 2000), [2010, Inf).

This behavior is consistent with how MSQ generally works re honoring segment granularity in the query. It's also similar to how native batch ingestion generates tombstones.

Description

Previously MSQ replace would generate "irregular" tombstones as an optimization so the controller doesn't run OOM. The new code only treats the infinite interval ones special, which will be cleaned up by the coordinator in a follow up change.

This change semi-reverts #13893, uses the tests added in the PR and adds a new condition to account for the infinite-interval tombstones.

Release note

MSQ REPLACE queries will generate tombstone segments honoring the segment granularity specified in the query. If a query generates more than 5000 tombstones, it'll return an MSQ TooManyBucketsFault to the user, similar to the behavior with data segments.


Key changed/added classes in this PR
  • TombstoneHelper.java
  • TombstoneHelperTest.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

This change tweaks to only account for the infinite-interval tombstones.
For finite-interval tombstones, the MSQ query granualrity will be used
which is consistent with how MSQ works.
@github-actions github-actions bot added Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Oct 24, 2023
@abhishekrb19 abhishekrb19 marked this pull request as draft October 24, 2023 17:47
@abhishekrb19 abhishekrb19 marked this pull request as ready for review October 24, 2023 19:00
@abhishekrb19 abhishekrb19 changed the title MSQ generates tombstones honoring the query's granularity. MSQ generates tombstones honoring granularity specified in a REPLACE query. Oct 25, 2023
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like the idea of having the tombstone behavior be better aligned between native and MSQ. I'm not sure if there's a reason that it needs to be different.

Couple comments:

  1. Looks like TombstoneHelper has computeTombstones used by native batch in replace mode, and computeTombstoneSegmentsForReplace used by MSQ in replace mode. Is there a practical difference in the logic? If so— is it a difference we need to preserve, or could both systems use the same method? It looks like the method used by MSQ was added in Generate tombstones when running MSQ's replace #13706. Perhaps the PR has some discussion that would shed light on these questions.
  2. Ideally MSQ would return a TooManyBuckets fault if there are more than 5000 tombstones (StageDefinition.PARTITION_STATS_MAX_BUCKETS). To support this I suggest adding a limit parameter to TombstoneHelper, so there's some way for a caller to tell if the limit is exceeded. This prevents OOM in situations where there are too many time chunks. It also aligns with the behavior of an ingest that generates actual data as opposed to tombstones.

@abhishekrb19
Copy link
Contributor Author

Thank you for the review, @gianm!

  1. Yep, I agree. It looks like @LakshSingla did some analysis re reusing computeTombstones- Generate tombstones when running MSQ's replace #13706 (comment). I also looked into it, and I think with some refactoring like moving IntervalUtils into core druid and passing in specific arguments, this should be possible. I'd prefer to revisit the refactor as a separate PR from the functional change in this patch. For now, I've added javadocs to the two methods the native batch and MSQ engines use.
  2. Good idea! I updated the code per your suggestion and added unit tests in MSQReplaceTest.java and TombstoneHelperTest.java.

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since TooManyBucketsFault now also includes tombstones, does the user-facing message for the error still hold, or do we want it to be more explanatory?
Expressing "tombstones" in the message doesn't seem neat, so I think we can mention the number of buckets generated & the number of empty buckets in the message to help the user & admin debug faster, without saying what tombstones are.

@abhishekrb19
Copy link
Contributor Author

abhishekrb19 commented Nov 3, 2023

@LakshSingla re:

Since TooManyBucketsFault now also includes tombstones, does the user-facing message for the error still hold, or do we want it to be more explanatory?
Expressing "tombstones" in the message doesn't seem neat, so I think we can mention the number of buckets generated & the number of empty buckets in the message to help the user & admin debug faster, without saying what tombstones are.

The user-facing error message would still be (also in MSQFaultsTest):

TooManyBuckets: Too many partition buckets (max = 5,000); try breaking your query up into smaller queries or using a wider PARTITIONED BY

Did you mean the cause in the stacktrace? If so, I think it's better to be direct and say "tombstone" in the cause since it's an existing concept explained in the Druid docs. The stacktrace for completeness:

org.apache.druid.msq.indexing.error.MSQException: TooManyBuckets: Too many partition buckets (max = 5,000); try breaking your query up into smaller queries or using a wider PARTITIONED BY
	at org.apache.druid.msq.exec.ControllerImpl.publishAllSegments(ControllerImpl.java:1437)
	at org.apache.druid.msq.exec.ControllerImpl.publishSegmentsIfNeeded(ControllerImpl.java:1667)
	at org.apache.druid.msq.exec.ControllerImpl.runTask(ControllerImpl.java:404)
	at org.apache.druid.msq.exec.ControllerImpl.run(ControllerImpl.java:338)
	at org.apache.druid.msq.indexing.MSQControllerTask.runTask(MSQControllerTask.java:248)
	at org.apache.druid.indexing.common.task.AbstractTask.run(AbstractTask.java:169)
	at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:229)
	at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:154)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.apache.druid.java.util.common.IAE: Cannot add more tombstone buckets than [5000].
	at org.apache.druid.indexing.common.task.batch.parallel.TombstoneHelper.computeTombstoneIntervalsForReplace(TombstoneHelper.java:220)
	at org.apache.druid.indexing.common.task.batch.parallel.TombstoneHelper.computeTombstoneSegmentsForReplace(TombstoneHelper.java:149)
	at org.apache.druid.msq.exec.ControllerImpl.publishAllSegments(ControllerImpl.java:1423)
	... 13 more

@LakshSingla
Copy link
Contributor

Did you mean the cause in the stacktrace

I meant the user facing message. We can enhance the message to include more information about what the user is doing, and how many buckets they are generating.

I think it's better to be direct and say "tombstone" in the cause since it's an existing concept explained in the Druid docs.

I guess so, though tombstones isn't something that we expect a user to be familiar with.

@abhishekrb19 abhishekrb19 force-pushed the uniform_msq_tombstones branch from e8011aa to 46749ab Compare November 3, 2023 16:39
@abhishekrb19 abhishekrb19 force-pushed the uniform_msq_tombstones branch from 46749ab to 8bb2c38 Compare November 3, 2023 16:41
@abhishekrb19
Copy link
Contributor Author

@LakshSingla, thank you for the review! Regarding mentioning the total # of buckets a query would generate without the max buckets limit, the computation could lead to an OOM. The max = 5,000 in the exception message provides a safe permissible upper bound limit to a user, which I think suffices.

I believe I've addressed your comments. Can you please take another look?

@abhishekrb19 abhishekrb19 merged commit 2e79fd5 into apache:master Nov 15, 2023
83 checks passed
@abhishekrb19 abhishekrb19 deleted the uniform_msq_tombstones branch November 15, 2023 07:35
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
…` query. (apache#15243)

* MSQ generates tombstones honoring the query's granularity.

This change tweaks to only account for the infinite-interval tombstones.
For finite-interval tombstones, the MSQ query granualrity will be used
which is consistent with how MSQ works.

* more tests and some cleanup.

* checkstyle

* comment edits

* Throw TooManyBuckets fault based on review; add more tests.

* Add javadocs for both methods on reconciling the methods.

* review: Move testReplaceTombstonesWithTooManyBucketsThrowsException to MsqFaultsTest

* remove unused imports.

* Move TooManyBucketsException to indexing package for shared exception handling.

* lower max bucket for tests and fixup count

* Advance and count the iterator.

* checkstyle
writer-jill pushed a commit to writer-jill/druid that referenced this pull request Nov 20, 2023
…` query. (apache#15243)

* MSQ generates tombstones honoring the query's granularity.

This change tweaks to only account for the infinite-interval tombstones.
For finite-interval tombstones, the MSQ query granualrity will be used
which is consistent with how MSQ works.

* more tests and some cleanup.

* checkstyle

* comment edits

* Throw TooManyBuckets fault based on review; add more tests.

* Add javadocs for both methods on reconciling the methods.

* review: Move testReplaceTombstonesWithTooManyBucketsThrowsException to MsqFaultsTest

* remove unused imports.

* Move TooManyBucketsException to indexing package for shared exception handling.

* lower max bucket for tests and fixup count

* Advance and count the iterator.

* checkstyle
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
…` query. (apache#15243)

* MSQ generates tombstones honoring the query's granularity.

This change tweaks to only account for the infinite-interval tombstones.
For finite-interval tombstones, the MSQ query granualrity will be used
which is consistent with how MSQ works.

* more tests and some cleanup.

* checkstyle

* comment edits

* Throw TooManyBuckets fault based on review; add more tests.

* Add javadocs for both methods on reconciling the methods.

* review: Move testReplaceTombstonesWithTooManyBucketsThrowsException to MsqFaultsTest

* remove unused imports.

* Move TooManyBucketsException to indexing package for shared exception handling.

* lower max bucket for tests and fixup count

* Advance and count the iterator.

* checkstyle
AmatyaAvadhanula added a commit to AmatyaAvadhanula/druid that referenced this pull request Dec 12, 2023
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Release Notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants