Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache DataSegments and PendingSegments on the Overlord #17336

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AmatyaAvadhanula
Copy link
Contributor

@AmatyaAvadhanula AmatyaAvadhanula commented Oct 13, 2024

Description

Polling the druid_segments and druid_pendingSegments tables can be a bottleneck for several operations in Druid and the load on the metadata store increases with the number of segments in the cluster.
This PR aims to maintain a central cache of active segments and relevant pending segments on the Overlord and use this wherever possible.

Advantages

  • Faster segment allocation (40x in SegmentAllocateActionTest#testBenchmark) -> Lower lag especially in situations where data is being backfilled
  • (TODO?) Faster segment syncing on coordinator -> Faster ingestion handoff on clusters where polling millions of segments (with hundreds of columns) takes a few minutes

Changes

  • Cache segment timelines for used segments and pending segments for intervals with active locks on the overlord.
  • Use this info for segment allocation
  • Every segment allocation or segment commit transactionally updates the cache
  • (TODO) Ensure that marking segments as unused or cleaning up pending segments on the Overlord also clear the cache.
  • (TODO) Make all the calls from the Coordinator go through the Overlord before altering druid_segments or druid_pendingSegments in order to have a centralized cache
  • (TODO?) Use this cache and poll segments incrementally on the Coodinator

Potential issues:

  • There can be an increase in memory but an overhead of a similar magnitude was already introduced with the CompactionSupervisor on the Overlord. This cache can be utilized by the compaction supervisor as well.
  • Time taken for leader election can be high. (May be mitigated by falling back to the DB when the cache is initializing)

Implementation details

Cache initialization:

  • When a TaskLockbox syncs for the first time upon leader election, obtain all the pending segments for active locks and cache only these. Any other pending segment must be irrelevant for segment allocation.
  • Also obtain all the used datasources and add all their segments to the cache.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz
Copy link
Contributor

kfaraz commented Oct 17, 2024

Thanks for the PR, @AmatyaAvadhanula !
This is a good idea and has been talked about for a while.

There are a few things that might need to be done before we can start caching segments on the Overlord:

  • Move all segment mark used / unused operations from Coordinator to the Overlord
  • Move schema fingerprint update operations from Coordinator to the Overlord

Without these being done first, the cache might serve stale information.

cc: @cryptoe

@AmatyaAvadhanula
Copy link
Contributor Author

@kfaraz, thanks for the feedback.

  1. Segment metadata operations on the coordinator needing to be routed through the Overlord is a TODO of this PR. I think it should be simple (but tedious) to move the APIs to the Overlord and have an overlord client execute these on the coordinator.
  1. Move schema fingerprint update operations from Coordinator to the Overlord.

I do not think this is necessary as the coordinator can continue to perform those operations as DataSegment is immutable. The coordinator currrently operates on a snapshot and will continue to do so. If this is incorrect, I think we can begin by having this caching enabled only when the Centralized segment metadata cache feature is disabled.

I hope we can determine the least set of changes to get this working and also add a feature flag, if everything else seems good.

@kfaraz
Copy link
Contributor

kfaraz commented Oct 17, 2024

Segment metadata operations on the coordinator needing to be routed through the Overlord is a TODO of this PR. I think it should be simple (but tedious) to move the APIs to the Overlord and have an overlord client execute these on the coordinator.

It would probably be simpler to write and easier to review if we do this part in a separate PR and merge it before the caching changes.

can continue to perform those operations as DataSegment is immutable.

While this is true for the payload of the segment itself, the schema fingerprint may change (typically from null to something non-null. I don't think any other change is possible for this column.)

I hope we can determine the least set of changes to get this working and also add a feature flag, if everything else seems good.

The cache should be behind a feature flag.

But to keep things simple, the new Overlord APIs probably don't need to be behind the feature flag. During a rolling upgrade, if coordinator gets upgraded first and starts calling new APIs on an old Overlord, we can simply give a nice error message.

@AmatyaAvadhanula
Copy link
Contributor Author

It would probably be simpler to write and easier to review if we do this part in a separate PR and merge it before the caching changes.

Thanks. I agree that this would be better.

While this is true for the payload of the segment itself, the schema fingerprint may change (typically from null to something non-null. I don't think any other change is possible for this column.)

Segments would be added to the cache during a commit irrespective of whether the CDS feature is enabled.
The schema fingerprint backfill that happens on the coordinator doesn't affect the cache and operates on a snapshot that the coordinator uses.
Is this understanding incorrect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants