Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for global sequence processing to the "ordered" extension in Java SDK #32540

Merged
merged 36 commits into from
Oct 9, 2024

Conversation

slilichenko
Copy link
Contributor

@slilichenko slilichenko commented Sep 23, 2024

Global sequence processing is used to ensure that events for a given key are only processed when it's guaranteed that they all the elements for this particular have been received.

Consider a PCollection which contains these event tuples (first element of the tuple is the global sequence number):
[1, key1, data], [2, key2, data], [3, key1, data], [4, key1, data], [5, key2, data], [7, key2, data]. Elements for key1 must be processed in the following order: 1, 3, 4. Elements for key2 must be processed in the following order: 2, 5. Event with sequence 7 can't be processed because there is a missing sequence 6.

The approach used to implement ordered processing in the presence of global sequencing:

  • Generate (periodically for streaming pipelines, once for batch) the side input which contains the maximum contiguous range of sequences across all keys and the maximum timestamp of events in that range.
  • Use this side input by the DoFns which process events to a) store in the per-key processing state the latest maximum range and b) set up an event time based timer to fire off at the latest timestamp.
  • Save all the events for a particular key into the ordered list state (with some optimization exceptions) because the events received in this DoFn are per key - they are not guaranteed to be contiguous and can't be processed right away.
  • Once the timer fires off there is a guarantee that all the events for a given key have been received up to the firing timestamp. The latest contiguous range stored in the processing state is used to limit the events in the ordered list state that can be safely processed in a loop.

This high level diagram illustrates the overall approach.

There are dedicated unit tests to cover both per-key and global sequence processing. Please refer to them to understand the details of use cases.

Note that the batch unit tests for global processing don't automatically run under global sequencing. This is due to the apparent incorrectness of the DirectRunner implementation (it is supposed to block processing event processing DoFns until the side input is calculated once). The batch processing tests were successfully run manually using DataflowRunner. Additional work will be needed to either fix the DirectRunner, switch to PrismRunner (when it supports all the primitives used in this transform), or enable test to run a DataflowRunner.

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @m-trieu for label java.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Contributor

github-actions bot commented Oct 1, 2024

Reminder, please take a look at this pr: @m-trieu

Copy link
Contributor

github-actions bot commented Oct 4, 2024

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damondouglas for label java.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to try to take this in 2 chunks, starting with the side input pieces.

The code itself looks good so far, thanks for the thorough tests in particular

Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - mostly LGTM outside of pending comments (mostly cosmetic)

Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, thanks! I noticed that the java test suite failed, but it looks unrelated to your change (FlinkRequiresStableInputTest failed). I'm rerunning to hopefully get a green signal before merging

@damccorm damccorm merged commit 20d0f6e into apache:master Oct 9, 2024
21 checks passed
reeba212 pushed a commit to reeba212/beam that referenced this pull request Dec 4, 2024
… in Java SDK (apache#32540)

* Initial changes to support processing global sequences.

* Refactor the DoFns out of the transform and into a class hierarchy.

* Next round of implementation of Global Sequence handling.

* Added ticker timers in global sequence processing.

* Corrected the emission batch logic.

* Reworked some tests and fixed the batch output logic.

* Pluggable combiner for the global sequence.

* First iteration of the efficient merging accumulator

* Mostly complete implementation of the accumulator and corresponding tests.

* Additional round of test refinements.

* Added logic to DQL the records below the global sequence range.

* Added providing a global sequence combiner through a handler.

* Added SequenceRangeAccumulatorCoder and tests. Improved logic of creating timers.

* Fixed logging levels (moved them to "trace") on several transforms.

* Round of code improvements and cleanups.

* Tests to verify that the the global sequence is correctly produced by the transform.

* Added batch processing verification to the global sequence processing.

* A round of documentation update and minor clean up.

* Fixed the description in CHANGES.md

* Polish by "spotless"

* Polish by "spotless"

* Removed unneeded logging configuration file.

* Made ContiguousSequenceRange open ended.

* Removed details from 2.60.0 section in CHANGES.md.

* Update sdks/java/extensions/ordered/src/main/java/org/apache/beam/sdk/extensions/ordered/combiner/DefaultSequenceCombiner.java

Co-authored-by: Danny McCormick <[email protected]>

* Fixed spotless related errors.

* Added a note about the new functionality to CHANGES.md

* Added clarification around the data structure used in the sequence combiner.

* Added clarification around the data structure used in the sequence combiner.

* Fixed the problem with allowed lateness being set to 0 in the global sequence tracker.

* Parameterized the GlobalSequenceTracker with the max number of events to trigger the re-evaluation. Fixed accidentally disabled unit tests.

* Made the event timer used to wait for the event arrival respect the lateness of the input.

* Created new failure reason code - "before initial sequence"

---------

Co-authored-by: Danny McCormick <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants