Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python DirectRunner does not emit data at GC time #21260

Open
damccorm opened this issue Jun 4, 2022 · 1 comment
Open

Python DirectRunner does not emit data at GC time #21260

damccorm opened this issue Jun 4, 2022 · 1 comment

Comments

@damccorm
Copy link
Contributor

damccorm commented Jun 4, 2022

The following should succeed but does not:


test_options = PipelineOptions(flags=['--allow_unsafe_triggers'])
 with TestPipeline(options=test_options)
as pipeline:
  pcoll = (
    pipeline
    | beam.Create([(1, 1), (1, 2), (1, 3), (1, 4)])
    |
WindowInto(
      window.GlobalWindows(),
      trigger=trigger.AfterCount(5),
      accumulation_mode=trigger.AccumulationMode.ACCUMULATING)
 
  | beam.GroupByKey())
  assert_that(pcoll, equal_to([(1, [1, 2, 3, 4])]))

However, it  currently fails, because  pcoll will be empty. It appears that the Direct Runner drops data if the trigger never fired.

Imported from Jira BEAM-13078. Original Jira may contain additional context.
Reported by: zhoufek.

@kennknowles
Copy link
Member

@zhoufek are you aware of any follow-up to this? If not, this is certainly an issue. It would be a direct runner issue, not a Python SDK issue, I think. Dataflow has one implementation of triggers while Java-based runners have another, and neither is shared with the local Python runner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants