Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Pipeline may stuck after error thrown in transform due to racing condition in CancellableQueue #32502

Closed
2 of 17 tasks
Abacn opened this issue Sep 18, 2024 · 2 comments
Closed
2 of 17 tasks

Comments

@Abacn
Copy link
Contributor

Abacn commented Sep 18, 2024

What happened?

We've seen report ragarding pipeline stuck after error thrown in @Setup method in transform. It is suspected the cause attributed to CancellableQueue in Java SDK harness has the following racing condition. Sympom:

WARNING 2024-09-13T17:15:45.961Z Operation ongoing in bundle process_bundle-3678134576214540161-11096 for at least 06h56m00s without outputting or completing: at 
[email protected]/jdk.internal.misc.Unsafe.park(Native Method) at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341) at 
[email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:506) at 
[email protected]/java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3465) at [email protected]/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3436) at 
[email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1623) at 
app//org.apache.beam.sdk.fn.CancellableQueue.take(CancellableQueue.java:95) at 
app//org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.awaitCompletion(BeamFnDataInboundObserver.java:122) at 
app//org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:550) at 
app//org.apache.beam.fn.harness.FnHarness$$Lambda$203/0x00007f77f52f5ef8.apply(Unknown Source) at 
app//org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:150) at 
app//org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:115) at 
app//org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver$$Lambda$212/0x00007f77f5301138.run(Unknown Source) at ...

Some Exception happened, cancel() called:

Further invocation is supposed to raise exception:

However, if in between cancel() and the next invocation, reset() is called,

exception will set to null, and runner does not know the bad status, and just waiting for elements which will never come in.

This affects Java portable runners including Dataflow runner v2.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@scwhittle
Copy link
Contributor

reset() isn't called except when the bundle is completed so I don't think this was the root-cause of stuckness. I believe #32714 addressed this issue

@Abacn Abacn closed this as not planned Won't fix, can't repro, duplicate, stale Dec 12, 2024
@Abacn
Copy link
Contributor Author

Abacn commented Dec 12, 2024

thanks for information, closed the issue for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants