You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working on a Dataflow Runner V2 bug, we discovered that there is some extra sampling done by the Java SDK.
Basically SDK side PCollection sampling is done on a per bundle basis, because we "technically" don't know if a given bundle is the only instance of this PCollection or not.
But that ignores that SDKs re-use execution plans and naively just reads the first element of a pcollection in a bundle.
This will lead to extra reads for pipelines with 1 key only, which is the hotkey.
Issue Priority
Priority: 3 (minor)
Issue Components
Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam YAML
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner
The text was updated successfully, but these errors were encountered:
lostluck
changed the title
[Bug]: Sampling Data from PCollection at the Source is redundant, since the Source already has a perfect estimate of per element size.
[Bug]: Java SDK Sampling Data from PCollection at the Source is redundant, since the Source already has a perfect estimate of per element size.
Oct 4, 2023
What happened?
While working on a Dataflow Runner V2 bug, we discovered that there is some extra sampling done by the Java SDK.
Basically SDK side PCollection sampling is done on a per bundle basis, because we "technically" don't know if a given bundle is the only instance of this PCollection or not.
But that ignores that SDKs re-use execution plans and naively just reads the first element of a pcollection in a bundle.
This will lead to extra reads for pipelines with 1 key only, which is the hotkey.
Issue Priority
Priority: 3 (minor)
Issue Components
The text was updated successfully, but these errors were encountered: