-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: WriteToBigQuery with file_loads and dynamic table destination doesn't load after first File Load #23104
Comments
CC: @ahmedabu98 who is currently working on fixing this |
Thanks @Abacn Thank you @rizenfrmtheashes for that document, it was really helpful in understanding the underlying issues. I'm working on a solution in #23012 (currently just writing tests and changing other tests that relied on the previous implementation). |
Label "awaiting cannot be managed because it does not exist in the repo. Please check your spelling. |
.remove-labels 'awaiting triage' |
@ahmedabu98 Oh I absolutely missed this. I'm glad it's getting actioned. I took a look at the PR and it looks good at addressing the issue with the |
Thanks for taking a look! I'm looking to get it merged before the end of the week, though I'm not sure if it will make version 2.42.0 as that release branch has already been cut. |
What happened?
I was able to reproduce a bug in the WriteToBigQuery function in the apache beam source code for Dataflow Streaming and the File Loads setting. This issue is primarily caused by the impulse nodes feeding into the WaitForBQJobs task used in four separate nodes in the pipeline.
This bug manifests when using file loads to write data to different tables, set via a lambda reading the row inputs. The issue occurs after the first file load, where the impulses (which should either be periodic, or through a different method) no longer fire again, resulting in the side inputs into the nodes that load to a temp table, check if the load is complete, load a schema change to the intended table, check if that change is complete, copy from the temp table to the intended table, check that is complete, then remove the original temp table.
The use of the single impulse makes sure that this set of work gets executed once, but after it is executed once, and more data comes in, none of that data gets actioned.
I reported this issue directly to the Customer Support at the Dataflow team at GCP, but after not seeing it being actioned, I'm reporting it here.
Here is the Google Doc I created for them that walks through reproduction of the bug.
This has been reproduced on both Apache beam 2.32 and 2.40, running on dataflow.
Issue Priority
Priority: 2
Issue Component
Component: io-py-gcp
The text was updated successfully, but these errors were encountered: