Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: WriteToBigQuery with file_loads and dynamic table destination doesn't load after first File Load #23104

Closed
rizenfrmtheashes opened this issue Sep 8, 2022 · 6 comments · Fixed by #23012

Comments

@rizenfrmtheashes
Copy link

rizenfrmtheashes commented Sep 8, 2022

What happened?

I was able to reproduce a bug in the WriteToBigQuery function in the apache beam source code for Dataflow Streaming and the File Loads setting. This issue is primarily caused by the impulse nodes feeding into the WaitForBQJobs task used in four separate nodes in the pipeline.

This bug manifests when using file loads to write data to different tables, set via a lambda reading the row inputs. The issue occurs after the first file load, where the impulses (which should either be periodic, or through a different method) no longer fire again, resulting in the side inputs into the nodes that load to a temp table, check if the load is complete, load a schema change to the intended table, check if that change is complete, copy from the temp table to the intended table, check that is complete, then remove the original temp table.

The use of the single impulse makes sure that this set of work gets executed once, but after it is executed once, and more data comes in, none of that data gets actioned.

I reported this issue directly to the Customer Support at the Dataflow team at GCP, but after not seeing it being actioned, I'm reporting it here.

Here is the Google Doc I created for them that walks through reproduction of the bug.

This has been reproduced on both Apache beam 2.32 and 2.40, running on dataflow.

Issue Priority

Priority: 2

Issue Component

Component: io-py-gcp

@Abacn
Copy link
Contributor

Abacn commented Sep 9, 2022

CC: @ahmedabu98 who is currently working on fixing this
.remove-labels "awaiting triage"

@ahmedabu98
Copy link
Contributor

Thanks @Abacn

Thank you @rizenfrmtheashes for that document, it was really helpful in understanding the underlying issues. I'm working on a solution in #23012 (currently just writing tests and changing other tests that relied on the previous implementation).

@github-actions
Copy link
Contributor

github-actions bot commented Sep 9, 2022

Label "awaiting cannot be managed because it does not exist in the repo. Please check your spelling.

@Abacn
Copy link
Contributor

Abacn commented Sep 9, 2022

.remove-labels 'awaiting triage'

@rizenfrmtheashes
Copy link
Author

@ahmedabu98 Oh I absolutely missed this. I'm glad it's getting actioned. I took a look at the PR and it looks good at addressing the issue with the WaitForBQJobs single impulse, as well as removing the single impulse dependency on the tasks following. Looking forward to seeing its inclusion in the next release!

@ahmedabu98
Copy link
Contributor

Thanks for taking a look! I'm looking to get it merged before the end of the week, though I'm not sure if it will make version 2.42.0 as that release branch has already been cut.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants