[Bug]: WriteToBigQuery with file_loads and dynamic table destination doesn't load after first File Load #23104

rizenfrmtheashes · 2022-09-08T18:07:37Z

What happened?

I was able to reproduce a bug in the WriteToBigQuery function in the apache beam source code for Dataflow Streaming and the File Loads setting. This issue is primarily caused by the impulse nodes feeding into the WaitForBQJobs task used in four separate nodes in the pipeline.

This bug manifests when using file loads to write data to different tables, set via a lambda reading the row inputs. The issue occurs after the first file load, where the impulses (which should either be periodic, or through a different method) no longer fire again, resulting in the side inputs into the nodes that load to a temp table, check if the load is complete, load a schema change to the intended table, check if that change is complete, copy from the temp table to the intended table, check that is complete, then remove the original temp table.

The use of the single impulse makes sure that this set of work gets executed once, but after it is executed once, and more data comes in, none of that data gets actioned.

I reported this issue directly to the Customer Support at the Dataflow team at GCP, but after not seeing it being actioned, I'm reporting it here.

Here is the Google Doc I created for them that walks through reproduction of the bug.

This has been reproduced on both Apache beam 2.32 and 2.40, running on dataflow.

Issue Priority

Priority: 2

Issue Component

Component: io-py-gcp

Abacn · 2022-09-09T15:04:17Z

CC: @ahmedabu98 who is currently working on fixing this
.remove-labels "awaiting triage"

ahmedabu98 · 2022-09-09T15:09:09Z

Thanks @Abacn

Thank you @rizenfrmtheashes for that document, it was really helpful in understanding the underlying issues. I'm working on a solution in #23012 (currently just writing tests and changing other tests that relied on the previous implementation).

github-actions · 2022-09-09T16:20:14Z

Label "awaiting cannot be managed because it does not exist in the repo. Please check your spelling.

Abacn · 2022-09-09T16:22:17Z

.remove-labels 'awaiting triage'

rizenfrmtheashes · 2022-09-09T16:31:08Z

@ahmedabu98 Oh I absolutely missed this. I'm glad it's getting actioned. I took a look at the PR and it looks good at addressing the issue with the WaitForBQJobs single impulse, as well as removing the single impulse dependency on the tasks following. Looking forward to seeing its inclusion in the next release!

ahmedabu98 · 2022-09-13T17:28:45Z

Thanks for taking a look! I'm looking to get it merged before the end of the week, though I'm not sure if it will make version 2.42.0 as that release branch has already been cut.

rizenfrmtheashes added awaiting triage bug labels Sep 8, 2022

github-actions bot added gcp io P2 python labels Sep 8, 2022

github-actions bot removed the awaiting triage label Sep 9, 2022

ahmedabu98 mentioned this issue Sep 13, 2022

(BQ Python) Fix streaming with large loads by performing job waits in finish_bundle #23012

Merged

johnjcasey closed this as completed in #23012 Sep 14, 2022

rizenfrmtheashes mentioned this issue Oct 17, 2022

BigQuery FILE_LOADS failed with 400 error in streaming mode in Python #20824

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: WriteToBigQuery with file_loads and dynamic table destination doesn't load after first File Load #23104

[Bug]: WriteToBigQuery with file_loads and dynamic table destination doesn't load after first File Load #23104

rizenfrmtheashes commented Sep 8, 2022 •

edited

Loading

Abacn commented Sep 9, 2022

ahmedabu98 commented Sep 9, 2022

github-actions bot commented Sep 9, 2022

Abacn commented Sep 9, 2022

rizenfrmtheashes commented Sep 9, 2022

ahmedabu98 commented Sep 13, 2022

[Bug]: WriteToBigQuery with file_loads and dynamic table destination doesn't load after first File Load #23104

[Bug]: WriteToBigQuery with file_loads and dynamic table destination doesn't load after first File Load #23104

Comments

rizenfrmtheashes commented Sep 8, 2022 • edited Loading

What happened?

Issue Priority

Issue Component

Abacn commented Sep 9, 2022

ahmedabu98 commented Sep 9, 2022

github-actions bot commented Sep 9, 2022

Abacn commented Sep 9, 2022

rizenfrmtheashes commented Sep 9, 2022

ahmedabu98 commented Sep 13, 2022

rizenfrmtheashes commented Sep 8, 2022 •

edited

Loading