-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Python WriteToBigQuery with File Loads and dynamic table destinations sometimes remove temp tables before copy finishes or is triggered. #23670
Comments
.add-labels gcp,io,python |
It's entirely possible that during an autoscale event, work items may get replayed, and the third party side-effects that involve the copy job and the remove jobs may get replayed. specifically for work items that trigger a remove job before being reset since it didn't reach the end of the pipeline and considered "committed". Regardless, there should be a way to handle/track this, or make sure that the load to temp table -> copy to destination -> remove temp table happens as a single commit phase, and allow duplication. |
cc: @ahmedabu98 |
this is most likely not a reality because of the GBK here. I'm wondering if that GBK is necessary (I think BQ ignores delete ops for non-existing tables so sending more than one request for a table should be fine). @pabloem WDYT? The GBK is an old addition but thought I'd still ask. |
Although, the GBK maintains a boundary for retries here. Load to temp table and copy to destination steps are fusible steps, and delete tables is it's own separate step. So any retries happening while deleting temp tables would not trigger a retry for the load and copy steps. |
@rizenfrmtheashes does this only happen with dynamic destinations? Also can you check your log timestamps and see if these errors start showing up near an autoscale event? |
I think it only happens with dynamic destinations because there aren't any remove table steps in static destinations. That basically directly loads the data into destination table, instead of using a temp table intermediate step (there is some stuff around if some data hits some number thresholds to require partitioning that falls outside this, but that's an edge case). I just checked the job I use in production where this occurs, and the last autoscaling event occurred a little less than 2 hours prior to the first log where this error gets reported. So it's likely not autoscaling. I didn't see a one off instance re-allocation either during that time.
After re-reading the code there, I agree, the GBK isn't in the right place for this replay phenomenon to occur like I'm describing. I'll keep digging on my end to find more smoking guns. |
So you're right that the load to temp --> copy --> delete temp route is always chosen for dynamic destinations. However, this does also get triggered by large loads that don't fit in a single load job. So I'm wondering if this issue is unique to dynamic destinations or if it's for all writes that take this alternative route. |
You're right. I haven't reproduced (or had the time to reproduce) the case you're talking about, but I doubt it's just isolated to dynamic table destinations. I suspect the error also occurs for large direct loads like you're describing, since there isn't much different code wise in either situation. |
I want to bump/note that I still experience this issue. even on beam 2.48. I continue to use a patched version. I have had need to launch new kinds of pipelines as of late, also writing to BQ via file loads and dynamic table destinations, and I encounter this issue still.
It would be helpful to know if this is being explored/actioned because otherwise I have to continue using my patch that never deletes the temp tables, with which I have had no issues with. This thread had a suspicion that some retry before a group by commit barrier was causing write items to get replayed, with third party sideeffects causing these temp tables to be deleted and so not working on the subsequent retries. but we couldn't figure out where the retry loop could be happening where the temp tables wouldn't be regenerated or renamed. I wonder if there's a mix up in the names of the jobs to tables? Also as a note, I only saw this happening whenever I drained a job. Might be a red herring. |
What happened?
We believe that in the Python SDK, for the GCP IO library, In
WriteToBigQuery
, usingfile_loads
and dynamic table destinations, that sometimes in rare scenarios temp tables get removed via theRemoveTempTables
ptransform before the prior Copy Jobs finish or even get kicked off.We've only seen this occur under heavy load (many millions of rows) and high parallelism (beam 2.40, dataflow v2 runner, autoscale from 1 to ~40 n1-standard-4 instances).
As a note, although we are using beam 2.40, we are using what are the current master branch versions of this file here which contain the latest fixes for bigquery file loads from pr #23012 via a copy/paste patch.
We encounter stack traces like
as a note
bigquery_file_loads_patch_40.py
is just a reference to a copy/pasted version of the sourcebigquery_file_loads.py
file in the gcp/io section of the SDK that we used to backport fixes from newer versions of beam (like #23012). We did dependency checking to make sure the backported fixes were okay.You can likely reproduce by using a code pipeline like this doc here (used for a prior bug report), and instead send millions of rows through
Issue Priority
Priority: 2
Issue Component
Component: io-py-gcp
The text was updated successfully, but these errors were encountered: