-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery FILE_LOADS failed with 400 error in streaming mode in Python #20824
Comments
is this still an issue? |
Is this issue being actioned? We encounter this issue whenever we move a pipeline into draining. This issue prevents us from using the draining status as a mechanism for checking if a pipeline has fully cleared its backlog, and requires us to make estimations for when a pipeline is done, before directly canceling. This is not ideal. I can provide more info if requested! |
@rizenfrmtheashes could you provide a code sample we could use as a repro? |
sure. We ended up Dumping a LARGE amount of data with a specified schema through a Reshuffle and then into a bigquery file loads with dynamic table destinations
We used a similar input like described in this bug report doc here. (The bug in this doc in particular was reported #23104 and mostly fixed in #23012 ) When we set this job to draining after writing 10s of thousands of rows, this is the stacktrace we get
as a note (also redacting org names in stacktraces) We are using beam version 2.40 and dataflow v2 runner when this happened. |
This is very helpful, thank you so much, @rizenfrmtheashes . As a next step, we should identify whether the error is caused in the drain logic, or this is a gap in BQ IO implementation (incorrect usage of BQ apis during the call in draining phase). I suspect it's the latter . Will try to find an owner for this to look closer. |
if you want a pip-tools style dependency/requirements file we use to build the container that runs in this dataflow job, I can provide that too. We used pip-tools to find the minimum versions that can safely run with beam 2.40 and maybe a version of a base GCP python package might be causing this issue. |
There was a similar issue a few months ago where a pipeline in draining was running into similar errors. This connector used to throw and early error when a source URI (ie. file to load to BQ) is not provided. The other issue was mitigated with https://github.com/apache/beam/pull/17566/files where the error was replaced with a warning. In contrast, the error in this issue looks like it's from BigQuery... Looks like this is running into a similar problem where there are no files to load since the pipeline is in draining phase, but load job requests are still being sent. |
We could perform a simple check to see |
Could reproduce this with the following pipeline:
|
I think it would be safe to log a warning and ignore the bundle in Update: |
Although I'm having trouble identifying what exactly causes us to end up with empty |
We are using FILE_LOADS to write to BigQuery in streaming mode using Python.
after running for about 1 hours, beam job throws an exception with regards to
RuntimeError: apitools.base.py.exceptions.HttpBadRequestError
including error message "Load configuration must specify at least one source URI".Perhaps, this can be fixed by validating the input value
[files(= element[1])|https://github.com/apache/beam/blob/v2.28.0/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L469]
is not empty.Imported from Jira BEAM-11939. Original Jira may contain additional context.
Reported by: yshimizu.
The text was updated successfully, but these errors were encountered: