Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bigquery): use storage api for query jobs #6822

Merged
merged 52 commits into from
Jan 23, 2023

Conversation

alvarowolfx
Copy link
Contributor

Initial work on using the Storage API for fetching results of a query. This is more efficient because it can download data in parallel by splitting the read session and using Arrow as a more efficient format. The API surface for users stay the same, with them being able to transform query results into user defined structs. Under the hood the library will take care of converting data represented in Arrow to the user defined struct.

One thing to note is that this introduces the first external dependency on the Apache Arrow Go library.

Initially we are gonna use it as an experimental feature and explicit ask users to create a bqStorage.BigQueryReadClient.

Proposed by issue #3880 and work on the Python library https://medium.com/google-cloud/announcing-google-cloud-bigquery-version-1-17-0-1fc428512171

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the BigQuery API. labels Oct 7, 2022
@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Oct 14, 2022
@alvarowolfx alvarowolfx requested a review from shollyman October 25, 2022 17:35
Copy link
Contributor

@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few initial comments

bigquery/integration_test.go Outdated Show resolved Hide resolved
bigquery/storage/reader/arrow.go Outdated Show resolved Hide resolved
bigquery/storage/reader/arrow.go Outdated Show resolved Hide resolved
bigquery/storage/reader/client.go Outdated Show resolved Hide resolved
@product-auto-label product-auto-label bot added the stale: old Pull request is old and needs attention. label Nov 7, 2022
bigquery/storage_client.go Outdated Show resolved Hide resolved
bigquery/storage_iterator.go Outdated Show resolved Hide resolved
return nil
}

func (it *arrowIterator) processStream(readStream string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation is appropriate for a unary call, but could be refined to something more stream oriented (e.g. fail after N consecutive attempts, reset attempts when we successfully request).

bigquery/storage_iterator_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to ship, once you address the remaining items. Left one more comment as a holdover from our chat today about selecting the correct statement from a multi-statement execution (script) where there's different types of statements.

bigquery/storage_iterator.go Outdated Show resolved Hide resolved
@alvarowolfx alvarowolfx requested a review from shollyman January 11, 2023 22:16
// This function uses a naive approach of checking the root level query
// ( ignoring subqueries, functions calls, etc ) and checking
// if it contains an ORDER BY clause.
func HasOrderedResults(sql string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reasonable approach, though something like unmatched parens in an inline comment might muck this up. To get more right requires more SQL parsing than we'd want to do locally, and the resolution is fairly simple in these instances.

@alvarowolfx alvarowolfx added the automerge Merge the pull request once unit tests and other checks pass. label Jan 23, 2023
@gcf-merge-on-green gcf-merge-on-green bot merged commit 26c04f4 into googleapis:main Jan 23, 2023
@gcf-merge-on-green gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Jan 23, 2023
gcf-merge-on-green bot pushed a commit that referenced this pull request Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. size: xl Pull request size is extra large. stale: extraold Pull request is critically old and needs prioritization.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants