feat(bigquery): use storage api for query jobs #6822

alvarowolfx · 2022-10-07T15:57:09Z

Initial work on using the Storage API for fetching results of a query. This is more efficient because it can download data in parallel by splitting the read session and using Arrow as a more efficient format. The API surface for users stay the same, with them being able to transform query results into user defined structs. Under the hood the library will take care of converting data represented in Arrow to the user defined struct.

One thing to note is that this introduces the first external dependency on the Apache Arrow Go library.

Initially we are gonna use it as an experimental feature and explicit ask users to create a bqStorage.BigQueryReadClient.

Proposed by issue #3880 and work on the Python library https://medium.com/google-cloud/announcing-google-cloud-bigquery-version-1-17-0-1fc428512171

shollyman

Left a few initial comments

bigquery/integration_test.go

bigquery/storage/reader/arrow.go

bigquery/storage/reader/client.go

…ssion access

bigquery/storage_client.go

bigquery/storage_iterator.go

shollyman · 2022-12-27T19:29:41Z

bigquery/storage_iterator.go

+	return nil
+}
+
+func (it *arrowIterator) processStream(readStream string) {


Current implementation is appropriate for a unary call, but could be refined to something more stream oriented (e.g. fail after N consecutive attempts, reset attempts when we successfully request).

bigquery/storage_iterator_test.go

shollyman

I think this is ready to ship, once you address the remaining items. Left one more comment as a holdover from our chat today about selecting the correct statement from a multi-statement execution (script) where there's different types of statements.

bigquery/storage_iterator.go

…age api

…m count

shollyman · 2023-01-18T01:14:08Z

bigquery/internal/query/order.go

+// This function uses a naive approach of checking the root level query
+// ( ignoring subqueries, functions calls, etc ) and checking
+// if it contains an ORDER BY clause.
+func HasOrderedResults(sql string) bool {


I think this is a reasonable approach, though something like unmatched parens in an inline comment might muck this up. To get more right requires more SQL parsing than we'd want to do locally, and the resolution is fairly simple in these instances.

product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the BigQuery API. labels Oct 7, 2022

product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Oct 14, 2022

alvarowolfx force-pushed the bq-query-storage-api branch from cbbe454 to 8c73313 Compare October 25, 2022 16:54

alvarowolfx requested a review from shollyman October 25, 2022 17:35

shollyman reviewed Oct 27, 2022

View reviewed changes

bigquery/integration_test.go Outdated Show resolved Hide resolved

bigquery/storage/reader/arrow.go Outdated Show resolved Hide resolved

bigquery/storage/reader/arrow.go Outdated Show resolved Hide resolved

bigquery/storage/reader/client.go Outdated Show resolved Hide resolved

alvarowolfx force-pushed the bq-query-storage-api branch from 24f03c9 to cca00a8 Compare November 3, 2022 18:56

product-auto-label bot added the stale: old Pull request is old and needs attention. label Nov 7, 2022

alvarowolfx added 19 commits November 8, 2022 16:30

feat(bigquery): use storage api for query jobs

bc85c4f

fix(bigquery): check for nil rowSource on iterator

8ce3146

fix(bigquery): pass job config to job.read

3837e96

fix(bigquery): remove internal timeout for storage read

5a5853b

feat(bigquery): move storage api integration to managedreader package

b4f91c4

fix minimal job config test

35a074e

feat(bigquery): add managed reader and all arrow types integration tests

14bd232

feat(bigquery): allow reading queries, jobs and table with storage api

365dc70

feat(bigquery): basic ordering detection for storage api

68f8ae2

fix(bigquery): rename function to parse raw bq values

6bade1f

refactor(bigquery): remove some duplication for ValueLoader resolution

ce887e2

fix(bigquery): iterator value loader refactor issue

de3dff2

refactor(bigquery): rename managed reader to reader

a65fa88

feat(bigquery): remove ordering check as backend handles that

a6668b1

fix(bigquery): remove unused files to check query ordering

46ccf6b

feat(bigquery): add benchmarks for storage read api

a1c432e

feat(bigquery): add arrow raw iterator to storage reader package

484d307

fix(bigquery): remove bq storage client from core pkg test

1cbf868

refactor(bigquery): remove Reader and add ReadSession concept

24db8c3

alvarowolfx force-pushed the bq-query-storage-api branch from 460ba9b to 24db8c3 Compare November 8, 2022 20:31

feat(bigquery): make arrow iterator internal and expose direct ReadSe…

524ad39

…ssion access

shollyman reviewed Dec 27, 2022

View reviewed changes

feat(bigquery): handle script jobs queries with storage api

5c88f62

shollyman approved these changes Jan 3, 2023

View reviewed changes

bigquery/storage_iterator.go Outdated Show resolved Hide resolved

alvarowolfx added 6 commits January 5, 2023 11:23

fix(bigquery): properly handle script jobs

68b21d9

fix(bigquery): race condition when setting up wait group

16569aa

Merge branch 'main' into bq-query-storage-api

635d842

feat(bigquery): use ReadSession.EstimateRowCount for reads using stor…

3f5dfbc

…age api

test(bigquery): improve retry logic test for storage iterator

a3c3a0a

judahrand mentioned this pull request Jan 6, 2023

Implement BigQuery Driver apache/arrow-adbc#168

Closed

alvarowolfx added 2 commits January 11, 2023 18:12

feat(bigquery): check for order by clause and limit storage api strea…

4b0da4e

…m count

Merge branch 'main' into bq-query-storage-api

3e38df5

alvarowolfx requested a review from shollyman January 11, 2023 22:16

shollyman approved these changes Jan 18, 2023

View reviewed changes

alvarowolfx added 4 commits January 18, 2023 16:24

Merge branch 'main' into bq-query-storage-api

777dc07

feat(bigquery): bump arrow version to v10.0.1

083f421

fix(bigquery): data race when marking iterator as done

9c43018

fix(bigquery): remove atomic.Bool usage

4ad7739

shollyman approved these changes Jan 20, 2023

View reviewed changes

Merge branch 'main' into bq-query-storage-api

c476cfc

alvarowolfx added the automerge Merge the pull request once unit tests and other checks pass. label Jan 23, 2023

gcf-merge-on-green bot merged commit 26c04f4 into googleapis:main Jan 23, 2023

gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Jan 23, 2023

This was referenced Jan 23, 2023

chore(main): release bigquery 1.46.0 #7251

Closed

chore(main): release bigquery 1.46.0 #7307

Merged

alvarowolfx mentioned this pull request Jan 26, 2023

bigquery: enable transparent storage api usage #3880

Closed

This was referenced Feb 14, 2023

[BQ] February 13, 2023 kitta65/bq-extension-vscode#136

Closed

[BQ] February 13, 2023 kitta65/prettier-plugin-bq#141

Closed

[BQ] February 13, 2023 kitta65/bq2cst#149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bigquery): use storage api for query jobs #6822

feat(bigquery): use storage api for query jobs #6822

alvarowolfx commented Oct 7, 2022

shollyman left a comment

shollyman Dec 27, 2022

shollyman left a comment

shollyman Jan 18, 2023

feat(bigquery): use storage api for query jobs #6822

feat(bigquery): use storage api for query jobs #6822

Conversation

alvarowolfx commented Oct 7, 2022

shollyman left a comment

Choose a reason for hiding this comment

shollyman Dec 27, 2022

Choose a reason for hiding this comment

shollyman left a comment

Choose a reason for hiding this comment

shollyman Jan 18, 2023

Choose a reason for hiding this comment