First step in adding schema update to Storage API sink. Refactor code #21395 #24147

reuvenlax · 2022-11-14T00:18:42Z

The BigQuery Storage Write API detects updated schemas and returns the new schema. However the schema is returned as a proto TableSchema. The current sink creates the proto descriptor directly from the input type (either Beam schema or json TableSchema), which is problematic as the new proto descriptor must be compatible with the old one. This PR refactors the sink to always use the proto TableSchema when constructing a descriptor.

Note: this may increase message size slightly for some schemas. e.g. for a Beam schema with a INT32 field, we currently generate an int32 proto field. However BigQuery schemas themselves don't have int32 fields, only int64 fields. So roundtripping through TableSchema means that we will now generate an int64 proto field instead.

This PR also removes the old schema-update code in preparation for the new version.

github-actions · 2022-11-14T00:49:27Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @Abacn for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

… proto.

reuvenlax · 2022-11-15T17:34:26Z

R: @yirutang

github-actions · 2022-11-15T17:35:28Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

reuvenlax · 2022-11-21T17:22:10Z

friendly ping

yirutang · 2022-11-21T18:14:17Z

R:@agrawal-siddharth

yirutang · 2022-11-21T21:19:50Z

How is the initial schema determined (before the schema update event)?

reuvenlax · 2022-11-21T21:48:32Z

The initial schema is always what the user provides to the BQ sink (though if the table already exists and the user leaves it null, then we will call getTable to determine the schema)

yirutang · 2022-11-22T01:07:37Z

...src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiDynamicDestinationsTableRow.java

-        }
-      }
+    TableRowConverter(
+        TableSchema tableSchema,


Why there are 3 types of schema here? What do they mean?

this.tableSchema - this is the json table schema (the Beam API is written in terms of this schema, and that's usually what users give to us)

this.protoTableSchema - the result of translating the json schema into the proto TableSchema

this.schemaInformation - some extra information calculated about the schema to allow for easy conversion of json row -> proto.

yirutang · 2022-11-23T21:45:48Z

...src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiDynamicDestinationsTableRow.java

+
+    public StorageApiWritePayload toMessage(TableRow tableRow, boolean respectRequired)
+        throws Exception {
+      boolean ignore = ignoreUnknownValues || autoSchemaUpdates;


autoSchemaUpdates implies ignoreUnknownValues?

I thought user may want autoSchemaUpdates enabled and doesn't want unknown fields?

It does not imply that - it is only ignoring unknown values at the prior stage so that it can send them on to the writing stage. However this actually belongs in the followon PR (since in this PR, there is not yet any handling of new schemas) so I'll remove this for now.

yirutang

I just did a rough pass. I don't know the connector good enough to give a thorough review...

reuvenlax · 2022-11-28T21:43:17Z

Run Java PreCommit

reuvenlax · 2022-11-28T23:21:02Z

Run Java PreCommit

…o Storage API sink. Refactor code apache#21395

nbali · 2023-03-03T17:48:25Z

There were some optimizations done in #22942 that was mostly reverted here. Was that necessary? @reuvenlax @pabloem

github-actions bot added gcp io java labels Nov 14, 2022

github-actions bot added the Next Action: Reviewers label Nov 14, 2022

reuvenlax force-pushed the schema_push_notification_stage1 branch from 90d7539 to d6b8262 Compare November 14, 2022 03:00

reuvenlax added 2 commits November 15, 2022 02:17

Remove old schema-update code and refactor to be based on TableSchema…

dc247b6

… proto.

refactoring

73572b2

reuvenlax force-pushed the schema_push_notification_stage1 branch from 2ba93fe to 73572b2 Compare November 15, 2022 10:40

reuvenlax changed the title ~~Schema push notification stage1~~ First step in adding schema update to Storage API sink. Refactor code #21395 Nov 15, 2022

yirutang reviewed Nov 22, 2022

View reviewed changes

yirutang reviewed Nov 23, 2022

View reviewed changes

remove code

c84ef60

yirutang approved these changes Nov 28, 2022

View reviewed changes

remove unused variable

087c6b7

reuvenlax merged commit 5bb13fa into apache:master Nov 29, 2022

ruslan-ikhsan pushed a commit to ruslan-ikhsan/beam that referenced this pull request Nov 30, 2022

Merge pull request apache#24147: First step in adding schema update t…

2a83576

…o Storage API sink. Refactor code apache#21395

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First step in adding schema update to Storage API sink. Refactor code #21395 #24147

First step in adding schema update to Storage API sink. Refactor code #21395 #24147

reuvenlax commented Nov 14, 2022 •

edited

Loading

github-actions bot commented Nov 14, 2022

reuvenlax commented Nov 15, 2022

github-actions bot commented Nov 15, 2022

reuvenlax commented Nov 21, 2022

yirutang commented Nov 21, 2022

yirutang commented Nov 21, 2022

reuvenlax commented Nov 21, 2022

yirutang Nov 22, 2022

reuvenlax Nov 22, 2022

yirutang Nov 23, 2022

reuvenlax Nov 26, 2022

yirutang left a comment

reuvenlax commented Nov 28, 2022

reuvenlax commented Nov 28, 2022

nbali commented Mar 3, 2023

First step in adding schema update to Storage API sink. Refactor code #21395 #24147

First step in adding schema update to Storage API sink. Refactor code #21395 #24147

Conversation

reuvenlax commented Nov 14, 2022 • edited Loading

github-actions bot commented Nov 14, 2022

reuvenlax commented Nov 15, 2022

github-actions bot commented Nov 15, 2022

reuvenlax commented Nov 21, 2022

yirutang commented Nov 21, 2022

yirutang commented Nov 21, 2022

reuvenlax commented Nov 21, 2022

yirutang Nov 22, 2022

Choose a reason for hiding this comment

reuvenlax Nov 22, 2022

Choose a reason for hiding this comment

yirutang Nov 23, 2022

Choose a reason for hiding this comment

reuvenlax Nov 26, 2022

Choose a reason for hiding this comment

yirutang left a comment

Choose a reason for hiding this comment

reuvenlax commented Nov 28, 2022

reuvenlax commented Nov 28, 2022

nbali commented Mar 3, 2023

reuvenlax commented Nov 14, 2022 •

edited

Loading