Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First step in adding schema update to Storage API sink. Refactor code #21395 #24147

Merged
merged 4 commits into from
Nov 29, 2022

Conversation

reuvenlax
Copy link
Contributor

@reuvenlax reuvenlax commented Nov 14, 2022

The BigQuery Storage Write API detects updated schemas and returns the new schema. However the schema is returned as a proto TableSchema. The current sink creates the proto descriptor directly from the input type (either Beam schema or json TableSchema), which is problematic as the new proto descriptor must be compatible with the old one. This PR refactors the sink to always use the proto TableSchema when constructing a descriptor.

Note: this may increase message size slightly for some schemas. e.g. for a Beam schema with a INT32 field, we currently generate an int32 proto field. However BigQuery schemas themselves don't have int32 fields, only int64 fields. So roundtripping through TableSchema means that we will now generate an int64 proto field instead.

This PR also removes the old schema-update code in preparation for the new version.

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@reuvenlax reuvenlax force-pushed the schema_push_notification_stage1 branch from 2ba93fe to 73572b2 Compare November 15, 2022 10:40
@reuvenlax reuvenlax changed the title Schema push notification stage1 First step in adding schema update to Storage API sink. Refactor code #21395 Nov 15, 2022
@reuvenlax
Copy link
Contributor Author

R: @yirutang

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@reuvenlax
Copy link
Contributor Author

friendly ping

@yirutang
Copy link
Contributor

R:@agrawal-siddharth

@yirutang
Copy link
Contributor

How is the initial schema determined (before the schema update event)?

@reuvenlax
Copy link
Contributor Author

The initial schema is always what the user provides to the BQ sink (though if the table already exists and the user leaves it null, then we will call getTable to determine the schema)

}
}
TableRowConverter(
TableSchema tableSchema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there are 3 types of schema here? What do they mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.tableSchema - this is the json table schema (the Beam API is written in terms of this schema, and that's usually what users give to us)

this.protoTableSchema - the result of translating the json schema into the proto TableSchema

this.schemaInformation - some extra information calculated about the schema to allow for easy conversion of json row -> proto.


public StorageApiWritePayload toMessage(TableRow tableRow, boolean respectRequired)
throws Exception {
boolean ignore = ignoreUnknownValues || autoSchemaUpdates;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autoSchemaUpdates implies ignoreUnknownValues?

I thought user may want autoSchemaUpdates enabled and doesn't want unknown fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not imply that - it is only ignoring unknown values at the prior stage so that it can send them on to the writing stage. However this actually belongs in the followon PR (since in this PR, there is not yet any handling of new schemas) so I'll remove this for now.

Copy link
Contributor

@yirutang yirutang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a rough pass. I don't know the connector good enough to give a thorough review...

@reuvenlax
Copy link
Contributor Author

Run Java PreCommit

1 similar comment
@reuvenlax
Copy link
Contributor Author

Run Java PreCommit

@reuvenlax reuvenlax merged commit 5bb13fa into apache:master Nov 29, 2022
ruslan-ikhsan pushed a commit to ruslan-ikhsan/beam that referenced this pull request Nov 30, 2022
@nbali
Copy link
Contributor

nbali commented Mar 3, 2023

There were some optimizations done in #22942 that was mostly reverted here. Was that necessary? @reuvenlax @pabloem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants