-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First step in adding schema update to Storage API sink. Refactor code #21395 #24147
First step in adding schema update to Storage API sink. Refactor code #21395 #24147
Conversation
Assigning reviewers. If you would like to opt out of this review, comment R: @robertwb for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
90d7539
to
d6b8262
Compare
2ba93fe
to
73572b2
Compare
R: @yirutang |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
friendly ping |
How is the initial schema determined (before the schema update event)? |
The initial schema is always what the user provides to the BQ sink (though if the table already exists and the user leaves it null, then we will call getTable to determine the schema) |
} | ||
} | ||
TableRowConverter( | ||
TableSchema tableSchema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why there are 3 types of schema here? What do they mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.tableSchema - this is the json table schema (the Beam API is written in terms of this schema, and that's usually what users give to us)
this.protoTableSchema - the result of translating the json schema into the proto TableSchema
this.schemaInformation - some extra information calculated about the schema to allow for easy conversion of json row -> proto.
|
||
public StorageApiWritePayload toMessage(TableRow tableRow, boolean respectRequired) | ||
throws Exception { | ||
boolean ignore = ignoreUnknownValues || autoSchemaUpdates; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
autoSchemaUpdates implies ignoreUnknownValues?
I thought user may want autoSchemaUpdates enabled and doesn't want unknown fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not imply that - it is only ignoring unknown values at the prior stage so that it can send them on to the writing stage. However this actually belongs in the followon PR (since in this PR, there is not yet any handling of new schemas) so I'll remove this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did a rough pass. I don't know the connector good enough to give a thorough review...
Run Java PreCommit |
1 similar comment
Run Java PreCommit |
…o Storage API sink. Refactor code apache#21395
There were some optimizations done in #22942 that was mostly reverted here. Was that necessary? @reuvenlax @pabloem |
The BigQuery Storage Write API detects updated schemas and returns the new schema. However the schema is returned as a proto TableSchema. The current sink creates the proto descriptor directly from the input type (either Beam schema or json TableSchema), which is problematic as the new proto descriptor must be compatible with the old one. This PR refactors the sink to always use the proto TableSchema when constructing a descriptor.
Note: this may increase message size slightly for some schemas. e.g. for a Beam schema with a INT32 field, we currently generate an int32 proto field. However BigQuery schemas themselves don't have int32 fields, only int64 fields. So roundtripping through TableSchema means that we will now generate an int64 proto field instead.
This PR also removes the old schema-update code in preparation for the new version.