Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add update schema support for multiplexing #1867

Merged
merged 43 commits into from
Nov 12, 2022
Merged

feat: add update schema support for multiplexing #1867

merged 43 commits into from
Nov 12, 2022

Conversation

GaoleMeng
Copy link
Contributor

@GaoleMeng GaoleMeng commented Nov 5, 2022

To make this happen, we will store a mapping from stream name to updated schema mapping inside connection worker pool. Whenever the json writer accept one append, we will check the cache to see whether there is one updated schema and compared with the current one. Then recreate the stream writer if there is different schema

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> ☕️

If you write sample code, please follow the samples format.

GaoleMeng and others added 30 commits September 13, 2022 01:58
also fixed a tiny bug inside fake bigquery write impl for getting thre
response from offset
possible the proto schema does not contain this field
@GaoleMeng GaoleMeng requested review from a team and aribray November 5, 2022 02:16
@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. labels Nov 5, 2022
@GaoleMeng GaoleMeng requested a review from yirutang November 7, 2022 21:17

@Override
public void onSuccess(AppendRowsResponse response) {
streamNameToUpdatedSchema.put(streamName, response.getUpdatedSchema());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After you refreshWriter, you need to mark the entry here as null. I think it is better we keep a schema on the Writer level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is there could be multiple stream writers that use the same stream name that all need refreshWriter to be triggered whenever there is a updated schema

So we can't directly nullify the updated schema for a given stream name, otherwise some streamwriter might not be able to get the updated schema correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, changed to use timestamp pattern

If the timestamp used in the current stream writer is older than the updated schema version,change to use updated schema

@@ -398,7 +397,7 @@ public static StreamWriter.Builder newBuilder(String streamName) {

/** Thread-safe getter of updated TableSchema */
public synchronized TableSchema getUpdatedSchema() {
return singleConnectionOrConnectionPool.getUpdatedSchema();
return singleConnectionOrConnectionPool.getUpdatedSchema(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a little meaning change in this field now. Previously, it will only return an Updated Schema when a schema update happens during the life time of this StreamWriter. Now it will always return the "current schema" of our knowledge. May worth explain this a bit since Dataflow is going to use this field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one line comment

/*
* Contains the mapping from stream name to updated schema.
*/
private Map<String, TableSchema> streamNameToUpdatedSchema = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to only cache to the level of table, the size of this map could be huge, if it is per stream one table schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to cache table name

@@ -720,7 +722,7 @@ private AppendRequestAndResponse pollInflightRequestQueue() {
}

/** Thread-safe getter of updated TableSchema */
public synchronized TableSchema getUpdatedSchema() {
public synchronized TableSchemaAndTimestamp getUpdatedSchema() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is breaking change. Lets just add a new method instead of change this old method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should not been used as public,
let's fallback to package private

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that for ConnectionWorker this method shouldn't be public at all. But it seems the method on StreamWriter also changed?

@GaoleMeng GaoleMeng requested a review from a team as a code owner November 10, 2022 23:14
@@ -147,11 +152,11 @@ long getInflightWaitSeconds(StreamWriter streamWriter) {
return connectionWorker().getInflightWaitSeconds();
}

TableSchema getUpdatedSchema() {
TableSchemaAndTimestamp getUpdatedSchema(StreamWriter streamWriter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean this is actually a breaking change? Dataflow will use this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, let's used timestamp on streamwriter when returning schema

@@ -720,7 +722,7 @@ private AppendRequestAndResponse pollInflightRequestQueue() {
}

/** Thread-safe getter of updated TableSchema */
public synchronized TableSchema getUpdatedSchema() {
public synchronized TableSchemaAndTimestamp getUpdatedSchema() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that for ConnectionWorker this method shouldn't be public at all. But it seems the method on StreamWriter also changed?

@GaoleMeng GaoleMeng added the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 11, 2022
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 11, 2022
refreshWriter(this.streamWriter.getUpdatedSchema());
TableSchema updatedSchemaAndTime = this.streamWriter.getUpdatedSchema();
// Create a new stream writer internally if a new updated schema is reported from backend.
if (updatedSchemaAndTime != null && !this.tableSchema.equals(updatedSchemaAndTime)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can directly use streamWriter.getUpdatedSchema() != null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Neenu1995 Neenu1995 added the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 11, 2022
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 11, 2022
@GaoleMeng GaoleMeng added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 12, 2022
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 12, 2022
@GaoleMeng GaoleMeng added the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 12, 2022
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 12, 2022
@GaoleMeng GaoleMeng merged commit 2adf81b into googleapis:main Nov 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants