Handle updates to table schema when using Storage API writes. #24145

reuvenlax · 2022-11-13T22:03:18Z

No description provided.

github-actions · 2022-11-13T22:36:41Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @ahmedabu98 for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions · 2022-11-21T12:13:50Z

Reminder, please take a look at this pr: @kennknowles @ahmedabu98

github-actions · 2022-11-23T12:13:56Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @lukecwik for label java.
R: @pabloem for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions · 2022-11-30T12:14:12Z

Reminder, please take a look at this pr: @lukecwik @pabloem

kennknowles · 2022-11-30T17:24:35Z

There's no description. Is this just a working draft?

reuvenlax · 2022-11-30T18:38:08Z

Draft

…

On Wed, Nov 30, 2022 at 9:24 AM Kenn Knowles ***@***.***> wrote: There's no description. Is this just a working draft? — Reply to this email directly, view it on GitHub <#24145 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFAYJVL5UPGHPJ3XYQ3ZK6TWK6EWBANCNFSM6AAAAAAR7HLS6E> . You are receiving this because you authored the thread.Message ID: ***@***.***>

reuvenlax · 2022-12-02T04:42:41Z

R: @prodriguezdefino
R: @yirutang

github-actions · 2022-12-02T04:43:45Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

yirutang · 2022-12-07T22:48:12Z

R: @GaoleMeng FYI

yirutang · 2022-12-07T22:49:06Z

R: @agrawal-siddharth FYI

reuvenlax · 2022-12-22T17:27:26Z

friendly ping!

reuvenlax · 2023-01-03T19:09:33Z

friendly ping

prodriguezdefino

left few comment nits, but overall looks good to me

prodriguezdefino · 2023-01-03T23:55:04Z

...tform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiWriteUnshardedRecords.java

+                          .encodeUnknownFields(unknownFields, ignoreUnknownValues));
+            } catch (TableRowToStorageApiProto.SchemaConversionException e) {
+              TableRow tableRow = appendClientInfo.toTableRow(payloadBytes);
+              // TODO(reuvenlax): We need to merge the unknown fields in!


maybe just include the information of the unknown fields to the error message returned would be sufficient for debugging purposes.

For now this codepath is disabled. will revisit later.

prodriguezdefino · 2023-01-04T00:13:34Z

...tform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiWriteUnshardedRecords.java

@@ -488,6 +542,14 @@ long flush(
              return RetryType.RETRY_ALL_OPERATIONS;
            },
            c -> {
+              AppendRowsResponse response = Preconditions.checkStateNotNull(c.getResult());
+              if (autoUpdateSchema && response.hasUpdatedSchema()) {


do we get a similar information piece on schema updates when the insert fails? if so that info added to the failed insert rows could help on debugging.

I don't believe we do.

prodriguezdefino · 2023-01-04T00:14:18Z

...atform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiWritesShardedRecords.java

+          appendClientInfo.set(
+              AppendClientInfo.of(
+                  updatedSchema.read(), appendClientInfo.get().getCloseAppendClient()));
+          // TODO: invalidate?


it seems that the unsharded version of the writes do invalidate when schema has changed.

I think that put() will invalidate the old value

added invalidate

prodriguezdefino · 2023-01-04T00:15:24Z

...atform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiWritesShardedRecords.java

+            appendClientInfo.set(
+                AppendClientInfo.of(
+                    updatedSchemaReturned.get(), appendClientInfo.get().getCloseAppendClient()));
+            // TODO: invalidate?


same as in line 434

Same - I believe that put() invalidates the old value

added invalidate

prodriguezdefino · 2023-01-04T00:25:47Z

.../src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiDynamicDestinationsBeamRow.java

+
+    @Override
+    public StorageApiWritePayload toMessage(TableRow tableRow, boolean respectRequired) {
+      throw new RuntimeException("Not supported");


Maybe I misunderstood the logic, but would it be possible to add a validation in BigQueryIO that schema update is not available for Beam Row payloads?
If I'm not mistaken someone could configure the writes with useBeamSchema() and set also auto update of schemas and then their pipeline will fail with this runtime exception, when in fact we could have captured that in validation time.

reuvenlax · 2023-01-11T17:56:21Z

@prodriguezdefino comments addressed

prodriguezdefino · 2023-01-11T18:01:08Z

LGTM

reuvenlax · 2023-01-11T20:48:18Z

Run Java_GCP_IO_Direct PreCommit

yirutang · 2023-01-11T18:34:19Z

...oogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/AppendClientInfo.java

+    return updatedTableSchema.hashCode() != getTableSchema().hashCode();
+  }
+
+  public ByteString encodeUnknownFields(TableRow unknown, boolean ignoreUnknownValues)


In case of ignoreUnknonwValues to be false, this will be a void operation?

No - these are fields that are unknown to the prior step. They make actually end up being known to the current step due to the updated schema.

yirutang · 2023-01-11T19:11:25Z

...tform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiWriteUnshardedRecords.java

+          if (appendClientInfo == null) {
+            appendClientInfo = getAppendClientInfo(true, null);
+          }
+          @Nullable TableRow unknownFields = payload.getUnknownFields();


if ignoreUnknownValues is false, could we avoid doing all the following (to save some process time).

unknownFields are values that were unknown to the prior conversion step. These fields may be known to the writing step (since it gets schema updates back from Vortex) so we can't ignore them here.

yirutang · 2023-01-11T21:28:46Z

...tform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiWriteUnshardedRecords.java

+          invalidateWriteStream();
+          appendClientInfo =
+              Preconditions.checkStateNotNull(getAppendClientInfo(false, updatedTableSchema));
+          updatedTableSchema = null;


I am wondering if there will be races regarding this updatedSchema?

what race do you envision?

Some responses are coming back and the updatedTableSchema is being updated L568, which in race with the postFlush here.

postFlush is called only after all futures have completed (in flushAll), so we would not expect any more callbacks. Also note that RetryManager calls these response callbacks in the primary thread (RetryManager.await() calls the callbacks), so the callbacks here are not being called asynchronously.

yirutang · 2023-01-11T21:47:49Z

...atform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiWritesShardedRecords.java

+                    Preconditions.checkStateNotNull(info.getStreamAppendClient()).pin();
+                    return info;
+                  }));
+      TableSchema updatedSchemaValue = updatedSchema.read();


updatedSchema can be null?

updatedSchema is a state variable. If it has never been set, then it will return null.

yirutang · 2023-01-11T23:00:00Z

...ud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/TableRowToStorageApiProto.java

@@ -479,15 +502,41 @@ public static DynamicMessage messageFromTableRow(
        }
      }

+      if (unknownFields != null) {


Is this file the conversion stage?

If user has input ABC and the original schema is AB but later on the schema is updated to ABC, will L496 fail the check?

no, because the caller (StorageApiDynamicDestinationsTableRow.java:151) passes in true if autoSchemaUpdates==true

yirutang · 2023-01-11T23:15:56Z

...latform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/TableRowToStorageApiProtoTest.java

+  @Test
+  public void testIgnoreUnknownNestedField() throws Exception {
+    TableRow rowNoF = new TableRow();
+    rowNoF.putAll(BASE_TABLE_ROW_NO_F);


Just try to understand the test, so Beam accepts two format of TableRow? One is "F"->List of values and the other is List(field_name, field_value), and they can be mixed up as nested field value to the same TableRow struct?

correct. This is the unfortunate history of TableRow (which long predated Beam)

yirutang · 2023-01-11T23:26:51Z

...latform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/TableRowToStorageApiProtoTest.java

+    assertEquals(1, ((TableRow) unknown.get("nestedvaluenof1")).size());
+    assertEquals(
+        "foobar",
+        ((TableRow) unknown.get("nestedvalue1")).getF().get(BASE_TABLE_ROW.getF().size()).getV());


In this case, the unknown value is at the last offset, is there elements ahead of it? What if the unknown field is in the middle of the array list?

This is an unordered map, so there should be no offsets.

yirutang · 2023-01-11T23:38:39Z

...le-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOWriteTest.java

+            .to(tableRef)
+            .withMethod(method)
+            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
+            .ignoreUnknownValues()


nit: Add a test for ignoreUnknownValue(false) and withAutoSchemaUpdate(true)

yirutang · 2023-01-11T23:46:04Z

...le-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOWriteTest.java

+            .map(tr -> filterUnknownValues(tr, tableSchema.getFields()))
+            .collect(Collectors.toList());
+    Iterable<TableRow> expectedFullValues =
+        LongStream.range(6, 10).mapToObj(getRowSet).collect(Collectors.toList());


Should this be 6,9 and above 0,5?

(6,10) is open on the upper bound (i.e. it's [6, 10) )

yirutang · 2023-01-11T23:54:47Z

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/SplittingIterable.java


 /**
 * Takes in an iterable and batches the results into multiple ProtoRows objects. The splitSize
 * parameter controls how many rows are batched into a single ProtoRows object before we move on to
 * the next one.
 */
 class SplittingIterable implements Iterable<ProtoRows> {
+  interface ConvertUnknownFields {


I somehow couldn't find the implementation of this?

This is a Java functional interface - any matching lambda will conform. e.g. you can pass in (tableRow, ignore) -> {} and this will conform to the interface

yirutang · 2023-01-12T00:01:22Z

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/SplittingIterable.java

+                // into a proto and concatenate to the existing proto.
+                try {
+                  byteString =
+                      byteString.concat(


Maybe I missed something, so every time we will concat the unknown fields with the existing byte string? Then what's the difference between this and passing down the entire message? Maybe the unknownFieldsToMessage will filter something out? But I don't see it has a schema.

The prior convert message only includes fields known to it in the proto it generates. It can't include fields it doesn't know about as they would have to be in the proto descriptor (and it can't use the proto's unknown field set as that requires field ids, which is not known yet).

Therefore the incoming byteString contains only fields that were known to the convert stage, and all other fields are put into the json unknownFields object. What we are doing here is taking advantage of the fact that the write step has a more up-to-date view on the schema, so we walk over the unknownFields json and extract whatever fields are now known (which might still be only a subset of the remaining fields). We then convert those unknownFields to a proto, and concatenate the two protos.

Not an expert in modern Java... Where is the implementation of unknownFieldsToMessage.convert? How could only convert the "known" unknown fields?

nvm, found it, should be encodeUnknownFields

Yes, code looks good here.

yirutang · 2023-01-12T00:02:39Z

...latform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiDynamicDestinations.java

@@ -35,6 +35,8 @@

    StorageApiWritePayload toMessage(T element) throws Exception;

+    StorageApiWritePayload toMessage(TableRow tableRow, boolean respectRequired) throws Exception;


How is this relevant to this change?

We can't simply take the unknownFields and convert them to a proto, as there may be missing required fields (because those fields are in the original proto). We need a way to do the conversion without enforcing nullability.

yirutang · 2023-01-12T00:10:44Z

...tform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiWriteUnshardedRecords.java

+        // If we got a response indicating an updated schema, recreate the client.
+        if (updatedTableSchema != null
+            && this.appendClientInfo != null
+            && this.appendClientInfo.hasSchemaChanged(updatedTableSchema)) {


Not sure if this is expensive to perform, theoretically it is not needed since when setting updatedTableSchema it is already checked.

We also provided something here:
https://github.com/googleapis/java-bigquerystorage/blob/main/google-cloud-bigquerystorage/src/main/java/com/google/cloud/bigquery/storage/v1/StreamWriter.java#L157

If that field is not null, it means a new schema appeared. We do the comparison based on timestamp.

removed this

yirutang · 2023-01-12T00:13:30Z

...atform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiWritesShardedRecords.java

@@ -652,6 +692,18 @@ public void process(
        }
        appendSplitDistribution.update(numAppends);

+        if (updatedSchemaReturned.get() != null) {


Wondering if you can directly use the one on StreamWriter:
https://github.com/googleapis/java-bigquerystorage/blob/main/google-cloud-bigquerystorage/src/main/java/com/google/cloud/bigquery/storage/v1/StreamWriter.java#L157

Maybe - is there an advantage to using that over the one in the response?

Yes, the logic can be simplified down to:
https://screenshot.googleplex.com/9o5fvs3UgNEWa9E

No extra comparison needed.

However it can only capture the first schema update event, which is enough since we will refresh the whole writer.

github-actions bot added gcp io java labels Nov 13, 2022

github-actions bot added the Next Action: Reviewers label Nov 13, 2022

github-actions bot added the slow-review label Nov 21, 2022

github-actions bot removed the slow-review label Nov 23, 2022

github-actions bot added the slow-review label Nov 30, 2022

github-actions bot removed the slow-review label Nov 30, 2022

reuvenlax force-pushed the schema_update_push_notification branch from 7d77609 to 235400c Compare December 1, 2022 00:28

github-actions bot added hcatalog and removed hcatalog labels Dec 1, 2022

lukecwik changed the title ~~Schema update push notification~~ [WIP] Schema update push notification Dec 1, 2022

github-actions bot added hcatalog and removed hcatalog labels Dec 2, 2022

reuvenlax force-pushed the schema_update_push_notification branch from 299b96a to 283f6e1 Compare December 2, 2022 04:31

github-actions bot added hcatalog and removed hcatalog labels Dec 2, 2022

reuvenlax changed the title ~~[WIP] Schema update push notification~~ Handle updates to table schema when using Storage API writes. Dec 2, 2022

reuvenlax force-pushed the schema_update_push_notification branch from 283f6e1 to 5b25910 Compare December 7, 2022 22:40

github-actions bot added the hcatalog label Dec 7, 2022

prodriguezdefino reviewed Jan 4, 2023

View reviewed changes

reuvenlax force-pushed the schema_update_push_notification branch from 5b25910 to 3223f4e Compare January 6, 2023 20:38

github-actions bot removed the hcatalog label Jan 6, 2023

reuvenlax force-pushed the schema_update_push_notification branch 2 times, most recently from e1616c3 to 3f1394c Compare January 11, 2023 02:39

yirutang reviewed Jan 11, 2023

View reviewed changes

yirutang reviewed Jan 12, 2023

View reviewed changes

Handle schema updates in Storage API writes.

7ad44c8

reuvenlax force-pushed the schema_update_push_notification branch from e0dd65d to 7ad44c8 Compare January 19, 2023 21:50

reuvenlax merged commit f5020e7 into apache:master Jan 20, 2023

		@@ -35,6 +35,8 @@

		StorageApiWritePayload toMessage(T element) throws Exception;

		StorageApiWritePayload toMessage(TableRow tableRow, boolean respectRequired) throws Exception;

Handle updates to table schema when using Storage API writes. #24145

Handle updates to table schema when using Storage API writes. #24145

Conversation

reuvenlax commented Nov 13, 2022

github-actions bot commented Nov 13, 2022

github-actions bot commented Nov 21, 2022

github-actions bot commented Nov 23, 2022

github-actions bot commented Nov 30, 2022

kennknowles commented Nov 30, 2022

reuvenlax commented Nov 30, 2022 via email

reuvenlax commented Dec 2, 2022

github-actions bot commented Dec 2, 2022

yirutang commented Dec 7, 2022 • edited Loading

yirutang commented Dec 7, 2022

reuvenlax commented Dec 22, 2022

reuvenlax commented Jan 3, 2023

prodriguezdefino left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reuvenlax commented Jan 11, 2023

prodriguezdefino commented Jan 11, 2023

reuvenlax commented Jan 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yirutang Jan 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yirutang commented Dec 7, 2022 •

edited

Loading

yirutang Jan 12, 2023 •

edited

Loading