Implemented SchemaTransforms for SingleStoreIO #24290

AdalbertMemSQL · 2022-11-21T16:05:26Z

Added default RowMapper and UserDataMapper.
Implemented SchemaTransform for Read, ReadWithPartitions, and Write PTransforms.
These changes will allow us to configure SingleStoreIO easier and to use it with other languages

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

AdalbertMemSQL · 2022-11-21T16:07:24Z

addresses #22617

github-actions · 2022-11-21T17:06:33Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @apilloud for label java.
R: @pabloem for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

AdalbertMemSQL · 2022-11-25T11:59:52Z

R: @Abacn

AdalbertMemSQL · 2022-11-25T12:00:05Z

R: @johnjcasey

github-actions · 2022-11-25T12:00:56Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

github-actions · 2022-11-25T12:01:14Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

johnjcasey · 2022-11-28T15:42:29Z

r: @ahmedabu98

johnjcasey · 2022-11-28T15:43:26Z

@AdalbertMemSQL it is probably worth putting some of these classes into a subdirectory (singlestore/schematransform) to help with organization

ahmedabu98

Thank you, I left a few comments. This IO looks great!

ahmedabu98 · 2022-11-29T20:31:06Z

...che/beam/sdk/io/singlestore/schematransform/SingleStoreSchemaTransformReadConfiguration.java

+  @AutoValue.Builder
+  public abstract static class Builder {
+
+    public abstract Builder setDataSourceConfiguration(SingleStoreIO.DataSourceConfiguration value);


When a remote SDK tries to prepare a configuration Row object to use this IO, how would it set the dataSourceConfiguration? The DataSourceConfiguration POJO only exists in the Java SDK

Hmm...
That's a good question.
Is it correct that if I will @DefaultSchema(AutoValueSchema.class) before the DataSourceConfiguration class then Beam will infer the schema for it and an object with the same schema can be somehow created in other SDKs?

Just tested this out locally with a simple configuration that had a POJO field and generating a schema worked fine with just the @AutoValue decoration:

Field{name=field1, description=, type=STRING NOT NULL, options={{}}} Field{name=field2, description=, type=INT32 NOT NULL, options={{}}} Field{name=pojoField, description=, type=ROW< pojoField1 STRING NOT NULL, pojoField2 INT32 NOT NULL > NOT NULL, options={{}}}

It might still be better to keep @DefaultSchema(AutoValueSchema.class) though, according to the programming guide: https://beam.apache.org/documentation/programming-guide/#inferring-schemas

...g/apache/beam/sdk/io/singlestore/schematransform/SingleStoreSchemaTransformReadProvider.java

ahmedabu98 · 2022-11-29T20:53:15Z

...sdk/io/singlestore/schematransform/SingleStoreSchemaTransformReadWithPartitionsProvider.java

+ * An implementation of {@link TypedSchemaTransformProvider} for SingleStoreDB parallel read jobs
+ * configured using {@link SingleStoreSchemaTransformReadWithPartitionsConfiguration}.
+ */
+public class SingleStoreSchemaTransformReadWithPartitionsProvider


There's a lot of overlap between this and the SingleStoreSchemaTransformReadProvider and configuration classes. I think the only difference is two configuration parameters (this one uses the initialNumReaders parameter and the other uses the outputParallelization parameter).

Would it make sense to combine these two sets of classes into one that includes both parameters? You can add a new readWithPartitions boolean parameter that would distinguish between the two read modes.

Some thought should be put into this decision. Merging the two read modes would make sense as it is now, but if it's likely that these two modes will develop down the line to have many more differences then keeping them separate makes more sense.

initialNumReaders parameter is already deleted. So now the only difference is outputParallelization parameter that has sense only for the sequential reading. I don't think that these read modes will evolve a lot. Will try to merge their SchemaTransforms.

.../apache/beam/sdk/io/singlestore/schematransform/SingleStoreSchemaTransformWriteProvider.java

AdalbertMemSQL · 2022-12-09T12:14:52Z

@ahmedabu98 Can you please do one more review of this PR?

Added default RowMapper and UserDataMapper These changes will allow to configure SingleStoreIO easier and to use it with other languages

Added DefaultSchema for DataSourceConfiguration Changed URNs Added checks for empty strings Deleted ReadWithPartitions schema transform and added withPartitions options to Read schema transform

ahmedabu98 · 2022-12-14T14:32:18Z

.test-infra/jenkins/job_PostCommit_Java_SingleStoreIO_IT.groovy

+      description('Runs the Java SingleStoreIO Integration Test.')
+
+      // Set common parameters.
+      commonJobProperties.setTopLevelMainJobProperties(delegate)


Can you set a timeout here? This is not to set a strict time limit for the job, but more to catch runaway jobs, see example here.

Looks like the default timeout is already set to 100

// Sets common top-level job properties for main repository jobs. static void setTopLevelMainJobProperties(def context, String defaultBranch = 'master', int defaultTimeout = 100, boolean allowRemotePoll = true, String jenkinsExecutorLabel = 'beam', boolean cleanWorkspace = true) {

ahmedabu98 · 2022-12-14T14:42:12Z

...store/src/main/java/org/apache/beam/sdk/io/singlestore/SingleStoreDefaultUserDataMapper.java

+    Schema.LogicalType<Object, Object> logicalType =
+        (Schema.LogicalType<Object, Object>) type.getLogicalType();
+    if (logicalType == null) {
+      throw new UnsupportedOperationException("Failed to extract logical type");
+    }


You can make use of FieldType::isLogicalType as a check here.

ahmedabu98 · 2022-12-14T14:51:06Z

...che/beam/sdk/io/singlestore/schematransform/SingleStoreSchemaTransformReadConfiguration.java

+  @AutoValue.Builder
+  public abstract static class Builder {
+
+    public abstract Builder setDataSourceConfiguration(SingleStoreIO.DataSourceConfiguration value);


Just tested this out locally with a simple configuration that had a POJO field and generating a schema worked fine with just the @AutoValue decoration:

Field{name=field1, description=, type=STRING NOT NULL, options={{}}} Field{name=field2, description=, type=INT32 NOT NULL, options={{}}} Field{name=pojoField, description=, type=ROW< pojoField1 STRING NOT NULL, pojoField2 INT32 NOT NULL > NOT NULL, options={{}}}

…y with logical types

ahmedabu98 · 2022-12-14T16:03:08Z

Thanks @AdalbertMemSQL, LGTM so far.

I had another design question, is it necessary to create two new read nested classes (ReadRows and ReadWithPartitionsRows) that cater to Row objects? Much of the code seems to be boilerplate for Read and ReadWithPartitions, except only to add some parameters to specify Row object output. It works as it is now, but it's better to be concise and not duplicate code.

It may help to reduce this by creating a new readRows() function that calls on SingleStoreIO.read() and adding the specifications needed to output Rows. Same with partitions, a new readWithPartitionsRows() function. See this example in BigQueryIO. The same could be applied here, passing in the relevant rowMapper and coder. Although this will probably need a check at the end of Read/ReadWithPartitions expand() to see if we are reading Rows so that it can set row schema on the output PCollection.

WDYT?

ahmedabu98 · 2022-12-14T16:29:18Z

FYI these suggestions ^ (if you decide to implement them) can also be done in a later PR. I don't see any blockers for merging this one, let me know what you decide

AdalbertMemSQL · 2022-12-14T17:19:09Z

I like this idea.
Will try to implement it.

AdalbertMemSQL · 2022-12-15T14:42:46Z

@ahmedabu98 Can you please trigger execution of the job_PostCommit_Java_SingleStoreIO_IT.groovy?

ahmedabu98 · 2022-12-15T15:50:42Z

Seed job failed here: https://ci-beam.apache.org/job/beam_SeedJob/10792/console

Seeing ERROR: (job_PostCommit_Java_SingleStoreIO_IT.groovy, line 44) No such property: commonJobProperties for class: javaposse.jobdsl.dsl.jobs.FreeStyleJob

ahmedabu98

Looks really good! left one comment and a typo fix. The typo is preventing us from running a seed job on this PR, which would allow us to run SingleStoreIO_IT.

ahmedabu98 · 2022-12-15T15:16:34Z

sdks/java/io/singlestore/src/main/java/org/apache/beam/sdk/io/singlestore/SingleStoreIO.java

@@ -370,8 +386,6 @@ public DataSource getDataSource() {

    abstract @Nullable RowMapper<T> getRowMapper();

-    abstract @Nullable Coder<T> getCoder();


Why not keep the option for users to set their own coders?

Now they can do it using RowMapperWithCodder
https://github.com/apache/beam/pull/24290/files#diff-1ae81cb3e2f6d00213f38c6ebcee815cea50c993e9a4a0514b0d93f7837af0bcR237-R240

.test-infra/jenkins/job_PostCommit_Java_SingleStoreIO_IT.groovy

Co-authored-by: Ahmed Abualsaud <[email protected]>

ahmedabu98 · 2022-12-15T16:44:33Z

...store/src/main/java/org/apache/beam/sdk/io/singlestore/SingleStoreDefaultUserDataMapper.java

+      DateTimeFormat.forPattern("yyyy-MM-DD' 'HH:mm:ss.SSS");
+
+  private String convertLogicalTypeFieldToString(Schema.FieldType type, Object value) {
+    assert type.getTypeName().isLogicalType();


Suggested change

assert type.getTypeName().isLogicalType();

checkArgument(

type.getTypeName().isLogicalType(),

"<appropriate error message>");;

make sure you're importing
import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument;

ahmedabu98 · 2022-12-15T16:45:17Z

Run Java SingleStoreIO_IT

ahmedabu98 · 2022-12-15T17:29:36Z

Integration test failed due to problems connecting to jdbc: https://ci-beam.apache.org/job/beam_PostCommit_Java_SingleStoreIO_IT_PR/1/console

AdalbertMemSQL · 2022-12-15T21:26:57Z

Integration test failed due to problems connecting to jdbc: https://ci-beam.apache.org/job/beam_PostCommit_Java_SingleStoreIO_IT_PR/1/console

I hope it is fixed now.

lukecwik · 2022-12-19T16:58:58Z

Run Seed Job

ahmedabu98 · 2022-12-19T17:08:23Z

...store/src/main/java/org/apache/beam/sdk/io/singlestore/SingleStoreDefaultUserDataMapper.java

@@ -37,7 +39,7 @@ final class SingleStoreDefaultUserDataMapper implements SingleStoreIO.UserDataMa
      DateTimeFormat.forPattern("yyyy-MM-DD' 'HH:mm:ss.SSS");

  private String convertLogicalTypeFieldToString(Schema.FieldType type, Object value) {
-    assert type.getTypeName().isLogicalType();
+    checkArgument(type.getTypeName().isLogicalType(), "<appropriate error message>");


Could you replace it with an appropriate error message?

P.S.
Doing another read of this I realize this check isn't necessary heh, sorry for suggesting it earlier.

ahmedabu98 · 2022-12-19T18:23:35Z

Run Java SingleStoreIO_IT

ahmedabu98 · 2022-12-19T19:04:02Z

Run Java SingleStoreIO_IT

ahmedabu98 · 2022-12-19T19:08:24Z

@AdalbertMemSQL test is running now

ahmedabu98

LGTM and tests are passing :) when you're ready I can look for a committer to merge this

AdalbertMemSQL · 2022-12-19T21:08:24Z

I'm ready
I will be grateful if you could find a committer :)

pabloem · 2022-12-19T21:13:05Z

I'll be happy to merge once tests are green again

pabloem · 2022-12-19T21:55:13Z

alright merging! Thanks everyone. Very happy to get this in!

Abacn · 2022-12-21T17:31:29Z

Looks like this PR breaks SingleStoreIO performance test: https://ci-beam.apache.org/view/PerformanceTests/job/beam_PerformanceTests_SingleStoreIO/79/

because the test source file has renamed to SingleStoreIOPerformanceIT.java instead of SingleStoreIOITPerformance.java

* Implemented SchemaTransforms for SingleStoreIO Added default RowMapper and UserDataMapper These changes will allow to configure SingleStoreIO easier and to use it with other languages * Fixed nullable errors * Changed to don't use .* form of import * Changed formatter field to be transient * Nit reformatting * Fixed bugs in tests * Moved schema transform classes to the separate folder * Removed unused imports * Added package-info file * check point * check point * Resolved comments Added DefaultSchema for DataSourceConfiguration Changed URNs Added checks for empty strings Deleted ReadWithPartitions schema transform and added withPartitions options to Read schema transform * Changed identation * Fixed build by adding a cast * Reformatted code * Added an assertion that convertLogicalTypeFieldToString is called only with logical types * Refactored code to delete ReadRows and ReadRowsWithPartitions classes * Update .test-infra/jenkins/job_PostCommit_Java_SingleStoreIO_IT.groovy Co-authored-by: Ahmed Abualsaud <[email protected]> * Fixed bug where env variable name was used instead of the value * Changed to use checkArgument instead of assert * Added appropriate error message Co-authored-by: Ahmed Abualsaud <[email protected]>

github-actions bot added infra io java labels Nov 21, 2022

github-actions bot added the Next Action: Reviewers label Nov 21, 2022

AdalbertMemSQL marked this pull request as draft November 21, 2022 17:08

AdalbertMemSQL force-pushed the SingleStoreDefaults1 branch from a9ab0bf to ae0b7f5 Compare November 24, 2022 08:35

AdalbertMemSQL marked this pull request as ready for review November 25, 2022 11:59

ahmedabu98 requested changes Nov 29, 2022

View reviewed changes

AdalbertMemSQL force-pushed the SingleStoreDefaults1 branch from 87f7703 to d3e9fdf Compare December 6, 2022 14:01

AdalbertMemSQL added 12 commits December 14, 2022 13:32

Implemented SchemaTransforms for SingleStoreIO

e97f319

Added default RowMapper and UserDataMapper These changes will allow to configure SingleStoreIO easier and to use it with other languages

Fixed nullable errors

69a1d21

Changed to don't use .* form of import

08b3132

Changed formatter field to be transient

caf46da

Nit reformatting

302ceac

Fixed bugs in tests

b5e9543

Moved schema transform classes to the separate folder

4e2e7bc

Removed unused imports

d59828a

Added package-info file

04432dd

check point

103e4a8

check point

e67f055

Resolved comments

7be63d7

Added DefaultSchema for DataSourceConfiguration Changed URNs Added checks for empty strings Deleted ReadWithPartitions schema transform and added withPartitions options to Read schema transform

ahmedabu98 reviewed Dec 14, 2022

View reviewed changes

Added an assertion that convertLogicalTypeFieldToString is called onl…

faa2416

…y with logical types

Refactored code to delete ReadRows and ReadRowsWithPartitions classes

bdc5329

ahmedabu98 requested changes Dec 15, 2022

View reviewed changes

Update .test-infra/jenkins/job_PostCommit_Java_SingleStoreIO_IT.groovy

b9d253e

Co-authored-by: Ahmed Abualsaud <[email protected]>

ahmedabu98 reviewed Dec 15, 2022

View reviewed changes

AdalbertMemSQL added 2 commits December 15, 2022 23:18

Fixed bug where env variable name was used instead of the value

7c9b4f3

Changed to use checkArgument instead of assert

891508b

ahmedabu98 reviewed Dec 19, 2022

View reviewed changes

ahmedabu98 approved these changes Dec 19, 2022

View reviewed changes

Added appropriate error message

98a25bd

pabloem merged commit 44aee66 into apache:master Dec 19, 2022

Abacn mentioned this pull request Dec 21, 2022

Fix SingleStoreIO performance test job #24753

Merged

3 tasks

		@@ -370,8 +386,6 @@ public DataSource getDataSource() {

		abstract @Nullable RowMapper<T> getRowMapper();

		abstract @Nullable Coder<T> getCoder();

Implemented SchemaTransforms for SingleStoreIO #24290

Implemented SchemaTransforms for SingleStoreIO #24290

Conversation

AdalbertMemSQL commented Nov 21, 2022 • edited Loading

GitHub Actions Tests Status (on master branch)

AdalbertMemSQL commented Nov 21, 2022

github-actions bot commented Nov 21, 2022

AdalbertMemSQL commented Nov 25, 2022

AdalbertMemSQL commented Nov 25, 2022

github-actions bot commented Nov 25, 2022

github-actions bot commented Nov 25, 2022

johnjcasey commented Nov 28, 2022

johnjcasey commented Nov 28, 2022

ahmedabu98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdalbertMemSQL commented Dec 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmedabu98 commented Dec 14, 2022

ahmedabu98 commented Dec 14, 2022

AdalbertMemSQL commented Dec 14, 2022

AdalbertMemSQL commented Dec 15, 2022

ahmedabu98 commented Dec 15, 2022

ahmedabu98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmedabu98 commented Dec 15, 2022

ahmedabu98 commented Dec 15, 2022

AdalbertMemSQL commented Dec 15, 2022

lukecwik commented Dec 19, 2022

ahmedabu98 Dec 19, 2022 • edited Loading

Choose a reason for hiding this comment

ahmedabu98 commented Dec 19, 2022

ahmedabu98 commented Dec 19, 2022

ahmedabu98 commented Dec 19, 2022

ahmedabu98 left a comment

Choose a reason for hiding this comment

AdalbertMemSQL commented Dec 19, 2022

pabloem commented Dec 19, 2022

pabloem commented Dec 19, 2022

Abacn commented Dec 21, 2022

AdalbertMemSQL commented Nov 21, 2022 •

edited

Loading

ahmedabu98 Dec 19, 2022 •

edited

Loading