-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Beam Schemas capability to parse json-schemas. This is the de-… #24271
Conversation
Assigning reviewers. If you would like to opt out of this review, comment R: @apilloud for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Codecov Report
@@ Coverage Diff @@
## master #24271 +/- ##
==========================================
- Coverage 73.43% 73.37% -0.07%
==========================================
Files 716 717 +1
Lines 96571 96939 +368
==========================================
+ Hits 70920 71128 +208
- Misses 24328 24488 +160
Partials 1323 1323
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do with properties that don't translate to beam schemas like array uniqueness of items?
We should describe to users what we reject and be explicit about what we support and what we ignore for each JSON type from https://json-schema.org/understanding-json-schema/reference/index.html
We'll also want to be careful when supporting nulls and test for the null case explicitly.
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/test/resources/schemas/json/nested_arrays_objects_json_schema.json
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy
Outdated
Show resolved
Hide resolved
sdks/java/core/build.gradle
Outdated
@@ -91,6 +91,8 @@ dependencies { | |||
shadow library.java.avro | |||
shadow library.java.snappy_java | |||
shadow library.java.joda_time | |||
shadow library.java.json_org | |||
shadow library.java.everit_json_schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this feature require the user provide the library?
Need to analyze how large the dependency tree is and how stable it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've marked is as a provided
dependency.
Before moving forward - another possibility is to create a new extension module to hold these dependencies and this functionality. The module would be used by some of our schema transform implementations (kafka and pubsub), so it would still come back as a depenency for other Beam modules, but not for core - wdyt?
sdks/java/core/src/test/resources/schemas/json/nested_arrays_objects_json_schema.json
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Show resolved
Hide resolved
Should we add a SchemaProvider as well? |
if (propertySchema == null) { | ||
throw new IllegalArgumentException("Unable to parse schema " + jsonSchema.toString()); | ||
} | ||
if (propertySchema instanceof org.everit.json.schema.ObjectSchema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add the following support methods? I advocate that removing the embedded checks in this method allows these to be independently tested. First, the following shows how beamSchemaFromJsonSchema
might look with the proposed overloaded support methods. Then, some of the support methods follow. To avoid the instanceof checks, I tried to find in the libraries Javadoc, whether the base Schema had a getType()
getter but it seems not. Perhaps we could have a Map<Class<T extends org.everit.json.schema.Schema>, Function<T extends org.everit.json.schema.Schema, Schema.Field>>
that maps a particular json Schema class to the Java function that converts to the Schema.Field.
Schema.Builder beamSchemaBuilder = Schema.builder()
for (String propertyName : jsonSchema.getPropertySchemas().keySet()) {
org.everit.json.schema.Schema propertySchema = jsonSchema.getPropertySchemas().get(propertyName);
if (propertySchema == null) {
throw new IllegalArgumentException("Unable to parse schema " + jsonSchema.toString());
}
Schema.Field field = beamFieldFromJsonField(propertySchema);
builder = builder.addField(field);
}
return builder.build();
static Schema.Field beamFieldFromJsonField(org.everit.json.schema.Schema propertySchema) {
if (propertySchema instanceof org.everit.json.schema.ObjectSchema) {
return beamFieldFromJsonField((org.everit.json.schema.ObjectSchema) propertySchema);
}
if (propertySchema instanceof org.everit.json.schema.ArraySchema) {
return beamFieldFromJsonField((org.everit.json.schema.ObjectSchema) propertySchema);
}
// etc etc to all the various types.
}
static Schema.Field beamFieldFromJsonField(org.everit.json.schema.ObjectSchema objectSchema) {
// do some magic
}
static Schema.Field beamFieldFromJsonField(org.everit.json.schema.ArraySchema arraySchema) {
// do some more magic
}
static Schema.Field beamFieldFromJsonField(org.everit.json.schema.BooleanField booleanSchema) {
// do even some more magical magic
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conversions for type to schema are short I think it would be simpler to not add this indirection
…facto standard way to define JSON schemas
35131d3
to
a14c8c4
Compare
953e918
to
342edca
Compare
Can you elaborate @reuvenlax ? Any pointers? : ) Happy to implement whatever... |
Run Java PreCommit |
3 similar comments
Run Java PreCommit |
Run Java PreCommit |
Run Java PreCommit |
I've tried to document in some detail what we do/don't support, as well as the required provided dependency. LMK what you think @lukecwik I've also added support for nullable/non-nullable fields (based on the |
Run Spotless PreCommit |
Run Java_GCP_IO_Direct PreCommit |
1 similar comment
Run Java_GCP_IO_Direct PreCommit |
Run Java_Debezium_IO_Direct PreCommit |
Run Java_PVR_Flink_Docker PreCommit |
@lukecwik please let me know what you think of the javadoc for it, and other changes. |
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
* <dependency> | ||
* <groupId>com.github.erosb < /groupId> | ||
* <artifactId>everit-json-schema < /artifactId> | ||
* <version>1.14.1 < /version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should point the person to use the same version that Beam was tested with and not hard-code the version in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annoying but is fine
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java
Outdated
Show resolved
Hide resolved
This only applies if you have a type that you want to convert to a schema and doesn't apply to schemas stored in an arbitrary location (e.g. string, file, URL, ...) |
Nice improvement over the original and good tests. |
…/JsonUtils.java Co-authored-by: Lukasz Cwik <[email protected]>
…/JsonUtils.java Co-authored-by: Lukasz Cwik <[email protected]>
…/JsonUtils.java Co-authored-by: Lukasz Cwik <[email protected]>
…/JsonUtils.java Co-authored-by: Lukasz Cwik <[email protected]>
…/JsonUtils.java Co-authored-by: Lukasz Cwik <[email protected]>
45cfdd1
to
aff5ea4
Compare
Run Java_Examples_Dataflow PreCommit |
3 similar comments
Run Java_Examples_Dataflow PreCommit |
Run Java_Examples_Dataflow PreCommit |
Run Java_Examples_Dataflow PreCommit |
Run Java_GCP_IO_Direct PreCommit |
thanks all! |
apache#24271) * Adding Beam Schemas capability to parse json-schemas. This is the de-facto standard way to define JSON schemas * json sample schema files for tests * addressing comments * fixup * fixup * documenting and fixing nullable cases * fixup * Update sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java Co-authored-by: Lukasz Cwik <[email protected]> * Update sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java Co-authored-by: Lukasz Cwik <[email protected]> * Update sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java Co-authored-by: Lukasz Cwik <[email protected]> * Update sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java Co-authored-by: Lukasz Cwik <[email protected]> * Update sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/JsonUtils.java Co-authored-by: Lukasz Cwik <[email protected]> * improve docs * fixup Co-authored-by: Lukasz Cwik <[email protected]>
…facto standard way to define JSON schemas
r: @damondouglas
fyi: @reuvenlax @lukecwik - I am adding a new dependency to Java Core SDK. It is not exposed in the API. It is used to parse JSON schemas.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.