-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discover nested columns when using nested column indexer for schemaless ingestion #13672
discover nested columns when using nested column indexer for schemaless ingestion #13672
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the interface change.
{ | ||
JsonProvider getJsonProvider(); | ||
/** | ||
* List all "root" primitive properties and primitive lists (no nested objects, no lists of objects) | ||
* List all "root" fields, optionally filtering to include only fields that contain primitive and lists of primitive values | ||
*/ | ||
Iterable<String> discoverRootFields(T obj); | ||
Iterable<String> discoverRootFields(T obj, boolean discoverNestedFields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to change the interface at all? Can't you add the config as a property on the actual FlattenerMaker objects themselves? Doing that will definitely shrink this PR down and avoid interface changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i guess I was thinking doing it this way would force implementors to be aware that the contract is a bit different (or at least will be), though maybe it is nicer to push into the FlattenerMaker
constructors instead and we just mention in the release notes that implementors should try to honor the mode for their readers.
this.flattener = ObjectFlatteners.create( | ||
flattenSpec, | ||
new JSONFlattenerMaker(keepNullColumns), | ||
false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain I know why this is hard-coding to false. Is that covered in a comment or javadoc or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and other Parser
based implementations are hard coded to false to retain existing behavior for Hadoop ingestion. Parser
are not created with a InputRowSchema
the same way as InputEntityReader
are, so I don't have easy access to a TuningConfig
, and I wasn't very motivated to fix this up for Hadoop.
@@ -115,7 +115,7 @@ public SamplerResponse sample() | |||
); | |||
} | |||
|
|||
return inputSourceSampler.sample(inputSource, inputFormat, dataSchema, samplerConfig); | |||
return inputSourceSampler.sample(inputSource, inputFormat, dataSchema, samplerConfig, tuningConfig != null ? tuningConfig.convertToTaskTuningConfig() : null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you invert the !=
here pls.
@@ -67,7 +67,7 @@ services: | |||
# See https://hub.docker.com/_/mysql | |||
# The image will intialize the user and DB upon first start. | |||
metadata: | |||
# platform: linux/x86_64 - Add when running on M1 Macs | |||
platform: linux/x86_64 #- Add when running on M1 Macs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be checked in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idk, i guess it seemed to not make anything fail, and it was sort of annoying to have to hunt this down to make the new integration tests work on an M1 mac... but I can revert just in case
@@ -92,14 +92,22 @@ private static boolean isFieldPrimitive(Schema.Field field) | |||
private final boolean fromPigAvroStorage; | |||
private final boolean binaryAsString; | |||
|
|||
private final boolean discoverNestedFields; | |||
|
|||
/** | |||
* @param fromPigAvroStorage boolean to specify the data file is stored using AvroStorage | |||
* @param binaryAsString boolean to encode the byte[] as a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadocs need an update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
} | ||
|
||
@Override | ||
public Iterable<String> discoverRootFields(final JsonNode obj) | ||
{ | ||
if (discoverNestedFields) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you drop a one line comment here? I am assuming its like this since each top-level field is a field of its own if we are allowing nested columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want similar comments for all the other FlattenerMaker implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
went ahead and added comments to all
} | ||
|
||
@JsonCreator | ||
private DimensionsSpec( | ||
@JsonProperty("dimensions") List<DimensionSchema> dimensions, | ||
@JsonProperty("dimensionExclusions") List<String> dimensionExclusions, | ||
@Deprecated @JsonProperty("spatialDimensions") List<SpatialDimensionSchema> spatialDimensions, | ||
@JsonProperty("includeAllDimensions") boolean includeAllDimensions | ||
@JsonProperty("includeAllDimensions") boolean includeAllDimensions, | ||
@JsonProperty("useNestedColumnIndexerForSchemaDiscovery") Boolean useNestedColumnIndexerForSchemaDiscovery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about a shorter name? "discoverNested"
?
OrcStructFlattenerMaker(boolean binaryAsString) | ||
private final boolean discoverNestedFields; | ||
|
||
OrcStructFlattenerMaker(boolean binaryAsString, boolean disocverNestedFields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo in name of parameter
Description
Following up on #13653, this PR improves the flattener machinery to allow discovering nested columns when using druid schemaless ingestion powered by the nested column indexer for discovered columns, and moves the flag from
AppendableIndexSpec
toDimensionsSchema
.Effectively, whenever
is set, this value is pushed down to the
FlattenerMaker
implementations which power the column discovery.InputEntityReader
are fed anInputRowSchema
which has access to theDimensionsSchema
and so this value which can be passed intoFlattenerMaker
implementations, which have been updated to honor this setting.This PR also adds a set of integration tests to test schemaless ingestion using
useNestedColumnIndexerForSchemaDiscovery
set to true with a variety of input formats. This test does not actually exercise the changes in this PR since the batch tests contain no nested data, but does at least cover string and numbers. I plan to add streaming integration tests in the future once the streaming tests are moved over to the new integration framework, and since those datas are generated I should be able to add some nested structure and provide integration test coverage for the full set of functionality.This PR has: