Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discover nested columns when using nested column indexer for schemaless ingestion #13672

Merged
merged 15 commits into from
Jan 18, 2023

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Jan 14, 2023

Description

Following up on #13653, this PR improves the flattener machinery to allow discovering nested columns when using druid schemaless ingestion powered by the nested column indexer for discovered columns, and moves the flag from AppendableIndexSpec to DimensionsSchema.

Effectively, whenever

...
    "dimensionsSpec": {
        "dimensions": [],
        "useNestedColumnIndexerForSchemaDiscovery": true
...
      },
...

is set, this value is pushed down to the FlattenerMaker implementations which power the column discovery.

InputEntityReader are fed an InputRowSchema which has access to the DimensionsSchema and so this value which can be passed into FlattenerMaker implementations, which have been updated to honor this setting.

This PR also adds a set of integration tests to test schemaless ingestion using useNestedColumnIndexerForSchemaDiscovery set to true with a variety of input formats. This test does not actually exercise the changes in this PR since the batch tests contain no nested data, but does at least cover string and numbers. I plan to add streaming integration tests in the future once the streaming tests are moved over to the new integration framework, and since those datas are generated I should be able to add some nested structure and provide integration test coverage for the full set of functionality.


This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the interface change.

Comment on lines 219 to 224
{
JsonProvider getJsonProvider();
/**
* List all "root" primitive properties and primitive lists (no nested objects, no lists of objects)
* List all "root" fields, optionally filtering to include only fields that contain primitive and lists of primitive values
*/
Iterable<String> discoverRootFields(T obj);
Iterable<String> discoverRootFields(T obj, boolean discoverNestedFields);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to change the interface at all? Can't you add the config as a property on the actual FlattenerMaker objects themselves? Doing that will definitely shrink this PR down and avoid interface changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i guess I was thinking doing it this way would force implementors to be aware that the contract is a bit different (or at least will be), though maybe it is nicer to push into the FlattenerMaker constructors instead and we just mention in the release notes that implementors should try to honor the mode for their readers.

this.flattener = ObjectFlatteners.create(
flattenSpec,
new JSONFlattenerMaker(keepNullColumns),
false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain I know why this is hard-coding to false. Is that covered in a comment or javadoc or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and other Parser based implementations are hard coded to false to retain existing behavior for Hadoop ingestion. Parser are not created with a InputRowSchema the same way as InputEntityReader are, so I don't have easy access to a TuningConfig, and I wasn't very motivated to fix this up for Hadoop.

@@ -115,7 +115,7 @@ public SamplerResponse sample()
);
}

return inputSourceSampler.sample(inputSource, inputFormat, dataSchema, samplerConfig);
return inputSourceSampler.sample(inputSource, inputFormat, dataSchema, samplerConfig, tuningConfig != null ? tuningConfig.convertToTaskTuningConfig() : null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you invert the != here pls.

@@ -67,7 +67,7 @@ services:
# See https://hub.docker.com/_/mysql
# The image will intialize the user and DB upon first start.
metadata:
# platform: linux/x86_64 - Add when running on M1 Macs
platform: linux/x86_64 #- Add when running on M1 Macs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be checked in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk, i guess it seemed to not make anything fail, and it was sort of annoying to have to hunt this down to make the new integration tests work on an M1 mac... but I can revert just in case

@@ -92,14 +92,22 @@ private static boolean isFieldPrimitive(Schema.Field field)
private final boolean fromPigAvroStorage;
private final boolean binaryAsString;

private final boolean discoverNestedFields;

/**
* @param fromPigAvroStorage boolean to specify the data file is stored using AvroStorage
* @param binaryAsString boolean to encode the byte[] as a string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadocs need an update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

@Override
public Iterable<String> discoverRootFields(final JsonNode obj)
{
if (discoverNestedFields) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you drop a one line comment here? I am assuming its like this since each top-level field is a field of its own if we are allowing nested columns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want similar comments for all the other FlattenerMaker implementations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went ahead and added comments to all

@clintropolis clintropolis merged commit fb26a10 into apache:master Jan 18, 2023
@clintropolis clintropolis deleted the discover-nested-columns branch January 18, 2023 20:57
}

@JsonCreator
private DimensionsSpec(
@JsonProperty("dimensions") List<DimensionSchema> dimensions,
@JsonProperty("dimensionExclusions") List<String> dimensionExclusions,
@Deprecated @JsonProperty("spatialDimensions") List<SpatialDimensionSchema> spatialDimensions,
@JsonProperty("includeAllDimensions") boolean includeAllDimensions
@JsonProperty("includeAllDimensions") boolean includeAllDimensions,
@JsonProperty("useNestedColumnIndexerForSchemaDiscovery") Boolean useNestedColumnIndexerForSchemaDiscovery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about a shorter name? "discoverNested"?

OrcStructFlattenerMaker(boolean binaryAsString)
private final boolean discoverNestedFields;

OrcStructFlattenerMaker(boolean binaryAsString, boolean disocverNestedFields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo in name of parameter

@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants