discover nested columns when using nested column indexer for schemaless ingestion #13672

clintropolis · 2023-01-14T03:06:58Z

Description

Following up on #13653, this PR improves the flattener machinery to allow discovering nested columns when using druid schemaless ingestion powered by the nested column indexer for discovered columns, and moves the flag from AppendableIndexSpec to DimensionsSchema.

Effectively, whenever

...
    "dimensionsSpec": {
        "dimensions": [],
        "useNestedColumnIndexerForSchemaDiscovery": true
...
      },
...

is set, this value is pushed down to the FlattenerMaker implementations which power the column discovery.

InputEntityReader are fed an InputRowSchema which has access to the DimensionsSchema and so this value which can be passed into FlattenerMaker implementations, which have been updated to honor this setting.

This PR also adds a set of integration tests to test schemaless ingestion using useNestedColumnIndexerForSchemaDiscovery set to true with a variety of input formats. This test does not actually exercise the changes in this PR since the batch tests contain no nested data, but does at least cover string and numbers. I plan to add streaming integration tests in the future once the streaming tests are moved over to the new integration framework, and since those datas are generated I should be able to add some nested structure and provide integration test coverage for the full set of functionality.

This PR has:

been self-reviewed.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

imply-cheddar

I don't think we need the interface change.

imply-cheddar · 2023-01-16T05:02:40Z

core/src/main/java/org/apache/druid/java/util/common/parsers/ObjectFlatteners.java

  {
    JsonProvider getJsonProvider();
    /**
-     * List all "root" primitive properties and primitive lists (no nested objects, no lists of objects)
+     * List all "root" fields, optionally filtering to include only fields that contain primitive and lists of primitive values
     */
-    Iterable<String> discoverRootFields(T obj);
+    Iterable<String> discoverRootFields(T obj, boolean discoverNestedFields);


I don't think you need to change the interface at all? Can't you add the config as a property on the actual FlattenerMaker objects themselves? Doing that will definitely shrink this PR down and avoid interface changes?

yeah, i guess I was thinking doing it this way would force implementors to be aware that the contract is a bit different (or at least will be), though maybe it is nicer to push into the FlattenerMaker constructors instead and we just mention in the release notes that implementors should try to honor the mode for their readers.

imply-cheddar · 2023-01-16T05:04:42Z

core/src/main/java/org/apache/druid/java/util/common/parsers/JSONPathParser.java

+    this.flattener = ObjectFlatteners.create(
+        flattenSpec,
+        new JSONFlattenerMaker(keepNullColumns),
+        false


I'm not certain I know why this is hard-coding to false. Is that covered in a comment or javadoc or something?

this and other Parser based implementations are hard coded to false to retain existing behavior for Hadoop ingestion. Parser are not created with a InputRowSchema the same way as InputEntityReader are, so I don't have easy access to a TuningConfig, and I wasn't very motivated to fix this up for Hadoop.

imply-cheddar · 2023-01-16T05:11:39Z

...ervice/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamSamplerSpec.java

@@ -115,7 +115,7 @@ public SamplerResponse sample()
      );
    }

-    return inputSourceSampler.sample(inputSource, inputFormat, dataSchema, samplerConfig);
+    return inputSourceSampler.sample(inputSource, inputFormat, dataSchema, samplerConfig, tuningConfig != null ? tuningConfig.convertToTaskTuningConfig() : null);


can you invert the != here pls.

imply-cheddar · 2023-01-16T05:17:03Z

integration-tests-ex/cases/cluster/Common/dependencies.yaml

@@ -67,7 +67,7 @@ services:
  # See https://hub.docker.com/_/mysql
  # The image will intialize the user and DB upon first start.
  metadata:
-    #    platform: linux/x86_64  - Add when running on M1 Macs
+    platform: linux/x86_64  #- Add when running on M1 Macs


Should this be checked in?

idk, i guess it seemed to not make anything fail, and it was sort of annoying to have to hunt this down to make the new integration tests work on an M1 mac... but I can revert just in case

abhishekagarwal87 · 2023-01-17T05:25:01Z

...-core/avro-extensions/src/main/java/org/apache/druid/data/input/avro/AvroFlattenerMaker.java

@@ -92,14 +92,22 @@ private static boolean isFieldPrimitive(Schema.Field field)
  private final boolean fromPigAvroStorage;
  private final boolean binaryAsString;

+  private final boolean discoverNestedFields;
+
  /**
   * @param fromPigAvroStorage boolean to specify the data file is stored using AvroStorage
   * @param binaryAsString boolean to encode the byte[] as a string.


javadocs need an update.

abhishekagarwal87 · 2023-01-17T05:28:15Z

core/src/main/java/org/apache/druid/java/util/common/parsers/JSONFlattenerMaker.java

  }

  @Override
  public Iterable<String> discoverRootFields(final JsonNode obj)
  {
+    if (discoverNestedFields) {


can you drop a one line comment here? I am assuming its like this since each top-level field is a field of its own if we are allowing nested columns.

do you want similar comments for all the other FlattenerMaker implementations?

went ahead and added comments to all

…c to DimensionsSpec

imply-cheddar · 2023-01-19T03:28:47Z

core/src/main/java/org/apache/druid/data/input/impl/DimensionsSpec.java

  }

  @JsonCreator
  private DimensionsSpec(
      @JsonProperty("dimensions") List<DimensionSchema> dimensions,
      @JsonProperty("dimensionExclusions") List<String> dimensionExclusions,
      @Deprecated @JsonProperty("spatialDimensions") List<SpatialDimensionSchema> spatialDimensions,
-      @JsonProperty("includeAllDimensions") boolean includeAllDimensions
+      @JsonProperty("includeAllDimensions") boolean includeAllDimensions,
+      @JsonProperty("useNestedColumnIndexerForSchemaDiscovery") Boolean useNestedColumnIndexerForSchemaDiscovery


how about a shorter name? "discoverNested"?

imply-cheddar · 2023-01-19T03:33:23Z

...re/orc-extensions/src/main/java/org/apache/druid/data/input/orc/OrcStructFlattenerMaker.java

-  OrcStructFlattenerMaker(boolean binaryAsString)
+  private final boolean discoverNestedFields;
+
+  OrcStructFlattenerMaker(boolean binaryAsString, boolean disocverNestedFields)


nit: typo in name of parameter

discover nested columns when using nested column indexer for schemaless

02bdd39

clintropolis added Area - Querying Release Notes Area - Ingestion labels Jan 14, 2023

clintropolis added 3 commits January 13, 2023 21:45

fixes

3f410a9

stable test output

b785c3b

more stable

7def946

imply-cheddar suggested changes Jan 16, 2023

View reviewed changes

push flag into FlattenerMaker constructors

eec9bd8

abhishekagarwal87 reviewed Jan 17, 2023

View reviewed changes

clintropolis added 5 commits January 17, 2023 15:41

comments and javadoc

185dd2c

move useNestedColumnIndexerForSchemaDiscovery from AppendableIndexSpe…

4ccd6a6

…c to DimensionsSpec

revert

38a15ca

fix build

55e27f0

actually fix build this time

922a3b1

clintropolis removed the Release Notes label Jan 18, 2023

clintropolis added 5 commits January 17, 2023 21:18

adjust

a830c10

javadoc imports are not real imports i guess

514bd3b

fix javadoc

430c66e

revert

2b32878

add test

04d5d66

abhishekagarwal87 approved these changes Jan 18, 2023

View reviewed changes

clintropolis merged commit fb26a10 into apache:master Jan 18, 2023

clintropolis deleted the discover-nested-columns branch January 18, 2023 20:57

imply-cheddar reviewed Jan 19, 2023

View reviewed changes

clintropolis added this to the 26.0 milestone Apr 10, 2023

techdocsmith mentioned this pull request Apr 12, 2023

[DRAFT] 26.0.0 release notes #14064

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discover nested columns when using nested column indexer for schemaless ingestion #13672

discover nested columns when using nested column indexer for schemaless ingestion #13672

clintropolis commented Jan 14, 2023 •

edited

Loading

imply-cheddar left a comment

imply-cheddar Jan 16, 2023

clintropolis Jan 16, 2023

imply-cheddar Jan 16, 2023

clintropolis Jan 16, 2023

imply-cheddar Jan 16, 2023

imply-cheddar Jan 16, 2023

clintropolis Jan 16, 2023

abhishekagarwal87 Jan 17, 2023

clintropolis Jan 17, 2023

abhishekagarwal87 Jan 17, 2023

clintropolis Jan 17, 2023

clintropolis Jan 17, 2023

imply-cheddar Jan 19, 2023

imply-cheddar Jan 19, 2023

discover nested columns when using nested column indexer for schemaless ingestion #13672

discover nested columns when using nested column indexer for schemaless ingestion #13672

Conversation

clintropolis commented Jan 14, 2023 • edited Loading

Description

imply-cheddar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis commented Jan 14, 2023 •

edited

Loading