Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remix nested columns #14014

Merged
merged 16 commits into from
Apr 5, 2023
Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Apr 2, 2023

Description

This PR reverts NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre #13803 behavior (v4) for backwards compatibility in favor of splitting the new functionality into a new "auto" dimension schema type to more accurately reflect the new behavior that this PR adds.

This new 'auto type' indexer, merger, and an associated family of serializers is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json> columns, all sharing a common 'nested' format that allows segment merging to occur smoothly regardless of which physical format the column was written with.

To accompany this, a new ColumnFormat interface has been defined to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation. This allows both the v4 legacy columns and the new common format nested columns to both be able to map to the COMPLEX<json>

This PR also fixes a bug in RoaringBitmapSerdeFactory, where if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it it would fail (the nerve!)

Release note

A new "auto" type column schema and indexer has been added to native ingestion as the next logical iteration of the nested column functionality. This automatic type column indexer that produces the most appropriate column for the given inputs, producing either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json> columns, all sharing a common 'nested' format.

All columns produced by 'auto' have indexes to aid in fast filtering (unlike classic LONG and DOUBLE columns) and use cardinality based thresholds to attempt to only utilize these indexes when it is likely to actually speed up the query (unlike classic STRING columns).

COMPLEX<json> columns produced by this 'auto' indexer store arrays of simple scalar types differently than their 'json' (v4) counterparts, storing them as ARRAY typed columns. This means that the JSON_VALUE function can now extract entire arrays, e.g. JSON_VALUE(nested, '$.array' RETURNING BIGINT ARRAY). There is no change with how arrays of complex objects are stored at this time.

This improvement also adds a completely new functionality to Druid, ARRAY typed columns, which unlike classic multi-value STRING columns behave with ARRAY semantics. These columns can currently only be created via the 'auto' type indexer when all values are an arrays with the same type of elements.


Key changed/added classes in this PR
  • ColumnFormat
  • a bunch of other stuff

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

changes:
* introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation
* introduce new 'standard type' indexer, merger, and family of serializers, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>.
* revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre apache#13803 behavior (v4) for backwards compatibility
* fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)
@clintropolis clintropolis force-pushed the nested-column-remix branch from 6561769 to 2f93beb Compare April 2, 2023 12:10
if (o instanceof Object[]) {
Object[] array = (Object[]) o;
if (elementNumber < array.length) {
return array[elementNumber];

Check failure

Code scanning / CodeQL

Improper validation of user-provided array index

This index depends on a [user-provided value](1) which can cause an ArrayIndexOutOfBoundsException. This index depends on a [user-provided value](2) which can cause an ArrayIndexOutOfBoundsException. This index depends on a [user-provided value](3) which can cause an ArrayIndexOutOfBoundsException. This index depends on a [user-provided value](4) which can cause an ArrayIndexOutOfBoundsException. This index depends on a [user-provided value](5) which can cause an ArrayIndexOutOfBoundsException.
@JsonSubTypes.Type(name = NestedDataComplexTypeSerde.TYPE_NAME, value = NestedDataDimensionSchema.class)
@JsonSubTypes.Type(name = NestedDataComplexTypeSerde.TYPE_NAME, value = NestedDataDimensionSchema.class),
@JsonSubTypes.Type(name = StandardTypeColumnSchema.TYPE, value = StandardTypeColumnSchema.class),
@JsonSubTypes.Type(name = "auto", value = StandardTypeColumnSchema.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this added as well? Shouldn't anything that is using this be able to use StandardTypeColumnSchema.TYPE when it persists such that we don't need to add 2 names for the same thing?

Comment on lines 190 to 196
default void mergeNestedFields(SortedMap<String, FieldTypeInfo.MutableTypeSet> mergedFields)
{
mergedFields.put(
NestedPathFinder.JSON_PATH_ROOT,
new FieldTypeInfo.MutableTypeSet().add(getColumnCapabilities().toColumnType())
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feel very nested-column-specific, but it's a change on the DimensionIndexer interface. Does it really need to be here? Can't DimensionIndexer instances make sure that they are all the same type and then use concrete methods instead of leaking something like this on the interface?

return new CapabilitiesBasedFormat(
ColumnCapabilitiesImpl.snapshot(
getColumnCapabilities(),
CapabilitiesBasedFormat.DIMENSION_CAPABILITY_MERGE_LOGIC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a nit, but it seems really weird to have the CapabilitiesBasedFormat need to be given logic from itself? Maybe create a static CapabilitiesBasedFormat.snapshot(ColumnCapabilities) that does this on its own and keeps the implementation details local and private?

Comment on lines 203 to 209
handler.makeMerger(
indexSpec,
segmentWriteOutMedium,
dimCapabilities.get(i),
dimFormats.get(i).toColumnCapabilities(),
progress,
closer
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handler came from the dimFormat, right? Why would it need the capabilities from dimFormat? Can we not eliminate the argument entirely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the snag here is that right now the string dimension merger needs the 'has multiple values' flag set on the capabilities to determine whether it makes a serializer for a single value or mutli value string.

I didn't want to rework the classic columns and indexers to use the ColumnFormat stuff at this time to minimize the number of changes, but the answer is yes, once we make a dedicate format for the String dimension schema i would think it would capture whether the string was multi-valued or not and we can drop this argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that because we are using CapabilitiesBasedFormat for all of the old ones that it would provide a nice clean plain to just start ignoring this parameter. Though, punting on it for later is probably also fine as even if we started ignoring it, making the actual interface change is less easy... Okay.

Comment on lines 264 to 268
ColumnDescriptor columnDesc = ColumnDescriptor
.builder()
.setValueType(dimCapabilities.get(i).getType())
.setValueType(dimFormats.get(i).getLogicalType().getType())
.addSerde(new NullColumnPartSerde(indexMergeResult.rowCount, indexSpec.getBitmapSerdeFactory()))
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to store the type of the null column? Maybe we need a null type that indicates that it's always null and we can store it like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is from that explicit null value columns change we did a while back that saves any dimension schema in the segment that didn't have any values

I agree this can probably be better/smarter

Comment on lines 54 to 62
public void deserializeColumn(
@SuppressWarnings("unused") String columnName,
ByteBuffer buffer,
ColumnBuilder builder,
ColumnConfig columnConfig
)
{
deserializeColumn(buffer, builder, columnConfig);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that we need to add this columnName parameter. So far, the serde code has generally kept the name disassociated from the actual storage of the column as the name of the column is actually unimportant for the ability to serialize and deserialize the bytes.

It looks like this is being done because the column name is being used as a prefix on the other names in the file smoosher, that makes sense, but at this point it's not a "column name" as much as a "unique prefix". Given that it is a prefix that we expect this column to use on all of its things, I think it makes sense to serialize out the unique prefix as part of the bytes of the column itself and then read it back in from there. Let it be coincidence that the unique prefix just so happens to be the same thing as the column name.

This is pretty similar to how it was already working in the "older" versions where it got the information by deserializing a metadata object...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed from this interface since this wasn't needed for the v4 complex column and has been reverted since it can get the filename from its embedded metadata.

I have left the parameter on the ColumnPartSerde.Deserializer interface for now because the new nested column part serde does need it since it doesn't store a separate metadata file embedded in it and IndexIO.V9IndexLoader.deserializeColumn already has the interned column/filename passed in so i took advantage of this and pushed it down (instead of writing the column name again inside of the nested part serde or its data file). This feels like an internal interface so it doesn't seem that disruptive to change, but can do this other ways if really against it

@@ -294,7 +294,7 @@ public Deserializer getDeserializer()
return new Deserializer()
{
@Override
public void read(ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnConfig)
public void read(String columnName, ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we remove the argument from ComplexMetricSerde we don't need it here anymore either.

@@ -39,7 +39,8 @@
@JsonSubTypes.Type(name = "floatV2", value = FloatNumericColumnPartSerdeV2.class),
@JsonSubTypes.Type(name = "longV2", value = LongNumericColumnPartSerdeV2.class),
@JsonSubTypes.Type(name = "doubleV2", value = DoubleNumericColumnPartSerdeV2.class),
@JsonSubTypes.Type(name = "null", value = NullColumnPartSerde.class)
@JsonSubTypes.Type(name = "null", value = NullColumnPartSerde.class),
@JsonSubTypes.Type(name = "standard", value = StandardTypeColumnPartSerde.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What're the semantics of the standard column part serde trying to do? If it's standard, does it subsume all of the others?

Comment on lines 56 to 57
void read(ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnConfig);
void read(String columnName, ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnConfig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add the extra parameter here.

Comment on lines -97 to +98
Assert.assertEquals(168, key.getEffectiveSizeBytes());
Assert.assertEquals(6, indexer.getCardinality());
Assert.assertEquals(276, key.getEffectiveSizeBytes());
Assert.assertEquals(5, indexer.getCardinality());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the different in numbers because the tests was validating v5, now it's validating v4?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String name,
IndexSpec indexSpec,
SegmentWriteOutMedium segmentWriteOutMedium,
@SuppressWarnings("unused") ProgressIndicator progressIndicator,

Check notice

Code scanning / CodeQL

Useless parameter

The parameter 'progressIndicator' is never used.
String name,
IndexSpec indexSpec,
SegmentWriteOutMedium segmentWriteOutMedium,
@SuppressWarnings("unused") ProgressIndicator progressIndicator,

Check notice

Code scanning / CodeQL

Useless parameter

The parameter 'progressIndicator' is never used.
String name,
IndexSpec indexSpec,
SegmentWriteOutMedium segmentWriteOutMedium,
@SuppressWarnings("unused") ProgressIndicator progressIndicator,

Check notice

Code scanning / CodeQL

Useless parameter

The parameter 'progressIndicator' is never used.
String name,
IndexSpec indexSpec,
SegmentWriteOutMedium segmentWriteOutMedium,
@SuppressWarnings("unused") ProgressIndicator progressIndicator,

Check notice

Code scanning / CodeQL

Useless parameter

The parameter 'progressIndicator' is never used.
String name,
IndexSpec indexSpec,
SegmentWriteOutMedium segmentWriteOutMedium,
@SuppressWarnings("unused") ProgressIndicator progressIndicator,

Check notice

Code scanning / CodeQL

Useless parameter

The parameter 'progressIndicator' is never used.
Comment on lines 1117 to 1119
// this should use:
// columnHolder.getColumnFormat().getColumnSchema(dimension)
// someday...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it cannot because? (Please add to the comment)

Comment on lines 99 to 100
final SortedValueDictionary dimValues = mergable.getValueDictionary();
mergable.mergeFieldsInto(mergedFields);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear if the dimValues dictionary would be mutated or not and if that's expected or not. I think that it's expected that it is not mutated. In which case, it might be nice to push mergable.mergeFieldsInto() down below the if (!allNulls){ } block just to make it clear that those values are being evaluated and grabbed before any merge occurs.

@@ -17,27 +17,31 @@
* under the License.
*/

package org.apache.druid.segment.column;
package org.apache.druid.segment.nested;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package nit: I wonder if it shouldn't be segment.column.nested? Probably doesn't really make that much of a difference, but seems more technically accurate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have a segment.column.nested right now, i almost left it in segment.column since that also seems appropriate

Comment on lines +39 to +45
* Nested data column with optimized support for simple arrays. Not actually v5 in the segment since columns are now
* serialized using {@link org.apache.druid.segment.serde.NestedCommonFormatColumnPartSerde} instead of the generic
* complex type system.
*
* Not really stored in a segment as V5 since instead of V5 we migrated to {@link NestedCommonFormatColumn} which
* specializes physical format based on the types of data encountered during processing, and so versions are now
* {@link NestedCommonFormatColumnSerializer#V0} for all associated specializations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the NestedCommon name that you picked for other stuff. It wouldn't bother me if you named this NestedCommon as well, fwiw. I'm also fine with the current naming, so more stream-of-consciousness thought than a request for change.

Comment on lines 46 to 53
/**
* Deserializes a ByteBuffer and adds it to the ColumnBuilder. This method allows for the ComplexMetricSerde
* to implement it's own versioning scheme to allow for changes of binary format in a forward-compatible manner.
*
* @param buffer the buffer to deserialize
* @param builder ColumnBuilder to add the column to
* @param columnConfig ColumnConfiguration used during deserialization
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the javadoc on this method got clobbered, probably by accident?

Comment on lines +114 to +120
// this shouldn't happen, but if it does, try to close to prevent a leak
try {
col.close();
}
catch (IOException e) {
throw new RuntimeException(e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a nit, but if you add this text to the exception message, you get the benefit of the comment explaining that you don't expect it to occur and if the exception ever does get thrown, the message is there too for whoever sees it.


boolean allNulls = dimValues == null || dimValues.allNull();
sortedLookup = dimValues;
if (!allNulls) {
mergable.mergeFieldsInto(mergedFields);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking, I had been thinking this was correct outside of the if statement because, even if it's all nulls, the others might have never seen a null and so you want to merge it so that null is properly registered. Is that not a concern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i realized that if it is all null this isn't doing anything useful anyway so i put it inside the if

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants