Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix expression column capabilities to not report dictionary encoded unless input is string #16577

Conversation

clintropolis
Copy link
Member

Description

Fixes an issue that occurs after #16366 where expressions with single input dictionary encoded columns that are not strings but have a string output incorrectly use the string dictionary encoded vector selector.

The fix for now is to make setting the capabilities of the virtual column to be dictionary encoded only if the input type is string. In the future if we make dictionary encoded selectors that are not coupled with handling strings then we could remove this constraint, but until then, this is the safest option to avoid incorrectly using string dimension selectors.

While here also noticed that COMPLEX<json> was needlessly reporting itself as dictionary encoded, which isn't quite true. While the nested field columns are dictionary encoded, the json values themselves are not, so functions like TO_JSON_STRING could also run into this problem.


This PR has:

  • been self-reviewed.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@@ -277,6 +277,9 @@ public ColumnCapabilities inferColumnCapabilities(@Nullable ColumnType outputTyp
// until query time
return ColumnCapabilitiesImpl.copyOf(underlyingCapabilities)
.setType(ColumnType.STRING)
// this is sad, but currently dictionary encodedness is tied to string
// selectors and sad stuff happens if the input type isn't string
.setDictionaryEncoded(underlyingCapabilities.is(ValueType.STRING))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be underlyingCapabilities.isTrue() && underlyingCapabilities.is(ValueType.STRING)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, imo copyOf is not the right thing to use here, logically speaking. I don't think we actually want to carry through all caps from the underlying selector, just a few specific ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be underlyingCapabilities.isTrue() && underlyingCapabilities.is(ValueType.STRING)?

oops yes.

also, imo copyOf is not the right thing to use here, logically speaking. I don't think we actually want to carry through all caps from the underlying selector, just a few specific ones.

Ah yea that is reasonable i suppose, i think originally more were preserved than not so i went with the copy. I guess hasBitmapIndexes isn't widely used on query side anymore since stuff can just ask ColumnIndexSupplier for an index and it will return if it has it, though i suppose it should still be preserved just in case.

Copy link
Member Author

@clintropolis clintropolis Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, we were explicitly marking hasBitmapIndexes as false, though #15585 allows expressions to use them in many cases, so it probably should be allowed to pass through. hasBitmapIndexes and hasSpatialIndexes should probably be removed from ColumnCapabilities, and the write side stuff that currently uses it be moved to ColumnFormat, though not going to do that on this PR.

@@ -303,10 +303,6 @@ public void read(ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnCo
bitmapSerdeFactory,
byteOrder
);
ColumnCapabilitiesImpl capabilitiesBuilder = builder.getCapabilitiesBuilder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just removing unused code?

Copy link
Member Author

@clintropolis clintropolis Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i had put in the description:

While here also noticed that COMPLEX was needlessly reporting itself as dictionary encoded, which isn't quite true. While the nested field columns are dictionary encoded, the json values themselves are not, so functions like TO_JSON_STRING could also run into this problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, I missed that the capabilitiesBuilder is from builder itself. OK, this change makes sense then.

@clintropolis clintropolis merged commit 3fb6ba2 into apache:master Jun 8, 2024
87 checks passed
@clintropolis clintropolis deleted the fix-fallback-vectorize-dictionary-encoded branch June 8, 2024 20:05
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants