-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling aggregateMultipleValues in all StringAnyAggregators #15434
Conversation
...sing/src/test/java/org/apache/druid/query/aggregation/any/StringAnyBufferAggregatorTest.java
Fixed
Show fixed
Hide fixed
...sing/src/test/java/org/apache/druid/query/aggregation/any/StringAnyBufferAggregatorTest.java
Fixed
Show fixed
Hide fixed
public class StringAnyAggregator implements Aggregator | ||
{ | ||
private final BaseObjectColumnValueSelector valueSelector; | ||
private final int maxStringBytes; | ||
private boolean isFound; | ||
private String foundValue; | ||
final boolean aggregateMultipleValues; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -49,12 +49,14 @@ public class StringAnyAggregatorFactory extends AggregatorFactory | |||
private final String fieldName; | |||
private final String name; | |||
protected final int maxStringBytes; | |||
protected final boolean aggregateMultipleValues; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why protected? (same applies to maxStringBytes
too i think...)
if (row.size() > 1) { | ||
List<String> arrayList = new ArrayList<>(); | ||
row.forEach(rowIndex -> { | ||
String chopped = StringUtils.fastLooseChop(multiValueSelector.lookupName(rowIndex), maxStringBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think you need to call fastLooseChop
here because putValue
uses StringUtils.toUtf8WithLimit
which achieves the same thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it. Applied to arrays initially to avoid partially chopped result like [first, seco
. applying chop at last leaves the result in weird open ended array. I think it might be expected as user can change it with increasing max bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine i think, i don't really like the silent truncation at all personally; i think we should probably change first/last/any to all optionally fail instead of silently truncating results, but that doesn't need to changed now.
columnSelectorFactory.moveSelectorCursorToNext(); | ||
columnSelectorFactory.moveSelectorCursorToNext(); | ||
res.aggregate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does it increment cursor twice before aggregating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yes, I should have probably mention it in comment, Dummy TestColumnSelectorFactory Selector that I wrote at the end of class has MVD present on 3rd place, this moved cursor twice.
case COMPLEX: | ||
return new StringAnyAggregatorFactory(name, fieldName, maxStringBytes, aggregateMultipleValues); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would complex map to string? I think this means any type of complex would use the string aggregator which doesn't seem correct to me
private Integer getMaxStringBytes( | ||
final List<RexNode> rexNodes, | ||
final AggregateCall aggregateCall, | ||
final PlannerContext plannerContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method seems strange to pass List<RexNode>
, and only uses rexNodes.get(1)
from it, please pass the RexNode
to check directly
Addressed all comments. |
.sql("SELECT ANY_VALUE(dim3, 1000, null) FROM foo") | ||
.queryContext(ImmutableMap.of()); | ||
|
||
DruidException e1 = assertThrows(DruidException.class, () -> qtb1.run()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgyrtkirk Looks like we are getting unclear error message in decoupled mode than coupled mode. I had to had condition for asserts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from Query could not be planned. A possible reason is
it seem like it went fishing - but the error is [The third argument 'null:NULL' to function 'EXPR$0' is not a boolean]
which should be identified much earlier.
I believe the best way to fix this inconsistency would be to make sure this is checked during validation.
Another way around is to use @NotYetSupported
and leave it for later....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added early validation in operandChecker class and removed the number/boolean checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm
}); | ||
foundValue = DimensionHandlerUtils.convertObjectToString(arrayList); | ||
} else { | ||
foundValue = multiValueSelector.lookupName(row.get(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem to exactly match the behavior of the StringAnyAggregator
and StringAnyBufferAggregator
, which only have special case handling for 0 length lists. I think this is probably 'better' behavior for mvds since it matches the behavior of scan queries which only show mvds as an array if the length is bigger than 1, should we modify the other two implementations to be consistent for handling the case where the mvd row only has a single value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made it consistent across all aggregator to return single value in case of array size 1
processing/src/main/java/org/apache/druid/query/aggregation/any/StringAnyVectorAggregator.java
Outdated
Show resolved
Hide resolved
processing/src/test/java/org/apache/druid/query/aggregation/AggregatorFactoryTest.java
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/StringAnyBufferAggregator.java
Outdated
Show resolved
Hide resolved
...sing/src/test/java/org/apache/druid/query/aggregation/any/StringAnyBufferAggregatorTest.java
Show resolved
Hide resolved
List<Object> objectList = (List) object; | ||
foundValue = objectList.size() > 0 ? DimensionHandlerUtils.convertObjectToString(objectList.get(0)) : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this go over all elements of the list? the vector implementation seems to be doing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, we check the flag aggregateMultipleValues and return all elements stringified
processing/src/main/java/org/apache/druid/query/aggregation/any/StringAnyVectorAggregator.java
Outdated
Show resolved
Hide resolved
}); | ||
foundValue = DimensionHandlerUtils.convertObjectToString(arrayList); | ||
} else { | ||
foundValue = multiValueSelector.lookupName(row.get(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other impelementation called DimensionHandlerUtils.convertObjectToString
even on a single value - is that ok; shouldn't they be in sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DimensionHandlerUtils.convertObjectToString handles null value and here foundValue null is also handled.
@@ -13433,6 +13433,45 @@ public void testStringAggExpressionNonConstantSeparator() | |||
); | |||
} | |||
|
|||
@Test | |||
public void testStringAnyAggArgValidation() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there seems to be multiple testcases in this test; can you separate them?
.sql("SELECT ANY_VALUE(dim3, 1000, null) FROM foo") | ||
.queryContext(ImmutableMap.of()); | ||
|
||
DruidException e1 = assertThrows(DruidException.class, () -> qtb1.run()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from Query could not be planned. A possible reason is
it seem like it went fishing - but the error is [The third argument 'null:NULL' to function 'EXPR$0' is not a boolean]
which should be identified much earlier.
I believe the best way to fix this inconsistency would be to make sure this is checked during validation.
Another way around is to use @NotYetSupported
and leave it for later....
); | ||
break; | ||
case 3: | ||
maxStringBytes = getMaxStringBytes(rexNodes.get(1), aggregateCall, plannerContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: copy-paste - as the createAggregatorFactory
will be called anyway with a null
if its not specified; wouldn't it be better to just keep a variable and set its value when available?
plannerContext.setPlanningError( | ||
"The second argument '%s' to function '%s' is not a number", | ||
rexNodes.get(1), | ||
"The third argument '%s' to function '%s' is not a boolean", | ||
rexNodes.get(2), | ||
aggregateCall.getName() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit too late... a check like this should happen during validation and not during aggregator production; a straight DruidException
might also be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the code as early validation is added
); | ||
return null; | ||
} | ||
Integer maxStringBytes = RexLiteral.intValue(rexNodes.get(1)); // added not null check at the function |
Check warning
Code scanning / CodeQL
Boxed variable is never null Warning
} | ||
if (object instanceof List) { | ||
List<Object> objectList = (List) object; | ||
if (objectList.size() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the size() == 0
? if yes that could be a problem
note: I wonder if ifs can be flipped; since when aggregateMultipleValues
is true
all branches lead to the same decision; could something like this work?
if( !aggregateMultipleValues && (object instanceof List) && ((List)object).size() > 1 ) {
return DimensionHandlerUtils.convertObjectToString(objectList.get(0));
} else {
DimensionHandlerUtils.convertObjectToString(object);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added size 0 check. It can be possible that we ingest empty arrays i guess.
if (objectList.size() == 1) { | ||
return DimensionHandlerUtils.convertObjectToString(objectList.get(0)); | ||
} | ||
if (aggregateMultipleValues) { | ||
return DimensionHandlerUtils.convertObjectToString(objectList); | ||
} | ||
return DimensionHandlerUtils.convertObjectToString(objectList.get(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that I think about it, for the non-vectorized engine, string object selectors actually already unwrap the list if it is only a single element, so I don't think we might not actually need special handling for list size 1
https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/DimensionSelector.java#L141 which is called by https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/DimensionSelector.java#L128 which is what is used by the multi-value dimension selector when reading from segments https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/column/StringUtf8DictionaryEncodedColumn.java#L229 (no other dimension selectors for mvds override this behavior)
sorry for the confusion, the value will always be a String
or null
in the case of size 1 or size 0 when using getObject().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clintropolis Thank you Clint. Is it ok to leave this extra safety check in the code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this was ... any (heh!) other aggregator i would say change it (and probably re-arrange the if) because aggregate is in a hot path, but since this one basically only calls aggregate once and then short-circuits the rest of the time... so its probably fine
Co-authored-by: Clint Wylie <[email protected]>
Updating the native docs for #15434
…15434) * Enabling aggregateMultipleValues in all StringAnyAggregators * Adding more tests * More validation * fix warning * updating asserts in decoupled mode * fix intellij inspection * Addressing comments * Addressing comments * Adding early validations and make aggregate consistent across all * fixing tests * fixing tests * Update docs/querying/sql-aggregations.md Co-authored-by: Clint Wylie <[email protected]> * fixing static check --------- Co-authored-by: Clint Wylie <[email protected]>
Updating the native docs for apache#15434
…15434) * Enabling aggregateMultipleValues in all StringAnyAggregators * Adding more tests * More validation * fix warning * updating asserts in decoupled mode * fix intellij inspection * Addressing comments * Addressing comments * Adding early validations and make aggregate consistent across all * fixing tests * fixing tests * Update docs/querying/sql-aggregations.md Co-authored-by: Clint Wylie <[email protected]> * fixing static check --------- Co-authored-by: Clint Wylie <[email protected]>
Updating the native docs for apache#15434
Description
In case of MVDs, we need to decide if need to aggregate multiple values or pick up first one from the list. This PR enables aggregateMultipleValues optional flag in all StringAnyAggregators. Also exposed the flag in ANY_VALUE sql function.
Nuked Mockitos from StringAnyAggregator Factory test.
Fixed the bug ...
StringAnyVectorAggregator has some bug, in case of mvds, it always used to pick up first value, and it was inconsistent behavior with other StringAny Aggregator. aggregateMultipleValues is set to true by default. If user need first encountered values from mvd fields then aggregateMultipleValues can be set to false.
Release note
Adding aggregateMultipleValues optional flag in ANY_VALUE and StringAny aggregators.
This PR has: