Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling aggregateMultipleValues in all StringAnyAggregators #15434

Merged
merged 13 commits into from
Nov 29, 2023

Conversation

pranavbhole
Copy link
Contributor

@pranavbhole pranavbhole commented Nov 27, 2023

Description

In case of MVDs, we need to decide if need to aggregate multiple values or pick up first one from the list. This PR enables aggregateMultipleValues optional flag in all StringAnyAggregators. Also exposed the flag in ANY_VALUE sql function.
Nuked Mockitos from StringAnyAggregator Factory test.

Fixed the bug ...

StringAnyVectorAggregator has some bug, in case of mvds, it always used to pick up first value, and it was inconsistent behavior with other StringAny Aggregator. aggregateMultipleValues is set to true by default. If user need first encountered values from mvd fields then aggregateMultipleValues can be set to false.

Release note

Adding aggregateMultipleValues optional flag in ANY_VALUE and StringAny aggregators.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

public class StringAnyAggregator implements Aggregator
{
private final BaseObjectColumnValueSelector valueSelector;
private final int maxStringBytes;
private boolean isFound;
private String foundValue;
final boolean aggregateMultipleValues;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -49,12 +49,14 @@ public class StringAnyAggregatorFactory extends AggregatorFactory
private final String fieldName;
private final String name;
protected final int maxStringBytes;
protected final boolean aggregateMultipleValues;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why protected? (same applies to maxStringBytes too i think...)

if (row.size() > 1) {
List<String> arrayList = new ArrayList<>();
row.forEach(rowIndex -> {
String chopped = StringUtils.fastLooseChop(multiValueSelector.lookupName(rowIndex), maxStringBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you need to call fastLooseChop here because putValue uses StringUtils.toUtf8WithLimit which achieves the same thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it. Applied to arrays initially to avoid partially chopped result like [first, seco. applying chop at last leaves the result in weird open ended array. I think it might be expected as user can change it with increasing max bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine i think, i don't really like the silent truncation at all personally; i think we should probably change first/last/any to all optionally fail instead of silently truncating results, but that doesn't need to changed now.

Comment on lines +124 to +126
columnSelectorFactory.moveSelectorCursorToNext();
columnSelectorFactory.moveSelectorCursorToNext();
res.aggregate();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it increment cursor twice before aggregating?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes, I should have probably mention it in comment, Dummy TestColumnSelectorFactory Selector that I wrote at the end of class has MVD present on 3rd place, this moved cursor twice.

Comment on lines 165 to 166
case COMPLEX:
return new StringAnyAggregatorFactory(name, fieldName, maxStringBytes, aggregateMultipleValues);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would complex map to string? I think this means any type of complex would use the string aggregator which doesn't seem correct to me

Comment on lines 308 to 311
private Integer getMaxStringBytes(
final List<RexNode> rexNodes,
final AggregateCall aggregateCall,
final PlannerContext plannerContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method seems strange to pass List<RexNode>, and only uses rexNodes.get(1) from it, please pass the RexNode to check directly

@pranavbhole
Copy link
Contributor Author

Addressed all comments.

.sql("SELECT ANY_VALUE(dim3, 1000, null) FROM foo")
.queryContext(ImmutableMap.of());

DruidException e1 = assertThrows(DruidException.class, () -> qtb1.run());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgyrtkirk Looks like we are getting unclear error message in decoupled mode than coupled mode. I had to had condition for asserts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from Query could not be planned. A possible reason is
it seem like it went fishing - but the error is [The third argument 'null:NULL' to function 'EXPR$0' is not a boolean] which should be identified much earlier.

I believe the best way to fix this inconsistency would be to make sure this is checked during validation.

Another way around is to use @NotYetSupported and leave it for later....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added early validation in operandChecker class and removed the number/boolean checks.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

});
foundValue = DimensionHandlerUtils.convertObjectToString(arrayList);
} else {
foundValue = multiValueSelector.lookupName(row.get(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to exactly match the behavior of the StringAnyAggregator and StringAnyBufferAggregator, which only have special case handling for 0 length lists. I think this is probably 'better' behavior for mvds since it matches the behavior of scan queries which only show mvds as an array if the length is bigger than 1, should we modify the other two implementations to be consistent for handling the case where the mvd row only has a single value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made it consistent across all aggregator to return single value in case of array size 1

docs/querying/sql-functions.md Outdated Show resolved Hide resolved
Comment on lines 60 to 61
List<Object> objectList = (List) object;
foundValue = objectList.size() > 0 ? DimensionHandlerUtils.convertObjectToString(objectList.get(0)) : null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this go over all elements of the list? the vector implementation seems to be doing that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, we check the flag aggregateMultipleValues and return all elements stringified

});
foundValue = DimensionHandlerUtils.convertObjectToString(arrayList);
} else {
foundValue = multiValueSelector.lookupName(row.get(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other impelementation called DimensionHandlerUtils.convertObjectToString even on a single value - is that ok; shouldn't they be in sync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DimensionHandlerUtils.convertObjectToString handles null value and here foundValue null is also handled.

@@ -13433,6 +13433,45 @@ public void testStringAggExpressionNonConstantSeparator()
);
}

@Test
public void testStringAnyAggArgValidation()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems to be multiple testcases in this test; can you separate them?

.sql("SELECT ANY_VALUE(dim3, 1000, null) FROM foo")
.queryContext(ImmutableMap.of());

DruidException e1 = assertThrows(DruidException.class, () -> qtb1.run());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from Query could not be planned. A possible reason is
it seem like it went fishing - but the error is [The third argument 'null:NULL' to function 'EXPR$0' is not a boolean] which should be identified much earlier.

I believe the best way to fix this inconsistency would be to make sure this is checked during validation.

Another way around is to use @NotYetSupported and leave it for later....

);
break;
case 3:
maxStringBytes = getMaxStringBytes(rexNodes.get(1), aggregateCall, plannerContext);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: copy-paste - as the createAggregatorFactory will be called anyway with a null if its not specified; wouldn't it be better to just keep a variable and set its value when available?

Comment on lines 278 to 282
plannerContext.setPlanningError(
"The second argument '%s' to function '%s' is not a number",
rexNodes.get(1),
"The third argument '%s' to function '%s' is not a boolean",
rexNodes.get(2),
aggregateCall.getName()
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit too late... a check like this should happen during validation and not during aggregator production; a straight DruidException might also be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the code as early validation is added

);
return null;
}
Integer maxStringBytes = RexLiteral.intValue(rexNodes.get(1)); // added not null check at the function

Check warning

Code scanning / CodeQL

Boxed variable is never null Warning

The variable 'maxStringBytes' is only assigned values of primitive type and is never 'null', but it is declared with the boxed type 'Integer'.
}
if (object instanceof List) {
List<Object> objectList = (List) object;
if (objectList.size() == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the size() == 0 ? if yes that could be a problem
note: I wonder if ifs can be flipped; since when aggregateMultipleValues is true all branches lead to the same decision; could something like this work?

if( !aggregateMultipleValues &&  (object instanceof List) && ((List)object).size() > 1 ) {
   return DimensionHandlerUtils.convertObjectToString(objectList.get(0));
} else {
  DimensionHandlerUtils.convertObjectToString(object);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added size 0 check. It can be possible that we ingest empty arrays i guess.

docs/querying/sql-aggregations.md Outdated Show resolved Hide resolved
Comment on lines +80 to +86
if (objectList.size() == 1) {
return DimensionHandlerUtils.convertObjectToString(objectList.get(0));
}
if (aggregateMultipleValues) {
return DimensionHandlerUtils.convertObjectToString(objectList);
}
return DimensionHandlerUtils.convertObjectToString(objectList.get(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that I think about it, for the non-vectorized engine, string object selectors actually already unwrap the list if it is only a single element, so I don't think we might not actually need special handling for list size 1
https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/DimensionSelector.java#L141 which is called by https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/DimensionSelector.java#L128 which is what is used by the multi-value dimension selector when reading from segments https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/column/StringUtf8DictionaryEncodedColumn.java#L229 (no other dimension selectors for mvds override this behavior)

sorry for the confusion, the value will always be a String or null in the case of size 1 or size 0 when using getObject().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis Thank you Clint. Is it ok to leave this extra safety check in the code ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this was ... any (heh!) other aggregator i would say change it (and probably re-arrange the if) because aggregate is in a hot path, but since this one basically only calls aggregate once and then short-circuits the rest of the time... so its probably fine

@soumyava soumyava merged commit 93cd638 into apache:master Nov 29, 2023
83 checks passed
@pranavbhole pranavbhole mentioned this pull request Nov 29, 2023
10 tasks
@pranavbhole pranavbhole deleted the aggregateMultipleValues branch November 29, 2023 22:59
abhishekagarwal87 pushed a commit that referenced this pull request Nov 30, 2023
Updating the native docs for #15434
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
…15434)

* Enabling aggregateMultipleValues in all StringAnyAggregators

* Adding more tests

* More validation

* fix warning

* updating asserts in decoupled mode

* fix intellij inspection

* Addressing comments

* Addressing comments

* Adding early validations and make aggregate consistent across all

* fixing tests

* fixing tests

* Update docs/querying/sql-aggregations.md

Co-authored-by: Clint Wylie <[email protected]>

* fixing static check

---------

Co-authored-by: Clint Wylie <[email protected]>
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
Updating the native docs for apache#15434
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
…15434)

* Enabling aggregateMultipleValues in all StringAnyAggregators

* Adding more tests

* More validation

* fix warning

* updating asserts in decoupled mode

* fix intellij inspection

* Addressing comments

* Addressing comments

* Adding early validations and make aggregate consistent across all

* fixing tests

* fixing tests

* Update docs/querying/sql-aggregations.md

Co-authored-by: Clint Wylie <[email protected]>

* fixing static check

---------

Co-authored-by: Clint Wylie <[email protected]>
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
Updating the native docs for apache#15434
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants