-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expression virtual column indexes #15585
expression virtual column indexes #15585
Conversation
sql/src/main/java/org/apache/druid/sql/calcite/expression/Expressions.java
Fixed
Show fixed
Hide fixed
The only remaining coverage failures are on 'other tests' in 'sql-compat=false' mode because of some lines that never run in that mode, which i think we can ignore. |
One thing I'm still thinking about is how to use the same strategy of #13977 and #15551 which allow automatically skipping using indexes if the cardinality of the underlying column is "too high" compared to the total number of rows to be scanned. The main problem is I haven't decided on the best way to expose the underlying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a couple comments about readability. The new SQL plans are nice.
query = Druids.newTimeseriesQueryBuilder() | ||
.dataSource(CalciteTests.DATASOURCE1) | ||
.intervals(querySegmentSpec(Filtration.eternity())) | ||
.virtualColumns(expressionVirtualColumn("v0", "substring(\"dim1\", 0, 1)", ColumnType.STRING)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could probably do a NullHandling.sqlCompatible() ? expressionVirtualColumn(...) : new VirtualColumn[0]
to keep this as a single call to the query builder. Or, split the builder calls up so it's not fluent-style. Something like that might help readability. As it is, it takes some studying to tell the difference between the two branches.
@Override | ||
public BitmapColumnIndex forPredicate(DruidPredicateFactory matcherFactory) | ||
{ | ||
final Supplier<NonnullPair<List<ImmutableBitmap>, List<ImmutableBitmap>>> bitmapsSupplier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do the two lists of bitmaps represent?
a private static class
or some comments would help readability here
Description
This PR adds support for using underlying column indexes for
ExpressionVirtualColumn
which have a single input column, similar to whatExpressionFilter
can do in the same scenario. It achieves this a bit differently however.ExpressionFilter
can make the filter expression itself into a predicate and push it down as aDruidPredicateFactory
to the underlyingDruidPredicateIndexes
, if the expression evaluates to true then the value matches, if not then it doesn't.ExpressionVirtualColumn
however is making arbitrary values which can be matched by other filters such as equality or ranges, and so instead needs a mechanism to supply its own indexes. So, we expose an implementation ofDruidPredicateIndexes
that can accept anyDruidPredicateFactory
, and then use theDictionaryEncodedValueIndex
which exposes low level access to the dictionary and value indexes of the underlying column so that we can scan the values of the dictionary to use as inputs to the expression and then feed the transformed values to the appropriate type of predicate from theDruidPredicateFactory
based on the output type of the expression.I decided to push this all the way down to
Expr
, using the predicate index implementation as the default, but also allow individual expressions to produce more optimized index suppliers. I have currently only wired this up toIdentifierExpr
, which can simply delegate directly to the underlyingColumnIndexSupplier
because it is a direct column access expression. This means that an expression virtual column which just has an identifier should now be able to produce equivalent performance during filtering as if the column was used directly.This measures pretty well, using a regex expression as an example, which in the first example uses no indexes and the second uses the new expression predicate indexes showed a decent performance boost (the virtual column performance should be approximately the same to using an expression filter or extractionFn directly, which also use predicate indexes).
before:
after:
This improvement allows us to make some changes to the SQL planner to prefer using
ExpressionVIrtualColumn
with other native filters over falling back toExpressionFilter
in many cases, since now there is no penalty to using a virtual column, which can result in more efficient plans, such as being able to collapse multiple filters on the same column, and also re-use expression computation.For example, given a query like:
before this change would produce a rather inefficient set of expression filters that looked something like this, where effectively each separate expression filter would have to individually scan all of the underlying values in the dictionary to find the matches
but after some adjustments to take advantage of the improvements in this PR, will now plan to
because of the shared common virtual column expression, and only has to evaluate the indexes and compute the expression values a single time, and can be collapsed into an 'IN' filter because the planner can recognize the 'OR' of 'equals' pattern can be condensed into an 'IN'.
This PR has: