expression virtual column indexes #15585

clintropolis · 2023-12-15T23:31:28Z

Description

This PR adds support for using underlying column indexes for ExpressionVirtualColumn which have a single input column, similar to what ExpressionFilter can do in the same scenario. It achieves this a bit differently however. ExpressionFilter can make the filter expression itself into a predicate and push it down as a DruidPredicateFactory to the underlying DruidPredicateIndexes, if the expression evaluates to true then the value matches, if not then it doesn't. ExpressionVirtualColumn however is making arbitrary values which can be matched by other filters such as equality or ranges, and so instead needs a mechanism to supply its own indexes. So, we expose an implementation of DruidPredicateIndexes that can accept any DruidPredicateFactory, and then use the DictionaryEncodedValueIndex which exposes low level access to the dictionary and value indexes of the underlying column so that we can scan the values of the dictionary to use as inputs to the expression and then feed the transformed values to the appropriate type of predicate from the DruidPredicateFactory based on the output type of the expression.

I decided to push this all the way down to Expr, using the predicate index implementation as the default, but also allow individual expressions to produce more optimized index suppliers. I have currently only wired this up to IdentifierExpr, which can simply delegate directly to the underlying ColumnIndexSupplier because it is a direct column access expression. This means that an expression virtual column which just has an identifier should now be able to produce equivalent performance during filtering as if the column was used directly.

This measures pretty well, using a regex expression as an example, which in the first example uses no indexes and the second uses the new expression predicate indexes showed a decent performance boost (the virtual column performance should be approximately the same to using an expression filter or extractionFn directly, which also use predicate indexes).

SELECT string4, COUNT(*) FROM foo WHERE REGEXP_EXTRACT(string1, '^1') IS NOT NULL OR REGEXP_EXTRACT('Z' || string2, '^Z2') IS NOT NULL GROUP BY 1

before:

Benchmark                        (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlExpressionBenchmark.querySql       40           5000000  explicit        false  avgt    5  279.224 ±  3.353  ms/op
SqlExpressionBenchmark.querySql       40           5000000      auto        false  avgt    5  274.911 ±  5.642  ms/op

after:

Benchmark                        (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt    Score   Error  Units
SqlExpressionBenchmark.querySql       40           5000000  explicit        false  avgt    5  194.301 ± 4.236  ms/op
SqlExpressionBenchmark.querySql       40           5000000      auto        false  avgt    5  187.447 ± 3.748  ms/op

This improvement allows us to make some changes to the SQL planner to prefer using ExpressionVIrtualColumn with other native filters over falling back to ExpressionFilter in many cases, since now there is no penalty to using a virtual column, which can result in more efficient plans, such as being able to collapse multiple filters on the same column, and also re-use expression computation.

For example, given a query like:

before this change would produce a rather inefficient set of expression filters that looked something like this, where effectively each separate expression filter would have to individually scan all of the underlying values in the dictionary to find the matches

but after some adjustments to take advantage of the improvements in this PR, will now plan to

because of the shared common virtual column expression, and only has to evaluate the indexes and compute the expression values a single time, and can be collapsed into an 'IN' filter because the planner can recognize the 'OR' of 'equals' pattern can be condensed into an 'IN'.

This PR has:

been self-reviewed.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

…l-column-indexes

sql/src/main/java/org/apache/druid/sql/calcite/expression/Expressions.java

clintropolis · 2023-12-21T00:08:26Z

The only remaining coverage failures are on 'other tests' in 'sql-compat=false' mode because of some lines that never run in that mode, which i think we can ignore.

clintropolis · 2023-12-21T00:12:52Z

One thing I'm still thinking about is how to use the same strategy of #13977 and #15551 which allow automatically skipping using indexes if the cardinality of the underlying column is "too high" compared to the total number of rows to be scanned. The main problem is I haven't decided on the best way to expose the underlying ColumnConfig that controls these settings to the VirtualColumn index supplier method. I think its probably best to either make it available to ColumnSelectorColumnIndexSelector to pass in to the virtual columns, or available through any ColumnSelector which is already passed into the virtual columns. I don't think this needs to be a blocker for merging this PR though.

…l-column-indexes

gianm

LGTM, a couple comments about readability. The new SQL plans are nice.

gianm · 2024-01-03T03:52:58Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java

+      query = Druids.newTimeseriesQueryBuilder()
+                    .dataSource(CalciteTests.DATASOURCE1)
+                    .intervals(querySegmentSpec(Filtration.eternity()))
+                    .virtualColumns(expressionVirtualColumn("v0", "substring(\"dim1\", 0, 1)", ColumnType.STRING))


Could probably do a NullHandling.sqlCompatible() ? expressionVirtualColumn(...) : new VirtualColumn[0] to keep this as a single call to the query builder. Or, split the builder calls up so it's not fluent-style. Something like that might help readability. As it is, it takes some studying to tell the difference between the two branches.

gianm · 2024-01-03T04:15:03Z

processing/src/main/java/org/apache/druid/math/expr/ExprPredicateIndexSupplier.java

+    @Override
+    public BitmapColumnIndex forPredicate(DruidPredicateFactory matcherFactory)
+    {
+      final Supplier<NonnullPair<List<ImmutableBitmap>, List<ImmutableBitmap>>> bitmapsSupplier;


what do the two lists of bitmaps represent?

a private static class or some comments would help readability here

…l-column-indexes

clintropolis added 2 commits December 14, 2023 22:37

ExpressionVirtualColumn + indexes = bff

02faaa6

fix

19d6a87

clintropolis added Performance Area - Querying WIP labels Dec 15, 2023

github-actions bot added the Area - Segment Format and Ser/De label Dec 15, 2023

clintropolis added 3 commits December 15, 2023 21:08

use predicate based on expr output type not column input type

30c6069

use virtual columns for equality and range

af85f0e

Merge remote-tracking branch 'upstream/master' into expression-virtua…

ed5a78d

…l-column-indexes

github-advanced-security bot found potential problems Dec 20, 2023

View reviewed changes

sql/src/main/java/org/apache/druid/sql/calcite/expression/Expressions.java Fixed Show fixed Hide fixed

clintropolis added 2 commits December 20, 2023 12:18

fixes

1468ba1

fix tests

0607f01

clintropolis marked this pull request as ready for review December 20, 2023 23:33

clintropolis removed the WIP label Dec 21, 2023

Merge remote-tracking branch 'upstream/master' into expression-virtua…

748dfbd

…l-column-indexes

gianm approved these changes Jan 3, 2024

View reviewed changes

clintropolis added 4 commits January 3, 2024 12:43

Merge remote-tracking branch 'upstream/master' into expression-virtua…

0247fcb

…l-column-indexes

cleanup

7661822

Merge remote-tracking branch 'upstream/master' into expression-virtua…

2b9dd01

…l-column-indexes

style bot is wrong and a liar

a36df25

clintropolis merged commit f19ece1 into apache:master Jan 4, 2024
82 of 83 checks passed

clintropolis deleted the expression-virtual-column-indexes branch January 4, 2024 05:00

clintropolis mentioned this pull request Jan 6, 2024

fix bugs with expression virtual column indexes for expression virtual columns which refer to other virtual columns #15633

Merged

10 tasks

LakshSingla added this to the 29.0.0 milestone Jan 29, 2024

LakshSingla mentioned this pull request Feb 13, 2024

[DRAFT] 29.0.0 release notes #15896

Closed

clintropolis mentioned this pull request Feb 23, 2024

fix bug with expression virtual column indexes on missing columns for expressions that turn null values into not null values #15959

Merged

4 tasks

clintropolis mentioned this pull request Jun 8, 2024

fix expression column capabilities to not report dictionary encoded unless input is string #16577

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expression virtual column indexes #15585

expression virtual column indexes #15585

clintropolis commented Dec 15, 2023 •

edited

Loading

clintropolis commented Dec 21, 2023

clintropolis commented Dec 21, 2023 •

edited

Loading

gianm left a comment

gianm Jan 3, 2024

gianm Jan 3, 2024

expression virtual column indexes #15585

expression virtual column indexes #15585

Conversation

clintropolis commented Dec 15, 2023 • edited Loading

Description

clintropolis commented Dec 21, 2023

clintropolis commented Dec 21, 2023 • edited Loading

gianm left a comment

Choose a reason for hiding this comment

gianm Jan 3, 2024

Choose a reason for hiding this comment

gianm Jan 3, 2024

Choose a reason for hiding this comment

clintropolis commented Dec 15, 2023 •

edited

Loading

clintropolis commented Dec 21, 2023 •

edited

Loading