Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expression virtual column indexes #15585

Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Dec 15, 2023

Description

This PR adds support for using underlying column indexes for ExpressionVirtualColumn which have a single input column, similar to what ExpressionFilter can do in the same scenario. It achieves this a bit differently however. ExpressionFilter can make the filter expression itself into a predicate and push it down as a DruidPredicateFactory to the underlying DruidPredicateIndexes, if the expression evaluates to true then the value matches, if not then it doesn't. ExpressionVirtualColumn however is making arbitrary values which can be matched by other filters such as equality or ranges, and so instead needs a mechanism to supply its own indexes. So, we expose an implementation of DruidPredicateIndexes that can accept any DruidPredicateFactory, and then use the DictionaryEncodedValueIndex which exposes low level access to the dictionary and value indexes of the underlying column so that we can scan the values of the dictionary to use as inputs to the expression and then feed the transformed values to the appropriate type of predicate from the DruidPredicateFactory based on the output type of the expression.

I decided to push this all the way down to Expr, using the predicate index implementation as the default, but also allow individual expressions to produce more optimized index suppliers. I have currently only wired this up to IdentifierExpr, which can simply delegate directly to the underlying ColumnIndexSupplier because it is a direct column access expression. This means that an expression virtual column which just has an identifier should now be able to produce equivalent performance during filtering as if the column was used directly.

This measures pretty well, using a regex expression as an example, which in the first example uses no indexes and the second uses the new expression predicate indexes showed a decent performance boost (the virtual column performance should be approximately the same to using an expression filter or extractionFn directly, which also use predicate indexes).

SELECT string4, COUNT(*) FROM foo WHERE REGEXP_EXTRACT(string1, '^1') IS NOT NULL OR REGEXP_EXTRACT('Z' || string2, '^Z2') IS NOT NULL GROUP BY 1

before:

Benchmark                        (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlExpressionBenchmark.querySql       40           5000000  explicit        false  avgt    5  279.224 ±  3.353  ms/op
SqlExpressionBenchmark.querySql       40           5000000      auto        false  avgt    5  274.911 ±  5.642  ms/op

after:

Benchmark                        (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt    Score   Error  Units
SqlExpressionBenchmark.querySql       40           5000000  explicit        false  avgt    5  194.301 ± 4.236  ms/op
SqlExpressionBenchmark.querySql       40           5000000      auto        false  avgt    5  187.447 ± 3.748  ms/op

This improvement allows us to make some changes to the SQL planner to prefer using ExpressionVIrtualColumn with other native filters over falling back to ExpressionFilter in many cases, since now there is no penalty to using a virtual column, which can result in more efficient plans, such as being able to collapse multiple filters on the same column, and also re-use expression computation.

For example, given a query like:

Screenshot 2023-12-20 at 10 24 20 AM

before this change would produce a rather inefficient set of expression filters that looked something like this, where effectively each separate expression filter would have to individually scan all of the underlying values in the dictionary to find the matches

Screenshot 2023-12-20 at 3 57 57 PM

but after some adjustments to take advantage of the improvements in this PR, will now plan to

Screenshot 2023-12-20 at 10 23 30 AM

because of the shared common virtual column expression, and only has to evaluate the indexes and compute the expression values a single time, and can be collapsed into an 'IN' filter because the planner can recognize the 'OR' of 'equals' pattern can be condensed into an 'IN'.


This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@clintropolis clintropolis marked this pull request as ready for review December 20, 2023 23:33
@clintropolis
Copy link
Member Author

The only remaining coverage failures are on 'other tests' in 'sql-compat=false' mode because of some lines that never run in that mode, which i think we can ignore.

@clintropolis
Copy link
Member Author

clintropolis commented Dec 21, 2023

One thing I'm still thinking about is how to use the same strategy of #13977 and #15551 which allow automatically skipping using indexes if the cardinality of the underlying column is "too high" compared to the total number of rows to be scanned. The main problem is I haven't decided on the best way to expose the underlying ColumnConfig that controls these settings to the VirtualColumn index supplier method. I think its probably best to either make it available to ColumnSelectorColumnIndexSelector to pass in to the virtual columns, or available through any ColumnSelector which is already passed into the virtual columns. I don't think this needs to be a blocker for merging this PR though.

@clintropolis clintropolis removed the WIP label Dec 21, 2023
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple comments about readability. The new SQL plans are nice.

query = Druids.newTimeseriesQueryBuilder()
.dataSource(CalciteTests.DATASOURCE1)
.intervals(querySegmentSpec(Filtration.eternity()))
.virtualColumns(expressionVirtualColumn("v0", "substring(\"dim1\", 0, 1)", ColumnType.STRING))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably do a NullHandling.sqlCompatible() ? expressionVirtualColumn(...) : new VirtualColumn[0] to keep this as a single call to the query builder. Or, split the builder calls up so it's not fluent-style. Something like that might help readability. As it is, it takes some studying to tell the difference between the two branches.

@Override
public BitmapColumnIndex forPredicate(DruidPredicateFactory matcherFactory)
{
final Supplier<NonnullPair<List<ImmutableBitmap>, List<ImmutableBitmap>>> bitmapsSupplier;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do the two lists of bitmaps represent?

a private static class or some comments would help readability here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants