Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuple sketch SQL support #13887

Conversation

frankgrimes97
Copy link
Contributor

This PR is a follow-up to #13819 so that the Tuple sketch functionality can be used in SQL for both ingestion using Multi-Stage Queries (MSQ) and also for analytic queries against Tuple sketch columns.

Release note

Add SQL functions for creating and operating on Tuple sketches


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

docs/querying/sql-functions.md Outdated Show resolved Hide resolved
docs/querying/sql-functions.md Outdated Show resolved Hide resolved
docs/querying/sql-aggregations.md Outdated Show resolved Hide resolved
@anshu-makkar
Copy link
Contributor

@abhishekagarwal87 kindly check this

@frankgrimes97
Copy link
Contributor Author

Hmm... not sure if there's anything to be done about this failing CI check:

> spellcheck
[69](https://github.com/apache/druid/actions/runs/4358587476/jobs/7648241472#step:21:70)
> mdspell --en-us --ignore-numbers --report '../docs/**/*.md' || (./script/notify-spellcheck-issues && false)
[70](https://github.com/apache/druid/actions/runs/4358587476/jobs/7648241472#step:21:71)

[71](https://github.com/apache/druid/actions/runs/4358587476/jobs/7648241472#step:21:72)
    ../docs/querying/sql-functions.md
[72](https://github.com/apache/druid/actions/runs/4358587476/jobs/7648241472#step:21:73)
      139 | ## ARRAY_OF_DOUBLES_SKETCH 
[73](https://github.com/apache/druid/actions/runs/4358587476/jobs/7648241472#step:21:74)
      149 | ## ARRAY_OF_DOUBLES_SKETCH_METRICS_SUM_ESTIMATE

https://github.com/apache/druid/actions/runs/4358587476/jobs/7648241472

@frankgrimes97
Copy link
Contributor Author

@vtlim Hi, are there any further changes required before this PR can be accepted and merged? Is a review by more maintainers required? Thanks!

@vtlim
Copy link
Member

vtlim commented Mar 13, 2023

@frankgrimes97 Can you add the two new sketch aggregator names to the .spelling file? The tests catch the new names as a misspelling so we just need to add them to the dictionary. For example after L735 in https://github.com/apache/druid/blob/master/website/.spelling#L735. Add a line for ARRAY_OF_DOUBLES_SKETCH and one for ARRAY_OF_DOUBLES_SKETCH_METRICS_SUM_ESTIMATE

I only reviewed the docs side, so there will need to be a reviewer on the code side. cc @abhishekagarwal87 or @rohangarg

@frankgrimes97 frankgrimes97 requested a review from gianm March 16, 2023 17:50
@frankgrimes97
Copy link
Contributor Author

Hi, can this now be merged or are more changes required? Thanks!

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool functionality! Left a number of comments on how best to integrate the functionality into SQL, given the way that SQL normally works.

Some APIs were recently deprecated as part of:
  apache#13904
  apache#13914

Some CI checks were warning about this so better to address it now instead
of needing to do so in a subsequent PR.
- Add missing INTERSECT/NOT/UNION descriptions
- Re-order new functions alphabetically in documentation
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes generally look good to me. I'd consider changing the names to start with TUPLE_DOUBLES_ rather than ARRAY_OF_DOUBLES_SKETCH_. (Like TUPLE_DOUBLES_SKETCH, TUPLE_DOUBLES_NOT, etc.) It's a bit less typing and IMO clearer.

Curious what people think.

@frankgrimes97
Copy link
Contributor Author

The changes generally look good to me. I'd consider changing the names to start with TUPLE_DOUBLES_ rather than ARRAY_OF_DOUBLES_SKETCH_. (Like TUPLE_DOUBLES_SKETCH, TUPLE_DOUBLES_NOT, etc.) It's a bit less typing and IMO clearer.

Curious what people think.

One issue with changing the naming only for the SQL functions is that it would be inconsistent with the naming in the native Druid functions and Apache Data Sketches codebase/documentation and so might lead to confusion.
That issue aside, I do think the naming can be more terse without losing too much meaning.

It might be worth noting that the naming of the Data Sketch SQL functions doesn't seem entirely consistent across the board.
e.g. I see other sketch SQL functions use a DS prefix or suffix to designate them as Data Sketches:

  • APPROX_COUNT_DISTINCT_DS_HLL
  • APPROX_COUNT_DISTINCT_DS_THETA
  • APPROX_QUANTILE_DS
  • DS_CDF
  • DS_GET_QUANTILE
  • DS_GET_QUANTILES
  • DS_HISTOGRAM
  • DS_HLL
  • DS_QUANTILE_SUMMARY
  • DS_QUANTILES_SKETCH
  • DS_RANK
  • DS_THETA

We could perhaps consider the following:

  • DS_TUPLE_DOUBLES
  • DS_TUPLE_DOUBLES_INTERSECT
  • DS_TUPLE_DOUBLES_METRICS_SUM_ESTIMATE
  • DS_TUPLE_DOUBLES_NOT
  • DS_TUPLE_DOUBLES_UNION

@gianm
Copy link
Contributor

gianm commented Mar 21, 2023

It might be worth noting that the naming of the Data Sketch SQL functions doesn't seem entirely consistent across the board.
e.g. I see other sketch SQL functions use a DS prefix or suffix to designate them as Data Sketches:

That's a good point about DS being used in most of the others. I think the suggestions at the end of your post (DS_TUPLE_DOUBLES, etc) are a good balance between consistency, clarity, and terseness. They work for me.

ARRAY_OF_DOUBLES_SKETCH -> DS_TUPLE_DOUBLES
ARRAY_OF_DOUBLES_SKETCH_* -> DS_TUPLE_DOUBLES_*
@frankgrimes97 frankgrimes97 requested review from gianm and vtlim and removed request for gianm and vtlim March 21, 2023 21:37
@frankgrimes97
Copy link
Contributor Author

@gianm @vtlim I've renamed the SQL functions and updated the documentation accordingly.

docs/querying/sql-functions.md Show resolved Hide resolved
docs/querying/sql-aggregations.md Show resolved Hide resolved
docs/querying/sql-functions.md Outdated Show resolved Hide resolved
Copy link
Member

@vtlim vtlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for docs

@frankgrimes97 frankgrimes97 requested a review from gianm March 24, 2023 18:10
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after CI passes.

@frankgrimes97
Copy link
Contributor Author

frankgrimes97 commented Mar 28, 2023

LGTM after CI passes.

@gianm The failing checks seem unrelated to my changes.

@abhishekagarwal87
Copy link
Contributor

Indeed. I am merging this PR since failures are indeed unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants