-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push Dynamic Join Predicates into Scan ("Sideways Information Passing", etc) #7955
Comments
cc @sunchao @viirya and @kazuyukitanimura whom I mention this technique the other day |
I'm very curious about how this kind of graph is drawn😯 |
@acking-you I draw it by hand using https://monodraw.helftone.com/ (and it unfortunately takes me a long time) @devinjdangelo has suggested https://asciiflow.com/#/ works for making something quick in the browser @westonpace pointed out that if you're writing on Github then you can use mermaid syntax in comments / issues / prs and it will render automatically: https://github.blog/2022-02-14-include-diagrams-markdown-files-mermaid/ |
From: #9963
I wonder if |
If there is a secondary index on |
Interested in this one! |
Thanks @Lordworms -- I am hoping someone else can step up and help you with this. I just don't have time to help with a project to improve join performance at this time. |
I believe DuckDB just announced support for this feature in 1.1: https://duckdb.org/2024/09/09/announcing-duckdb-110.html#dynamic-filter-pushdown-from-joins |
Is your feature request related to a problem or challenge?
If we want to make DataFusion the engine of choice for fast OLAP processing, eventually we will need to make joins faster. In addition to making sure the join order is not disastrous (e.g. #7949) we can consider other advanced OLAP techniques improve joins (especially queries with multiple joins)
Describe the solution you'd like
I would like to propose we look into pushing "join predicate" into scans (which I know of as "sideways information passing")
As an example, consider the joins from TPCH Q17
The first join (should) look like this. The observation is there are no predicates on the
lineitem
table (the big one), which means all the filtering happens in the join, which is bad because the scan can't do optimizations like "late materialization" and instead must decode all 60M values of selected columns, even though very few (2044!) are actually usedThe idea is to push the predicate into the join, by making something that acts like
l_partkey IN (...)
that can be applied during the scanIn a query with a single selective join (that filters many values) the savings is likely minimal as it depends on how much work can be saved in materialization (decoding). The only scan that does late materialization in DataFusion at the time of writing is the
ParquetExec
However, in a query with multiple selective joins the savings becomes much more pronounced, because we can save the effort of creating intermediate join outputs which are filtered out by joins later in the plan
For example:
Describe alternatives you've considered
Some version of this technique is described in "Bloom Filter Joins" in Spark: https://issues.apache.org/jira/browse/SPARK-32268
Building a seprate Bloom Filter has the nice property that you can distribute them in a networked cluster, however, the overhead of creating the bloom filter would likely be non trivial
Additional context
See a description of how DataFusion HashJoins work here: #7953
Here is an industrial paper that describes industrial experience with using SIPS techniques here: https://15721.courses.cs.cmu.edu/spring2020/papers/13-execution/shrinivas-icde2013.pdf
The text was updated successfully, but these errors were encountered: