Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad Join Order for TPCH Q18 results in slow performance #7950

Closed
alamb opened this issue Oct 27, 2023 · 1 comment · Fixed by #8126
Closed

Bad Join Order for TPCH Q18 results in slow performance #7950

alamb opened this issue Oct 27, 2023 · 1 comment · Fixed by #8126
Assignees
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Oct 27, 2023

Describe the bug

The join order chosen for TPCH query 17 is bad making datafusion take much longer to execute the query

To Reproduce

Create Data:

cd arrow-datafusion/benchmarks
./bench.sh  data tpch10

Run query with datafusion-cli:

cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select	c_name,	c_custkey,	o_orderkey,	o_orderdate,	o_totalprice,	sum(l_quantity)	from	customer,	orders,	lineitem	where	o_orderkey in (	select	l_orderkey	from	lineitem	group by	l_orderkey having	sum(l_quantity) > 300	)	and c_custkey = o_custkey	and o_orderkey = l_orderkey	group by	c_name,	c_custkey,	o_orderkey,	o_orderdate,	o_totalprice	order by	o_totalprice desc,	o_orderdate;"

Here is the query:

select
c_name,
c_custkey,
o_orderkey,
o_orderdate,
o_totalprice,
sum(l_quantity)
from
customer,
orders,
lineitem
where	o_orderkey in (
  select l_orderkey
  from	lineitem
  group by l_orderkey
  having	sum(l_quantity) > 300
  )
and c_custkey = o_custkey
and o_orderkey = l_orderkey
group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
order by o_totalprice desc, o_orderdate;

Expected behavior

The input order of HashJoin(1) in the below plan should be swapped

         DataFusion TPCH SF10 Q18 Plan                                                                    
        annotated with output row counts                                                                  
                                                                                                          
                                                         1 Row             This join is building a 60M row
                                                                           hash table and probing 624(!!) 
                                                  ┌────────────────┐        rows, rather than building a  
                                                  │   Aggregate    │       624 row hash table and probing 
                                                  │                │               a 60M row table        
                                                  │                │                                      
                                                  └────────────────┘                                      
                                                           │                              │               
                                                           ▼    2M Rows                                   
                                                  ┌────────────────┐                      │               
                                                  │  HashJoin (1)  │                                      
                                                  │   o_orderkey = │◀ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘               
                                                  │   l_orderkey   │                                      
                                                  └────┬──────┬────┘                                      
                                                       │      │         624 Rows                          
                                        ┌──────────────┘      └────────────────────────┐                  
                          60M Rows      │                                              │                  
                                        │                                              ▼                  
                                        ▼                            2M Rows  ┌────────────────┐          
                               ┌────────────────┐                             │     Filter     │          
                               │  HashJoin (2)  │                             │                │          
                               │   o_orderkey = │                             │                │          
                               │   l_orderkey   │                             └────────────────┘          
                               └───┬────────┬───┘     60M Rows                         │                  
                                   │        │                                          │                  
              15M Rows    ┌────────┘        └───────────────┐                          │  15M Rows        
                          │                                 │                          │                  
                          │                                 │                          │                  
                          ▼                                 ▼                          ▼                  
                 ┌────────────────┐               ┌──────────────────┐        ┌────────────────┐          
                 │  HashJoin (3)  │               │Scan: lineitem    │        │   Aggregate    │          
                 │   c_custkey =  │               │(no filters)      │        │                │          
                 │   o_custkey    │               │                  │        │                │          
                 └───┬────────┬───┘               └──────────────────┘        └────────────────┘          
  1.5M Rows          │        │       15M Rows                                         │                  
           ┌─────────┘        └──────────┐                                             │   60M Rows       
           │                             │                                             │                  
           ▼                             ▼                                             ▼                  
 ┌──────────────────┐          ┌──────────────────┐                          ┌──────────────────┐         
 │Scan: customer    │          │Scan: orders      │                          │Scan: lineitem    │         
 │(no predicates)   │          │(no predicates)   │                          │                  │         
 │                  │          │                  │                          │                  │         
 └──────────────────┘          └──────────────────┘                          └──────────────────┘         

Additional context

No response

@alamb alamb added the bug Something isn't working label Oct 27, 2023
@alamb
Copy link
Contributor Author

alamb commented Oct 27, 2023

This is likely one of the root causes of #5646

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants