Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries) #6679

Dandandan · 2023-06-15T14:46:52Z

Which issue does this PR close?

Closes ##6700

Benchmark results (TPC-H SF=1 in memory, average of 20 runs): q5, q7, q17, q18, and q21 show some improvement, most queries do not challenge the join or this code path (building the hashmap) of the join too much.

query	main	PR
5	58.74	54.98
7	104.14	92.41
17	332.83	320.09
18	244.41	218.97
21	152.79	132.93

Using the bench script (tpch_mem):

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ adapt_datastructure ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 187.78ms │            188.39ms │     no change │
│ QQuery 2     │  70.28ms │             68.41ms │     no change │
│ QQuery 3     │  52.51ms │             53.41ms │     no change │
│ QQuery 4     │  39.13ms │             38.95ms │     no change │
│ QQuery 5     │ 126.46ms │            113.56ms │ +1.11x faster │
│ QQuery 6     │  11.12ms │             11.14ms │     no change │
│ QQuery 7     │ 277.91ms │            232.85ms │ +1.19x faster │
│ QQuery 8     │  80.91ms │             81.30ms │     no change │
│ QQuery 9     │ 174.15ms │            163.58ms │ +1.06x faster │
│ QQuery 10    │ 104.81ms │            104.20ms │     no change │
│ QQuery 11    │  53.20ms │             54.47ms │     no change │
│ QQuery 12    │  70.71ms │             71.59ms │     no change │
│ QQuery 13    │ 214.11ms │            202.77ms │ +1.06x faster │
│ QQuery 14    │  14.09ms │             13.13ms │ +1.07x faster │
│ QQuery 15    │  23.20ms │             23.99ms │     no change │
│ QQuery 16    │  67.71ms │             51.89ms │ +1.30x faster │
│ QQuery 17    │ 718.41ms │            705.77ms │     no change │
│ QQuery 18    │ 732.54ms │            607.96ms │ +1.20x faster │
│ QQuery 19    │  61.01ms │             61.89ms │     no change │
│ QQuery 20    │ 214.62ms │            218.56ms │     no change │
│ QQuery 21    │ 412.88ms │            335.48ms │ +1.23x faster │
│ QQuery 22    │  33.44ms │             34.58ms │     no change │
└──────────────┴──────────┴─────────────────────┴───────────────┘

Rationale for this change

Currently, we're using a smallvec to keep the indices with the same value. This hurts performance for cases when we have multiple values as it slows down insertion as well as retrieval of the indices (if the smallvec allocates, it's not cache-efficient).

Also the memory usage of the in-memory structure is reduced by paying the Vec-overhead only once rather than per-value: required memory is roughly reduced by 4x for storing the indices (8 bytes per value vs 32 bytes)

We can use a chained-list datastructure in a preallocated array where the next indices are stored (adapted from Balancing vectorized query execution with bandwidth-optimized storage)

What changes are included in this PR?

Are these changes tested?

Yes, by existing tests.

Are there any user-facing changes?

No

Dandandan · 2023-06-16T09:48:40Z

FYI @berkaysynnada I am trying to make some improvements to the hash join datastructure.

It seems the datastructure changes are not really compatible with the symmetric hash join (as it needs to be mutated during the probing process) - does it make sense to "duplicate" the old structure and code and use that inside the symmetric hash join?

Dandandan · 2023-06-16T14:32:37Z

datafusion/core/src/physical_plan/joins/symmetric_hash_join.rs

+    /// Gets build and probe indices which satisfy the on condition (including
+    /// the equality condition and the join filter) in the join.
+    #[allow(clippy::too_many_arguments)]
+    pub fn build_join_indices(


The old implementation moved to symmetric hash join. The added complexity to support both options in a more generic way seems to add more complexity than just having the two versions around (and further tune them for the specific purpose / algorithm).

Dandandan · 2023-06-16T14:58:55Z

datafusion/core/src/physical_plan/joins/hash_join_utils.rs

+
+/// SymmetricJoinHashMap is similar to JoinHashMap, except that it stores the indices inline, allowing it to mutate
+/// and shrink the indices.
+pub struct SymmetricJoinHashMap(pub RawTable<(u64, SmallVec<[u64; 1]>)>);


@berkaysynnada not sure if SmallVec is optimal. It might be an improvement to use Vec here as the >1 case probably occurs more often here?

I might change this since we are not pushing for the same hash table implementation.

ozankabak · 2023-06-16T17:01:11Z

@metesynnada, PTAL. Let's collaborate with @Dandandan on this as this is related to your work area.

thinkharderdev

Not super familiar with the code before but this seems like a good change.

datafusion/core/src/physical_plan/joins/hash_join_utils.rs

Dandandan · 2023-06-17T18:56:38Z

@metesynnada let me know what you think and how we can cooperate. Next up I'm investigating improvements that could be done to speed up / vectorize collision checks.

ozankabak · 2023-06-17T21:46:41Z

We will study this PR tomorrow and comment on it. Thanks for working on this.

metesynnada · 2023-06-18T12:19:55Z

@metesynnada let me know what you think and how we can cooperate. Next up I'm investigating improvements that could be done to speed up / vectorize collision checks.

I'm excited to collaborate on optimizing the collision checks. As @ozankabak tells, I'll start my deep dive into this PR and further tomorrow.

alamb · 2023-06-18T14:16:09Z

Here are my measurements which I think are consistent with what is on this PR.
This is very exciting to see

--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ adapt_datastructure ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  517.86ms │            827.70ms │  1.60x slower │
│ QQuery 2     │  278.64ms │            534.33ms │  1.92x slower │
│ QQuery 3     │  186.82ms │            178.34ms │     no change │
│ QQuery 4     │  114.30ms │            112.48ms │     no change │
│ QQuery 5     │  480.70ms │            436.67ms │ +1.10x faster │
│ QQuery 6     │   40.20ms │             40.14ms │     no change │
│ QQuery 7     │ 1212.07ms │            963.47ms │ +1.26x faster │
│ QQuery 8     │  248.13ms │            251.24ms │     no change │
│ QQuery 9     │  613.36ms │            614.93ms │     no change │
│ QQuery 10    │  344.36ms │            339.00ms │     no change │
│ QQuery 11    │  212.06ms │            211.54ms │     no change │
│ QQuery 12    │  165.50ms │            167.81ms │     no change │
│ QQuery 13    │  673.03ms │            697.19ms │     no change │
│ QQuery 14    │   53.17ms │             50.71ms │     no change │
│ QQuery 15    │   88.08ms │             96.08ms │  1.09x slower │
│ QQuery 16    │  251.78ms │            200.89ms │ +1.25x faster │
│ QQuery 17    │ 2692.16ms │           2694.20ms │     no change │
│ QQuery 18    │ 2814.05ms │           2642.08ms │ +1.07x faster │
│ QQuery 19    │  167.80ms │            167.04ms │     no change │
│ QQuery 20    │  865.79ms │            852.77ms │     no change │
│ QQuery 21    │ 1446.79ms │           1268.84ms │ +1.14x faster │
│ QQuery 22    │  100.54ms │            102.31ms │     no change │
└──────────────┴───────────┴─────────────────────┴───────────────┘

Dandandan · 2023-06-18T14:52:17Z

Thanks for running the benchmarks @alamb - are you sure of the accuracy of the slower running queries?

Dandandan · 2023-06-18T16:41:59Z

Especially query 1 is suspicious, as it doesn't have a join @alamb ;)

ozankabak · 2023-06-18T18:01:02Z

Especially query 1 is suspicious, as it doesn't have a join @alamb ;)

Wow. So there is a huge noise in the benchmark?

Dandandan · 2023-06-18T18:45:20Z

Especially query 1 is suspicious, as it doesn't have a join @alamb ;)

Wow. So there is a huge noise in the benchmark?

I ran the benchmark manually, averaged over 20 runs, which has only minimal noise. But running once might add some more noise

Dandandan · 2023-06-19T07:54:31Z

My results using the bench.sh script with default parameters:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ adapt_datastructure ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 187.78ms │            188.39ms │     no change │
│ QQuery 2     │  70.28ms │             68.41ms │     no change │
│ QQuery 3     │  52.51ms │             53.41ms │     no change │
│ QQuery 4     │  39.13ms │             38.95ms │     no change │
│ QQuery 5     │ 126.46ms │            113.56ms │ +1.11x faster │
│ QQuery 6     │  11.12ms │             11.14ms │     no change │
│ QQuery 7     │ 277.91ms │            232.85ms │ +1.19x faster │
│ QQuery 8     │  80.91ms │             81.30ms │     no change │
│ QQuery 9     │ 174.15ms │            163.58ms │ +1.06x faster │
│ QQuery 10    │ 104.81ms │            104.20ms │     no change │
│ QQuery 11    │  53.20ms │             54.47ms │     no change │
│ QQuery 12    │  70.71ms │             71.59ms │     no change │
│ QQuery 13    │ 214.11ms │            202.77ms │ +1.06x faster │
│ QQuery 14    │  14.09ms │             13.13ms │ +1.07x faster │
│ QQuery 15    │  23.20ms │             23.99ms │     no change │
│ QQuery 16    │  67.71ms │             51.89ms │ +1.30x faster │
│ QQuery 17    │ 718.41ms │            705.77ms │     no change │
│ QQuery 18    │ 732.54ms │            607.96ms │ +1.20x faster │
│ QQuery 19    │  61.01ms │             61.89ms │     no change │
│ QQuery 20    │ 214.62ms │            218.56ms │     no change │
│ QQuery 21    │ 412.88ms │            335.48ms │ +1.23x faster │
│ QQuery 22    │  33.44ms │             34.58ms │     no change │
└──────────────┴──────────┴─────────────────────┴───────────────┘

metesynnada · 2023-06-19T08:09:10Z

My results are aligned with yours,

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ adapt_datastructure ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  382.88ms │            384.07ms │     no change │
│ QQuery 2     │  124.07ms │            116.29ms │ +1.07x faster │
│ QQuery 3     │  174.73ms │            174.30ms │     no change │
│ QQuery 4     │  110.42ms │            110.51ms │     no change │
│ QQuery 5     │  227.84ms │            226.10ms │     no change │
│ QQuery 6     │  108.11ms │            109.13ms │     no change │
│ QQuery 7     │  394.28ms │            352.03ms │ +1.12x faster │
│ QQuery 8     │  256.21ms │            253.69ms │     no change │
│ QQuery 9     │  403.20ms │            385.42ms │     no change │
│ QQuery 10    │  299.90ms │            299.38ms │     no change │
│ QQuery 11    │   88.98ms │             82.60ms │ +1.08x faster │
│ QQuery 12    │  202.39ms │            203.64ms │     no change │
│ QQuery 13    │  439.47ms │            420.72ms │     no change │
│ QQuery 14    │  151.04ms │            150.24ms │     no change │
│ QQuery 15    │  114.99ms │            114.34ms │     no change │
│ QQuery 16    │  126.88ms │             90.89ms │ +1.40x faster │
│ QQuery 17    │  955.93ms │            942.46ms │     no change │
│ QQuery 18    │ 1183.84ms │            979.92ms │ +1.21x faster │
│ QQuery 19    │  311.80ms │            312.30ms │     no change │
│ QQuery 20    │  353.71ms │            338.11ms │     no change │
│ QQuery 21    │  634.34ms │            539.71ms │ +1.18x faster │
│ QQuery 22    │   77.22ms │             76.97ms │     no change │
└──────────────┴───────────┴─────────────────────┴───────────────┘

Currently, I am reviewing the code. The algorithm is neat, however, I want to find a good way to integrate the symmetric hash join into this algorithm. I think it can be possible.

As far as I can see, the memory reservation in the hash join is not changed,

    // Estimation of memory size, required for hashtable, prior to allocation.
    // Final result can be verified using `RawTable.allocation_info()`
    //
    // For majority of cases hashbrown overestimates buckets qty to keep ~1/8 of them empty.
    // This formula leads to overallocation for small tables (< 8 elements) but fine overall.
    let estimated_buckets = (num_rows.checked_mul(8).ok_or_else(|| {
        DataFusionError::Execution(
            "usize overflow while estimating number of hasmap buckets".to_string(),
        )
    })? / 7)
        .next_power_of_two();
    // 32 bytes per `(u64, SmallVec<[u64; 1]>)`
    // + 1 byte for each bucket
    // + 16 bytes fixed
    let estimated_hastable_size = 32 * estimated_buckets + estimated_buckets + 16;

Dandandan · 2023-06-19T08:50:02Z

Currently, I am reviewing the code. The algorithm is neat, however, I want to find a good way to integrate the symmetric hash join into this algorithm. I think it can be possible.

@ozankabak I am not too familiar with the symmetric hash join, but one complexity seemed to be in reallocating/updating the next datastructure for the symmetric hash join.

metesynnada · 2023-06-19T10:42:48Z

I think changing the chain start pointer and a deletion offset might be the solution, I am working on that. We can address this in another PR. Since the performance gain is obvious, we can move on with this algorithm.

If you change the memory reservation calculation, LGTM.

Worst case, I will improve the data structure behind the SHJ hash table and we use separate ones in these joins.

Dandandan · 2023-06-19T11:17:30Z

I think changing the chain start pointer and a deletion offset might be the solution, I am working on that. We can address this in another PR. Since the performance gain is obvious, we can move on with this algorithm.

If you change the memory reservation calculation, LGTM.

Worst case, I will improve the data structure behind the SHJ hash table and we use separate ones in these joins.

Sounds good :). Yes will do!

metesynnada · 2023-06-19T11:42:11Z

datafusion/core/src/physical_plan/joins/hash_join.rs

+            // Already exists: add index to next array
+            let prev_index = *index;
+            // Store new value inside hashmap
+            *index = (row + offset + 1) as u64;


Is it possible to hold the chain start in hashmap, instead of end of the chain? Is there any particular reason for this?

Additions become O(1) by holding the end of the chain, right?

I think the reason is that while iterating over the hashes/indices we get the latest index (which contains both the value and points to the previous index each time) as a constant time operation. Not sure how it would work when holding the chain start in the map as we have to iterate the map first to get to the last?

It would be possible (though seems not beneficial for the normal hash join) to also keep the start of the chain in the hashmap.

Yeah, there is no gain for the usual hash join, but pruning becomes much more expensive if I do not have the beginning. I think I will not push for it, for now, let s have separate hashmap paradigms.

Additions become O(1) by holding the end of the chain, right?

Yes, this way next[value - 1] contains the previous value, and the next value / index can be found in the same way again.

metesynnada

LGTM.

metesynnada · 2023-06-19T11:54:20Z

datafusion/core/src/physical_plan/joins/hash_join.rs

+            // Already exists: add index to next array
+            let prev_index = *index;
+            // Store new value inside hashmap
+            *index = (row + offset + 1) as u64;


Yeah, there is no gain for the usual hash join, but pruning becomes much more expensive if I do not have the beginning. I think I will not push for it, for now, let s have separate hashmap paradigms.

metesynnada · 2023-06-19T11:54:45Z

datafusion/core/src/physical_plan/joins/hash_join.rs

-            for &i in indices {
-                // Check hash collisions
+            let mut i = *index - 1;
+            loop {
                let offset_build_index = i as usize - offset_value;


Offset is not necessary for usual hash join, you can remove it safely.

metesynnada · 2023-06-19T11:55:52Z

datafusion/core/src/physical_plan/joins/hash_join_utils.rs

+
+/// SymmetricJoinHashMap is similar to JoinHashMap, except that it stores the indices inline, allowing it to mutate
+/// and shrink the indices.
+pub struct SymmetricJoinHashMap(pub RawTable<(u64, SmallVec<[u64; 1]>)>);


I might change this since we are not pushing for the same hash table implementation.

alamb · 2023-06-19T13:26:35Z

Thanks for running the benchmarks @alamb - are you sure of the accuracy of the slower running queries?

No I am not -- I observe significant variation on the queries that take small amounts of time to run. Thank you

mingmwang · 2023-06-20T08:31:00Z

I had run the test on my local, no performance downgrade, this PR is great !!

Change HashJoin datastructure

763c24d

github-actions bot added the core Core DataFusion crate label Jun 15, 2023

Dandandan marked this pull request as draft June 15, 2023 14:47

Dandandan changed the title ~~Change HashJoin datastructure~~ Improve performance of HashJoin datastructure\ Jun 15, 2023

Dandandan changed the title ~~Improve performance of HashJoin datastructure\~~ Improve performance of HashJoin datastructure Jun 15, 2023

Daniël Heres added 2 commits June 16, 2023 13:05

Simplify a bit

0fadf05

Simplify a bit

b20432c

Dandandan changed the title ~~Improve performance of HashJoin datastructure~~ Improve performance of HashJoin datastructure (5-15% improvement on selected TPC-H queries) Jun 16, 2023

Dandandan changed the title ~~Improve performance of HashJoin datastructure (5-15% improvement on selected TPC-H queries)~~ Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries) Jun 16, 2023

Daniël Heres added 3 commits June 16, 2023 15:15

Cleanup, fix symmetric hash join

23f3acd

Cleanup

2ef8252

Cleanup

0d19c46

Dandandan marked this pull request as ready for review June 16, 2023 13:46

Daniël Heres added 3 commits June 16, 2023 15:50

Add docs

ab6ba2c

Add docs

fbde9b0

Merge remote-tracking branch 'origin/main' into adapt_datastructure

4b90e29

Dandandan commented Jun 16, 2023

View reviewed changes

Daniël Heres added 3 commits June 16, 2023 16:36

Use named struct

ca259da

Use named struct

05e529e

Comment

3b8d53c

Dandandan assigned thinkharderdev and unassigned thinkharderdev Jun 16, 2023

Dandandan requested a review from thinkharderdev June 16, 2023 14:55

Dandandan commented Jun 16, 2023

View reviewed changes

thinkharderdev approved these changes Jun 16, 2023

View reviewed changes

datafusion/core/src/physical_plan/joins/hash_join_utils.rs Show resolved Hide resolved

Add example

61b7d57

Daniël Heres added 2 commits June 19, 2023 13:37

Update / simplify memory calculation with new datastructure

6d05fb4

Fmt

0637138

metesynnada reviewed Jun 19, 2023

View reviewed changes

Remove offset

ddb2ff9

Dandandan merged commit 26c90c2 into main Jun 19, 2023

alamb deleted the adapt_datastructure branch June 19, 2023 13:26

Dandandan mentioned this pull request Jun 25, 2023

Use chained list datastructure in hash join #6700

Closed

Dandandan mentioned this pull request Jul 17, 2023

Blog post with DataFusion Jun - Sep 2023 #6780

Closed

metesynnada mentioned this pull request Aug 22, 2023

Merge hash table implementations and remove leftover utilities #7366

Merged

Dandandan mentioned this pull request Aug 23, 2024

Improve the hash join performance by replacing the RawTable to a simple Vec for JoinHashMap #6910

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries) #6679

Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries) #6679

Dandandan commented Jun 15, 2023 •

edited

Loading

Dandandan commented Jun 16, 2023

Dandandan Jun 16, 2023

Dandandan Jun 16, 2023

metesynnada Jun 19, 2023

ozankabak commented Jun 16, 2023

thinkharderdev left a comment

Dandandan commented Jun 17, 2023

ozankabak commented Jun 17, 2023 •

edited

Loading

metesynnada commented Jun 18, 2023

alamb commented Jun 18, 2023

Dandandan commented Jun 18, 2023

Dandandan commented Jun 18, 2023

ozankabak commented Jun 18, 2023

Dandandan commented Jun 18, 2023

Dandandan commented Jun 19, 2023 •

edited

Loading

metesynnada commented Jun 19, 2023 •

edited

Loading

Dandandan commented Jun 19, 2023

metesynnada commented Jun 19, 2023 •

edited

Loading

Dandandan commented Jun 19, 2023 •

edited

Loading

metesynnada Jun 19, 2023

metesynnada Jun 19, 2023

Dandandan Jun 19, 2023

metesynnada Jun 19, 2023

Dandandan Jun 19, 2023

metesynnada left a comment

metesynnada Jun 19, 2023

metesynnada Jun 19, 2023

Dandandan Jun 19, 2023

metesynnada Jun 19, 2023

alamb commented Jun 19, 2023

mingmwang commented Jun 20, 2023

Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries) #6679

Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries) #6679

Conversation

Dandandan commented Jun 15, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Dandandan commented Jun 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak commented Jun 16, 2023

thinkharderdev left a comment

Choose a reason for hiding this comment

Dandandan commented Jun 17, 2023

ozankabak commented Jun 17, 2023 • edited Loading

metesynnada commented Jun 18, 2023

alamb commented Jun 18, 2023

Dandandan commented Jun 18, 2023

Dandandan commented Jun 18, 2023

ozankabak commented Jun 18, 2023

Dandandan commented Jun 18, 2023

Dandandan commented Jun 19, 2023 • edited Loading

metesynnada commented Jun 19, 2023 • edited Loading

Dandandan commented Jun 19, 2023

metesynnada commented Jun 19, 2023 • edited Loading

Dandandan commented Jun 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metesynnada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jun 19, 2023

mingmwang commented Jun 20, 2023

Dandandan commented Jun 15, 2023 •

edited

Loading

ozankabak commented Jun 17, 2023 •

edited

Loading

Dandandan commented Jun 19, 2023 •

edited

Loading

metesynnada commented Jun 19, 2023 •

edited

Loading

metesynnada commented Jun 19, 2023 •

edited

Loading

Dandandan commented Jun 19, 2023 •

edited

Loading