Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries) #6679

Merged
merged 16 commits into from
Jun 19, 2023

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jun 15, 2023

Which issue does this PR close?

Closes ##6700

Benchmark results (TPC-H SF=1 in memory, average of 20 runs): q5, q7, q17, q18, and q21 show some improvement, most queries do not challenge the join or this code path (building the hashmap) of the join too much.

query main PR
5 58.74 54.98
7 104.14 92.41
17 332.83 320.09
18 244.41 218.97
21 152.79 132.93

Using the bench script (tpch_mem):

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ adapt_datastructure ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 187.78ms │            188.39ms │     no change │
│ QQuery 2     │  70.28ms │             68.41ms │     no change │
│ QQuery 3     │  52.51ms │             53.41ms │     no change │
│ QQuery 4     │  39.13ms │             38.95ms │     no change │
│ QQuery 5     │ 126.46ms │            113.56ms │ +1.11x faster │
│ QQuery 6     │  11.12ms │             11.14ms │     no change │
│ QQuery 7     │ 277.91ms │            232.85ms │ +1.19x faster │
│ QQuery 8     │  80.91ms │             81.30ms │     no change │
│ QQuery 9     │ 174.15ms │            163.58ms │ +1.06x faster │
│ QQuery 10    │ 104.81ms │            104.20ms │     no change │
│ QQuery 11    │  53.20ms │             54.47ms │     no change │
│ QQuery 12    │  70.71ms │             71.59ms │     no change │
│ QQuery 13    │ 214.11ms │            202.77ms │ +1.06x faster │
│ QQuery 14    │  14.09ms │             13.13ms │ +1.07x faster │
│ QQuery 15    │  23.20ms │             23.99ms │     no change │
│ QQuery 16    │  67.71ms │             51.89ms │ +1.30x faster │
│ QQuery 17    │ 718.41ms │            705.77ms │     no change │
│ QQuery 18    │ 732.54ms │            607.96ms │ +1.20x faster │
│ QQuery 19    │  61.01ms │             61.89ms │     no change │
│ QQuery 20    │ 214.62ms │            218.56ms │     no change │
│ QQuery 21    │ 412.88ms │            335.48ms │ +1.23x faster │
│ QQuery 22    │  33.44ms │             34.58ms │     no change │
└──────────────┴──────────┴─────────────────────┴───────────────┘

Rationale for this change

Currently, we're using a smallvec to keep the indices with the same value. This hurts performance for cases when we have multiple values as it slows down insertion as well as retrieval of the indices (if the smallvec allocates, it's not cache-efficient).

Also the memory usage of the in-memory structure is reduced by paying the Vec-overhead only once rather than per-value: required memory is roughly reduced by 4x for storing the indices (8 bytes per value vs 32 bytes)

We can use a chained-list datastructure in a preallocated array where the next indices are stored (adapted from Balancing vectorized query execution with bandwidth-optimized storage)

What changes are included in this PR?

Are these changes tested?

Yes, by existing tests.

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Jun 15, 2023
@Dandandan Dandandan marked this pull request as draft June 15, 2023 14:47
@Dandandan Dandandan changed the title Change HashJoin datastructure Improve performance of HashJoin datastructure\ Jun 15, 2023
@Dandandan Dandandan changed the title Improve performance of HashJoin datastructure\ Improve performance of HashJoin datastructure Jun 15, 2023
@Dandandan
Copy link
Contributor Author

FYI @berkaysynnada I am trying to make some improvements to the hash join datastructure.

It seems the datastructure changes are not really compatible with the symmetric hash join (as it needs to be mutated during the probing process) - does it make sense to "duplicate" the old structure and code and use that inside the symmetric hash join?

@Dandandan Dandandan changed the title Improve performance of HashJoin datastructure Improve performance of HashJoin datastructure (5-15% improvement on selected TPC-H queries) Jun 16, 2023
@Dandandan Dandandan changed the title Improve performance of HashJoin datastructure (5-15% improvement on selected TPC-H queries) Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries) Jun 16, 2023
@Dandandan Dandandan marked this pull request as ready for review June 16, 2023 13:46
/// Gets build and probe indices which satisfy the on condition (including
/// the equality condition and the join filter) in the join.
#[allow(clippy::too_many_arguments)]
pub fn build_join_indices(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old implementation moved to symmetric hash join. The added complexity to support both options in a more generic way seems to add more complexity than just having the two versions around (and further tune them for the specific purpose / algorithm).


/// SymmetricJoinHashMap is similar to JoinHashMap, except that it stores the indices inline, allowing it to mutate
/// and shrink the indices.
pub struct SymmetricJoinHashMap(pub RawTable<(u64, SmallVec<[u64; 1]>)>);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@berkaysynnada not sure if SmallVec is optimal. It might be an improvement to use Vec here as the >1 case probably occurs more often here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might change this since we are not pushing for the same hash table implementation.

@ozankabak
Copy link
Contributor

@metesynnada, PTAL. Let's collaborate with @Dandandan on this as this is related to your work area.

Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familiar with the code before but this seems like a good change.

@Dandandan
Copy link
Contributor Author

@metesynnada let me know what you think and how we can cooperate. Next up I'm investigating improvements that could be done to speed up / vectorize collision checks.

@ozankabak
Copy link
Contributor

ozankabak commented Jun 17, 2023

We will study this PR tomorrow and comment on it. Thanks for working on this.

@metesynnada
Copy link
Contributor

@metesynnada let me know what you think and how we can cooperate. Next up I'm investigating improvements that could be done to speed up / vectorize collision checks.

I'm excited to collaborate on optimizing the collision checks. As @ozankabak tells, I'll start my deep dive into this PR and further tomorrow.

@alamb
Copy link
Contributor

alamb commented Jun 18, 2023

Here are my measurements which I think are consistent with what is on this PR.
This is very exciting to see

--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ adapt_datastructure ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  517.86ms │            827.70ms │  1.60x slower │
│ QQuery 2     │  278.64ms │            534.33ms │  1.92x slower │
│ QQuery 3     │  186.82ms │            178.34ms │     no change │
│ QQuery 4     │  114.30ms │            112.48ms │     no change │
│ QQuery 5     │  480.70ms │            436.67ms │ +1.10x faster │
│ QQuery 6     │   40.20ms │             40.14ms │     no change │
│ QQuery 7     │ 1212.07ms │            963.47ms │ +1.26x faster │
│ QQuery 8     │  248.13ms │            251.24ms │     no change │
│ QQuery 9     │  613.36ms │            614.93ms │     no change │
│ QQuery 10    │  344.36ms │            339.00ms │     no change │
│ QQuery 11    │  212.06ms │            211.54ms │     no change │
│ QQuery 12    │  165.50ms │            167.81ms │     no change │
│ QQuery 13    │  673.03ms │            697.19ms │     no change │
│ QQuery 14    │   53.17ms │             50.71ms │     no change │
│ QQuery 15    │   88.08ms │             96.08ms │  1.09x slower │
│ QQuery 16    │  251.78ms │            200.89ms │ +1.25x faster │
│ QQuery 17    │ 2692.16ms │           2694.20ms │     no change │
│ QQuery 18    │ 2814.05ms │           2642.08ms │ +1.07x faster │
│ QQuery 19    │  167.80ms │            167.04ms │     no change │
│ QQuery 20    │  865.79ms │            852.77ms │     no change │
│ QQuery 21    │ 1446.79ms │           1268.84ms │ +1.14x faster │
│ QQuery 22    │  100.54ms │            102.31ms │     no change │
└──────────────┴───────────┴─────────────────────┴───────────────┘

@Dandandan
Copy link
Contributor Author

Thanks for running the benchmarks @alamb - are you sure of the accuracy of the slower running queries?

@Dandandan
Copy link
Contributor Author

Especially query 1 is suspicious, as it doesn't have a join @alamb ;)

@ozankabak
Copy link
Contributor

Especially query 1 is suspicious, as it doesn't have a join @alamb ;)

Wow. So there is a huge noise in the benchmark?

@Dandandan
Copy link
Contributor Author

Especially query 1 is suspicious, as it doesn't have a join @alamb ;)

Wow. So there is a huge noise in the benchmark?

I ran the benchmark manually, averaged over 20 runs, which has only minimal noise. But running once might add some more noise

@Dandandan
Copy link
Contributor Author

Dandandan commented Jun 19, 2023

My results using the bench.sh script with default parameters:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ adapt_datastructure ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 187.78ms │            188.39ms │     no change │
│ QQuery 2     │  70.28ms │             68.41ms │     no change │
│ QQuery 3     │  52.51ms │             53.41ms │     no change │
│ QQuery 4     │  39.13ms │             38.95ms │     no change │
│ QQuery 5     │ 126.46ms │            113.56ms │ +1.11x faster │
│ QQuery 6     │  11.12ms │             11.14ms │     no change │
│ QQuery 7     │ 277.91ms │            232.85ms │ +1.19x faster │
│ QQuery 8     │  80.91ms │             81.30ms │     no change │
│ QQuery 9     │ 174.15ms │            163.58ms │ +1.06x faster │
│ QQuery 10    │ 104.81ms │            104.20ms │     no change │
│ QQuery 11    │  53.20ms │             54.47ms │     no change │
│ QQuery 12    │  70.71ms │             71.59ms │     no change │
│ QQuery 13    │ 214.11ms │            202.77ms │ +1.06x faster │
│ QQuery 14    │  14.09ms │             13.13ms │ +1.07x faster │
│ QQuery 15    │  23.20ms │             23.99ms │     no change │
│ QQuery 16    │  67.71ms │             51.89ms │ +1.30x faster │
│ QQuery 17    │ 718.41ms │            705.77ms │     no change │
│ QQuery 18    │ 732.54ms │            607.96ms │ +1.20x faster │
│ QQuery 19    │  61.01ms │             61.89ms │     no change │
│ QQuery 20    │ 214.62ms │            218.56ms │     no change │
│ QQuery 21    │ 412.88ms │            335.48ms │ +1.23x faster │
│ QQuery 22    │  33.44ms │             34.58ms │     no change │
└──────────────┴──────────┴─────────────────────┴───────────────┘

@metesynnada
Copy link
Contributor

metesynnada commented Jun 19, 2023

My results are aligned with yours,

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ adapt_datastructure ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  382.88ms │            384.07ms │     no change │
│ QQuery 2     │  124.07ms │            116.29ms │ +1.07x faster │
│ QQuery 3     │  174.73ms │            174.30ms │     no change │
│ QQuery 4     │  110.42ms │            110.51ms │     no change │
│ QQuery 5     │  227.84ms │            226.10ms │     no change │
│ QQuery 6     │  108.11ms │            109.13ms │     no change │
│ QQuery 7     │  394.28ms │            352.03ms │ +1.12x faster │
│ QQuery 8     │  256.21ms │            253.69ms │     no change │
│ QQuery 9     │  403.20ms │            385.42ms │     no change │
│ QQuery 10    │  299.90ms │            299.38ms │     no change │
│ QQuery 11    │   88.98ms │             82.60ms │ +1.08x faster │
│ QQuery 12    │  202.39ms │            203.64ms │     no change │
│ QQuery 13    │  439.47ms │            420.72ms │     no change │
│ QQuery 14    │  151.04ms │            150.24ms │     no change │
│ QQuery 15    │  114.99ms │            114.34ms │     no change │
│ QQuery 16    │  126.88ms │             90.89ms │ +1.40x faster │
│ QQuery 17    │  955.93ms │            942.46ms │     no change │
│ QQuery 18    │ 1183.84ms │            979.92ms │ +1.21x faster │
│ QQuery 19    │  311.80ms │            312.30ms │     no change │
│ QQuery 20    │  353.71ms │            338.11ms │     no change │
│ QQuery 21    │  634.34ms │            539.71ms │ +1.18x faster │
│ QQuery 22    │   77.22ms │             76.97ms │     no change │
└──────────────┴───────────┴─────────────────────┴───────────────┘

Currently, I am reviewing the code. The algorithm is neat, however, I want to find a good way to integrate the symmetric hash join into this algorithm. I think it can be possible.

As far as I can see, the memory reservation in the hash join is not changed,

    // Estimation of memory size, required for hashtable, prior to allocation.
    // Final result can be verified using `RawTable.allocation_info()`
    //
    // For majority of cases hashbrown overestimates buckets qty to keep ~1/8 of them empty.
    // This formula leads to overallocation for small tables (< 8 elements) but fine overall.
    let estimated_buckets = (num_rows.checked_mul(8).ok_or_else(|| {
        DataFusionError::Execution(
            "usize overflow while estimating number of hasmap buckets".to_string(),
        )
    })? / 7)
        .next_power_of_two();
    // 32 bytes per `(u64, SmallVec<[u64; 1]>)`
    // + 1 byte for each bucket
    // + 16 bytes fixed
    let estimated_hastable_size = 32 * estimated_buckets + estimated_buckets + 16;

@Dandandan
Copy link
Contributor Author

Currently, I am reviewing the code. The algorithm is neat, however, I want to find a good way to integrate the symmetric hash join into this algorithm. I think it can be possible.

@ozankabak I am not too familiar with the symmetric hash join, but one complexity seemed to be in reallocating/updating the next datastructure for the symmetric hash join.

@metesynnada
Copy link
Contributor

metesynnada commented Jun 19, 2023

I think changing the chain start pointer and a deletion offset might be the solution, I am working on that. We can address this in another PR. Since the performance gain is obvious, we can move on with this algorithm.

If you change the memory reservation calculation, LGTM.

Worst case, I will improve the data structure behind the SHJ hash table and we use separate ones in these joins.

@Dandandan
Copy link
Contributor Author

Dandandan commented Jun 19, 2023

I think changing the chain start pointer and a deletion offset might be the solution, I am working on that. We can address this in another PR. Since the performance gain is obvious, we can move on with this algorithm.

If you change the memory reservation calculation, LGTM.

Worst case, I will improve the data structure behind the SHJ hash table and we use separate ones in these joins.

Sounds good :). Yes will do!

// Already exists: add index to next array
let prev_index = *index;
// Store new value inside hashmap
*index = (row + offset + 1) as u64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to hold the chain start in hashmap, instead of end of the chain? Is there any particular reason for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additions become O(1) by holding the end of the chain, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason is that while iterating over the hashes/indices we get the latest index (which contains both the value and points to the previous index each time) as a constant time operation. Not sure how it would work when holding the chain start in the map as we have to iterate the map first to get to the last?

It would be possible (though seems not beneficial for the normal hash join) to also keep the start of the chain in the hashmap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is no gain for the usual hash join, but pruning becomes much more expensive if I do not have the beginning. I think I will not push for it, for now, let s have separate hashmap paradigms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additions become O(1) by holding the end of the chain, right?

Yes, this way next[value - 1] contains the previous value, and the next value / index can be found in the same way again.

Copy link
Contributor

@metesynnada metesynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

// Already exists: add index to next array
let prev_index = *index;
// Store new value inside hashmap
*index = (row + offset + 1) as u64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is no gain for the usual hash join, but pruning becomes much more expensive if I do not have the beginning. I think I will not push for it, for now, let s have separate hashmap paradigms.

for &i in indices {
// Check hash collisions
let mut i = *index - 1;
loop {
let offset_build_index = i as usize - offset_value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offset is not necessary for usual hash join, you can remove it safely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


/// SymmetricJoinHashMap is similar to JoinHashMap, except that it stores the indices inline, allowing it to mutate
/// and shrink the indices.
pub struct SymmetricJoinHashMap(pub RawTable<(u64, SmallVec<[u64; 1]>)>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might change this since we are not pushing for the same hash table implementation.

@Dandandan Dandandan merged commit 26c90c2 into main Jun 19, 2023
@alamb
Copy link
Contributor

alamb commented Jun 19, 2023

Thanks for running the benchmarks @alamb - are you sure of the accuracy of the slower running queries?

No I am not -- I observe significant variation on the queries that take small amounts of time to run. Thank you

@alamb alamb deleted the adapt_datastructure branch June 19, 2023 13:26
@mingmwang
Copy link
Contributor

I had run the test on my local, no performance downgrade, this PR is great !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants