-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries) #6679
Conversation
FYI @berkaysynnada I am trying to make some improvements to the hash join datastructure. It seems the datastructure changes are not really compatible with the symmetric hash join (as it needs to be mutated during the probing process) - does it make sense to "duplicate" the old structure and code and use that inside the symmetric hash join? |
/// Gets build and probe indices which satisfy the on condition (including | ||
/// the equality condition and the join filter) in the join. | ||
#[allow(clippy::too_many_arguments)] | ||
pub fn build_join_indices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old implementation moved to symmetric hash join. The added complexity to support both options in a more generic way seems to add more complexity than just having the two versions around (and further tune them for the specific purpose / algorithm).
|
||
/// SymmetricJoinHashMap is similar to JoinHashMap, except that it stores the indices inline, allowing it to mutate | ||
/// and shrink the indices. | ||
pub struct SymmetricJoinHashMap(pub RawTable<(u64, SmallVec<[u64; 1]>)>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@berkaysynnada not sure if SmallVec
is optimal. It might be an improvement to use Vec
here as the >1 case probably occurs more often here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might change this since we are not pushing for the same hash table implementation.
@metesynnada, PTAL. Let's collaborate with @Dandandan on this as this is related to your work area. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super familiar with the code before but this seems like a good change.
@metesynnada let me know what you think and how we can cooperate. Next up I'm investigating improvements that could be done to speed up / vectorize collision checks. |
We will study this PR tomorrow and comment on it. Thanks for working on this. |
I'm excited to collaborate on optimizing the collision checks. As @ozankabak tells, I'll start my deep dive into this PR and further tomorrow. |
Here are my measurements which I think are consistent with what is on this PR.
|
Thanks for running the benchmarks @alamb - are you sure of the accuracy of the slower running queries? |
Especially query 1 is suspicious, as it doesn't have a join @alamb ;) |
Wow. So there is a huge noise in the benchmark? |
I ran the benchmark manually, averaged over 20 runs, which has only minimal noise. But running once might add some more noise |
My results using the bench.sh script with default parameters:
|
My results are aligned with yours,
Currently, I am reviewing the code. The algorithm is neat, however, I want to find a good way to integrate the symmetric hash join into this algorithm. I think it can be possible. As far as I can see, the memory reservation in the hash join is not changed, // Estimation of memory size, required for hashtable, prior to allocation.
// Final result can be verified using `RawTable.allocation_info()`
//
// For majority of cases hashbrown overestimates buckets qty to keep ~1/8 of them empty.
// This formula leads to overallocation for small tables (< 8 elements) but fine overall.
let estimated_buckets = (num_rows.checked_mul(8).ok_or_else(|| {
DataFusionError::Execution(
"usize overflow while estimating number of hasmap buckets".to_string(),
)
})? / 7)
.next_power_of_two();
// 32 bytes per `(u64, SmallVec<[u64; 1]>)`
// + 1 byte for each bucket
// + 16 bytes fixed
let estimated_hastable_size = 32 * estimated_buckets + estimated_buckets + 16; |
@ozankabak I am not too familiar with the symmetric hash join, but one complexity seemed to be in reallocating/updating the |
I think changing the chain start pointer and a deletion offset might be the solution, I am working on that. We can address this in another PR. Since the performance gain is obvious, we can move on with this algorithm. If you change the memory reservation calculation, LGTM. Worst case, I will improve the data structure behind the SHJ hash table and we use separate ones in these joins. |
Sounds good :). Yes will do! |
// Already exists: add index to next array | ||
let prev_index = *index; | ||
// Store new value inside hashmap | ||
*index = (row + offset + 1) as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to hold the chain start in hashmap, instead of end of the chain? Is there any particular reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additions become O(1) by holding the end of the chain, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason is that while iterating over the hashes/indices we get the latest index (which contains both the value and points to the previous index each time) as a constant time operation. Not sure how it would work when holding the chain start in the map as we have to iterate the map first to get to the last?
It would be possible (though seems not beneficial for the normal hash join) to also keep the start of the chain in the hashmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there is no gain for the usual hash join, but pruning becomes much more expensive if I do not have the beginning. I think I will not push for it, for now, let s have separate hashmap paradigms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additions become O(1) by holding the end of the chain, right?
Yes, this way next[value - 1]
contains the previous value, and the next value / index can be found in the same way again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
// Already exists: add index to next array | ||
let prev_index = *index; | ||
// Store new value inside hashmap | ||
*index = (row + offset + 1) as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there is no gain for the usual hash join, but pruning becomes much more expensive if I do not have the beginning. I think I will not push for it, for now, let s have separate hashmap paradigms.
for &i in indices { | ||
// Check hash collisions | ||
let mut i = *index - 1; | ||
loop { | ||
let offset_build_index = i as usize - offset_value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offset is not necessary for usual hash join, you can remove it safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
/// SymmetricJoinHashMap is similar to JoinHashMap, except that it stores the indices inline, allowing it to mutate | ||
/// and shrink the indices. | ||
pub struct SymmetricJoinHashMap(pub RawTable<(u64, SmallVec<[u64; 1]>)>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might change this since we are not pushing for the same hash table implementation.
No I am not -- I observe significant variation on the queries that take small amounts of time to run. Thank you |
I had run the test on my local, no performance downgrade, this PR is great !! |
Which issue does this PR close?
Closes ##6700
Benchmark results (TPC-H SF=1 in memory, average of 20 runs): q5, q7, q17, q18, and q21 show some improvement, most queries do not challenge the join or this code path (building the hashmap) of the join too much.
Using the bench script (tpch_mem):
Rationale for this change
Currently, we're using a
smallvec
to keep the indices with the same value. This hurts performance for cases when we have multiple values as it slows down insertion as well as retrieval of the indices (if the smallvec allocates, it's not cache-efficient).Also the memory usage of the in-memory structure is reduced by paying the
Vec
-overhead only once rather than per-value: required memory is roughly reduced by 4x for storing the indices (8 bytes per value vs 32 bytes)We can use a chained-list datastructure in a preallocated array where the next indices are stored (adapted from Balancing vectorized query execution with bandwidth-optimized storage)
What changes are included in this PR?
Are these changes tested?
Yes, by existing tests.
Are there any user-facing changes?
No