Evaluate vectorized hash table for group aggregation #7095

sunchao · 2023-07-26T04:30:53Z

Is your feature request related to a problem or challenge?

Currently DF uses a RawTable from hashbrown as the hash table implementation in group aggregations. This requires first converting the input batches into a row format, and then process the converted rows one by one, does hash probing, equality check, as well as creating new entries accordingly.

A different approach, as discussed in the Photon paper (and is also used by DuckDB), is to adopt a new vectorized approach in the hash table design, so that each of the above steps can be vectorized. In addition this allows us to skip the row conversion and directly operates on the input batches.

Internally we have a draft implementation for this and it has shown considerable improvements (even without SIMD, although with a lot of unsafes 😂 ) on top of the current hash aggregation approach, so we'd like to contribute to DF and see if it can help to improve its aggregation performance even further.

Describe the solution you'd like

Design & implement a separate vectorized hash table. It can either replace the existing RawTable inside GroupValuesRows, or we can have a separate GroupValues implementation.

Describe alternatives you've considered

Not to implement this.

Additional context

No response

The text was updated successfully, but these errors were encountered:

Dandandan · 2023-07-26T08:05:31Z

@sunchao that sounds very exciting 🚀

doki23 · 2024-03-13T14:39:56Z

Hi @sunchao, are you still moving it forward?

sunchao · 2024-03-13T17:58:02Z

@doki23 yes I'm still planning to. I have a POC branch for this work: https://github.com/sunchao/arrow-datafusion/commits/vectorized-hash-table/ but I haven't got time to go back to it yet. Will try to do it soon.

Lordworms · 2024-08-01T03:21:53Z

@doki23 yes I'm still planning to. I have a POC branch for this work: https://github.com/sunchao/arrow-datafusion/commits/vectorized-hash-table/ but I haven't got time to go back to it yet. Will try to do it soon.

Would like to explore more on this ticket. should I start after this branch?

alamb · 2024-08-22T13:29:51Z

@viirya / @sunchao / @andygrove / @kazuyukitanimura , this idea came up in the context of #11943 -- I wonder if you have any updates or code we might be able to look at

sunchao · 2024-08-22T22:51:44Z

@Lordworms @alamb the code is still in my branch: https://github.com/sunchao/arrow-datafusion/commits/vectorized-hash-table/ and hasn't been updated for a while, so not sure how easy it can rebase.

One missing piece for the work is to eliminate the convert_columns and convert_rows cost. Last time I measured it took ~30% of the total time. For this, I think we need to extend the Array builder to add a get API too, so that while we are building the group values (which is represented via Arrow vectors) we can still look into the elements in various operations such as equality check.

jayzhan211 · 2024-08-23T02:40:21Z

I had tried the code in the branch but doesn't see the performance improvement (and slightly slower), I think RowConverter is what matters.

https://github.com/apache/datafusion/pull/12124/files

sunchao · 2024-08-23T03:06:14Z

Yes, that matches with my last measurement too. In my last company we had an internal implementation which bypasses the row-column conversion, and it showed good improvements over the current implementation.

Dandandan · 2024-12-03T15:12:47Z

I think this has been (mostly) implemented.

Still interested to see if we can get any faster than hashbrown but so far didn't see any evidence to support that.

jayzhan211 · 2024-12-12T01:16:07Z

I think this has been (mostly) implemented.

Still interested to see if we can get any faster than hashbrown but so far didn't see any evidence to support that.

Is current version of code vectorized? I though what "vectorized" implies that we run with SIMD

Still interested to see if we can get any faster than hashbrown but so far didn't see any evidence to support that.

I don't think there is a 'trivial' way to outperform HashBrown. I suspect that any performance improvements they achieved are due to factors other than having a better hashing mechanism than HashBrown

alamb · 2024-12-12T16:57:52Z

I don't think there is a 'trivial' way to outperform HashBrown. I suspect that any performance improvements they achieved are due to factors other than having a better hashing mechanism than HashBrown

Vectorizing the hash calculation is the big one, and DataFusion already do that

There are more "exotic" hash table strategies that supposedly are able to vectorize the lookup / collision comparisons too (e.g. look up multiple keys in one instruction). The umbra papers talk about this, but that codebase is not open source and I am not sure how much it really improves performance in practice

Dandandan · 2024-12-12T18:53:34Z

I don't think there is a 'trivial' way to outperform HashBrown. I suspect that any performance improvements they achieved are due to factors other than having a better hashing mechanism than HashBrown

Vectorizing the hash calculation is the big one, and DataFusion already do that

There are more "exotic" hash table strategies that supposedly are able to vectorize the lookup / collision comparisons too (e.g. look up multiple keys in one instruction). The umbra papers talk about this, but that codebase is not open source and I am not sure how much it really improves performance in practice

Yeah - my hope is we can find a way to integrate this in some way to the hashbrown APIs (so we can still benefit from the well optimized implementation).

jayzhan211 · 2024-12-12T23:53:24Z

Vectorizing the hash calculation is the big one, and DataFusion already do that

Do you mean hash_utils::create_hashes is vectorized (SIMDed)? Actually I haven't successfully find such SIMD instruction in group aggregation yet. Maybe my compiler configuration is incorrect

alamb · 2024-12-13T12:29:21Z

Do you mean hash_utils::create_hashes is vectorized (SIMDed)? Actually I haven't successfully find such SIMD instruction in group aggregation yet. Maybe my compiler configuration is incorrect

What I really means is that https://docs.rs/datafusion/latest/datafusion/common/hash_utils/fn.create_hashes.html creates the hashes a column at a time(Vector) rather than row by row

At the very least this is likely faster than calling it on each row as there is one function call per batch rather than per row.

I think it also gives the compiler a better chance to actually use SIMD CPU instructions to compute the hashes. This is what I believe is referred to as "auto vectorization".

However, I have not verified that the rust compiler actually does use SIMD instructions for create_hashes. Maybe that is something worth looking into

This is what @XiangpengHao did when optimizing StringView. Looked carefully at the assembly produced by https://godbolt.org/z/1jhc1hae1 and verified / tweaked the code until it was using expected instructions

Dandandan · 2024-12-13T12:59:23Z

create_hashes is visible, but not often super hot in profiling, so there I think there might be lower hanging fruit to optimize.

For e.g. for hash join: speeding up the algorithm for traversing the chain, speeding up the collision checks, etc.
For aggregates: I think there might be some further opportunities for reusing the intermediate hash tables, improving high cardinality cases, etc.

sunchao added the enhancement New feature or request label Jul 26, 2023

sunchao mentioned this issue Jul 26, 2023

[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

Open

17 tasks

kazuyukitanimura mentioned this issue Jul 27, 2023

Memory Limited GroupBy (Externalized / Spill) #1570

Closed

4 tasks

2010YOUY01 mentioned this issue Aug 21, 2024

Sketch for aggregation intermediate results blocked management #11943

Closed

alamb mentioned this issue Aug 23, 2024

Remove the raw feature and make RawTable private rust-lang/hashbrown#546

Merged

Dandandan closed this as completed Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate vectorized hash table for group aggregation #7095

Evaluate vectorized hash table for group aggregation #7095

sunchao commented Jul 26, 2023 •

edited

Loading

Dandandan commented Jul 26, 2023

doki23 commented Mar 13, 2024

sunchao commented Mar 13, 2024

Lordworms commented Aug 1, 2024

alamb commented Aug 22, 2024

sunchao commented Aug 22, 2024

jayzhan211 commented Aug 23, 2024 •

edited

Loading

sunchao commented Aug 23, 2024

Dandandan commented Dec 3, 2024

jayzhan211 commented Dec 12, 2024 •

edited

Loading

alamb commented Dec 12, 2024

Dandandan commented Dec 12, 2024

jayzhan211 commented Dec 12, 2024 •

edited

Loading

alamb commented Dec 13, 2024 •

edited

Loading

Dandandan commented Dec 13, 2024 •

edited

Loading

Evaluate vectorized hash table for group aggregation #7095

Evaluate vectorized hash table for group aggregation #7095

Comments

sunchao commented Jul 26, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Dandandan commented Jul 26, 2023

doki23 commented Mar 13, 2024

sunchao commented Mar 13, 2024

Lordworms commented Aug 1, 2024

alamb commented Aug 22, 2024

sunchao commented Aug 22, 2024

jayzhan211 commented Aug 23, 2024 • edited Loading

sunchao commented Aug 23, 2024

Dandandan commented Dec 3, 2024

jayzhan211 commented Dec 12, 2024 • edited Loading

alamb commented Dec 12, 2024

Dandandan commented Dec 12, 2024

jayzhan211 commented Dec 12, 2024 • edited Loading

alamb commented Dec 13, 2024 • edited Loading

Dandandan commented Dec 13, 2024 • edited Loading

sunchao commented Jul 26, 2023 •

edited

Loading

jayzhan211 commented Aug 23, 2024 •

edited

Loading

jayzhan211 commented Dec 12, 2024 •

edited

Loading

jayzhan211 commented Dec 12, 2024 •

edited

Loading

alamb commented Dec 13, 2024 •

edited

Loading

Dandandan commented Dec 13, 2024 •

edited

Loading