Improve aggregate performance with specialized groups accumulator for single string group by #7064

alamb · 2023-07-24T10:42:25Z

Is your feature request related to a problem or challenge?

DataFusion could be made faster for queries that have a GROUP BY <string> column

For example, in ClickBench Q34

Q34: SELECT "URL", COUNT(*) AS c FROM hits GROUP BY "URL" ORDER BY c DESC LIMIT 10;

You can run this query from a datafusion checkout like this (using the code in #7060, which hopefully will be merged shortly):

# get data
./benchmarks/bench.sh data clickbench_1
# run benchmark
cargo run --release  --bin dfbench -- clickbench --query 34

Here is the profile from running 16 cores:

cargo run --release  --bin dfbench -- clickbench --query 34 --iterations 10 --partitions 16

Describe the solution you'd like

I would like a special cased GroupsValue for this case of a single string (hopefully Utf8, LargeUTf8, Binary, and LargeBinary) column that:

Does no allocations per group (aka stores all strings in some single contiguous location)
Avoids the Row format / copy of values

Other ideas that could make this faster:

Small String optimization
special case ASCII (to avoid UTF8 checks for data, like TPCH, that does not contain UTF8 data)

"Small String optimization" refers to the format described in the umbra paper,

This would have to be adapted for Rust / safetly but the same general idea applies (inlining the first few bytes of the string into the hash table for quick "is it equal" comparisons, and then having an offset to an external area for larger strings)

Describe alternatives you've considered

No response

Additional context

@tustvold's changes in #6969 and #7043 should make it very easy to code this up as a different GroupValues implementation

The text was updated successfully, but these errors were encountered:

Dandandan · 2023-08-03T08:24:02Z

I think for single column group, we could "just"

use the arrow string / binary builder and store the strings data outside of the hashtable -> makes it also cheap to produce the final column
make sure to memoize hashes (repeatedly hashing is likely slow)

anything I missed here?

alamb · 2023-08-03T13:11:27Z

anything I missed here?

I think that would get most of the benefit.

Another potential optimization would be to potentially use the 'small string optimization' so the hash table comparison could be done inline and have one level of indirection

So in the hash table store not only the group_index but also another 12 bytes:

0-3: length of the string group key value (u32)
4-7: first four bytes of the string value itself (u32)
8-11: offset into string buffer (u32)

That way group key comparisons are faster because:

If the first 8 bytes are different you know the group value is different
We can check the actual values in the string builder without an extra level of computation on the offset buffer


Perhaps we could learn from the `View` implementation @tustvold was working on here

https://github.com/apache/arrow-rs/pull/4585/files#diff-694565dedb86d29ae2474ae09d51867a98a534543a45d79fcc3506b2958b73baR26

alamb · 2024-01-04T12:25:20Z

Here is a suggestion for a specific data structure for a similar idea from @Dandandan 👍
https://github.com/apache/arrow-datafusion/pull/8721/files#r1441586235

/// Contains hashes and offsets for given hash (+ potential collisions), use `RawTable` for extra speed
uniques: HashMap<u64, SmallVec<u64; 1>>,
/// actual string/byte data, can be emitted cheaply / free
values: BufferBuilder<u8>,

tustvold · 2024-01-04T12:28:31Z

If you use the raw_entry API on HashMap, or RawTable, you should be able to avoid needing the SmallVec to handle hash collisions - as they will be handled for you more efficiently by the hash probing setup.

Dandandan · 2024-01-04T12:33:31Z

If you use the raw_entry API on HashMap, or RawTable, you should be able to avoid needing the SmallVec to handle hash collisions - as they will be handled for you more efficiently by the hash probing setup.

Yes, good addition 👍

jayzhan211 · 2024-01-08T14:40:49Z

This task seems interesting, if no one take on this, I would like to give it a try.

Probably work on the follow up on #8721 first. Distinct Accumulator for Bytes

alamb · 2024-01-08T16:02:11Z

Probably work on the follow up on #8721 first. Distinct Accumulator for Bytes

I think that is a great idea

jayzhan211 · 2024-01-10T01:27:03Z

I draft the idea, does this make sense?

// Short String Optimizated HashSet for String
// Equivalent to HashSet<String> but with better memory usage (Speed unsure)
struct SSOStringHashSet {
    // header: u128
    // short string: length(4bytes) + data(12bytes)
    // long string:  length(4bytes) + prefix(4bytes) + offset(8bytes)
    header_set: HashSet<u128, RandomState>,
    // map<hash of long string w/o 4 bytes prefix, offset in buffer>
    long_string_map: HashMap<u64, u64, RandomState>,
    buffer: BufferBuilder<u8>,
}

impl SSOStringHashSet {
    fn insert(&mut self, value: &str) {
        let value_len = value.len();
        if value_len <= 12 {
            let mut short_string_header = 0u128;
            short_string_header |= (value_len << 96) as u128;
            short_string_header |= value
                .as_bytes()
                .iter()
                .fold(0u128, |acc, &x| acc << 8 | x as u128);
            self.header_set.insert(short_string_header);
        } else {
            // 1) hash the string w/o 4 bytes prefix
            // 2) check if the hash exists in the map
            // 3) if exists, insert the offset into the header
            // 4) if not exists, insert the hash and offset into the map

            let mut long_string_header = 0u128;
            long_string_header |= (value_len << 96) as u128;
            long_string_header |= (value
                .as_bytes()
                .iter()
                .take(4)
                .fold(0u128, |acc, &x| acc << 8 | x as u128)
                << 64) as u128;

            let suffix = value
                .as_bytes()
                .iter()
                .skip(4)
                .collect::<Vec<_>>();

            // NYI hash_bytes: hash &[u8] to u64, similar to hashbrown `make_hash` for &[u8]
            let hashed_suffix = hash_bytes(suffix);
            if let Some(offset) = self.long_string_map.get(&hashed_suffix) {
                long_string_header |= *offset as u128;
            } else {
                let offset = self.buffer.len();
                self.long_string_map.insert(hashed_suffix, offset as u64);
                long_string_header |= offset as u128;
                // convert suffix: Vec<&u8> to &[u8]
                self.buffer.append_slice(suffix);
            }
            self.header_set.insert(long_string_header);
        }
    }
}

tustvold · 2024-01-10T06:36:56Z

I think that won't currently handle hash collisions, https://github.com/apache/arrow-rs/blob/master/arrow-array%2Fsrc%2Fbuilder%2Fgeneric_bytes_dictionary_builder.rs#L211 might provide some inspiration here

alamb · 2024-01-10T19:03:13Z

If you want to try and use the short string optimization, it might make sense to create a struct to encapsulate the struct directly, perhaps something like

enum StringKey {
  // data length, in chars
  u64: len,
  // if the data length, in *bytes* is less than 8, stored as a u64 here, 
  // otherwise, this stores the offset into buffer
  offset_or_inline: u64,
  }
}

Then you maybe your structure can look like

struct StringKey {
  // returns the data pointed at by this key (either inlined or in buffer, depending on self.length)
  fn val(&self, buffer: &[u8]) -> &str { 
   ...
  }
}

As well as various hashing, etc

Then your struct could look something like

// Short String Optimizated HashSet for String
// Equivalent to HashSet<String> but with better memory usage (Speed unsure)
struct SSOStringHashSet {
    inner: HashSet<StringKey, RandomState>,
    // map<hash of long string w/o 4 bytes prefix, offset in buffer>
    long_string_map: HashMap<u64, u64, RandomState>,
    buffer: BufferBuilder<u8>,
}

alamb · 2024-01-10T23:23:17Z

I started hacking on a potential PR for this here: #8827 -- maybe we can collaborate @jayzhan211 . I need a little more to get it working in general, but then we'll have to implement the emit/clear functions too

jayzhan211 · 2024-01-13T07:19:15Z

@alamb @tustvold I have done the first draft for distinct count #8849. Slightly different from the suggestions above

alamb · 2024-01-16T17:18:17Z

@alamb @tustvold I have done the first draft for distinct count #8849. Slightly different from the suggestions above

Update here is I plan to help @jayzhan211 with #8849 and then revisit this PR (ideally reusing what we have come up with in #8849. )

alamb · 2024-01-29T10:48:17Z

#8849 should be merged today shortly. Then I will work on polishing up #8827.

During this exercise we came up with some other ideas on how to improve performance with string grouping keys which I also hope to write up this week

alamb added the enhancement New feature or request label Jul 24, 2023

alamb changed the title ~~Improve aggregate performance by special casing single string columns~~ Improve aggregate performance by special casing single string group by Jul 24, 2023

This was referenced Jul 24, 2023

[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

Open

Improve aggregate performance by special casing single group keys #6969

Closed

tustvold self-assigned this Jul 24, 2023

alamb unassigned tustvold Jul 31, 2023

avantgardnerio mentioned this issue Aug 4, 2023

Memory is coupled to group by cardinality, even when the aggregate output is truncated by a limit clause #7191

Closed

alamb mentioned this issue Aug 4, 2023

[EPIC] A collection of Sort + Limit / Top K optimizations #7195

Open

11 tasks

alamb mentioned this issue Jan 4, 2024

feat: native types in DistinctCountAccumulator for primitive types #8721

Merged

alamb mentioned this issue Jan 10, 2024

Implement specialized group values for single Uft8/LargeUtf8/Binary/LargeBinary column #8827

Merged

7 tasks

alamb changed the title ~~Improve aggregate performance by special casing single string group by~~ Improve aggregate performance with specialized groups accumulator for special casing single string group by Jan 12, 2024

alamb assigned alamb and tustvold Jan 12, 2024

alamb changed the title ~~Improve aggregate performance with specialized groups accumulator for special casing single string group by~~ Improve aggregate performance with specialized groups accumulator for single string group by Jan 12, 2024

alamb unassigned tustvold Jan 12, 2024

This was referenced Jan 14, 2024

Add 'clickbench_extended' benchmark #8860

Closed

DataFusion weekly project plan (Andrew Lamb) - Jan 15, 2024 #8864

Closed

Optimize COUNT( DISTINCT ...) for strings (up to 9x faster) #8849

Merged

alamb mentioned this issue Jan 21, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 22, 2024 #8933

Closed

9 tasks

alamb mentioned this issue Jan 28, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 29, 2024 #9030

Closed

6 tasks

This was referenced Feb 4, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 5, 2024 #9121

Closed

Add string aggregate grouping fuzz test, add MemTable::with_sort_exprs #9190

Merged

Improved performance for streaming grouping with single string columns #9195

Open

alamb mentioned this issue Feb 12, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 12, 2024 #9200

Closed

8 tasks

alamb closed this as completed in #8827 Feb 20, 2024

alamb mentioned this issue Jul 15, 2024

2024 Q3-Q4 Roadmap? #11442

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve aggregate performance with specialized groups accumulator for single string group by #7064

Improve aggregate performance with specialized groups accumulator for single string group by #7064

alamb commented Jul 24, 2023 •

edited

Loading

Dandandan commented Aug 3, 2023 •

edited

Loading

alamb commented Aug 3, 2023 •

edited

Loading

alamb commented Jan 4, 2024

tustvold commented Jan 4, 2024

Dandandan commented Jan 4, 2024

jayzhan211 commented Jan 8, 2024 •

edited

Loading

alamb commented Jan 8, 2024

jayzhan211 commented Jan 10, 2024 •

edited

Loading

tustvold commented Jan 10, 2024

alamb commented Jan 10, 2024 •

edited

Loading

alamb commented Jan 10, 2024

jayzhan211 commented Jan 13, 2024 •

edited

Loading

alamb commented Jan 16, 2024

alamb commented Jan 29, 2024

Improve aggregate performance with specialized groups accumulator for single string group by #7064

Improve aggregate performance with specialized groups accumulator for single string group by #7064

Comments

alamb commented Jul 24, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Dandandan commented Aug 3, 2023 • edited Loading

alamb commented Aug 3, 2023 • edited Loading

alamb commented Jan 4, 2024

tustvold commented Jan 4, 2024

Dandandan commented Jan 4, 2024

jayzhan211 commented Jan 8, 2024 • edited Loading

alamb commented Jan 8, 2024

jayzhan211 commented Jan 10, 2024 • edited Loading

tustvold commented Jan 10, 2024

alamb commented Jan 10, 2024 • edited Loading

alamb commented Jan 10, 2024

jayzhan211 commented Jan 13, 2024 • edited Loading

alamb commented Jan 16, 2024

alamb commented Jan 29, 2024

alamb commented Jul 24, 2023 •

edited

Loading

Dandandan commented Aug 3, 2023 •

edited

Loading

alamb commented Aug 3, 2023 •

edited

Loading

jayzhan211 commented Jan 8, 2024 •

edited

Loading

jayzhan211 commented Jan 10, 2024 •

edited

Loading

alamb commented Jan 10, 2024 •

edited

Loading

jayzhan211 commented Jan 13, 2024 •

edited

Loading