Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve aggregate performance with specialized groups accumulator for single string group by #7064

Closed
alamb opened this issue Jul 24, 2023 · 14 comments · Fixed by #8827
Closed
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jul 24, 2023

Is your feature request related to a problem or challenge?

DataFusion could be made faster for queries that have a GROUP BY <string> column

For example, in ClickBench Q34

Q34: SELECT "URL", COUNT(*) AS c FROM hits GROUP BY "URL" ORDER BY c DESC LIMIT 10;

You can run this query from a datafusion checkout like this (using the code in #7060, which hopefully will be merged shortly):

# get data
./benchmarks/bench.sh data clickbench_1
# run benchmark
cargo run --release  --bin dfbench -- clickbench --query 34

Here is the profile from running 16 cores:

cargo run --release  --bin dfbench -- clickbench --query 34 --iterations 10 --partitions 16
Screenshot 2023-07-24 at 7 23 09 AM

Describe the solution you'd like

I would like a special cased GroupsValue for this case of a single string (hopefully Utf8, LargeUTf8, Binary, and LargeBinary) column that:

  1. Does no allocations per group (aka stores all strings in some single contiguous location)
  2. Avoids the Row format / copy of values

Other ideas that could make this faster:

  1. Small String optimization
  2. special case ASCII (to avoid UTF8 checks for data, like TPCH, that does not contain UTF8 data)

"Small String optimization" refers to the format described in the umbra paper,

Screenshot 2023-07-24 at 6 38 01 AM

This would have to be adapted for Rust / safetly but the same general idea applies (inlining the first few bytes of the string into the hash table for quick "is it equal" comparisons, and then having an offset to an external area for larger strings)

Describe alternatives you've considered

No response

Additional context

@tustvold's changes in #6969 and #7043 should make it very easy to code this up as a different GroupValues implementation

@alamb alamb added the enhancement New feature or request label Jul 24, 2023
@alamb alamb changed the title Improve aggregate performance by special casing single string columns Improve aggregate performance by special casing single string group by Jul 24, 2023
@tustvold tustvold self-assigned this Jul 24, 2023
@Dandandan
Copy link
Contributor

Dandandan commented Aug 3, 2023

I think for single column group, we could "just"

  • use the arrow string / binary builder and store the strings data outside of the hashtable -> makes it also cheap to produce the final column
  • make sure to memoize hashes (repeatedly hashing is likely slow)

anything I missed here?

@alamb
Copy link
Contributor Author

alamb commented Aug 3, 2023

anything I missed here?

I think that would get most of the benefit.

Another potential optimization would be to potentially use the 'small string optimization' so the hash table comparison could be done inline and have one level of indirection

So in the hash table store not only the group_index but also another 12 bytes:

0-3: length of the string group key value (u32)
4-7: first four bytes of the string value itself (u32)
8-11: offset into string buffer (u32)

That way group key comparisons are faster because:

  1. If the first 8 bytes are different you know the group value is different
  2. We can check the actual values in the string builder without an extra level of computation on the offset buffer

Perhaps we could learn from the `View` implementation @tustvold was working on here

https://github.com/apache/arrow-rs/pull/4585/files#diff-694565dedb86d29ae2474ae09d51867a98a534543a45d79fcc3506b2958b73baR26


@alamb
Copy link
Contributor Author

alamb commented Jan 4, 2024

Here is a suggestion for a specific data structure for a similar idea from @Dandandan 👍
https://github.com/apache/arrow-datafusion/pull/8721/files#r1441586235

/// Contains hashes and offsets for given hash (+ potential collisions), use `RawTable` for extra speed
uniques: HashMap<u64, SmallVec<u64; 1>>,
/// actual string/byte data, can be emitted cheaply / free
values: BufferBuilder<u8>,

@tustvold
Copy link
Contributor

tustvold commented Jan 4, 2024

If you use the raw_entry API on HashMap, or RawTable, you should be able to avoid needing the SmallVec to handle hash collisions - as they will be handled for you more efficiently by the hash probing setup.

@Dandandan
Copy link
Contributor

If you use the raw_entry API on HashMap, or RawTable, you should be able to avoid needing the SmallVec to handle hash collisions - as they will be handled for you more efficiently by the hash probing setup.

Yes, good addition 👍

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 8, 2024

This task seems interesting, if no one take on this, I would like to give it a try.

Probably work on the follow up on #8721 first. Distinct Accumulator for Bytes

@alamb
Copy link
Contributor Author

alamb commented Jan 8, 2024

Probably work on the follow up on #8721 first. Distinct Accumulator for Bytes

I think that is a great idea

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 10, 2024

I draft the idea, does this make sense?

// Short String Optimizated HashSet for String
// Equivalent to HashSet<String> but with better memory usage (Speed unsure)
struct SSOStringHashSet {
    // header: u128
    // short string: length(4bytes) + data(12bytes)
    // long string:  length(4bytes) + prefix(4bytes) + offset(8bytes)
    header_set: HashSet<u128, RandomState>,
    // map<hash of long string w/o 4 bytes prefix, offset in buffer>
    long_string_map: HashMap<u64, u64, RandomState>,
    buffer: BufferBuilder<u8>,
}

impl SSOStringHashSet {
    fn insert(&mut self, value: &str) {
        let value_len = value.len();
        if value_len <= 12 {
            let mut short_string_header = 0u128;
            short_string_header |= (value_len << 96) as u128;
            short_string_header |= value
                .as_bytes()
                .iter()
                .fold(0u128, |acc, &x| acc << 8 | x as u128);
            self.header_set.insert(short_string_header);
        } else {
            // 1) hash the string w/o 4 bytes prefix
            // 2) check if the hash exists in the map
            // 3) if exists, insert the offset into the header
            // 4) if not exists, insert the hash and offset into the map

            let mut long_string_header = 0u128;
            long_string_header |= (value_len << 96) as u128;
            long_string_header |= (value
                .as_bytes()
                .iter()
                .take(4)
                .fold(0u128, |acc, &x| acc << 8 | x as u128)
                << 64) as u128;

            let suffix = value
                .as_bytes()
                .iter()
                .skip(4)
                .collect::<Vec<_>>();

            // NYI hash_bytes: hash &[u8] to u64, similar to hashbrown `make_hash` for &[u8]
            let hashed_suffix = hash_bytes(suffix);
            if let Some(offset) = self.long_string_map.get(&hashed_suffix) {
                long_string_header |= *offset as u128;
            } else {
                let offset = self.buffer.len();
                self.long_string_map.insert(hashed_suffix, offset as u64);
                long_string_header |= offset as u128;
                // convert suffix: Vec<&u8> to &[u8]
                self.buffer.append_slice(suffix);
            }
            self.header_set.insert(long_string_header);
        }
    }
}

@tustvold
Copy link
Contributor

I think that won't currently handle hash collisions, https://github.com/apache/arrow-rs/blob/master/arrow-array%2Fsrc%2Fbuilder%2Fgeneric_bytes_dictionary_builder.rs#L211 might provide some inspiration here

@alamb
Copy link
Contributor Author

alamb commented Jan 10, 2024

If you want to try and use the short string optimization, it might make sense to create a struct to encapsulate the struct directly, perhaps something like

enum StringKey {
  // data length, in chars
  u64: len,
  // if the data length, in *bytes* is less than 8, stored as a u64 here, 
  // otherwise, this stores the offset into buffer
  offset_or_inline: u64,
  }
}

Then you maybe your structure can look like

struct StringKey {
  // returns the data pointed at by this key (either inlined or in buffer, depending on self.length)
  fn val(&self, buffer: &[u8]) -> &str { 
   ...
  }
}

As well as various hashing, etc

Then your struct could look something like

// Short String Optimizated HashSet for String
// Equivalent to HashSet<String> but with better memory usage (Speed unsure)
struct SSOStringHashSet {
    inner: HashSet<StringKey, RandomState>,
    // map<hash of long string w/o 4 bytes prefix, offset in buffer>
    long_string_map: HashMap<u64, u64, RandomState>,
    buffer: BufferBuilder<u8>,
}

@alamb
Copy link
Contributor Author

alamb commented Jan 10, 2024

I started hacking on a potential PR for this here: #8827 -- maybe we can collaborate @jayzhan211 . I need a little more to get it working in general, but then we'll have to implement the emit/clear functions too

@alamb alamb changed the title Improve aggregate performance by special casing single string group by Improve aggregate performance with specialized groups accumulator for special casing single string group by Jan 12, 2024
@alamb alamb changed the title Improve aggregate performance with specialized groups accumulator for special casing single string group by Improve aggregate performance with specialized groups accumulator for single string group by Jan 12, 2024
@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 13, 2024

@alamb @tustvold I have done the first draft for distinct count #8849. Slightly different from the suggestions above

@alamb
Copy link
Contributor Author

alamb commented Jan 16, 2024

@alamb @tustvold I have done the first draft for distinct count #8849. Slightly different from the suggestions above

Update here is I plan to help @jayzhan211 with #8849 and then revisit this PR (ideally reusing what we have come up with in #8849. )

@alamb
Copy link
Contributor Author

alamb commented Jan 29, 2024

#8849 should be merged today shortly. Then I will work on polishing up #8827.

During this exercise we came up with some other ideas on how to improve performance with string grouping keys which I also hope to write up this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants