There is no encodings on float array #1723

shenlei149 · 2024-12-19T02:50:30Z

We have a lot of data for ML training, which are saved as parquet files. We want to find a alternative way to store the data to speed up reading and reduce file size.

There are several columns with few nulls, and their data type is struct<value: list<array: float not null>>, and there are several handreds of rows per row group and 500/1000 floats in each list. For these columns, vortex does not have a good performance bacause of no encodings, so the array size is larger than parquet.

pseudo code likes:

import pyarrow.parquet as pq
import pyarrow as pa
import vortex as vx

arraw_table = pq.read_table("file.parquet")

some_arrow_array = choose_some_column(arraw_table)
for a_array in some_arrow_array:
    vx_array = vx.array(a_array)
    cvx_array = vx.compress(vx_array)
    # for list float columns, cvx_array.nbytes == vx_array.nbytes

I try to flatten the list array and export data, use them as one array, still no encoding...

Here is the code and output:

import pyarrow as pa
import vortex as vx
import array

with open("sample.data","rb") as file:
    numbers = array.array('f')
    numbers.fromfile(file, (265216//4))
    arrow_array = pa.array(numbers)
    print(arrow_array.nbytes)
    vcol = vx.array(arrow_array)
    print(vcol.nbytes)
    cvcol = vx.compress(vcol)
    print(cvcol.nbytes)
    print(cvcol.nbytes/vcol.nbytes)

530432
530433
526365
0.9923307938985697

I use btrblocks to encode the data.

Stats:
- input size 530432
- output size 219769
- compression ratio 2.41359

if I use 500 floats to do the same thing, no encoding, and the compression ratio is 1.

My question is if the way I use vortex is right, if not, what is the right way to do it?

I'm not sure if I make question clear, please let me know if you need more details.

The text was updated successfully, but these errors were encountered:

a10y · 2024-12-19T03:42:38Z

Hey @shenlei149, thanks for giving Vortex a try for your usecase!

We do implement two floating point encodings, ALP and ALP-RD ("ALP for Real Doubles"). The first meant for low-precision decimals that can be approximated as integers, the latter is for high-precision doubles and is probably more likely to be useful for your use-case.

Note that the compression ratio for ALP-RD is not gigantic (usually it saves on the order of ~9 bits-per-value) but they do match what DuckDB/the ALP paper indicate.

I've noticed that we actually had a bug in our compressor selection for floating points, I've filed a PR with the fix here: #1725

a10y · 2024-12-19T03:44:59Z

BtrBlocks implements two encodings that we don't, Frequency and PDE. I'd be curious to understand which encoding they're choosing for your dataset

Fixes issue seen in #1723 In #1000, changes were made to remove ALP-RD as a top-level compressor and instead to only use it for patches. However, it seems that it was not getting selected anymore, whether that was due to the patching cost overhead or something else. This was noticed by a user, and confirmed by me in a Python shell. <img width="946" alt="image" src="https://github.com/user-attachments/assets/c42caed5-12e3-448b-aea6-3f33a7c97bfc" /> After this change, ALP-RD indeed does get selected again. <img width="907" alt="image" src="https://github.com/user-attachments/assets/bbb996fc-5223-43ed-9b4c-4b0262a417dc" />

shenlei149 · 2024-12-19T14:17:53Z

Thank you for your reply and quick fix.

update the code and rebuild, but it does not work. The size is still large.

From return status of btrblocks compress API, used_compression_schemes contains only one element, which is 2, so I think it uses dictionary encoding for my dataset.

enum class DoubleSchemeType : uint8_t {
  UNCOMPRESSED = 0,
  ONE_VALUE = 1,
  DICT = 2,
  RLE = 3,
  FREQUENCY = 4,
  PSEUDODECIMAL = 5,
  // legacy schemes
  DOUBLE_BP = 28,
  DICTIONARY_8 = 29,
  DICTIONARY_16 = 31,
  SCHEME_MAX = 32
};

a10y · 2024-12-19T22:09:27Z

Thanks for the extra info @shenlei149. I spent a bit of time today tracking down some float compression things in BtrBlocks.

Take for example the following program, which tries to compress a column of 64,000 fully random doubles:

#include "btrblocks.hpp"
#include "common/Log.hpp"
#include <random>

#define MAX_DEPTH 4

static btrblocks::Vector<double> generateData() {
    btrblocks::Vector<double> data(64000);
    std::default_random_engine engine;
    std::uniform_real_distribution<double> unif;
    for (size_t i = 0; i < 64000; i++) {
        data[i] = unif(engine);
    }

    return data;
}

int main(int argc, char *argv[]) {
    using namespace btrblocks;

    
    BtrBlocksConfig::configure([&](BtrBlocksConfig &config){
        std::cout << "setting max cascade depth to " << MAX_DEPTH << std::endl;
        config.integers.max_cascade_depth = MAX_DEPTH;
        config.doubles.max_cascade_depth = MAX_DEPTH;
        config.strings.max_cascade_depth = MAX_DEPTH;
        config.doubles.schemes.enable(DoubleSchemeType::DOUBLE_BP);
    });


    // Logging
    Log::set_level(Log::level::debug);

    // Compress it.
    Relation floats_rel;
    // All zeros?
    floats_rel.addColumn({"dbls", generateData()});

    // Compress the entire column.
    Range range(0, floats_rel.tuple_count);
    Chunk input = floats_rel.getChunk({range}, 0);
    Datablock compressor(floats_rel);
    
    
    // Compress, getting stats back.
    std::unique_ptr<uint8_t[]> output(new uint8_t[64000 * sizeof(double) * 2]);
    auto stats = compressor.compress(input, output);


    std::cout << "Stats:" <<  std::endl
        << "- input size " << input.size_bytes() << std::endl
        << "- output size " << stats.total_data_size << std::endl
        << "- compression ratio " << stats.compression_ratio << std::endl
        ;

    return 0;
}

This, somewhat predictably, does not compress at all:

It's really hard to figure out why the compressor is performing a certain way without having a better understanding of your data distribution.

It sounds like from your comment that BtrBlocks is selecting Dict encoding for your data, i.e. there are a lot of repeated fp64 values. Is that right? If so, I'd expect Vortex to be able to pick up on that as well, like it does in this example:

If it'd be possible to share a bit more information about this arrays that might help track this down.

shenlei149 · 2024-12-20T04:43:43Z

sample.zip

Here is sample file, and it is binary file not zip file... I just change the extension name to make github happy, otherwise file cannot be uploaded.

file size is 265216, and they are 66304 floats, 4 Bytes.

btrblocks does not support float so I have to use double like this

Relation to_compress;
std::variant<Vector<INTEGER>, Vector<DOUBLE>, Vector<str>> data;
Vector<DOUBLE> doubles(original_data.size()); // original_data is std::vector<float>
for (size_t i = 0; i < original_data.size(); ++i) {
  doubles[i] = original_data[i];
}
to_compress.addColumn(Column{"doubles", std::move(doubles)});

so compression ratio is 1.2 instead of 2.4. maybe it find out that there are many zeros in array and decide to use dictionary encoding :)

I find you use a function name tree_display, here is output. so vortex chooses alp+zigzag+bitpacked?

root: vortex.alp(0x11)(f64, len=66304) nbytes=526.37 kB (100.00%)
  metadata: ALPMetadata { exponents: Exponents { e: 16, f: 0 }, patches: Some(PatchesMetadata { len: 7452, indices_ptype: U32 }) }
  encoded: vortex.zigzag(0x1d)(i64, len=66304) nbytes=449.73 kB (85.44%)
    metadata: ZigZagMetadata
    encoded: fastlanes.bitpacked(0x15)(u64, len=66304) nbytes=449.73 kB (85.44%)
      metadata: BitPackedMetadata { validity: NonNullable, bit_width: 54, offset: 0, patches: Some(PatchesMetadata { len: 42, indices_ptype: U16 }) }
      buffer: 449.28 kB
      patch_indices: vortex.primitive(0x03)(u16, len=42) nbytes=85 B (0.02%)
        metadata: PrimitiveMetadata { validity: NonNullable }
        buffer: 84 B
      patch_values: vortex.primitive(0x03)(u64, len=42) nbytes=337 B (0.06%)
        metadata: PrimitiveMetadata { validity: NonNullable }
        buffer: 336 B
  patch_indices: fastlanes.bitpacked(0x15)(u32, len=7452) nbytes=17.00 kB (3.23%)
    metadata: BitPackedMetadata { validity: NonNullable, bit_width: 16, offset: 0, patches: Some(PatchesMetadata { len: 98, indices_ptype: U16 }) }
    buffer: 16.38 kB
    patch_indices: vortex.primitive(0x03)(u16, len=98) nbytes=197 B (0.04%)
      metadata: PrimitiveMetadata { validity: NonNullable }
      buffer: 196 B
    patch_values: vortex.primitive(0x03)(u32, len=98) nbytes=393 B (0.07%)
      metadata: PrimitiveMetadata { validity: NonNullable }
      buffer: 392 B
  patch_values: vortex.primitive(0x03)(f64, len=7452) nbytes=59.62 kB (11.33%)
    metadata: PrimitiveMetadata { validity: NonNullable }
    buffer: 59.62 kB

In my experiment, arrow_array is pyarrow.lib.DoubleArray not FloatArray because I don't know how to let pyarrow think data type if float. Sadly... :(

with open("sample.data","rb") as file:
    numbers = array.array('f')
    numbers.fromfile(file, (265216//4))
    arrow_array = pa.array(numbers)
    print(type(arrow_array))

vortex tries to ALP and btrblocks may find out a lot of zeros then use dictionary here.

a10y · 2024-12-20T17:05:50Z

Thanks for sharing that! I think I've decomped the source of the divergent behaviors.

Vortex performs compression by sampling. The samples are targeted at finding Run-length encoding and bitpacking opportunities. It does this by sampling 16 runs of 64 contiguous elements into a 1024 element tuple and tests the different compressors over that.

However, the duplicates in this dataset are distributed in such a way whereas to make it hard for the sampler to find them. It seems that on average, any individual window of 64-elements has very few duplicates, and that holds true even up to 1024 elements:

import struct
import pandas as pd

# Calculate for every 64-element window how many duplicates
def dupes_histo(window_size, xs):
    dupes = []
    for i in range(len(xs) - window_size):
        window = xs[i:i+window_size]
        dupe_count = len(window) - len(set(window))
        dupes.append(dupe_count)
    return dupes


with open("/Users/aduffy/Downloads/sample_floats.bin", "rb") as f:
    xs = []
    for i in range(66304):
        xs.append(struct.unpack("f", f.read(4))[0])

assert len(xs) == 66304


for WINDOW_SIZE in [64, 128, 1024]:
    dupes = pd.Series(dupes_histo(WINDOW_SIZE, xs))

    print(f"Analysis for WINDOW_SIZE {WINDOW_SIZE}:")

    print("- # windows: {}".format(len(dupes)))
    print("- min dupes in any window: {}".format(dupes.min()))
    print("- max dupes in any window: {}".format(dupes.max()))
    print("- # windows w/max dupes: {}".format(len(dupes[dupes == dupes.max()])))
    print("- average dupes per window: {}".format(dupes.mean()))

❯ uv run load_bin.py
Analysis for WINDOW_SIZE 64:
- # windows: 66240
- min dupes in any window: 0
- max dupes in any window: 3
- # windows w/max dupes: 96
- average dupes per window: 0.21541364734299517
Analysis for WINDOW_SIZE 128:
- # windows: 66176
- min dupes in any window: 0
- max dupes in any window: 5
- # windows w/max dupes: 98
- average dupes per window: 0.8615056818181818
Analysis for WINDOW_SIZE 1024:
- # windows: 65280
- min dupes in any window: 41
- max dupes in any window: 529
- # windows w/max dupes: 15
- average dupes per window: 174.91424632352943

Note that in Rust API you can change the sampling parameters, so when I tweak them high enough to effectively grab the entire array as a "sample", we do get the ideal encoding:

https://github.com/spiraldb/vortex/blob/aduffy/repro-gh-1723/bench-vortex/src/bin/gh1723.rs#L37-L44

I think we likely need to improve the use of global statistics in our compressor to perform well on these kinds of arrays. We have an open issue to add a cardinality estimate stat in #913, I think the addition of this + plumbing into the compressor would help greatly here.

a10y mentioned this issue Dec 19, 2024

feat: re-enable ALP-RD as top-level float compressor #1725

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There is no encodings on float array #1723

There is no encodings on float array #1723

shenlei149 commented Dec 19, 2024

a10y commented Dec 19, 2024

a10y commented Dec 19, 2024

shenlei149 commented Dec 19, 2024

a10y commented Dec 19, 2024

shenlei149 commented Dec 20, 2024

a10y commented Dec 20, 2024 •

edited

Loading

There is no encodings on float array #1723

There is no encodings on float array #1723

Comments

shenlei149 commented Dec 19, 2024

a10y commented Dec 19, 2024

a10y commented Dec 19, 2024

shenlei149 commented Dec 19, 2024

a10y commented Dec 19, 2024

shenlei149 commented Dec 20, 2024

a10y commented Dec 20, 2024 • edited Loading

a10y commented Dec 20, 2024 •

edited

Loading