Skip to content

Latest commit

 

History

History
65 lines (41 loc) · 7.58 KB

README.md

File metadata and controls

65 lines (41 loc) · 7.58 KB

DataBlock

DataBlock is ours own implementation of Apache Arrow and adapt some concepts from Velox. This crate uses the implementation described in Type Exercise in Rust to support the subset of Arrow format. What's more, DataBlock is designed to be used in the query engine of curvature, therefore:

  • it only provides the functions that query engine need to use

  • optimize the speed in the use case of the query engine

Difference with the official implementation

  • Array is !Send, if you want to send Array between threads, copy the inner data in the Array

  • Array is mutable! Query engine could mutate the Array with &mut self without copy, thanks to the SwarPtr

  • The result of the computation on Array should be written to a predefined Array. Arrow's computation will generate a new Array, it is inefficient and put a lots of pressure on the memory allocator.

  • Computations on Arrays will take a selection array as its context. The common meaning in of the selection array is that: it is used to reduce the computation, the function only used to perform computation on the selected elements. For computations that generating boolean result, the result is written into the selection array, instead of a brand new BooleanArray, via the and_inplace function such that we do not select the unselected elements. For other computations, the selection array will remain unchanged. Let us proof this design is correct here. Statement: If the selection is not all set, the expression exist in the And context. In the initial state, the selection is all set. After executing the first boolean expression, the selection stores the result. Boolean expressions can only be conjunct by And/Or operator. For the Or operator, we will provide a brand new selection array to its individual children. Therefore, we must in the And context and using the selection array to execute the remaining expressions is correct

  • An Array can not be partial slice of another Array

  • Arrow use portable SIMD to perform vectorization. However, it is static(in compiler time). You should compile the arrow with -C target-features=+feature to compile the problem for the cpu that supports this feature. Running the program on the CPU that does not support it will cause undefined behavior. If you do not compile with these features, the code will not use this SIMD instructions even if the CPU support it. Whats more, portable simd is only available in the nightly. We detect the features supported by the cpu at runtime and using the corresponding SIMD instructions. Whats more, we use intrinsic to write the SIMD manually, such that comparison that can not be auto-vectorized could use the SIMD to perform acceleration. You can use it in stable rust now.

Build

cargo build

Note: DataBlock requires your CPU to support sse2 instruction set if your CPU is X86/X86-64. You can check it with lscpu on the ubuntu.

You do not need to pass the RUSTFLAGS="-C target-feature=+features" when building this crate. Code will try to use the following instructions dynamically:

  • On X86/X86-64
    • avx2
  • On ARM
    • Neon

TODO

  • Support List of type.
  • Support lots of encoding like Constant/Dictionary

Tittle-tattle

Auto-vectorization is all you need 😊

As we all know, Compiler is smarter than you. Data-block do not use SIMD manually(currently), it will let the compiler generate the best instructions for you. The trick is building different functions based on different target features and call the correct function at runtime. You can find the concrete example in std::arch.

At the first attempt, I tried to use the SIMD manually. It is pretty annoying:

  • I use the conditional compilation to generate efficient code at compiler time. User should build the code with RUSTFLAGS="-C target-feature=+avx2" to generate code that can only run on CPUs that have avx2 feature. It is not portable at all! Even if we used the portable_simd crate, we still need this flag to build efficient code, according to this guard. The portable means you can build the same code to different targets at compiler time, still static... Even if I build the code with correct flag, I found the instructions generated by the compiler is still slower than the auto-vectorization version on many different machines 😠!!! BTW: performance on some machine decreases 😠

  • I have to use the nightly of the rust. SIMD is an unstable feature. Nightly has some strange bugs....

After doing lots of experiments, I decide to use auto-vectorization. I am the prompt engineer of the compiler. I will view the ASM code generated by the compiler to make sure some key functions are vectorized. For example, I use unsafe code to enable auto-vectorization, maybe it is a bug? I replace the BitVec with Vec<bool>, iterate the array, do some computation and write results to BitVec can not be vectorized automatically. It is at least 100 times slower thant the vectorized version, you can find the benchmark in benches/selection.

Comparison

Comparing one array with another array(scalar) is widely used in the query engine. We can store the results that matches the comparison in two forms: Vec<usize> and Vec<bool>. For Vec<usize>, we store the index of the array that match the comparison. For Vec<bool>, we store the result of each value in the boolean array.

If the expression has multiple comparisons conjunct with And, latter comparison can use the index array to avoid unnecessary computation, pretty cool. However, computing the index array is pretty expensive, compiler can not perform auto-vectorization for this code! In my benchmark, it is at least 10 times slower than storing the result in Vec<bool>😭. You can find the benchmark in benches/selection. So, when executing the expression, which one is faster is totally depend on the Selectivity of the expression.

Currently, we store the result in Vec<bool> and will implement storing the result in Vec<usize> in the future. Executor of the expression should choose the correct function at runtime!

Vec<bool> VS BitMap

  • If we use Vec<bool> to store the comparison result, compiler can use auto-vectorization to speedup the comparison between primitive types. However, store the comparison result to BitMap can not be auto-vectorized, we should implement the SIMD manually
  • Comparison and logical operation on BitMap is faster than on Vec<bool>, it is widely used in expressions.

Comparison with BitMap. Awesome!

Insight: BitMap could return the index of one/zero quickly! Instead of using the index array to index the selected rows, we could youse the BitMap to index the array 😊. This unifies the BitMap and index array. See Filter Representation in Vectorized Query Execution for more details

Computation could choose to use the BitMap as index array in when it thinks it is more applicable. For example, the StringArray will always choose to view the BitMap as index array because it can not leverage SIMD to perform acceleration. The PrimitiveArray will choose to view the BitMap as index array when almost all of the bits are zeros. Using SIMD and cache locality is much faster even if it does lots of unnecessary computations