DataBlock
is ours own implementation of Apache Arrow
and adapt some concepts from Velox. This crate uses the implementation described in Type Exercise in Rust
to support the subset of Arrow
format. What's more, DataBlock
is designed to be used in the query engine of curvature
, therefore:
-
it only provides the functions that query engine need to use
-
optimize the speed in the use case of the query engine
-
Array is
!Send
, if you want to sendArray
between threads, copy the inner data in theArray
-
Array
is mutable! Query engine could mutate theArray
with&mut self
without copy, thanks to theSwarPtr
-
The result of the computation on
Array
should be written to a predefinedArray
. Arrow's computation will generate a newArray
, it is inefficient and put a lots of pressure on the memory allocator. -
Computations on
Array
s will take aselection
array as its context. The common meaning in of theselection
array is that: it is used to reduce the computation, the function only used to perform computation on the selected elements. For computations that generating boolean result, the result is written into theselection
array, instead of a brand newBooleanArray
, via theand_inplace
function such that we do not select the unselected elements. For other computations, theselection
array will remain unchanged. Let us proof this design is correct here. Statement: If the selection is not all set, the expression exist in theAnd
context. In the initial state, the selection is all set. After executing the first boolean expression, the selection stores the result. Boolean expressions can only be conjunct byAnd
/Or
operator. For theOr
operator, we will provide a brand new selection array to its individual children. Therefore, we must in theAnd
context and using the selection array to execute the remaining expressions is correct -
An
Array
can not be partial slice of anotherArray
-
Arrow use portable SIMD to perform vectorization. However, it is static(in compiler time). You should compile the arrow with
-C target-features=+feature
to compile the problem for the cpu that supports this feature. Running the program on the CPU that does not support it will cause undefined behavior. If you do not compile with these features, the code will not use this SIMD instructions even if the CPU support it. Whats more, portable simd is only available in the nightly. We detect the features supported by the cpu at runtime and using the corresponding SIMD instructions. Whats more, we use intrinsic to write the SIMD manually, such that comparison that can not be auto-vectorized could use the SIMD to perform acceleration. You can use it in stable rust now.
cargo build
Note: DataBlock
requires your CPU to support sse2
instruction set if your CPU is X86/X86-64. You can check it with lscpu
on the ubuntu.
You do not need to pass the RUSTFLAGS="-C target-feature=+features"
when building this crate. Code will try to use the following instructions dynamically:
- On X86/X86-64
- avx2
- On ARM
- Neon
- Support
List
of type. - Support lots of encoding like
Constant
/Dictionary
As we all know, Compiler is smarter than you. Data-block do not use SIMD manually(currently), it will let the compiler generate the best instructions for you. The trick is building different functions based on different target features and call the correct function at runtime. You can find the concrete example in std::arch.
At the first attempt, I tried to use the SIMD manually. It is pretty annoying:
-
I use the conditional compilation to generate efficient code at compiler time. User should build the code with
RUSTFLAGS="-C target-feature=+avx2"
to generate code that can only run on CPUs that haveavx2
feature. It is not portable at all! Even if we used the portable_simd crate, we still need this flag to build efficient code, according to this guard. The portable means you can build the same code to different targets at compiler time, still static... Even if I build the code with correct flag, I found the instructions generated by the compiler is still slower than the auto-vectorization version on many different machines 😠!!! BTW: performance on some machine decreases 😠 -
I have to use the nightly of the rust. SIMD is an unstable feature. Nightly has some strange bugs....
After doing lots of experiments, I decide to use auto-vectorization. I am the prompt engineer of the compiler. I will view the ASM code generated by the compiler to make sure some key functions are vectorized. For example, I use unsafe code to enable auto-vectorization, maybe it is a bug? I replace the BitVec
with Vec<bool>
, iterate the array, do some computation and write results to BitVec
can not be vectorized automatically. It is at least 100 times slower thant the vectorized version, you can find the benchmark in benches/selection
.
Comparing one array with another array(scalar) is widely used in the query engine. We can store the results that matches the comparison in two forms: Vec<usize>
and Vec<bool>
. For Vec<usize>
, we store the index of the array that match the comparison. For Vec<bool>
, we store the result of each value in the boolean array.
If the expression has multiple comparisons conjunct with And
, latter comparison can use the index array to avoid unnecessary computation, pretty cool. However, computing the index array is pretty expensive, compiler can not perform auto-vectorization for this code! In my benchmark, it is at least 10 times slower than storing the result in Vec<bool>
😭. You can find the benchmark in benches/selection
. So, when executing the expression, which one is faster is totally depend on the Selectivity
of the expression.
Currently, we store the result in Vec<bool>
and will implement storing the result in Vec<usize>
in the future. Executor of the expression should choose the correct function at runtime!
- If we use
Vec<bool>
to store the comparison result, compiler can use auto-vectorization to speedup the comparison between primitive types. However, store the comparison result toBitMap
can not be auto-vectorized, we should implement the SIMD manually - Comparison and logical operation on
BitMap
is faster than onVec<bool>
, it is widely used in expressions.
Insight: BitMap
could return the index of one/zero quickly! Instead of using the index array to index the selected rows, we could youse the BitMap
to index the array 😊. This unifies the BitMap
and index array. See Filter Representation in Vectorized Query Execution for more details
Computation could choose to use the BitMap
as index array in when it thinks it is more applicable. For example, the StringArray
will always choose to view the BitMap
as index array because it can not leverage SIMD to perform acceleration. The PrimitiveArray
will choose to view the BitMap
as index array when almost all of the bits are zeros. Using SIMD and cache locality is much faster even if it does lots of unnecessary computations