Compressing VarBinArray to ConstantArray loses info about offsets_ptype #1021

XinyuZeng · 2024-10-12T05:28:13Z

Hi Vortex, I am not sure this is the desired behavior. For example, if we compress a LargeBinary or LargeUtf8 Arrow Array into Vortex's ConstantArray and then canonicalize it back, we will get Binary or Utf8 Arrow Array. This is because VarBinArray::from_iter always uses the u32 offsets builder:

vortex/vortex-array/src/array/varbin/mod.rs

Line 164 in e75606d

let mut builder = VarBinBuilder::<u32>::with_capacity(iter.size_hint().0);

This can be reproduced by running the round_trip_arrow_compressed test. It is ignored but Arrow now supports comparing Structs:

vortex/bench-vortex/src/lib.rs

Lines 264 to 268 in e75606d

    
           // Ignoring since Struct arrays don't currently support equality. 
        
           // https://github.com/apache/arrow-rs/issues/5199 
        
           #[ignore] 
        
           #[test] 
        
           fn round_trip_arrow_compressed() {

The taxi dataset has a field store_and_fwd_flag which is mostly N. It is reasonable for a ConstantArray to just use u32 offset but if we have a ChunkedArray where the first chunk is Constant and the second chunk is not, we may have inconsistent Arrow schema between output RecordBatches? (while this may be the problem of Arrow missing a logical type)

The text was updated successfully, but these errors were encountered:

a10y · 2024-10-14T18:03:04Z

Hey @XinyuZeng , thanks for taking a look! You are correct that in general, while we do our best to preserve round-tripping between Arrow arrays and the nearest Vortex encoding, after compression we can't guarantee to give you back the exact same Arrow encoding you started with. For example, if you provided a StringViewArray or BinaryViewArray, we would still give you back a StringArray / BinaryArray.

This is because of the Vortex type system encodes logical types (e.g. "these bytes are Utf8 encoded") VS the Arrow physical encoding types. Each of our logical types has a blessed "canonical" encoding which can represent values of that type while being zero-copy to Arrow.

To provide an exact round-trip guarantee would require storing information somewhere about the original Arrow encoding, which I don't think is something we're considering right now.

XinyuZeng · 2024-10-15T02:27:54Z

Got it, thanks! Still wondering will this have a potential issue when connecting to DataFusion since they are using Arrow physical schema directly. For example, we always got Utf8 in schema (

vortex/vortex-datafusion/src/datatype.rs

Line 67 in 3bef63b

DType::Utf8(_) => DataType::Utf8,

) but the canonical Array may be LargeUtf8 (

vortex/vortex-array/src/canonical.rs

Lines 270 to 271 in 3bef63b

    
           PType::I64 => Arc::new(unsafe { 
        
               LargeStringArray::new_unchecked(

). Maybe a corner case though, just accidentally think of.

robert3005 · 2024-10-15T07:44:50Z

This is indeed an issue. This particular case would be solved by #757. I don't recall why we didn't take the schema from the array. Likely needs to add a new trait i.e. ArrowDType and implement it for all our arrays to solve this issue. Then we could produce the true canonical dtype.

robert3005 · 2024-10-15T11:45:57Z

@a10y pointed out that we might have enough metadata in our files and that we are still missing metadata in our inmemory arrays. However, we should make batches returned to datafusion have consistent dtypes

XinyuZeng · 2024-10-15T13:16:14Z

I see. Btw datafusion has a plan to separate logical type out: apache/datafusion#11513. If that proposal is implemented then maybe there is no need for vortex to ensure consistent dtypes for output batches.

a10y · 2024-10-15T15:00:57Z

Yep we're tracking that keenly and agree it will be likely to help here 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compressing VarBinArray to ConstantArray loses info about offsets_ptype #1021

Compressing VarBinArray to ConstantArray loses info about offsets_ptype #1021

XinyuZeng commented Oct 12, 2024

a10y commented Oct 14, 2024

XinyuZeng commented Oct 15, 2024

robert3005 commented Oct 15, 2024

robert3005 commented Oct 15, 2024 •

edited

Loading

XinyuZeng commented Oct 15, 2024

a10y commented Oct 15, 2024

Compressing VarBinArray to ConstantArray loses info about offsets_ptype #1021

Compressing VarBinArray to ConstantArray loses info about offsets_ptype #1021

Comments

XinyuZeng commented Oct 12, 2024

a10y commented Oct 14, 2024

XinyuZeng commented Oct 15, 2024

robert3005 commented Oct 15, 2024

robert3005 commented Oct 15, 2024 • edited Loading

XinyuZeng commented Oct 15, 2024

a10y commented Oct 15, 2024

robert3005 commented Oct 15, 2024 •

edited

Loading