-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compressing VarBinArray to ConstantArray loses info about offsets_ptype #1021
Comments
Hey @XinyuZeng , thanks for taking a look! You are correct that in general, while we do our best to preserve round-tripping between Arrow arrays and the nearest Vortex encoding, after compression we can't guarantee to give you back the exact same Arrow encoding you started with. For example, if you provided a This is because of the Vortex type system encodes logical types (e.g. "these bytes are Utf8 encoded") VS the Arrow physical encoding types. Each of our logical types has a blessed "canonical" encoding which can represent values of that type while being zero-copy to Arrow. To provide an exact round-trip guarantee would require storing information somewhere about the original Arrow encoding, which I don't think is something we're considering right now. |
Got it, thanks! Still wondering will this have a potential issue when connecting to DataFusion since they are using Arrow physical schema directly. For example, we always got Utf8 in schema ( vortex/vortex-datafusion/src/datatype.rs Line 67 in 3bef63b
vortex/vortex-array/src/canonical.rs Lines 270 to 271 in 3bef63b
|
This is indeed an issue. This particular case would be solved by #757. I don't recall why we didn't take the schema from the array. Likely needs to add a new trait i.e. |
@a10y pointed out that we might have enough metadata in our files and that we are still missing metadata in our inmemory arrays. However, we should make batches returned to datafusion have consistent dtypes |
I see. Btw datafusion has a plan to separate logical type out: apache/datafusion#11513. If that proposal is implemented then maybe there is no need for vortex to ensure consistent dtypes for output batches. |
Yep we're tracking that keenly and agree it will be likely to help here 😄 |
Hi Vortex, I am not sure this is the desired behavior. For example, if we compress a
LargeBinary
orLargeUtf8
Arrow Array into Vortex'sConstantArray
and then canonicalize it back, we will getBinary
orUtf8
Arrow Array. This is becauseVarBinArray::from_iter
always uses the u32 offsets builder:vortex/vortex-array/src/array/varbin/mod.rs
Line 164 in e75606d
This can be reproduced by running the
round_trip_arrow_compressed
test. It is ignored but Arrow now supports comparing Structs:vortex/bench-vortex/src/lib.rs
Lines 264 to 268 in e75606d
The taxi dataset has a field
store_and_fwd_flag
which is mostlyN
. It is reasonable for a ConstantArray to just use u32 offset but if we have a ChunkedArray where the first chunk is Constant and the second chunk is not, we may have inconsistent Arrow schema between output RecordBatches? (while this may be the problem of Arrow missing a logical type)The text was updated successfully, but these errors were encountered: