How does Vortex compare to Lance? #1226
Replies: 1 comment 2 replies
-
Hi Phil, great question, we’ve been following along with Lance for a while now, some neat stuff. Of course, please caveat this answer that my understanding of Lance isn’t as deep as Vortex! At a high level, you can think of Lance as covering the architectural "boxes" of Iceberg + Parquet, whereas Vortex is more of a pure replacement for Parquet or ORC. Lance provides dataset semantics (atomicity, versioning, etc.) as well as additional features to support building vector indices against those datasets. Internally, the Lance V1 format originally provided fast random access by simply storing uncompressed Arrow. That’s no longer the case, and their V2 format recently added FSST and our FastLanes BitPacking to compress strings & integers, respectively. That said, their compression implementation is less complete: for example there is no float compression such as Vortex ALP, and Lance doesn't currently support cascading compression codecs. Notably, one of the things that makes Vortex special is that it defines its own in-memory format for compressed Arrays (think of it as compressed Arrow), opening up all sorts of potential compute engine optimizations. And by drawing Vortex's architectural box this way, Vortex (or its component parts) can be reused more easily in other systems (e.g., we'd like to get Vortex support into Iceberg eventually). In the same way that DataFusion is a useful compute library for many Rust projects (e.g. Influx, and I believe yourselves), Vortex should be seen as a generally useful storage library with reusable components. For example, We need to clean up the benchmarks a bit before formally publishing them, but Vortex currently has ~2x the write throughput of Parquet v2 with zstd, ~2-5x the full scan throughput, and ~200x faster random access reads, while typically being the same approximate size (high variance, +/- 50%, but median is probably ~10% bigger than Parquet). I'm not sure about the latest performance numbers for Lance, but at the very least, I would expect their storage size to be much larger (2-10x for most datasets). In terms of which to use, I would say if this is internal to another data-oriented project, then definitely give Vortex a go. I’m sure there will be some bumps, lack of docs, and other issues common with early stage projects, but do let us know! If you are an end-user, for example working in Python and performing ML/search oriented tasks, then Lance is your best bet for now. |
Beta Was this translation helpful? Give feedback.
-
Hey everyone! Phil here from @paradedb. We're pretty interested in a Parquet/Arrow successor, and were previously considering Lance for the fast random access read. Could you please share how Vortex compares in your own words? When should one consider one vs the other.
Beta Was this translation helpful? Give feedback.
All reactions