-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RUST] python (re)integration v1 #436
Conversation
a020ffb
to
b18bee4
Compare
reader = pa.dataset.Scanner.from_dataset(data_obj).to_reader() | ||
elif isinstance(data_obj, pa.dataset.Scanner): | ||
reader = data_obj.to_reader() | ||
elif isinstance(data_obj, pa.RecordBatchReader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice. Finally have the generator support. Could we add a test for it?
rust/pylance/src/scanner.rs
Outdated
|
||
fn to_reader(self_: PyRef<'_, Self>) -> PyResult<PyObject> { | ||
self_.rt.block_on(async { | ||
let mut scanner: LanceScanner = self_.dataset.scan(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did not see which of the following code needs to be async.
is it ok to not use rt.block_on
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
into_stream calls tokio spawn so requires a Runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rust/pylance/src/scanner.rs
Outdated
#[pymethods] | ||
impl Scanner { | ||
#[getter(dataset_schema)] | ||
fn dataset_schema(self_: PyRef<'_, Self>) -> PyResult<PyObject> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use self_
as name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convention from examples - this has a reference to Python whereas &self: Self does not
rust/pylance/src/reader.rs
Outdated
stream.next().await.map(|rs| { | ||
rs.map_err(|err| { | ||
match err { | ||
Error::Arrow(err) => ArrowError::IoError(err), // we lose the error type converting to LanceError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we implement a From<&LanceError> for ArrowError
60bae41
to
de411d7
Compare
Hello world for python (re)integration from Rust package. Enables bare minimum: ```python import lance batch = lance.read_batch("/path/to/dataset.lance") lance.write_batch(batch, "/path/to/write/dataset.lance") ``` 1. Separate the Rust lib build and the python lib (e.g., cargo build doesn't seem to work as is) 2. Read batch succeeds but prints errors about tokio worker thread being cancelled 3. Currently each method call has to create its own tokio runtime, bad bad bad. 4. Function signatures / docstrings Our main point of integration with pyarrow is extending the Dataset and Scanner interface. In current python package we: 1. Expose appropriate classes/methods from C++ in the cython code. 2. Define cython class FileSystemDataset(Dataset) 3. Define methods that match the Dataset interface and call our own BuildScanner. For Rust, we have 2 options: 1. Manually define PyTypeInfo for pyarrow.dataset.Dataset and we can define everything in _lib.pyx using pyo3 2. Expose the bare minimum from pyo3 and then code up FileSystemDataset(Dataset) by hand in python Either way, it'll require some mixing of Rust/pyo3 generated python that can be imported and used by a hand-crafted python module on top. Option 2 seems much easier. What do we want to do if people want to use the native pa.dataset.write_dataset and pa.dataset.dataset interfaces?
ce5a6fd
to
f66a125
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go go go
} | ||
|
||
/// Find k-nearest neighbour within the vector column. | ||
pub fn nearest(&mut self, column: &str, q: &Float32Array, k: usize) -> &mut Self { | ||
pub fn nearest(&mut self, column: &str, q: &Float32Array, k: usize) -> Result<&mut Self> { | ||
if k <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k is usize
which can not be negative, it only needs to check k == 0
?
@@ -0,0 +1,6 @@ | |||
from .dataset import LanceDataset, write_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not urgent not. but let's add license headers later?
lance.write_dataset
andlance.dataset
.to_table()
orto_scanner().to_reader()
(like datafusion or polars)