Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RUST] python (re)integration v1 #436

Merged
merged 27 commits into from
Jan 24, 2023
Merged

[RUST] python (re)integration v1 #436

merged 27 commits into from
Jan 24, 2023

Conversation

changhiskhan
Copy link
Contributor

@changhiskhan changhiskhan commented Jan 17, 2023

  1. exposes lance.write_dataset and lance.dataset.
  2. Has APIs for projection pushdown, offset/limit, and nearest neighbors
  3. No predicate pushdown yet
  4. The dataset is not readable by duckdb directly. Requires calling either to_table() or to_scanner().to_reader() (like datafusion or polars)

@changhiskhan changhiskhan requested a review from eddyxu January 17, 2023 18:16
@changhiskhan changhiskhan force-pushed the changhiskhan/pyo3 branch 2 times, most recently from a020ffb to b18bee4 Compare January 21, 2023 02:48
@changhiskhan changhiskhan changed the title [RUST] python integration hello world [RUST] python (re)integration v1 Jan 21, 2023
@changhiskhan changhiskhan marked this pull request as ready for review January 21, 2023 04:55
rust/src/dataset/scanner.rs Outdated Show resolved Hide resolved
reader = pa.dataset.Scanner.from_dataset(data_obj).to_reader()
elif isinstance(data_obj, pa.dataset.Scanner):
reader = data_obj.to_reader()
elif isinstance(data_obj, pa.RecordBatchReader):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice. Finally have the generator support. Could we add a test for it?


fn to_reader(self_: PyRef<'_, Self>) -> PyResult<PyObject> {
self_.rt.block_on(async {
let mut scanner: LanceScanner = self_.dataset.scan();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not see which of the following code needs to be async.

is it ok to not use rt.block_on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

into_stream calls tokio spawn so requires a Runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[pymethods]
impl Scanner {
#[getter(dataset_schema)]
fn dataset_schema(self_: PyRef<'_, Self>) -> PyResult<PyObject> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use self_ as name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convention from examples - this has a reference to Python whereas &self: Self does not

rust/pylance/src/scanner.rs Outdated Show resolved Hide resolved
rust/pylance/src/scanner.rs Show resolved Hide resolved
stream.next().await.map(|rs| {
rs.map_err(|err| {
match err {
Error::Arrow(err) => ArrowError::IoError(err), // we lose the error type converting to LanceError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement a From<&LanceError> for ArrowError

rust/pylance/src/dataset.rs Outdated Show resolved Hide resolved
@changhiskhan changhiskhan force-pushed the changhiskhan/pyo3 branch 4 times, most recently from 60bae41 to de411d7 Compare January 23, 2023 06:52
Hello world for python (re)integration from Rust package.
Enables bare minimum:

```python
import lance
batch = lance.read_batch("/path/to/dataset.lance")
lance.write_batch(batch, "/path/to/write/dataset.lance")
```

1. Separate the Rust lib build and the python lib (e.g., cargo build doesn't seem to work as is)
2. Read batch succeeds but prints errors about tokio worker thread being cancelled
3. Currently each method call has to create its own tokio runtime, bad bad bad.
4. Function signatures / docstrings

Our main point of integration with pyarrow is extending the Dataset and Scanner interface. In current python package we:
1. Expose appropriate classes/methods from C++ in the cython code.
2. Define cython class FileSystemDataset(Dataset)
3. Define methods that match the Dataset interface and call our own BuildScanner.

For Rust, we have 2 options:
1. Manually define PyTypeInfo for pyarrow.dataset.Dataset and we can define everything in _lib.pyx using pyo3
2. Expose the bare minimum from pyo3 and then code up FileSystemDataset(Dataset) by hand in python

Either way, it'll require some mixing of Rust/pyo3 generated python that can be imported and used by a hand-crafted python module on top. Option 2 seems much easier.

What do we want to do if people want to use the native pa.dataset.write_dataset and pa.dataset.dataset interfaces?
Copy link
Contributor

@eddyxu eddyxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go go go

}

/// Find k-nearest neighbour within the vector column.
pub fn nearest(&mut self, column: &str, q: &Float32Array, k: usize) -> &mut Self {
pub fn nearest(&mut self, column: &str, q: &Float32Array, k: usize) -> Result<&mut Self> {
if k <= 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k is usize which can not be negative, it only needs to check k == 0?

@@ -0,0 +1,6 @@
from .dataset import LanceDataset, write_dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not urgent not. but let's add license headers later?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants