feat: support merge by row_id, row_addr #3254

chenkovsky · 2024-12-16T14:06:21Z

No description provided.

wjones127

Thanks for working on this @chenkovsky. I would like to see a few improvements to the unit tests, and then this is ready to go.

wjones127 · 2024-12-16T18:54:49Z

rust/lance/src/dataset.rs

+        let test_dir = tempdir().unwrap();
+        let test_uri = test_dir.path().to_str().unwrap();


If we aren't testing anything about the files, let's use an in-memory dataset instead.

Suggested change

let test_dir = tempdir().unwrap();

let test_uri = test_dir.path().to_str().unwrap();

wjones127 · 2024-12-16T18:55:29Z

rust/lance/src/dataset.rs

+        Dataset::write(data, test_uri, Some(write_params.clone()))
+            .await
+            .unwrap();
+
+        let mut dataset = Dataset::open(test_uri).await.unwrap();


If you re-use the dataset instance from write(), you can just use an in-memory dataset:

Suggested change

Dataset::write(data, test_uri, Some(write_params.clone()))

.await

.unwrap();

let mut dataset = Dataset::open(test_uri).await.unwrap();

let dataset = Dataset::write(data, "memory://", Some(write_params.clone()))

.await

.unwrap();

wjones127 · 2024-12-16T19:05:03Z

rust/lance/src/dataset.rs

+        let new_batch =
+            RecordBatch::try_new(new_schema.clone(), vec![row_ids.clone(), row_ids.clone()])
+                .unwrap();
+        let new_data = RecordBatchIterator::new(vec![Ok(new_batch)], new_schema.clone());
+        dataset.merge(new_data, ROW_ID, "rowid").await.unwrap();
+        dataset.validate().await.unwrap();


I'd like us to assert a few more things in this test:

dataset has the expected final schema key, value, new_value.

The values are what we expect. For this, you should avoid using the same values in each column. Otherwise, the test could pass even if there is a bug that uses the wrong column's values. Right now, you use row_ids.clone() for both rowid and new_value.

This works even if you shuffle the data. I would recommend using take_record_batch() to reorder the new_batch so the row ids are out-of-order.

wjones127 · 2024-12-16T19:05:28Z

rust/lance/src/dataset.rs

+        // This test also tests "null filling" when merging (e.g. when keys do not match
+        // we need to insert nulls)


Where is the null filling? It seems like you are providing every row id, unless I am missing something.

Where is the null filling? It seems like you are providing every row id, unless I am missing something.

sorry, I copy and modify another test

wjones127 · 2024-12-16T19:05:45Z

rust/lance/src/dataset.rs

+    #[rstest]
+    #[tokio::test]
+    async fn test_merge_on_row_addr(
+        #[values(LanceFileVersion::Legacy, LanceFileVersion::Stable)]
+        data_storage_version: LanceFileVersion,
+        #[values(false, true)] use_stable_row_id: bool,


Same comments from the row id test apply here.

codecov-commenter · 2024-12-17T00:57:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.42%. Comparing base (83b8efd) to head (0ce8ac1).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3254      +/-   ##
==========================================
- Coverage   78.47%   78.42%   -0.05%     
==========================================
  Files         245      245              
  Lines       85088    85096       +8     
  Branches    85088    85096       +8     
==========================================
- Hits        66772    66738      -34     
- Misses      15501    15546      +45     
+ Partials     2815     2812       -3

Flag	Coverage Δ
unittests	`78.42% <100.00%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

feat: merge by row_id, row_addr

3f8071f

github-actions bot added the enhancement New feature or request label Dec 16, 2024

chenkovsky changed the title ~~feat: merge by row_id, row_addr~~ feat: support merge by row_id, row_addr Dec 16, 2024

chenkovsky mentioned this pull request Dec 16, 2024

_rowaddr and _rowid not exposed for merge? #3251

Open

broccoliSpicy requested a review from wjones127 December 16, 2024 15:42

wjones127 requested changes Dec 16, 2024

View reviewed changes

update test

0ce8ac1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support merge by row_id, row_addr #3254

feat: support merge by row_id, row_addr #3254

chenkovsky commented Dec 16, 2024

wjones127 left a comment

wjones127 Dec 16, 2024

wjones127 Dec 16, 2024

wjones127 Dec 16, 2024

wjones127 Dec 16, 2024

chenkovsky Dec 16, 2024

wjones127 Dec 16, 2024

chenkovsky Dec 17, 2024

codecov-commenter commented Dec 17, 2024

		let test_dir = tempdir().unwrap();
		let test_uri = test_dir.path().to_str().unwrap();

		// This test also tests "null filling" when merging (e.g. when keys do not match
		// we need to insert nulls)

feat: support merge by row_id, row_addr #3254

Are you sure you want to change the base?

feat: support merge by row_id, row_addr #3254

Conversation

chenkovsky commented Dec 16, 2024

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Dec 16, 2024

Choose a reason for hiding this comment

wjones127 Dec 16, 2024

Choose a reason for hiding this comment

wjones127 Dec 16, 2024

Choose a reason for hiding this comment

wjones127 Dec 16, 2024

Choose a reason for hiding this comment

chenkovsky Dec 16, 2024

Choose a reason for hiding this comment

wjones127 Dec 16, 2024

Choose a reason for hiding this comment

chenkovsky Dec 17, 2024

Choose a reason for hiding this comment

codecov-commenter commented Dec 17, 2024

Codecov Report