IPC format support for StringViewArray and BinaryViewArray #5525

XiangpengHao · 2024-03-17T23:05:32Z

Which issue does this PR close?

Part of #5506.

Rationale for this change

Added necessary changes to handle the BinaryView and Utf8View in IPC reader/writer.

What changes are included in this PR?

The changes are slightly larger than expected because the BinaryView and Utf8View has variadicBufferCounts which no other types had before.

Currently implementation ignores the offset of Binary/Utf8View Array's offsets, meaning that the entire buffers will be serialize to the IPC buffer. This might write more data than necessary. Slicing and writing a view array to IPC buffer is non-trivial and was left as future work.

In #5506, @alamb mentioned (3) the integration tests. I'm not entirely sure how to do this. Should we generate some arrow data and commit to the testing repository and then add more tests to the arrow-rs/arrow-integration-testing/tests /ipc_reader.rs?

Are there any user-facing changes?

I just learned the arrow memory layout today, so I expect quite a lot of corner cases I didn't handle, please feel free to comment as you see anything non-intuitive :-)

arrow-ipc/src/reader.rs

arrow-ipc/src/writer.rs

alamb

Thank you @XiangpengHao -- this is looking great

I left a bunch of minor comments -- but the only thing that is missing from making this mergable in my opinion are some round trip tests (to ensure that we can write these arrays to an IPC file/stream and then read them back and get the same result)

I think we can use the same roundtrip pattern here:

arrow-rs/arrow-ipc/src/reader.rs

Lines 1571 to 1604 in 72854c4

    
           #[test] 
        
           fn test_roundtrip_stream_run_array_sliced() { 
        
               let run_array_1: Int32RunArray = vec!["a", "a", "a", "b", "b", "c", "c", "c"] 
        
                   .into_iter() 
        
                   .collect(); 
        
               let run_array_1_sliced = run_array_1.slice(2, 5); 
        
               let run_array_2_inupt = vec![Some(1_i32), None, None, Some(2), Some(2)]; 
        
               let mut run_array_2_builder = PrimitiveRunBuilder::<Int16Type, Int32Type>::new(); 
        
               run_array_2_builder.extend(run_array_2_inupt); 
        
               let run_array_2 = run_array_2_builder.finish(); 
        
               let schema = Arc::new(Schema::new(vec![ 
        
                   Field::new( 
        
                       "run_array_1_sliced", 
        
                       run_array_1_sliced.data_type().clone(), 
        
                       false, 
        
                   ), 
        
                   Field::new("run_array_2", run_array_2.data_type().clone(), false), 
        
               ])); 
        
               let input_batch = RecordBatch::try_new( 
        
                   schema, 
        
                   vec![Arc::new(run_array_1_sliced.clone()), Arc::new(run_array_2)], 
        
               ) 
        
               .unwrap(); 
        
               let output_batch = roundtrip_ipc_stream(&input_batch); 
        
               // As partial comparison not yet supported for run arrays, the sliced run array 
        
               // has to be unsliced before comparing with the output. the second run array 
        
               // can be compared as such. 
        
               assert_eq!(input_batch.column(1), output_batch.column(1)); 
        
               let run_array_1_unsliced = unslice_run_array(run_array_1_sliced.into_data()).unwrap();

The cases to cover are:

Basic BinaryView / Utf8View
Sliced BinaryView / Utf8View
Nested BinaryView/Utf8View in Dictionary/Struct/List (to cover the code in set_variadic_buffer_counts)

Currently implementation ignores the offset of Binary/Utf8View Array's offsets, meaning that the entire buffers will be serialize to the IPC buffer. This might write more data than necessary. Slicing and writing a view array to IPC buffer is non-trivial and was left as future work.

I think the IPC serializer should just serialize the raw arrays as given and not try to optimize anything. If users wants to "compact" the arrays prior to sending them over IPC I think it should be an explicit choice and they can do it via the gc API suggested in #5513

In #5506, @alamb mentioned (3) the integration tests. I'm not entirely sure how to do this. Should we generate some arrow data and commit to the testing repository and then add more tests to the arrow-rs/arrow-integration-testing/tests /ipc_reader.rs?

Maybe @bkietz knows if we have added StringViewArrays to the integration test suite already. I did not see any commits in https://github.com/apache/arrow-testing/commits/master that have such files.

If we don't have such files, I think we should add them / work with the other language teams to add them for compatibility as a follow on task. I can file tickets to track this

Thank you @ariesdevil and @viirya for the revies

arrow-ipc/src/reader.rs

arrow-ipc/src/writer.rs

bkietz · 2024-03-19T21:24:10Z

@alamb
We have added Utf8View to archery integration testing here. C++ <-> Go passes through both IPC and cABI (== arrow-rs::ffi). A PR to remove the skip on rust should add arrow-rs to the party.

arrow-ipc/src/writer.rs

tustvold · 2024-03-19T23:36:39Z

arrow-ipc/src/writer.rs

+            // The spec is not clear on whether the view/null buffer should be included in the variadic buffer count.
+            // But from C++ impl https://github.com/apache/arrow/blob/b448b33808f2dd42866195fa4bb44198e2fc26b9/cpp/src/arrow/ipc/writer.cc#L477
+            // we know they are not included.
+            counts.push(array.to_data().buffers().len() as i64 - 1);


Suggested change

counts.push(array.to_data().buffers().len() as i64 - 1);

counts.push(array.data_buffers().len() as i64);

data_buffers() is only available when the array is casted down to GenericByteViewArray

In this pattern, it must be a GenericByteViewArray, so using data_buffers() here is right.

yes, but it's quite verbose to first cast each type in to BinaryView or Utf8View and then call data_buffers()

arrow-ipc/src/writer.rs

tustvold · 2024-03-19T23:42:00Z

Had a brief look and I like where this is headed. I left some comments, but other than those already suggested by others, I wonder if we could integrate the variadicBuffer collection into the existing logic to traverse the nested types. This would be quicker, simpler and probably easier to maintain

arrow-ipc/src/writer.rs

Co-authored-by: Benjamin Kietzman <[email protected]> Co-authored-by: Raphael Taylor-Davies <[email protected]>

alamb

Hi @XiangpengHao

What is left before we can merge this PR?

It seems like the two remaining items are:

Roundtrip tests (just in this repo, as described in IPC format support for StringViewArray and BinaryViewArray #5525 (review))
Maybe a follow up ticket to track adding archery testing (e.g. IPC format support for StringViewArray and BinaryViewArray #5525 (comment))

Is there anything else?

XiangpengHao · 2024-03-22T17:01:31Z

Hi @XiangpengHao

What is left before we can merge this PR?

It seems like the two remaining items are:

Roundtrip tests (just in this repo, as described in IPC format support for StringViewArray and BinaryViewArray #5525 (review))

Maybe a follow up ticket to track adding archery testing (e.g. IPC format support for StringViewArray and BinaryViewArray #5525 (comment))

Is there anything else?

I think those are the two major todos. Sorry I got quite busy these days, will try to address them in a few days.

alamb · 2024-03-22T17:58:32Z

I think those are the two major todos. Sorry I got quite busy these days, will try to address them in a few days.

No worries!

I think we could merge this PR with just the first (round trip tests) and then do the integration test in a follow on PR

arrow-ipc/src/writer.rs

XiangpengHao · 2024-03-27T16:44:24Z

Finally get back to this! I checked in the roundtrip tests and fixed bugs related to dictionary encodings. Can you @tustvold @alamb take a look again?

alamb

Thanks @XiangpengHao -- I think this is really close.

I think the tests need a few tweaks and fix the CI but then this will be good to go.

Thank you again so much 🙏

arrow-ipc/src/reader.rs

XiangpengHao · 2024-03-28T20:54:06Z

Updated the tests! I believe the CI failure is not related to this pr here..

alamb

I think it looks good to me -- thank you @XiangpengHao 🙏

We can keep iterating in subsequent PRs I think

alamb · 2024-03-28T21:48:28Z

arrow-ipc/src/writer.rs

@@ -1247,6 +1291,22 @@ fn write_array_data(
                compression_codec,
            )?;
        }
+    } else if matches!(data_type, DataType::BinaryView | DataType::Utf8View) {
+        // Slicing the views buffer is safe and easy,


alamb · 2024-03-28T21:52:58Z

Updated the tests! I believe the CI failure is not related to this pr here..

I agree it doesn't look related. I made #5564 to test this theory

Update: the CI fails on main. as well, filed #5565

alamb · 2024-03-28T23:09:15Z

Update: the CI fails on main. as well, filed #5565 -- I'll try and look at it in a day or two if no one beats me to it

alamb · 2024-03-31T09:21:07Z

I took the liberty of merging up from master to this branch to hopefully get a clean CI run

XiangpengHao · 2024-03-31T14:51:00Z

Thanks @alamb the CI passed!

alamb · 2024-04-01T18:31:06Z

Thanks again @XiangpengHao -- this is a very nice step forward

XiangpengHao added 3 commits March 17, 2024 21:30

check in ipc format for view types

06e0610

update tests

4f0c80d

fix variadic counting

4c29007

github-actions bot added the arrow Changes to the arrow crate label Mar 17, 2024

viirya reviewed Mar 17, 2024

View reviewed changes

arrow-ipc/src/reader.rs Outdated Show resolved Hide resolved

fix linting, address comments

72854c4

ariesdevil reviewed Mar 18, 2024

View reviewed changes

arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved

alamb mentioned this pull request Mar 18, 2024

DataFusion weekly project plan (Andrew Lamb) - March 18, 2024 apache/datafusion#9675

Closed

7 tasks

alamb reviewed Mar 19, 2024

View reviewed changes

tustvold reviewed Mar 19, 2024

View reviewed changes

arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved

tustvold reviewed Mar 19, 2024

View reviewed changes

arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved

tustvold reviewed Mar 19, 2024

View reviewed changes

arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved

bkietz reviewed Mar 20, 2024

View reviewed changes

arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved

arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved

XiangpengHao and others added 2 commits March 20, 2024 13:27

Apply suggestions from code review

850b01e

Co-authored-by: Benjamin Kietzman <[email protected]> Co-authored-by: Raphael Taylor-Davies <[email protected]>

address some review comments

3b59c08

XiangpengHao marked this pull request as draft March 20, 2024 20:29

update comments

d121ce6

alamb reviewed Mar 22, 2024

View reviewed changes

tustvold reviewed Mar 25, 2024

View reviewed changes

arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved

XiangpengHao added 2 commits March 27, 2024 16:32

Add tests and fix bugs with dict types

d16cf23

make clippy happy

bb9e42f

XiangpengHao marked this pull request as ready for review March 27, 2024 16:41

alamb reviewed Mar 28, 2024

View reviewed changes

arrow-ipc/src/reader.rs Outdated Show resolved Hide resolved

arrow-ipc/src/reader.rs Show resolved Hide resolved

arrow-ipc/src/reader.rs Show resolved Hide resolved

update test cases

0ffd783

alamb approved these changes Mar 28, 2024

View reviewed changes

alamb mentioned this pull request Mar 28, 2024

WIP -- Testing CI #5564

Closed

Merge remote-tracking branch 'apache/master' into XiangpengHao/master

448bc9e

alamb merged commit 17058c7 into apache:master Apr 1, 2024
25 checks passed

alamb mentioned this pull request Apr 1, 2024

DataFusion weekly project plan (Andrew Lamb) - April 1, 2024 apache/datafusion#9899

Closed

7 tasks

tustvold mentioned this pull request Apr 17, 2024

parquet / Build wasm32 (pull_request) CI check failing on main #5565

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPC format support for StringViewArray and BinaryViewArray #5525

IPC format support for StringViewArray and BinaryViewArray #5525

XiangpengHao commented Mar 17, 2024 •

edited

Loading

alamb left a comment

bkietz commented Mar 19, 2024 •

edited

Loading

tustvold Mar 19, 2024

XiangpengHao Mar 20, 2024 •

edited

Loading

ariesdevil Mar 21, 2024

XiangpengHao Mar 25, 2024

tustvold commented Mar 19, 2024

alamb left a comment

XiangpengHao commented Mar 22, 2024 •

edited

Loading

alamb commented Mar 22, 2024

XiangpengHao commented Mar 27, 2024

alamb left a comment

XiangpengHao commented Mar 28, 2024

alamb left a comment

alamb Mar 28, 2024

alamb commented Mar 28, 2024 •

edited

Loading

alamb commented Mar 28, 2024

alamb commented Mar 31, 2024

XiangpengHao commented Mar 31, 2024

alamb commented Apr 1, 2024


	#[test]
	fn test_roundtrip_stream_run_array_sliced() {
	let run_array_1: Int32RunArray = vec!["a", "a", "a", "b", "b", "c", "c", "c"]
	.into_iter()
	.collect();
	let run_array_1_sliced = run_array_1.slice(2, 5);

	let run_array_2_inupt = vec![Some(1_i32), None, None, Some(2), Some(2)];
	let mut run_array_2_builder = PrimitiveRunBuilder::<Int16Type, Int32Type>::new();
	run_array_2_builder.extend(run_array_2_inupt);
	let run_array_2 = run_array_2_builder.finish();

	let schema = Arc::new(Schema::new(vec![
	Field::new(
	"run_array_1_sliced",
	run_array_1_sliced.data_type().clone(),
	false,
	),
	Field::new("run_array_2", run_array_2.data_type().clone(), false),
	]));
	let input_batch = RecordBatch::try_new(
	schema,
	vec![Arc::new(run_array_1_sliced.clone()), Arc::new(run_array_2)],
	)
	.unwrap();
	let output_batch = roundtrip_ipc_stream(&input_batch);

	// As partial comparison not yet supported for run arrays, the sliced run array
	// has to be unsliced before comparing with the output. the second run array
	// can be compared as such.
	assert_eq!(input_batch.column(1), output_batch.column(1));

	let run_array_1_unsliced = unslice_run_array(run_array_1_sliced.into_data()).unwrap();

	counts.push(array.to_data().buffers().len() as i64 - 1);
	counts.push(array.data_buffers().len() as i64);

IPC format support for StringViewArray and BinaryViewArray #5525

IPC format support for StringViewArray and BinaryViewArray #5525

Conversation

XiangpengHao commented Mar 17, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

bkietz commented Mar 19, 2024 • edited Loading

tustvold Mar 19, 2024

Choose a reason for hiding this comment

XiangpengHao Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

ariesdevil Mar 21, 2024

Choose a reason for hiding this comment

XiangpengHao Mar 25, 2024

Choose a reason for hiding this comment

tustvold commented Mar 19, 2024

alamb left a comment

Choose a reason for hiding this comment

XiangpengHao commented Mar 22, 2024 • edited Loading

alamb commented Mar 22, 2024

XiangpengHao commented Mar 27, 2024

alamb left a comment

Choose a reason for hiding this comment

XiangpengHao commented Mar 28, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 28, 2024

Choose a reason for hiding this comment

alamb commented Mar 28, 2024 • edited Loading

alamb commented Mar 28, 2024

alamb commented Mar 31, 2024

XiangpengHao commented Mar 31, 2024

alamb commented Apr 1, 2024

XiangpengHao commented Mar 17, 2024 •

edited

Loading

bkietz commented Mar 19, 2024 •

edited

Loading

XiangpengHao Mar 20, 2024 •

edited

Loading

XiangpengHao commented Mar 22, 2024 •

edited

Loading

alamb commented Mar 28, 2024 •

edited

Loading