Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure IPC stream messages are contiguous #6321

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Aug 28, 2024

Which issue does this PR close?

Closes #6311

Rationale for this change

The format guarantees that each IPC file embeds a valid IPC stream. However, when writing IPC files arrow-rs currently aligns the encapsulated flatbuffers Messages to 64 byte boundaries instead of 8 bytes. This can leave gaps of padding bytes between the Messages which are not valid in the stream format.

What changes are included in this PR?

  • flatbuffers Messages are written with no leading padding; instead padding is written after the flatbuffers Message to enforce IpcWriteOptions::alignment
  • FileWriter and StreamWriter track total written bytes in order to ensure body buffers are aligned relative to the file/stream start
  • arrow-json-integration-test is amended to assert that IPC files can be read by a stream reader

Are there any user-facing changes?

No

@bkietz bkietz requested review from westonpace and tustvold August 28, 2024 18:00
@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 28, 2024
@bkietz
Copy link
Member Author

bkietz commented Aug 28, 2024

The failures in integration are expected; JS and Go are still producing IPC files without valid embedded streams so they will not be considered valid by the modified arrow-json-integration-test. I wasn't sure how best to expose that added validation; maybe an env var or build option or we just leave it commented until go and JS are fixed?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this. I'm not an expert on this code but here's some initial thoughts from a scan of the changes.

write_message_at_offset(writer, 0, encoded, write_options)
}

fn write_message_at_offset<W: Write>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: this function name confuses me slightly. If I'm writing to a "file-like object' (e.g. Write) then there is no underlying capability to write at an offset. "Skip offset bytes" doesn't make sense in a write context (as opposed to a read context).

I think...instead...you are maybe using the offset to determine how much padding to add at the end?

Maybe write_remaining_message with offset replaced by already_written?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about write_message_positioned_absolutely(writer, position, encoded, options)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, since write_message doesn't seem widely used I could just add the already_written_len argument there rather than adding a new fn

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace how about now?

Comment on lines -831 to -875
/// The number of bytes between each block of bytes, as an offset for random access
block_offsets: usize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what this was doing so I'm not sure I understand what it means that it is going away.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the former mechanism for tracking absolute position in the output file. It was named according to the only thing it was used for: setting Block::offsets. Since it's now used to track how much padding should be written after flatbuffers Messages and to make its expected value more obvious, I renamed it.

@alamb
Copy link
Contributor

alamb commented Aug 31, 2024

The integration CI test failure looks real to me: https://github.com/apache/arrow-rs/actions/runs/10603567022/job/29388207333?pr=6321

@bkietz
Copy link
Member Author

bkietz commented Sep 2, 2024

@alamb integration failures are expected since this PR adds validation for an aspect of the file format which not all producers guarantee yet (see also the main tracking issue). I wasn't sure how (or if) that added validation should be in this PR, any advice would be welcome.

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

Comment on lines 1260 to 1321
pub fn write_message<W: Write>(
mut writer: W,
already_written_len: usize,
encoded: EncodedData,
write_options: &IpcWriteOptions,
) -> Result<(usize, usize), ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think technically this would be a breaking change as the method is publicly exposed to the API. Could maybe deprecate this and create it as a new method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's preferable I can certainly do that; I think that was the first thing I wrote. Then it seemed that this function is only public for use in arrow-flight, so I prioritized keeping things simpler.

arrow-ipc/src/writer.rs Outdated Show resolved Hide resolved
@bkietz bkietz force-pushed the overalign-only-body-buffers branch from 70c88e8 to 9877966 Compare October 3, 2024 20:02
@tustvold tustvold added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Oct 4, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some tests please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate next-major-release the PR has API changes and it waiting on the next major version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IPC code writes files which do not include a valid stream
5 participants