-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow limited access to OsStr
bytes
#109698
Conversation
`OsStr` has historically kept its implementation details private out of concern for locking us into a specific encoding on Windows. This is an alternative to rust-lang#95290 which proposed specifying the encoding on Windows. Instead, this only specifies that for cross-platform code, `OsStr`'s encoding is a superset of UTF-8 and defines rules for safely interacting with it At minimum, this can greatly simplify the `os_str_bytes` crate and every arg parser that interacts with `OsStr` directly (which is most of those that support invalid UTF-8).
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @cuviper (or someone else) soon. Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (
|
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
01f8d93
to
0d87d66
Compare
I think this counts as new guarantee for API purposes. @rustbot label +T-libs-api -T-libs |
Out of interest, was there anything in particular that motivated you to propose this now? |
I think this should come with an example since it's a wide pointer. Conversion from C to Rust has to jump through extra hoops. |
I'm personally amenable to adding this guarantee, but I do agree with @the8472 that we need to confirm that this will actually work as intended. |
OsStr
OsStr
bytes in unsafe
blocks
Sorry for the lack of details and confusion on this. I was under the impression that working towards my final goal would best be done in smaller steps of defining My hope that this more limited alternative to #95290 will have a chance to move forward. |
If we do this, I wonder if it makes sense to also offer a // Similar to `str::split_at` and `[T]::split_at`
// Panics if `mid` is not on a UTF-8 code point boundary.
fn split_at(&self, mid: usize) -> (&OsStr, &OsStr); Splitting and joining are the two hazard areas when dealing with known valid |
A part of me finds it weird to offer a safe function like EDIT: To make clear, I'm open to adding that function if that is the direction we want to take this. |
Hm, we could offer some safe way to get an index according to these rules. E.g. // Returns the byte index where `searcher` first returns true.
// `searcher` can inspect `char`s decoded from UTF-8.
// Non UTF-8 encoded characters are skipped.
fn find_char(&self, searcher: FnMut(char) -> bool) -> Option<usize>; But at this point I'll stop because I'm practically designing yet another alternative approach. |
Does this confirm that |
For now, this proposals restricts what is safe to transmute from //! - When [transmuting] from `&[u8]` to `&OsStr`,
//! - the slice may only include content from comparable `&OsStr` (see above) or be valid UTF-8
//! - any splits of the `&OsStr` must be along char boundaries (the first byte of a UTF-8 code
//! point sequence)
|
I think I buy the rules as written, but I would like to see a doc example using Also, cc @SimonSapin to see what you think about this. (Let me know if you want me to stop pinging you about this, but I always think of you as the champion against a change like this.) Popping up a level, this is kind of an interesting use of |
It actually isn't a bag of bytes on Windows, it must be WTF-8-encoded or you can start reading out of bounds: fn main() {
let b: &[u8] = b"\xC2";
let s: &std::ffi::OsStr = unsafe { std::mem::transmute(b) };
dbg!(s);
} $ cargo run -q --target x86_64-pc-windows-gnu
[src/main.rs:4] s = "\u{9b}:] = \n\0\0\0\0\0\0\01\u{10}L[...]thread 'main' panicked at 'failed printing to stderr: Windows stdio in console mode does not support writing non-UTF-8 byte sequences', library\std\src\io\stdio.rs:1008:9 The proposal is phrased to never produce invalid WTF-8. I think that as written now it allows for any |
This wording would provide an additional guarantee that |
e2d912b
to
e6a35c4
Compare
I've addressed what caused the wasm build failure and it should be good to go again. |
@bors r+ |
Allow limited access to `OsStr` bytes `OsStr` has historically kept its implementation details private out of concern for locking us into a specific encoding on Windows. This is an alternative to rust-lang#95290 which proposed specifying the encoding on Windows. Instead, this only specifies that for cross-platform code, `OsStr`'s encoding is a superset of UTF-8 and defines rules for safely interacting with it At minimum, this can greatly simplify the `os_str_bytes` crate and every arg parser that interacts with `OsStr` directly (which is most of those that support invalid UTF-8). Tracking issue: rust-lang#111544
☀️ Test successful - checks-actions |
Finished benchmarking commit (9610dfe): comparison URL. Overall result: no relevant changes - no action needed@rustbot label: -perf-regression Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 642.846s -> 643.715s (0.14%) |
This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.
Allow limited access to `OsString` bytes This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.
Allow limited access to `OsString` bytes This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.
Allow limited access to `OsString` bytes This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.
OsStr
has historically kept its implementation details private out ofconcern for locking us into a specific encoding on Windows.
This is an alternative to #95290 which proposed specifying the encoding on Windows. Instead, this
only specifies that for cross-platform code,
OsStr
's encoding is a superset of UTF-8 and definesrules for safely interacting with it
At minimum, this can greatly simplify the
os_str_bytes
crate and everyarg parser that interacts with
OsStr
directly (which is most of thosethat support invalid UTF-8).
Tracking issue: #111544