-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stdio: Handle unicode boundaries better on Windows #23344
Comments
The code for all this currently lives in |
The Windows console just deals with an opaque sequence of |
Some notes on Stdin as it is today: There seem to be some inefficiencies, the allocation of a 4 kB vector can possibly be turned into an array on the stack. A new String is then allocated again from that result. The MemReader comment should go, it used to talk about an unwrap that's no longer there. |
As reported in #30399, an easy way to trigger this: let v = vec![255u8; 5000];
let s = String::from_utf8_lossy(&v);
// println!("Len {}", s.len());
println!("{}", s); |
I just ran into this issue while doing
|
This is a real bummer because you can't write non-UTF-8 data to stdout via the std::io APIs, which is not a limitation in C. The simplest example here would be a program that wants to write arbitrary binary data to stdout (expecting it to be piped somewhere), but I hit this today with my program that runs a command, captures its stdout, and then later writes that data to its own stdout. If the stdout from the command was not valid UTF-8 that fails. (The program in question was the Visual C++ installed with the French locale running in a Windows VM with the system locale set to US English.) |
I gather this code is written this way because we want to be able to write Rust strings (which can contain Unicode) to the console, but it's kind of limiting. |
@luser unfortunately (as I'm led to believe) that's actually a somewhat impossible operation on Windows. The captured output is all byte-oriented as that's how the stdout pipes are set up for a spawned program, but when printing that to a console (not a pipe) it's expected to be a u16-oriented stream, so a conversion needs to happen. I'm not sure if there's a system API for writing bytes to a console, however, other than just the u16 units we're doing today. |
Sure you can write bytes to the console by calling
If stdout is piped then Rust won't check that what you're outputting is UTF-8 and will gladly just throw those bytes through the pipe. The only time it matters is when you're interacting with the Windows Console which is a terrible place to read/write arbitrary binary data, and if your program tries to spew binary data to the console then that's just bad design. |
I think it's unfortunate that the issue I ran into (run a subprocess, capture its output, write that to stdout) doesn't work reliably out-of-the box, and I think that's pretty legitimate. |
It is working as reliably as it can. If you pass it UTF-8 data it does the conversion to UTF-16 and everything works fine. If you pass it data that isn't UTF-8, well now what? It certainly can't losslessly convert it to UTF-16, so that's out. As for writing it out with In your case, if the program you're capturing the output of is not outputting UTF-8 but rather some other code page, then the solution is for you to figure out which code page it is and use a crate like https://github.com/lifthrasiir/rust-encoding to convert it to UTF-8 before writing it out. If the output is binary, why the heck are you trying to write binary data to the console? |
Presumably the output is already in the system codepage, but And yes, I understand that system codepages are a mess, and the historical legacy of "everything is ASCII" in C is terrible, but this seems like a fairly straightforward use case, and to do what would be 2 lines in C I'll have to jump through a number of hoops in Rust. |
Rust has to default to using the wide versions so that Rust strings, which are known to always be UTF-8, can be written to the console correctly. Changing this behavior would be a serious regression for all the people that currently rely on unicode in the console. Maybe one day libstd will have a more extensive API for interacting with stdout, including choosing how to deal with code pages, but that will take time and likely an RFC (feel free to create one if you can think of a good design), but until then your best option is to call WriteConsoleA yourself or create a library that does so for you, or even just call the libc functions if you really want. |
@luser yeah it's true that that this is unfortunately harder to do than in C, but it's unfortunate fallout from our decision to be much more principled about UTF-8 (especially in strings). The interface here is "write an arbitrary blob of bytes to stdout" rather than "write this process's output to stdout". The former doesn't have much of an interpretation if it's not unicode, but the latter should almost always be possible to ferry through without incident. It may be worth exploring attempting to interpret the input as unicode and wherever that fails call the ansi functions which take bytes instead. That way you may be interleaving calls to the byte and wide apis, but maybe it'd work out? That'd at least get us to an infallible |
It would most definitely give incorrect results when your input is in a code page that doesn't match the ascii portion of UTF-8, as during wide calls those bytes would be interpreted as UTF-8, while if some invalid UTF-8 is encountered it would do a narrow call and interpret those bytes as whatever the console code page is. I'd really prefer if we had separate APIs for writing unicode data and writing arbitrary blobs in the current code page. |
In the interim, can we make the error message a little bit more Windows- and stdlib-specific? "text was not valid unicode" is really generic and can make it seem like the error is coming from somewhere other than the stdlib. I went so far as to grep my entire local Cargo cache to make sure it wasn't coming from a library somewhere in the below issue. |
…dtolnay Improve error message and docs for non-UTF-8 bytes in stdio on Windows This should make debugging problems like abonander/multipart#106 significantly more straightforward in the future. cc rust-lang#23344, @retep998 @alexcrichton Not sure who do r? so I'll let rust-highfive pick one.
…hton Refactor Windows stdio and remove stdin double buffering I was looking for something nice and small to work on, tried to tackle a few FIXME's in Windows stdio, and things grew from there. This part of the standard library contains some tricky code, and has changed over the years to handle more corner cases. It could use some refactoring and extra comments. Changes/fixes: - Made `StderrRaw` `pub(crate)`, to remove the `Write` implementations on `sys::Stderr` (used unsynchronised for panic output). - Remove the unused `Read` implementation on `sys::windows::stdin` - The `windows::stdio::Output` enum made sense when we cached the handles, but we can use simple functions like `is_console` now that we get the handle on every read/write - `write` can now calculate the number of written bytes as UTF-8 when we can't write all `u16`s. - If `write` could only write one half of a surrogate pair, attempt another write for the other because user code can't reslice in any way that would allow us to write it otherwise. - Removed the double buffering on stdin. Documentation on the unexposed `StdinRaw` says: 'This handle is not synchronized or buffered in any fashion'; which is now true. - `sys::windows::Stdin` now always only partially fills its buffer, so we can guarantee any arbitrary UTF-16 can be re-encoded without losing any data. - `sys::windows::STDIN_BUF_SIZE` is slightly larger to compensate. There should be no real change in the number of syscalls the buffered `Stdin` does. This buffer is a little larger, while the extra buffer on Stdin is gone. - `sys::windows::Stdin` now attempts to handle unpaired surrogates at its buffer boundary. - `sys::windows::Stdin` no langer allocates for its buffer, but the UTF-16 decoding still does. ### Testing I did some manual testing of reading and writing to console. The console does support UTF-16 in some sense, but doesn't supporting displaying characters outside the BMP. - compile stage 1 stdlib with a tiny value for `MAX_BUFFER_SIZE` to make it easier to catch corner cases - run a simple test program that reads on stdin, and echo's to stdout - write some lines with plenty of ASCII and emoji in a text editor - copy and paste in console to stdin - return with `\r\n\` or CTRL-Z - copy and paste in text editor - check it round-trips ----- Fixes rust-lang#23344. All but one of the suggestions in that issue are now implemented. the missing one is: > * When reading data, we require the entire set of input to be valid UTF-16. We should instead attempt to read as much of the input as possible as valid UTF-16, only returning an error for the actual invalid elements. For example if we read 10 elements, 5 of which are valid UTF-16, the 6th is bad, and then the remaining are all valid UTF-16, we should probably return the first 5 on a call to `read`, then return an error, then return the remaining on the next call to `read`. Stdin in Console mode is dealing with text directly input by a user. In my opinion getting an unpaired surrogate is quite unlikely in that case, and a valid reason to error on the entire line of input (which is probably short). Dealing with it is incompatible with an unbuffered stdin, which seems the more interesting guarantee to me.
…hton Refactor Windows stdio and remove stdin double buffering I was looking for something nice and small to work on, tried to tackle a few FIXME's in Windows stdio, and things grew from there. This part of the standard library contains some tricky code, and has changed over the years to handle more corner cases. It could use some refactoring and extra comments. Changes/fixes: - Made `StderrRaw` `pub(crate)`, to remove the `Write` implementations on `sys::Stderr` (used unsynchronised for panic output). - Remove the unused `Read` implementation on `sys::windows::stdin` - The `windows::stdio::Output` enum made sense when we cached the handles, but we can use simple functions like `is_console` now that we get the handle on every read/write - `write` can now calculate the number of written bytes as UTF-8 when we can't write all `u16`s. - If `write` could only write one half of a surrogate pair, attempt another write for the other because user code can't reslice in any way that would allow us to write it otherwise. - Removed the double buffering on stdin. Documentation on the unexposed `StdinRaw` says: 'This handle is not synchronized or buffered in any fashion'; which is now true. - `sys::windows::Stdin` now always only partially fills its buffer, so we can guarantee any arbitrary UTF-16 can be re-encoded without losing any data. - `sys::windows::STDIN_BUF_SIZE` is slightly larger to compensate. There should be no real change in the number of syscalls the buffered `Stdin` does. This buffer is a little larger, while the extra buffer on Stdin is gone. - `sys::windows::Stdin` now attempts to handle unpaired surrogates at its buffer boundary. - `sys::windows::Stdin` no langer allocates for its buffer, but the UTF-16 decoding still does. ### Testing I did some manual testing of reading and writing to console. The console does support UTF-16 in some sense, but doesn't supporting displaying characters outside the BMP. - compile stage 1 stdlib with a tiny value for `MAX_BUFFER_SIZE` to make it easier to catch corner cases - run a simple test program that reads on stdin, and echo's to stdout - write some lines with plenty of ASCII and emoji in a text editor - copy and paste in console to stdin - return with `\r\n\` or CTRL-Z - copy and paste in text editor - check it round-trips ----- Fixes rust-lang#23344. All but one of the suggestions in that issue are now implemented. the missing one is: > * When reading data, we require the entire set of input to be valid UTF-16. We should instead attempt to read as much of the input as possible as valid UTF-16, only returning an error for the actual invalid elements. For example if we read 10 elements, 5 of which are valid UTF-16, the 6th is bad, and then the remaining are all valid UTF-16, we should probably return the first 5 on a call to `read`, then return an error, then return the remaining on the next call to `read`. Stdin in Console mode is dealing with text directly input by a user. In my opinion getting an unpaired surrogate is quite unlikely in that case, and a valid reason to error on the entire line of input (which is probably short). Dealing with it is incompatible with an unbuffered stdin, which seems the more interesting guarantee to me.
…hton Refactor Windows stdio and remove stdin double buffering I was looking for something nice and small to work on, tried to tackle a few FIXME's in Windows stdio, and things grew from there. This part of the standard library contains some tricky code, and has changed over the years to handle more corner cases. It could use some refactoring and extra comments. Changes/fixes: - Made `StderrRaw` `pub(crate)`, to remove the `Write` implementations on `sys::Stderr` (used unsynchronised for panic output). - Remove the unused `Read` implementation on `sys::windows::stdin` - The `windows::stdio::Output` enum made sense when we cached the handles, but we can use simple functions like `is_console` now that we get the handle on every read/write - `write` can now calculate the number of written bytes as UTF-8 when we can't write all `u16`s. - If `write` could only write one half of a surrogate pair, attempt another write for the other because user code can't reslice in any way that would allow us to write it otherwise. - Removed the double buffering on stdin. Documentation on the unexposed `StdinRaw` says: 'This handle is not synchronized or buffered in any fashion'; which is now true. - `sys::windows::Stdin` now always only partially fills its buffer, so we can guarantee any arbitrary UTF-16 can be re-encoded without losing any data. - `sys::windows::STDIN_BUF_SIZE` is slightly larger to compensate. There should be no real change in the number of syscalls the buffered `Stdin` does. This buffer is a little larger, while the extra buffer on Stdin is gone. - `sys::windows::Stdin` now attempts to handle unpaired surrogates at its buffer boundary. - `sys::windows::Stdin` no langer allocates for its buffer, but the UTF-16 decoding still does. ### Testing I did some manual testing of reading and writing to console. The console does support UTF-16 in some sense, but doesn't supporting displaying characters outside the BMP. - compile stage 1 stdlib with a tiny value for `MAX_BUFFER_SIZE` to make it easier to catch corner cases - run a simple test program that reads on stdin, and echo's to stdout - write some lines with plenty of ASCII and emoji in a text editor - copy and paste in console to stdin - return with `\r\n\` or CTRL-Z - copy and paste in text editor - check it round-trips ----- Fixes rust-lang#23344. All but one of the suggestions in that issue are now implemented. the missing one is: > * When reading data, we require the entire set of input to be valid UTF-16. We should instead attempt to read as much of the input as possible as valid UTF-16, only returning an error for the actual invalid elements. For example if we read 10 elements, 5 of which are valid UTF-16, the 6th is bad, and then the remaining are all valid UTF-16, we should probably return the first 5 on a call to `read`, then return an error, then return the remaining on the next call to `read`. Stdin in Console mode is dealing with text directly input by a user. In my opinion getting an unpaired surrogate is quite unlikely in that case, and a valid reason to error on the entire line of input (which is probably short). Dealing with it is incompatible with an unbuffered stdin, which seems the more interesting guarantee to me.
…hton Refactor Windows stdio and remove stdin double buffering I was looking for something nice and small to work on, tried to tackle a few FIXME's in Windows stdio, and things grew from there. This part of the standard library contains some tricky code, and has changed over the years to handle more corner cases. It could use some refactoring and extra comments. Changes/fixes: - Made `StderrRaw` `pub(crate)`, to remove the `Write` implementations on `sys::Stderr` (used unsynchronised for panic output). - Remove the unused `Read` implementation on `sys::windows::stdin` - The `windows::stdio::Output` enum made sense when we cached the handles, but we can use simple functions like `is_console` now that we get the handle on every read/write - `write` can now calculate the number of written bytes as UTF-8 when we can't write all `u16`s. - If `write` could only write one half of a surrogate pair, attempt another write for the other because user code can't reslice in any way that would allow us to write it otherwise. - Removed the double buffering on stdin. Documentation on the unexposed `StdinRaw` says: 'This handle is not synchronized or buffered in any fashion'; which is now true. - `sys::windows::Stdin` now always only partially fills its buffer, so we can guarantee any arbitrary UTF-16 can be re-encoded without losing any data. - `sys::windows::STDIN_BUF_SIZE` is slightly larger to compensate. There should be no real change in the number of syscalls the buffered `Stdin` does. This buffer is a little larger, while the extra buffer on Stdin is gone. - `sys::windows::Stdin` now attempts to handle unpaired surrogates at its buffer boundary. - `sys::windows::Stdin` no langer allocates for its buffer, but the UTF-16 decoding still does. ### Testing I did some manual testing of reading and writing to console. The console does support UTF-16 in some sense, but doesn't supporting displaying characters outside the BMP. - compile stage 1 stdlib with a tiny value for `MAX_BUFFER_SIZE` to make it easier to catch corner cases - run a simple test program that reads on stdin, and echo's to stdout - write some lines with plenty of ASCII and emoji in a text editor - copy and paste in console to stdin - return with `\r\n\` or CTRL-Z - copy and paste in text editor - check it round-trips ----- Fixes rust-lang#23344. All but one of the suggestions in that issue are now implemented. the missing one is: > * When reading data, we require the entire set of input to be valid UTF-16. We should instead attempt to read as much of the input as possible as valid UTF-16, only returning an error for the actual invalid elements. For example if we read 10 elements, 5 of which are valid UTF-16, the 6th is bad, and then the remaining are all valid UTF-16, we should probably return the first 5 on a call to `read`, then return an error, then return the remaining on the next call to `read`. Stdin in Console mode is dealing with text directly input by a user. In my opinion getting an unpaired surrogate is quite unlikely in that case, and a valid reason to error on the entire line of input (which is probably short). Dealing with it is incompatible with an unbuffered stdin, which seems the more interesting guarantee to me.
The stdio support on windows has two separate modes. One is when stdout is not a console and implies byte orientation, the other is when stdout is a console and implies [u16] orientation. In the stdout-is-a-console case we require that all input and output to be valid unicode so we can translate [u8] to [u16].
In this console case, however, our translation may not be quite right for a number of scenarios:
read
, then return an error, then return the remaining on the next call toread
.read
(returning an error if the first byte is invalid UTF-8).write
. Like the reading case, this could be alleviated with a 4-byte buffer for unwritten-but-valid-utf8 characters.At this time I'm not sure what sort of guarantees Windows actually gives us on these APIs. For example does windows never deal with only half of a surrogate pair on a read/write? If Windows were to guarantee various properties like this (or that it always writes the entire buffer, for example), our lives would be a lot easier! For now, though, it is unclear what we can rely on, so this issue is written as if we rely on nothing.
The text was updated successfully, but these errors were encountered: