-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobserver hangs due to full pipe #9739
Comments
Oh dear, thanks for the report! This is definitely something we need to fix... Do you know the capacity of the pipe by default on your system? (basically is it less than 256?) If the pipe capacity is greater than 256 I'm not sure how else this could be coming about, but if it's less than 256 then I can imagine this happening because Cargo at least will write tokens back to the pipe and then try to read them off again for later scheduling purposes. This would introduce a deadlock where Cargo, who can read tokens, is also waiting to write tokens. Poking around it looks like we can use |
Actually you are right, i don't see how pipe can be full with only 256 elements, on linux it is allocated page by page , so minimum would be 4k, doc says it is supposed to be 16 pages, so 64k, but when i write a small c tool to call fctnl on proc file system, it reports 4096. Migh be debian, might the fact it runs on systemd-nspawn container. If you have any idea how to dig it further, i can try. |
Oh dear, that sounds bad! To confirm, you're not trying to use the unstable parallel rustc, right? Also, do you have any build scripts in play which use Reading over the When the system hangs, are you sure that the threads are all blocked? In that they're not like busy-spinning hogging CPU writing things to pipes? Also, would you be able to share the project that reproduces this (perhaps just privately with me). I've got a 40/80 core system I can test on which I may be able to stress enough to reproduce this perhaps? |
No, this is rust 1.50 stable release installed with rustup (quite old i know, i'm trying to upgrade stuffs to 1.54 but it takes time to reproduce).
a bash scripts calls make that just calls cargo build --release. I don't see the pipe in make process, but i didn't try to share anything at least if it is not done by default.
strace of rustc
strace of cargo
i will ask around |
Oh no worries on 1.50 vs 1.54, I'm just trying to rule out some more esoteric possibilities to make sure this is as weird as I think it is. AFAIK nothing really substantive around this changed between 1.50 and 1.54 so upgrading will likely not show a difference. My guess is that the issue here stems from the very-large amount of parallellism with 256 HT cores. You may be one of the first people to run Cargo/rustc on such parallelism! Although if you have |
I don't think so, make is not called with -j, just plain make, that builds rust code first then some ocaml code, just a wrapper, but i'll try to remove it make from the equation.
And the only running make process does not share any file descriptor with its cargo child. I don't really see how to configure cargo to communicate with make job server. And the environment of cargo regarding to make is : |
FYI, |
Hm ok, so it sounds like Cargo is the one making the jobserver and it's not dealing with the parent I'm not sure that |
Ah yes, sub-make is definitely possible, there are a lot of -sys crates. I'll see if i can play with bpftrace to try to find where + is coming. |
I know believe it is not an issue with cargo. So as you said, as long as pipe buffer is larger than core count, it should never deadlock. Except, it seems linux pipe buffer has some perculiarity. Apparently, it is a ring buffer of pages, with two paths :
But i ran the build on my personal machines, and the pipe buffer is 64k all along the build, also strace everything, and no call to fcntl to change the pipe size happened. Except on this ci systemd containers, it is 4k, and so far i have not being able to find what reduces the pipe buffer, bpftrace shows no call to the kernel pipe_fcntl function that would reduce the size. I suspect it is hitting some unpriviledged user ressource inside container, which has the effect to truncate the pipe capacity of existing pipe to minimum, or something like this https://elixir.bootlin.com/linux/latest/source/fs/pipe.c#L799. |
Ooh the plot thickens! So I can actually pretty easily reproduce the deadlock you're seeing with this: use std::io;
const N: usize = 10_000;
fn main() {
unsafe {
let mut a = [0; 2];
if libc::pipe(a.as_mut_ptr()) != 0 {
panic!("err: {}", io::Error::last_os_error());
}
println!("init buffer: {}", libc::fcntl(a[1], libc::F_GETPIPE_SZ));
println!("new buffer: {}", libc::fcntl(a[1], libc::F_SETPIPE_SZ, 0));
for _ in 0..80 {
assert_eq!(libc::write(a[1], b"a".as_ptr() as *const _, 1), 1);
}
let threads = (0..80)
.map(|_| {
std::thread::spawn(move || {
for _ in 0..N {
let mut b = [0u8];
assert_eq!(libc::read(a[0], b.as_mut_ptr() as *mut _, 1), 1);
assert_eq!(libc::write(a[1], b"a".as_ptr() as *const _, 1), 1);
}
})
})
.collect::<Vec<_>>();
for t in threads {
t.join().unwrap();
}
}
} If the line with I think that the fix here is probably to document this ring behavior with pages in the |
Indeed, i think it would be a good idea for linux systems. If it cannot increase the buffer due to user limit, it will probably fail, but then i guess it is much easier to understand what is going on while getting an error. As for me, i've increased /proc/sys/fs/pipe-user-pages-soft, so that right now all pipes stay at the default of 16 pages, and no build hanged since then. |
This commit attempts to address rust-lang/cargo#9739 by optionally increasing the capacity of pipes created on Linux. This may be required if the pipe initially starts out with only one page of buffer which apparently means that normal jobserver usage might cause the pipe to deadlock where all writers are blocked.
Oh were you unable to get the pipe capacity to increase using FWIW I posted rust-lang/jobserver-rs#34 which I thought might fix things, but if |
I didn't fully reproduce the truncation of default value, but when the limit is reached, F_SETPIPE_SZ will return EPERM. So probably a good idea to have a nice error message here |
Ah ok, so does that mean you can increase the size of some pipes, just not infinitely? (that makes sense because presumably the kernel is limited in resources). To confirm, though, you can successfully increase the size of at least one pipe? If so that means my PR should fix the issue for you without having you to reconfigure the system limits. |
Yes and Yes. Actually full example of what is going on : #define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <unistd.h>
int main(int argc, char **argv) {
int ret;
int fd = open("/proc/sys/fs/pipe-user-pages-soft", O_RDONLY);
char buf[64];
ret = read(fd, buf, 64);
close(fd);
buf[ret] = '\0';
int limit = atoi(buf);
int fds[2]; // leak fds who cares;
for (int i = 0; i < limit / 16; ++i) {
ret = pipe(fds);
if (ret) {
printf("cannot create pipe %s", strerror(errno));
exit(1);
}
ret = fcntl(fds[0], F_GETPIPE_SZ);
printf("pipe %d buffer size %d\n", i, ret);
}
ret = pipe(fds);
if (ret) {
printf("cannot create last pipe %s", strerror(errno));
exit(1);
}
ret = fcntl(fds[0], F_GETPIPE_SZ);
printf("last pipe buffer %d\n", ret);
ret = fcntl(fds[0], F_SETPIPE_SZ, 16 * 4096);
if (ret <= 0) {
printf("cannot increase last pipe buffer : %s\n",
strerror(errno));
exit(1);
}
return 0;
} after some limit you can still create pipes, but buffer is limited to one page, so creating pipe works, but increasing size returns error, but initially it works until this limit is reached. So your PR would basically just make cargo fail, not fix the issue, but i guess at least it can explain what is going on. It would also help in case default pipe size is one page, but i'm not sure there exists such a linux configuration. You probably need to recompile kernel. |
Hm ok so to confirm I'm understanding, Linux, after enough pipes are created, reduces new pipes' default capacity from 16 pages to 1 page? Is that how you were seeing this issue? Enough pipes were created such that new pipes defaulted to 1 page? |
Yes, this is the issue. So in this case, unless some pipes were destroyed in the tiny time between pipe creation and now, ncreasing pipe buffer will likely fail anyway. |
Ok that's good to know, and means I probably won't land that PR on Does |
As far as i understand, it applies per user, or might be to all unpriviledged users, i'm not so sure.
Probably related to the fact this machine has unusual load. It is big servers running 25 containers running CI builds concurrently, not only rust. It compiles a lot of ocaml code, and ocaml is a "traditional" compiler in the sense it pipes output a lot, eg to external assembler, for preprocessor, etc. So it is not unexpected that it would create more than 1024 pipes concurrently. |
This commit attempts to address rust-lang/cargo#9739 by optionally increasing the capacity of pipes created on Linux. This may be required if the pipe initially starts out with only one page of buffer which apparently means that normal jobserver usage might cause the pipe to deadlock where all writers are blocked.
@alexcrichton thanks a lot for your help on this. Regarding your comment on your last commit we also discussed this internally and although I don't think it is per se a bug as it seems to be allowed by pipe semantic it seems to be a weird behavior of Linux pipes. Do you think we should raise that upstream? |
Personally I am indeed quite surprised that my Rust program above deadlocks. That sort of feels like a bug in the kernel but I don't have the capacity myself to report upstream (or to know whether it's even reasonable to report upstream). If y'all are willing though it's probably not a bad idea to report? My guess is that if Cargo is hitting this deadlock then |
checking /proc/x/stack, it is hanging at https://github.com/torvalds/linux/blob/master/fs/pipe.c#L560. checking comments, it looks like not only there needs to be space in the pipe, but also slots. at https://github.com/torvalds/linux/blob/master/fs/pipe.c#L1306, a one-page pipe has only one slot. this led me to test nthreads=2, which also hangs. checking blame of the pipe_write comment, it was added in torvalds/linux@a194dfe. it says:
so i guess there is some race, causing the slot to be abandoned. normally this is not an issue, because some reader should always be queued. in this case, there is no reader. this is an incorrect use of pipes, see https://man7.org/linux/man-pages/man7/pipe.7.html:
nonetheless, considering the strong emphasis placed on kernel backwards compatibility, it could be worth reporting on the linux mailing list. |
If you need any information from me about how Cargo uses the jobserver and/or pipes please let me know! Otherwise personally I'm afraid of just being berated for incorrectly using pipes so I don't think I'll try to post myself. |
This should help mitigate rust-lang#9739 at least with a better error message as opposed to deadlocking.
I posted on the Linux kernel mailing list: https://lore.kernel.org/lkml/1628086770.5rn8p04n6j.none@localhost/. Of potential note to Rust is my comment at the end: registering a SIGCHLD handler with SA_RESTART likely works around this issue in GNU make. Using a two-page pipe buffer also has a high probability of working around the issue, but has a problem: the one-page buffer size is only selected in the first place because the user has exceeded their pipe buffer quota. I believe in this circumstance, unless the user happens to close some pipes between the pipe creation and resizing, the resize will be denied. |
Thanks for that! That also seems like a reasonable solution to me, yeah, and I agree that the "fix" in the |
Bump to the latest jobserver dependency This should help mitigate #9739 at least with a better error message as opposed to deadlocking.
My fix for this issue has been commited upstream: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=46c4c9d1beb7f5b4cec4dd90e7728720583ee348. Assuming there are no issues, it should be in the final 5.14 release, as well as stable releases back to 4.4. Since this issue only occurs when the user has a very high number of pipes open simultaneously (at least 1024), and it can be worked around by increasing the value of fs.pipe-user-pages-soft, I don't think any fix is necessary on the Rust side. Since jobserver seems to be applicable for arbitrary uses of tokening, not just starting processes, it would not be appropriate to register a SIGCHLD handler in jobserver; the handler would need to be installed in cargo and all other users of jobserver. Strictly speaking, to avoid this bug, the buffer size need only be increased to at least two pages (which is what the kernel fix does). However, rust-lang/jobserver-rs@865b7e3 also works. Note that the code is still technically incorrect for non-Linux platforms, as there is no guaranteed pipe buffer size or behavior. In practice, I think a minimum buffer size of 1000 bytes is likely to function on all Unix-like platforms, barring basically-a-bug behavior like Linux's here. A fully correct implementation would always have at least one thread either currently or imminently reading on the pipe. However, as I explained in my (very long) email, GNU make exhibits the same behavior in edge cases, so I don't think it's something to be overly concerned about. |
Oh wow that's awesome, thanks for the update @Hello71! |
I think this is probably sufficiently handled now, so I'm going to close this, but if there are more updates here please feel free to post them! |
final note: https://lore.kernel.org/linux-fsdevel/CAHk-=wgE4XwB2E0=CYB8eqss6WB1+FXgVWZRsUUSWTOeq-kh8w@mail.gmail.com/ points out that even after the patch, you can still write 4097 bytes to a 8192 byte buffer and fail to write 1 more byte, but after the change, pipes should be able to hold at least 4096 bytes in all cases. so theoretically rust-lang/jobserver-rs@865b7e3 should take the requested size and then add one page (not 4096 bytes on some platforms!), but in practice i doubt people will try to run more than 4096 simultaneous jobs, and it might not work on other platforms (as linus says, posix only requires 512 bytes) |
The change to have the jobserver resize the buffer was reverted in #9798 because it was causing some breakage (see rust-lang/rust#88091). Alex outlined some alternatives here: rust-lang/rust#88091 (comment) I'm not sure if it is worth re-opening this or if there are any realistic options. |
I am not sure what would be a proper options. As reported by the other person hitting pipe limit, there are different tradeoff in different situations, i believe linux changes are enough. And from my side of things, i think this issues that is easy to google should already be enough if someone hits the same corner case so i would say you guys can keep things as is. |
Problem
When running cargo build on a machine with 128cores/256HT for CI running on debian with linux 5.8, builds regularly hang.
All rustc process and the cargo process each have a thread trying to write "|" to what i think is the jobserver pipe, but the pipe is full, so the build hangs.
Increasing manually the pipe buffer size in C via /proc/pid/fd/ unblocks everything and the build finishes
Steps
Not sure tbh, besides running big amount of jobs
Possible Solution(s)
Notes
Currently only reproduced on cargo 1.50, but trying to upgrade everything to 1.54 to check
here is one backtrace from cargo
And here is an example from rustc
The text was updated successfully, but these errors were encountered: