-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in rlua crate tests between rustc 1.23.0 and 1.24.0, only on windows. #48251
Comments
One possibility as to why that it only happens on windows is that the implementation of longjmp on that platform uses the same exception framework that Rust uses for panics. |
That is interesting! That WOULD explain the behavior I'm seeing. |
It's possible to temporarily unlock unstable features on stable channel with env var ( |
That's incredibly helpful, thank you! |
Here is a summary of the current situation as I understand it: mlua-rs/rlua#71 (comment) I may have gotten some details wrong, if I have feel free to |
I guess when they say it's a "C API" they are not kidding. :p Would it make sense to make an standalone wrapper, called "Universal API for Lua, with C calling conventions"? |
Lua's API is not my favorite :( |
Hmm. I think there's a good case to be made for disabling #46833 -- or at least making it opt-in -- until there are stable attributes to opt-out. @wycats points out that this likely to affect Helix as well. I would definitely prefer not to have public packages relying on |
Re-reading the comments on that PR, I see that I wrote this:
But this got overlooked. |
I knew that was wrong when I did it, I'll forget this hack exists :P |
Something I wanted to point out, is that even if a crater run was done, it wouldn't have caught my issue at least because as I understand it they're only done on linux? If there was some kind of donation box for running crater on windows I would contribute to it! |
@kyren I believe crater actually tests windows these days, but I could be wrong. Also, cross-posting from the chucklefish/rlua repo:
Can anyone verify if this is correct? I think in my ideal world, we would not require any attributes -- rather, we would have a way to skip the panic in the case where Rust is using the native unwind facility (but you would opt-in to that, probably, similar to |
I talked a bit more with @wycats about the Helix situation, which is somewhat different, though very much related: Ruby also sometimes longjmps. @wycats wanted some form of attribute that could be used to label functions that longjmp so that the compiler could verify that there is nothing on the stack with a dtor. This seems like it would also help @kyren in verifying this claim that they made:
|
At this point, it seems pretty clear to me that:
In the short term, I think we should amend #46833 so that the "panic wrapper" uses an (unstable) opt-in -- e.g., |
Is it possible to adjust the implementation to only catch Rust-generated unwinding? I seem to remember that we did something like this in the past to not conflict with C++ exceptions but I may be misremembering. |
I'm sorry to say that it doesn't :( off the top of my head it's going to require a reasonable amount of work as well. |
Sorry for the breakage y'all are seeing here @kyren! We've long wanted to improve our ecosystem-wide regression testing to include Windows as well to catch issues like this sooner, but we still have yet to implement that :( Your explanation of what's happening was quite helpful to me, and it also reminded me of what I believe is an existing bug in Rust! I'm unfortunately traveling right now and not at a Windows computer, but I can try to verify this information when I get back home in a week. In any case though I believe this is actually a fixable bug in Rust without actually adding any new attributes or stabilizing anything. I agree with @nikomatsakis that we'll probably want to have some form of dealing with exceptions at the FFI boundary eventually but the longer we can put that off the better! (in my opinion at least) The tl;dr; is that I believe we need to implement exception filters in our landing pads. Rust landing pads should, in theory, only run for Rust exceptions. Not for all SEH exceptions. I'm pretty certain this bug has come up before in the past but I was unfortunately unable to dig up the issue report. The crux of this issue is that code like this: struct A;
impl Drop for A {
fn drop(&mut self) {
println!("hello!");
}
}
fn main() {
let _a = A;
cause_a_segfault();
}
fn cause_a_segfault() {
unsafe { *(0x1 as *mut u32) = 3; }
} I think will actually print out "hello!". The bug may be more nuanced than this but I'm pretty certain something like this is a problem, I can try to create an actual reproduction of this sort of issue when I get back home to a Windows computer to try out some various pieces. Basically what this means is that our "landing pads" are executing for any and all SEH exceptions. This seems like it includes setjmp/longjmp, which is a bug! (a very longstanding bug!) This is part of the whole "we don't want you unwinding into Rust" in that making this work is tricky and whatnot, but it turns out that this is our own problem where there's a whole bunch of ways to "unwind into Rust" on MSVC using SEH that we must deal with. For those familiar with LLVM I believe the problem looks like this: fn foo2() {}
#[no_mangle]
pub extern fn bar() {
foo2();
} On MSVC this code currently generates IR that looks like: define void @bar() unnamed_addr #1 personality i32 (...)* @__CxxFrameHandler3 {
start:
; invoke foo::foo2
invoke void @_ZN3foo4foo217h18e6b2931a4bedc6E()
to label %bb2 unwind label %funclet_bb1
bb1: ; preds = %funclet_bb1
call void @llvm.trap()
unreachable
bb2: ; preds = %start
ret void
funclet_bb1: ; preds = %start
%cleanuppad = cleanuppad within none []
br label %bb1
} The problem (as I understand it), is that if %cleanuppad = cleanuppad within none [] (or something about that line I believe) I remember that when implementing unwinding on MSVC I never did figure out how to place a filter here on "only run this for Rust exceptions". Other platforms like MinGW which use a custom personality function have filtering enabled which (IIRC) only runs cleanups for Rust-originating exceptions (what we want). On MSVC I never got a custom personality working (LLVM only recognizes a few I think?) so sticking to one of the standard personality functions meant we got something working quickly at least! Now all is not lost I think! I believe what we may want to do is alter the default codegen for landing pads. We already have implemented a method of "only catch Rust exceptions" as that's why Phew! That's quite a lot of information. To summarize, though:
|
Such a wonderful, detailed explanation, thank you @alexcrichton!
I'm totally fine with this hacky fix until 1.25! If you're okay with it, I'd like to do a release of my crate with this hack for the time being, and then when 1.25 hits remove it and just require rustc >= 1.25. However, let's be realistic, I'd obviously prioritize the entire rust ecosystem over my one little crate, so if you think that's a problem I don't have to do that. This is not such a problem for Chucklefish, since we have sort of a chucklefish_hax branch anyway where this hack could live, so I'm fine if it's just too gross to release publicly with that hack. I'm actually relieved really, because the Lua C API had always been in this "legal gray area" w.r.t. Rust. @nikomatsakis suggested changes regarding having the compiler help about Drop types being on the stack, and inter-operating with other unwinding mechanism obviously sound really great and helpful, but even just knowing that calling into the Lua C API should be supported at all is plenty for right now :D For the record, I think having a public API that relies on something like longjmp or C++ exceptions is really, exceptionally gross (pun intended), but I'm glad that Rust is attempting to deal with the unfortunate reality of it. I'm completely fine with it though if the finer details of it take a while to work out. |
@kyren |
@nikomatsakis said:
For what it's worth, I think this space deserves a more full-fledged RFC. As @nikomatsakis pointed out, Helix has similar but not identical concerns, and it would be great to get a conversation about longjmps and "unwinding past the FFI boundary" into one place so we can work out the requirements and solve them holistically. |
@alexcrichton do you object to reverting this change while we sort it out? |
My take is that we should revert now but possibly bring the feature back with @alexcrichton's filtering suggestion -- although it's not entirely obvious to me that filtering is what end users will want, either. (But it's probably a decent default.) |
@kyren do you think (by any chance) it would be possible to make a standalone regression test? |
This is my interim plan. I have PR to revert on stable (#48378) and will prepare a similar PR for beta. Then I have a more expansive diff that reverts but allows one to opt back in to the wrapper by doing |
FWIW, this requires reverting a bunch of changes to gtk-rs and related crates which are now running into UB with 1.24.1 due to this being reverted. |
Say more? Did they have their own panic guards which were removed? |
@nikomatsakis Yes, exactly that :) But fortunately not in any released version yet (unless there was a release of some crate outside of the gtk-rs organization that I'm not aware of). It was planned to release this really soon now though... |
I don't really know the state of it or whether it would be appropriate for a stable point release, but it would be nice if #48572 could be merged instead of a straight rollback, to avoid missing out on the feature for another rust stable version. I do feel a bit guilty for triggering the reverted behavior, not having to worry about panic unwinds at extern function boundaries is quite nice, even if I'm not currently relying on it. |
We plan to discuss this in the core team meeting tonight. Given #48572, we may forego the revert. @kyren, is it ok w/ you to wait until the next beta to have a fix? (I'm not sure how I feel about backporting #48572 to beta, but I guess it's ok -- backporting to stable seems too risky to me, personally.) |
@kyren Did you try @petrochenkov 's suggestion to change longjmp to __instrinsic_longjmp ? If so, did it work? |
So I finally got around to trying this and I can confirm that this DOES work, thank you very much for the suggestion! Since I basically have a fix for this, I'm not in a huge rush to have this change in the rust stable compiler, but this is only speaking for me. This issue probably causes problems for other people that have to deal with at least any of Lua or Ruby or maybe libpng, depending on how strict they are about handling errors properly or how deep into the APIs they go. I would LIKE it if the fix was in the next stable version of Rust, but even then it's not critical because like I said, I have a fix. The main thing I really needed from all of this was an assurance that longjmp APIs are at least supposed to be compatible with Rust in general, which just practically saves me a lot of heartache. If #48572 lands in nightly and the regression test for longjmp stays long term, but I have to wait two rust stable cycles to remove the now small rlua hack, that's certainly not the end of the world. The only way this would be a problem were if the fix for longjmping APIs ended up requiring an unstable feature that was a very long way off, or the hacky fix I have broke for some other reason in the interim, or if #48572 or similar was not merged, or if there was another decision that similarly changed direction so that I can't reliably call into a longjmp based API. If the fix did make it into beta though that would certainly feel a lot more "certain" to me and allow me to relax about it a lot more :D
I was typing this reply as I saw your comment :D |
This commit is targeted at addressing rust-lang#48251 by specifically fixing a case where a longjmp over Rust frames on MSVC runs cleanups, accidentally running the "abort the program" cleanup as well. Added in rust-lang#46833 `extern` ABI functions in Rust will abort the process if Rust panics, and currently this is modeled as a normal cleanup like all other destructors. Unfortunately it turns out that `longjmp` on MSVC is implemented with SEH, the same mechanism used to implement panics in Rust. This means that `longjmp` over Rust frames will run Rust cleanups (even though we don't necessarily want it to). Notably this means that if you `longjmp` over a Rust stack frame then that probably means you'll abort the program because one of the cleanups will abort the process. After some discussion on IRC it turns out that `longjmp` doesn't run cleanups for *caught* exceptions, it only runs cleanups for cleanup pads. Using this information this commit tweaks the codegen for an `extern` function to a catch-all clause for exceptions instead of a cleanup block. This catch-all is equivalent to the C++ code: try { foo(); } catch (...) { bar(); } and in fact our codegen here is designed to match exactly what clang emits for that C++ code! With this tweak a longjmp over Rust code will no longer abort the process. A longjmp will continue to "accidentally" run Rust cleanups (destructors) on MSVC. Other non-MSVC platforms will not rust destructors with a longjmp, so we'll probably still recommend "don't have destructors on the stack", but in any case this is a more surgical fix than rust-lang#48567 and should help us stick to standard personality functions a bit longer.
triage: leaving assigned to niko but it might be good for him to delegate to alex or someone else if there's someone else taking point on this now. |
rustc: Tweak funclet cleanups of ffi functions This commit is targeted at addressing rust-lang#48251 by specifically fixing a case where a longjmp over Rust frames on MSVC runs cleanups, accidentally running the "abort the program" cleanup as well. Added in rust-lang#46833 `extern` ABI functions in Rust will abort the process if Rust panics, and currently this is modeled as a normal cleanup like all other destructors. Unfortunately it turns out that `longjmp` on MSVC is implemented with SEH, the same mechanism used to implement panics in Rust. This means that `longjmp` over Rust frames will run Rust cleanups (even though we don't necessarily want it to). Notably this means that if you `longjmp` over a Rust stack frame then that probably means you'll abort the program because one of the cleanups will abort the process. After some discussion on IRC it turns out that `longjmp` doesn't run cleanups for *caught* exceptions, it only runs cleanups for cleanup pads. Using this information this commit tweaks the codegen for an `extern` function to a catch-all clause for exceptions instead of a cleanup block. This catch-all is equivalent to the C++ code: try { foo(); } catch (...) { bar(); } and in fact our codegen here is designed to match exactly what clang emits for that C++ code! With this tweak a longjmp over Rust code will no longer abort the process. A longjmp will continue to "accidentally" run Rust cleanups (destructors) on MSVC. Other non-MSVC platforms will not rust destructors with a longjmp, so we'll probably still recommend "don't have destructors on the stack", but in any case this is a more surgical fix than rust-lang#48567 and should help us stick to standard personality functions a bit longer.
1.24.1 was released with this reverted now. |
So, we talked about this in the core team meeting, and -- as you can see -- decided to revert after all. It was a bit of a tough call, but we felt like when in doubt, we ought to bias towards "back off and try again more carefully". The plan is to land @alexcrichton's change in #48572, which should enable us to re-enable the abort-guards by default. We are also keeping the explicit (I still think we ought to address the interop question in a more thorough way, however.) @sdroege I do apologize for whatever churn this caused gtk-rs and anyone else! |
@nikomatsakis No problem, "git revert" is cheap and there was no release yet. All options were bad, let's just hope it sticks the next time :) |
Since the revert made it into 1.24.1, this is now fixed! Thanks for the great work, closing! |
[The following explanation of the problem contains a lot of assumptions about the problem's nature, I could be extremely wrong! Take this explanation with a grain of salt]
Wild guess, it has something to do with #46833.
So, I'm not sure this is a rust bug exactly, but I wanted to make you aware of a crate regression that's causing me quite a lot of problems. Ultimately, I believe this comes from me doing some things that are not.. exactly definitely kosher, but that I don't have a convenient workaround for.
Ultimately, the problem stems from the Lua C API having error handling fundamentally based around setjmp / longjmp. In order to write a usable wrapper in Rust, without having to write inconvenient, slow C shims, Rust must, at times, call into Lua C API functions that in turn trigger a longjmp, and that longjmp must necessarily pass over Rust frames.
I realize how unsafe this is! However, there are some limitations here that I think make it at least somewhat reasonable to be able to do:
This last point is unfortunate, and causes me a lot of headaches, and at least one strange practical problem. We at Chucklefish build rlua for consoles, but for consoles the gcc crate does not work, and we have to build Lua out of tree (also to include Lua console specific patches). Needing to include C shims in rlua means that we would also have to copy paste those shims into the console projects, and this is a less than ideal situation.
The reason I think this issue has to do with #46833 is that I can actually fix the regression if I simply add an
#[unwind]
attribute to each rust function that calls lua_error. However, this is only necessary for some reason on windows, and the attribute is currently unstable!I think it's completely reasonable to need to mark functions as
#[unwind]
if they will trigger a longjmp, but I'm not exactly sure what to do in the meantime untilunwind_attributes
becomes stable. Maybe this behavior on windows IS simply a bug? Maybe it's a bug on linux / macos that this doesn't abort? I would especially not like to have to write C shims if ultimately I can get rid of them whenunwind_attributes
becomes stable.(Side Note: Chucklefish actually uses the stable rust compiler in production now, so this is especially annoying. We were very excited about the 1.24 release, because it includes incremental compilation and rustfmt-preview while also being stable, and we noticed this bug in trying to switch to the stable compiler on windows.)
Edit: If people have to dig into the
rlua
source, I'm sorry, it is a tangled nest of "unsafe, unsafe everywhere". The best example to look at is thetest_error
test, and the methodLua::create_callback_function
. Adding the#[unwind]
attribute to:callback_call_impl
,safe_pcall
, andsafe_xpcall
fixes that test on windows, on unstable.Edit 2: Edited for clarity, but also I need to point out that I need to do something, because this bug currently makes
rlua
just basically broken on windows. It's not "just" a test failure, error handling in general will just abort.The text was updated successfully, but these errors were encountered: