Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious (?) error in "compile-fail\rfc-2126-extern-in-paths\single-segment.rs" etc #48116

Closed
kennytm opened this issue Feb 10, 2018 · 16 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@kennytm
Copy link
Member

kennytm commented Feb 10, 2018

First introduced by #47828 (comment).

Symptom: The following 3 test cases involving error E0432 will fail.

failures:
    [compile-fail] compile-fail\rfc-2126-extern-in-paths\single-segment.rs
    [compile-fail] compile-fail\use-keyword.rs
    [compile-fail] compile-fail\use-mod-2.rs

Mainly affects Windows and macOS machines (maybe because they are tested first).

Current instances:

PR Image
#47828 (comment) check i686-pc-windows-gnu
#47828 (comment) check x86_64-apple-darwin
#47828 (comment) check i686-pc-windows-gnu
#47657 (comment) check i686-pc-windows-gnu
#48092 (comment) check i686-pc-windows-gnu
#48127 (comment) check x86_64-apple-darwin
#47614 (comment) check x86_64-apple-darwin
#47752 (comment) check x86_64-windows-gnu
#47804 (comment) check x86_64-apple-darwin
#47804 (comment) check x86_64-apple-darwin
#48158 (comment) check x86_64-apple-darwin
#47906 (comment) check x86_64-apple-darwin
@kennytm kennytm added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC labels Feb 10, 2018
@kennytm
Copy link
Member Author

kennytm commented Feb 10, 2018

The second victim (#47657) contains nothing useful in the log https://ci.appveyor.com/project/rust-lang/rust/build/1.0.6293/job/k1wbokf5c59o4ghj.

cc @alexcrichton @eddyb

@eddyb
Copy link
Member

eddyb commented Feb 10, 2018

Does compiletest really hide ICEs now? How did this happen 😞?

@alexcrichton
Copy link
Member

There's some more information as well at the top of #47828 (comment), notably that this has been seen at least once on x86_64-apple-darwin.

The fact that it's spurious yet deterministic about these three tests is pretty disturbing, and sort of points to maybe this being a nondeterministic miscompilation in librustc_resolve itself. I think it's pretty certain this was introduced via the LLVM 6 upgrade, but given the lack of ability to reproduce or anything else my preferred course of action here is:

  • Leave LLVM 6 in master as-is. We want this for a whole slew of other reasons. That, and it's not like we have a perfect test suite otherwise.
  • We're very close to the beta branching point. If it becomes a problem we should revert the LLVM upgrade on beta immediately (shouldn't be hard to do with just a few submodule updates).
  • Hopefully we'll get some more testing of this via nightly users on the nightly branch. I'm sort of hoping that someone else runs into a segfault/bug which allows us to actually debug this...

@kennytm if this starts bouncing every other PR though we can certainly reconsider!

@kennytm
Copy link
Member Author

kennytm commented Feb 10, 2018

@eddyb I've verified that the ICE is swallowed by compile-test 😢. However, even if it is not swallowed, the ICE is useless without RUST_BACKTRACE=1 😓

Edit: Wow, even worse, the test case for #48000 passed successfully even though it is still ICE-ing. check_no_compiler_crash is not effective anymore?

Edit²: I think this (ICE being ignored) is caused by #47634

Edit³: Nevermind, the ICE detection is functioning properly with #48048. The ICE in #48000 never failed with --error-format json. But it specifically catches the words "error: internal compiler error".

Edit⁴: The "hide ICE" thing should be fixed in #48127.

@kennytm kennytm added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-resolve Area: Name resolution labels Feb 10, 2018
kennytm added a commit to kennytm/rust that referenced this issue Feb 11, 2018
1. When the invalid condition is hit, write out the relevant variables too
2. In compile-fail/parse-fail tests, check for ICE first, so the invalid
   error patterns won't mask our ICE output.
bors added a commit that referenced this issue Feb 11, 2018
[WIP] Debug #48116.

1. When the invalid condition is hit, write out the relevant variables too
2. In compile-fail/parse-fail tests, check for ICE first, so the invalid error patterns won't mask our ICE output.

r? @alexcrichton
cc #48116, cc @eddyb
kennytm added a commit to kennytm/rust that referenced this issue Feb 13, 2018
1. When the invalid condition is hit, write out the relevant variables too
2. In compile-fail/parse-fail tests, check for ICE first, so the invalid
   error patterns won't mask our ICE output.
@alexcrichton
Copy link
Member

Ok I've been debugging this with @eddyb today and we may (?) have made some progress. @eddyb has been able to deterministically reproduce the test failure on his machine and furthermore has a reduced test case which panics on his machine. Unfortunately I have been unable to reproduce this on my machine.

What we have found though is also fascinating. After lots of sharing of IR we've found that this particular snippet of IR will optimize differently on his machine than on mine. (using the same version of LLVM!) This points to me as a memory access violation in LLVM or something like that. @eddyb, however, shared the literal LLVM binaries with me and I was again unable to reproduce on my machine!

I did find out, however, that valgrind reported no violations on the LLVM I compiled myself whereas it did report a violation on the binary that @eddyb shared with me. This, being very suspicious, pointed at maybe a miscompile of LLVM itself. (or maybe undefined behavior in LLVM?)

@eddyb was locally using clang 4 for compiling LLVM and I was using gcc 5.4.0. After switching to clang 4 locally I was able to reproduce the valgrind violation, although I'm still unable to reproduce the miscompile of the Rust code itself.

Given that we're seeing this failure across three builders, the two MinGW ones and 64-bit OSX my current suspicion is that this is basically just undefined behavior in LLVM itself, only exploited on newer version of the compilers we use to compile it than what I was using locally. This I think would explain the various symptoms of:

  • Sporadic - anything dealing with undefined behavior can do whatever whenever
  • A variety of systems and compilers - probably not a miscompile or a bug in a compiler we're using to compile LLVM, but rather a bug in LLVM itself
  • Unable to reproduce locally - I was using different versions of things like glibc and system compilers which threw wrenches into the mix I believe

Unfortunately this still isn't a huge amount to go on. I'm going to try to hone in on the valgrind error here and see where that leads me. I'm sort of just praying at this point that it leads to the cause of these bugs.

@alexcrichton
Copy link
Member

Bisection of the valgrind error locally points to llvm-mirror/llvm@f45aefe.

kennytm added a commit to kennytm/rust that referenced this issue Feb 14, 2018
1. When the invalid condition is hit, write out the relevant variables too
2. In compile-fail/parse-fail tests, check for ICE first, so the invalid
   error patterns won't mask our ICE output.
@alexcrichton
Copy link
Member

Unfortunately that LLVM commit does not cleanly revert, so I've manually reverted it instead. Reverting that commit makes the valgrind error go away locally for me, and we're currently confirming with @eddyb whether it fixes the miscompile locally.

In the meantime though @eddyb also found that removing this assume fixed the miscompile, so we should probably do that in the meantime anyway.

@alexcrichton
Copy link
Member

Ok @kennytm has removed the assume in #48209, and maybe that will fix this issue? If not we can try to pursue the LLVM fix perhaps.

kennytm added a commit to kennytm/rust that referenced this issue Feb 14, 2018
Removed the `assume()` which we assumed is the cause of misoptimization in
issue rust-lang#48116.
bors added a commit that referenced this issue Feb 14, 2018
Try to fix 48116 and 48192

The bug #48116 happens because of a misoptimization of the `import_path_to_string` function, where a `names` slice is empty but the `!names.is_empty()` branch is executed.

https://github.com/rust-lang/rust/blob/4d2d3fc5dadf894a8ad709a5860a549f2c0b1032/src/librustc_resolve/resolve_imports.rs#L1015-L1042

Yesterday, @eddyb had locally reproduced the bug, and [came across the `position` function](https://mozilla.logbot.info/rust-infra/20180214#c14296834) where the `assume()` call is found to be suspicious. We have *not* concluded that this `assume()` causes #48116, but given [the reputation of `assume()`](#45501 (comment)), this seems higher relevant. Here we try to see if commenting it out can fix the errors.

Later @alexcrichton has bisected and found a potential bug [in the LLVM side](#48116 (comment)). We are currently testing if reverting that LLVM commit is enough to stop the bug. If true, this PR can be reverted (keep the `assume()`) and we could backport the LLVM patch instead.

(This PR also includes an earlier commit from #48127 for help debugging ICE happening in compile-fail/parse-fail tests.)

The PR also reverts #48059, which seems to cause #48192.

r? @alexcrichton
cc @eddyb, @arthurprs (#47333)
@dotdash
Copy link
Contributor

dotdash commented Feb 14, 2018

The valgrind violation seems harmless. It's from assert((getNumBuckets() & (getNumBuckets()-1)) == 0 and getNumBuckets() is:

  unsigned getNumBuckets() const {
    return Small ? InlineBuckets : getLargeRep()->NumBuckets;
  }

What happens here is that clang happens to ultimately put the branch on Small after the branch that compares the results from getLargeRep()->NumBuckets. That means that the first branch may depend on uninitialized memory, because if Small is true, the large rep part never gets initialized.

That should be fine though, because InlineBuckets is a template parameter (8 here), and for the case that Small is true, the assertion could statically be proven to hold. So all you need is to check whether Small is true, or else whether the condition holds for getLargeRep()->NumBuckets. So the result for NumBuckets only matters if Small is false, and in that case, the memory is initialized.

See also the remark under "Preparing your program" on http://valgrind.org/docs/manual/quick-start.html which says that such bogus violations are expected at higher optimization levels.

@dotdash
Copy link
Contributor

dotdash commented Feb 14, 2018

@alexcrichton I'm pretty sure that commit just made the jump threading pass find more opportunities to perform optimizations. So without that commit, it just never gets to hit the code path where the (bogus) violation is reported.

@alexcrichton
Copy link
Member

@dotdash gah bummer! I was wondering if it'd be something like that... (I know we have tons of those "errors" in rustc with valgrind)

If that's the case though it's sort of fascinating because it means that a binary on @eddyb's machine optimize IR differently than when it was on my machine... That still implies to me some level of undefined behavior but maybe not the kind flagged by valgrind?

@alexcrichton
Copy link
Member

@dotdash hm so fascinatingly it looks like the patch I gisted above actually does fix the compile on @eddyb's machine...

@alexcrichton
Copy link
Member

Ok some more information on this. I've finally been able to reproduce the error that @eddyb was seeing. Given the exact binaries from @eddyb I was originally unable to reproduce the issue he had on his machine. He realized, though, that our machines differed in glibc versions. Notably @eddyb had glibc 2.26 and I had glibc 2.23. Testing more versions revealed that with the binaries @eddyb gave me glibc 2.25 worked ok and glibc 2.26 was the first bad one.

With this new information I reran bisection and it fascinatingly pointed at the same commit. I truly have no idea what is going on here.

This may be a bug that only "just happens" to show up on glibc 2.26 though. The bots that are reproducing this, OSX and MinGW, are not using glibc. Still digging...

@alexcrichton
Copy link
Member

I've filed https://bugs.llvm.org/show_bug.cgi?id=36386 upstream to hopefully address this issue. I don't think we can be 100% sure that this is the exact same issue that we're seeing on OSX/Windows but I think it's as close as we're gonna get for the time being.

@Diggsey
Copy link
Contributor

Diggsey commented Feb 14, 2018

This comment seems to hint at the same issue: https://bugs.llvm.org/show_bug.cgi?id=32981#c2

Also related to JumpThreading and non-determinism due to pointer comparisons would tally with the glibc changes (some change could result in malloc giving out chunks of memory in a different order) as well as why it affects platforms without glibc.

@alexcrichton
Copy link
Member

I'm going to preemptively close this as I believe symptom has been fixed for us with commenting out the assume and I think further investigation should probably be coordinated with https://bugs.llvm.org/show_bug.cgi?id=36386

bors added a commit that referenced this issue Mar 2, 2018
Backport LLVM fixes for a JumpThreading / assume intrinsic bug

This fixes the original cause of #48116 and restores the assume intrinsic that was removed as a workaround.

r? @alexcrichton
bors added a commit that referenced this issue Mar 3, 2018
Backport LLVM fixes for a JumpThreading / assume intrinsic bug

This fixes the original cause of #48116 and restores the assume intrinsic that was removed as a workaround.

r? @alexcrichton
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants