-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"dangling symbolic link" flakes after upgrading to Bazel 7 #20886
Comments
Can you share the Bazel flags you're using? We've seen a few symlink-related issues due to the |
|
I lied.
I reverted our builds to use Bazel 6.4.0 before testing this. 🤷 |
This is primary to work around bazelbuild/bazel#20886, but also helps on Windows if symbolic links aren’t supported.
#20408 (comment) reports that this is still a problem in Bazel 7.0.0. Even if the PR #19739 (fixing #19143 in 6.4) mentions that it is working at HEAD at the time, before 7.0 was branched, 7.0.0 is still broken (transiently). |
This occurs quite frequently in our environment and even more frequently when enabling bzl-module support. I have tried to recreate this in a minimal sandbox without success but my guess is that it does not seem related to remote-execution since it appears on some targets like this: cc_library( Where the strip_include_prefix triggers the symlink actions to occur but it occasionally fails locally even though the target files must be available locally since they were found by the glob in the repository. This is a major issue for us and a stopper for us in order to proceed with the migration to Bazel modules. Bazel version used: 7.0.2 |
Same when using rules_foreign_cc. But using |
@uhlajs Is this with 7.0.0, or 7.0.2? The latter contains some fixes for |
With |
@uhlajs |
Thanks for rerouting this, @fmeum. Feel free to ping or reopen if it turns out to be a Bazel issue and not just a rules_foreign_cc one. |
@tjgq the dangling symbolic link issues we were seeing were not related to rules_foreign_cc and are definitely a Bazel issue. I don't think this issue should be closed since this needs to be fixed. |
@Gormo did not use cmake or |
Sorry, I misread the thread. |
@tjgq no problems. Thank you for reopening it. |
@Gormo Since it appears to be difficult to produce a reproducer for this, could you perhaps try to bisect this down to the breaking Bazel commit using Bazelisk's |
An update from our side:
Current theory is that it could be related to io-access and some kind of internal race-condition between different threads and sometimes io-access is delayed on disks with heavy load which triggers this issue. We tried to recreate this locally by using a disk-loader for simulating high io-load but unfortunately without success. |
@fmeum I have now tried, cherry-picking 48ea3d2 on top on 7.0.2 but that didn't really affect anything since we get the errors also on ctx.actions.symlink() generated actions. But that didn't remove any flakiness either.
|
@Wyverald FYI if you haven't already seen this thread |
@Gormo can you check what the paths look like about which Bazel complains that they are dangling symlinks? I'm curious if they stay dangling symlinks at the end of the build and if so, how exactly they are dangling, i.e. in what step of the symlink resolution does the "file not found" error occur? (e.g. simply the target doesn't exist even though the directory that contains it does? Does the symlink point to a file under a directory that should exist, but it doesn't? Something more complicated?) It's not immediately obvious how this could happen: AFAIU Does |
Here is an (IP-mangled) output from the symlinks: ERROR: /Top-bazel/478b2cbff2254079381e27d1a245fab2/external/FOO/BUILD.bazel:7:12: output 'external/FOO/lib/libfoo.a' is a dangling symbolic link
I also tried with "--spawn_strategy=standalone" but that didn't affect anything. |
If it still fails with Can you check which directory exists and which one does not in the ancestors of |
This error also occurs on Windows:
Looking at the disk, I can see.
The build event streaming posted the event after the symlink was created.
|
Here is an equivalent output from Linux:
Looking at the disk, I can see.
The build event streaming posted the event after the symlink was created.
So it seems like the symlink targets always exists but they are sometimes not populated into the sandbox. |
We were hitting this issue very consistently and |
This reverts commit 578ce77. Copying doesn't appear to help with bazelbuild/bazel#20886, and it's going to be fixed in Bazel 7.1 anyway.
This reverts commit 578ce77. Copying doesn't appear to help with bazelbuild/bazel#20886, and it's going to be fixed in Bazel 7.1 anyway.
@freeformstu, The issue still occurs frequently on 7.1.0 when running with bazel modules enabled. |
Has anyone been able to reproduce this issue locally with Bazel 7.1.0 and |
@fmeum I was able to reproduce the same issue under Bazel
/edit In trying to reproduce this, it looks like |
@chrisabbott Could you share an example with which you observe this behavior? |
@fmeum Yep! Here you go. Bear in mind that I didn't actually need both flags as I mentioned above, so it may or may not be useful to you.
|
@chrisabbott Sorry, I didn't see your edit before: If that flag fixes the issue, I'm pretty sure it's the deterministic failure described in #21215. |
@fmeum, we have --noincompatible_sandbox_hermetic_tmp enabled on 7.1.0, but we still see the error frequently in CI when bazel modules is enabled. (It's actually as frequent so it's stopping us from migrating to bazel modules). It's undeterministic and difficult to understand what triggers it but if you have proposals for relevant log points I can enable those, or even patch bazel for a test if needed. |
@Gormo Could you try to bisect this down to a smaller commit range by setting |
One thing that we noticed in the past few months after upgrading to Bazel 7 was files in Bazel's output_base mysteriously missing from CI machines, leading to errors like:
or
The error messages above are just examples. These kinds of errors can happen to any external repo, but only to external repos (files missing under output_base/external). We do have a process on CI machine to clean cache directories, including Bazel output_base when the disk is close to full. However, the cleaning process is to rotate the whole cache directory, not deleting files individually. So it shouldn't cause partial deletion of output_base. When this error happens, all subsequent builds from that CI machine would fail with similar errors until we run I can image the missing files under output_base/external can cause dangling symbolic link discussed on this ticket. |
The issues described in this thread that aren't fixed by I don't know what could be causing the issue @linzhp described though. |
I faced the issue with
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
def deps():
ref = "2.14"
http_archive(
name = "jansson",
build_file = "//my:jansson.BUILD",
sha256 = "5798d010e41cf8d76b66236cfb2f2543c8d082181d16bc3085ab49538d4b9929",
strip_prefix = "jansson-{}".format(ref),
url = "https://github.com/akheron/jansson/releases/download/v{}/jansson-{}.tar.gz".format(ref, ref),
)
load("@rules_foreign_cc//foreign_cc:defs.bzl", "cmake")
filegroup(name = "lib_source", srcs = glob(["**"]))
cmake(name = "libjansson", lib_source = ":lib_source", visibility = ["//visibility:public"])
build_file = "//my:jansson.BUILD",
+ patches = ["//my:jansson.patch"], # https://github.com/bazelbuild/bazel/issues/20886
sha256 = "5798d010e41cf8d76b66236cfb2f2543c8d082181d16bc3085ab49538d4b9929",
--- CMakeLists.txt
+++ CMakeLists.txt
@@ -271,2 +271,3 @@
-file (COPY ${CMAKE_CURRENT_SOURCE_DIR}/src/jansson.h
- DESTINATION ${CMAKE_CURRENT_BINARY_DIR}/include/)
+configure_file (${CMAKE_CURRENT_SOURCE_DIR}/src/jansson.h
+ ${CMAKE_CURRENT_BINARY_DIR}/include/jansson.h
+ COPYONLY)
@@ -298 +299 @@
- ${CMAKE_CURRENT_SOURCE_DIR}/src/jansson.h)
+ ${CMAKE_CURRENT_BINARY_DIR}/include/jansson.h) 💡 |
Since it's pretty likely that this is fixed by 52adf0b, I will close this issue. If you can still reproduce your issue with a version of Bazel including this commit (currently
|
Description of the bug:
googleapis/google-cloud-cpp#13444
After upgrading to Bazel 7, we have started seeing transient failures in our CI. These have all been from
io_opentelemetry_cpp
.My naive guess is that it has something to do with how that repo uses
include_prefix
: https://github.com/open-telemetry/opentelemetry-cpp/blob/c4f39f2be8109fd1a3e047677c09cf47954b92db/sdk/src/trace/BUILD#L10Which category does this issue belong to?
External Dependency
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Not sure, but I can supply more logs and test solutions (within reason).
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?release 7.0.0
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
It probably has to do with "build without the bytes"
Have you found anything relevant by searching the web?
#19143 seems like a similar issue.
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: