Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Action should not be successful and cached if outputs were not created #14543

Closed
keith opened this issue Jan 11, 2022 · 7 comments
Closed

Action should not be successful and cached if outputs were not created #14543

keith opened this issue Jan 11, 2022 · 7 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@keith
Copy link
Member

keith commented Jan 11, 2022

We recently had a cache of cache poisoning where we realized that actions are considered successful based on their exit code, even if the actual outputs of the action do not match the listed outputs, which later fails the build with something like:

ERROR: /path/to/BUILD:3:8: output 'foo.txt' was not created

But this does not stop the same action from being fetched from the cache by subsequent builds. Specifically what happened in our case was:

  1. The compiler of an action crashed during execution due to a hardware failure
  2. The action somehow still exited 0 (this is something that I also need to investigate and fix, but likely on the rules_swift side)
  3. Bazel cached the output of the action, which was a log of the compiler crash, and that no outputs were created
  4. Bazel failed after it identified the missing outputs
  5. All subsequent builds with the same inputs pulled this invalid cache entry and failed showing the same compiler crash log

I think if the outputs being created successfully were part of the requirement for an action to be marked as successful, this wouldn't have happened. (This clearly requires your action fails non-deterministically, which should be rare, but can happen in cases like this.)

Here's the execution log json from the action where the compiler crashed:

{
  "commandArgs": ["bazel-out/darwin-opt-exec-2B5CBBC6-ST-d7817b5f5799/bin/external/build_bazel_rules_swift/tools/worker/worker", "swiftc", "@bazel-out/ios-arm64-min12.0-applebin_ios-ios_arm64-opt-ST-d7817b5f5799/bin/Modules/Foo/Foo.swiftmodule-0.params"],
  snip ...
  "inputs": snip...,
  "listedOutputs": ["bazel-out/ios-arm64-min12.0-applebin_ios-ios_arm64-opt-ST-d7817b5f5799/bin/Modules/Foo/Foo.swiftmodule", snip ...],
  "remotable": true,
  "cacheable": true,
  "timeoutMillis": "0",
  "progressMessage": "Compiling Swift module //Modules/Foo:Foo",
  "mnemonic": "SwiftCompile",
  "actualOutputs": [],
  "runner": "worker",
  "remoteCacheHit": false,
  "status": "",
  "exitCode": 0,
  "remoteCacheable": true,
  "walltime": "7.261184363s"
}

And then the log from all subsequent builds with the same inputs:

{
  "commandArgs": ["bazel-out/darwin-opt-exec-2B5CBBC6-ST-d7817b5f5799/bin/external/build_bazel_rules_swift/tools/worker/worker", "swiftc", "@bazel-out/ios-arm64-min12.0-applebin_ios-ios_arm64-opt-ST-d7817b5f5799/bin/Modules/Foo/Foo.swiftmodule-0.params"],
  snip ...
  "inputs": snip...,
  "listedOutputs": ["bazel-out/ios-arm64-min12.0-applebin_ios-ios_arm64-opt-ST-d7817b5f5799/bin/Modules/Foo/Foo.swiftmodule", snip...],
  "remotable": true,
  "cacheable": true,
  "timeoutMillis": "0",
  "progressMessage": "Compiling Swift module //Modules/Foo:Foo",
  "mnemonic": "SwiftCompile",
  "actualOutputs": [],
  "runner": "remote cache hit",
  "remoteCacheHit": true,
  "status": "",
  "exitCode": 0,
  "remoteCacheable": true,
  "walltime": "0s"
}

Note the second execution log shows the invalid results were pulled from cache.

What operating system are you running Bazel on?

macOS

What's the output of bazel info release?

5.0.0rc3

keith added a commit to bazelbuild/rules_swift that referenced this issue Jan 11, 2022
In the case that swiftc crashed, the worker would return that it exited
0. This was because calling WEXITSTATUS when WIFEXITED is not true is
invalid and happened to return 0 in that case. We now handle the case
where swiftc exits from a signal, and return the signal it exited with
instead. There might be some other cases here that we have to handle in
the future, so for now those are failures so we don't return an invalid
value.

Also see bazelbuild/bazel#14543
@coeuvre coeuvre self-assigned this Jan 11, 2022
@coeuvre coeuvre added team-Remote-Exec Issues and PRs for the Execution (Remote) team P2 We'll consider working on this in future. (Assignee optional) type: bug labels Jan 11, 2022
@Wyverald
Copy link
Member

Is this a 5.0 regression? (i.e. does it show up in 4.2.2?)

@coeuvre
Copy link
Member

coeuvre commented Jan 11, 2022

I don't think it's a regression (this issue also shows up in 4.x).

@keith
Copy link
Member Author

keith commented Jan 11, 2022

Yea I doubt it

keith added a commit to bazelbuild/rules_swift that referenced this issue Jan 11, 2022
In the case that swiftc crashed, the worker would return that it exited
0. This was because calling WEXITSTATUS when WIFEXITED is not true is
invalid and happened to return 0 in that case. We now handle the case
where swiftc exits from a signal, and return the signal it exited with
instead. There might be some other cases here that we have to handle in
the future, so for now those are failures so we don't return an invalid
value.

Also see bazelbuild/bazel#14543
@coeuvre coeuvre added P1 I'll work on this now. (Assignee required) and removed P2 We'll consider working on this in future. (Assignee optional) labels Mar 9, 2022
@coeuvre
Copy link
Member

coeuvre commented Mar 9, 2022

Working on the fix.

@coeuvre
Copy link
Member

coeuvre commented Mar 14, 2022

A check for upload has been added by 5b54588. I will create another PR for the check when downloading.

@coeuvre coeuvre reopened this Mar 14, 2022
@keith
Copy link
Member Author

keith commented Mar 14, 2022

Thanks! I think I found another case of this where the disk cache is polluted when a header inclusion fails:

ERROR: /private/var/tmp/_bazel_ksmiley/4096f01a2eae006dd7eb0d708a9f3935/external/v8/BUILD.bazel:3357:11: Compiling src/snapshot/snapshot-utils.cc failed: undeclared inclusion(s) in rule '@v8//:v8_libshared_noicu':
this rule is missing dependency declarations for the following files included by 'src/snapshot/snapshot-utils.cc':
  'external/com_googlesource_chromium_zlib/zlib.h'
  'external/com_googlesource_chromium_zlib/zconf.h'

I just wanted to mention in case that makes sense to you while you're in this, but I can file another issue after I debug more

@coeuvre
Copy link
Member

coeuvre commented Mar 15, 2022

Yes, please file another issue with more details. Thanks!

brentleyjones pushed a commit to brentleyjones/bazel that referenced this issue Mar 15, 2022
Even if the exit code is 0. Missing declared outputs will be detected by Bazel later and fail the build, so avoid uploading this false positive cache entry.

Wrap buildUploadManifest inside a `Single.fromCallable` since there are many IOs (both the check we add in this PR and stats already there). If `--experimental_remote_cache_async` is set, these IOs will now be executed in a background thread.

Fixes bazelbuild#14543.

Closes bazelbuild#15016.

PiperOrigin-RevId: 434448255
(cherry picked from commit 5b54588)
brentleyjones pushed a commit to brentleyjones/bazel that referenced this issue Mar 17, 2022
An AC entry that misses declared outputs of an action is invalid because, if Bazel accepts this from remote cache, it will detect the missing declared outputs error at a later stage and fail the build.

This PR adds a check for mandatory outputs when downloading outputs from remote cache. If a mandatory output is missing from AC entry, Bazel will ignore the cache entry so build can continue.

Also fixes an issue introduced by bazelbuild#15016 that tests result are not uploaded to remote cache. Test spawns declare all the possible outputs but they usually don't generate them all. This change fixes that by only check for mandatory outputs via `spawn.isMandatoryOutput()`.

Fixes bazelbuild#14543.

Closes bazelbuild#15051.

PiperOrigin-RevId: 435307260
(cherry picked from commit a151116)
coeuvre added a commit to coeuvre/bazel that referenced this issue Mar 18, 2022
Even if the exit code is 0. Missing declared outputs will be detected by Bazel later and fail the build, so avoid uploading this false positive cache entry.

Wrap buildUploadManifest inside a `Single.fromCallable` since there are many IOs (both the check we add in this PR and stats already there). If `--experimental_remote_cache_async` is set, these IOs will now be executed in a background thread.

Fixes bazelbuild#14543.

Closes bazelbuild#15016.

PiperOrigin-RevId: 434448255
coeuvre added a commit to coeuvre/bazel that referenced this issue Mar 18, 2022
An AC entry that misses declared outputs of an action is invalid because, if Bazel accepts this from remote cache, it will detect the missing declared outputs error at a later stage and fail the build.

This PR adds a check for mandatory outputs when downloading outputs from remote cache. If a mandatory output is missing from AC entry, Bazel will ignore the cache entry so build can continue.

Also fixes an issue introduced by bazelbuild#15016 that tests result are not uploaded to remote cache. Test spawns declare all the possible outputs but they usually don't generate them all. This change fixes that by only check for mandatory outputs via `spawn.isMandatoryOutput()`.

Fixes bazelbuild#14543.

Closes bazelbuild#15051.

PiperOrigin-RevId: 435307260
Wyverald pushed a commit that referenced this issue Mar 18, 2022
…ere not created (#15071)

* Add a method, Spawn#isOutputMandatory, to indicate whether a Spawn must create an output by the end of its execution. If not, a Spawn (like a test) may succeed without necessarily creating that output.

The 4 categories of actions that do this are:

1. Tests (tests can create XML and other files, but may not).
2. Java compilations with reduced classpaths (the initial compilation with a reduced classpath may fail, but produce a usable jdeps proto file, so the Spawn will claim that it succeeded so that the action can process the proto file).
3. Extra actions with a dummy output that is produced locally, not by the Spawn. However, by changing the outputs of the Spawn to be empty, this situation is avoided.
4. C++ compilations with coverage (a .gcno coverage file is not produced for an empty translation unit).

In particular, all SpawnActions' Spawns have #isMandatoryOutput always return true, and there should be no new cases of #isMandatoryOutput returning false besides the ones added in this CL.

PiperOrigin-RevId: 425616085

* Remote: Don't upload action result if declared outputs are not created.

Even if the exit code is 0. Missing declared outputs will be detected by Bazel later and fail the build, so avoid uploading this false positive cache entry.

Wrap buildUploadManifest inside a `Single.fromCallable` since there are many IOs (both the check we add in this PR and stats already there). If `--experimental_remote_cache_async` is set, these IOs will now be executed in a background thread.

Fixes #14543.

Closes #15016.

PiperOrigin-RevId: 434448255

* Remote: Check declared outputs when downloading outputs.

An AC entry that misses declared outputs of an action is invalid because, if Bazel accepts this from remote cache, it will detect the missing declared outputs error at a later stage and fail the build.

This PR adds a check for mandatory outputs when downloading outputs from remote cache. If a mandatory output is missing from AC entry, Bazel will ignore the cache entry so build can continue.

Also fixes an issue introduced by #15016 that tests result are not uploaded to remote cache. Test spawns declare all the possible outputs but they usually don't generate them all. This change fixes that by only check for mandatory outputs via `spawn.isMandatoryOutput()`.

Fixes #14543.

Closes #15051.

PiperOrigin-RevId: 435307260

Co-authored-by: janakr <[email protected]>
copybara-service bot pushed a commit that referenced this issue Aug 29, 2022
…re missing.

So that if the unusual case described by #14543 happens, we can still see the error message of the spawn (if any).

Closes #15724.

PiperOrigin-RevId: 470673113
Change-Id: I54a17a555bf1e88bd8ec8bd8a980e5914cfd2261
aiuto pushed a commit to aiuto/bazel that referenced this issue Oct 12, 2022
…re missing.

So that if the unusual case described by bazelbuild#14543 happens, we can still see the error message of the spawn (if any).

Closes bazelbuild#15724.

PiperOrigin-RevId: 470673113
Change-Id: I54a17a555bf1e88bd8ec8bd8a980e5914cfd2261
tymurmustafaiev pushed a commit to tymurmustafaiev/rules_swift that referenced this issue Jul 19, 2023
In the case that swiftc crashed, the worker would return that it exited
0. This was because calling WEXITSTATUS when WIFEXITED is not true is
invalid and happened to return 0 in that case. We now handle the case
where swiftc exits from a signal, and return the signal it exited with
instead. There might be some other cases here that we have to handle in
the future, so for now those are failures so we don't return an invalid
value.

Also see bazelbuild/bazel#14543
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
3 participants