Add "Remote Output Service: place bazel-out/ on a FUSE file system"

As requested by @lberki in bazelbuild/bazel#12823, I'm hereby sending out a proposal for adding support for a client-side FUSE daemon to Bazel. Suggestions are welcome!
EdSchouten · Feb 10, 2021 · 36c7774 · 36c7774
1 parent 80a4f2a
commit 36c7774
Showing 1 changed file with 226 additions and 0 deletions.
diff --git a/designs/2021-02-09-remote-output-service.md b/designs/2021-02-09-remote-output-service.md
@@ -0,0 +1,226 @@
+---
+created: 2021-02-09
+last updated: 2021-02-09
+status: To be reviewed
+reviewers:
+  - philwo
+title: "Remote Output Service: place bazel-out/ on a FUSE file system"
+authors:
+  - EdSchouten
+---
+
+# Abstract
+
+This document describes an extension to Bazel, allowing it to host its
+bazel-out/ directory on a FUSE file system. The goal of this change is
+to reduce the amount of network that Bazel generates when remote
+execution is used. The end result is similar to what's offered by
+["Remote Builds without the Bytes"](https://docs.google.com/document/d/11m5AkWjigMgo9wplqB8zTdDcHoMLEFOSH0MdBNCBYOE/),
+with the difference that outputs remain accessible.
+
+# Background
+
+Bazel can use the [Remote Execution protocol](https://github.com/bazelbuild/remote-apis)
+to offload the execution of build actions to a remote build cluster.
+When executing actions remotely, Bazel performs the following three
+tasks successively:
+
+1. Uploading inputs: Bazel uploads individual files and Directory,
+   Command and Action messages into the Content Addressable Storage
+   (CAS).
+2. Execution: Bazel requests that the build cluster executes the Action
+   that was uploaded into the CAS. The build cluster returns an
+   ActionResult.
+3. Downloading outputs: Bazel downloads all files referenced by the
+   ActionResult. In case of directory outputs, Bazel also downloads all
+   files referenced by Tree objects referenced by the ActionResult.
+
+In the common case, the first two tasks consume little bandwidth. By
+using an RPC method named FindMissingBlobs(), Bazel is capable of
+scanning which objects already exist in the CAS, thereby allowing it to
+skip unnecessary uploads of objects. Assuming the retention rate of the
+remote build cluster is adequate, you will see that almost all network
+bandwidth generated by Bazel is caused by the third task. In 2018, Jakob
+Buchgraber [measured](https://docs.google.com/document/d/11m5AkWjigMgo9wplqB8zTdDcHoMLEFOSH0MdBNCBYOE/)
+that when building Bazel itself using remote execution, at least 95.4%
+of network bandwidth was caused by the downloading of outputs.
+
+The reason Bazel downloads output files is as follows:
+
+- To allow the user to inspect and use the results of a build, e.g. to
+  run the software that was built.
+- Bazel supports mixed local and remote execution. Targets may either
+  be annotated explicitly to specify where they are run (using the
+  "no-remote" tag), or features such as
+  [the dynamic spawn scheduler](https://blog.bazel.build/2019/02/01/dynamic-spawn-scheduler.html)
+  may be used to let Bazel automatically decide where actions are run.
+  If a locally executing action depends on files that were built
+  remotely, Bazel needs to download those files to satisfy the
+  execution requirements of the locally executing action.
+- To serve as Bazel's bookkeeping. To a certain extent, Bazel is
+  capable of building projects whose dependency graph does not fit in
+  memory. Bazel keeps a finitely sized cache of file metadata (size,
+  hash, etc.) in memory. Once exhausted, Bazel may discard entries.
+  During a later stage of a build, Bazel may need to recompute these
+  entries by inspecting files on disk.
+- To guarantee forward progress of the build, even if objects were to
+  disappear from the remote CAS. By having the files present locally,
+  Bazel can reupload them if they were to disappear.
+
+Based on the conclusions of his measurement results, Jakob added a
+series of command line flags (`--remote_download_minimal`,
+`--remote_download_toplevel`, etc.) that permit users to skip the
+downloading step when possible. When `--remote_download_minimal` is
+enabled, outputs will only be downloaded if successive actions depend on
+them, or if `bazel run` is invoked and the output is an execution
+dependency. During every invocation, Bazel constructs an additional
+ActionInputMap that stores metadata of all files that are only present
+remotely, thereby allowing input roots to be constructed without having
+files present locally. Various remote CAS implementations have been
+improved to guarantee the retention of recently accessed objects,
+thereby ensuring forward progress.
+
+Though `--remote_download_minimal` has helped many users of Bazel's
+remote execution to scale, some inherent downsides remain:
+
+- Bazel's memory usage has increased significantly. The ActionInputMap
+  that gets created during build may easily consume multiple gigabytes
+  of space for a sufficiently large project.
+- Incremental builds have become significantly slower. Because the
+  ActionInputMap is discarded after every build, Bazel effectively
+  assumes that the outputs have disappeared, thereby causing it to do a
+  full rebuild.
+- Users now need to make a conscious choice whether they want to
+  download output files or not. This makes it harder to do ad hoc
+  exploration.
+
+# Proposal
+
+In a nutshell, the proposal is to optionally let Bazel create a
+bazel-out/ directory that contains the same layout as generated by a
+plain build (without `--remote_download_minimal`), but somehow delay
+downloads of remote files until their contents are actually read (either
+by successive no-remote actions, or when the user accesses bazel-out/
+manually). This prevents the need for the additional ActionInputMap and
+once again grants the user the ability to explore build outputs on
+demand.
+
+One challenge with this approach is that it relies on special operating
+system features to create such lazy-loading files. It can't be built on
+top of the plain POSIX API. Unfortunately, no standards seem to exist in
+this space:
+
+- Linux provides [FUSE](https://en.wikipedia.org/wiki/Filesystem_in_Userspace),
+  which can be used to create a mountpoint for which all operations are
+  forwarded through a character device to a userspace process.
+- For macOS there is [macFUSE](https://osxfuse.github.io) (also known as
+  OSXFUSE). It implements a slightly older version of the FUSE protocol,
+  with minor macOS specific protocol extensions in place. In terms of
+  robustness and performance, it fares worse than Linux's FUSE
+  implementation. Though still hosted on GitHub, this project is no
+  longer Open Source. The last Open Source version no longer runs on
+  modern versions of macOS.
+- Windows provides an API called
+  [Projected File System (ProjFS)](https://docs.microsoft.com/en-us/windows/win32/projfs/projected-file-system)
+  that makes it possible to instantiate files underneath a
+  virtualization root whose contents are backed by a
+  [promise](https://en.wikipedia.org/wiki/Futures_and_promises).
+- On operating systems that don't offer the features above, but do
+  support networked file systems (e.g., NFS, SMB), it may be desirable
+  to let the system mount a fictive volume managed by a userspace process.
+
+Some of these APIs also require administrative privileges to work
+properly. Though FUSE can on most Linux systems be used without any
+special privileges through the use of the `fusermount` utility (setuid),
+it does require elevated privileges to get it to work inside a Docker
+container.
+
+Because of these limitations, it makes sense to let a daemon other than
+Bazel manage these directories. In order to manage the lifecycle of
+these bazel-out/ directories and to instruct the creation of
+lazy-loading files, Bazel may communicate with this daemon over gRPC.
+This approach has a couple of advantages:
+
+- It's a proven strategy. Google already has such a system internally
+  for Blaze. Their daemon is called objfsd and uses FUSE.
+- By having a well-defined and succint protocol, it is possible for
+  people to easily design their own daemons that use custom protocols or
+  have custom storage policies (e.g., snapshotting and preserving the
+  results of builds). These daemons may have a release cadence that is
+  independent from Bazel.
+- A single daemon may be run on a system, caching build results for
+  multiple users and multiple projects. The lifetime of the daemon may
+  be managed separately from the Bazel server, which may be useful for
+  shared CI setups.
+
+The question then becomes what the schema needs to look like that Bazel
+and the daemon will use. Google already has a schema for this
+internally. Unfortunately, this protocol is not a perfect match:
+
+- It doesn't support REv2 specific concepts such as instance names,
+  user-configurable hashing functions, etc.
+- It makes no use of REv2 Protobuf messages, even though there are many
+  that seem useful in literal form (Digest, OutputFile, OutputDirectory,
+  OutputSymlink).
+- The semantics around preserving files seems different. Google's
+  internal protocol has support for explicitly marking files that need
+  to be retained, while REv2 provides no such mechanism. It is assumed
+  that clients call FindMissingBlobs() to instruct a build cluster to
+  keep files present, at least for the remainder of a build.
+
+Because of that, [Bazel PR #12823](https://github.com/bazelbuild/bazel/pull/12823)
+provides an experimental implementation that uses a custom gRPC protocol
+that looks as follows:
+
+```proto
+service RemoteOutputService {
+  // Remove a bazel-out/ directory.
+  rpc Clean(CleanRequest) returns (google.protobuf.Empty);
+
+  // Indicate that a build is starting or has completed. May create a
+  // bazel-out/ if none exists for the current workspace.
+  rpc StartBuild(StartBuildRequest) returns (StartBuildResponse);
+  rpc FinalizeBuild(FinalizeBuildRequest) returns (google.protobuf.Empty);
+
+  // Create one or more lazy-loading files or directories inside a
+  // bazel-out/ directory, backed by objects in the CAS.
+  rpc BatchCreate(BatchCreateRequest) returns (google.protobuf.Empty);
+  // Reobtain the metadata of files stored in the bazel-out/ directory.
+  rpc BatchStat(BatchStatRequest) returns (BatchStatResponse);
+}
+```
+
+This protocol is strongly based on the API of
+[Bazel's OutputService class](https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/vfs/OutputService.java).
+The Protobuf file in the PR contains more complete documentation. An
+example server-side implementation
+[has been published as part of the Buildbarn project](https://github.com/buildbarn/bb-clientd).
+
+# Alternatives considered
+
+In the months before this proposal was written, various experiments were
+performed:
+
+- An attempt was made to add an integrated FUSE file system to Bazel
+  itself. This project was eventually abandoned due to the limitations
+  mentioned previously (i.e., the inability to run Bazel without
+  elevated privileges).
+- Letting Bazel manage the bazel-out/ directory itself, emitting
+  symbolic links to a FUSE file system that provides a flat view of the
+  CAS ([PR #11703](https://github.com/bazelbuild/bazel/pull/11703) and
+  [PR #11622](https://github.com/bazelbuild/bazel/pull/11662)). Though
+  this worked fine from Bazel's point of view, the additional symlink
+  indirection confused many build actions. It completely broke dynamic
+  linkage.
+- Letting Bazel write its contents into a tmpfs-like FUSE daemon, where
+  files may be hardlinked from a directory that provides a flat view of
+  the CAS. This worked, but performance was very bad due to a high
+  amount of context switching. The gRPC-based solution solves this by
+  supporting batched requests.
+
+# Backward-compatibility
+
+The goal is to implement this feature in such a way that local
+execution, plain remote execution, remote execution with
+`--remote_download_minimal`, remote execution with a local disk cache,
+etc. all remain functional.