Skip to content

Commit

Permalink
Add "Remote Output Service: place bazel-out/ on a FUSE file system"
Browse files Browse the repository at this point in the history
As requested by @lberki in bazelbuild/bazel#12823, I'm hereby sending
out a proposal for adding support for a client-side FUSE daemon to
Bazel. Suggestions are welcome!
  • Loading branch information
EdSchouten committed Feb 10, 2021
1 parent 80a4f2a commit 36c7774
Showing 1 changed file with 226 additions and 0 deletions.
226 changes: 226 additions & 0 deletions designs/2021-02-09-remote-output-service.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
---
created: 2021-02-09
last updated: 2021-02-09
status: To be reviewed
reviewers:
- philwo
title: "Remote Output Service: place bazel-out/ on a FUSE file system"
authors:
- EdSchouten
---

# Abstract

This document describes an extension to Bazel, allowing it to host its
bazel-out/ directory on a FUSE file system. The goal of this change is
to reduce the amount of network that Bazel generates when remote
execution is used. The end result is similar to what's offered by
["Remote Builds without the Bytes"](https://docs.google.com/document/d/11m5AkWjigMgo9wplqB8zTdDcHoMLEFOSH0MdBNCBYOE/),
with the difference that outputs remain accessible.

# Background

Bazel can use the [Remote Execution protocol](https://github.com/bazelbuild/remote-apis)
to offload the execution of build actions to a remote build cluster.
When executing actions remotely, Bazel performs the following three
tasks successively:

1. Uploading inputs: Bazel uploads individual files and Directory,
Command and Action messages into the Content Addressable Storage
(CAS).
2. Execution: Bazel requests that the build cluster executes the Action
that was uploaded into the CAS. The build cluster returns an
ActionResult.
3. Downloading outputs: Bazel downloads all files referenced by the
ActionResult. In case of directory outputs, Bazel also downloads all
files referenced by Tree objects referenced by the ActionResult.

In the common case, the first two tasks consume little bandwidth. By
using an RPC method named FindMissingBlobs(), Bazel is capable of
scanning which objects already exist in the CAS, thereby allowing it to
skip unnecessary uploads of objects. Assuming the retention rate of the
remote build cluster is adequate, you will see that almost all network
bandwidth generated by Bazel is caused by the third task. In 2018, Jakob
Buchgraber [measured](https://docs.google.com/document/d/11m5AkWjigMgo9wplqB8zTdDcHoMLEFOSH0MdBNCBYOE/)
that when building Bazel itself using remote execution, at least 95.4%
of network bandwidth was caused by the downloading of outputs.

The reason Bazel downloads output files is as follows:

- To allow the user to inspect and use the results of a build, e.g. to
run the software that was built.
- Bazel supports mixed local and remote execution. Targets may either
be annotated explicitly to specify where they are run (using the
"no-remote" tag), or features such as
[the dynamic spawn scheduler](https://blog.bazel.build/2019/02/01/dynamic-spawn-scheduler.html)
may be used to let Bazel automatically decide where actions are run.
If a locally executing action depends on files that were built
remotely, Bazel needs to download those files to satisfy the
execution requirements of the locally executing action.
- To serve as Bazel's bookkeeping. To a certain extent, Bazel is
capable of building projects whose dependency graph does not fit in
memory. Bazel keeps a finitely sized cache of file metadata (size,
hash, etc.) in memory. Once exhausted, Bazel may discard entries.
During a later stage of a build, Bazel may need to recompute these
entries by inspecting files on disk.
- To guarantee forward progress of the build, even if objects were to
disappear from the remote CAS. By having the files present locally,
Bazel can reupload them if they were to disappear.

Based on the conclusions of his measurement results, Jakob added a
series of command line flags (`--remote_download_minimal`,
`--remote_download_toplevel`, etc.) that permit users to skip the
downloading step when possible. When `--remote_download_minimal` is
enabled, outputs will only be downloaded if successive actions depend on
them, or if `bazel run` is invoked and the output is an execution
dependency. During every invocation, Bazel constructs an additional
ActionInputMap that stores metadata of all files that are only present
remotely, thereby allowing input roots to be constructed without having
files present locally. Various remote CAS implementations have been
improved to guarantee the retention of recently accessed objects,
thereby ensuring forward progress.

Though `--remote_download_minimal` has helped many users of Bazel's
remote execution to scale, some inherent downsides remain:

- Bazel's memory usage has increased significantly. The ActionInputMap
that gets created during build may easily consume multiple gigabytes
of space for a sufficiently large project.
- Incremental builds have become significantly slower. Because the
ActionInputMap is discarded after every build, Bazel effectively
assumes that the outputs have disappeared, thereby causing it to do a
full rebuild.
- Users now need to make a conscious choice whether they want to
download output files or not. This makes it harder to do ad hoc
exploration.

# Proposal

In a nutshell, the proposal is to optionally let Bazel create a
bazel-out/ directory that contains the same layout as generated by a
plain build (without `--remote_download_minimal`), but somehow delay
downloads of remote files until their contents are actually read (either
by successive no-remote actions, or when the user accesses bazel-out/
manually). This prevents the need for the additional ActionInputMap and
once again grants the user the ability to explore build outputs on
demand.

One challenge with this approach is that it relies on special operating
system features to create such lazy-loading files. It can't be built on
top of the plain POSIX API. Unfortunately, no standards seem to exist in
this space:

- Linux provides [FUSE](https://en.wikipedia.org/wiki/Filesystem_in_Userspace),
which can be used to create a mountpoint for which all operations are
forwarded through a character device to a userspace process.
- For macOS there is [macFUSE](https://osxfuse.github.io) (also known as
OSXFUSE). It implements a slightly older version of the FUSE protocol,
with minor macOS specific protocol extensions in place. In terms of
robustness and performance, it fares worse than Linux's FUSE
implementation. Though still hosted on GitHub, this project is no
longer Open Source. The last Open Source version no longer runs on
modern versions of macOS.
- Windows provides an API called
[Projected File System (ProjFS)](https://docs.microsoft.com/en-us/windows/win32/projfs/projected-file-system)
that makes it possible to instantiate files underneath a
virtualization root whose contents are backed by a
[promise](https://en.wikipedia.org/wiki/Futures_and_promises).
- On operating systems that don't offer the features above, but do
support networked file systems (e.g., NFS, SMB), it may be desirable
to let the system mount a fictive volume managed by a userspace process.

Some of these APIs also require administrative privileges to work
properly. Though FUSE can on most Linux systems be used without any
special privileges through the use of the `fusermount` utility (setuid),
it does require elevated privileges to get it to work inside a Docker
container.

Because of these limitations, it makes sense to let a daemon other than
Bazel manage these directories. In order to manage the lifecycle of
these bazel-out/ directories and to instruct the creation of
lazy-loading files, Bazel may communicate with this daemon over gRPC.
This approach has a couple of advantages:

- It's a proven strategy. Google already has such a system internally
for Blaze. Their daemon is called objfsd and uses FUSE.
- By having a well-defined and succint protocol, it is possible for
people to easily design their own daemons that use custom protocols or
have custom storage policies (e.g., snapshotting and preserving the
results of builds). These daemons may have a release cadence that is
independent from Bazel.
- A single daemon may be run on a system, caching build results for
multiple users and multiple projects. The lifetime of the daemon may
be managed separately from the Bazel server, which may be useful for
shared CI setups.

The question then becomes what the schema needs to look like that Bazel
and the daemon will use. Google already has a schema for this
internally. Unfortunately, this protocol is not a perfect match:

- It doesn't support REv2 specific concepts such as instance names,
user-configurable hashing functions, etc.
- It makes no use of REv2 Protobuf messages, even though there are many
that seem useful in literal form (Digest, OutputFile, OutputDirectory,
OutputSymlink).
- The semantics around preserving files seems different. Google's
internal protocol has support for explicitly marking files that need
to be retained, while REv2 provides no such mechanism. It is assumed
that clients call FindMissingBlobs() to instruct a build cluster to
keep files present, at least for the remainder of a build.

Because of that, [Bazel PR #12823](https://github.com/bazelbuild/bazel/pull/12823)
provides an experimental implementation that uses a custom gRPC protocol
that looks as follows:

```proto
service RemoteOutputService {
// Remove a bazel-out/ directory.
rpc Clean(CleanRequest) returns (google.protobuf.Empty);
// Indicate that a build is starting or has completed. May create a
// bazel-out/ if none exists for the current workspace.
rpc StartBuild(StartBuildRequest) returns (StartBuildResponse);
rpc FinalizeBuild(FinalizeBuildRequest) returns (google.protobuf.Empty);
// Create one or more lazy-loading files or directories inside a
// bazel-out/ directory, backed by objects in the CAS.
rpc BatchCreate(BatchCreateRequest) returns (google.protobuf.Empty);
// Reobtain the metadata of files stored in the bazel-out/ directory.
rpc BatchStat(BatchStatRequest) returns (BatchStatResponse);
}
```

This protocol is strongly based on the API of
[Bazel's OutputService class](https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/vfs/OutputService.java).
The Protobuf file in the PR contains more complete documentation. An
example server-side implementation
[has been published as part of the Buildbarn project](https://github.com/buildbarn/bb-clientd).

# Alternatives considered

In the months before this proposal was written, various experiments were
performed:

- An attempt was made to add an integrated FUSE file system to Bazel
itself. This project was eventually abandoned due to the limitations
mentioned previously (i.e., the inability to run Bazel without
elevated privileges).
- Letting Bazel manage the bazel-out/ directory itself, emitting
symbolic links to a FUSE file system that provides a flat view of the
CAS ([PR #11703](https://github.com/bazelbuild/bazel/pull/11703) and
[PR #11622](https://github.com/bazelbuild/bazel/pull/11662)). Though
this worked fine from Bazel's point of view, the additional symlink
indirection confused many build actions. It completely broke dynamic
linkage.
- Letting Bazel write its contents into a tmpfs-like FUSE daemon, where
files may be hardlinked from a directory that provides a flat view of
the CAS. This worked, but performance was very bad due to a high
amount of context switching. The gRPC-based solution solves this by
supporting batched requests.

# Backward-compatibility

The goal is to implement this feature in such a way that local
execution, plain remote execution, remote execution with
`--remote_download_minimal`, remote execution with a local disk cache,
etc. all remain functional.

0 comments on commit 36c7774

Please sign in to comment.