Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote Output Service: place bazel-out/ on a FUSE file system #12823

Conversation

EdSchouten
Copy link
Contributor

Builds that yield large output files can generate lots of network
bandwidth when remote execution is used. To combat this, we have flags
such as --remote_download_minimal to disable downloading of output
files. Unfortunately, this makes it hard to perform ad hoc exploration
of build outputs.

In an attempt to solve this, this change adds an option to Bazel named
--remote_output_service. When enabled, Bazel effectively gives up the
responsibility of managing a bazel-out/ directory. Instead, it calls
into a gRPC service to request a directory and creates a symlink that
points to it.

Smart implementations of this gRPC service may use things like FUSE to
let this replacement bazel-out/ directory be lazy-loading, thereby
reducing network bandwidth significantly. In order to create
lazy-loading files and directories, Bazel can call into a BatchCreate()
RPC that takes a list of Output{File,Directory,Symlink} messages,
similar to REv2's ActionResult. This call is also used to create
runfiles directories by providing fictive instances of OutputSymlink.

To prevent Bazel from reading the contents of files stored in the FUSE
file system (which would cause network I/O), the protocol offers a
BatchStat() call that can return information such as the REv2 digest.
Though this is redundant with --unix_digest_hash_attribute_name, there
are a couple of reasons why I added this feature:

  1. For non-Linux operating systems, it may make more sense to use NFSv4
    instead of FUSE (i.e., running a virtual NFS daemon on localhost).
    Even though RFC 8276 adds support for extended attributes to NFSv4,
    not all operating systems implement it.

  2. It addresses the security/hermeticity concerns that people raised
    when this feature was added. There is no way to add extended
    attributes to files that can't be tampered with (as a non-root user),
    while using gRPC solves that problem inherently.

  3. Callers of Bazel's BatchStat.batchStat() may generate many system
    calls successively. This causes a large number of context switches
    between Bazel and the FUSE daemon. Using gRPC seems to be cheaper.

By requiring that the output path returned by the gRPC service is
writable, no-remote actions can still run as before, both with
sandboxing enabled and disabled. The only difference is that it will use
space on the gRPC service side, as opposed to the user's home directory
(though the gRPC service may continue to store data in the user's home
directory.

I have a server implementation is written in Go on top of Buildbarn's
storage and file system layer. My plan is to release the code for this
service as soon as I've got a 'thumbs up' on the overall approach.

@EdSchouten
Copy link
Contributor Author

EdSchouten commented Jan 13, 2021

Initially my plan was to try to get the protocol that Bazel will speak with the FUSE daemon be standardized as part of the remote-apis project. Mailing list discussion:

https://groups.google.com/g/remote-execution-apis/c/qOSWWwBLPzo

After discussions that took place during the monthly meeting, and after taking another look at some of the (remotely) similar protocols in this space, such as Buildgrid's local_casd protocol, I think it would make more sense to keep a protocol like this in the Bazel tree, so that the protocol remains minimal.

I have not yet invested time in adding unit/integration testing coverage for this code. This is something that I'd rather do after first having some discussion here. For example, what do you think of the overall approach? What about the protocol? I'd love to get your feedback!

Cc @janakdr @bergsieker @ulfjack

@EdSchouten EdSchouten force-pushed the eschouten/20210115-remote-output-service-integration branch from a7feeb1 to b1992ca Compare January 13, 2021 15:34
@janakdr
Copy link
Contributor

janakdr commented Jan 13, 2021

cc @susinmotion @alexjski

@EdSchouten EdSchouten force-pushed the eschouten/20210115-remote-output-service-integration branch from b1992ca to 214a642 Compare January 13, 2021 17:22
@rupertks
Copy link
Contributor

cc @tetromino

Copy link
Contributor

@janakdr janakdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks cool!

@meisterT
Copy link
Member

cc @coeuvre

@jin jin added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Jan 15, 2021
@jin
Copy link
Member

jin commented Jan 15, 2021

@lberki if needed, please reassign to someone that could shepherd this PR into Google?

Copy link
Contributor

@janakdr janakdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jin we may need to get a few more eyes to give general approval of the idea. When it comes time for a review, @alexjski should be one of the reviewers.

src/main/protobuf/remote_output_service.proto Show resolved Hide resolved
EdSchouten added a commit to buildbarn/bb-storage that referenced this pull request Jan 25, 2021
In the nearby future, I want to release a client-side FUSE daemon that
people can use to speed up builds that use remote execution. More
details can be found here:

bazelbuild/bazel#12823

This daemon is going to make extensive use of the path resolution
library that we added some time. This commit contains all of the
components that the FUSE daemon is going to need in addition to what's
already there:

- AbsoluteScopeWalker: To enforce that paths are absolute.
- Builder: To construct normalized pathname strings.
- LoopDetectingScopeWalker: To prevent infinite traversal of symlinks.
- VirtualRootScopeWalkerFactory: To simulate resolution of absolute
  paths that start with a prefix that needs to be trimmed.
- Void{Component,Scope}Walker: To perform path resolution without making
  any assumptions about the underlying file system.
EdSchouten added a commit to buildbarn/bb-remote-execution that referenced this pull request Jan 31, 2021
This protocol is also part of this PR against Bazel:

bazelbuild/bazel#12823

Let's add it to our tree, so that we can already start using it, even
though it's not part of Bazel properly. This is the protocol that Bazel
will use to communicate with the client-side FUSE daemon.
EdSchouten added a commit to buildbarn/bb-clientd that referenced this pull request Feb 3, 2021
 This change extends the pathname structure offered by bb_clientd.
 Instead of letting the top-level directory provide immediate access to
 the CAS, we move this feature to a subdirectory named "cas". Two new
 top-level directories are added:

 - "outputs", which provides an implementation of the Bazel Remote Output
   Service. Details: bazelbuild/bazel#12823

 - "scratch", which provides a simple scratch space to test the behaviour
   of the FUSE file system without requiring Bazel to create an output
   directory for you.
@EdSchouten EdSchouten force-pushed the eschouten/20210115-remote-output-service-integration branch 2 times, most recently from dc9e493 to 15d2bb3 Compare February 3, 2021 14:51
@EdSchouten
Copy link
Contributor Author

I've just rebased this change on top of latest master, and made CI happy once again. Earlier today I released my implementation of the FUSE daemon on GitHub as well:

https://github.com/buildbarn/bb-clientd

🎉 🎉 🎉

Be sure to let me know what I can do from my end to get this change polished up, upstreamed, etc. etc. etc.

@janakdr janakdr requested review from alexjski and janakdr and removed request for janakdr February 3, 2021 15:37
@janakdr
Copy link
Contributor

janakdr commented Feb 3, 2021

@alexjski could you take the lead on this? Thanks!

@lberki
Copy link
Contributor

lberki commented Feb 3, 2021

/cc @philwo

@sohaibiftikhar
Copy link

This is amazing 🥳

@lberki
Copy link
Contributor

lberki commented Feb 8, 2021

As a general comment, a design doc would be useful so that everyone can understand more easily what's going on. We even have a separate repository for those:

https://github.com/bazelbuild/proposals

Apart from that procedural note, my main worry is that this adds Yet Another Way for Bazel to handle the metadata pertaining to action outputs, of which we already have too many (objfsd, the internal equivalent of this FUSE service, ActionFS, the output tree under bazel-out/, the local action cache, the disk cache, etc.

@philwo is probably the most competent person to comment on all these things. I personally don't feel confident saying anything about this change other than "pls no I don't need more abstractions", which is not exactly helpful. At the very least, it should be written down somewhere how this proposal interacts with these other metadata handling systems.

EdSchouten added a commit to EdSchouten/proposals that referenced this pull request Feb 10, 2021
As requested by @lberki in bazelbuild/bazel#12823, I'm hereby sending
out a proposal for adding support for a client-side FUSE daemon to
Bazel. Suggestions are welcome!
@stagnation
Copy link
Contributor

Back from vacation now, I posted the proto in this PR: #18775 . Happy to address any comments.

EdSchouten added a commit to EdSchouten/bazel that referenced this pull request Aug 31, 2023
When doing a regular build without passing in --remote_download_*,
ExecutionTool will automatically clean up any traces of an OutputService
that replaces bazel-out/ with a symlink pointing to a remote file
system.

The --remote_download_* option is implemented by installing an
OutputService of its own, meaning that the regular logic in
ExecutionTool is skipped. The rationale being that an OutputService
essentially takes ownership of bazel-out/.

This change extends RemoteOutputService to be a more complete
implementation of OutputService that ensures that bazel-out/ is set up
properly. This improves interoperability with changes such as the gRPC
based Remote Output Service protocol (see bazelbuild#12823).
EdSchouten added a commit to EdSchouten/bazel that referenced this pull request Aug 31, 2023
When doing a regular build without passing in --remote_download_*,
ExecutionTool will automatically clean up any traces of an OutputService
that replaces bazel-out/ with a symlink pointing to a remote file
system.

The --remote_download_* option is implemented by installing an
OutputService of its own, meaning that the regular logic in
ExecutionTool is skipped. The rationale being that an OutputService
essentially takes ownership of bazel-out/.

This change extends RemoteOutputService to be a more complete
implementation of OutputService that ensures that bazel-out/ is set up
properly. This improves interoperability with changes such as the gRPC
based Remote Output Service protocol (see bazelbuild#12823).
EdSchouten added a commit to EdSchouten/bazel that referenced this pull request Aug 31, 2023
In case DirectoryContext.createOrSetWritable() is called on a directory
inside of a Tree, it already does a nice forward sweep along the path to
create missing intermediate directories. When creating a bare output
file in a nested directory, it does not. It implicitly assumes that the
output directory was already created at a previous point in time.

This doesn't cause issues whenever we do a fully clean build, as some
earlier phase (the analysis phase?) creates the missing parent
directories for us. However, if we remove the nested directory structure
from bazel-bin/ manually and rerun the test, this causes failures along
the lines:

    java.io.FileNotFoundException: /REDACTED/execroot/main/bazel-out/darwin_x86_64-fastbuild/bin/some/very/very/very/deeply/nested (No such file or directory)

Solve this issue by properly catching FileNotFoundExceptions that are
generated when we call Path.isWritable() on the parent directory of the
output file. In that case we resort to calling
Path.createDirectoryAndParents().

I think this is likely the most efficient strategy, as it means we don't
need to do a full forward scan of the path in the common case where the
parent directory does exist.

This issue can be noticed in practice when switching from
--remote_output_service (PR bazelbuild#12823) to --remote_download_minimal.
EdSchouten added a commit to EdSchouten/bazel that referenced this pull request Aug 31, 2023
When doing a regular build without passing in --remote_download_*,
ExecutionTool will automatically clean up any traces of an OutputService
that replaces bazel-out/ with a symlink pointing to a remote file
system.

The --remote_download_* option is implemented by installing an
OutputService of its own, meaning that the regular logic in
ExecutionTool is skipped. The rationale being that an OutputService
essentially takes ownership of bazel-out/.

This change extends RemoteOutputService to be a more complete
implementation of OutputService that ensures that bazel-out/ is set up
properly. This improves interoperability with changes such as the gRPC
based Remote Output Service protocol (see bazelbuild#12823).
EdSchouten added a commit to EdSchouten/bazel that referenced this pull request Sep 1, 2023
When doing a regular build without passing in --remote_download_*,
ExecutionTool will automatically clean up any traces of an OutputService
that replaces bazel-out/ with a symlink pointing to a remote file
system.

The --remote_download_* option is implemented by installing an
OutputService of its own, meaning that the regular logic in
ExecutionTool is skipped. The rationale being that an OutputService
essentially takes ownership of bazel-out/.

This change extends RemoteOutputService to be a more complete
implementation of OutputService that ensures that bazel-out/ is set up
properly. This improves interoperability with changes such as the gRPC
based Remote Output Service protocol (see bazelbuild#12823).

Change-Id: I5bedea2895785f3b8836fe8c5f8ae97ca2689233
EdSchouten added a commit to EdSchouten/bazel that referenced this pull request Sep 1, 2023
When doing a regular build without passing in --remote_download_*,
ExecutionTool will automatically clean up any traces of an OutputService
that replaces bazel-out/ with a symlink pointing to a remote file
system.

The --remote_download_* option is implemented by installing an
OutputService of its own, meaning that the regular logic in
ExecutionTool is skipped. The rationale being that an OutputService
essentially takes ownership of bazel-out/.

This change extends RemoteOutputService to be a more complete
implementation of OutputService that ensures that bazel-out/ is set up
properly. This improves interoperability with changes such as the gRPC
based Remote Output Service protocol (see bazelbuild#12823).
@EdSchouten EdSchouten force-pushed the eschouten/20210115-remote-output-service-integration branch from 3925745 to 5e3f6ae Compare September 1, 2023 12:14
Builds that yield large output files can generate lots of network
bandwidth when remote execution is used. To combat this, we have flags
such as --remote_download_minimal to disable downloading of output
files. Unfortunately, this makes it hard to perform ad hoc exploration
of build outputs.

In an attempt to solve this, this change adds an option to Bazel named
--remote_output_service. When enabled, Bazel effectively gives up the
responsibility of managing a bazel-out/ directory. Instead, it calls
into a gRPC service to request a directory and creates a symlink that
points to it.

Smart implementations of this gRPC service may use things like FUSE to
let this replacement bazel-out/ directory be lazy-loading, thereby
reducing network bandwidth significantly. In order to create
lazy-loading files and directories, Bazel can call into a BatchCreate()
RPC that takes a list of Output{File,Directory,Symlink} messages,
similar to REv2's ActionResult. This call is also used to create
runfiles directories by providing fictive instances of OutputSymlink.

To prevent Bazel from reading the contents of files stored in the FUSE
file system (which would cause network I/O), the protocol offers a
BatchStat() call that can return information such as the REv2 digest.
Though this is redundant with --unix_digest_hash_attribute_name, there
are a couple of reasons why I added this feature:

1. For non-Linux operating systems, it may make more sense to use NFSv4
   instead of FUSE (i.e., running a virtual NFS daemon on localhost).
   Even though RFC 8276 adds support for extended attributes to NFSv4,
   not all operating systems implement it.

2. It addresses the security/hermeticity concerns that people raised
   when this feature was added. There is no way to add extended
   attributes to files that can't be tampered with (as a non-root user),
   while using gRPC solves that problem inherently.

3. Callers of Bazel's BatchStat.batchStat() may generate many system
   calls successively. This causes a large number of context switches
   between Bazel and the FUSE daemon. Using gRPC seems to be cheaper.

By requiring that the output path returned by the gRPC service is
writable, no-remote actions can still run as before, both with
sandboxing enabled and disabled. The only difference is that it will use
space on the gRPC service side, as opposed to the user's home directory
(though the gRPC service may continue to store data in the user's home
directory.

I have a server implementation is written in Go on top of Buildbarn's
storage and file system layer. My plan is to release the code for this
service as soon as I've got a 'thumbs up' on the overall approach.
@EdSchouten EdSchouten force-pushed the eschouten/20210115-remote-output-service-integration branch from 5e3f6ae to b9bfcf0 Compare September 1, 2023 12:23
copybara-service bot pushed a commit that referenced this pull request Sep 4, 2023
When doing a regular build without passing in --remote_download_*, ExecutionTool will automatically clean up any traces of an OutputService that replaces bazel-out/ with a symlink pointing to a remote file system.

The --remote_download_* option is implemented by installing an OutputService of its own, meaning that the regular logic in ExecutionTool is skipped. The rationale being that an OutputService essentially takes ownership of bazel-out/.

This change extends RemoteOutputService to be a more complete implementation of OutputService that ensures that bazel-out/ is set up properly. This improves interoperability with changes such as the gRPC based Remote Output Service protocol (see #12823).

Closes #19377.

PiperOrigin-RevId: 562482812
Change-Id: I6352838c24d7fe8b3192ebe064647a140fad5c8c
@coeuvre
Copy link
Member

coeuvre commented Sep 28, 2023

Quick update: @tjgq and I had a in-depth review for the protocol and we want to make some changes. I am working on a proof of concept naive server implementation (which will also be used in the integration tests) to make sure our changes make sense before we share the feedbacks.

stagnation added a commit to stagnation/bazel that referenced this pull request Oct 4, 2023
Introduce a .proto file for the proposed Remote Output Service. This
commit is part of Ed Schouten's ideas and demonstration of feasibility
described in the linked pull request below, this commit is just
clerical. The Remote Output Service can speed up build with large output
files by avoiding the downloads. Combined with an on-demand filesystem
solution clients can download just the files they need. The big picture
is described and tracked in
bazelbuild#12823 .

This is the first step in a rough process:

    1 Review and merge the remote_output_service.proto definition first.
    2 Implement the Bazel side from scratch, so we only have one
      implementation of OutputService.
    3 The community will implement and maintain the server part (local
      daemon) outside of the Bazel source tree.
    4 Eventually, remote_output_service.proto will move out of the Bazel
      source tree and be maintained by the community, similar to the
      REAPI spec today.
stagnation added a commit to stagnation/bazel that referenced this pull request Oct 4, 2023
Introduce a .proto file for the proposed Remote Output Service. This
commit is part of Ed Schouten's ideas and demonstration of feasibility
described in the linked pull request below, this commit is just
clerical. The Remote Output Service can speed up build with large output
files by avoiding the downloads. Combined with an on-demand filesystem
solution clients can download just the files they need. The big picture
is described and tracked in
bazelbuild#12823 .

This is the first step in a rough process:

    1 Review and merge the remote_output_service.proto definition first.
    2 Implement the Bazel side from scratch, so we only have one
      implementation of OutputService.
    3 The community will implement and maintain the server part (local
      daemon) outside of the Bazel source tree.
    4 Eventually, remote_output_service.proto will move out of the Bazel
      source tree and be maintained by the community, similar to the
      REAPI spec today.
upmorpheus pushed a commit to upmorpheus/Golang-dev that referenced this pull request Oct 14, 2023
This protocol is also part of this PR against Bazel:

bazelbuild/bazel#12823

Let's add it to our tree, so that we can already start using it, even
though it's not part of Bazel properly. This is the protocol that Bazel
will use to communicate with the client-side FUSE daemon.
@moroten
Copy link
Contributor

moroten commented Oct 23, 2023

Bazel builds with this PR are available at https://github.com/meroton/bazel/releases

@coeuvre
Copy link
Member

coeuvre commented Dec 15, 2023

Prototype is done. Working on the design doc and will start checking in the code soon.

@EdSchouten
Copy link
Contributor Author

At BazelCon you folks mentioned you wanted to make some protocol changes. Any chance we can already discuss these ahead of things getting merged?

@coeuvre
Copy link
Member

coeuvre commented Jan 2, 2024

Yes, I will include the changes in the design doc and have it reviewed by the community before merging the code.

@coeuvre
Copy link
Member

coeuvre commented Jan 18, 2024

The design doc is out! Please check #20933.

@tetromino tetromino removed their request for review March 6, 2024 14:54
@EdSchouten
Copy link
Contributor Author

EdSchouten commented Mar 6, 2024

Closing this, as #21140 added the new protocol for which @coeuvre is going to add support to Bazel. The latest version of bb_clientd implements the new protocol as well.

@EdSchouten EdSchouten closed this Mar 6, 2024
@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.