Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0062] Content-addressed paths #62

Merged
merged 34 commits into from
Jan 12, 2022
Merged
Changes from 2 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
6d001f3
CAP RFC: First draft
thufschmitt Sep 19, 2019
435fc42
typo
thufschmitt Dec 11, 2019
7b26144
Apply @grahamc's suggestions
regnat Dec 11, 2019
81099b2
nix code -> Nix expression
thufschmitt Dec 11, 2019
4277386
Break-up the big introduction paragraph
thufschmitt Dec 11, 2019
7af7d2c
Rename to match the PR number
thufschmitt Dec 12, 2019
5fec861
Rename the drv attribute to __contentAddressed
thufschmitt Dec 12, 2019
9edc11f
Mention the GC issue
thufschmitt Jan 8, 2020
5717351
Remove the ambiguity on what an `output` is
thufschmitt Jan 8, 2020
1a844cc
Replace aliases paths by a pathOf mapping
thufschmitt Jan 15, 2020
26ae77e
Move the example after the design description
thufschmitt Jan 15, 2020
bbdca7e
Rephrase the design
thufschmitt Jan 15, 2020
63f3eca
Add shepherd team
thufschmitt Jan 16, 2020
a6d2f38
Rewrite the RFC to account for the RFC meeting comments
thufschmitt Feb 17, 2020
140e093
Add a section about leaking output paths
thufschmitt Feb 17, 2020
288dcb4
Merge remote-tracking branch 'upstream/master' into cas-rfc
Ericson2314 Mar 14, 2020
60e7da3
Merge pull request #5 from Ericson2314/cas-rfc-new-template
regnat Mar 18, 2020
1115a0d
Refine the design summary
thufschmitt Mar 18, 2020
13938de
Rename dependency-addressed into input-addressed
thufschmitt Mar 18, 2020
3a25f7f
minor fixup after comments
thufschmitt Mar 25, 2020
3a18867
Apply suggestions from code review
regnat Jun 19, 2020
fa16e86
Update rfcs/0062-content-addressed-paths.md
Mic92 Oct 22, 2020
94b65bd
Update the terminology to match the in the implementation
thufschmitt Apr 14, 2021
7ed4481
Reword the detailed design presentation
thufschmitt Apr 14, 2021
fb4c61d
Quote some strings in the yaml frontmatter
thufschmitt Apr 14, 2021
841fe3f
Add a design paragraph about the remote caching
thufschmitt Apr 14, 2021
27bd048
Lift the determinism requirement
thufschmitt Apr 14, 2021
1e8fab7
Typo
edolstra May 31, 2021
9772625
Apply suggestions from code review
edolstra May 31, 2021
02ae2b5
Rewrite the RFC
thufschmitt Jun 2, 2021
2d74fed
Make the python samples a bit more pythonic
regnat Jun 2, 2021
168a149
Explicit that unresolved dependencies are eval-time
thufschmitt Jun 2, 2021
427abed
Prettify
thufschmitt Jun 2, 2021
f275669
Make the end-goal an experiment
regnat Dec 10, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
281 changes: 281 additions & 0 deletions rfcs/0060-content-addressed-paths.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,281 @@
---
feature: Simple content-adressed store paths
start-date: 2019-08-14
author: Théophane Hufschmitt
co-authors: (find a buddy later to help our with the RFC)
shepherd-team: (names, to be nominated and accepted by RFC steering committee)
shepherd-leader: (name to be appointed by RFC steering committee)
related-issues: (will contain links to implementation PRs)
---

# Summary

[summary]: #summary

Add some basic but simple support for content-adressed store paths to Nix.

We plan here to give the possibility to mark certain store paths as
content-adressed (ca), while keeping the other dependency-adressed as they are
now (modulo some mandatory drv rewriting before the build, see below)

By making this opt-in, we can impose arbitrary limitations to the paths that
are allowed to be ca to avoid some tricky issues that can arise with
content-adressability.
In particular, we restrict ourselves to paths without any non-textual
self-reference (_i.e_ a self-reference hidden inside a zip file) and known to
be deterministic (for caching reasons, see [caching]).
That way we don't have to worry about the fact that hash-rewriting is only an
approximation nor by the semantics of the distribution of non-deterministic
paths, **but** we also leave the option to lift these restrictions later.
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved

This RFC already has a (somewhat working) POC at
<https://github.com/regnat/nix/tree/cas>.
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved

# Motivation

[motivation]: #motivation

Having a content-adressed store with Nix (aka the "Intensional store") is a
long-time dream of the community − a design for that was already taking a whole
chapter in [Eelco's PHD thesis][nixphd].

This was never done because it represents a quite big change in Nix's model,
with some non-totally-solved implications (regarding the trust model in
particular).
Even without going all the way down to a fully intensional model (yet), we can
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
make certain paths content-adressed, which can give some important benefits of
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
the intensional store at a much lower price. In particular, setting some
critical derivations as content-adressed can lead to some substancial build
cutoffs.

# Detailed design

[design]: #detailed-design

In all that follows, we pretend that each derivation has only one output.
This doesn't change the reasoning but makes things easier to state.

The gist of the design is that

- Some derivations can be marked as content-adressed (ca), in which case their
output will be moved to a path `ca` determined only by its content after the
build
- Each (non content-adressed) derivation will have two outputs: A `static` one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing because above it says that each derivation has one output. It's more accurate to say that there are two derivations (that differ in their output paths).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW the names static and dynamic are ambiguous because they're already used for static/dynamic linking (in particular builds might have a static output for .a files...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the names because it's indeed a very overloaded term, but that also matches the common usage when it comes to language semanics ("static" knowledge that can be known by just looking at the source code vs. dynamic stuff that is only available at runtime)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evaluation-time vs. build-time seems to be less ambiguous to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe in 2 names because there isn't just one dynamic hash. As more CA deps are built, we can incrementally "improve" the drv. It's better to speak of normal forms, or an arbitrary expression vs value description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but why would you want to?

Purely-evaluation-time version can be computed without any builds and is a natural database key. Fully-updated version is used to determine the actual output path for the build. Why would we ever compute a partial substitution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. When you do a build, you want your deps in normal form, but you yourself unchanged: "almost normal form".

  2. I just thought of another: when you ask the remote builder, you ask with the partial substitution which has all your existing substitutions. Then you can be sure the server won't respond with a conflict! This is really good because even if the server doesn't agree with you that your maximally unevaluated derivation has that content hash, maybe another derivation has that content hash?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think in this case this actually is the normal form for build. This is the final form for non-CA, and a necessary build form for CA, their CA version is a different story.

  2. I think the most useful case of what you describe is that you have a different original derivation but some CA dependencies happen to have the same content. It would be a nice feature for a binary cache to provide, but we ither get nontrivial query logic, or combinatorial explosion, and I think neither should be mandated before we have the easy case working.

There can also be something about nondeterminisim and a CA derivation output switching between a small number of possible content sets, but I think we should encourage as much caution as possible for the first version, and if somethings slips through it should be highly visible (and get reported)

computed at evaluation time and a `dynamic` one computed from the dynamic
outputs of its dependencies. These outputs may be identical if the derivation
doesn't (transitively) depend on any ca derivation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fold the existing fixed output logic into this? e.g. even today two different derivations that produce the same output can be substituted within some downstream derivation without changing its output path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by "fold" here. If you mean "make fixed-output derivations a special-case of content-addressed ones" then yes, we could in theory (just adding a check at the end that the content-hash is the expected one), but I'm not sure that this is desirable in practice (since the mechanism for CA derivations is inherently more complex than the one for fixed-output ones). Otoh they can (and do in the prototype) share a bit of implementation − in particular fixed-output and CA derivations are stored the same way in the db.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing about the hashDerivationModulo that @edolstra mentions is content addressable derivations already evaluate two another derivation, just like CA ones. But we don't see that derivation today as it is just used to calculate downstream store paths. I think we should expose that derivation. Then the complexity is about the same, and things are more uniform.

Copy link
Member

@edolstra edolstra Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are 3 types of derivations:

  1. Non-CA derivations that only depend on non-CA derivations.
  2. Non-CA derivations that depend on at least one CA derivation.
  3. CA derivations.

I'm not sure if 2) is useful to support. It already behaves different from 1) in that the output paths cannot be known in advance. So then you may as well make the outputs content-addressable. So maybe be it's easier to require that only CA derivations may depend on CA derivations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this because there is analogous "context" vs "redex" situations for #40. Surely normal derivations would depend on ret-cont ones, so we should have normal derivations that depends on CA ones. Also matches how we have normal derivations that depend on fixed-output ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this model CA derivations should be treated the same as we would impure derivations if we had them; we do not know whether they are reproducible and thus we should propagate that uncertainty. In this case that would be done by marking all derivations downstream as CA. With fixed-output ones it works because of the hash. Thus, CA derivations that check against a passed-in hash would result in non-CA downstream, but without hash in CA downstream.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FRidh that doesn't work because downstream ones can have cycles. The similarity is that "naive" downstream one doesn't have have a corresponding build, only the normalized drvs map to builds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should drop support for 2) as

  1. I think that could severely impair the usefulness of this
  2. Supporting that doesn't really add any complexity

To expand a bit:

  1. The original use-case for the prototype was a derivation (call it d0) deep into a derivation tree that had its inputs frequently changing, but its output rarely, and that was a dependency of a lot of other (expensive) things. We knew that d0 was totally deterministic, so we could easily make it CA (which saved a lot of building), but we couldn't make much assumptions about its referrers, so we couldn't make them CA.

    I guess that could be transposed to nixpkgs by replacing d0 by e.g. glibc or anything in stdenv (assuming that it's deterministic). Having CA derivations "poison" their referrers would mean that we would have to mark the whole of nixpkgs as CA which will probably not be possible.

  2. The way I see things, we only have 2 types of derivations

    1. CA derivations (that depend on zero or more CA derivations)
    2. Non-CA derivations (that depend on zero or more CA derivations)

    It just happens that the semantics of non-CA derivations that depend on zero CA derivations match the current semantics, but there's absolutely no extra-complexity involved compared to only supporting Non-CA derivations that don't depend on CA derivations).

    Another way to look at it is that being CA and depending on CA are two orthogonal features, so if we support CA-depending-on-CA and non-CA-depending-on-non-CA, we also support "for free" non-CA-depending-on-CA (and CA-depending-on-non-CA)

Copy link
Member

@7c6f434c 7c6f434c Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the objection to making something CA?

Compressed manpages, JARs or any other such thing present in the output. Note that marking something as CA might require a careful study of either the build system or the output.

Also, of course, hard to debug issues with determinism could be a problem…

But if any dependency can be a CA derivation, and this propagates upwards to change the output paths of its referrers, then you don't have this property anymore.

Maybe specifically that should be cached, indeed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. What's the objection to making something CA?

The main reason I can think of is non-determinism − I know there are caching models that can handle that just fine, but implementing these are out-of-scope for this RFC

2. It does add complexity, and a runtime cost. Currently, for a non-CA derivation, you can look at the top-level .drv file to know its output paths. So for instance, to substitute a derivation, you don't have to look at the dependencies (and `SubstitutionGoal` doesn't). But if any dependency can be a CA derivation, and this propagates upwards to change the output paths of its referrers, then you don't have this property anymore. This would also mean that `Store::buildDerivation()` would no longer work because it doesn't have access to the input derivations.

This doesn't have to be true I think (although I'm not 100% sure wrt #62 (comment)):

The current behavior (in the prototype I mean) is that the remote cache knows about aliases path, so if /nix/store/abc-foo is an alias for /nix/store/def-foo, the cache will happily serve a narinfo for abc-foo containing aliasOf: def-foo, and SubstitutionGoal will just add def-foo to its dependent goals as if it were another dependency. The same thing should also hold for Store::buildDerivation() unless I miss something.

There's indeed a slight runtime overhead (because we have to fetch an extra narinfo for the alias), but it seems very minor (and marking everything as CA wouldn't make this any faster)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I can think of is non-determinism

Right, so that means that the output path of a non-CA derivation can differ between builds. But that's already the case since the CA dependencies of a non-CA derivation can change. So you can't use things like binary caches anyway.

Copy link
Member

@Ericson2314 Ericson2314 Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to be able to make packages content-addressable one by one independently. If zlib doesn't have a self-reference, for example, then I want to make it so without having to also convert all its reverse dependencies. That would be a disaster.

The binary cache interaction doesn't seem too bad? It's basically "You asked for drv0, I map drv0 to drv1 under substitutions (a to a_hash, b to b_hash, ....z to z_hash). I have a build for drv1." The client can see if any of those content-hash assignments clash with any of its own, and take the drv1 build and the hash assignments if it is OK with them. As long as we keep track of all hash assignments, and stop doing whatever we are doing if we notice a conflict, things won't go wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I can think of is non-determinism

Right, so that means that the output path of a non-CA derivation can differ between builds.

I am not sure I follow: if all CA derivations are carefully checked to be deterministic, then the output paths of their reverse-dependencies won't change.

- just prior to being realized, each derivation gets rewritten by replacing
each of its dependencies by its `dynamic` or `ca` path

## Example

Since the design is non trivial, better start with an example to give an
intuition of what's happening:

In this example, we have the following nix code:

```nix
rec {
contentAdressed = mkDerivation {
name = "contentAdressed";
contentAdressed = true;
edolstra marked this conversation as resolved.
Show resolved Hide resolved
… # Some extra arguments
};
dependent = mkDerivation {
name = "dependent";
buildInputs = [ contentAdressed ];
… # Some extra arguments
};
transitivelyDependent = mkDerivation {
name = "transitivelyDependent";
buildInputs = [ dependent ];
… # Some extra arguments
};
}
```

Suppose that we want to build `transitivelyDependent`.
What will happen is the following

- We instantiate the nix code, this gives us three drv files:
`contentAdressed.drv`, `dependent.drv` and `transitivelyDependent.drv`
- We build `contentAdressed.drv`.
- We first compute `dynamic(contentAdressed.drv)` to replace its
inputs by their real output path. Since there is none, we
have here `dynamic(contentAdressed.drv) == contentAdressed.drv`
- We realise `dynamic(contentAdressed.drv)`. This gives us an output path
`out(dynamic(contentAdressed.drv))`
- We move `out(dynamic(contentAdressed.drv))` to its content-adressed path
`ca(contentAdressed.drv)` which derives from
`sha256(out(dynamic(contentAdressed.drv)))`
- We build `dependent.drv`
- We first compute `dynamic(dependent.drv)` to replace its
inputs by their real output path.
In that case, we replace `contentAdressed.drv!out` by
`ca(contentAdressed.drv)`
- We realise `dynamic(dependent.drv)`. This gives us an output path
`out(dynamic(dependent.drv))`
- We build `transitivelyDependent.drv`
- We first compute `dynamic(transitivelyDependent.drv)` to replace its
inputs by their real output path.
In that case, that means replacing `dependent.drv!out` by
`out(dynamic(dependent.drv))`
- We realise `dynamic(transitivelyDependent.drv)`. This gives us an output path
`out(dynamic(transitivelyDependent.drv))`

Now suppose that we slightly change the definition of `contentAdressed` in such
a way that `contentAdressed.drv` will be modified, but its output will be the
same. We try to rebuild the new `transitivelyDependent`. What happens is the
following:

- We instantiate the nix code, this gives us three new drv files:
`contentAdressed.drv`, `dependent.drv` and `transitivelyDependent.drv`
- We build `contentAdressed.drv`.
- We first compute `dynamic(contentAdressed.drv)` to replace its
inputs by their real output path. Since there is none, we
have here `dynamic(contentAdressed.drv) == contentAdressed.drv`
- We realise `dynamic(contentAdressed.drv)`. This gives us an output path
`out(dynamic(contentAdressed.drv))`
- We compute `ca(contentAdressed.drv)` and notice that the
path already exists (since it's the same as the one we built previously),
so we discard the result.
- We build `dependent.drv`
- We first compute `dynamic(dependent.drv)` to replace its
inputs by their real output path.
In that case, we replace `contentAdressed.drv!out` by
`ca(contentAdressed.drv)`
- We notice that `dynamic(dependent.drv)` is the same as before (since
`ca(contentAdressed.drv)` is the same as before), so we
just return the already existing path
- We build `transitivelyDependent.drv`
- We first compute `dynamic(transitivelyDependent.drv)` to replace its
inputs by their real output path.
In that case, that means replacing `dependent.drv!out` by
`out(dynamic(dependent.drv))`
- Here again, we notice that `dynamic(transitivelyDependent.drv)` is the same as before,
so we don't build anything

## nix-build process
Copy link
Member

@Ericson2314 Ericson2314 Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this stuff sounds normative? If so, it should be moved outside of "examples" back into "detailed design".


### Aliases paths

To allow this, we add a new type of store path: aliases paths.
These paths don't actually exist in the store, just in the database and point to
another path (so they are morally symlinks, but inside the db rather than
on-disk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this shouldn't be a StorePath -> StorePath mapping but a DrvOutputId -> StorePath mapping. This eliminates the suggestion that the paths in the domain of this mapping are real paths. Nix already does something like this in hashDerivationModulo() to compute the "static" store path outputs of a derivation, by replacing the inputDrvs with IDs that don't change for fixed-output derivations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks this begins to answer my #62 (comment)

Copy link
Member

@Ericson2314 Ericson2314 Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say DrvOutputId -> DrvOutputId even. hashDerivationModulo and the thing that "upgrades" (possibly nested!) CA derivations to use their output hash should be extremely similar.

We can then have a separate normalized DrvOutputId -> StorePath mapping.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks better indeed, although I'm not sure how you define DrvOutputId. My understanding is that it should essentially be the "hash" part of the path. Is that it or is there something more?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's basically it. It's a hash over the derivation graph, which is essentially what the hash part of non-CA output paths is.


### Building a ca derivation

ca derivations are derivations with the `contentAdressed` argument set to
`true`.

The process for building a content-adressed derivation is the following:

- We build it like a normal derivation to get an output path `$out`.
- We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing]
- We move `$out` to `/nix/store/$chash-$name`
- We create an alias path from `$out` to `/nix/store/$chash-$name`

[^modulo-hashing]:

We can possibly normalize all the self-references before
computing the hash and rewrite them when moving the path to handle paths with
self-references, but this isn't strictly required for a first iteration

### Building a normal derivation

The process for building a normal derivation is the following:

- We look into the drv for all the inputs paths of the build
- For each input path, we look whether the path is an alias. If so we replace it
by its target
- We compute the `dynamic` output of the derivation from the patched version
- We then try to substitute and build the new derivation
- We create an alias path from the `static` output to the `dynamic` one

## Wrapping it up

# Drawbacks

[drawbacks]: #drawbacks

- Obviously, this makes the Nix model more complicated than what it is now. In
particular, the caching model needs some modifications (see [caching]);

- We specify that only a sub-category of derivations can safely be marked as
`contentAdressed`, but there's no way to enforce these restricitions;

- This will probably be a breaking-change for some tooling since the output path
that's stored in the `.drv` files doesn't correspond to the actual on-disk
path the output will be stored in (because it might just be an alias for the
other path)

# Alternatives

[alternatives]: #alternatives

[RFC 0017][] is another proposal with the
same end-goal. The big difference between these two is in the scope they cover:
RFC 0017 is about fundamentally changing the base model of Nix, while this
proposal suggests to make only the minimal amount of changes to the current
model to allow the content-adressed model to live in parallel (which would open
the way to a fully content-adressed store as RFC0017, but in a much more
incremental way).

Eventually this RFC should be subsumed by RFC0017.

# Unresolved questions

[unresolved]: #unresolved-questions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that we should ensure is that CA derivations can be built by untrusted users without the input drvs being available. This is currently not the case for non-CA derivations: Store::buildDerivation(BasicDerivation) can only be used by trusted users, which is annoying for Hydra builders.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A major open issue is garbage collection of the aliases table. It's not clear when entries should be deleted. The paths in the domain are fake so we can't use them for expiration. The paths in the codomain could be used (i.e. if a path is GC'ed, we delete the alias entries that map to it) but it's not clear whether that's desirable since you may want to bring back the path via substitution in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unprivileged builds is more a future work than an unresolved issue (since it's just something that we can build on top of that). I very quickly mention it (Investigate the consequences in term of privileges requirements), but if you think it's worth expanding, I can add more on that topic

it's not clear whether that's desirable since you may want to bring back the path via substitution in the future

What do you mean by that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned it because we need to keep it in mind to avoid a design of derivations that makes unprivileged builds impossible (i.e. avoid the mistake of hashDerivationModulo).

What do you mean by that?

If you garbage-collect the output of a CA derivation, and then build the CA derivation again, you could avoid building it by fetching the output from a binary cache, if you remembered what the CA output paths were.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could avoid building it by fetching the output from a binary cache, if you remembered what the CA output paths were.

Just to stress: this knowledge would allow fetching even when the binary cache doesn't know of the derivation I am trying to build but the output happens to be the same

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this mistake of hashDerivationModulo?


## Caching

[caching]: #caching

The big unresolved question is about the caching of content-adressed paths.
As [Eelco's phd thesis][nixphd] states it, caching ca paths raises a number of
questions when building that path is non-deterministic (because two different
stores can have two different outputs for the same path, which might lead to
some dependencies being duplicated in the closure of a dependency).
There exist some solutions to this problem (including one presented in Eelco's
thesis), but for the sake of simplicity, this RFC simply forbids to mark a
derivation as ca if its build is not deterministic (although there's no real
way to check that so it's up to the author of the derivation to ensure that it
is the case).

## Client support

The bulk of the job here is done by the nix daemon.

Depending on the details of the current Nix implementation, there might or
might not be a need for the client to also support it (which would require the
daemon and the client to be updated in synchronously)

## Old Nix versions and caching

What happens (and should happen) if a nix not supporting the cas model queries
a cache with cas paths in it is not clear yet.

In particular, the content (and the existence) of the physical path of the
static derivation isn't decided. A backwards-compatible choice would be to make
this a symlink to the dynamic path, but this is also very leaky and potentially
unsound.

# Future work

[future]: #future-work

This RFC tries as much as possible to provide a solid foundation for building
ca paths with Nix, leaving as much room as possible for future extensions.
In particular:

- Add some path-rewriting to allow derivations with self-references to be built
as ca
- Consolidate the caching model to allow non-deterministic derivations to be
built as ca
- (hopefully, one day) make the CA model the default one in Nix
- Investigate the consequences in term of privileges requirements
- Build a trust model on top of the content-adressed model to share store paths

[rfc 0017]: https://github.com/NixOS/rfcs/pull/17
[nixphd]: https://nixos.org/~eelco/pubs/phd-thesis.pdf