Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0062] Content-addressed paths #62

Merged
merged 34 commits into from
Jan 12, 2022
Merged
Changes from 20 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
6d001f3
CAP RFC: First draft
thufschmitt Sep 19, 2019
435fc42
typo
thufschmitt Dec 11, 2019
7b26144
Apply @grahamc's suggestions
regnat Dec 11, 2019
81099b2
nix code -> Nix expression
thufschmitt Dec 11, 2019
4277386
Break-up the big introduction paragraph
thufschmitt Dec 11, 2019
7af7d2c
Rename to match the PR number
thufschmitt Dec 12, 2019
5fec861
Rename the drv attribute to __contentAddressed
thufschmitt Dec 12, 2019
9edc11f
Mention the GC issue
thufschmitt Jan 8, 2020
5717351
Remove the ambiguity on what an `output` is
thufschmitt Jan 8, 2020
1a844cc
Replace aliases paths by a pathOf mapping
thufschmitt Jan 15, 2020
26ae77e
Move the example after the design description
thufschmitt Jan 15, 2020
bbdca7e
Rephrase the design
thufschmitt Jan 15, 2020
63f3eca
Add shepherd team
thufschmitt Jan 16, 2020
a6d2f38
Rewrite the RFC to account for the RFC meeting comments
thufschmitt Feb 17, 2020
140e093
Add a section about leaking output paths
thufschmitt Feb 17, 2020
288dcb4
Merge remote-tracking branch 'upstream/master' into cas-rfc
Ericson2314 Mar 14, 2020
60e7da3
Merge pull request #5 from Ericson2314/cas-rfc-new-template
regnat Mar 18, 2020
1115a0d
Refine the design summary
thufschmitt Mar 18, 2020
13938de
Rename dependency-addressed into input-addressed
thufschmitt Mar 18, 2020
3a25f7f
minor fixup after comments
thufschmitt Mar 25, 2020
3a18867
Apply suggestions from code review
regnat Jun 19, 2020
fa16e86
Update rfcs/0062-content-addressed-paths.md
Mic92 Oct 22, 2020
94b65bd
Update the terminology to match the in the implementation
thufschmitt Apr 14, 2021
7ed4481
Reword the detailed design presentation
thufschmitt Apr 14, 2021
fb4c61d
Quote some strings in the yaml frontmatter
thufschmitt Apr 14, 2021
841fe3f
Add a design paragraph about the remote caching
thufschmitt Apr 14, 2021
27bd048
Lift the determinism requirement
thufschmitt Apr 14, 2021
1e8fab7
Typo
edolstra May 31, 2021
9772625
Apply suggestions from code review
edolstra May 31, 2021
02ae2b5
Rewrite the RFC
thufschmitt Jun 2, 2021
2d74fed
Make the python samples a bit more pythonic
regnat Jun 2, 2021
168a149
Explicit that unresolved dependencies are eval-time
thufschmitt Jun 2, 2021
427abed
Prettify
thufschmitt Jun 2, 2021
f275669
Make the end-goal an experiment
regnat Dec 10, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
310 changes: 310 additions & 0 deletions rfcs/0062-content-addressed-paths.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,310 @@
---
feature: Simple content-adressed store paths
start-date: 2019-08-14
author: Théophane Hufschmitt
co-authors: (find a buddy later to help our with the RFC)
shepherd-team: @layus, @edolstra and @Ericson2314
shepherd-leader: (name to be appointed by RFC steering committee)
Mic92 marked this conversation as resolved.
Show resolved Hide resolved
related-issues: (will contain links to implementation PRs)
---

# Summary

[summary]: #summary

Add some basic but simple support for content-adressed store paths to Nix.
edolstra marked this conversation as resolved.
Show resolved Hide resolved

We plan here to give the possibility to mark certain store paths as
content-adressed (ca), while keeping the other input-adressed as they are
now (modulo some mandatory drv rewriting before the build, see below)
edolstra marked this conversation as resolved.
Show resolved Hide resolved

By making this opt-in, we can impose arbitrary limitations to the paths that
are allowed to be ca to avoid some tricky issues that can arise with
content-adressability.

In particular, we restrict ourselves to paths that are:

- without any non-textual self-reference (_i.e_ a self-reference hidden inside a zip file)
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
- known to be deterministic (for caching reasons, see [caching]).

That way we don't have to worry about the fact that hash-rewriting is only an
approximation nor by the semantics of the distribution of non-deterministic
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
paths.

We also leave the option to lift these restrictions later.

This RFC already has a (somewhat working) POC at
<https://github.com/NixOS/nix/pull/3262>.

# Motivation

[motivation]: #motivation

Having a content-adressed store with Nix (aka the "Intensional store") is a
long-time dream of the community − a design for that was already taking a whole
chapter in [Eelco's PHD thesis][nixphd].

This was never done because it represents a quite big change in Nix's model,
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
with some non-totally-solved implications (regarding the trust model in
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
particular).
Even without going all the way down to a fully intensional model, we can
make specific paths content-adressed, which can give some important benefits of
the intensional store at a much lower price. In particular, setting some
critical derivations as content-adressed can lead to some substancial build
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
cutoffs.

# Detailed design

[design]: #detailed-design

The gist of the design is that:

- Derivations can be marked as content-adressed (ca), in which case each
one of their output will be moved to content-addressed `ca` store path.
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
This extends the current notion of "fixed-output" derivations.
- We introduce the notion of "resolving" a derivation, which extends to
arbitrary `ca` derivations the current behavior of replacing fixed-outputs
derivations by their output hash.
- We refine the build process so that every derivation is first normalized
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
before being realized

## Nix-build process

For the sake of clarity, we will refer to the current model (where the
derivations are indexed by their inputs, also sometimes called "extensional") as
the `input-addressed` model

### Output mappings

For each output `output` of a derivation `drv`, we define

- its output id **DrvOutputId(drv, output)** as the tuple `(hash(drv), output, truster)`, where `truster` is a reserved field for future use and currently always set to `"world"`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- its output id **DrvOutputId(drv, output)** as the tuple `(hash(drv), output, truster)`, where `truster` is a reserved field for future use and currently always set to `"world"`.
- its output id **DrvOutputId(drv, output)** as the tuple `(hash(drv), output, truster)`, where `truster` is a reserved field for future use (see trust model in <link to future work section>) and currently always set to `"world"`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However: is it necessary to be concerned with the trust model at all at this stage? I’d rather just leave this completely out of scope and change the semantics later when we introduce the trust model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for that was to prevent another schema change. That being said we might omit this in the RFC as it's indeed not relevant for the current design

thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
This id uniquely identifies the output.
We textually represent this as `hash(drv)!output[@truster]`.
- its concrete path **PathOf(outputId)** as the path on which the output will be stored on disk.

> Unresolved: should we already include the `truster` field in `DrvOutputId`
> even if it's not used atm? What would be the cost of adding it later?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s leave it out for now.


In a input-addressed-only world, the concrete path for a derivation output was a pure function of this output's id that could be computed at eval-time. However this won't be the case anymore once we allow content-addressed derivations, so we now need to store the results the `PathOf` function in the Nix database as a new table:
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved

```sql
create table if not exists PathOf (
drv integer not null,
output text not null,
truster integer not null,
path integer not null,
)
```

### Building a non-ca derivation

#### Resolved derivations

We define a **resolved derivation** as a derivation whose only references are either:

- Placeholders for the its own outputs (from the `placeholder` builtin)
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
- References to the outputs of other (non content-addresed) resolved derivations
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
- Existing store paths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Existing store paths
Existing store paths (including *built* content-addresed resolved derivations' output paths)


For a derivation `drv` whose input derivations have all been realised, we define its **associated resolved derivation** `resolved(drv)` as `drv` in which we replace every input derivation `inDrv` of `drv` by `pathOf(inDrv)` (and update the output hash accordingly).

> This doesn't have the property that for a derivation that doesn't depend on any CA derivation `resolved(drv) == drv`. I think that this is a rather big issue so we'll have to find a way to get this property back (but feel free to correct me if you think that it isn't a big deal)

`resolved` is (intentionally) not injective: If `drv` and `drv'` only differ because one depends on `dep` and the other on `dep'`, but `dep` and `dep'` are content-addressed and have the same output hash, then `resolved(drv)` and `resolved(drv')` will be equal.

#### Build process

When asked to build a derivation `drv`, we instead:

1. Compute `resolved(drv)`
2. Substitute and build `resolved(drv)` like a normal derivation.
Possibly this is a no-op because it may be that `resolved(drv)` has already been built.
3. Add a new mapping `pathOf(drv!${output}) == ${output}(resolved(drv))` for each output `output` of `drv`

### Building a ca derivation
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved

A **ca derivation** is a derivation with the `__contentAddressed` argument set
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
to `true` and the `outputHashAlgo` set to a value that is a valid hash name
recognized by Nix (see the description for `outputHashAlgo` at
<https://nixos.org/nix/manual/#sec-advanced-attributes> for the current allowed
values).

The process for building a content-adressed derivation `drv` is the following:

- We build it like a normal derivation (see above).
For each output `$outputId` of the derivation, this gives us a (temporary) output path `$out`.
- We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing]
- We move `$out` to `/nix/store/$chash-$name`
- We store the mapping `PathOf($outputId) == "/nix/store/$chash-$name"`

[^modulo-hashing]:

We can possibly normalize all the self-references before
computing the hash and rewrite them when moving the path to handle paths with
self-references, but this isn't strictly required for a first iteration

## Example

In this example, we have the following Nix expression:

```nix
rec {
contentAddressed = mkDerivation {
name = "contentAddressed";
__contentAddressed = true;
… # Some extra arguments
};
dependent = mkDerivation {
name = "dependent";
buildInputs = [ contentAddressed ];
… # Some extra arguments
};
transitivelyDependent = mkDerivation {
name = "transitivelyDependent";
buildInputs = [ dependent ];
… # Some extra arguments
};
}
```

Suppose that we want to build `transitivelyDependent`.
What will happen is the following

1. We instantiate the Nix expression, this gives us three drv files:
`contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv`
2. We build `contentAddressed.drv`.
- We first compute `resolved(contentAddressed.drv)`.
- We realise `resolved(contentAddressed.drv)`. This gives us an output path
`out(resolved(contentAddressed.drv))`
- We move `out(resolved(contentAddressed.drv))` to its content-adressed path
`ca(contentAddressed.drv)` which derives from
`sha256(out(resolved(contentAddressed.drv)))`
- We register in the db that `pathOf(contentAddressed.drv!out) == ca(contentAddressed.drv)`
3. We build `dependent.drv`
- We first compute `resolved(dependent.drv)`.
This gives us a new derivation identical to `dependent.drv`, except that `contentAddressed.drv!out` is replaced by `pathOf(contentAddressed.drv!out) == ca(contentAddressed.drv)`
- We realise `resolved(dependent.drv)`. This gives us an output path
`out(resolved(dependent.drv))`
- We register in the db that `pathOf(dependent.drv!out) == out(resolved(dependent.drv))` We build `transitivelyDependent.drv`
4. We build `transitivelyDependent.drv`
- We first compute `resolved(transitivelyDependent.drv)`
This gives us a new derivation identical to `transitivelyDependent.drv`, except that `dependent.drv!out` is replaced by `pathOf(dependent.drv!out) == out(resolved(dependent.drv))`
- We realise `resolved(transitivelyDependent.drv)`. This gives us an output path `out(resolved(transitivelyDependent.drv))`
- We register in the db that `pathOf(transitivelyDependent.drv!out) == out(resolved(transitivelyDependent.drv))`

Now suppose that we replace `contentAddressed` by `contentAddressed'`, which evaluates to a new derivation `contentAddressed'.drv` such that the output of `contentAddressed'.drv` is the same as the output of `contentAddressed.drv` (say we change a comment in a source file of `contentAddressed`).
We try to rebuild the new `transitivelyDependent`. What happens is the following:

1. We instantiate the Nix expression, this gives us three new drv files:
`contentAddressed'.drv`, `dependent'.drv` and `transitivelyDependent'.drv`
2. We build `contentAddressed'.drv`.
- We first compute `resolved(contentAddressed'.drv)`
- We realise `resolved(contentAddressed'.drv)`. This gives us an output path `out(resolved(contentAddressed'.drv))`
- We compute `ca(contentAddressed'.drv)` and notice that the path already exists (since it's the same as the one we built previously), so we discard the result.
- We register in the db that `pathOf(contentAddressed.drv'!out) == ca(contentAddressed'.drv)` ( also equals to `ca(contentAddressed.drv)`)
3. We build `dependent'.drv`
- We first compute `resolved(dependent'.drv)`.
This gives us a new derivation identical to `dependent'.drv`, except that `contentAddressed'.drv!out` is replaced by `pathOf(contentAddressed'.drv!out) == ca(contentAddressed'.drv)`
- We notice that `resolved(dependent'.drv) == resolved(dependent.drv)` (since `ca(contentAddressed'.drv) == ca(contentAddressed.drv)`), so we just return the already existing path
4. We build `transitivelyDependent'.drv`
- We first compute `resolved(transitivelyDependent'.drv)`
- Here again, we notice that `resolved(transitivelyDependent'.drv)` is the same as `resolved(transitivelyDependent.drv)`, so we don't build anything

# Drawbacks

[drawbacks]: #drawbacks

- Obviously, this makes the Nix model more complicated than what it is now. In
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
particular, the caching model needs some modifications (see [caching]);

- We specify that only a sub-category of derivations can safely be marked as
`contentAddressed`, but there's no way to enforce these restricitions;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bigger problem than it might look like, as it means that trivial updates can break the CA marking for reasons not worth mentioning in the upstream changelog.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely :)

Maybe that could be clearly stated, but the original scope of this work was to be able to mark very specific derivations that were clearly guaranteed to be deterministic, in which case the problem was less important

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the question «why not just propagate CA» shows that writing more is a good idea.

I do think that stressing the limitations in a few key places is also a nice thing to do (people should be able to apply RFC as passed, not what was intended and not what was discussed, after all… we should not treat ourselves worse than we treat computers!)


- This will probably be a breaking-change for some tooling since the output path
that's stored in the `.drv` files doesn't correspond to an actual on-disk
path.

# Alternatives

[alternatives]: #alternatives

[RFC 0017][] is another proposal with the
same end-goal. The big difference between these two is in the scope they cover:
RFC 0017 is about fundamentally changing the base model of Nix, while this
proposal suggests to make only the minimal amount of changes to the current
model to allow the content-adressed model to live in parallel (which would open
the way to a fully content-adressed store as RFC0017, but in a much more
incremental way).

Eventually this RFC should be subsumed by RFC0017.

# Unresolved questions

[unresolved]: #unresolved-questions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have functionality that allows to build a CA package twice with different apparent output paths, and optionally with different parallelism settings? The build of the package obviously fails if the CA unification doesn't lead to the same result.

Should we mandate that Hydra uses this functionality? Should it be on by default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://github.com/NixOS/rfcs/pull/62/files#r357243841 I think we can deal with non-deterministic derivation just fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, for binary cache transparency it is much better if you can build something locally, then regain connectivity and fetch stuff from a cache, then fetch stuff from a different cache, then build some more locally, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in my mind that's equally risky with and without content addressable derivations. The only difference is one lets you know if something goes wrong, and one doesn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Build nondeterminism doesn't introduce significant behaviour changes, so as long as the expectations are not broken (yeah, we install you into this output path and your dependencies into those paths, and that is not going to change), it will be mostly usable. There are a few CPU-dependent optimisations from time to time, they are annoying.

With CA things are actually moved around, so even though everything would still work when assembled together, the assembling part will be failing. It is Nix, not the code that is built by Nix, that would fail to do things because of nondeterminism.

Copy link
Member

@Ericson2314 Ericson2314 Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think trying to keep going despite non-determinism incoherence is a misfeature. You can always evict your own CA mappings (can keep the builds themselves for easy "rollback") and align with cache.reflex-frp.org and keep going.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if things are marked CA, of course it is a good idea to catch failures. But what you propose will not catch much, because a typical derivation is only built once (ever) by Hydra, later Hydra will use the binary cache. Also my proposal includes feeding different «apparent» output paths to the same build with the same dependencies, which has a better chance of discovering compressed self-references.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with running --check and trying different termporary output paths. Catching non-determinism I don't think is important, because it's really clashes that we care about. However, catching self-references is important as we have to be able to move the thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, varying the output paths is something that doesn't follow from anything Nix does, so it has to be spelled explicitly.


## Caching

[caching]: #caching

The big unresolved question is about the caching of content-adressed paths.
As [Eelco's phd thesis][nixphd] states it, caching ca paths raises a number of
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
questions when building that path is non-deterministic (because two different
stores can have two different outputs for the same path, which might lead to
some dependencies being duplicated in the closure of a dependency).
There exist some solutions to this problem (including one presented in Eelco's
thesis), but for the sake of simplicity, this RFC simply forbids to mark a
derivation as ca if its build is not deterministic (although there's no real
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
way to check that so it's up to the author of the derivation to ensure that it
is the case).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can skip this. If we track all the evaluation steps, we have all the information to ensure a binary cache isn't given anything that clashes with ourself. Maybe the first prototype will discover these errrors lazily, but it should discover them


## Client support

The bulk of the job here is done by the Nix daemon.

Depending on the details of the current Nix implementation, there might or
might not be a need for the client to also support it (which would require the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
might not be a need for the client to also support it (which would require the
be a need for the client to also support it (which would require the

daemon and the client to be updated in synchronously)

## Old Nix versions and caching

What happens (and should happen) if a Nix not supporting the cas model queries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cas hasn't been defined yet.

a cache with cas paths in it is not clear yet.

## Garbage collection

Another major open issue is garbage collection of the aliases table. It's not
clear when entries should be deleted. The paths in the domain are "fake" so we
can't use them for expiration. The paths in the codomain could be used (i.e. if
a path is GC'ed, we delete the alias entries that map to it) but it's not clear
whether that's desirable since you may want to bring back the path via
substitution in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we might just store the "generating" mappings from almost-resolved ca input paths (all deps resolved) to output paths, as this will require far less space. OTOH it makes garbage collection tricker as now all mappings in the build closure are needed to recover a maximum-unresolved input path to map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH we can just do that with tracing GC. We just read the table backwards, saying each derivation in the codomain references everything in the domain that maps to it, and then look those up in turn.


## Ensuring that no temporary output path leaks in the result

One possible issue with the ca model is that the output paths get moved after being built, which breaks self-references. Hash rewriting solves this in most cases, but it is only heuristic and there is no way to truly ensure that we don't leak a self-reference (for example if a self-reference appears in a zipped file − like it's often the case for man pages or java jars, the hash-rewriting machinery won't detect it).
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
Having leaking self-references is annoying since
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved

- These self-references change each time the inputs of the derivation change, making ca useless (because the output will _always_ change when the input change)
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
- More annoyingly, these references become dangling and can cause runtime failures

We however have a way to dectect these: If we have leaking self-references then the output will change if we artificially change its output path. This could be integrated in the `--check` option of `nix-store`.

# Future work

[future]: #future-work

This RFC tries as much as possible to provide a solid foundation for building
ca paths with Nix, leaving as much room as possible for future extensions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ca paths with Nix, leaving as much room as possible for future extensions.
CA paths with Nix, leaving as much room as possible for future extensions.

In particular:

- Add some path-rewriting to allow derivations with self-references to be built
as ca
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
- Consolidate the caching model to allow non-deterministic derivations to be
built as ca
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
- (hopefully, one day) make the CA model the default one in Nix
- Investigate the consequences in term of privileges requirements
- Build a trust model on top of the content-adressed model to share store paths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference the reserved truster field from here


[rfc 0017]: https://github.com/NixOS/rfcs/pull/17
[nixphd]: https://nixos.org/~eelco/pubs/phd-thesis.pdf