Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC 0062] Content-addressed paths #62
[RFC 0062] Content-addressed paths #62
Changes from 2 commits
6d001f3
435fc42
7b26144
81099b2
4277386
7af7d2c
5fec861
9edc11f
5717351
1a844cc
26ae77e
bbdca7e
63f3eca
a6d2f38
140e093
288dcb4
60e7da3
1115a0d
13938de
3a25f7f
3a18867
fa16e86
94b65bd
7ed4481
fb4c61d
841fe3f
27bd048
1e8fab7
9772625
02ae2b5
2d74fed
168a149
427abed
f275669
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing because above it says that each derivation has one output. It's more accurate to say that there are two derivations (that differ in their output paths).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW the names
static
anddynamic
are ambiguous because they're already used for static/dynamic linking (in particular builds might have astatic
output for.a
files...).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the names because it's indeed a very overloaded term, but that also matches the common usage when it comes to language semanics ("static" knowledge that can be known by just looking at the source code vs. dynamic stuff that is only available at runtime)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
evaluation-time vs. build-time seems to be less ambiguous to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe in 2 names because there isn't just one dynamic hash. As more CA deps are built, we can incrementally "improve" the drv. It's better to speak of normal forms, or an arbitrary expression vs value description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but why would you want to?
Purely-evaluation-time version can be computed without any builds and is a natural database key. Fully-updated version is used to determine the actual output path for the build. Why would we ever compute a partial substitution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you do a build, you want your deps in normal form, but you yourself unchanged: "almost normal form".
I just thought of another: when you ask the remote builder, you ask with the partial substitution which has all your existing substitutions. Then you can be sure the server won't respond with a conflict! This is really good because even if the server doesn't agree with you that your maximally unevaluated derivation has that content hash, maybe another derivation has that content hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case this actually is the normal form for build. This is the final form for non-CA, and a necessary build form for CA, their CA version is a different story.
I think the most useful case of what you describe is that you have a different original derivation but some CA dependencies happen to have the same content. It would be a nice feature for a binary cache to provide, but we ither get nontrivial query logic, or combinatorial explosion, and I think neither should be mandated before we have the easy case working.
There can also be something about nondeterminisim and a CA derivation output switching between a small number of possible content sets, but I think we should encourage as much caution as possible for the first version, and if somethings slips through it should be highly visible (and get reported)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fold the existing fixed output logic into this? e.g. even today two different derivations that produce the same output can be substituted within some downstream derivation without changing its output path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by "fold" here. If you mean "make fixed-output derivations a special-case of content-addressed ones" then yes, we could in theory (just adding a check at the end that the content-hash is the expected one), but I'm not sure that this is desirable in practice (since the mechanism for CA derivations is inherently more complex than the one for fixed-output ones). Otoh they can (and do in the prototype) share a bit of implementation − in particular fixed-output and CA derivations are stored the same way in the db.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing about the
hashDerivationModulo
that @edolstra mentions is content addressable derivations already evaluate two another derivation, just like CA ones. But we don't see that derivation today as it is just used to calculate downstream store paths. I think we should expose that derivation. Then the complexity is about the same, and things are more uniform.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are 3 types of derivations:
I'm not sure if 2) is useful to support. It already behaves different from 1) in that the output paths cannot be known in advance. So then you may as well make the outputs content-addressable. So maybe be it's easier to require that only CA derivations may depend on CA derivations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this because there is analogous "context" vs "redex" situations for #40. Surely normal derivations would depend on ret-cont ones, so we should have normal derivations that depends on CA ones. Also matches how we have normal derivations that depend on fixed-output ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this model CA derivations should be treated the same as we would impure derivations if we had them; we do not know whether they are reproducible and thus we should propagate that uncertainty. In this case that would be done by marking all derivations downstream as CA. With fixed-output ones it works because of the hash. Thus, CA derivations that check against a passed-in hash would result in non-CA downstream, but without hash in CA downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FRidh that doesn't work because downstream ones can have cycles. The similarity is that "naive" downstream one doesn't have have a corresponding build, only the normalized drvs map to builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should drop support for 2) as
To expand a bit:
The original use-case for the prototype was a derivation (call it
d0
) deep into a derivation tree that had its inputs frequently changing, but its output rarely, and that was a dependency of a lot of other (expensive) things. We knew thatd0
was totally deterministic, so we could easily make it CA (which saved a lot of building), but we couldn't make much assumptions about its referrers, so we couldn't make them CA.I guess that could be transposed to nixpkgs by replacing
d0
by e.g.glibc
or anything in stdenv (assuming that it's deterministic). Having CA derivations "poison" their referrers would mean that we would have to mark the whole of nixpkgs as CA which will probably not be possible.The way I see things, we only have 2 types of derivations
It just happens that the semantics of non-CA derivations that depend on zero CA derivations match the current semantics, but there's absolutely no extra-complexity involved compared to only supporting Non-CA derivations that don't depend on CA derivations).
Another way to look at it is that being CA and depending on CA are two orthogonal features, so if we support CA-depending-on-CA and non-CA-depending-on-non-CA, we also support "for free" non-CA-depending-on-CA (and CA-depending-on-non-CA)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compressed manpages, JARs or any other such thing present in the output. Note that marking something as CA might require a careful study of either the build system or the output.
Also, of course, hard to debug issues with determinism could be a problem…
Maybe specifically that should be cached, indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I can think of is non-determinism − I know there are caching models that can handle that just fine, but implementing these are out-of-scope for this RFC
This doesn't have to be true I think (although I'm not 100% sure wrt #62 (comment)):
The current behavior (in the prototype I mean) is that the remote cache knows about aliases path, so if
/nix/store/abc-foo
is an alias for/nix/store/def-foo
, the cache will happily serve a narinfo forabc-foo
containingaliasOf: def-foo
, andSubstitutionGoal
will just adddef-foo
to its dependent goals as if it were another dependency. The same thing should also hold forStore::buildDerivation()
unless I miss something.There's indeed a slight runtime overhead (because we have to fetch an extra narinfo for the alias), but it seems very minor (and marking everything as CA wouldn't make this any faster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so that means that the output path of a non-CA derivation can differ between builds. But that's already the case since the CA dependencies of a non-CA derivation can change. So you can't use things like binary caches anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to be able to make packages content-addressable one by one independently. If
zlib
doesn't have a self-reference, for example, then I want to make it so without having to also convert all its reverse dependencies. That would be a disaster.The binary cache interaction doesn't seem too bad? It's basically "You asked for drv0, I map drv0 to drv1 under substitutions (a to a_hash, b to b_hash, ....z to z_hash). I have a build for drv1." The client can see if any of those content-hash assignments clash with any of its own, and take the drv1 build and the hash assignments if it is OK with them. As long as we keep track of all hash assignments, and stop doing whatever we are doing if we notice a conflict, things won't go wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow: if all CA derivations are carefully checked to be deterministic, then the output paths of their reverse-dependencies won't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of this stuff sounds normative? If so, it should be moved outside of "examples" back into "detailed design".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this shouldn't be a
StorePath -> StorePath
mapping but aDrvOutputId -> StorePath
mapping. This eliminates the suggestion that the paths in the domain of this mapping are real paths. Nix already does something like this inhashDerivationModulo()
to compute the "static" store path outputs of a derivation, by replacing theinputDrvs
with IDs that don't change for fixed-output derivations.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks this begins to answer my #62 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say
DrvOutputId -> DrvOutputId
even.hashDerivationModulo
and the thing that "upgrades" (possibly nested!) CA derivations to use their output hash should be extremely similar.We can then have a separate
normalized DrvOutputId -> StorePath
mapping.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks better indeed, although I'm not sure how you define
DrvOutputId
. My understanding is that it should essentially be the "hash" part of the path. Is that it or is there something more?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's basically it. It's a hash over the derivation graph, which is essentially what the hash part of non-CA output paths is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that we should ensure is that CA derivations can be built by untrusted users without the input drvs being available. This is currently not the case for non-CA derivations:
Store::buildDerivation(BasicDerivation)
can only be used by trusted users, which is annoying for Hydra builders.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A major open issue is garbage collection of the aliases table. It's not clear when entries should be deleted. The paths in the domain are fake so we can't use them for expiration. The paths in the codomain could be used (i.e. if a path is GC'ed, we delete the alias entries that map to it) but it's not clear whether that's desirable since you may want to bring back the path via substitution in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unprivileged builds is more a future work than an unresolved issue (since it's just something that we can build on top of that). I very quickly mention it (
Investigate the consequences in term of privileges requirements
), but if you think it's worth expanding, I can add more on that topicWhat do you mean by that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned it because we need to keep it in mind to avoid a design of derivations that makes unprivileged builds impossible (i.e. avoid the mistake of
hashDerivationModulo
).If you garbage-collect the output of a CA derivation, and then build the CA derivation again, you could avoid building it by fetching the output from a binary cache, if you remembered what the CA output paths were.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to stress: this knowledge would allow fetching even when the binary cache doesn't know of the derivation I am trying to build but the output happens to be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this mistake of
hashDerivationModulo
?