Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add --cid-version=1 --raw-leaves=false isn't hash coherent with add --cid-version=0 #8974

Closed
3 tasks done
Jorropo opened this issue May 14, 2022 · 13 comments
Closed
3 tasks done
Labels
topic/docs Documentation topic/UnixFS Topic UnixFS

Comments

@Jorropo
Copy link
Contributor

Jorropo commented May 14, 2022

Checklist

Installation method

built from source

Version

7892cc91f9ed17f5a6e0348334ed09c8bdb3194f

Config

N/A

Description

When using --cid-version=1 ipfs will use CIDv1 CIDs in the root blocks of the file, creating different hashes if the file gets chunked into multiple leaves.

Reproduction

dd if=/dev/urandom count=1 bs=1024 iflag=fullblock of=f
[[ $(ipfs cid base32 $(ipfs add -n --chunker=size-512 -Q --cid-version=0 f)) != $(ipfs add -n --chunker=size-512 -Q --cid-version=1 --raw-leaves=false f) ]] && echo "Failed, did not matched!" || echo "Success matched!"
@Jorropo Jorropo added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels May 14, 2022
@Jorropo Jorropo changed the title add --cid-version=1 --raw-leaves=false isn't hash coherent with --cid-version=0 add --cid-version=1 --raw-leaves=false isn't hash coherent with add --cid-version=0 May 14, 2022
@Jorropo
Copy link
Contributor Author

Jorropo commented May 14, 2022

Personally that a WONTFIX (except maybe some docs), having to use --cid-version=0 and then ipfs cid base32 if you care about backward compat doesn't look too bad.
I belive we just have to be carefull that this behaviour doesn't happen with #4143.

@ribasushi
Copy link
Contributor

@Jorropo this isn't a bug - the version of a cid is part of the payload. A block with links in cidv0 is different byte-content ( thus different CID ) as opposed to a block with the very same links in cidv1.

@Jorropo
Copy link
Contributor Author

Jorropo commented May 14, 2022

@ribasushi I know, that why the:

Personally that a WONTFIX

It's just that go-ipfs attempt to have some hash coherency with future versions. (adding the same data multiple times with the same options in future versions should generally give you the same CID back)
We have to choose whether this qualify or not. And whether doing ipfs cid base32 $(ipfs add -n --cid-version=0) is a valid alternative.
As well as document this alternative.

@ribasushi
Copy link
Contributor

You are probably thinking of --upgrade-cidv0-in-output=true. Note that this was raised ( and rejected ) in #4143 (comment)

@Jorropo
Copy link
Contributor Author

Jorropo commented May 14, 2022

You are probably thinking of --upgrade-cidv0-in-output=true.

I'm essentially proposing the same thing but doing it explicitly with ipfs cid base32 (which already exists and works), I think the fix for this issue is just gonna be some more lines of docs somewhere.

@schomatis
Copy link
Contributor

Always happy to get more docs but is this coherency expected by the user somewhere? (And what exactly is the meaning of it?) It seems reliant on very specific and internal knowledge on how we construct these (raw leaves, chunker, etc.).

@Jorropo
Copy link
Contributor Author

Jorropo commented May 16, 2022

is this coherency expected by the user somewhere?

I have talked with a few people that assume it.
We also have tried to somewhat maintain it.

And what exactly is the meaning of it?

Adding the same file multiple times yields the same CID (or at least a compatible one, so really it's only the same multihash).

It seems reliant on very specific and internal knowledge on how we construct these (raw leaves, chunker, etc.).

Yeah

@schomatis
Copy link
Contributor

Pushing some more on your definition to try to better understand it and see how we can fit it in the docs:

Adding the same file multiple times yields the same CID

You're changing the arguments, is the user that changes these still expect the same CID?

a compatible one

Then what does compatible mean? Same multihash? Is the user aware of it, and if it is, it expects it to go unchanged when it changes the CID or any other argument?

If anything this seems more of an issue with the UnixFS layer and explaining to the user that their file/directory/etc representation in IPFS is opaque and they shouldn't rely on the CID/multihash. We attempt to change this as little as possible but there are (AFAIK) no guarantees. For example, we now automatically change the directory implementation (from basic to HAMT) based (roughly) on its number of entries, when they reach a (variably-adjusted) threshold. There is no single "file" entity in IPFS as a continuous string of bits as (say) in the user's hard drive. I'd push for more documentation in that direction.

@schomatis schomatis added topic/UnixFS Topic UnixFS topic/docs Documentation and removed kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels May 16, 2022
@schomatis
Copy link
Contributor

(Changing to a docs issues unless there is in fact a guarantee of CID/multihash stability in UnixFS.)

@aschmahmann
Copy link
Contributor

I have talked with a few people that assume it.

IMO there is in no way whatsoever a guarantee that ipfs-vX add foobar generates the same DAG as ipfs-vX+1 add foobar.

From ipfs add --help

https://github.com/ipfs/go-ipfs/blob/a72753bade90c4a48c29aba6c0dc81c44785e9d2/core/commands/add.go#L110-L113

If that wording is insufficient to get the point across we can modify that as well.

We also have tried to somewhat maintain it.

Correct, we don't like changing behavior because no matter what you tell people they'll start relying on things anyway unless it evolves too quickly for people to incorrectly assume it'll never change. This is part of why the defaults for ipfs add have not been changed to a superior DAG structure that uses raw leaves (e.g. ipfs add --cid-version=1 --raw-leaves=true (yes, I know --cid-version=1 implies raw leaves).

When the change to CIDv1 by default happens there will likely need to be a set of comms reminding people that defaults can change and how to generate CIDv0s if they really need to.

@Jorropo unless the idea is you want to add more text to ipfs add this should be closed in favor of an issue in https://github.com/ipfs/ipfs-docs/issues

@Jorropo
Copy link
Contributor Author

Jorropo commented May 16, 2022

If anything this seems more of an issue with the UnixFS layer and explaining to the user that their file/directory/etc representation in IPFS is opaque and they shouldn't rely on the CID/multihash. We attempt to change this as little as possible but there are (AFAIK) no guarantees. For example, we now automatically change the directory implementation (from basic to HAMT) based (roughly) on its number of entries, when they reach a (variably-adjusted) threshold. There is no single "file" entity in IPFS as a continuous string of bits as (say) in the user's hard drive. I'd push for more documentation in that direction.

I agree with all of this.

I think the issue is getting side tracked.

The issue here is that it is surprising that the CID version change the hash. It seems like a non trivial amount of people (me including before I implemented some raw CID code) think that CID version is just an encoding trick, but the binary data wouldn't change, just how it gets made into text.

unless the idea is you want to add more text to ipfs add

I want to add a

This also affects CID's binary representation inside blocks and could change your hashes.

similar thing to --cid-version's help text.

@schomatis
Copy link
Contributor

I think the implicit assumption that is causing some confusion here is we're talking about a single block (strings of bits) of data.

In general changing CID version doesn't change the hash of the data (because we keep the same hashing algorithm, at least in v1).

But a file is (usually) not a single block of data but a DAG of them, and a DAG is not just the concatenation of the chunks of the original file but something more opaque that includes other metadata to structure it. (That metadata is in fact leaked in the CID when the user might not be actually interested in it.)

The bold text is not obvious to a new user (I had the same confusion for a lot of time) and some more docs could be added on that.

@Jorropo
Copy link
Contributor Author

Jorropo commented Aug 15, 2023

I had discussions since then, we ideally want to sunset CIDv0, this will never happen but we are still gonna try.

We want to default to encoding CIDv1 inside the binary blocks even when a CIDv0 could be used because it makes it anoying to write various implementations and have weird edge cases in some of them.

@Jorropo Jorropo closed this as not planned Won't fix, can't repro, duplicate, stale Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/docs Documentation topic/UnixFS Topic UnixFS
Projects
None yet
Development

No branches or pull requests

4 participants