-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Update ADR-040 to store hash(value) in SMT leaf #9680
docs: Update ADR-040 to store hash(value) in SMT leaf #9680
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purposes of IPLD it is more conventional to store hash(value)
so we can map it to an IPLD block that stores the value. But as I think about this more, perhaps we actually do want to include key information in the IPLD in this manner... hash(key || value) => key || value
is still a proper content hash mapping since the key is included as part of the IPLD content. This would be a more compact way to store key information in the IPLD DAG, but if we switch to storing hash(value)
as the leaf value we can still store the raw key in the IPLD DAG with a distinct IPLD that maps hash(key) => key
(i.e. we IPLDize the inverse index).
Sorry for going back and forth on this @roysc! So I think the most compelling reason to switch to storing hash(value)
as the leaf value is preventing collisions when key1 || value1
== key2 || value2
(e.g. a || bc
== ab || c
).
In any case I think we should explicitly define the tree nodes, including the node prefixes and hash function used. E.g.
- An inner node is
0x01 || left_hash || right_hash
. - A leaf node is
0x00 || path || leaf_value
.- The leaf_value is the
SHA_256(value)
. - The path is the
SHA_256(key)
. left_hash
andright_hash
areSHA_256(left_node)
andSHA_256(right_node)
, respectively- || is byte concatenation
- The leaf_value is the
|
||
SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(key, value)` in a leaf. Since we don't know a structure of a value (in particular if it contains the key) we hash both the key and the value in the `SC` leaf. | ||
SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(value)` in a leaf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(value)` in a leaf. | |
SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `0x00 || key || hash(value)` in a leaf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed - in the leaf we need to commit to the key as well (it's not enough that it is in a path).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hash(key)/path
is stored in the leaf node, even if not as part of the "leaf value" (i.e. the value returned from calling Get
on the SMT)
leaf node == prefix || path || leaf_value
== prefix || hash(key) || hash(value_provided_to_the_SMT)
When calling Set(key, value_provided_to_the_SMT)
it hashes the key
into the path
and the value_provided_to_the_SMT
into the leaf_value
. When calling Get(key)
it returns value_provided_to_the_SMT
.
Note that value_provided_to_the_SMT
currently is hash(key || value_in_the_StateStore)
so that when we call Get
we retrieve hash(key || value_in_the_StateStore)
which is the key we need for the (current) inverse index.
So the SMT leaf node is, in current practice, prefix || hash(key) || hash(hash(key || value_in_the_StateStore))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so we can add prefix +
to my suggestion. We can use ||
operator instead of +
if you prefer.
Note that value_provided_to_the_SMT currently is
hash(key || value_in_the_StateStore)
Why is that? We should provide key
and obj_value
without modifying it. SMT will do all necessary operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was just a misunderstanding of the ADR language, I think. We didn't think it was describing the SMT's internal structure, since the hashed values are not exposed by the SMT interface (so we assumed the hashed value should be passed in). But we can add methods for that and fork the code if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that? We should provide
key
andobj_value
without modifying it. SMT will do all necessary operations.
Like Roy said it is because the current implementation does not expose the hashed values.
We did it like this so that when we would Get
from the SMT we retrieved the value we needed for the old inverse index hash(key || value)
.
Set
takes a key and value,Get
only returns the value provided toSet
not some internal transformation of key and/or value- this would be really odd behavior for a Setter and Getter interface.- If the value we provided to
Set
was the unhashed "value" (key || obj_value
), then when we wouldGet
from the SMT we would get that unhashed value (again, not some internal- hashed- transformation of the value we provided) and we would have to hash it again at the level above the SMT before we could use it in the inverse index. This would have worked but would mean we were duplicating hashing efforts at the two levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are on the same page. In my suggestion I added hash(key)
to the leaf (I'm using +
operator rather than ||
). It seams we need to update it to (as you noted in the comment above):
prefix + hash(key) + hash(value_provided_to_the_SMT)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated my suggestion based on the SMT spec: https://github.com/celestiaorg/celestia-specs/blob/ec98170398dfc6394423ee79b00b71038879e211/src/specs/data_structures.md#sparse-merkle-tree and John response.
@roysc , let's update the PR and merge it. |
We need to add an information about the prefix. |
I would argue the prefix is an implementation detail and doesn't need to be in the spec, since it's not relevant to the ADR's functionality? |
We are defining what's in the SMT leaf, so let's be precise - because this spec will be used by IBC, and they need a full information. |
Let's be thorough then, we should include the spec for internal nodes and the hash function used as well. I'll link the spec from Celestia and quote the relevant parts here. |
@robert-zaremba updated, let me know how that looks. |
d65f218
to
bf2a783
Compare
I realized a couple things need clarification, though maybe just for my own understanding. This seems as good a place as any to ask, as they are relevant questions:
If these hashes won't be needed outside of a context where their preimages are also available, we can simplify the design and implementation somewhat. |
let's update here that reverse index is needed for IPDL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good - we clarify what's needed in a call.
I picked a few nits, but this is done and should be good to merge. |
we miss one more approval. @i-norden could you approve a well? |
@roysc , could you sync this branch? It should merge automatically once it's synced. |
Description
This revises ADR-40 to specify storing
hash(value)
as the SMT mapped value.This should allow the same indexing functionality as using
hash(key + value)
, while working much better with IPLD by storing only the hash of the mapped content.Some discussion here: #9331 (comment)
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!
to the type prefix if API or client breaking change - N/ACHANGELOG.md
Reviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...
!
in the type prefix if API or client breaking change