docs: Update ADR-040 to store hash(value) in SMT leaf #9680

roysc · 2021-07-12T11:57:17Z

Description

This revises ADR-40 to specify storing hash(value) as the SMT mapped value.
This should allow the same indexing functionality as using hash(key + value), while working much better with IPLD by storing only the hash of the mapped content.
Some discussion here: #9331 (comment)

Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

confirmed the correct type prefix in the PR title
confirmed ! in the type prefix if API or client breaking change
confirmed all author checklist items have been addressed
reviewed state machine logic
reviewed API design and naming
reviewed documentation is accurate
reviewed tests and test coverage
manually tested (if applicable)

i-norden

For the purposes of IPLD it is more conventional to store hash(value) so we can map it to an IPLD block that stores the value. But as I think about this more, perhaps we actually do want to include key information in the IPLD in this manner... hash(key || value) => key || value is still a proper content hash mapping since the key is included as part of the IPLD content. This would be a more compact way to store key information in the IPLD DAG, but if we switch to storing hash(value) as the leaf value we can still store the raw key in the IPLD DAG with a distinct IPLD that maps hash(key) => key (i.e. we IPLDize the inverse index).

Sorry for going back and forth on this @roysc! So I think the most compelling reason to switch to storing hash(value) as the leaf value is preventing collisions when key1 || value1 == key2 || value2 (e.g. a || bc == ab || c).

In any case I think we should explicitly define the tree nodes, including the node prefixes and hash function used. E.g.

An inner node is 0x01 || left_hash || right_hash.
A leaf node is 0x00 || path || leaf_value.
- The leaf_value is the SHA_256(value).
- The path is the SHA_256(key).
- left_hash and right_hash are SHA_256(left_node) and SHA_256(right_node), respectively
- || is byte concatenation

robert-zaremba · 2021-07-12T15:40:06Z

docs/architecture/adr-040-storage-and-smt-state-commitments.md


-SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(key, value)` in a leaf. Since we don't know a structure of a value (in particular if it contains the key) we hash both the key and the value in the `SC` leaf.
+SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(value)` in a leaf.


Suggested change

SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(value)` in a leaf.

SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `0x00 || key || hash(value)` in a leaf.

as discussed - in the leaf we need to commit to the key as well (it's not enough that it is in a path).

hash(key)/path is stored in the leaf node, even if not as part of the "leaf value" (i.e. the value returned from calling Get on the SMT)

leaf node == prefix || path || leaf_value == prefix || hash(key) || hash(value_provided_to_the_SMT)

When calling Set(key, value_provided_to_the_SMT) it hashes the key into the path and the value_provided_to_the_SMT into the leaf_value. When calling Get(key) it returns value_provided_to_the_SMT.

Note that value_provided_to_the_SMT currently is hash(key || value_in_the_StateStore) so that when we call Get we retrieve hash(key || value_in_the_StateStore) which is the key we need for the (current) inverse index.

So the SMT leaf node is, in current practice, prefix || hash(key) || hash(hash(key || value_in_the_StateStore))

Right, so we can add prefix + to my suggestion. We can use || operator instead of + if you prefer.

Note that value_provided_to_the_SMT currently is hash(key || value_in_the_StateStore)

Why is that? We should provide key and obj_value without modifying it. SMT will do all necessary operations.

That was just a misunderstanding of the ADR language, I think. We didn't think it was describing the SMT's internal structure, since the hashed values are not exposed by the SMT interface (so we assumed the hashed value should be passed in). But we can add methods for that and fork the code if necessary.

Why is that? We should provide key and obj_value without modifying it. SMT will do all necessary operations.

Like Roy said it is because the current implementation does not expose the hashed values.

We did it like this so that when we would Get from the SMT we retrieved the value we needed for the old inverse index hash(key || value).

Set takes a key and value, Get only returns the value provided to Set not some internal transformation of key and/or value- this would be really odd behavior for a Setter and Getter interface.

If the value we provided to Set was the unhashed "value" (key || obj_value), then when we would Get from the SMT we would get that unhashed value (again, not some internal- hashed- transformation of the value we provided) and we would have to hash it again at the level above the SMT before we could use it in the inverse index. This would have worked but would mean we were duplicating hashing efforts at the two levels.

I think we are on the same page. In my suggestion I added hash(key) to the leaf (I'm using + operator rather than ||). It seams we need to update it to (as you noted in the comment above):

prefix + hash(key) + hash(value_provided_to_the_SMT)

Ping, @roysc , @i-norden - let's update the paragraph to what's in the SMT leaf and merge this PR.

I have updated my suggestion based on the SMT spec: https://github.com/celestiaorg/celestia-specs/blob/ec98170398dfc6394423ee79b00b71038879e211/src/specs/data_structures.md#sparse-merkle-tree and John response.

robert-zaremba · 2021-07-28T20:28:46Z

@roysc , let's update the PR and merge it.

robert-zaremba · 2021-07-28T20:31:35Z

We need to add an information about the prefix.

roysc · 2021-07-30T02:34:23Z

I would argue the prefix is an implementation detail and doesn't need to be in the spec, since it's not relevant to the ADR's functionality?

robert-zaremba · 2021-07-30T21:54:42Z

I would argue the prefix is an implementation detail and doesn't need to be in the spec, since it's not relevant to the ADR's functionality?

We are defining what's in the SMT leaf, so let's be precise - because this spec will be used by IBC, and they need a full information.

roysc · 2021-07-31T10:50:07Z

Let's be thorough then, we should include the spec for internal nodes and the hash function used as well. I'll link the spec from Celestia and quote the relevant parts here.

roysc · 2021-07-31T11:47:10Z

@robert-zaremba updated, let me know how that looks.

roysc · 2021-08-01T16:46:36Z

I realized a couple things need clarification, though maybe just for my own understanding. This seems as good a place as any to ask, as they are relevant questions:

@robert-zaremba what is the anticipated use case for the reverse index? In what context would only h(k), and not k ,be available to the program? Do we actually need to have h(k) available anywhere outside of proof generation? I may just not be familiar enough with the SDK's usage to see this.
@i-norden in what context will we need h(v) for IPLD? Would there be any case where h(v) is needed that v is not also available?

If these hashes won't be needed outside of a context where their preimages are also available, we can simplify the design and implementation somewhat.

robert-zaremba · 2021-08-04T15:13:59Z

let's update here that reverse index is needed for IPDL

robert-zaremba

looks good - we clarify what's needed in a call.

roysc · 2021-08-10T08:32:23Z

I picked a few nits, but this is done and should be good to merge.

robert-zaremba · 2021-08-11T21:21:06Z

we miss one more approval. @i-norden could you approve a well?

docs/architecture/adr-040-storage-and-smt-state-commitments.md

robert-zaremba · 2021-09-15T10:16:36Z

@roysc , could you sync this branch? It should merge automatically once it's synced.

Update ADR-40 to store hash(value) in SMT leaf

ab7ee3e

roysc requested review from aaronc and alexanderbez as code owners July 12, 2021 11:57

github-actions bot added the T: ADR An issue or PR relating to an architectural decision record label Jul 12, 2021

robert-zaremba self-assigned this Jul 12, 2021

i-norden reviewed Jul 12, 2021

View reviewed changes

robert-zaremba reviewed Jul 12, 2021

View reviewed changes

Improve wording

a805516

Specify SMT structure

f071c61

roysc mentioned this pull request Jul 31, 2021

Expose leaf node with smt.GetLeaf() celestiaorg/smt#54

Closed

"celestia"

bf2a783

roysc force-pushed the roysc/adr-040-update-smt-leaf branch from d65f218 to bf2a783 Compare August 1, 2021 03:53

robert-zaremba approved these changes Aug 4, 2021

View reviewed changes

robert-zaremba changed the title ~~docs: Update ADR-40 to store hash(value) in SMT leaf~~ docs: Update ADR-040 to store hash(value) in SMT leaf Aug 4, 2021

nits

265d0f6

i-norden approved these changes Sep 7, 2021

View reviewed changes

i-norden suggested changes Sep 7, 2021

View reviewed changes

docs/architecture/adr-040-storage-and-smt-state-commitments.md Outdated Show resolved Hide resolved

roysc added 2 commits September 8, 2021 16:12

correct wording

537f1bd

Merge branch 'master' into roysc/adr-040-update-smt-leaf

5901119

alexanderbez approved these changes Sep 9, 2021

View reviewed changes

robert-zaremba added the A:automerge Automatically merge PR once all prerequisites pass. label Sep 15, 2021

Merge branch 'master' into roysc/adr-040-update-smt-leaf

0973b86

mergify bot merged commit 86b0c0a into cosmos:master Sep 15, 2021

roysc deleted the roysc/adr-040-update-smt-leaf branch September 16, 2021 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Update ADR-040 to store hash(value) in SMT leaf #9680

docs: Update ADR-040 to store hash(value) in SMT leaf #9680

roysc commented Jul 12, 2021 •

edited

Loading

i-norden left a comment

robert-zaremba Jul 12, 2021 •

edited

Loading

robert-zaremba Jul 12, 2021

i-norden Jul 12, 2021 •

edited

Loading

robert-zaremba Jul 12, 2021

roysc Jul 13, 2021 •

edited

Loading

i-norden Jul 13, 2021

robert-zaremba Jul 14, 2021

robert-zaremba Jul 17, 2021

robert-zaremba Jul 17, 2021

robert-zaremba commented Jul 28, 2021

robert-zaremba commented Jul 28, 2021

roysc commented Jul 30, 2021

robert-zaremba commented Jul 30, 2021

roysc commented Jul 31, 2021

roysc commented Jul 31, 2021

roysc commented Aug 1, 2021

robert-zaremba commented Aug 4, 2021

robert-zaremba left a comment

roysc commented Aug 10, 2021

robert-zaremba commented Aug 11, 2021

robert-zaremba commented Sep 15, 2021


		SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(key, value)` in a leaf. Since we don't know a structure of a value (in particular if it contains the key) we hash both the key and the value in the `SC` leaf.
		SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(value)` in a leaf.

docs: Update ADR-040 to store hash(value) in SMT leaf #9680

docs: Update ADR-040 to store hash(value) in SMT leaf #9680

Conversation

roysc commented Jul 12, 2021 • edited Loading

Description

Author Checklist

Reviewers Checklist

i-norden left a comment

Choose a reason for hiding this comment

robert-zaremba Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

robert-zaremba Jul 12, 2021

Choose a reason for hiding this comment

i-norden Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

robert-zaremba Jul 12, 2021

Choose a reason for hiding this comment

roysc Jul 13, 2021 • edited Loading

Choose a reason for hiding this comment

i-norden Jul 13, 2021

Choose a reason for hiding this comment

robert-zaremba Jul 14, 2021

Choose a reason for hiding this comment

robert-zaremba Jul 17, 2021

Choose a reason for hiding this comment

robert-zaremba Jul 17, 2021

Choose a reason for hiding this comment

robert-zaremba commented Jul 28, 2021

robert-zaremba commented Jul 28, 2021

roysc commented Jul 30, 2021

robert-zaremba commented Jul 30, 2021

roysc commented Jul 31, 2021

roysc commented Jul 31, 2021

roysc commented Aug 1, 2021

robert-zaremba commented Aug 4, 2021

robert-zaremba left a comment

Choose a reason for hiding this comment

roysc commented Aug 10, 2021

robert-zaremba commented Aug 11, 2021

robert-zaremba commented Sep 15, 2021

roysc commented Jul 12, 2021 •

edited

Loading

robert-zaremba Jul 12, 2021 •

edited

Loading

i-norden Jul 12, 2021 •

edited

Loading

roysc Jul 13, 2021 •

edited

Loading