Improvements to the networking storage read requests #5

tomaka · 2023-02-21T10:21:25Z

No description provided.

bkchr

Generally looking good! Just some questions.

proposals/0000-net-storage-reads-improvements.md

bkchr · 2023-02-27T11:44:33Z

proposals/0000-net-storage-reads-improvements.md

+
+Implementers of the replier side should be careful to detect early on when a reply would exceed the maximum reply size, rather than inconditionally generate a reply, as this could take a very large amount of CPU, disk I/O, and memory. Existing implementations might currently be accidentally protected from such an attack thanks to the fact that requests have a maximum size, and thus that the list of keys in the query was bounded. After this proposal, this accidental protection would no longer exist.
+
+It is not possible for the querier to know in advance whether its query will lead to a reply that exceeds the maximum size. If the reply is too large, the querier should send a different, smaller request. While this is not an ideal solution, it is believed that this problem is rare enough that it's not worth making things more complicated in order to solve it. Ideal solution could include proofs that can be verified in a streaming way, or sending a partial proof whose remainder can be queried afterwards.


Do we already have any way to indicate that a request was too big? Or how would the requester find out about this?

The policy right now is the same for all request-response protocols:

If the request is too big, the remote abruptly closes (i.e. resets) the substream.

If you're asking what if the response is too big, well the answerer is supposed to cap the size of the response so that it doesn't exceed that limit. If it's not possible to send back any response because it would be too big, then the answerer absurptly closes (i.e. resets) the substream.

This isn't very optimal, but as I've mentioned we just assume that it can't happen. The limit exists in order to avoid very abusive situations where one party maliciously sends an infinite stream of bytes. This limit isn't meant to ensure granularity in the response.

If the request is too big, the remote abruptly closes (i.e. resets) the substream.

This is the answer to my question! Ty.

bkchr · 2023-02-27T11:45:14Z

@cheme could you please also check this out again?

proposals/0000-net-storage-reads-improvements.md

cheme · 2023-02-27T13:38:48Z

proposals/0000-net-storage-reads-improvements.md

+message Key {
+       required bytes key = 1;
+       optional bool skipValue = 2; // Defaults to `false` if missing
+       optional bool includeDescendants = 3; // Defaults to `false` if missing


Regarding request being too big, I would find it neat if it was allowed to reply only to part of the query:

the n first keys when no descendant

the n first keys and possibly an incomplete number of descendants for the key n
as long as it match the order of the queried content.
(the problemetic of stopping a process in the middle if proof get too big is something that is needed in pvf context too).

This would be up to the full node to decide where to stop.

The end of content is simply the first key (main key or key in the descendant) that is not part of the proof.

Then the light client could send the query again minus these n first keys.
But to handle the case where the prefix is too big to include all descendant in a single response, I would propose to change this so :
includeDescendant: Option<Option<&[u8]>>
None would means don't include descendant, Some(None) would mean inculde descendant starting by the key itself, Some(Some(&[1, 2, 3u8])) would mean include descendant of prefix, starting with prefix ++ b"123" with no need to include value for this starting key (but need to prove the next key is indeed following it).
That way we can query any range.

cheme · 2023-02-27T13:40:11Z

proposals/0000-net-storage-reads-improvements.md

+       }
+
+       required bytes hash = 1;
+       required ChildTrieNamespace namespace = 2;


Considering previous (child trie root hash) mandatory field, this field is not strictly necessary. Though I like having it as it makes it simple to implement (for rocksdb otherwhise we would need to change storage).

(child trie root hash)

Are you assuming that hash means the trie root hash of the child trie?
The way I intended it, hash is the "name" of the child trie.
In other words, the child trie is the one at concat(":child_storage:", namespace, ":", hash)

oh, sorry I misread, was assuming the namespace was a different field.
namespace is actually the kind of child trie, which is fine.
so here we do pass the child trie location in the top trie (hash is the child trie unique id) and will require the proof to include the proof of the child trie root in the top trie.
I was thinking we pass the child trie root and then only expect to receive proof from the child trie only.
That's fine this way (closer to previous api).

Noc2

@cheme: Just to double-check: I'm happy to merge it if you are also okay with it.

cheme · 2023-03-24T13:04:44Z

Works for me, even if I think includeDescendant should be Option<Option<&[u8]>> instead of just a boolean to allow restarting prefix query when result is incomplete. CC @tomaka

tomaka · 2023-03-24T13:10:37Z

I think it's not as simple as what you suggest. After all, even if you don't ask for descendants, you could be asking for many keys and the proof could be very large. If we add a system where results are incomplete and you have to resume your query, then IMO it should be done on the entire query and not just for descendants.

Basically, my opinion is: this all seems very complicated to do properly, and we can deal with it later.

tomaka · 2023-03-24T13:13:06Z

Since compact proofs are completely deterministic, IMO a simple way to add resumable queries is to add a field in the request that indicates after which storage key you want the proof to start.

We could also do this for non-compact proofs if we enforce a deterministic order.

cheme · 2023-03-24T13:14:35Z

I think it's not as simple as what you suggest. After all, even if you don't ask for descendants, you could be asking for many keys and the proof could be very large. If we add a system where results are incomplete and you have to resume your query, then IMO it should be done on the entire query and not just for descendants.

Basically, my opinion is: this all seems very complicated to do properly, and we can deal with it later.

In this case (no descendent but too many key queried), the query can be resume later by just removing the keys from the resulting set from the query set.
With the descendant case, resuming is not doable.

Since compact proofs are completely deterministic, IMO a simple way to add resumable queries is to add a field in the request that indicates after which storage key you want the proof to start.

this is what I did suggest. (just instead of a different field it changes the type of the boolean to Option<Option but a different optional field sounds good to me).

tomaka · 2023-03-24T13:17:40Z

(just instead of a different field it changes the type of the boolean to Option<Option but a different optional field sounds good to me)

This seems like two vastly different solutions to me, no?

What I suggest is that if you pass a "resume value" in this new field, then the proof would for example not even contain the node value of the root node. You basically take your proof as a Vec<Vec<u8>> and you remove the first items.

cheme · 2023-03-24T13:21:04Z

Note that if you define it at the query level (rather than key level), you will need to make it a pair (key, Option<key_under_prefix>).
But it is not strictly necessary as we can still remove keys from the query (and actually should to save query size).

cheme · 2023-03-24T13:22:33Z

but defining at query level (as you propose IIUC) would make sense as we should only need to resume for a single prefixed key query at a time.

tomaka · 2023-03-24T13:27:36Z

But it is not strictly necessary as we can still remove keys from the query (and actually should to save query size).

Uhhh... no?

To give a concrete example of what I suggest:

Let's say you do a request for keys 0xaaaa, 0xaaab1, and 0xaaab2
The remote sends back the value of the root node of the trie, plus the value of 0xaaaa, plus the value of 0xaaab (a branch node), but then adding any other node would be too large, so it doesn't send the rest.
In turn, the requester sends back the exact same request, but passes 0xaaab as the "resume value".
The remote then sends back the rest of the proof (with 0xaaab1 and 0xaaab2).

No matter what the request is, the remote needs to send back N entries. But instead of sending N entries at once, it sends entries 0..M and then later M..N.

There's no need to remove 0xaaaa from the request. If the previous response already contains everything you need to know the value of 0xaaaa, the responder won't send anything more concerning 0xaaaa anyway.

cheme · 2023-03-24T13:34:27Z

Why don't you resume directly from "0xaaab1"? (trying to manage thing at node level would sounds complex)

In your scenario if you got a request with prefix "0xaa" and only can reply "0xaaaa and "0xaaab1", how do you get "0xaaab2"?

tomaka · 2023-03-24T13:41:25Z

In your scenario if you got a request with prefix "0xaa" and only can reply "0xaaaa and "0xaaab1", how do you get "0xaaab2"?

You send the same request and pass 0xaaab1 as the resume value
(if the resume value means "strictly superior" like I had in mind, but could also be 0xaaab2 if it means "superior or equal")

What I suggest is literally doing something like this but a bit more optimized:

fn answer_request(requested_keys: ..., resume_value: ...) -> ... {
    let proof: Vec<...> = generate_proof(requested_keys);
    while proof.first().key < resume_value {
        proof.remove(0);
    }

    let truncate_pos = 0;
    let mut size = 0;
    for elem in proof.iter() {
        if size + elem.size >= LIMIT { break; }
        size += elem.size;
        truncate_pos += 1;
    }
    proof.truncate(truncate_pos);

    proof.encode()
}

cheme · 2023-03-24T14:06:19Z

Ah ok, I get you, your resume value is a lower bound filter. Exclusive sounds ok (even if I don't think it was already implemented this way for range (includeDescendant case), but I think it just means inclusive of the value hash of the bound), it sounds good.

On the other hand your code snippet completely lost me, generating a proof with all keys for truncating after is complex.
Should probably just remove the first keys out of the requested_keys.

I would rather manage the limit at a key level (only stop recording after a value is accessed so there is no orphan nodes).
For instance if you use a kind of recorder over a db :

fn record(accessed_node: Vec<u8>, proof: &mut Vec<Vec<u8>>, last_key_size: &mut usize, has_value: bool)  -> bool {
if proof.size() + accessed_node.size() > LIMIT {
proof.truncate(*last_key_size);
return false
}
proof.append(accessed_node);
if has_value {
*last_key_size = proof.len();
}
true
}

with has value being inline value in node or a value node.

tomaka · 2023-03-27T07:43:38Z

Exclusive sounds ok

I actually think that inclusive is a bit better. This way, an empty resume value and a missing resume value are equivalent.

only stop recording after a value is accessed so there is no orphan nodes

I don't think it's a problem to send orphan nodes, is it?

The problem that this resume value solves is to handle the situation where a proof is larger than the arbitrary limit that we have decided for the response size.

This response size limit exists in order for the sender to be sure that "something is happening", in other words that it is not being sent an infinite stream of randomly-generated data. It's not a limit to the size of a proof, but to the size of a networking message.
Therefore, the sender should be able to (and is expected to) buffer the entire proof.

In fact, we strictly speaking don't even need a limit. Every information in a Merkle proof is either bounded in size or is prefixed with its size. And if nodes in the proof are strictly ordered (like I suggest we enforce), then the recipient of the proof can be sure that what it is receiving is a correct proof while still receiving it. In other words, the receiver can be sure that the sender of the proof isn't malicious while the streaming is still going on.
EDIT: actually no, because the sender of the proof could send you an infinitely-sized value, and you can't verify the hash of this value before having downloaded it entirely, but I'm personally ok with having a limit to the size of a storage value
However, we would still need a resume value in case where the sender of the proof abruptly disconnects and we want to download the rest of the proof from a different node.

What I want to express is that to me the resume value isn't really "related" to the content of the proof, but to its form. It's more a way to resume the download of a proof whose download was interrupted for some reason.

cheme · 2023-03-27T08:28:56Z

I actually think that inclusive is a bit better. This way, an empty resume value and a missing resume value are equivalent.

I did made my mind with exclusive (like for the prefix it is better as we avoid a value (we need to start resume from previous obtained value anyway). It make resume straightforward (last read value).

I don't think it's a problem to send orphan nodes, is it?

no not really, as long as ordered

In fact, we strictly speaking don't even need a limit.

for the streaming case, then it is not a limit but a bandwidth to insert delays while serving.
If sending compact proof though, then we need this as the root is only completed and checkable when all nodes are received (and all ommitted hash rebuild).
So I kind of like the idea of a limit.

actually no, because the sender of the proof could send you an infinitely-sized value

Issue of being able to check only at the end of the big value looks similar to what I mention for compact proof, differencie is only that in this case, even with a limit you cannot know if value is correct until you get the
last byte of the value (the last limited query).
Generally this kind of big value should be chunked in the trie. (alternatively we could use some hashing like blake3 where it is theorically doable to do proof for a single chunk but chunking big value in trie sounds lot more ok to me at least for now).

personally ok with having a limit to the size of a storage value

Definitely something that chains should put in place, but I feel like at substrate level it could be free (limit define in runtime).

if nodes in the proof are strictly ordered (like I suggest we enforce)

yes
note that ordering the nodes would need a spec (IIRC I always felt like current compact proof which is ordered nodes is not the natural ordering which should just be lexicographical order).

All in all I believe your idea of passing a lower bound threshold for resume is good, and would be happy to see it in the PSP (more an exclusive one), and implement it.
Will be fine with ordering to (even if I would then prefer something like in w3f/PSPs#56 (maybe a simpler encoding where all hash are written with values first time met even if less good in my opinion), but can be for later. Just pointing out that sorting is similar to the compact proof.

tomaka · 2023-03-27T14:51:20Z

I've updated the proposal.

cheme

Thanks for the late change (will try to implement that this week).

cheme · 2023-03-27T14:52:29Z

proposals/0000-net-storage-reads-improvements.md

@@ -84,14 +85,17 @@ If `skipValue` is `true` for a `Key`, then the value associated with this key is

 If `includeDescendants` is `true` for a `Key`, then the replier must also include in the proof all keys that are descendant of the given key (in other words, its children, children of children, children of children of children, etc.). It must do so even if `key` itself doesn't have any storage value associated to it. The values of all of these descendants are replaced with their hashes if `skipValue` is `true`, similarly to `key` itself.

+The optional `onlyKeysAfter` field can provide a lower bound for the keys contained in the proof. The responder must not include in its proof any node whose key is inferior or equal to the value in `onlyKeysAfter`. If no `onlyKeysAfter` is provided, the response must start with the root node of the trie.


Regarding this sentence, it can be a bit confusing for implementor in the sense that when we are doing prefix, we query from this bound (if bound is under prefix), and will include in the proof the nodes containing the key (needed to show that next key follow this bound), but not its value.
May be a bit too implementation oriented.

I don't understand what you're saying

Just the exclusive bound when we do a range proof (for prefix case) will include the restart key even if exclusive (not the value).

I don't understand why

let s say your range query stopped at key (nibbles not bytes) [1, 2, 15], then you ask again with starting exclusive at [1, 2, 15], so you will need to descend into the key to check if [1, 2, 15, X] exist before going into [1, 2, 16], and you will include [1, 2, 15] key (not value).

but it means that the emitter of the query also keep in memory that there is no children to the node

Yes, that's intended

I feel like it would be confusing to put nibbles in the query, we only use keys everywhere and user of the rpc should not need to know anything about the state storage.

That's not the RPC layer, that's the networking protocol.

The concept of nibbles is deeply tied with the trie and with Merkle proofs. You can't do anything with the proof you receive in the response without knowing what nibbles are.

make sense, I still think it is too complicated (stateless got value in my mind).
if I want to do what you suggest, I would need:

struct Cursor { key: Vec<u8>, padded: bool, // indicate if last nibble of key is a padding next_child: Option<u8>, // indicate iteration next item or parent if none }

ah I get where you're going, next_child don t need to be an option as if it is parent, we already got the node in previous query and should just restart at parent and next sibbling.
And if you define next child as the last nibble of key you don t need the field. Then you also not repeat the parent nodes in the proof.
Would work, is stateless from the server point of view, not from the client point of view. Not really compatible with compact proof (or it also cannot be stateless from the server point of view).
Still sounds a bit complex to me.

Actually compact local to a reply would work.
Would be ok then (just need to define the OnlyKeysAfter as nibbles (add a boolean to indicate if padded)).

Actually when reading "The responder must not include in its proof any node whose key is inferior or equal to the value in onlyKeysAfter." I was thinking of value key not node key.
So this means all parent key will not be include, that s a bit fun as the proof will jump in sequence, but implementable.
From the client side it means keeping a stack of node in memory and passing the cursor, finally I understand what you did propose.

We can just add the padding boolean to OnlyKeysAfter and we re good.

Just for other readers, in the trie crate use by substrate a node key is called a node prefix.

proposals/0000-net-storage-reads-improvements.md

Noc2 · 2023-03-28T07:16:57Z

I'm going ahead and merging it. But if we don't move this repo to https://github.com/polkadot-fellows any time soon, I will give you access to this repo and the rights to approve PRs.

tomaka · 2023-03-28T09:39:52Z

@Noc2 This wasn't finished, we were still talking about modifications.

Noc2 · 2023-03-28T09:43:19Z

@tomaka, I thought you implemented the changes, and @cheme approved it (see above). Feel free to create a new PR to update it.

tomaka · 2023-03-28T09:46:33Z

Please revert the merge

cheme · 2023-03-28T09:49:43Z

Maybe I should have revert my approval when we did see a slight issue, but it is @tomaka who see it. I am still ok to continue conversation here and have a small patch in a different pr, but reverting would be easier yes (only a slight change seems needed).
I did look at my implementation option, and restarting at "node key" level (with skipping the parent nodes) will be a bit more work than if we just restart from a key (need some primitives at trie level), but ok.

Noc2 · 2023-03-28T10:02:55Z

Please revert the merge

Done.

Improvements to the networking storage read requests

b09d98a

This was referenced Feb 21, 2023

RFC: Add a way to ask for a storage proof of just the hash of a value w3f/polkadot-spec#590

Closed

Key storage proof v2 paritytech/substrate#12843

Closed

Add a network protocol to optimize state_getKeys smol-dot/smoldot#177

Open

bkchr approved these changes Feb 27, 2023

View reviewed changes

Change field numbers

8a37f37

cheme reviewed Feb 27, 2023

View reviewed changes

Noc2 approved these changes Mar 24, 2023

View reviewed changes

Add onlyKeysAfter

d24bc8b

cheme approved these changes Mar 27, 2023

View reviewed changes

tomaka added 2 commits March 27, 2023 16:56

Small change

b67f6cd

Sentence wasn't updated

196ada9

Noc2 merged commit f0383dd into w3f:main Mar 28, 2023

Noc2 mentioned this pull request Mar 28, 2023

Revert "Improvements to the networking storage read requests" #9

Merged

tomaka mentioned this pull request Mar 30, 2023

Improvements to the networking storage read requests #10

Closed


		Implementers of the replier side should be careful to detect early on when a reply would exceed the maximum reply size, rather than inconditionally generate a reply, as this could take a very large amount of CPU, disk I/O, and memory. Existing implementations might currently be accidentally protected from such an attack thanks to the fact that requests have a maximum size, and thus that the list of keys in the query was bounded. After this proposal, this accidental protection would no longer exist.

		It is not possible for the querier to know in advance whether its query will lead to a reply that exceeds the maximum size. If the reply is too large, the querier should send a different, smaller request. While this is not an ideal solution, it is believed that this problem is rare enough that it's not worth making things more complicated in order to solve it. Ideal solution could include proofs that can be verified in a streaming way, or sending a partial proof whose remainder can be queried afterwards.

		@@ -84,14 +85,17 @@ If `skipValue` is `true` for a `Key`, then the value associated with this key is

		If `includeDescendants` is `true` for a `Key`, then the replier must also include in the proof all keys that are descendant of the given key (in other words, its children, children of children, children of children of children, etc.). It must do so even if `key` itself doesn't have any storage value associated to it. The values of all of these descendants are replaced with their hashes if `skipValue` is `true`, similarly to `key` itself.

		The optional `onlyKeysAfter` field can provide a lower bound for the keys contained in the proof. The responder must not include in its proof any node whose key is inferior or equal to the value in `onlyKeysAfter`. If no `onlyKeysAfter` is provided, the response must start with the root node of the trie.

Improvements to the networking storage read requests #5

Improvements to the networking storage read requests #5

Conversation

tomaka commented Feb 21, 2023

bkchr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomaka Feb 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkchr commented Feb 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomaka Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

cheme Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

Noc2 left a comment

Choose a reason for hiding this comment

cheme commented Mar 24, 2023

tomaka commented Mar 24, 2023

tomaka commented Mar 24, 2023

cheme commented Mar 24, 2023 • edited Loading

tomaka commented Mar 24, 2023

cheme commented Mar 24, 2023

cheme commented Mar 24, 2023

tomaka commented Mar 24, 2023 • edited Loading

cheme commented Mar 24, 2023

tomaka commented Mar 24, 2023

cheme commented Mar 24, 2023

tomaka commented Mar 27, 2023 • edited Loading

cheme commented Mar 27, 2023

tomaka commented Mar 27, 2023

cheme left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomaka Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheme Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

cheme Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Noc2 commented Mar 28, 2023

tomaka commented Mar 28, 2023

Noc2 commented Mar 28, 2023

tomaka commented Mar 28, 2023

cheme commented Mar 28, 2023 • edited Loading

Noc2 commented Mar 28, 2023

tomaka Feb 27, 2023 •

edited

Loading

tomaka Feb 28, 2023 •

edited

Loading

cheme Feb 28, 2023 •

edited

Loading

cheme commented Mar 24, 2023 •

edited

Loading

tomaka commented Mar 24, 2023 •

edited

Loading

tomaka commented Mar 27, 2023 •

edited

Loading

tomaka Mar 27, 2023 •

edited

Loading

cheme Mar 27, 2023 •

edited

Loading

cheme Mar 27, 2023 •

edited

Loading

cheme commented Mar 28, 2023 •

edited

Loading