-
Notifications
You must be signed in to change notification settings - Fork 60
Conversation
@@ -321,7 +321,7 @@ class KadDHT extends EventEmitter { | |||
waterfall([ | |||
(cb) => utils.convertBuffer(key, cb), | |||
(id, cb) => { | |||
const rtp = this.routingTable.closestPeers(id, c.ALPHA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing as this code is the same for every query, maybe it should be part of the Query itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
Here are some metrics for
While times are still high, this is a significant improvement over starting with Notable metrics
* I believe the spike in time for 10s was due to a cpu issue on that run. |
Very interesting findings! I'm surprised that so many peers would take 30s to respond, do you have any idea why that might be? It shouldn't take long to get the closest peers from the k-table, so I assume that must be network latency, right? |
I'm going to run some more tests. I'm thinking this may be the actual connect that's taking so long, as I was previously using timeout on the entire |
Great work with those metrics @jacobheun This way, we have a more concrete way of defining a proper timeout value for the queries |
I started seeing some queries running for 10+ minutes (I ended up cancelling them). The error rate of the last was only 24.23%, but it had queried 1077 peers, which is insane. I was also surprised to see that when I went to reprovide the same id, without restarting my node, the query times didn't really improve. I expected shorter query times since we should have the closest peers from the last query in our routing table. I'm going to look at adding a more verbose test suite around |
In theory it should stop once there are no more closer peers to be queried than the K closest it already knows about, so it's very unlikely it would get to > 1,000 peers if that check is working correctly. I wonder if there's a bug there. Are you able to repro that test? If so it would be great to examine the DHT keys of the peers that it queries, and the peers waiting to be queried in each queue to ensure that it's examining them in the correct order. |
I ran into the issue a few times on the network but didn't get dump of the queries, I'll be working on a test suite today to try to reproduce before I do any further tweaks. One issue I see is |
|
||
// If we've queried enough peers, stop the queue | ||
if (!continueQuerying) { | ||
this.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this.stop
here improves the speed reliability quite a bit. Total query times are usually around or under 80seconds. While this isn't great, it's better than it currently is. The side effect of this, is that we might lose closer peers from the other, parallel, responses in this queue since it runs at a concurrency of 3.
I lowered the dial timeout (to 5s) on Switch for the js-ipfs instance I am running locally when testing against the live network. This also contributed to the improved speed.
One thing the Kademlia papers aren't particularly clear on (I may have missed it) is the behavior of the Disjoint Paths in terms of completing the query. Currently, we wait for all paths to finish before the query is finished. Should we? The more paths we allow to finish the closer we get to the actual closest peers, but the query takes significantly longer. If we stop after a single Disjoint Path has completed, the query time is dramatically reduced, but it's also less accurate. Long running peers might mitigated the accuracy problem as its routing table improves.
The big problem I am seeing now is slow/dead peers. My assumption is that the slow paths are hitting nodes that either aren't pruning their routing tables effectively, or the network is just that much in flux (nodes hitting the network and then going offline regularly). To remediate this we're either going to have to get really aggressive with the time we're allowing for connections and queries, or we'll have to be sloppier with what constitutes "closet". If we want to tighten down timeouts more, I think we'll want to add timeout options to Switch.dial
. I've been changing this globally for testing, but DHT expectations vs global dial expectations are probably different.
A note: I played around with a concurrency of 6 instead of 3, but I actually observed a negative impact on query times.
@dirkmc @vasco-santos thoughts? I'm tempted to try things with the "sloppy" and fast approach and see how fetching performs on the network with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work @jacobheun, these are solid improvements, taking orders of magnitude off the query time.
I agree that we should add timeout options to Switch.dial()
. I think the main reason that we see so many slow/dead peers is because nodes that are behind NATs are advertising their peer ids to the network. I believe this is the main problem in the go DHT and a strong impetus for using hole punching techniques. Ideally nodes would have a mechanism to be able to check if they're behind a NAT before advertising their peer ids, and only do so if they are reachable through a relay.
I believe it's correct to wait for all disjoint paths to finish before returning. A streaming API should mitigate this problem.
I would be interested in seeing how well the sloppy approach works, but it sounds like it would need to be run in a simulated network to understand how it behaves in the wild (unless there is already some research out there?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I realized, which is likely attributing to some of the workers getting "blocked", are peer addresses. If we encounter a peer that is offline, and/or has a lot of bad addresses associated with it, the Switch timeout applies to per dial attempt, not connection. If we used the default dial timeout of 10seconds, and the peer had 20 unique addresses, it could take us 200 seconds to go through all of the addresses. We don't currently do parallel dials in the Switch on the same transport, as we can't yet abort dials properly if a connection succeeds. We will probably need to mitigate this with a temporary timeout on the entire connection attempt until the async iterators work is completed for transports, which will include abort logic so we can go back to using parallel dials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because nodes that are behind NATs are advertising their peer ids to the network
Agreed, the NAT manager is on my list for this quarter, I may look at reprioritizing things a bit as this will help with performance in general.
I will look at adding an option to timeout the entire Switch.dial
. I think this should help lower the high end of the queries.
If we go with the sloppy worker stop approach, and allow all disjoint paths to finish, we should still see reasonably close results. From https://github.com/libp2p/research-dht/issues/6, we could look at some other techniques to supplement the sloppy approach:
On a successful get, we re-publish the value to nodes on the lookup path. Subsequent gets to hot keys will find the result without. This is part of the original Kademlia algorithm that doesn't appear to be implemented.
Sloppy provides. Analogous to re-publishing along the lookup path on get, provide doesn't need to go all the way to the "correct" node. If there are a lot of providers for a given key, new providers should be added to nodes on the lookup path. This is one of the fundamental innovations in Coral[3].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total query times are usually around or under 80seconds
Not ideal, but a great improvement already! 👏
If we want to tighten down timeouts more, I think we'll want to add timeout options to Switch.dial.
I believe this makes perfect sense for the DHT dials. As we will end up connecting with so many peers in the wild, the probability of failure will be super high and we should get to the next peer fast. So, I am totally in favor of this.
Regarding the disjoint paths, I also agree with @dirkmc that ideally, we should wait for all of them. However, we can make some experiments with that change, but I think it should be in a new PR, after this one.
The sloppy approach can also be interesting to experiment, do you prefer to continue it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the timeout to the query function. It seemed unnecessary to go about changing the Switch dial
interface at this point. Once we have parallel dial aborts working, that should allow is to potentially remove the timeout here. During the Switch async refactor it might make sense to add the option to override timeouts per dial then.
The sloppy approach can also be interesting to experiment, do you prefer to continue it here?
Right now this is sloppy. We're still running all disjoint paths, but we let workers end earlier than they might normally. One of the new tests validates the possibility of missing closer peers.
I want to look at adding a basic simulation test that we can look at running regularly when we attempt performance updates. But other than that, I think this is in a state worth trying out, and the next improvements will likely come from Switch and NAT improvements.
}, cb) | ||
} | ||
], (err) => callback(err)) | ||
], (err) => { | ||
if (errors.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should help mitigate #105. While the provide will still technically fail, we'll at least have attempted to provide to all peers.
cc282c1
to
61a8e26
Compare
I put together a basic simulation script in the tests folder that can be run via Here's sample output for a 10k node network assuming 30% offline/super slow peers
I pulled the sim code into the master branch to see the difference, it's significant and looks inline with query times I was seeing on the live network.
|
Wow great performance improvements! It's very useful to have a simulator as well. Something else that may help with a sloppy approach would be to implement libp2p/go-libp2p-kad-dht#323 |
I added the ability to override concurrency via dht options as this will be nice for people to be able to configure. I also bumped up the default to 6. With a concurrency of 6 and the sloppy worker behavior, the query times were a bit more consistent and lower. Also, the results of the queries were more consistent. I updated the simulation to do a final check of common closest peers from each run. Concurrency of 6, "sloppy" workers off
Concurrency of 6, "sloppy" workers on
While there is randomness to the simulation network between runs (perhaps an enhancement would be to generate networks prior to doing a run so you could leverage the same network to test against), the results across runs are fairly consistent. I want to look at CPU usage with these settings against js-ipfs, but I think these will make a big impact until we can resolve the network latency issues and dead/slow peer timeouts, which are now the major inhibitors to query times.
Republish will be really important, even without us being sloppy. Offline nodes that are in peer tables will make it improbable that we'll get the actual 20 closest peers, and the actual peers we get are going to vary a lot as the network changes. |
Nice 👍 The timing looks a lot better. Which is the piece of code that implements "sloppiness"? |
It's the added stop at: js-libp2p-kad-dht/src/query/workerQueue.js Lines 132 to 136 in a1c9890
|
This is looking pretty good, let me know when you'd like a code review |
A review would be good. The performance tests I'll do this week on js-ipfs should change much aside from adjusting config options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits, but otherwise LGTM 👍
Co-Authored-By: jacobheun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Jacob! 👏
The changes look great and the queue codebase is way better structured now. Also the performance improvements are super good, as well as the simulator!
Just detected an option missing in docs.
src/index.js
Outdated
/** | ||
* Number of closest peers to return on kBucket search, default 20 | ||
* | ||
* @type {number} | ||
*/ | ||
this.ncp = options.ncp || c.K | ||
this.ncp = options.ncp || this.kBucketSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add docs to options.ncp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, looking through the code we actually only use dht.ncp
in 1 spot. I kind of think we should just get rid of it altogether and replace that occurrence with dht.kBucketSize
. While it could potentially provide some finer grained control of the number of results we're returning, it's not doing that and I think it adds more confusion than it's worth right now. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ncp is now gone in favor of kBucketSize
@jacobheun with these changes do you think we can remove the shallow getClosestPeers() behaviour on PUT? |
We should be able to. It's still going to be relatively slow until we can improve the connection latency. I think we can do some comparative tests against this and a follow up PR for removing shallow to see what the impacts are. Shallow is definitely going to be much less accurate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
🚢 |
This PR is aimed at improving performance of the dht.
Things being reviewed:
k
(20) peers instead ofα
(3), per S/KademliaRefs: