-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Degrade provider handling quality gracefully under load #730
base: master
Are you sure you want to change the base?
Conversation
That way, overloaded nodes can drop provides.
And return when the process is closign. This will help speed up the main loop a bit.
Let's assume there's one (or zero) providers for a record and everyone is looking for 10 providers. - Before, peers would wait for all peers along the path to return this one provider. However, because there's only one provider, peers won't be able to short-circuit anyways. - Now, peers will go to the end of the path before waiting. This may make some queries slower, but it attempts to give "priority" to peers that actually _need_ responses as opposed to peers that are "optimistically" waiting for responses.
Still need to come back to take a look more carefully, but a few thoughts:
|
1.b. #729 (comment) And yes, you're right, we need to track this.
In theory? But I still feel like this change is a strict improvement. It'll only kick in when overloaded anyways.
Really, I don't think there's too much to tune here.
|
@@ -366,7 +390,10 @@ func (dht *IpfsDHT) handleAddProvider(ctx context.Context, p peer.ID, pmes *pb.M | |||
// add the received addresses to our peerstore. | |||
dht.peerstore.AddAddrs(pi.ID, pi.Addrs, peerstore.ProviderAddrTTL) | |||
} | |||
dht.ProviderManager.AddProvider(ctx, key, p) | |||
err := dht.ProviderManager.AddProviderNonBlocking(ctx, key, p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do a mybucket
style check here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, we're always in the bucket here, right?
👍
Ok, I think you've reasonably convinced me. Since the datastore is batching it seems reasonable to expect that puts should be quick here (as long as they're not blocked on the event queue by gets). In which case if your datastore's slow there's not much to be done about anything else here (aside from queues for bursts like you mentioned). However, since Gets can block Puts we still need to do some estimation on queue size (and parallelism) required for what we expect an average node to need. I suspect this won't be so awful given that the existing networks are pretty functional without this, so we mostly need some conservative estimates and let power users like infra providers tune more accurately over time. |
What about having get workers? Basically, we could:
Doing this without blocking the main event loop is going to be a bit tricky, but doable. |
There's a bunch of servers using this library to subscribe to a topic and publish messages to one another, and from time to time there are huge goroutine spikes that straight up take the process down due to memory usage: I'm fairly certain that I'm experiencing the same issue that Steven is trying to tackle here. All of those goroutines get stuck on the "select" trying to get their incoming request handled, in calls like It's yet unclear why there are suddenly those spikes in requests, but at least I want the processes to not get taken down that easily by just a spike of 50-100k requests. So I think this pull request would help. @Stebalien do you need help with reviews or testing? I'm not an expert in go-libp2p or this particular library, but I'm happy to help where I can or run this branch for a few days, if you think it's ready enough to give it a go. |
The one thing I still wanted was "get" workers. At the moment, gets are serial and puts can easily get backed up on a single slow get. |
@petar when performance of large DHT nodes (e.g. the hydras) comes up on your radar again could you take a look at this and decide if it's ready for merge as-is, or what changes need to be made? |
This pulls in libp2p/go-libp2p-kad-dht#730, which hangs on a branch shortly after 0.12.2. Essentially, this handles priority for requests a bit better, and drops unimportant requests if they come in too fast. This should prevent kad-dht from using tons of memory. Updates vocdoni#243.
This pulls in libp2p/go-libp2p-kad-dht#730, which hangs on a branch shortly after 0.12.2. Essentially, this handles priority for requests a bit better, and drops unimportant requests if they come in too fast. This should prevent kad-dht from using tons of memory. Updates #243.
This pulls in libp2p/go-libp2p-kad-dht#730, which hangs on a branch shortly after 0.12.2. Essentially, this handles priority for requests a bit better, and drops unimportant requests if they come in too fast. This should prevent kad-dht from using tons of memory. Updates #243.
This PR contains two changes to scale provider handling:
Ideally we'd have a some form of parallel provider record retrieval from the datastore, but this is still a good first step.
fixes #675