Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: libp2p-kad-dht goroutine leak crash #243

Closed
vdo opened this issue Jul 26, 2021 · 8 comments
Closed

bug: libp2p-kad-dht goroutine leak crash #243

vdo opened this issue Jul 26, 2021 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@vdo
Copy link
Contributor

vdo commented Jul 26, 2021

Describe the bug
one of the production gateways dies from time to time due a goroutine leak, reaching the mem limit.

To Reproduce (please complete the following information)

DVOTE_DATADIR=/app/run
DVOTE_MODE=gateway
DVOTE_LOGLEVEL=info
DVOTE_LOGERRORFILE=/app/run/dvote.err.log
DVOTE_API_ALLOWPRIVATE=True
DVOTE_API_SSL_DOMAIN=litxi.jumpingcrab.com
DVOTE_IPFS_SYNCKEY=letsgivevoice
DVOTE_METRICS_ENABLED=True
DVOTE_METRICS_REFRESHINTERVAL=10
DVOTE_ETHCONFIG_SIGNINGKEY=<redacted>
DVOTE_W3CONFIG_ENABLED=False
DVOTE_VOCHAINCONFIG_NOWAITSYNC=False
DVOTE_API_ROUTE=/
DVOTE_API_FILE=True
DVOTE_API_CENSUS=True
DVOTE_API_VOTE=True
DVOTE_API_RESULTS=True
DVOTE_API_INDEXER=False
DVOTE_API_TENDERMINT=False
DVOTE_DEV=False
DVOTE_VOCHAIN=main

Screenshot_2021-07-26_14-53-51

Compressed logs:
log.gz

@vdo vdo added the bug Something isn't working label Jul 26, 2021
@p4u
Copy link
Member

p4u commented Jul 26, 2021

Looks like there is an unattented context for https://github.com/libp2p/go-libp2p-kad-dht/blob/master/providers/providers_manager.go#L263

Probably the problem is in the subpub package (multirpc)

@p4u p4u assigned p4u and mvdan and unassigned p4u Jul 26, 2021
mvdan added a commit to mvdan/go-dvote that referenced this issue Jul 27, 2021
The library seems to generally have no default limit, and we've
encountered gateways in some environments spiking to as many as 200K
goroutines, so this seems like a likely culprit.

And even if this is not the cause, setting a limit is a good idea
anyway.

As a start, we're using four times MaxDHTpeers as the discovery limit
for both FindPeers and Advertise. It seems high enough to keep things
working, but low enough that we can't run out of memory with hundreds of
thousands of goroutines.

Updates vocdoni#243.
p4u pushed a commit that referenced this issue Jul 28, 2021
The library seems to generally have no default limit, and we've
encountered gateways in some environments spiking to as many as 200K
goroutines, so this seems like a likely culprit.

And even if this is not the cause, setting a limit is a good idea
anyway.

As a start, we're using four times MaxDHTpeers as the discovery limit
for both FindPeers and Advertise. It seems high enough to keep things
working, but low enough that we can't run out of memory with hundreds of
thousands of goroutines.

Updates #243.
p4u pushed a commit that referenced this issue Jul 28, 2021
The library seems to generally have no default limit, and we've
encountered gateways in some environments spiking to as many as 200K
goroutines, so this seems like a likely culprit.

And even if this is not the cause, setting a limit is a good idea
anyway.

As a start, we're using four times MaxDHTpeers as the discovery limit
for both FindPeers and Advertise. It seems high enough to keep things
working, but low enough that we can't run out of memory with hundreds of
thousands of goroutines.

Updates #243.
p4u pushed a commit that referenced this issue Jul 28, 2021
The library seems to generally have no default limit, and we've
encountered gateways in some environments spiking to as many as 200K
goroutines, so this seems like a likely culprit.

And even if this is not the cause, setting a limit is a good idea
anyway.

As a start, we're using four times MaxDHTpeers as the discovery limit
for both FindPeers and Advertise. It seems high enough to keep things
working, but low enough that we can't run out of memory with hundreds of
thousands of goroutines.

Updates #243.
mvdan added a commit to mvdan/go-dvote that referenced this issue Jul 29, 2021
When the connection to a peer is lost,
broadcastHandler errors in its SendMessage call,
and the entire goroutine stops.

No goroutine will continue receiving on the write channel,
and sooner than later, sends to the write channel will start blocking.

This starts causing deadlocks further up in IPFSsync.
SubPub.Subscribe and SubPub.PeerStreamWrite can now block forever,
and further up the chain in IPFSsync,
that can mean some goroutines hold onto mutexes forever.

On one hand, this chain of events can hang IPFSsync,
stopping it from doing anything useful until a restart.

On the other hand, it causes goroutine leaks.
When more calls to IPFSsync.Handle come through,
using new goroutines via the router,
those try to grab the deadlocked mutexes and hang forever.

First, fix the root cause: peerSub now has a "closed" channel,
which gets closed by peersManager when the peer is dropped.
Its goroutines, both for reading and writing messages,
keep running until that happens.

Second, make the symptom of the deadlock less severe:
prevent blocking on channel sends forever.
Any send on the "write" channel now stops on "closed".
And the send on BroadcastWriter, which could also block forever,
now has a fallback timeout of five minutes.

Updates vocdoni#243. Perhaps not a total fix, as there might be other leaks.
mvdan added a commit that referenced this issue Jul 30, 2021
When the connection to a peer is lost,
broadcastHandler errors in its SendMessage call,
and the entire goroutine stops.

No goroutine will continue receiving on the write channel,
and sooner than later, sends to the write channel will start blocking.

This starts causing deadlocks further up in IPFSsync.
SubPub.Subscribe and SubPub.PeerStreamWrite can now block forever,
and further up the chain in IPFSsync,
that can mean some goroutines hold onto mutexes forever.

On one hand, this chain of events can hang IPFSsync,
stopping it from doing anything useful until a restart.

On the other hand, it causes goroutine leaks.
When more calls to IPFSsync.Handle come through,
using new goroutines via the router,
those try to grab the deadlocked mutexes and hang forever.

First, fix the root cause: peerSub now has a "closed" channel,
which gets closed by peersManager when the peer is dropped.
Its goroutines, both for reading and writing messages,
keep running until that happens.

Second, make the symptom of the deadlock less severe:
prevent blocking on channel sends forever.
Any send on the "write" channel now stops on "closed".
And the send on BroadcastWriter, which could also block forever,
now has a fallback timeout of five minutes.

Updates #243. Perhaps not a total fix, as there might be other leaks.
p4u pushed a commit that referenced this issue Aug 2, 2021
When the connection to a peer is lost,
broadcastHandler errors in its SendMessage call,
and the entire goroutine stops.

No goroutine will continue receiving on the write channel,
and sooner than later, sends to the write channel will start blocking.

This starts causing deadlocks further up in IPFSsync.
SubPub.Subscribe and SubPub.PeerStreamWrite can now block forever,
and further up the chain in IPFSsync,
that can mean some goroutines hold onto mutexes forever.

On one hand, this chain of events can hang IPFSsync,
stopping it from doing anything useful until a restart.

On the other hand, it causes goroutine leaks.
When more calls to IPFSsync.Handle come through,
using new goroutines via the router,
those try to grab the deadlocked mutexes and hang forever.

First, fix the root cause: peerSub now has a "closed" channel,
which gets closed by peersManager when the peer is dropped.
Its goroutines, both for reading and writing messages,
keep running until that happens.

Second, make the symptom of the deadlock less severe:
prevent blocking on channel sends forever.
Any send on the "write" channel now stops on "closed".
And the send on BroadcastWriter, which could also block forever,
now has a fallback timeout of five minutes.

Updates #243. Perhaps not a total fix, as there might be other leaks.
p4u pushed a commit that referenced this issue Aug 2, 2021
When the connection to a peer is lost,
broadcastHandler errors in its SendMessage call,
and the entire goroutine stops.

No goroutine will continue receiving on the write channel,
and sooner than later, sends to the write channel will start blocking.

This starts causing deadlocks further up in IPFSsync.
SubPub.Subscribe and SubPub.PeerStreamWrite can now block forever,
and further up the chain in IPFSsync,
that can mean some goroutines hold onto mutexes forever.

On one hand, this chain of events can hang IPFSsync,
stopping it from doing anything useful until a restart.

On the other hand, it causes goroutine leaks.
When more calls to IPFSsync.Handle come through,
using new goroutines via the router,
those try to grab the deadlocked mutexes and hang forever.

First, fix the root cause: peerSub now has a "closed" channel,
which gets closed by peersManager when the peer is dropped.
Its goroutines, both for reading and writing messages,
keep running until that happens.

Second, make the symptom of the deadlock less severe:
prevent blocking on channel sends forever.
Any send on the "write" channel now stops on "closed".
And the send on BroadcastWriter, which could also block forever,
now has a fallback timeout of five minutes.

Updates #243. Perhaps not a total fix, as there might be other leaks.
@mvdan
Copy link
Contributor

mvdan commented Aug 7, 2021

The previous commits of mine did fix deadlocks and potential goroutine leaks, but those were minor goroutine leaks and unrelated to the much larger goroutine leak that litxi is running into.

I've done two hours of digging into this, and I've got good news and bad news.

The good news is that I went over the code step by step multiple times, and I could not spot a mistake on our part. It's worth noting that the subpub code is hard to follow though, so I could have missed something.

The bad news is that I don't see an easy fix for what I think is happening here. We essentially get tens of thousands of goroutines stuck handling "GetProviders" libp2p messages in go-libp2p-kad-dht. And it makes sense; its ProviderManager handles all of those requests serially, without even supporting parallel "gets".

So if a node discovers a thousand other nodes at once, a thousand such messages will go in each direction, and the queue will get pretty long for some time. Handling each request involves some input/output as well, which might block. If the messages come in faster than the node can keep up with, they will pile up and use more and more memory. Each pending request uses one goroutine as well.

I'm not sure why so many messages are coming in for just one of the gateways. Three ideas come to mind:

  1. The node is, for some reason, talking to itself over libp2p as if it were a different node, resulting in an endless source of discovery traffic.

  2. There really are many other vocdoni-node libp2p nodes in the network, and they spam each other continuously with discovery messages for some reason, such as the network dropping connections pretty often.

  3. There are many other machines in the network being used for libp2p, unrelated to vocdoni-node, and those libp2p nodes are spamming us with discovery messages.

I tend to think this has something to do with litxi's network setup, because as far as I can tell no other gateway has this issue with go-libp2p-kad-dht. Ideally, we'd fix the root cause of all those peers, which would presumably be a devops/systems fix.

That devops fix aside, I think our best bet in terms of mitigating the goroutine leak in the source code would be upstream's libp2p/go-libp2p-kad-dht#675, which already has a PR. It could be a few weeks until that's merged, though, but we could always try using that branch as a temporary measure.

Another potential fix would be to switch to gossipsub, since https://github.com/libp2p/go-libp2p-pubsub does not import the kad-dht library which has this potential memory leak :)

@mvdan mvdan changed the title bug: IPFS goroutine leak crash bug: libp2p-kad-dht goroutine leak crash Aug 12, 2021
@mvdan
Copy link
Contributor

mvdan commented Aug 12, 2021

I was finally able to catch gw5 while it was using 30k goroutines. It, too, was filled with kad-dht goroutines like:

goroutine 473021375 [select, 52 minutes]:
github.com/libp2p/go-libp2p-kad-dht/providers.(*ProviderManager).AddProvider(0xc003091f40, 0x33f7d20, 0xc04fca3dd0, 0xc016a659e0, 0x22, 0x30, 0xc03313ddd0, 0x26)
        github.com/libp2p/[email protected]/providers/providers_manager.go:222 +0x11e
github.com/libp2p/go-libp2p-kad-dht.(*IpfsDHT).handleAddProvider(0xc003557500, 0x33f7d20, 0xc04fca3dd0, 0xc03313ddd0, 0x26, 0xc069c21d80, 0x0, 0x0, 0xc0699a50c0)
        github.com/libp2p/[email protected]/handlers.go:369 +0x49b
github.com/libp2p/go-libp2p-kad-dht.(*IpfsDHT).handleNewMessage(0xc003557500, 0x341f5b0, 0xc0699a50a0, 0xc051772100)
        github.com/libp2p/[email protected]/dht_net.go:121 +0x6b6
github.com/libp2p/go-libp2p-kad-dht.(*IpfsDHT).handleNewStream(0xc003557500, 0x341f5b0, 0xc0699a50a0)
        github.com/libp2p/[email protected]/dht_net.go:26 +0x3f
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).SetStreamHandler.func1(0xc0208cc130, 0xf, 0x7f24a50afa00, 0xc0699a50a0, 0x1, 0x0)
        github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:574 +0xa2
created by github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).newStreamHandler
        github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:421 +0x63a

So the goroutine leak seems to be exactly the same as litxi. We also can't put the blame on litxi for who is causing all these sudden requests - gw5 was getting spikes of 10k goroutines a few days ago, during which time litxi was down.

Here are my two current best guesses as to the root causes of the spike in requests:

  1. our vocdoni-node code is buggy, and the gateways sometimes give each other libp2p DoS hugs of death, such as by entering in infinite loops.

  2. there's a libp2p client somewhere (maybe a website, or a script, or a test?) and it similarly does a libp2p DoS hug of death.

I'm not sure how likely the second one is, though. Do we have anything else that uses libp2p-kad-dht besides vocdoni-node gateways themselves?

@p4u
Copy link
Member

p4u commented Aug 17, 2021

Nothing else uses libp2p, so it should be 1

@mvdan
Copy link
Contributor

mvdan commented Aug 17, 2021

It could also be something in between, like a frontend or client sending 100 requests to a gateway, and handling those 100 requests results in 100+ libp2p requests done by the gateway.

@p4u
Copy link
Member

p4u commented Aug 17, 2021

Each file a client tries to get from the gateway (header images, voting process metadata, etc.) goes to the IPFS daemon, handled by the Retrieve method here https://github.com/vocdoni/vocdoni-node/blob/master/data/ipfs.go#L253

There is a LRU cache (I introduced it recently, such as 1 month ago) that saves a maximum of 128 files which if available are directly used to fetch the file data (without fetching the IPFS daemon) https://github.com/vocdoni/vocdoni-node/blob/master/data/ipfs.go#L44

But apart from that (files fetched by clients) the libp2p-kad-dht has not directly interaction with a client request.

@p4u
Copy link
Member

p4u commented Aug 17, 2021

In the other side, we have the ipfsSync daemon that uses subpub. It is constantly receiving protocol messages (HELLO and UPDATE) from other cluster peers. Currently gw5 has 7 cluster-peers, so its receiving messages almost every second. However I don't see why that should end up in a memory leak

2021-08-17T07:52:05Z    DEBUG   subpubtransport/subpub.go:61    received 137 bytes from QmWZLvLxdtEVzZYRguFMnerzgRunRKFyQAmSttyNtsVsxg
2021-08-17T07:52:05Z    DEBUG   ipfssync/ipfssync.go:406        received message HELLO {Address:QmWZLvLxdtEVzZYRguFMnerzgRunRKFyQAmSttyNtsVsxg Hash: MA:/ip4/162.56.177.209/tcp/4001/p2p/QmYfvwviy9p9o9DaGcRFpvcZDv2NZ8uXF6gtUxar8VU9fz PL:[] Ts:1629186725}
2021-08-17T07:52:05Z    DEBUG   subpubtransport/subpub.go:61    received 90 bytes from QmaHv9XiC5GtF2MYYWEFCXGBMvXBdUYjS4w8Hpe8tWWD5c
2021-08-17T07:52:05Z    DEBUG   ipfssync/ipfssync.go:406        received message UPDATE {Address:QmaHv9XiC5GtF2MYYWEFCXGBMvXBdUYjS4w8Hpe8tWWD5c Hash:dcfc33d6fa6855f7aff3046e663fe04e7797ca801543337e9780061ba4f45590 MA: PL:[] Ts:1629186725}
2021-08-17T07:52:05Z    DEBUG   subpubtransport/subpub.go:61    received 90 bytes from QmUPch8qryfh8PXxHqaFBCHT8TUWKb69Gssi7kZvFNKzXL
2021-08-17T07:52:05Z    DEBUG   ipfssync/ipfssync.go:406        received message UPDATE {Address:QmUPch8qryfh8PXxHqaFBCHT8TUWKb69Gssi7kZvFNKzXL Hash:dcfc33d6fa6855f7aff3046e663fe04e7797ca801543337e9780061ba4f45590 MA: PL:[] Ts:1629186725}
2021-08-17T07:52:05Z    DEBUG   subpubtransport/subpub.go:61    received 90 bytes from QmWZLvLxdtEVzZYRguFMnerzgRunRKFyQAmSttyNtsVsxg
2021-08-17T07:52:05Z    DEBUG   ipfssync/ipfssync.go:406        received message UPDATE {Address:QmWZLvLxdtEVzZYRguFMnerzgRunRKFyQAmSttyNtsVsxg Hash:dcfc33d6fa6855f7aff3046e663fe04e7797ca801543337e9780061ba4f45590 MA: PL:[] Ts:1629186725}
2021-08-17T07:52:06Z    DEBUG   subpubtransport/subpub.go:61    received 138 bytes from QmYbQ7ycqfmvVp4KbHpo476dg61g7bhfR6ZbKouK91yik5
2021-08-17T07:52:06Z    DEBUG   ipfssync/ipfssync.go:406        received message HELLO {Address:QmYbQ7ycqfmvVp4KbHpo476dg61g7bhfR6ZbKouK91yik5 Hash: MA:/ip4/168.119.171.247/tcp/4001/p2p/QmVUxzXZj6H4nCeib1m374NShVBAikRwpYNJAnLtwApzb6 PL:[] Ts:1629186726}
2021-08-17T07:52:06Z    DEBUG   subpubtransport/subpub.go:61    received 90 bytes from QmebzdDHR5sKkaD2bWyis1DViN8zzSYHqyXiJu991HsPe7
2021-08-17T07:52:06Z    DEBUG   ipfssync/ipfssync.go:406        received message UPDATE {Address:QmebzdDHR5sKkaD2bWyis1DViN8zzSYHqyXiJu991HsPe7 Hash:dcfc33d6fa6855f7aff3046e663fe04e7797ca801543337e9780061ba4f45590 MA: PL:[] Ts:1629186726}
2021-08-17T07:52:09Z    INFO    ipfssync/ipfssync.go:304        [ipfsSync info] pins:2901 hash:dcfc33d6fa6855f7aff3046e663fe04e7797ca801543337e9780061ba4f45590
2021-08-17T07:52:09Z    DEBUG   ipfssync/ipfssync.go:220        broadcasting message UPDATE {Address:QmNtPnwSQguKesS1tj8nJvXdzN1tYpj1GpBjKkgAPD8NUa Hash:dcfc33d6fa6855f7aff3046e663fe04e7797ca801543337e9780061ba4f45590 MA: PL:[] Ts:1629186729}
2021-08-17T07:52:09Z    DEBUG   subpubtransport/subpub.go:88    sending 90 bytes to broadcast channel
2021-08-17T07:52:11Z    DEBUG   subpubtransport/subpub.go:61    received 135 bytes from QmUPch8qryfh8PXxHqaFBCHT8TUWKb69Gssi7kZvFNKzXL
2021-08-17T07:52:11Z    DEBUG   ipfssync/ipfssync.go:406        received message HELLO {Address:QmUPch8qryfh8PXxHqaFBCHT8TUWKb69Gssi7kZvFNKzXL Hash: MA:/ip4/176.9.17.253/tcp/4001/p2p/QmTuWawiQV184VAtzfLpTVatnv6V4kk7yYjpSNXMGv4ZyG PL:[] Ts:1629186731}

@mvdan
Copy link
Contributor

mvdan commented Aug 31, 2021

It is constantly receiving protocol messages (HELLO and UPDATE) from other cluster peers. Currently gw5 has 7 cluster-peers, so its receiving messages almost every second. However I don't see why that should end up in a memory leak

That could explain it, actually.

Right now, libp2p-kad-dht only handles one request at a time, be it read-only or read-write. If handling one of those requests blocks for a little while, then all other requests pile up, "leaking" their goroutines until they can be processed.

Handling one request could block if an operation on the underlying database blocks, for example. This could happen because the database is currently doing periodic work like flushing or compressing, or because another goroutine is currently writing the same keys, or because the disk/IO are currently being used heavily by another process.

mvdan added a commit to mvdan/go-dvote that referenced this issue Sep 21, 2021
This pulls in libp2p/go-libp2p-kad-dht#730,
which hangs on a branch shortly after 0.12.2.

Essentially, this handles priority for requests a bit better,
and drops unimportant requests if they come in too fast.
This should prevent kad-dht from using tons of memory.

Updates vocdoni#243.
p4u pushed a commit that referenced this issue Sep 21, 2021
This pulls in libp2p/go-libp2p-kad-dht#730,
which hangs on a branch shortly after 0.12.2.

Essentially, this handles priority for requests a bit better,
and drops unimportant requests if they come in too fast.
This should prevent kad-dht from using tons of memory.

Updates #243.
jordipainan pushed a commit that referenced this issue Sep 23, 2021
This pulls in libp2p/go-libp2p-kad-dht#730,
which hangs on a branch shortly after 0.12.2.

Essentially, this handles priority for requests a bit better,
and drops unimportant requests if they come in too fast.
This should prevent kad-dht from using tons of memory.

Updates #243.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants