Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: possible memory leak #519

Closed
Galadrin opened this issue Apr 30, 2021 · 34 comments
Closed

Problem: possible memory leak #519

Galadrin opened this issue Apr 30, 2021 · 34 comments
Assignees
Labels
help wanted Extra attention is needed need-investigation

Comments

@Galadrin
Copy link

Galadrin commented Apr 30, 2021

Describe the bug
A sentry used with seed enable and snapshot enable use a lot of swap memory over time.
Swap is used when mem pages are not modified since a long time.

sudo cat /proc/sys/vm/swappiness
60
sudo cat /proc/sys/vm/vfs_cache_pressure
100

To Reproduce

  • chain-main version 1.2.1 rebuild with cleveldb support
  • VPS server running Ubuntu 20.04
  • 4 vcpu
  • 8GiB DDR | 2GiB Swap
  • 80 peers average
  • db_backend = "cleveldb"
  • seed enable
  • snapshot enable

Expected behavior
no or limited swap usage

Screenshots
image

Additional context
issue detected on the mainnet
Error log attached is kill -6 PID after 22h runtime
chain-maind-errlog.txt

@tomtau tomtau added help wanted Extra attention is needed need-investigation labels Apr 30, 2021
@tomtau tomtau changed the title Problem: memory leak Problem: possible memory leak Apr 30, 2021
@Galadrin
Copy link
Author

Galadrin commented May 1, 2021

If it can help I've generate a pprof profile of the node.

sudo -u chain /usr/local/go/bin/go tool pprof -png /srv/chain/chain-maind 'http://127.0.0.1:6060/debug/pprof/allocs' > /tmp/allocs.png
Fetching profile over HTTP from http://127.0.0.1:6060/debug/pprof/allocs
Saved profile in /srv/chain/pprof/pprof.chain-maind.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz

pprof.chain-maind.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
allocs

@akamsmrhill
Copy link

akamsmrhill commented May 1, 2021 via email

@JayT106
Copy link
Contributor

JayT106 commented May 2, 2021

@Galadrin thanks!! big help!

@JayT106
Copy link
Contributor

JayT106 commented May 2, 2021

@tomtau after went through the code and the chain-main err-log. It highly likes the same levelDB issues syndtr/goleveldb#290 and syndtr/goleveldb#353 (less likely in this one)

someone in the discussion might provide a useful patch syndtr/goleveldb#290 (comment)

Do you think we should patch the levelDB and test it a bit? the LevelDB repo looks like slow maintenance. If we can confirm the patch can solve the issue. Maybe we suggest the Tendermint Core switch the LevelDB if any better implementation or they should do a patch on it?

@Galadrin
Copy link
Author

Galadrin commented May 2, 2021

I'm using the ClevelDB (but didn't see any difference with perfs.)

@tomtau
Copy link
Contributor

tomtau commented May 3, 2021

@JayT106 it'd be better to confirm the root cause -- @Galadrin mentions they recompiled the binary for cleveldb, so it shouldn't be affected by the goleveldb issues (unless it was compiled, such that goleveldb is still included for some parts).
From the profiling data, it seems a lot of it is consumed by IAVL and IAVL has some known issues / behaviour that doesn't play with LevelDB, e.g. cosmos/iavl#137

@JayT106
Copy link
Contributor

JayT106 commented May 3, 2021

@tomtau Sure, the IAVL issue looks more like the performance issue to impact the db layer. but perhaps it also relates to memory leaking issue.
@Galadrin is the chain-maind-errlog.txt from the same machine? In the log file, it shows the node was calling the goleveldb.
Could you check the config settings again? Or the config has an issue (not able to switch the backend db)

Apr 30 05:16:37 chain-maind[2578067]: goroutine 12 [runnable]: Apr 30 05:16:37 chain-maind[2578067]: syscall.Syscall6(0x11, 0x1d4, 0xc0177fc500, 0x1022, 0x5c2cd, 0x0, 0x0, 0x1022, 0x1022, 0x0) Apr 30 05:16:37 chain-maind[2578067]: /usr/lib/go-1.15/go/src/syscall/asm_linux_amd64.s:41 +0x5 Apr 30 05:16:37 chain-maind[2578067]: syscall.Pread(0x1d4, 0xc0177fc500, 0x1022, 0x200a, 0x5c2cd, 0x0, 0x0, 0xc006857078) Apr 30 05:16:37 chain-maind[2578067]: /usr/lib/go-1.15/go/src/syscall/zsyscall_linux_amd64.go:1211 +0x72 Apr 30 05:16:37 chain-maind[2578067]: internal/poll.(*FD).Pread(0xc010f76960, 0xc0177fc500, 0x1022, 0x200a, 0x5c2cd, 0xc001088a20, 0xc0068570e0, 0x2000) Apr 30 05:16:37 chain-maind[2578067]: /usr/lib/go-1.15/go/src/internal/poll/fd_unix.go:185 +0xb5 Apr 30 05:16:37 chain-maind[2578067]: os.(*File).pread(...) Apr 30 05:16:37 chain-maind[2578067]: /usr/lib/go-1.15/go/src/os/file_posix.go:40 Apr 30 05:16:37 chain-maind[2578067]: os.(*File).ReadAt(0xc006251f40, 0xc0177fc500, 0x1022, 0x200a, 0x5c2cd, 0x0, 0x0, 0x0) Apr 30 05:16:37 chain-maind[2578067]: /usr/lib/go-1.15/go/src/os/file.go:134 +0x10b Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb.(*iStorageReader).ReadAt(0xc019851580, 0xc0177fc500, 0x1022, 0x200a, 0x5c2cd, 0x1062, 0x1062, 0x0) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/storage.go:49 +0x5f Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb/table.(*Reader).readRawBlock(0xc0021bad00, 0x5c2cd, 0x101d, 0x1fdf301, 0x5c2cd, 0xcda101, 0x2c30001, 0> Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/table/reader.go:564 +0xb2 Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb/table.(*Reader).readBlock(0xc0021bad00, 0x5c2cd, 0x101d, 0x1, 0x0, 0x0, 0xcd4c7e) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/table/reader.go:603 +0x55 Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb/table.(*Reader).readBlockCached(0xc0021bad00, 0x5c2cd, 0x101d, 0x1, 0x25, 0xc006857500, 0xccdc1f, 0xc0> Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/table/reader.go:648 +0x1ee Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb/table.(*Reader).getDataIter(0xc0021bad00, 0x5c2cd, 0x101d, 0x0, 0xcd0001, 0xc001b2b19c, 0x2) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/table/reader.go:765 +0x5b Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb/table.(*Reader).getDataIterErr(0xc0021bad00, 0x5c2cd, 0x101d, 0x0, 0x1, 0x0, 0x0) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/table/reader.go:780 +0x138 Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb/table.(*indexIter).Get(0xc0198515a0, 0xccefd9, 0xc018635400) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/table/reader.go:507 +0x285 Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb/iterator.(*indexedIterator).setData(0xc010f76a20) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/iterator/indexed_iter.go:39 +0x> Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb/iterator.(*indexedIterator).Next(0xc010f76a20, 0xc017f0f3f8) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/iterator/indexed_iter.go:160 +0> Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb/iterator.(*indexedIterator).Next(0xc024c6b200, 0x71) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/iterator/indexed_iter.go:149 +0> Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb/iterator.(*mergedIterator).Next(0xc00c2c0e00, 0xc018cf0b40) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/iterator/merged_iter.go:169 +0x> Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb.(*tableCompactionBuilder).run(0xc018fac500, 0xc023d34608, 0x0, 0x0) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/db_compaction.go:442 +0x317 Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb.(*DB).compactionTransact(0xc00037c380, 0x1bce14a, 0xb, 0x2006e00, 0xc018fac500) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/db_compaction.go:186 +0x18c Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb.(*DB).tableCompaction(0xc00037c380, 0xc00c2f6b40, 0x1e27200) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/db_compaction.go:580 +0x64b Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb.(*DB).tableAutoCompaction(0xc00037c380) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/db_compaction.go:644 +0x54 Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb.(*DB).tCompaction(0xc00037c380) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/db_compaction.go:863 +0x2ee Apr 30 05:16:37 chain-maind[2578067]: created by github.com/syndtr/goleveldb/leveldb.openDB Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/db.go:155 +0x5a5 Apr 30 05:16:37 chain-maind[2578067]: goroutine 13 [select]: Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb.(*DB).mCompaction(0xc00037c380) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/db_compaction.go:773 +0x145 Apr 30 05:16:37 chain-maind[2578067]: created by github.com/syndtr/goleveldb/leveldb.openDB Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/db.go:156 +0x5c7 Apr 30 05:16:37 chain-maind[2578067]: goroutine 14 [select]: Apr 30 05:16:37 chain-maind[2578067]: github.com/syndtr/goleveldb/leveldb/util.(*BufferPool).drain(0xc0000ce1c0) Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/util/buffer_pool.go:209 +0x128 Apr 30 05:16:37 chain-maind[2578067]: created by github.com/syndtr/goleveldb/leveldb/util.NewBufferPool Apr 30 05:16:37 chain-maind[2578067]: /home/galadrin/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/util/buffer_pool.go:240

@Galadrin
Copy link
Author

Galadrin commented May 3, 2021

@Galadrin is the chain-maind-errlog.txt from the same machine? In the log file, it shows the node was calling the goleveldb.

Yes, it's from the same machine, but not the same time.

Could you check the config settings again? Or the config has an issue (not able to switch the backend db)

db_backend = "cleveldb" in the config.toml

@tomtau
Copy link
Contributor

tomtau commented May 3, 2021

@Galadrin
Copy link
Author

Galadrin commented May 3, 2021

For "cleveldb", afaik one needs to recompile the binary as well:
https://docs.tendermint.com/master/introduction/install.html#compile-with-cleveldb-support
https://github.com/cosmos/cosmos-sdk/blob/71b7fb829ed1d320264b3bc1eb4e995fc402a6a4/Makefile#L65

sure, it's localy rebuilt with cleveldb added to the BUILD_TAGS in the chain-main Makefile.

@allthatjazzleo
Copy link
Contributor

For "cleveldb", afaik one needs to recompile the binary as well:
https://docs.tendermint.com/master/introduction/install.html#compile-with-cleveldb-support
https://github.com/cosmos/cosmos-sdk/blob/71b7fb829ed1d320264b3bc1eb4e995fc402a6a4/Makefile#L65

sure, it's localy rebuilt with cleveldb added to the BUILD_TAGS in the chain-main Makefile.

Aside from adding -X github.com/cosmos/cosmos-sdk/types.DBBackend=cleveldb to ldflags, you also need to add gcc to BUILD_TAGS

@JayT106
Copy link
Contributor

JayT106 commented May 3, 2021

@Galadrin Is it okay to confirm your node can execute with cleveldb then see if the issue still happens? Then we can know if the issue relates to the goleveldb or something else. Thanks!

@Galadrin
Copy link
Author

Galadrin commented May 3, 2021

@Galadrin Is it okay to confirm your node can execute with cleveldb then see if the issue still happens? Then we can know if the issue relates to the goleveldb or something else. Thanks!

The library is linked and loaded in the proc maps.

ldd  /srv/chain/chain-maind
	linux-vdso.so.1 (0x00007ffca9bed000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2e8a0da000)
	libleveldb.so.1d => /lib/x86_64-linux-gnu/libleveldb.so.1d (0x00007f2e8a07d000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2e89e8b000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f2e8a106000)
	libsnappy.so.1 => /lib/x86_64-linux-gnu/libsnappy.so.1 (0x00007f2e89e80000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2e89c9f000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2e89c84000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2e89b33000)

How can I check my node execute with cleveldb ?

Aside from adding -X github.com/cosmos/cosmos-sdk/types.DBBackend=cleveldb to ldflags, you also need to add gcc to BUILD_TAGS

ok, so I need to add cleveldb, gcc to BUILD_TAGS ?

@JayT106
Copy link
Contributor

JayT106 commented May 3, 2021

How can I check my node execute with cleveldb ?

If the executing log didn't show the db backend info. Maybe you can do the kill PID to retrieve the log (like you did in the Additional context)

@Galadrin
Copy link
Author

Galadrin commented May 4, 2021

ok, I have rebuil with the following changes :

sudo apt install libsnappy-dev
CGO_LDFLAGS="-lsnappy" NETWORK=mainnet make build TENDERMINT_BUILD_OPTIONS=cleveldb

go build -mod=readonly -ldflags '-X github.com/cosmos/cosmos-sdk/version.Name=crypto-org-chain-chain -X github.com/cosmos/cosmos-sdk/version.ServerName=chain-maind -X github.com/cosmos/cosmos-sdk/version.Version=1.2.1 -X github.com/cosmos/cosmos-sdk/version.Commit=58af58cf65f846726b585a27c809755d96df8e13 -X github.com/cosmos/cosmos-sdk/types.DBBackend=cleveldb' -tags cgo,cleveldb,ledger,!test_ledger_mock,!ledger_mock,!ledger_zemu -o $HOME/chain-main/build/chain-maind ./cmd/chain-maind

then run again on the node.
This time, the cleveldb backend is called.

github.com/jmhodges/levigo._Cfunc_leveldb_get(0x396fabc0, 0x3a454e80, 0xc0021ddce0, 0x2a, 0xc001930c28, 0xc00760c600, 0x7f5bd41af610)
        _cgo_gotypes.go:360 +0x4e
github.com/jmhodges/levigo.(*DB).Get.func1(0xc0004d0ab0, 0xc0004d0ac0, 0xc0021ddce0, 0xc0021ddce0, 0x2a, 0x30, 0xc001930c28, 0xc00760c600, 0x1007f5c2ff55108)
        /home/galadrin/go/pkg/mod/github.com/jmhodges/[email protected]/db.go:201 +0x105
github.com/jmhodges/levigo.(*DB).Get(0xc0004d0ab0, 0xc0004d0ac0, 0xc0021ddce0, 0x2a, 0x30, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/galadrin/go/pkg/mod/github.com/jmhodges/[email protected]/db.go:201 +0x113
github.com/tendermint/tm-db.(*CLevelDB).Get(0xc00000f960, 0xc0021ddce0, 0x2a, 0x30, 0x2a, 0xc0021ddce0, 0x9, 0x30, 0x0)
        /home/galadrin/go/pkg/mod/github.com/tendermint/[email protected]/cleveldb.go:58 +0x65
github.com/tendermint/tm-db.(*PrefixDB).Get(0xc000d53770, 0xc0021ddad0, 0x21, 0x21, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/galadrin/go/pkg/mod/github.com/tendermint/[email protected]/prefixdb.go:34 +0x182
github.com/cosmos/iavl.(*nodeDB).GetNode(0xc0003c9380, 0xc0019bab00, 0x20, 0x20, 0x0)
        /home/galadrin/go/pkg/mod/github.com/cosmos/[email protected]/nodedb.go:85 +0x2b4
github.com/cosmos/iavl.(*Node).getLeftNode(...)```

I will give some time to check the memory.

@Galadrin
Copy link
Author

Galadrin commented May 4, 2021

@Galadrin Is it okay to confirm your node can execute with cleveldb then see if the issue still happens? Then we can know if the issue relates to the goleveldb or something else. Thanks!

After 7hours uptime, the swap usage continue to grow up.
image

do you want a new pprof or stack log ?

@JayT106
Copy link
Contributor

JayT106 commented May 4, 2021

@Galadrin Is it okay to confirm your node can execute with cleveldb then see if the issue still happens? Then we can know if the issue relates to the goleveldb or something else. Thanks!

After 7hours uptime, the swap usage continue to grow up.
image

do you want a new pprof or stack log ?

@Galadrin Thanks for the report. Is it okay to observe it longer? And perhaps gather the process RAM memory usage info, If we can see a huge memory leak. We can expect the process will be OOM in the long term.
The swap memory is the OS uses for improving the system performance. So the OS can try to utilize it as much as it can and hold it even the data doesn't need.
I think we can try to disable the swap to see the memory usage of the process in the long run, can it be stable? Or if you don't disable the swap, can it be stable including the RAM usage in the chain-maind process?
ref:
https://help.ubuntu.com/community/SwapFaq/#What_is_swappiness_and_how_do_I_change_it.3F
https://stackoverflow.com/questions/43443368/linux-swap-space-never-release-memory

@Galadrin
Copy link
Author

Galadrin commented May 4, 2021

I think we can try to disable the swap to see the memory usage of the process in the long run, can it be stable? Or if you don't disable the swap, can it be stable including the RAM usage in the chain-maind process?

I'm OK to swap off.

The swap is increasing but the ram usage is stable.
I will keep it running more longer. we have 7 days monitoring.

After 14h uptime, I'm using 4GB (50%) of RAM and 688MB swap.

@Galadrin
Copy link
Author

Galadrin commented May 5, 2021

@JayT106

  • uptime : 1.7 days
  • swap : disabled
  • memory usage : 5.86GB (77%)

image

our monitoring start warn about low memory on the server.

I'm not sure to have enough memory available for a pperf snapshot.
does I need to swap on or just a kill -6 $(pidof chain-maind) is enough?

@JayT106
Copy link
Contributor

JayT106 commented May 5, 2021

@Galadrin Thanks! could you 1. keep the current data you collected 2. restart the system with systemd-run to limit the process can only use 4.5GB of the system?
ref:
https://manpages.ubuntu.com/manpages/bionic/man1/systemd-run.html
https://www.baeldung.com/linux/limit-resource-consumption#using-systemd-run

@Galadrin
Copy link
Author

Galadrin commented May 5, 2021

  1. keep the current data you collected

@JayT106 what do you mean by keeping the data ?

@Galadrin
Copy link
Author

Galadrin commented May 5, 2021

  1. restart the system with systemd-run to limit the process can only use 4.5GB of the system

will add MemoryHigh=4G and MemoryMax=4.5G to my unit file

@JayT106
Copy link
Contributor

JayT106 commented May 5, 2021

  1. keep the current data you collected

@JayT106 what do you mean by keeping the data ?

do pperf snapshot, the snapshot shouldn't use too much RAM to dump the info to the file.

@JayT106
Copy link
Contributor

JayT106 commented May 5, 2021

  1. restart the system with systemd-run to limit the process can only use 4.5GB of the system

will add MemoryHigh=4G and MemoryMax=4.5G to my unit file
Thanks.

@Galadrin
Copy link
Author

Galadrin commented May 5, 2021

@Galadrin
Copy link
Author

Galadrin commented May 7, 2021

@JayT106 The memory limitation do is job :)
as you can see on the graph, the binary restart when we reach 4.64GB of ram usage on the system.

image

@JayT106
Copy link
Contributor

JayT106 commented May 7, 2021

@Galadrin Do you mean the process did GC to free the RAM? Or the process OOM? If it's OOM, any dump left we can analyze it. Thanks.
I was trying to deploy a node in my cloud (with seed node and snapshot enabled), but my node doesn't have the mem issue.
Maybe some env settings are different than yours.

@Galadrin
Copy link
Author

Galadrin commented May 7, 2021

Yes, OOM-killer trigger

May 07 23:00:27 hostname kernel: chain-maind invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
May 07 23:00:27 hostname kernel: CPU: 3 PID: 167287 Comm: chain-maind Not tainted 5.4.0-72-generic #80-Ubuntu
May 07 23:00:27 hostname kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
May 07 23:00:27 hostname kernel: Call Trace:
May 07 23:00:27 hostname kernel:  dump_stack+0x6d/0x8b
May 07 23:00:27 hostname kernel:  dump_header+0x4f/0x1eb
May 07 23:00:27 hostname kernel:  oom_kill_process.cold+0xb/0x10
May 07 23:00:27 hostname kernel:  out_of_memory.part.0+0x1df/0x3d0
May 07 23:00:27 hostname kernel:  out_of_memory+0x6d/0xd0
May 07 23:00:27 hostname kernel:  mem_cgroup_out_of_memory+0xbd/0xe0
May 07 23:00:27 hostname kernel:  try_charge+0x77c/0x810
May 07 23:00:27 hostname kernel:  mem_cgroup_try_charge+0x71/0x190
May 07 23:00:27 hostname kernel:  __add_to_page_cache_locked+0x265/0x340
May 07 23:00:27 hostname kernel:  ? scan_shadow_nodes+0x30/0x30
May 07 23:00:27 hostname kernel:  add_to_page_cache_lru+0x4d/0xd0
May 07 23:00:27 hostname kernel:  pagecache_get_page+0x101/0x300
May 07 23:00:27 hostname kernel:  filemap_fault+0x6b2/0xa50
May 07 23:00:27 hostname kernel:  ? filemap_map_pages+0x24c/0x380
May 07 23:00:27 hostname kernel:  ext4_filemap_fault+0x32/0x50
May 07 23:00:27 hostname kernel:  __do_fault+0x3c/0x130
May 07 23:00:27 hostname kernel:  do_fault+0x24b/0x640
May 07 23:00:27 hostname kernel:  ? __switch_to_asm+0x40/0x70
May 07 23:00:27 hostname kernel:  __handle_mm_fault+0x4c5/0x7a0
May 07 23:00:27 hostname kernel:  handle_mm_fault+0xca/0x200
May 07 23:00:27 hostname kernel:  do_user_addr_fault+0x1f9/0x450
May 07 23:00:27 hostname kernel:  __do_page_fault+0x58/0x90
May 07 23:00:27 hostname kernel:  ? exit_to_usermode_loop+0x8f/0x160
May 07 23:00:27 hostname kernel:  do_page_fault+0x2c/0xe0
May 07 23:00:27 hostname kernel:  do_async_page_fault+0x39/0x70
May 07 23:00:27 hostname kernel:  async_page_fault+0x34/0x40
May 07 23:00:27 hostname kernel: RIP: 0033:0x43c87c
May 07 23:00:27 hostname kernel: Code: Bad RIP value.
May 07 23:00:27 hostname kernel: RSP: 002b:00007f8bf4b2fbe8 EFLAGS: 00010206
May 07 23:00:27 hostname kernel: RAX: 0000000000000001 RBX: 000000c003241c80 RCX: 000000000047afe3
May 07 23:00:27 hostname kernel: RDX: 0000000000000001 RSI: 0000000000000081 RDI: 0000000002c07e18
May 07 23:00:27 hostname kernel: RBP: 00007f8bf4b2fc30 R08: 0000000000000000 R09: 0000000000000000
May 07 23:00:27 hostname kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000961cde80f30b
May 07 23:00:27 hostname kernel: R13: 0000000000000b68 R14: 0000961cb3d89c07 R15: 0000000000000000
May 07 23:00:27 hostname kernel: memory: usage 4718496kB, limit 4718592kB, failcnt 739064221
May 07 23:00:27 hostname kernel: memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
May 07 23:00:27 hostname kernel: kmem: usage 43740kB, limit 9007199254740988kB, failcnt 0
May 07 23:00:27 hostname kernel: Memory cgroup stats for /system.slice/chain-maind.service:
May 07 23:00:27 hostname kernel: anon 4786192384
                                                    file 495616
                                                    kernel_stack 184320
                                                    slab 30851072
                                                    sock 0
                                                    shmem 0
                                                    file_mapped 135168
                                                    file_dirty 0
                                                    file_writeback 0
                                                    anon_thp 16777216
                                                    inactive_anon 0
                                                    active_anon 4786262016
                                                    inactive_file 0
                                                    active_file 0
                                                    unevictable 0
                                                    slab_reclaimable 28504064
                                                    slab_unreclaimable 2347008
                                                    pgfault 137232381
                                                    pgmajfault 34084677
                                                    workingset_refault 430984686
                                                    workingset_activate 51636717
                                                    workingset_nodereclaim 0
                                                    pgrefill 69074416
                                                    pgscan 1170740775
                                                    pgsteal 1074492466
                                                    pgactivate 20030307
                                                    pgdeactivate 66718849
                                                    pglazyfree 7687218
                                                    pglazyfreed 7563567
                                                    thp_fault_alloc 0
                                                    thp_collapse_alloc 165
May 07 23:00:27 hostname kernel: Tasks state (memory values in pages):
May 07 23:00:27 hostname kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May 07 23:00:27 hostname kernel: [ 167278]   998 167278  1992895  1176630 13873152        0             0 chain-maind
May 07 23:00:27 hostname kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/system.slice/chain-maind.service,task_memcg=/sys>
May 07 23:00:27 hostname kernel: Memory cgroup out of memory: Killed process 167278 (chain-maind) total-vm:7971580kB, anon-rss:4674288kB, file-rss:32232kB, shmem-rss:0kB>
May 07 23:00:28 hostname kernel: oom_reaper: reaped process 167278 (chain-maind), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 07 23:00:28 hostname systemd[1]: chain-maind.service: Main process exited, code=killed, status=9/KILL

@JayT106
Copy link
Contributor

JayT106 commented May 10, 2021

pprof.chain-maind.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz

The profiling result showed the different PATH in iavl object. The iavl object might retain a lot of memory when doing the tree traverse(So you see the memory keep growing). We will continue to track this issue.

@JayT106
Copy link
Contributor

JayT106 commented May 11, 2021

@Galadrin from the RAM monitoring data in our seed node, we can see the RAM usage from 4.7G (Apr 26) increased to around 7.3G (now). But if the node is a sentry node. The RAM usage is around 2GB.

@Galadrin
Copy link
Author

But if the node is a sentry node. The RAM usage is around 2GB.

it's a sentry node, but we enable seed and snapshot.
All our other sentries use ~2GB

@tomtau
Copy link
Contributor

tomtau commented May 26, 2021

After the investigation by @JayT106 and @allthatjazzleo , they didn't manage to reproduce the issue and didn't notice any memory leak-like behaviour -- the only thing noticed was the increased memory usage due to IAVL.

If this issue reappears or there's a reliable way to reproduce it, feel free to reopen the issue.
Note that the version on the latest master is based off SDK 0.42.5 which upgraded the IAVL version to https://github.com/cosmos/iavl/releases/tag/v0.16.0 which may have a different memory behaviour.

@tomtau tomtau closed this as completed May 26, 2021
@GrapeBaBa
Copy link

We also meet this problem when key,value size is big, we found there is two hard code cache size in iavl/store.go and cache/cache.go. Changing cache size will be help.

@JayT106
Copy link
Contributor

JayT106 commented Aug 15, 2022

We also meet this problem when key,value size is big, we found there is two hard code cache size in iavl/store.go and cache/cache.go. Changing cache size will be help.

Thanks for sharing, we found out the recent Cronos is using tm-db 0.6.7 which changed the max_files_open from 1000 to 4096, it also consumes a lot of memory. Wondering which Cronos version you are using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed need-investigation
Projects
None yet
Development

No branches or pull requests

7 participants