-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Errors related to malloc and free #5728
Comments
please post a reproducible example using master branch |
I can reproduce
This error will only occur when training with GPU. And it appears randomly, sometimes after 3 iteration, sometimes after 20 iteration. |
Similar to @sifmelcara, the error I encounter also appears randomly. |
I can also reproduce this on mxnet 0.9.3a release (which is about two months old). |
To find out if this problem is related to my hardware, I grabbed my drive to my friends' computer which have a GTX 1070 and boot up the same OS then run the same binary (lenet), the program runs fine and did not crash. PS. The strange thing is, run tensorflow with cuda and cudnn on my computer do not crash... so there might still be an issue with mxnet |
My program crashed on 2 machines. One uses GTX TITAN Black and the other uses GTX TITAN X. |
I found this issue is related to I wonder if it is because in |
@piiswrong we definitely need tests for cpp-package to run in CI |
Does setting env var
make any difference in your case? |
@eric-haibin-lin These variables are exclusive and I should set them before running MXNet, right? |
it's read at runtime. Simply prepend them to the cmd you run I'm just curious if this is caused by a recent change in executor (bulk execution). |
@eric-haibin-lin Looks like turning off bulk execution make no difference for me. I would also like to provide my stack trace. (produced by cpp-package/examples/lenet.cpp) (gdb) bt
#0 0x00007fffe5cea81b in malloc_consolidate () from /nix/store/izxnyg94352qxa4a4783dzgnpy5cwazj-glibc-2.25/lib/libc.so.6
#1 0x00007fffe5ceb400 in _int_free () from /nix/store/izxnyg94352qxa4a4783dzgnpy5cwazj-glibc-2.25/lib/libc.so.6
#2 0x00007fffe7bcee60 in __gnu_cxx::new_allocator<mxnet::TBlob>::deallocate (this=0x7ffed000dee8, __p=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/ext/new_allocator.h:110
#3 std::allocator_traits<std::allocator<mxnet::TBlob> >::deallocate (__a=..., __n=<optimized out>, __p=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/alloc_traits.h:517
#4 std::_Vector_base<mxnet::TBlob, std::allocator<mxnet::TBlob> >::_M_deallocate (this=0x7ffed000dee8, __n=<optimized out>, __p=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/stl_vector.h:178
#5 std::_Vector_base<mxnet::TBlob, std::allocator<mxnet::TBlob> >::~_Vector_base (this=0x7ffed000dee8, __in_chrg=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/stl_vector.h:160
#6 std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> >::~vector (this=0x7ffed000dee8, __in_chrg=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/stl_vector.h:425
#7 mxnet::exec::ForwardOpExecutor::~ForwardOpExecutor (this=0x7ffed000de30, __in_chrg=<optimized out>) at src/executor/attach_op_execs_pass.cc:25
#8 0x0000000000405f56 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7ffed000de20)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/shared_ptr_base.h:150
#9 0x00007fffe7baa588 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7ffed0014e58, __in_chrg=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/shared_ptr_base.h:659
#10 std::__shared_ptr<mxnet::exec::OpExecutor, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7ffed0014e50, __in_chrg=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/shared_ptr_base.h:925
#11 std::shared_ptr<mxnet::exec::OpExecutor>::~shared_ptr (this=0x7ffed0014e50, __in_chrg=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/shared_ptr.h:93
#12 mxnet::exec::GraphExecutor::<lambda(mxnet::RunContext, mxnet::Engine::CallbackOnComplete)>::~<lambda> (this=0x7ffed0014e50, __in_chrg=<optimized out>)
at src/executor/graph_executor.cc:662
#13 std::_Function_base::_Base_manager<mxnet::exec::GraphExecutor::InitCachedOps()::<lambda(mxnet::RunContext, mxnet::Engine::CallbackOnComplete)> >::_M_destroy (__victim=...) at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:1726
#14 std::_Function_base::_Base_manager<mxnet::exec::GraphExecutor::InitCachedOps()::<lambda(mxnet::RunContext, mxnet::Engine::CallbackOnComplete)> >::_M_manager(std::_Any_data &, const std::_Any_data &, std::_Manager_operation) (__dest=..., __source=..., __op=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:1750
#15 0x00007fffe7b413ce in std::_Function_base::~_Function_base (this=0x1599dd0, __in_chrg=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:1830
#16 std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>::~function() (this=0x1599dd0, __in_chrg=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:1974
#17 mxnet::engine::ThreadedOpr::~ThreadedOpr (this=0x1599dd0, __in_chrg=<optimized out>) at src/engine/./threaded_engine.h:200
#18 mxnet::common::ObjectPool<mxnet::engine::ThreadedOpr>::Delete (this=0xa23b20, ptr=0x1599dd0) at src/engine/./../common/object_pool.h:139
#19 0x00007fffe77b425d in std::function<void (mxnet::RunContext)>::operator()(mxnet::RunContext) const (__args#0=..., this=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:2267
#20 mxnet::Engine::PushSync(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const (on_complete=..., ctx=..., __closure=<optimized out>)
---Type <return> to continue, or q <return> to quit---
at include/mxnet/././engine.h:211
#21 std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::Engine::PushSync(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&) (__functor=..., __args#0=<optimized out>, __args#1=<optimized out>)
at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:1871
#22 0x00007fffe7b47e37 in std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const (__args#1=..., __args#0=..., this=0x15908f0) at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:2267
#23 mxnet::engine::ThreadedEngine::ExecuteOprBlock (this=<optimized out>, run_ctx=..., opr_block=0xa17de8, this=<optimized out>)
at src/engine/./threaded_engine.h:321
#24 0x00007fffe7b4e4f6 in mxnet::engine::ThreadedEnginePerDevice::CPUWorker<(dmlc::ConcurrentQueueType)0> (block=0xa14b70, this=0xa106f0)
at src/engine/threaded_engine_perdevice.cc:180
#25 mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const
(__closure=<optimized out>) at src/engine/threaded_engine_perdevice.cc:76
#26 std::_Function_handler<void (), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:1871
#27 0x00007fffe65efd00 in ?? () from /nix/store/zag7bpja0fxm2r45x5xzdv8ff3rvj2nx-gcc-5.4.0-lib/lib/libstdc++.so.6
#28 0x00007fffcaf97234 in start_thread () from /nix/store/izxnyg94352qxa4a4783dzgnpy5cwazj-glibc-2.25/lib/libpthread.so.0
#29 0x00007fffe5d5b75f in clone () from /nix/store/izxnyg94352qxa4a4783dzgnpy5cwazj-glibc-2.25/lib/libc.so.6 |
I can confirm by setting |
I'll investigate if there's any issue for threaded engine in cpp-package over the weekend |
@GaiYu0 @sifmelcara Are the stack traces from the latest mxnet version? Did you try the latest version? |
I am using latest mxnet from engine branch. |
The stack trace is from the master branch. However, I also tested v0.9.3 stable release and it have the same issue. |
@sifmelcara were you using mnist dataset as the input for the lenet example? The example code is expecting a 'train.csv' to read data from, what did you use as input? |
I download training set from https://pjreddie.com/projects/mnist-in-csv/ and rename it to |
@sifmelcara I ran lenet example for 5089 iters and could not reproduce this bug. I am running commit 96eb4f5 from this pr: #5844 |
I boot the exact same hard drive on two computer, one have 1080Ti, the other have 1070.
I guess we need a fast GPU to reproduce this bug. |
@sifmelcara well, I am able to reproduce @GaiYu0's issue with the same hardware on previous version of MXNet. But it seems to be fixed with 96eb4f5 |
I just tested it again, here is the steps I take to reproduce the bug.
|
@eric-haibin-lin I found a way to consistently reproduce the issue on my machine.
I really appreciate your help in this issue. Thank you. |
Since @eric-haibin-lin reported that @GaiYu0's issue probably been fixed by 96eb4f5 , I guess my issue is somewhat different from this issue. |
Looks like both issues are fixed. Closing it for now. |
hi can anyone please tell me how to delay MXExecutorFree() call |
Hi! I encounter these errors when training a network:
*** Error in `/usr/bin/python': malloc(): memory corruption (fast): 0x0000000001755880 ***
*** Error in `/usr/bin/python': free(): invalid pointer: 0x000000000171ec30 ***
I am using the latest version of mxnet from engine branch. Similar errors occur when I use mxnet from master branch.
Could anyone help? Thank you very much!
Unfortunately I cannot get Python stack trace. But C stack trace is available:
#0 0x00007ffff782dc37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff7831028 in __GI_abort () at abort.c:89
#2 0x00007ffff786a2a4 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff79786b0 "*** Error in `%s': %s: 0x%s[32/1916]
at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007ffff7874ff7 in malloc_printerr (action=, str=0x7ffff7978a50 "malloc(): memory corruption (fast)",
ptr=) at malloc.c:4996
#4 0x00007ffff7877cf4 in _int_malloc (av=0x7fff00000020, bytes=24) at malloc.c:3359
#5 0x00007ffff78796c0 in __GI___libc_malloc (bytes=24) at malloc.c:2891
#6 0x00007fffddad6dad in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007fffebcab89d in std::_Function_base::_Base_managermxnet::NDArray::Chunk::~Chunk()::{lambda(mxnet::RunContext)#2}::_M_manager(std::_Any_data&, std::_Function_base::_Base_managermxnet::NDArray::Chunk::~Chunk()::{lambda(mxnet::RunContext)#2} const&, std::_Manager_operation) () from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#8 0x00007fffebcf715f in std::_Function_base::_Base_manager<mxnet::engine::ThreadedEngine::DeleteVariable(std::function<void (mxnet::RunContext)>, mxnet::Context, mxnet::engine::Var*)::{lambda(mxnet::RunContext)#1}>::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager<mxnet::engine::ThreadedEngine::DeleteVariable(std::function<void (mxnet::RunContext)>, mxnet::Context, mxnet::engine::Var*)::{lambda(mxnet::RunContext)#1}> const&, std::_Manager_operation) ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#9 0x00007fffebcaca74 in std::function<void (mxnet::RunContext)>::function(std::function<void (mxnet::RunContext)> const&) ()[16/1916]
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#10 0x00007fffebcf73b1 in mxnet::engine::ThreadedEngine::DeleteVariable(std::function<void (mxnet::RunContext)>, mxnet::Context, mxnet::engine::Var*) () from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#11 0x00007fffebcaba6d in std::_Sp_counted_ptr_inplace<mxnet::NDArray::Chunk, std::allocatormxnet::NDArray::Chunk, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#12 0x00007fffebcad78e in std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >::~vector() ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#13 0x00007fffeb4c8eca in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#14 0x00007fffebd114b8 in std::_Function_base::_Base_manager<mxnet::exec::GraphExecutor::InitCachedOps()::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}>::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager<mxnet::exec::GraphExecutor::InitCachedOps()::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}> const&, std::_Manager_operation) ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#15 0x00007fffebcf82d6 in std::_Function_handler<void (mxnet::RunContext), mxnet::engine::ThreadedEngine::DeleteOperator(mxnet::engine::Opr*)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext) ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#16 0x00007fffebcab693 in operator() (__args#0=..., this=) at /usr/include/c++/4.8/functional:2471
#17 operator() (on_complete=..., ctx=..., __closure=) at include/mxnet/././engine.h:213
#18 std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::Engine::PushSync(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, mxnet::FnProperty, int, char const*)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext, mxnet::engine::CallbackOnComplete) (__functor=...,
__args#0=..., __args#1=...) at /usr/include/c++/4.8/functional:2071
#19 0x00007fffebcfe06c in mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#20 0x00007fffebd0097e in std::_Function_handler<void (), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#21 0x00007fffddb29a60 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#22 0x00007ffff7bc4184 in start_thread (arg=0x7fff433fd700) at pthread_create.c:312
#23 0x00007ffff78f137d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
The text was updated successfully, but these errors were encountered: