Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tst_huge hangs with memory over 1GB. #1049

Closed
nyh opened this issue Aug 21, 2019 · 5 comments
Closed

tst_huge hangs with memory over 1GB. #1049

nyh opened this issue Aug 21, 2019 · 5 comments

Comments

@nyh
Copy link
Contributor

nyh commented Aug 21, 2019

This issue was discovered by @wkozaczuk and reported on the mailing list. It is likely a recent regression, or an old bug which for an unknown reason got exposed by recent changes to the memory layout.

Running tst-huge.so with 1GB of memory works fine:

scripts/build image=tests
scripts/run.py -m1G -e tests/tst-huge.so

However, giving it bigger memory, like the default 2GB, results in a hang.
With gdb we can see the test thread hung on a deadlock:

(gdb) connect
(gdb) osv syms
(gdb) osv info threads
...
 253 (0xffff80000255b040) tests/tst-huge. cpu1 status::waiting waitqueue::wait(lockfree::mutex&) at core/waitqueue.cc:56 vruntime  2.75268e-19
(gdb) osv thread
(gdb) bt
#0  sched::thread::switch_to (this=0xffff800000a6a040, 
    this@entry=0xffff800000a81040) at arch/x64/arch-switch.hh:108
#1  0x00000000403f31c4 in sched::cpu::reschedule_from_interrupt (
    this=0xffff800000053040, called_from_yield=called_from_yield@entry=false, 
    preempt_after=..., preempt_after@entry=...) at core/sched.cc:339
#2  0x00000000403f36ac in sched::cpu::schedule () at include/osv/sched.hh:1310
#3  0x00000000403f3d72 in sched::thread::wait (
    this=this@entry=0xffff80000255b040) at core/sched.cc:1214
#4  0x000000004041b0e2 in sched::thread::do_wait_for<lockfree::mutex, sched::wait_object<waitqueue> > (mtx=...) at include/osv/mutex.h:41
#5  sched::thread::wait_for<waitqueue&> (mtx=...) at include/osv/sched.hh:1220
#6  waitqueue::wait (this=this@entry=0x409a3550 <mmu::vma_list_mutex+48>, 
    mtx=...) at core/waitqueue.cc:56
#7  0x00000000403d6ea3 in rwlock::reader_wait_lockable (this=<optimized out>)
    at core/rwlock.cc:174
#8  rwlock::rlock (this=this@entry=0x409a3520 <mmu::vma_list_mutex>)
    at core/rwlock.cc:29
#9  0x0000000040341e0f in rwlock_for_read::lock (
    this=0x409a3520 <mmu::vma_list_mutex>) at include/osv/rwlock.h:113
#10 std::lock_guard<rwlock_for_read&>::lock_guard (__m=..., 
    this=<synthetic pointer>) at /usr/include/c++/8/bits/std_mutex.h:162
#11 lock_guard_for_with_lock<rwlock_for_read&>::lock_guard_for_with_lock (
    lock=..., this=<synthetic pointer>) at include/osv/mutex.h:89
#12 mmu::vm_fault (addr=18446603337326391296, addr@entry=18446603337326395384, 
    ef=ef@entry=0xffff800002560068) at core/mmu.cc:1334
#13 0x000000004039cdf6 in page_fault (ef=0xffff800002560068)
    at arch/x64/mmu.cc:38
#14 <signal handler called>
#15 0x00000000403e5b64 in memory::page_range_allocator::insert<true> (
    this=this@entry=0x409bb300 <memory::free_page_ranges>, pr=...)
    at core/mempool.cc:575
#16 0x00000000403e337c in memory::page_range_allocator::<lambda(memory::page_range&)>::operator() (header=..., __closure=<synthetic pointer>)
    at core/mempool.cc:751
#17 memory::page_range_allocator::<lambda(memory::page_range&)>::operator() (
    header=..., __closure=<synthetic pointer>) at core/mempool.cc:736
#18 memory::page_range_allocator::for_each<memory::page_range_allocator::alloc_aligned(size_t, size_t, size_t, bool)::<lambda(memory::page_range&)> > (f=..., 
    min_order=<optimized out>, this=0x409bb300 <memory::free_page_ranges>)
    at core/mempool.cc:809
#19 memory::page_range_allocator::alloc_aligned (
    this=this@entry=0x409bb300 <memory::free_page_ranges>, 
    size=size@entry=2097152, offset=offset@entry=0, 
    alignment=alignment@entry=2097152, fill=fill@entry=true)
    at core/mempool.cc:736
#20 0x00000000403e3c90 in memory::alloc_huge_page (N=N@entry=2097152)
    at core/mempool.cc:1601
#21 0x000000004034718e in mmu::uninitialized_anonymous_page_provider::map (
    this=0x408681b0 <mmu::page_allocator_init>, offset=83886080, ptep=..., 
    pte=..., write=<optimized out>) at include/osv/mmu-defs.hh:219
#22 0x000000004034c79a in mmu::populate<(mmu::account_opt)1>::page<1> (
    offset=83886080, ptep=..., this=0x2000001ffd80)
    at include/osv/mmu-defs.hh:235
#23 mmu::page<mmu::populate<>, 1> (ptep=..., offset=83886080, pops=...)
    at core/mmu.cc:311
#24 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 2>::operator() (
    base_virt=35185397596160, parent=..., this=<synthetic pointer>)
    at core/mmu.cc:437
#25 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 3>::map_range<2> (
    this=<synthetic pointer>, ptep=..., base_virt=35184372088832, slop=4096, 
    page_mapper=..., size=132120576, vcur=<optimized out>) at core/mmu.cc:399
#26 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 3>::operator() (
    base_virt=35184372088832, parent=..., this=<synthetic pointer>)
    at core/mmu.cc:449
#27 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 4>::map_range<3> (
    this=<synthetic pointer>, ptep=..., base_virt=35184372088832, slop=4096, 
    page_mapper=..., size=134217728, vcur=<optimized out>) at core/mmu.cc:399
#28 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 4>::operator() (
    base_virt=35184372088832, parent=..., this=<synthetic pointer>)
    at core/mmu.cc:449
#29 mmu::map_range<mmu::populate<(mmu::account_opt)1> > (
    vma_start=vma_start@entry=35185313710080, 
    vstart=vstart@entry=35185313710080, size=<optimized out>, page_mapper=..., 
    slop=slop@entry=4096) at core/mmu.cc:354
#30 0x000000004034cfc4 in mmu::operate_range<mmu::populate<(mmu::account_opt)1> > (size=<optimized out>, start=0x200038200000, vma_start=<optimized out>, 
    mapper=...) at core/mmu.cc:801
#31 mmu::vma::operate_range<mmu::populate<(mmu::account_opt)1> > (size=3, 
    addr=0x200038200000, mapper=..., this=0xffffa000481ff300)
    at core/mmu.cc:1412
#32 mmu::populate_vma<(mmu::account_opt)1> (vma=vma@entry=0xffffa000481ff300, 
    v=v@entry=0x200038200000, size=size@entry=134217728, 
    write=write@entry=false) at core/mmu.cc:1206
#33 0x0000000040345809 in mmu::map_anon (addr=addr@entry=0x0, 
    size=size@entry=134217728, flags=flags@entry=2, perm=perm@entry=3)
    at core/mmu.cc:1222
#34 0x0000000040456dad in mmap (addr=addr@entry=0x0, 
    length=length@entry=134217728, prot=prot@entry=3, flags=flags@entry=32802, 
    fd=fd@entry=-1, offset=offset@entry=0) at libc/mman.cc:156
#35 0x00001000000685d4 in exhaust_memory (size=size@entry=134217728)
    at /home/nyh/osv.tmp2/osv/tests/tst-huge.cc:31
#36 0x000010000006820b in main (argc=<optimized out>, argv=<optimized out>)
    length=length@entry=134217728, prot=prot@entry=3, flags=flags@entry=32802, 
    fd=fd@entry=-1, offset=offset@entry=0) at libc/mman.cc:156
#35 0x00001000000685d4 in exhaust_memory (size=size@entry=134217728)
    at /home/nyh/osv.tmp2/osv/tests/tst-huge.cc:31
#36 0x000010000006820b in main (argc=<optimized out>, argv=<optimized out>)

The deadlock happens because map_anon (frame 33) takes vma_list_mutex.for_write() and then we get a page fault, and page_fault (frame 13) takes the same rwlock for reading. And apparently our rwlock implementation isn't recursive. Which maybe we should reconsider, but this isn't the real bug here - the real bug is why we got this page fault in the first place.

The page fault happens on address 0xffff800041dffff8:

(gdb) fr 13
#13 0x000000004039cdf6 in page_fault (ef=0xffff800002560068)
    at arch/x64/mmu.cc:38
38	        mmu::vm_fault(addr, ef);
(gdb) p (void*) addr
$2 = (void *) 0xffff800041dffff8

How is this address calculated?

(gdb) fr 15
#15 0x00000000403e5b64 in memory::page_range_allocator::insert<true> (
    this=this@entry=0x409bb300 <memory::free_page_ranges>, pr=...)
    at core/mempool.cc:575
575	    void insert(page_range& pr) {
(gdb) l
570	        return size;
571	    }
572	
573	private:
574	    template<bool UseBitmap = true>
575	    void insert(page_range& pr) {
576	        auto addr = static_cast<void*>(&pr);
577	        auto pr_end = static_cast<page_range**>(addr + pr.size - sizeof(page_range**));
578	        *pr_end = &pr;
579	        auto order = ilog2(pr.size / page_size);
(gdb) p addr
$3 = (void *) 0xffff800002561000
(gdb) p pr_size
No symbol "pr_size" in current context.
(gdb) p pr.size
$4 = 1066004480
(gdb) p pr_end
$5 = (memory::page_range **) 0xffff800041dffff8

So the problematic address is pr_end. It is calculated by taking pr and adding to it 1066004480, which is almost one gigabyte. Does this make sense? Do we really have such a huge continguous (in physical memory...) allocation? We need to debug this further.

Another interesting observation: running tst-huge.so with -m2G or even -m1.1G fails as above. But if I run it with sizes only a tiny bit over 1G, it fails in more spectacular ways:

$ scripts/run.py -m1.01G -e tests/tst-huge.so
OSv v0.53.0-88-g5377a50b
Assertion failed: ef->rflags & processor::rflags_if (arch/x64/mmu.cc: page_fault: 34)
Halting.
$ scripts/run.py -m1.02G -e tests/tst-huge.so
OSv v0.53.0-88-g5377a50b
page fault outside application, addr: 0xffff800067a13000
page fault outside application, addr: 0x0000000000000000
exception nested too deeplyHalting.

I'll open a new issue about this, because it happens before running the test, so it's not at all specific to tst-huge.so - although possibly it will end up that both issues stem from the same bug.

@wkozaczuk
Copy link
Collaborator

wkozaczuk commented Aug 21, 2019

Not sure if that is the exact same one but I am getting this stacktrace when running tst-huge.so as of the commit 97fe8aa with manually applied patch to fix >1GB issue:

./scripts/run.py -m1.01G -e '/tests/tst-huge.so'
OSv v0.53.0-32-g97fe8aa3
page fault outside application, addr: 0xffff80004144d000
[registers]
RIP: 0x00000000003f1154 <void memory::page_range_allocator::insert<true>(memory::page_range&)+20>
RFL: 0x0000000000010282  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x0000000040aed000  RBX: 0xffff800000961000  RCX: 0x0000000000000001  RDX: 0x0000000000902300
RSI: 0xffff800000961000  RDI: 0x0000000000902300  RBP: 0xffff8000001dc990  R8:  0xffff800000950008
R9:  0x0000000000902300  R10: 0x0000000000902300  R11: 0x00000000ffffefff  R12: 0x0000000000902300
R13: 0x0000000000902300  R14: 0x0000000000902300  R15: 0x00000000ffffefff  RSP: 0xffff8000001dc970
Aborted

[backtrace]
0x00000000003495e2 <???+3446242>
0x000000000034ae39 <mmu::vm_fault(unsigned long, exception_frame*)+393>
0x00000000003a6dcd <page_fault+125>
0x00000000003a5c46 <???+3824710>
0x00000000003f1e2a <memory::page_range* memory::page_range_allocator::alloc<true>(unsigned long)+522>
0x00000000003ef3c9 <???+4125641>
0x00000000003ef5ff <???+4126207>
0x00000000003ef9a6 <malloc+70>
0x00000000003fba17 <sched::thread::init_stack()+103>
0x00000000004011f8 <sched::thread::thread(std::function<void ()>, sched::thread::attr, bool, bool)+1096>
0x00000000003a5129 <smp_launch()+2057>
0x000000000022d271 <main_cont(int, char**)+193>
0x0000000000400146 <thread_main_c+38>
0x00000000003a6bc2 <???+3828674>

Same with -m1.02G

@wkozaczuk
Copy link
Collaborator

Ok watch these two numbers in this run with 1.01G with free_initial_memory_range calls:

free_initial_memory_range, addr: 000000000093daf4
free_initial_memory_range, size: 000000003f6c250c
free_initial_memory_range, addr: 0000000000000000
free_initial_memory_range, size: 000000000009fc00
free_initial_memory_range, addr: 0000000000100000
free_initial_memory_range, size: 0000000000100000
free_initial_memory_range, addr: 0000000040000000
free_initial_memory_range, size: 0000000000a0e000
page fault outside application, addr: 0xffff800040a0d000
[registers]
RIP: 0x00000000003f1154 <void memory::page_range_allocator::insert<true>(memory::page_range&)+20>
RFL: 0x0000000000010282  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x00000000400ad000  RBX: 0xffff800000961000  RCX: 0x0000000000000001  RDX: 0x0000000000902300
RSI: 0xffff800000961000  RDI: 0x0000000000902300  RBP: 0xffff8000001dc990  R8:  0xffff800000950008
R9:  0x0000000000902300  R10: 0x0000000000902300  R11: 0x00000000ffffefff  R12: 0x0000000000902300
R13: 0x0000000000902300  R14: 0x0000000000902300  R15: 0x00000000ffffefff  RSP: 0xffff8000001dc970
Aborted

The fault is at 0xffff800040a0d000 which one page below the end of the memory
(physical 0x0000000040a0e000 = 0000000040000000 + 0000000000a0e000)

Every time I run it same stack trace, same page fault address, same other numbers, super repeatable.

@wkozaczuk
Copy link
Collaborator

Some kind of edge condition bug in memory::page_range_allocator?

@nyh
Copy link
Contributor Author

nyh commented Aug 21, 2019

@wkozaczuk your last comment exactly #1050 which I opened too. In #1050 I noticed that the crash happens before the application runs - it doesn't have anything to do with tst_huge. It may have the same cause - or a different cause - I don't know. Please check out #1050.

@wkozaczuk
Copy link
Collaborator

@nyh I did see it. But was not sure which issue it belonged. Another piece of evidence is that it started happening right after we applied my patch that changed arch_setup_free_memory() to start using memory below kernel. Which more and more means both 1049 and 1050 started appearing at the same time. I actually added some detail to the email - https://groups.google.com/d/msg/osv-dev/N5Knl4HE25o/dl27EFM-BAAJ - which kind of was related to both 1049 and 1050. Sorry I created more mess.

@nyh nyh closed this as completed in fb7ef9a Aug 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants