tst_huge hangs with memory over 1GB. #1049

nyh · 2019-08-21T09:42:04Z

This issue was discovered by @wkozaczuk and reported on the mailing list. It is likely a recent regression, or an old bug which for an unknown reason got exposed by recent changes to the memory layout.

Running tst-huge.so with 1GB of memory works fine:

scripts/build image=tests
scripts/run.py -m1G -e tests/tst-huge.so

However, giving it bigger memory, like the default 2GB, results in a hang.
With gdb we can see the test thread hung on a deadlock:

(gdb) connect
(gdb) osv syms
(gdb) osv info threads
...
 253 (0xffff80000255b040) tests/tst-huge. cpu1 status::waiting waitqueue::wait(lockfree::mutex&) at core/waitqueue.cc:56 vruntime  2.75268e-19
(gdb) osv thread
(gdb) bt
#0  sched::thread::switch_to (this=0xffff800000a6a040, 
    this@entry=0xffff800000a81040) at arch/x64/arch-switch.hh:108
#1  0x00000000403f31c4 in sched::cpu::reschedule_from_interrupt (
    this=0xffff800000053040, called_from_yield=called_from_yield@entry=false, 
    preempt_after=..., preempt_after@entry=...) at core/sched.cc:339
#2  0x00000000403f36ac in sched::cpu::schedule () at include/osv/sched.hh:1310
#3  0x00000000403f3d72 in sched::thread::wait (
    this=this@entry=0xffff80000255b040) at core/sched.cc:1214
#4  0x000000004041b0e2 in sched::thread::do_wait_for<lockfree::mutex, sched::wait_object<waitqueue> > (mtx=...) at include/osv/mutex.h:41
#5  sched::thread::wait_for<waitqueue&> (mtx=...) at include/osv/sched.hh:1220
#6  waitqueue::wait (this=this@entry=0x409a3550 <mmu::vma_list_mutex+48>, 
    mtx=...) at core/waitqueue.cc:56
#7  0x00000000403d6ea3 in rwlock::reader_wait_lockable (this=<optimized out>)
    at core/rwlock.cc:174
#8  rwlock::rlock (this=this@entry=0x409a3520 <mmu::vma_list_mutex>)
    at core/rwlock.cc:29
#9  0x0000000040341e0f in rwlock_for_read::lock (
    this=0x409a3520 <mmu::vma_list_mutex>) at include/osv/rwlock.h:113
#10 std::lock_guard<rwlock_for_read&>::lock_guard (__m=..., 
    this=<synthetic pointer>) at /usr/include/c++/8/bits/std_mutex.h:162
#11 lock_guard_for_with_lock<rwlock_for_read&>::lock_guard_for_with_lock (
    lock=..., this=<synthetic pointer>) at include/osv/mutex.h:89
#12 mmu::vm_fault (addr=18446603337326391296, addr@entry=18446603337326395384, 
    ef=ef@entry=0xffff800002560068) at core/mmu.cc:1334
#13 0x000000004039cdf6 in page_fault (ef=0xffff800002560068)
    at arch/x64/mmu.cc:38
#14 <signal handler called>
#15 0x00000000403e5b64 in memory::page_range_allocator::insert<true> (
    this=this@entry=0x409bb300 <memory::free_page_ranges>, pr=...)
    at core/mempool.cc:575
#16 0x00000000403e337c in memory::page_range_allocator::<lambda(memory::page_range&)>::operator() (header=..., __closure=<synthetic pointer>)
    at core/mempool.cc:751
#17 memory::page_range_allocator::<lambda(memory::page_range&)>::operator() (
    header=..., __closure=<synthetic pointer>) at core/mempool.cc:736
#18 memory::page_range_allocator::for_each<memory::page_range_allocator::alloc_aligned(size_t, size_t, size_t, bool)::<lambda(memory::page_range&)> > (f=..., 
    min_order=<optimized out>, this=0x409bb300 <memory::free_page_ranges>)
    at core/mempool.cc:809
#19 memory::page_range_allocator::alloc_aligned (
    this=this@entry=0x409bb300 <memory::free_page_ranges>, 
    size=size@entry=2097152, offset=offset@entry=0, 
    alignment=alignment@entry=2097152, fill=fill@entry=true)
    at core/mempool.cc:736
#20 0x00000000403e3c90 in memory::alloc_huge_page (N=N@entry=2097152)
    at core/mempool.cc:1601
#21 0x000000004034718e in mmu::uninitialized_anonymous_page_provider::map (
    this=0x408681b0 <mmu::page_allocator_init>, offset=83886080, ptep=..., 
    pte=..., write=<optimized out>) at include/osv/mmu-defs.hh:219
#22 0x000000004034c79a in mmu::populate<(mmu::account_opt)1>::page<1> (
    offset=83886080, ptep=..., this=0x2000001ffd80)
    at include/osv/mmu-defs.hh:235
#23 mmu::page<mmu::populate<>, 1> (ptep=..., offset=83886080, pops=...)
    at core/mmu.cc:311
#24 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 2>::operator() (
    base_virt=35185397596160, parent=..., this=<synthetic pointer>)
    at core/mmu.cc:437
#25 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 3>::map_range<2> (
    this=<synthetic pointer>, ptep=..., base_virt=35184372088832, slop=4096, 
    page_mapper=..., size=132120576, vcur=<optimized out>) at core/mmu.cc:399
#26 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 3>::operator() (
    base_virt=35184372088832, parent=..., this=<synthetic pointer>)
    at core/mmu.cc:449
#27 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 4>::map_range<3> (
    this=<synthetic pointer>, ptep=..., base_virt=35184372088832, slop=4096, 
    page_mapper=..., size=134217728, vcur=<optimized out>) at core/mmu.cc:399
#28 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 4>::operator() (
    base_virt=35184372088832, parent=..., this=<synthetic pointer>)
    at core/mmu.cc:449
#29 mmu::map_range<mmu::populate<(mmu::account_opt)1> > (
    vma_start=vma_start@entry=35185313710080, 
    vstart=vstart@entry=35185313710080, size=<optimized out>, page_mapper=..., 
    slop=slop@entry=4096) at core/mmu.cc:354
#30 0x000000004034cfc4 in mmu::operate_range<mmu::populate<(mmu::account_opt)1> > (size=<optimized out>, start=0x200038200000, vma_start=<optimized out>, 
    mapper=...) at core/mmu.cc:801
#31 mmu::vma::operate_range<mmu::populate<(mmu::account_opt)1> > (size=3, 
    addr=0x200038200000, mapper=..., this=0xffffa000481ff300)
    at core/mmu.cc:1412
#32 mmu::populate_vma<(mmu::account_opt)1> (vma=vma@entry=0xffffa000481ff300, 
    v=v@entry=0x200038200000, size=size@entry=134217728, 
    write=write@entry=false) at core/mmu.cc:1206
#33 0x0000000040345809 in mmu::map_anon (addr=addr@entry=0x0, 
    size=size@entry=134217728, flags=flags@entry=2, perm=perm@entry=3)
    at core/mmu.cc:1222
#34 0x0000000040456dad in mmap (addr=addr@entry=0x0, 
    length=length@entry=134217728, prot=prot@entry=3, flags=flags@entry=32802, 
    fd=fd@entry=-1, offset=offset@entry=0) at libc/mman.cc:156
#35 0x00001000000685d4 in exhaust_memory (size=size@entry=134217728)
    at /home/nyh/osv.tmp2/osv/tests/tst-huge.cc:31
#36 0x000010000006820b in main (argc=<optimized out>, argv=<optimized out>)
    length=length@entry=134217728, prot=prot@entry=3, flags=flags@entry=32802, 
    fd=fd@entry=-1, offset=offset@entry=0) at libc/mman.cc:156
#35 0x00001000000685d4 in exhaust_memory (size=size@entry=134217728)
    at /home/nyh/osv.tmp2/osv/tests/tst-huge.cc:31
#36 0x000010000006820b in main (argc=<optimized out>, argv=<optimized out>)

The deadlock happens because map_anon (frame 33) takes vma_list_mutex.for_write() and then we get a page fault, and page_fault (frame 13) takes the same rwlock for reading. And apparently our rwlock implementation isn't recursive. Which maybe we should reconsider, but this isn't the real bug here - the real bug is why we got this page fault in the first place.

The page fault happens on address 0xffff800041dffff8:

(gdb) fr 13
#13 0x000000004039cdf6 in page_fault (ef=0xffff800002560068)
    at arch/x64/mmu.cc:38
38	        mmu::vm_fault(addr, ef);
(gdb) p (void*) addr
$2 = (void *) 0xffff800041dffff8

How is this address calculated?

(gdb) fr 15
#15 0x00000000403e5b64 in memory::page_range_allocator::insert<true> (
    this=this@entry=0x409bb300 <memory::free_page_ranges>, pr=...)
    at core/mempool.cc:575
575	    void insert(page_range& pr) {
(gdb) l
570	        return size;
571	    }
572	
573	private:
574	    template<bool UseBitmap = true>
575	    void insert(page_range& pr) {
576	        auto addr = static_cast<void*>(&pr);
577	        auto pr_end = static_cast<page_range**>(addr + pr.size - sizeof(page_range**));
578	        *pr_end = &pr;
579	        auto order = ilog2(pr.size / page_size);
(gdb) p addr
$3 = (void *) 0xffff800002561000
(gdb) p pr_size
No symbol "pr_size" in current context.
(gdb) p pr.size
$4 = 1066004480
(gdb) p pr_end
$5 = (memory::page_range **) 0xffff800041dffff8

So the problematic address is pr_end. It is calculated by taking pr and adding to it 1066004480, which is almost one gigabyte. Does this make sense? Do we really have such a huge continguous (in physical memory...) allocation? We need to debug this further.

Another interesting observation: running tst-huge.so with -m2G or even -m1.1G fails as above. But if I run it with sizes only a tiny bit over 1G, it fails in more spectacular ways:

$ scripts/run.py -m1.01G -e tests/tst-huge.so
OSv v0.53.0-88-g5377a50b
Assertion failed: ef->rflags & processor::rflags_if (arch/x64/mmu.cc: page_fault: 34)
Halting.

$ scripts/run.py -m1.02G -e tests/tst-huge.so
OSv v0.53.0-88-g5377a50b
page fault outside application, addr: 0xffff800067a13000
page fault outside application, addr: 0x0000000000000000
exception nested too deeplyHalting.

I'll open a new issue about this, because it happens before running the test, so it's not at all specific to tst-huge.so - although possibly it will end up that both issues stem from the same bug.

The text was updated successfully, but these errors were encountered:

wkozaczuk · 2019-08-21T21:20:24Z

Not sure if that is the exact same one but I am getting this stacktrace when running tst-huge.so as of the commit 97fe8aa with manually applied patch to fix >1GB issue:

./scripts/run.py -m1.01G -e '/tests/tst-huge.so'
OSv v0.53.0-32-g97fe8aa3
page fault outside application, addr: 0xffff80004144d000
[registers]
RIP: 0x00000000003f1154 <void memory::page_range_allocator::insert<true>(memory::page_range&)+20>
RFL: 0x0000000000010282  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x0000000040aed000  RBX: 0xffff800000961000  RCX: 0x0000000000000001  RDX: 0x0000000000902300
RSI: 0xffff800000961000  RDI: 0x0000000000902300  RBP: 0xffff8000001dc990  R8:  0xffff800000950008
R9:  0x0000000000902300  R10: 0x0000000000902300  R11: 0x00000000ffffefff  R12: 0x0000000000902300
R13: 0x0000000000902300  R14: 0x0000000000902300  R15: 0x00000000ffffefff  RSP: 0xffff8000001dc970
Aborted

[backtrace]
0x00000000003495e2 <???+3446242>
0x000000000034ae39 <mmu::vm_fault(unsigned long, exception_frame*)+393>
0x00000000003a6dcd <page_fault+125>
0x00000000003a5c46 <???+3824710>
0x00000000003f1e2a <memory::page_range* memory::page_range_allocator::alloc<true>(unsigned long)+522>
0x00000000003ef3c9 <???+4125641>
0x00000000003ef5ff <???+4126207>
0x00000000003ef9a6 <malloc+70>
0x00000000003fba17 <sched::thread::init_stack()+103>
0x00000000004011f8 <sched::thread::thread(std::function<void ()>, sched::thread::attr, bool, bool)+1096>
0x00000000003a5129 <smp_launch()+2057>
0x000000000022d271 <main_cont(int, char**)+193>
0x0000000000400146 <thread_main_c+38>
0x00000000003a6bc2 <???+3828674>

Same with -m1.02G

wkozaczuk · 2019-08-21T21:32:59Z

Ok watch these two numbers in this run with 1.01G with free_initial_memory_range calls:

free_initial_memory_range, addr: 000000000093daf4
free_initial_memory_range, size: 000000003f6c250c
free_initial_memory_range, addr: 0000000000000000
free_initial_memory_range, size: 000000000009fc00
free_initial_memory_range, addr: 0000000000100000
free_initial_memory_range, size: 0000000000100000
free_initial_memory_range, addr: 0000000040000000
free_initial_memory_range, size: 0000000000a0e000
page fault outside application, addr: 0xffff800040a0d000
[registers]
RIP: 0x00000000003f1154 <void memory::page_range_allocator::insert<true>(memory::page_range&)+20>
RFL: 0x0000000000010282  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x00000000400ad000  RBX: 0xffff800000961000  RCX: 0x0000000000000001  RDX: 0x0000000000902300
RSI: 0xffff800000961000  RDI: 0x0000000000902300  RBP: 0xffff8000001dc990  R8:  0xffff800000950008
R9:  0x0000000000902300  R10: 0x0000000000902300  R11: 0x00000000ffffefff  R12: 0x0000000000902300
R13: 0x0000000000902300  R14: 0x0000000000902300  R15: 0x00000000ffffefff  RSP: 0xffff8000001dc970
Aborted

The fault is at 0xffff800040a0d000 which one page below the end of the memory
(physical 0x0000000040a0e000 = 0000000040000000 + 0000000000a0e000)

Every time I run it same stack trace, same page fault address, same other numbers, super repeatable.

wkozaczuk · 2019-08-21T21:33:45Z

Some kind of edge condition bug in memory::page_range_allocator?

nyh · 2019-08-21T21:34:07Z

@wkozaczuk your last comment exactly #1050 which I opened too. In #1050 I noticed that the crash happens before the application runs - it doesn't have anything to do with tst_huge. It may have the same cause - or a different cause - I don't know. Please check out #1050.

wkozaczuk · 2019-08-21T21:40:36Z

@nyh I did see it. But was not sure which issue it belonged. Another piece of evidence is that it started happening right after we applied my patch that changed arch_setup_free_memory() to start using memory below kernel. Which more and more means both 1049 and 1050 started appearing at the same time. I actually added some detail to the email - https://groups.google.com/d/msg/osv-dev/N5Knl4HE25o/dl27EFM-BAAJ - which kind of was related to both 1049 and 1050. Sorry I created more mess.

nyh mentioned this issue Aug 21, 2019

Can't run anything with 1.01G of memory #1050

Closed

nyh closed this as completed in fb7ef9a Aug 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tst_huge hangs with memory over 1GB. #1049

tst_huge hangs with memory over 1GB. #1049

nyh commented Aug 21, 2019

wkozaczuk commented Aug 21, 2019 •

edited

Loading

wkozaczuk commented Aug 21, 2019

wkozaczuk commented Aug 21, 2019

nyh commented Aug 21, 2019

wkozaczuk commented Aug 21, 2019

tst_huge hangs with memory over 1GB. #1049

tst_huge hangs with memory over 1GB. #1049

Comments

nyh commented Aug 21, 2019

wkozaczuk commented Aug 21, 2019 • edited Loading

wkozaczuk commented Aug 21, 2019

wkozaczuk commented Aug 21, 2019

nyh commented Aug 21, 2019

wkozaczuk commented Aug 21, 2019

wkozaczuk commented Aug 21, 2019 •

edited

Loading