aarch64: crashes when enabling tracepoint sched_switch #1195

wkozaczuk · 2022-05-06T21:57:06Z

The stack traces I am presenting below might be symptoms of a single underlying issue and may be related to #1158.

For example, when running tst-hello.so on ROFS image with the tracepoint sched_switch enabled we occasionally get this type of crash:

taskset -c 2-5 ./scripts/run.py -e '/tests/tst-hello.so' --trace sched_switch -H
OSv v0.56.0-86-g6f4334a4
eth0: 192.168.122.15
Booted up in 78.56 ms
Cmdline: /tests/tst-hello.so
Hello World
Assertion failed: p_status != thread::status::queued (core/sched.cc: schedule_next_thread: 273)

[backtrace]
0x0000000fc00ea000 <__assert_fail+32>
0x0000000fc025ae5c <sched::cpu::schedule_next_thread(bool, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)+1436>
0x0000000fc023047c <???+-1071446916>
0x0000000fc0180854 <interrupt+276>
0x0000000fc017f5f4 <???+-1072171532>
0x0000000fc023047c <???+-1071446916>
0x0000000fc02538f0 <sched::cpu::schedule()+44>
0x0000000fc0243e90 <memory::pool::free(void*)+204>
0x0000000fc02551ac <sched::thread::~thread()+652>
0x0000000fc02553e0 <sched::thread::~thread()+16>
0x0000000fc023b324 <osv::rcu_flush()+276>
0x0000000fc0161860 <elf::program::remove_object(elf::object*)+108>
0x0000000fc0161e8c <???+-1072292212>
0x0000000fc0283d78 <osv::application::join()+488>
0x0000000fc00dd590 <do_main_thread(void*)+1648>
0x0000000fc02b9070 <???+-1070886800>
0x0000000fc02574b0 <thread_main_c+32>
0x0000000fc017e784 <???+-1072175228>
qemu failed.

Running same test on ZFS image looks slightly different:

taskset -c 2-5 ./scripts/run.py -e '/tests/tst-hello.so' --trace sched_switch -c 1 -H
OSv v0.56.0-86-g6f4334a4
Assertion failed: p_status != thread::status::queued (core/sched.cc: schedule_next_thread: 273)

[backtrace]
0x0000000fc00ea000 <__assert_fail+32>
0x0000000fc025ae5c <sched::cpu::schedule_next_thread(bool, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)+1436>
0x0000000fc023047c <???+-1071446916>
0x0000000fc0180854 <interrupt+276>
0x0000000fc017f5f4 <???+-1072171532>
0x0000000fc023047c <???+-1071446916>
0x0000000fc02538f0 <sched::cpu::schedule()+44>
0x0000000fc0254900 <sched::thread::wake()+112>
0x0000000fc010b904 <kthread_add+436>
0x0000000fc0105a60 <taskqueue_start_threads+268>
0x000010000000a610 <???+42512>
0x000010000004d978 <???+317816>
0x0000100000054b00 <???+346880>
0x000010000003f400 <???+259072>
0x0000100000036f60 <???+225120>
0x0000100000039054 <???+233556>
0x000010000002c460 <???+181344>
0x000010000007e260 <???+516704>
0x000010000007e75c <???+517980>
0x0000000fc0297b08 <sys_mount+648>
0x0000000fc0294530 <mount_zfs_rootfs+80>
0x0000000fc00de1ac <do_main_thread(void*)+4748>
0x0000000fc02b9070 <???+-1070886800>
0x0000000fc02574b0 <thread_main_c+32>
0x0000000fc017e784 <???+-1072175228>

or

taskset -c 2-5 ./scripts/run.py -e '/tests/tst-hello.so' --trace sched_switch -c 1 -H
OSv v0.56.0-86-g6f4334a4
Assertion failed: _detached_state->st.load(std::memory_order_relaxed) == status::terminating (core/sched.cc: destroy: 1161)

[backtrace]
0x0000000fc00ea000 <__assert_fail+32>
0x0000000fc02547ac <sched::thread::destroy()+216>
0x0000000fc02547e4 <destroy_current_cpu_terminating_thread+36>
0x0000000fc0230534 <???+-1071446732>
0x0000000fc02538f0 <sched::cpu::schedule()+44>
0x0000000fc0254e54 <sched::thread::join()+148>
0x0000000fc025653c <sched::thread::reaper::reap()+316>
0x0000000fc0256600 <???+-1071290880>
0x0000000fc02574b0 <thread_main_c+32>
0x0000000fc017e784 <???+-1072175228>

Please note that when enabling many other tracepoints, things work fine.

My one theory was that possibly we are running out of the stack and I have tried to double kernel and app and exception stack and it did not help.

My other theory is that there still may be an issue in sched.S where we are not saving or restoring some registers. Or maybe the fact that these tracepoints are called in sched::cpu::schedule_next_thread() has something to do with it.

One more thing - the following tracepoints crash:

sched_sched
sched_switch
sched_load
sched_wake

These seem to work (at least sometimes):

sched_preempt (sometimes on 2 CPUs)
sched_idle
sched_wait

The text was updated successfully, but these errors were encountered:

nyh · 2022-05-09T14:42:24Z

This failure appears (as you noted in the title) to be specific to aarch64. I couldn't reproduce it on x86.

I tried to review this code again (it is fairly complex, unfortunately), but can't find any bugs just by looking at the code :-(

The trace_slow_path() code in trace.hh does arch::irq_disable_notrace() which is necessary to avoid switching CPUs while touching the per-cpu trace buffers, and also reentrancy of different tracepoints - I hope this function does work correctly on aarch64?

wkozaczuk · 2022-07-28T17:20:21Z

I have a feeling this might be related to the unimplemented tls_available (always returns false) -

osv/arch/aarch64/arch.hh

Lines 98 to 102 in 6f4334a

    
           inline bool tls_available() 
        
           { 
        
               /* XXX */ 
        
               return false; 
        
           }

. Or maybe other aarch64 specific code in this file and used by trace - like irq_flag_notrace.

wkozaczuk closed this as completed in 362b559 Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aarch64: crashes when enabling tracepoint sched_switch #1195

aarch64: crashes when enabling tracepoint sched_switch #1195

wkozaczuk commented May 6, 2022

nyh commented May 9, 2022

wkozaczuk commented Jul 28, 2022

aarch64: crashes when enabling tracepoint sched_switch #1195

aarch64: crashes when enabling tracepoint sched_switch #1195

Comments

wkozaczuk commented May 6, 2022

nyh commented May 9, 2022

wkozaczuk commented Jul 28, 2022