Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64: crashes when enabling tracepoint sched_switch #1195

Closed
wkozaczuk opened this issue May 6, 2022 · 2 comments
Closed

aarch64: crashes when enabling tracepoint sched_switch #1195

wkozaczuk opened this issue May 6, 2022 · 2 comments

Comments

@wkozaczuk
Copy link
Collaborator

The stack traces I am presenting below might be symptoms of a single underlying issue and may be related to #1158.

For example, when running tst-hello.so on ROFS image with the tracepoint sched_switch enabled we occasionally get this type of crash:

taskset -c 2-5 ./scripts/run.py -e '/tests/tst-hello.so' --trace sched_switch -H
OSv v0.56.0-86-g6f4334a4
eth0: 192.168.122.15
Booted up in 78.56 ms
Cmdline: /tests/tst-hello.so
Hello World
Assertion failed: p_status != thread::status::queued (core/sched.cc: schedule_next_thread: 273)

[backtrace]
0x0000000fc00ea000 <__assert_fail+32>
0x0000000fc025ae5c <sched::cpu::schedule_next_thread(bool, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)+1436>
0x0000000fc023047c <???+-1071446916>
0x0000000fc0180854 <interrupt+276>
0x0000000fc017f5f4 <???+-1072171532>
0x0000000fc023047c <???+-1071446916>
0x0000000fc02538f0 <sched::cpu::schedule()+44>
0x0000000fc0243e90 <memory::pool::free(void*)+204>
0x0000000fc02551ac <sched::thread::~thread()+652>
0x0000000fc02553e0 <sched::thread::~thread()+16>
0x0000000fc023b324 <osv::rcu_flush()+276>
0x0000000fc0161860 <elf::program::remove_object(elf::object*)+108>
0x0000000fc0161e8c <???+-1072292212>
0x0000000fc0283d78 <osv::application::join()+488>
0x0000000fc00dd590 <do_main_thread(void*)+1648>
0x0000000fc02b9070 <???+-1070886800>
0x0000000fc02574b0 <thread_main_c+32>
0x0000000fc017e784 <???+-1072175228>
qemu failed.

Running same test on ZFS image looks slightly different:

taskset -c 2-5 ./scripts/run.py -e '/tests/tst-hello.so' --trace sched_switch -c 1 -H
OSv v0.56.0-86-g6f4334a4
Assertion failed: p_status != thread::status::queued (core/sched.cc: schedule_next_thread: 273)

[backtrace]
0x0000000fc00ea000 <__assert_fail+32>
0x0000000fc025ae5c <sched::cpu::schedule_next_thread(bool, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)+1436>
0x0000000fc023047c <???+-1071446916>
0x0000000fc0180854 <interrupt+276>
0x0000000fc017f5f4 <???+-1072171532>
0x0000000fc023047c <???+-1071446916>
0x0000000fc02538f0 <sched::cpu::schedule()+44>
0x0000000fc0254900 <sched::thread::wake()+112>
0x0000000fc010b904 <kthread_add+436>
0x0000000fc0105a60 <taskqueue_start_threads+268>
0x000010000000a610 <???+42512>
0x000010000004d978 <???+317816>
0x0000100000054b00 <???+346880>
0x000010000003f400 <???+259072>
0x0000100000036f60 <???+225120>
0x0000100000039054 <???+233556>
0x000010000002c460 <???+181344>
0x000010000007e260 <???+516704>
0x000010000007e75c <???+517980>
0x0000000fc0297b08 <sys_mount+648>
0x0000000fc0294530 <mount_zfs_rootfs+80>
0x0000000fc00de1ac <do_main_thread(void*)+4748>
0x0000000fc02b9070 <???+-1070886800>
0x0000000fc02574b0 <thread_main_c+32>
0x0000000fc017e784 <???+-1072175228>

or

taskset -c 2-5 ./scripts/run.py -e '/tests/tst-hello.so' --trace sched_switch -c 1 -H
OSv v0.56.0-86-g6f4334a4
Assertion failed: _detached_state->st.load(std::memory_order_relaxed) == status::terminating (core/sched.cc: destroy: 1161)

[backtrace]
0x0000000fc00ea000 <__assert_fail+32>
0x0000000fc02547ac <sched::thread::destroy()+216>
0x0000000fc02547e4 <destroy_current_cpu_terminating_thread+36>
0x0000000fc0230534 <???+-1071446732>
0x0000000fc02538f0 <sched::cpu::schedule()+44>
0x0000000fc0254e54 <sched::thread::join()+148>
0x0000000fc025653c <sched::thread::reaper::reap()+316>
0x0000000fc0256600 <???+-1071290880>
0x0000000fc02574b0 <thread_main_c+32>
0x0000000fc017e784 <???+-1072175228>

Please note that when enabling many other tracepoints, things work fine.

My one theory was that possibly we are running out of the stack and I have tried to double kernel and app and exception stack and it did not help.

My other theory is that there still may be an issue in sched.S where we are not saving or restoring some registers. Or maybe the fact that these tracepoints are called in sched::cpu::schedule_next_thread() has something to do with it.

One more thing - the following tracepoints crash:

  • sched_sched
  • sched_switch
  • sched_load
  • sched_wake

These seem to work (at least sometimes):

  • sched_preempt (sometimes on 2 CPUs)
  • sched_idle
  • sched_wait
@nyh
Copy link
Contributor

nyh commented May 9, 2022

This failure appears (as you noted in the title) to be specific to aarch64. I couldn't reproduce it on x86.

I tried to review this code again (it is fairly complex, unfortunately), but can't find any bugs just by looking at the code :-(

The trace_slow_path() code in trace.hh does arch::irq_disable_notrace() which is necessary to avoid switching CPUs while touching the per-cpu trace buffers, and also reentrancy of different tracepoints - I hope this function does work correctly on aarch64?

@wkozaczuk
Copy link
Collaborator Author

I have a feeling this might be related to the unimplemented tls_available (always returns false) -

osv/arch/aarch64/arch.hh

Lines 98 to 102 in 6f4334a

inline bool tls_available()
{
/* XXX */
return false;
}
. Or maybe other aarch64 specific code in this file and used by trace - like irq_flag_notrace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants