Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for not using SIGSTOP/SIGCONT because not all apps take it well #13

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

podusowski
Copy link

This STOP/CONT pattern is used to avoid data-race between reading /proc and handling things like mmap events from the kernel, isn't it? Anyway, we are profiling some apps that don't take it very well since STOP causes syscalls to return abnormally. It should be fixed but you know, it is not always that easy. Therefore I'm proposing a switch to disable this behavior.

nwind/src/unwind_context.rs Outdated Show resolved Hide resolved
nwind/src/unwind_context.rs Outdated Show resolved Hide resolved
src/args.rs Outdated Show resolved Hide resolved
src/perf_group.rs Outdated Show resolved Hide resolved
podusowski and others added 2 commits January 11, 2020 18:49
@koute
Copy link
Owner

koute commented Jan 17, 2020

Yes, the STOP/CONT are sent because the perf_event_open interface is somewhat broken and AFAIK it's not really possible to use it in a non-racy way with an application which is already running. (It really shows that it was designed mostly with the fork + exec model in mind where you always start a fresh instance when profiling.)

Comment on lines +99 to +103
// avoid infinite loops
if self.ctx.nth_frame > 1000 {
warn!("possible infinite loop detected and avoided");
return false;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you actually hit a genuine infinite loop here? AFAIK infinite loops shouldn't really be possible as you're going to overflow the stack sooner or later anyway.

Anyway, this change isn't really correct. Even though > 1000 frame deep stacks are certainly a sign of a problem they should still be gathered. I've seen such stack traces in the wild, and gathering as much of it as possible later helps to fix it if you can manage to get to the top. So what was your motivation in adding this here? If you want to limit stack traces to a certain length we could add an extra parameter instead.

Copy link
Author

@podusowski podusowski Jan 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've got it frequently in one of the cortex-15 app, but I cannot post it here nor dig into it further since I'm leaving the company.

What I managed to figure out though is that it looked like a arm unwinder bug, vec holding the frames kept allocating until it failed while trying to reallocate into 3.5 gigs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, do you still want to make this as command line option? I'm asking because failed allocation, which is how this bug manifests itself, is just an abort, no panic nor Err. This makes it hard to diagnose if it happens to someone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants