-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PAL] Do not describe RIP location on syscall instruction #2017
[PAL] Do not describe RIP location on syscall instruction #2017
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)
Jenkins, test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @adarshan-intel)
-- commits
line 2 at r1:
No need to list all the PALs as the change is in all PALs we have.
Suggestion:
[PAL] Do not try to describe RIP location on syscall instruction
-- commits
line 15 at r1:
What do you mean by "control path"?
Code quote:
control path
As the true root cause for this bug is not yet found, this commit introduces a workaround
As I said in the previous PR (#2015) I'm not ok with doing random changes without understanding what's causing the memory corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @adarshan-intel and @mkow)
Previously, mkow (Michał Kowalczyk) wrote…
What do you mean by "control path"?
I proposed this "control path" wording. I simply meant that the function pal_describe_location()
calls other functions, and by inspecting all these called functions, it looked fine to me (no data races).
Previously, mkow (Michał Kowalczyk) wrote…
As the true root cause for this bug is not yet found, this commit introduces a workaround
As I said in the previous PR (#2015) I'm not ok with doing random changes without understanding what's causing the memory corruption.
Well, true... Someone needs to debug it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @adarshan-intel and @dimakuv)
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
I proposed this "control path" wording. I simply meant that the function
pal_describe_location()
calls other functions, and by inspecting all these called functions, it looked fine to me (no data races).
ah, you mean "control-flow path"? or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @adarshan-intel and @mkow)
Previously, mkow (Michał Kowalczyk) wrote…
ah, you mean "control-flow path"? or something else?
Yes, true, I meant "control flow path"
Commit b6a2d79 ("[PAL/{Linux,Linux-SGX}] Add trace log for raw syscalls") added a debug print for every encountered raw syscall instruction. Each print describes what system call number is invoked and at which address in the binary. The address is fed to the helper function `pal_describe_location(uc->rip)` which tries to find the binary + function name and put them in human-readable form into the provided buffer. For some reason, the snippet used in that commit (allocating a 128-byte buffer on signal-handling stack and calling `pal_describe_location()`) leads to a non-deterministic memory corruption on some workloads. That bug is hardly reproducible, and it is not fixed by e.g. increasing the signal stack size inside Gramine. The `pal_describe_location()` control path also seems correct. As the true root cause for this bug is not yet found, this commit introduces a workaround: temporarily removing the stack-allocated buffer and the correspoding `pal_describe_location()` call, and instead printing the raw RIP value. Signed-off-by: Adarsh Anand <[email protected]>
28cda11
to
9de0ec2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @mkow)
Previously, mkow (Michał Kowalczyk) wrote…
No need to list all the PALs as the change is in all PALs we have.
Done.
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Yes, true, I meant "control flow path"
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ) (waiting on @mkow)
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Well, true... Someone needs to debug it.
@mkow Will you post here the details of your debugging session, and the conclusion (most probably, to revert #1991 until a proper fix is done)?
Closing this, as it's not fixing the issue (see #1261 (comment)). I'll create a proper fix and submit a PR. |
Description of the Changes
PR #1991 exposes a segmentation fault with the use of
pal_describe_location
as in below error log:Using the above addresses we could find the functions abort->hlt from objdump of libc.
We tried to debug the issue using GDB to figure out the cause of abort call inside libc but issue did not reproduced with GDB even after multiple tries.
Issue is consistent with Pytorch workload using curated apps (with debug build) without GDB, on a Azure VM, rarely on our lab servers.
Also tested by increasing the ENCLAVE_SIG_STACK_SIZE and ALT_STACK_SIZE size to 256KB but that also did not help.
Likely some race in Gramine when using
pal_describe_location
which we are not hitting with GDB as GDB makes the execution slow.We want to remove this function call in coming release and analyze this issue after the release.
Fixes gramineproject/contrib#87
This change is