Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PAL] Do not describe RIP location on syscall instruction #2017

Conversation

adarshan-intel
Copy link
Contributor

@adarshan-intel adarshan-intel commented Oct 7, 2024

Description of the Changes

PR #1991 exposes a segmentation fault with the use of pal_describe_location as in below error log:

(host_exception.c:141:handle_sync_signal) error: Unexpected segmentation fault (SIGSEGV) occurred inside untrusted PAL (libc-2.31.so+0x22941 (addr = 0x7f62e5f98941))

Using the above addresses we could find the functions abort->hlt from objdump of libc.

We tried to debug the issue using GDB to figure out the cause of abort call inside libc but issue did not reproduced with GDB even after multiple tries.

Issue is consistent with Pytorch workload using curated apps (with debug build) without GDB, on a Azure VM, rarely on our lab servers.

Also tested by increasing the ENCLAVE_SIG_STACK_SIZE and ALT_STACK_SIZE size to 256KB but that also did not help.

Likely some race in Gramine when using pal_describe_location which we are not hitting with GDB as GDB makes the execution slow.

We want to remove this function call in coming release and analyze this issue after the release.

Fixes gramineproject/contrib#87


This change is Reviewable

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)

@dimakuv
Copy link

dimakuv commented Oct 7, 2024

Jenkins, test this please

@mkow mkow changed the title [PAL/{Linux,Linux-SGX}] Removing buffer and pal_describe_location function [PAL] Do not describe RIP location on syscall instruction Oct 7, 2024
Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @adarshan-intel)


-- commits line 2 at r1:
No need to list all the PALs as the change is in all PALs we have.

Suggestion:

[PAL] Do not try to describe RIP location on syscall instruction

-- commits line 15 at r1:
What do you mean by "control path"?

Code quote:

control path

-- commits line 18 at r1:

As the true root cause for this bug is not yet found, this commit introduces a workaround

As I said in the previous PR (#2015) I'm not ok with doing random changes without understanding what's causing the memory corruption.

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @adarshan-intel and @mkow)


-- commits line 15 at r1:

Previously, mkow (Michał Kowalczyk) wrote…

What do you mean by "control path"?

I proposed this "control path" wording. I simply meant that the function pal_describe_location() calls other functions, and by inspecting all these called functions, it looked fine to me (no data races).


-- commits line 18 at r1:

Previously, mkow (Michał Kowalczyk) wrote…

As the true root cause for this bug is not yet found, this commit introduces a workaround

As I said in the previous PR (#2015) I'm not ok with doing random changes without understanding what's causing the memory corruption.

Well, true... Someone needs to debug it.

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @adarshan-intel and @dimakuv)


-- commits line 15 at r1:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I proposed this "control path" wording. I simply meant that the function pal_describe_location() calls other functions, and by inspecting all these called functions, it looked fine to me (no data races).

ah, you mean "control-flow path"? or something else?

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @adarshan-intel and @mkow)


-- commits line 15 at r1:

Previously, mkow (Michał Kowalczyk) wrote…

ah, you mean "control-flow path"? or something else?

Yes, true, I meant "control flow path"

Commit b6a2d79 ("[PAL/{Linux,Linux-SGX}] Add trace log for raw syscalls")
added a debug print for every encountered raw syscall instruction. Each print
describes what system call number is invoked and at which address in the binary.
The address is fed to the helper function `pal_describe_location(uc->rip)` which
tries to find the binary + function name and put them in human-readable form
into the provided buffer.

For some reason, the snippet used in that commit (allocating a 128-byte buffer
on signal-handling stack and calling `pal_describe_location()`) leads to a
non-deterministic memory corruption on some workloads. That bug is hardly
reproducible, and it is not fixed by e.g. increasing the signal stack size inside
Gramine. The `pal_describe_location()` control path also seems correct.

As the true root cause for this bug is not yet found, this commit introduces
a workaround: temporarily removing the stack-allocated buffer and the
correspoding `pal_describe_location()` call, and instead printing the raw
RIP value.

Signed-off-by: Adarsh Anand <[email protected]>
Copy link
Contributor Author

@adarshan-intel adarshan-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @mkow)


-- commits line 2 at r1:

Previously, mkow (Michał Kowalczyk) wrote…

No need to list all the PALs as the change is in all PALs we have.

Done.


-- commits line 15 at r1:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Yes, true, I meant "control flow path"

Done.

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ) (waiting on @mkow)


-- commits line 18 at r1:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Well, true... Someone needs to debug it.

@mkow Will you post here the details of your debugging session, and the conclusion (most probably, to revert #1991 until a proper fix is done)?

@mkow
Copy link
Member

mkow commented Nov 5, 2024

Closing this, as it's not fixing the issue (see #1261 (comment)). I'll create a proper fix and submit a PR.

@mkow mkow closed this Nov 5, 2024
@adarshan-intel adarshan-intel deleted the adarsh/fix_raw_syscall branch November 25, 2024 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pytorch with debug flag is getting failed with latest gramine and gsc master with Segmentation fault
3 participants