Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scikit-learn-intelex : receive_checkpoint_and_restore() function call fails with ENOMEM error in gramine-direct #107

Closed
vasanth-intel opened this issue Oct 9, 2024 · 0 comments · Fixed by #108

Comments

@vasanth-intel
Copy link

Description of the problem
It is observed that the receive_checkpoint_and_restore() function call fails with ENOMEM error in gramine-direct mode for the examples/scikit-learn-intelex workload in the examples repo, as shown in the below markup. On further debugging, we found that this is a regression caused by the commit c0a2765 [LibOS,PAL/Linux-SGX] Add EDMM lazy allocation support.

intel@intel-M50CYP2SBSTD:~/gramine_edmm/examples/scikit-learn-intelex$ gramine-direct ./sklearnex scripts/kmeans_perf_eval.py
error: failed allocating 0x1a06ff6c2000-0x1a06ff6c3000
error: libos_init() failed in receive_checkpoint_and_restore: Cannot allocate memory (ENOMEM)
[P1:T1:python3.10] error: failed sending checkpoint: Permission denied (EACCES)
[P1:T1:python3.10] error: process creation failed
error: failed allocating 0x1a06ff6c2000-0x1a06ff6c3000
error: libos_init() failed in receive_checkpoint_and_restore: Cannot allocate memory (ENOMEM)
[P1:T1:python3.10] error: failed sending checkpoint: Permission denied (EACCES)
[P1:T1:python3.10] error: process creation failed
error: failed allocating 0x1a06ff6c2000-0x1a06ff6c3000
error: libos_init() failed in receive_checkpoint_and_restore: Cannot allocate memory (ENOMEM)
[P1:T1:python3.10] error: failed sending checkpoint: Permission denied (EACCES)
[P1:T1:python3.10] error: process creation failed
error: failed allocating 0x1a06ff6c2000-0x1a06ff6c3000
error: libos_init() failed in receive_checkpoint_and_restore: Cannot allocate memory (ENOMEM)
[P1:T1:python3.10] error: failed sending checkpoint: Permission denied (EACCES)
[P1:T1:python3.10] error: process creation failed
Emulating a raw system/supervisor call. This degrades performance, consider patching your application to use Gramine syscall API.
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
*** Stock Scikit-learn ***
Train time: 22.878 s
Inertia: 2468815.517
Number of iterations: 98
Davies-Bouldin metric on train data: 2.877
Predict time: 0.028 s
Davies-Bouldin metric on test data: 2.896
*** Intel extension for Scikit-learn ***
Train time: 6.979 s
Inertia: 2468787.252
Number of iterations: 157
Davies-Bouldin metric on train data: 2.780
Predict time: 0.006 s
Davies-Bouldin metric on test data: 2.768
Kmeans perf evaluation finished

Note

  1. The above observation is not seen with gramine-sgx execution mode.
  2. EDMM is NOT enabled.
  3. Even though we see multiple ENOMEM/EACCES messages, the workload successfully executes and prints the final success message like Kmeans perf evaluation finished.

Steps to reproduce

  1. Git clone examples repo git clone https://github.com/gramineproject/examples.git.
  2. Follow the steps mentioned within the README to build and execute scikit-learn-intelex with gramine.

Expected results
We are able to execute the scikit-learn-intelex example without any error messages.

Actual results
Even though the scikit-learn-intelex example is executed, we get error messages as seen above.

Gramine commit hash
c0a2765

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant