-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Long import times #627
Comments
Increasing In [1]: %time import cudf
CPU times: user 556 ms, sys: 204 ms, total: 760 ms
Wall time: 3.39 s |
Looking at the profile it looks like we're doing a lot of odd things at import time
|
It adds up on real rapids codes, e.g., Some observations:
We've been chasing down why we're seeing 20s-60s starts for a boring rapids web app init that is just setting up routes and running some cudf + udf warmup routines, so thought this may help. It's been frustrating for production b/c slows down first start + autorestarts. More nice-to-have for us, but maybe more important for others, are of course local dev + testing, and longer-term, precluding fast cold starts for GPU FaaS. A few more numbers:
And related to the above:
|
I spent a little time with profiling the import with snakeviz. gpu_utils is taking close to ~2.5 seconds and a big chunk of that is from |
Hmm - nothing sticks out to me in the implementation: https://github.com/NVIDIA/cuda-python/blob/746b773c91e1ede708fe9a584b8cdb1c0f32b51d/cuda/_lib/ccudart/ccudart.pyx#L1458-L1466 I suspect the overhead we're seeing here is from the initialization of the CUDA context. |
I can't think of a great way around this. It sounds like the lower bound to our import time is the time for CUDA context initialization, which apparently scales as the square of the number of GPUs.
Note that regardless, the CUDA context will eventually need to be initialized at some point if we're going to use the GPU for anything useful. |
Googling around suggests C++-level cuda context creation is expected to be closer to Maybe also something about how cpython links stuff at runtime? I've been thinking we might have python import issues around path search or native imports / runtime linking, but haven't dug deep on that. |
Are your workloads being run on the cloud? If so, it is possible that these are being run on multi-tenant nodes, which have more GPUs than are actually given to any one user. If this is the case, something to play with potentially would be restricting the |
I've seen for various local single GPUs, not just cloud: Local, windows -> wsl2 -> ubuntu docker -> rtx3080, local ubuntu docker -> rtx3080, and local ubuntu docker -> some old geforce. Happy to test any cpp/py... |
@vyasr Is there a reasonable way to add cudf import time tracking to the python benchmarks? |
I think that would be valuable. The issue is that a lot of the slow parts of the import have to do with initialization logic that only occurs the first time that you import cudf. For example:
Since cudf gets imported as part of the collection process for pytest and at the top every module, we'd be obscuring most of that overhead if we tried to incorporate it as part of the same benchmarking suite. If we want to benchmark this, we should just have a separate and very simple script that just times the import and exits. If we really care about accuracy it would be a bit more involved. We would need the script to do something like launch subprocesses (probably serially to avoid any contention issues around NVIDIA driver context creation) that each run the import command and then collect the results. I don't think we need to go that far, though. Even very rough timings will probably tell us what we need. |
Something we might also consider is using this lazy loader framework |
I suspect the addition of CUDA Lazy Loading should dramatically improve this situation. This was added in 11.7 and has been improved in subsequent releases. This is currently an optional feature and can be enabled by specifying the environment variable |
Agreed, CUDA lazy loading should help, although not sure how much. Last I checked import time was distributed across dependencies as well (numba, cupy, rmm) so we'll need to do some careful profiling again with lazy loading enabled to see what else those libraries are doing and how slow other parts of their import are (as well as the validation that cudf does on import, which adds significant overhead). |
I had a brief look with pyinstrument:
Adding
Numba forks a subprocess to compile some cuda code to determine if it needs to patch the ptx linker. We can turn that off
RMM imports pytorch if it's in the environment
Let's suppose we fix that with lazy-loading.
So now the largest single cost is importing cupy, which seems unavoidable. With a little bit of trimming one might get down to 1s. |
It takes an ecosystem with each lib chiseling away at its part, so worth it? We think of scenarios like CPU CI, lazy load, and pay-as-you-go for partial surface use.. |
Describe the bug
It takes a few seconds to import cudf today
Steps/Code to reproduce bug
The text was updated successfully, but these errors were encountered: