Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Long import times #627

Open
mrocklin opened this issue Jan 4, 2019 · 43 comments
Open

[BUG] Long import times #627

mrocklin opened this issue Jan 4, 2019 · 43 comments
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@mrocklin
Copy link
Collaborator

mrocklin commented Jan 4, 2019

Describe the bug

It takes a few seconds to import cudf today

Steps/Code to reproduce bug

In [1]: import numpy, pandas

In [2]: %time import cudf
CPU times: user 432 ms, sys: 132 ms, total: 564 ms
Wall time: 2.44 s
**git***
commit ab3f45857f641548f6d64d977908075d63c193bf (HEAD -> repr-html, mrocklin/repr-html)
Author: Matthew Rocklin <[email protected]>
Date:   Thu Jan 3 17:35:10 2019 -0800

    Test repr for both large and small dataframes

***OS Information***
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2018-03-20"
DGX_SWBUILD_VERSION="3.1.6"
DGX_COMMIT_ID="1b0f58ecbf989820ce745a9e4836e1de5eea6cfd"
DGX_SERIAL_NUMBER=QTFCOU8310024

DGX_OTA_VERSION="3.1.7"
DGX_OTA_DATE="Thu Sep 27 20:07:53 PDT 2018"
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Linux dgx16 4.4.0-135-generic #161-Ubuntu SMP Mon Aug 27 10:45:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

***GPU Information***
Thu Jan  3 18:42:10 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   42C    P0    59W / 300W |   1085MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   43C    P0    46W / 300W |     11MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   42C    P0    46W / 300W |     11MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   41C    P0    47W / 300W |     11MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   42C    P0    46W / 300W |     11MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   42C    P0    44W / 300W |     11MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   44C    P0    45W / 300W |     11MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   41C    P0    44W / 300W |     11MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     64394      C   ...klin/miniconda/envs/cudf_dev/bin/python   644MiB |
|    0     65545      C   ...klin/miniconda/envs/cudf_dev/bin/python   430MiB |
+-----------------------------------------------------------------------------+

***CPU***
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2030.789
CPU max MHz:           3600.0000
CPU min MHz:           1200.0000
BogoMIPS:              4391.76
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              51200K
NUMA node0 CPU(s):     0-19,40-59
NUMA node1 CPU(s):     20-39,60-79
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d

***CMake***
/home/nfs/mrocklin/miniconda/envs/cudf_dev/bin/cmake
cmake version 3.13.2

CMake suite maintained and supported by Kitware (kitware.com/cmake).

***g++***
/usr/bin/g++
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


***nvcc***

***Python***
/home/nfs/mrocklin/miniconda/envs/cudf_dev/bin/python
Python 3.5.5

***Environment Variables***
PATH                            : /home/nfs/mrocklin/miniconda/envs/cudf_dev/bin:/home/nfs/mrocklin/miniconda/bin:/home/nfs/mrocklin/miniconda/bin:/home/nfs/mrocklin/miniconda/bin:/home/nfs/mrocklin/bin:/home/nfs/mrocklin/.local/bin:/home/nfs/mrocklin/miniconda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
LD_LIBRARY_PATH                 :
NUMBAPRO_NVVM                   :
NUMBAPRO_LIBDEVICE              :
CONDA_PREFIX                    : /home/nfs/mrocklin/miniconda/envs/cudf_dev
PYTHON_PATH                     :

***conda packages***
/home/nfs/mrocklin/miniconda/bin/conda
# packages in environment at /home/nfs/mrocklin/miniconda/envs/cudf_dev:
#
# Name                    Version                   Build  Channel
alabaster                 0.7.12                     py_0    conda-forge
arrow-cpp                 0.10.0           py35h70250a7_0    conda-forge
asn1crypto                0.24.0                   py35_3    conda-forge
atomicwrites              1.2.1                      py_0    conda-forge
attrs                     18.2.0                     py_0    conda-forge
babel                     2.6.0                      py_1    conda-forge
backcall                  0.1.0                      py_0    conda-forge
blas                      1.0                         mkl
bleach                    3.0.2                      py_1    conda-forge
bokeh                     0.13.0                   py35_0
boost                     1.67.0           py35h3e44d54_0    conda-forge
boost-cpp                 1.67.0               h3a22d5f_0    conda-forge
bzip2                     1.0.6                h470a237_2    conda-forge
ca-certificates           2018.03.07                    0
certifi                   2018.8.24                py35_1
cffi                      1.11.5           py35h5e8e0c9_1    conda-forge
chardet                   3.0.4                    py35_3    conda-forge
click                     7.0                        py_0    conda-forge
cloudpickle               0.6.1                      py_0    conda-forge
cmake                     3.13.2               h011004d_0    conda-forge
commonmark                0.5.4                      py_2    conda-forge
cryptography              2.3.1            py35hdffb7b8_0    conda-forge
cryptography-vectors      2.3.1                    py35_0    conda-forge
cudatoolkit               9.2                           0
cudf                      0.4.0+385.g81a187d           <pip>
curl                      7.63.0               h74213dd_0    conda-forge
cython                    0.28.5           py35hfc679d8_0    conda-forge
cytoolz                   0.9.0.1          py35h470a237_0    conda-forge
dask                      1.0.0+51.g2ca205b           <pip>
dask-core                 1.0.0                      py_0    conda-forge
dask-cuda                 0.0.0                     <pip>
dask-cudf                 0.0.1+222.g10a9f90.dirty           <pip>
decorator                 4.3.0                      py_0    conda-forge
distributed               1.23.2                   py35_1    conda-forge
distributed               1.25.1+10.ga0d0ed2           <pip>
docutils                  0.14                     py35_1    conda-forge
entrypoints               0.2.3                    py35_2    conda-forge
expat                     2.2.5                hfc679d8_2    conda-forge
future                    0.16.0                   py35_2    conda-forge
gmp                       6.1.2                hfc679d8_0    conda-forge
heapdict                  1.0.0                 py35_1000    conda-forge
icu                       58.2                 hfc679d8_0    conda-forge
idna                      2.7                      py35_2    conda-forge
imagesize                 1.1.0                      py_0    conda-forge
intel-openmp              2019.1                      144
ipykernel                 5.1.0              pyh24bf2e0_0    conda-forge
ipython                   7.0.1            py35h24bf2e0_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.12.1                   py35_0    conda-forge
jinja2                    2.10                       py_1    conda-forge
jsonschema                2.6.0                    py35_2    conda-forge
jupyter_client            5.2.4                      py_0    conda-forge
jupyter_core              4.4.0                      py_0    conda-forge
krb5                      1.16.2               hbb41f41_0    conda-forge
libcurl                   7.63.0               hbdb9355_0    conda-forge
libedit                   3.1.20170329         haf1bffa_1    conda-forge
libffi                    3.2.1                hfc679d8_5    conda-forge
libgcc-ng                 8.2.0                hdf63c60_1
libgdf-cffi               0.4.0                     <pip>
libgfortran-ng            7.2.0                hdf63c60_3    conda-forge
librmm-cffi               0.4.0                     <pip>
libsodium                 1.0.16               h470a237_1    conda-forge
libssh2                   1.8.0                h5b517e9_3    conda-forge
libstdcxx-ng              8.2.0                hdf63c60_1
libuv                     1.24.1               h470a237_0    conda-forge
llvmlite                  0.27.0           py35hf484d3e_0    numba
Markdown                  2.6.11                    <pip>
markupsafe                1.0              py35h470a237_1    conda-forge
mistune                   0.8.3            py35h470a237_2    conda-forge
mkl                       2018.0.3                      1
mkl_fft                   1.0.9                    py35_0    conda-forge
mkl_random                1.0.1                    py35_0    conda-forge
more-itertools            4.3.0                    py35_0    conda-forge
msgpack-python            0.5.6            py35h2d50403_3    conda-forge
nbconvert                 5.3.1                      py_1    conda-forge
nbformat                  4.4.0                      py_1    conda-forge
ncurses                   6.1                  hfc679d8_2    conda-forge
notebook                  5.7.0                    py35_0    conda-forge
numba                     0.42.0          np115py35hf484d3e_0    numba
numpy                     1.15.2           py35h1d66e8a_0
numpy-base                1.15.2           py35h81de0dd_0
numpydoc                  0.8.0                      py_1    conda-forge
nvstrings                 0.1.0            cuda9.2_py35_0    nvidia
openssl                   1.0.2p               h14c3975_0
packaging                 18.0                       py_0    conda-forge
pandas                    0.20.3                   py35_1    conda-forge
pandoc                    2.5                           0    conda-forge
pandocfilters             1.4.2                      py_1    conda-forge
parquet-cpp               1.5.0.pre            h83d4a3d_0    conda-forge
parso                     0.3.1                      py_0    conda-forge
pathlib2                  2.3.2                    py35_0    conda-forge
pexpect                   4.6.0                    py35_0    conda-forge
pickleshare               0.7.5                    py35_0    conda-forge
pip                       18.0                  py35_1001    conda-forge
pluggy                    0.8.0                      py_0    conda-forge
prometheus_client         0.5.0                      py_0    conda-forge
prompt_toolkit            2.0.7                      py_0    conda-forge
psutil                    5.4.7            py35h470a237_1    conda-forge
ptyprocess                0.6.0                 py35_1000    conda-forge
py                        1.7.0                      py_0    conda-forge
pyarrow                   0.10.0           py35hfc679d8_0    conda-forge
pycparser                 2.19                       py_0    conda-forge
pygments                  2.3.1                      py_0    conda-forge
pyopenssl                 18.0.0                   py35_0    conda-forge
pyparsing                 2.3.0                      py_0    conda-forge
pysocks                   1.6.8                    py35_2    conda-forge
pytest                    4.0.2                     <pip>
pytest                    3.8.1                    py35_0
python                    3.5.5                h5001a0f_2    conda-forge
python-dateutil           2.7.5                      py_0    conda-forge
pytz                      2018.7                     py_0    conda-forge
pyyaml                    3.13             py35h470a237_1    conda-forge
pyzmq                     17.1.2           py35hae99301_0    conda-forge
readline                  7.0                  haf1bffa_1    conda-forge
recommonmark              0.4.0                      py_2    conda-forge
requests                  2.19.1                   py35_1    conda-forge
rhash                     1.3.6                h470a237_1    conda-forge
send2trash                1.5.0                      py_0    conda-forge
setuptools                40.4.3                   py35_0    conda-forge
simplegeneric             0.8.1                      py_1    conda-forge
six                       1.11.0                   py35_1    conda-forge
snowballstemmer           1.2.1                      py_1    conda-forge
sortedcontainers          2.1.0                      py_0    conda-forge
sphinx                    1.8.1                    py35_0    conda-forge
sphinx-markdown-tables    0.0.9                     <pip>
sphinx_rtd_theme          0.4.2                      py_0    conda-forge
sphinxcontrib-websupport  1.1.0                      py_1    conda-forge
sqlite                    3.26.0               hb1c47c0_0    conda-forge
tblib                     1.3.2                      py_1    conda-forge
terminado                 0.8.1                    py35_1    conda-forge
testpath                  0.3.1                    py35_1    conda-forge
tk                        8.6.9                ha92aebf_0    conda-forge
toolz                     0.9.0                      py_1    conda-forge
tornado                   5.1.1            py35h470a237_0    conda-forge
traitlets                 4.3.2                    py35_0    conda-forge
urllib3                   1.23                     py35_1    conda-forge
wcwidth                   0.1.7                      py_1    conda-forge
webencodings              0.5.1                      py_1    conda-forge
wheel                     0.32.0                py35_1000    conda-forge
xz                        5.2.4                h470a237_1    conda-forge
yaml                      0.1.7                h470a237_1    conda-forge
zeromq                    4.2.5                hfc679d8_6    conda-forge
zict                      0.1.3                      py_0    conda-forge
zlib                      1.2.11               h470a237_3    conda-forge
@mrocklin mrocklin added bug Something isn't working Needs Triage Need team to review and classify labels Jan 4, 2019
@kkraus14 kkraus14 added feature request New feature or request Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify bug Something isn't working labels Feb 13, 2019
@mrocklin
Copy link
Collaborator Author

Increasing

In [1]: %time import cudf
CPU times: user 556 ms, sys: 204 ms, total: 760 ms
Wall time: 3.39 s

@mrocklin
Copy link
Collaborator Author

Looking at the profile it looks like we're doing a lot of odd things at import time

         367424 function calls (356126 primitive calls) in 3.046 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      572    0.610    0.001    0.621    0.001 <frozen importlib._bootstrap_external>:914(get_data)
     3250    0.605    0.000    0.605    0.000 {built-in method posix.stat}
        3    0.573    0.191    0.573    0.191 {method 'read' of '_io.BufferedReader' objects}
      599    0.296    0.000    0.296    0.000 {built-in method io.open}
      597    0.256    0.000    0.256    0.000 {method 'close' of '_io.BufferedReader' objects}
    74/73    0.079    0.001    0.082    0.001 {built-in method _imp.create_dynamic}
      572    0.076    0.000    0.076    0.000 {built-in method marshal.loads}
        1    0.040    0.040    0.041    0.041 nvvm.py:106(__new__)
       82    0.039    0.000    0.039    0.000 {built-in method posix.listdir}
2000/1992    0.028    0.000    0.075    0.000 {built-in method builtins.__build_class__}
      547    0.019    0.000    0.019    0.000 ffi.py:106(__call__)
        2    0.017    0.008    0.017    0.008 {method 'dlopen' of 'CompiledFFI' objects}
    74/59    0.015    0.000    0.901    0.015 {built-in method _imp.exec_dynamic}
      921    0.015    0.000    0.015    0.000 {method 'sub' of 're.Pattern' objects}
      613    0.013    0.000    0.013    0.000 {method 'findall' of 're.Pattern' objects}
      572    0.011    0.000    0.011    0.000 {method 'read' of '_io.FileIO' objects}
     1093    0.010    0.000    0.400    0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
        7    0.009    0.001    0.009    0.001 {built-in method posix.read}
        4    0.009    0.002    0.009    0.002 {built-in method _posixsubprocess.fork_exec}
      572    0.009    0.000    0.732    0.001 <frozen importlib._bootstrap_external>:793(get_code)
        1    0.008    0.008    0.008    0.008 {method 'dot' of 'numpy.ndarray' objects}
      211    0.008    0.000    0.008    0.000 sre_compile.py:276(_optimize_charset)
      820    0.007    0.000    0.421    0.001 <frozen importlib._bootstrap>:882(_find_spec)
      592    0.006    0.000    0.586    0.001 __init__.py:78(open_resource)
        4    0.006    0.001    0.006    0.001 {built-in method _ctypes.dlopen}
     8994    0.006    0.000    0.006    0.000 {built-in method builtins.hasattr}

@lmeyerov
Copy link

lmeyerov commented Jan 23, 2022

It adds up on real rapids codes, e.g., 4s for an empty cudf df

Some observations:

  • pandas is 1+ s on its own and maintainers seem disinterested: PERF: pandas import is too slow pandas-dev/pandas#7282 (comment)

  • cudf is now 2.7+ s import, and add another 1-2s for even an empty df . So 1s blameable on pandas, but the other ~ 3s+ seems fair game..

  • dask_cudf adds another 0.5+ s for import, and I didn't measure importing a client and connecting to an existing cluster, but presumably much more

We've been chasing down why we're seeing 20s-60s starts for a boring rapids web app init that is just setting up routes and running some cudf + udf warmup routines, so thought this may help. It's been frustrating for production b/c slows down first start + autorestarts. More nice-to-have for us, but maybe more important for others, are of course local dev + testing, and longer-term, precluding fast cold starts for GPU FaaS.


A few more numbers:

(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "1"

real	0m0.047s
user	0m0.043s
sys	0m0.004s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import logging"

real	0m0.080s
user	0m0.068s
sys	0m0.012s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import pandas"

real	0m1.011s
user	0m1.059s
sys	0m0.626s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import cudf"

real	0m2.692s
user	0m2.607s
sys	0m0.811s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import cudf; cudf.DataFrame({'x': []})"

real	0m3.558s
user	0m3.211s
sys	0m1.060s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import dask_cudf; import cudf; cudf.DataFrame({'x': []})"

real	0m4.015s
user	0m3.547s
sys	0m0.794s

And related to the above:

(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "from numba import cuda"

real	0m1.233s
user	0m1.218s
sys	0m0.757s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import rmm"

real	0m1.678s
user	0m1.688s
sys	0m0.738s

@quasiben
Copy link
Member

quasiben commented Feb 7, 2022

I spent a little time with profiling the import with snakeviz. gpu_utils is taking close to ~2.5 seconds and a big chunk of that is from getDeviceCount. @shwina do you have any thoughts here ?

Screen Shot 2022-02-07 at 10 45 25 AM

@shwina
Copy link
Contributor

shwina commented Feb 7, 2022

Hmm - nothing sticks out to me in the implementation: https://github.com/NVIDIA/cuda-python/blob/746b773c91e1ede708fe9a584b8cdb1c0f32b51d/cuda/_lib/ccudart/ccudart.pyx#L1458-L1466

I suspect the overhead we're seeing here is from the initialization of the CUDA context.

@shwina
Copy link
Contributor

shwina commented Feb 7, 2022

I can't think of a great way around this. It sounds like the lower bound to our import time is the time for CUDA context initialization, which apparently scales as the square of the number of GPUs.

validate_setup sounds like something we definitely want to do at import time (not later), but we could potentially expose something like an environment variable that allows skipping the check altogether? Would that be useful?

Note that regardless, the CUDA context will eventually need to be initialized at some point if we're going to use the GPU for anything useful.

@lmeyerov
Copy link

lmeyerov commented Feb 7, 2022

Googling around suggests C++-level cuda context creation is expected to be closer to ~100ms than 1s+. We're all python nowadays so I don't have any C++ handy for seeing if that's still true or maybe there's a different notion of context. (My tests are single GPU, not DGX.)

Maybe also something about how cpython links stuff at runtime? I've been thinking we might have python import issues around path search or native imports / runtime linking, but haven't dug deep on that.

@jakirkham
Copy link
Member

Are your workloads being run on the cloud? If so, it is possible that these are being run on multi-tenant nodes, which have more GPUs than are actually given to any one user. If this is the case, something to play with potentially would be restricting the CUDA_VISIBLE_DEVICES as done above to see if that has an effect. This would provide some clue as to whether or not this is the issue.

@lmeyerov
Copy link

lmeyerov commented Feb 8, 2022

I've seen for various local single GPUs, not just cloud: Local, windows -> wsl2 -> ubuntu docker -> rtx3080, local ubuntu docker -> rtx3080, and local ubuntu docker -> some old geforce.

Happy to test any cpp/py...

@GregoryKimball
Copy link
Contributor

@vyasr Is there a reasonable way to add cudf import time tracking to the python benchmarks?

@vyasr
Copy link
Contributor

vyasr commented Jun 30, 2022

I think that would be valuable. The issue is that a lot of the slow parts of the import have to do with initialization logic that only occurs the first time that you import cudf. For example:

In [1]: %timeit import cudf
The slowest run took 22.42 times longer than the fastest. This could mean that an intermediate result is being cached.
670 ns ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]: %timeit import cudf
77 ns ± 0.189 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Since cudf gets imported as part of the collection process for pytest and at the top every module, we'd be obscuring most of that overhead if we tried to incorporate it as part of the same benchmarking suite. If we want to benchmark this, we should just have a separate and very simple script that just times the import and exits.

If we really care about accuracy it would be a bit more involved. We would need the script to do something like launch subprocesses (probably serially to avoid any contention issues around NVIDIA driver context creation) that each run the import command and then collect the results. I don't think we need to go that far, though. Even very rough timings will probably tell us what we need.

@jakirkham
Copy link
Member

Something we might also consider is using this lazy loader framework

https://github.com/scientific-python/lazy_loader

@jrhemstad
Copy link
Contributor

jrhemstad commented Feb 7, 2023

I suspect the addition of CUDA Lazy Loading should dramatically improve this situation. This was added in 11.7 and has been improved in subsequent releases.

This is currently an optional feature and can be enabled by specifying the environment variable CUDA_MODULE_LOADING=LAZY at runtime.

@vyasr
Copy link
Contributor

vyasr commented Feb 7, 2023

Agreed, CUDA lazy loading should help, although not sure how much. Last I checked import time was distributed across dependencies as well (numba, cupy, rmm) so we'll need to do some careful profiling again with lazy loading enabled to see what else those libraries are doing and how slow other parts of their import are (as well as the validation that cudf does on import, which adds significant overhead).

@wence-
Copy link
Contributor

wence- commented Feb 16, 2023

I had a brief look with pyinstrument:

$ pyinstrument importcudf.py

  _     ._   __/__   _ _  _  _ _/_   Recorded: 16:40:36  Samples:  1201
 /_//_/// /_\ / //_// / //_'/ //     Duration: 2.298     CPU time: 1.758
/   _/                      v4.4.0

Program: importcudf.py

2.297 <module>  importcudf.py:1
└─ 2.297 <module>  cudf/__init__.py:1
   ├─ 1.064 validate_setup  cudf/utils/gpu_utils.py:4
   │  ├─ 0.977 <module>  rmm/__init__.py:1
   │  │  ├─ 0.613 <module>  rmm/rmm.py:1
   │  │  │  ├─ 0.350 <module>  torch/__init__.py:1
   │  │  │  │     [617 frames hidden]  torch, <built-in>, typing, collection...
   │  │  │  └─ 0.261 <module>  cupy/__init__.py:1
   │  │  │        [668 frames hidden]  cupy, pytest, _pytest, attr, <built-i...
   │  │  ├─ 0.330 <module>  rmm/mr.py:1
   │  │  │  └─ 0.330 <module>  rmm/_lib/__init__.py:1
   │  │  │     ├─ 0.280 <module>  numba/__init__.py:1
   │  │  │     │     [739 frames hidden]  numba, numpy, re, sre_compile, sre_pa...
   │  │  │     └─ 0.047 <module>  numba/cuda/__init__.py:1
   │  │  │           [163 frames hidden]  numba, asyncio, ssl, enum, <built-in>...
   │  │  └─ 0.034 get_versions  rmm/_version.py:515
   │  │     └─ 0.034 git_pieces_from_vcs  rmm/_version.py:234
   │  │        └─ 0.034 run_command  rmm/_version.py:71
   │  │           └─ 0.027 Popen.communicate  subprocess.py:1110
   │  │                 [7 frames hidden]  subprocess, <built-in>, selectors
   │  └─ 0.031 getDeviceCount  rmm/_cuda/gpu.py:91
   ├─ 0.390 _setup_numba_linker  cudf/core/udf/utils.py:388
   │  └─ 0.390 safe_get_versions  ptxcompiler/patch.py:212
   │        [12 frames hidden]  ptxcompiler, subprocess, selectors, <...
   ├─ 0.308 <module>  cudf/api/__init__.py:1
   │  └─ 0.301 <module>  cudf/api/extensions/__init__.py:1
   │     └─ 0.301 <module>  cudf/api/extensions/accessor.py:1
   │        └─ 0.300 <module>  pandas/__init__.py:1
   │              [694 frames hidden]  pandas, pyarrow, copy, <built-in>, te...
   ├─ 0.280 <module>  cudf/datasets.py:1
   │  ├─ 0.253 <module>  cudf/_lib/__init__.py:1
   │  │  ├─ 0.124 <module>  cudf/utils/ioutils.py:1
   │  │  │  ├─ 0.096 <module>  pyarrow/fs.py:1
   │  │  │  │     [4 frames hidden]  pyarrow, <built-in>
   │  │  │  └─ 0.026 <module>  fsspec/__init__.py:1
   │  │  │        [79 frames hidden]  fsspec, importlib, pathlib, <built-in...
   │  │  ├─ 0.079 [self]  None
   │  │  └─ 0.043 <module>  cudf/_lib/strings/__init__.py:1
   │  └─ 0.024 <module>  cudf/core/column_accessor.py:1
   │     └─ 0.024 <module>  cudf/core/column/__init__.py:1
   ├─ 0.135 <module>  cudf/core/algorithms.py:1
   │  ├─ 0.097 <module>  cudf/core/indexed_frame.py:1
   │  │  └─ 0.090 <module>  cudf/core/groupby/__init__.py:1
   │  │     └─ 0.090 <module>  cudf/core/groupby/groupby.py:1
   │  │        ├─ 0.053 <module>  cudf/core/udf/groupby_utils.py:1
   │  │        │  └─ 0.053 _get_ptx_file  cudf/core/udf/utils.py:291
   │  │        │     └─ 0.053 get_current_device  numba/cuda/api.py:433
   │  │        │           [10 frames hidden]  numba
   │  │        └─ 0.031 <module>  cudf/core/udf/__init__.py:1
   │  │           └─ 0.027 <module>  cudf/core/udf/row_function.py:1
   │  └─ 0.031 <module>  cudf/core/index.py:1
   │     └─ 0.025 Float32Index.__init_subclass__  cudf/core/mixins/mixin_factory.py:212
   │        └─ 0.023 TimedeltaIndex._should_define_operation  cudf/core/mixins/mixin_factory.py:77
   └─ 0.086 get_versions  cudf/_version.py:515
      └─ 0.085 git_pieces_from_vcs  cudf/_version.py:235
         └─ 0.085 run_command  cudf/_version.py:72
            └─ 0.077 Popen.communicate  subprocess.py:1110
                  [7 frames hidden]  subprocess, <built-in>, selectors

To view this report with different options, run:
    pyinstrument --load-prev 2023-02-16T16-40-36 [options]

CUDA_MODULE_LOADING=LAZY doesn't really seem to help (I have host driver == cuda-12, and cuda-11.8 in a rapids-compose environment).

Adding RAPIDS_NO_INITIALIZE=1 just moves some time around

2.277 <module>  importcudf.py:1
└─ 2.277 <module>  cudf/__init__.py:1
   ├─ 0.453 <module>  rmm/__init__.py:1
   │  ├─ 0.352 <module>  rmm/rmm.py:1
   │  │  └─ 0.351 <module>  torch/__init__.py:1
   │  │        [663 frames hidden]  torch, <built-in>, typing, enum, pick...
   │  ├─ 0.067 <module>  rmm/mr.py:1
   │  │  └─ 0.067 <module>  rmm/_lib/__init__.py:1
   │  │     └─ 0.045 __new__  enum.py:180
   │  │           [15 frames hidden]  enum, <built-in>
   │  └─ 0.034 get_versions  rmm/_version.py:515
   │     └─ 0.034 git_pieces_from_vcs  rmm/_version.py:234
   │        └─ 0.034 run_command  rmm/_version.py:71
   │           └─ 0.027 Popen.communicate  subprocess.py:1110
   │                 [7 frames hidden]  subprocess, <built-in>, selectors
   ├─ 0.407 _setup_numba_linker  cudf/core/udf/utils.py:388
   │  └─ 0.407 safe_get_versions  ptxcompiler/patch.py:212
   │        [11 frames hidden]  ptxcompiler, subprocess, selectors, <...
   ├─ 0.364 <module>  cupy/__init__.py:1
   │     [1068 frames hidden]  cupy, pytest, _pytest, <built-in>, at...
   ├─ 0.356 <module>  cudf/api/__init__.py:1
   │  └─ 0.348 <module>  cudf/api/extensions/__init__.py:1
   │     └─ 0.348 <module>  cudf/api/extensions/accessor.py:1
   │        └─ 0.346 <module>  pandas/__init__.py:1
   │              [700 frames hidden]  pandas, pyarrow, <built-in>, textwrap...
   ├─ 0.286 <module>  cudf/datasets.py:1
   │  ├─ 0.258 <module>  cudf/_lib/__init__.py:1
   │  │  ├─ 0.126 <module>  cudf/utils/ioutils.py:1
   │  │  │  ├─ 0.097 <module>  pyarrow/fs.py:1
   │  │  │  │     [4 frames hidden]  pyarrow, <built-in>
   │  │  │  └─ 0.027 <module>  fsspec/__init__.py:1
   │  │  │        [86 frames hidden]  fsspec, importlib, pathlib, <built-in...
   │  │  ├─ 0.084 [self]  None
   │  │  └─ 0.044 <module>  cudf/_lib/strings/__init__.py:1
   │  └─ 0.025 <module>  cudf/core/column_accessor.py:1
   │     └─ 0.024 <module>  cudf/core/column/__init__.py:1
   ├─ 0.183 <module>  numba/__init__.py:1
   │     [495 frames hidden]  numba, llvmlite, ctypes, <built-in>, ...
   ├─ 0.085 get_versions  cudf/_version.py:515
   │  └─ 0.084 git_pieces_from_vcs  cudf/_version.py:235
   │     └─ 0.084 run_command  cudf/_version.py:72
   │        └─ 0.076 Popen.communicate  subprocess.py:1110
   │              [7 frames hidden]  subprocess, <built-in>, selectors
   ├─ 0.081 <module>  cudf/core/algorithms.py:1
   │  ├─ 0.042 <module>  cudf/core/indexed_frame.py:1
   │  │  └─ 0.036 <module>  cudf/core/groupby/__init__.py:1
   │  │     └─ 0.036 <module>  cudf/core/groupby/groupby.py:1
   │  │        └─ 0.031 <module>  cudf/core/udf/__init__.py:1
   │  │           └─ 0.027 <module>  cudf/core/udf/row_function.py:1
   │  └─ 0.032 <module>  cudf/core/index.py:1
   │     └─ 0.024 UInt64Index.__init_subclass__  cudf/core/mixins/mixin_factory.py:212
   │        └─ 0.024 UInt64Index._should_define_operation  cudf/core/mixins/mixin_factory.py:77
   │           └─ 0.024 dir  None
   │                 [2 frames hidden]  <built-in>
   └─ 0.026 <module>  numba/cuda/__init__.py:1
         [86 frames hidden]  numba, <built-in>, collections, llvml...

Numba forks a subprocess to compile some cuda code to determine if it needs to patch the ptx linker. We can turn that off

1.895 <module>  importcudf.py:1
└─ 1.895 <module>  cudf/__init__.py:1
   ├─ 0.468 <module>  rmm/__init__.py:1
   │  ├─ 0.359 <module>  rmm/rmm.py:1
   │  │  └─ 0.358 <module>  torch/__init__.py:1
   │  │        [687 frames hidden]  torch, <built-in>, typing, enum, insp...
   │  ├─ 0.073 <module>  rmm/mr.py:1
   │  │  └─ 0.072 <module>  rmm/_lib/__init__.py:1
   │  │     └─ 0.052 __new__  enum.py:180
   │  │           [23 frames hidden]  enum, <built-in>
   │  └─ 0.036 get_versions  rmm/_version.py:515
   │     └─ 0.036 git_pieces_from_vcs  rmm/_version.py:234
   │        └─ 0.036 run_command  rmm/_version.py:71
   │           └─ 0.029 Popen.communicate  subprocess.py:1110
   │                 [7 frames hidden]  subprocess, <built-in>, selectors
   ├─ 0.363 <module>  cupy/__init__.py:1
   │     [1034 frames hidden]  cupy, pytest, _pytest, <built-in>, at...
   ├─ 0.362 <module>  cudf/api/__init__.py:1
   │  └─ 0.354 <module>  cudf/api/extensions/__init__.py:1
   │     └─ 0.354 <module>  cudf/api/extensions/accessor.py:1
   │        └─ 0.353 <module>  pandas/__init__.py:1
   │              [722 frames hidden]  pandas, pyarrow, textwrap, <built-in>...
   ├─ 0.284 <module>  cudf/datasets.py:1
   │  ├─ 0.257 <module>  cudf/_lib/__init__.py:1
   │  │  ├─ 0.127 <module>  cudf/utils/ioutils.py:1
   │  │  │  ├─ 0.098 <module>  pyarrow/fs.py:1
   │  │  │  │     [4 frames hidden]  pyarrow, <built-in>
   │  │  │  └─ 0.028 <module>  fsspec/__init__.py:1
   │  │  │        [97 frames hidden]  fsspec, importlib, pathlib, <built-in...
   │  │  ├─ 0.082 [self]  None
   │  │  └─ 0.043 <module>  cudf/_lib/strings/__init__.py:1
   │  └─ 0.024 <module>  cudf/core/column_accessor.py:1
   │     └─ 0.024 <module>  cudf/core/column/__init__.py:1
   ├─ 0.194 <module>  numba/__init__.py:1
   │     [490 frames hidden]  numba, llvmlite, ctypes, <built-in>, ...
   ├─ 0.081 <module>  cudf/core/algorithms.py:1
   │  ├─ 0.043 <module>  cudf/core/indexed_frame.py:1
   │  │  └─ 0.036 <module>  cudf/core/groupby/__init__.py:1
   │  │     └─ 0.036 <module>  cudf/core/groupby/groupby.py:1
   │  │        └─ 0.031 <module>  cudf/core/udf/__init__.py:1
   │  │           └─ 0.027 <module>  cudf/core/udf/row_function.py:1
   │  │              └─ 0.021 <module>  cudf/core/udf/utils.py:1
   │  └─ 0.032 <module>  cudf/core/index.py:1
   │     └─ 0.024 GenericIndex.__init_subclass__  cudf/core/mixins/mixin_factory.py:212
   │        └─ 0.022 GenericIndex._should_define_operation  cudf/core/mixins/mixin_factory.py:77
   │           └─ 0.020 dir  None
   │                 [2 frames hidden]  <built-in>
   ├─ 0.078 get_versions  cudf/_version.py:515
   │  └─ 0.078 git_pieces_from_vcs  cudf/_version.py:235
   │     └─ 0.078 run_command  cudf/_version.py:72
   │        └─ 0.071 Popen.communicate  subprocess.py:1110
   │              [7 frames hidden]  subprocess, <built-in>, selectors
   ├─ 0.029 <module>  numba/cuda/__init__.py:1
   │     [110 frames hidden]  numba, collections, re, types, <built...
   └─ 0.022 <module>  cudf/io/__init__.py:1

RMM imports pytorch if it's in the environment

try:
    from torch.cuda.memory import CUDAPluggableAllocator
except ImportError:
    rmm_torch_allocator = None
else:
    import rmm._lib.torch_allocator

    _alloc_free_lib_path = rmm._lib.torch_allocator.__file__
    rmm_torch_allocator = CUDAPluggableAllocator(
        _alloc_free_lib_path,
        alloc_fn_name="allocate",
        free_fn_name="deallocate",
    )

Let's suppose we fix that with lazy-loading.

1.469 <module>  importcudf.py:1
└─ 1.467 <module>  cudf/__init__.py:1
   ├─ 0.360 <module>  cupy/__init__.py:1
   │     [1023 frames hidden]  cupy, pytest, _pytest, <built-in>, at...
   ├─ 0.282 <module>  cudf/api/__init__.py:1
   │  └─ 0.274 <module>  cudf/api/extensions/__init__.py:1
   │     └─ 0.274 <module>  cudf/api/extensions/accessor.py:1
   │        └─ 0.272 <module>  pandas/__init__.py:1
   │              [695 frames hidden]  pandas, pyarrow, <built-in>, textwrap...
   ├─ 0.275 <module>  cudf/datasets.py:1
   │  ├─ 0.249 <module>  cudf/_lib/__init__.py:1
   │  │  ├─ 0.125 <module>  cudf/utils/ioutils.py:1
   │  │  │  ├─ 0.096 <module>  pyarrow/fs.py:1
   │  │  │  │     [4 frames hidden]  pyarrow, <built-in>
   │  │  │  └─ 0.028 <module>  fsspec/__init__.py:1
   │  │  │        [91 frames hidden]  fsspec, importlib, pathlib, <built-in...
   │  │  ├─ 0.077 [self]  None
   │  │  └─ 0.042 <module>  cudf/_lib/strings/__init__.py:1
   │  └─ 0.025 <module>  cudf/core/column_accessor.py:1
   │     └─ 0.024 <module>  cudf/core/column/__init__.py:1
   ├─ 0.182 <module>  numba/__init__.py:1
   │     [446 frames hidden]  numba, llvmlite, ctypes, <built-in>, ...
   ├─ 0.118 <module>  cudf/core/algorithms.py:1
   │  ├─ 0.070 <module>  cudf/core/index.py:1
   │  │  ├─ 0.041 <module>  cudf/core/frame.py:1
   │  │  │  └─ 0.039 Frame  cudf/core/frame.py:51
   │  │  │     └─ 0.039 _cudf_nvtx_annotate  cudf/utils/utils.py:385
   │  │  │        └─ 0.039 annotate.__call__  nvtx/nvtx.py:89
   │  │  │              [2 frames hidden]  nvtx
   │  │  └─ 0.025 Float64Index.__init_subclass__  cudf/core/mixins/mixin_factory.py:212
   │  │     └─ 0.024 Float64Index._should_define_operation  cudf/core/mixins/mixin_factory.py:77
   │  │        └─ 0.023 dir  None
   │  │              [2 frames hidden]  <built-in>
   │  └─ 0.041 <module>  cudf/core/indexed_frame.py:1
   │     └─ 0.035 <module>  cudf/core/groupby/__init__.py:1
   │        └─ 0.035 <module>  cudf/core/groupby/groupby.py:1
   │           └─ 0.030 <module>  cudf/core/udf/__init__.py:1
   │              └─ 0.025 <module>  cudf/core/udf/row_function.py:1
   │                 └─ 0.020 <module>  cudf/core/udf/utils.py:1
   ├─ 0.102 <module>  rmm/__init__.py:1
   │  ├─ 0.066 <module>  rmm/mr.py:1
   │  │  └─ 0.066 <module>  rmm/_lib/__init__.py:1
   │  │     └─ 0.045 __new__  enum.py:180
   │  │           [16 frames hidden]  enum, <built-in>
   │  └─ 0.034 get_versions  rmm/_version.py:515
   │     └─ 0.034 git_pieces_from_vcs  rmm/_version.py:234
   │        └─ 0.034 run_command  rmm/_version.py:71
   │           └─ 0.027 Popen.communicate  subprocess.py:1110
   │                 [7 frames hidden]  subprocess, <built-in>, selectors
   ├─ 0.088 get_versions  cudf/_version.py:515
   │  └─ 0.087 git_pieces_from_vcs  cudf/_version.py:235
   │     └─ 0.087 run_command  cudf/_version.py:72
   │        └─ 0.080 Popen.communicate  subprocess.py:1110
   │              [7 frames hidden]  subprocess, <built-in>, selectors
   ├─ 0.026 <module>  numba/cuda/__init__.py:1
   │     [100 frames hidden]  numba, <built-in>, llvmlite, inspect,...
   └─ 0.021 <module>  cudf/io/__init__.py:1

So now the largest single cost is importing cupy, which seems unavoidable. With a little bit of trimming one might get down to 1s.

@lmeyerov
Copy link

It takes an ecosystem with each lib chiseling away at its part, so worth it?

We think of scenarios like CPU CI, lazy load, and pay-as-you-go for partial surface use..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests