-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: pandas import is too slow #7282
Comments
sounds a bit odd, you might have a path issue. do you have multiple pythons/environments installed? does importing numpy take the same amount of time?
|
numpy doesn't seem to have this issue...
|
no idea; whey don't you try in a virtualenv with only pandas deps installed |
are you loading this over a network? try to install locally, print out |
closing as not a bug. |
I have the same problem. Was this closed because you found a solution? I'd be grateful if you could share it. Thanks. |
@steve3141 - have you tried creating a pristine virtualenv and seeing if that helps? |
Afraid I can't; work, lockdown, etc. So I realize this is very likely not the fault of pandas, except insofar as "import pandas" executes an enormous number -- over 500 by my count -- of secondary import statements. Filesystem overhead. Thanks,
|
I know this has been closed for awhile but I'm seeing the same thing and it is not pandas specific. We have our pandas environment in a virtualenv on a drive on a server. That drive is then mounted by each client. This allows us to maintain a sane package environment among all users. However, this is clearly sacrificing startup time to an unreasonable extent. The import times in seconds are as follows:
So clearly this is a setup issue, but how do other companies deal with this problem? I find it hard to believe that packages are installed locally on every user's box and if that isn't the case, that they experience these long startup times. The network itself is working fine...transfer speeds are ~120MB/s. |
@rockg - dunno about every corporation, but certainly all of the installations I've worked with have had everything locally. Conda and tox can make it much easier to have local installs. |
I have the same problem -> 6s import time, local install (anaconda, pandas '0.14.1). This is impossibly slow, especially trying to import on multiple processes. |
Same problem, (pandas 0.18) although mine is not as awful: 400ms just to |
+1. I see anywhere between 400 - 700ms. |
try removing the mpl font caches. Or, if you are in such o locked down enviroment that you can not write the caches, this might be mpl searching your system for fonts everytime it is imported. |
(in python3/ pandas 1.6.2 via anaconda)
---- restart ipython ----
381 ms on linux importing pandas from ipython(300ms) is faster than running it from python(500ms) Importing some sub-dependencies speeds up importing pandas
--- restart ipython -----
It looks like pytz is particularly slow Getting all the modules from pandas I uninstalled matplotlib, xlsxwriter, and cython and imported pandas' sub imports before pandas(as seen via
A workaround may be to stratify these imports before you need pandas I'm getting similar results with no anaconda / python2 / pandas 1.8 |
Similar issue for me. It makes development in Flask unbearable since it is 10s after every file change to reload. I debugged it an an import time of 3-10 seconds of pandas is the main culprit (2015 MBA) running anaconda on 3.5 There is some caching happening, but not sure what... python -m timeit -n1 -r1 "import pandas" |
One workaround is to isolate all the code that interacts with pandas and lazily import that code only when you need it so that the wait period is only during program execution. (that's what I do) |
I don't think that will help - then I'll have the delay every reload
(since all my code works with pandas).
I'd done this in a terminal window:
'''
while true; do date && python -m timeit -n1 -r1 "import pandas"; sleep 2;
done;
'''
Doing this keeps pandas in the OS cache. Stupid hack, but keeps loading
down to 300-500ms.
…-J
On Sat, Jan 7, 2017 at 7:27 AM, Bryce Guinta ***@***.***> wrote:
One workaround is to isolate all the code that interacts with pandas and
lazily import that code only when you need it so that the wait period is
only during program execution. (that's what I do)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7282 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHchj5AFRzCAFrswjpTz1B7e0XM-mo1ks5rPvEDgaJpZM4B_lQ5>
.
--
+919971876580
twitter: @jacobSingh ( http://twitter.com/#!/JacobSingh )
web: http://www.jacobsingh.name
Skype: pajamadesign
gTalk: [email protected]
|
I'm having a similar issue. Running on OSX and does the same in the virtualenv and out of it. Tried reinstalling everything and that didn't help. Doesn't seem to be matplotlib as that is relatively fast on its own. Very tricky to troubleshoot this- doesn't seem to show anything in the logs. |
Can somebody please profile a simple "import pandas" and we can see if the problem is easily identified? |
So I did a quick profile and found the following:
Seems like the init at line 5 is taking most of the time- is this the main init of pandas? |
just for comparison on osx.
|
Probably cached?
…On Tue, Jan 10, 2017 at 9:56 PM, Jeff Reback ***@***.***> wrote:
just for comparison on osx.
# 2.7
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 287 msec per loop
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 671 msec per loop
# 3.5
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 168 msec per loop
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 494 msec per loop
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7282 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHchrtO89XXFLDcIdOvZzNQFEzSgmPrks5rQ7FOgaJpZM4B_lQ5>
.
--
+919971876580
twitter: @jacobSingh ( http://twitter.com/#!/JacobSingh )
web: http://www.jacobsingh.name
Skype: pajamadesign
gTalk: [email protected]
|
not sure what you think is cached |
@RexFuzzle I'm surprised you don't have any long file names. Did you strip the directories? You should be seeing something like the below. That will make it easier to see what is taking the majority of time. I think it comes down to pandas importing a lot of dependencies each which have their own hit.
|
Hmmm, that is strange- I didn't strip anything- was using cProfile- don't know if that could have caused it. Will investigate it a bit further tomorrow. From my results though it certainly seems like it is just the one init that is taking all the time- will try to get mine in the same format as yours and then we can compare- see if it is the same init file and line number. |
Save out the cprofile to a file and then load with pstats and print. If it is a specific module, run the line profiler to see if it anything specific or just a lot of small things.
|
For me, the first load is 4s. The the OS caches the library in memory, so
it's around 300-500ms. Wait a little while, and try again.
Best,
Jacob
…On Tue, Jan 10, 2017 at 10:42 PM, Jeff Reback ***@***.***> wrote:
not sure what you think is cached
--
+919971876580
twitter: @jacobSingh ( http://twitter.com/#!/JacobSingh )
web: http://www.jacobsingh.name
Skype: pajamadesign
gTalk: [email protected]
|
All right, let's go one step further and do a line profile of |
Maybe you could also give https://github.com/cournape/import-profiler a try But looking at the above values, although the import time is much larger, also numpy takes much longer. The ratio of numpy import to full pandas import seems rather the same of for the much smaller numbers @jreback posted, or I also see). So if numpy is already taking more than 4 seconds to import, we of course are not going to get pandas import time below that. |
Thanks for all the input. I ran a dtruss in the mean time and found that nothing happens for a few seconds before anything shows up there and so I'm thinking that there is a lag on disk reads instead of it being a python problem, this, to me, is re-enforced by the fact that the time seems to be grouped with the first line of the init file (artifact from cProfile?). Will do a bit more digging. Also agree that it seems to be more a numpy problem and will have a look through their issues and see if anybody else has something similar. |
Sorry, that is not what I wanted to say. I just meant that both numpy and pandas seem to take longer (compared to my laptop, both x10 to x15 times longer), so that is not necessarily to pinpoint to a certain import that is the culprit. It just seems generally slower. Which does not mean of course that we might do some more lazy imports in pandas to improve things, if there are bottlenecks. |
Please, do not ignore this issue. It's closed, but I also found problems with a long import duration. Maybe it should be picked up. Create awareness about this issue and higher the prio? Otherwise it is not good for the popularity of pandas. |
I'm willing and able to do any more testing, but I don't know of any other profile type tests that I can run that can try to find the source, so I am open to suggestions. |
Greetings, When using pandas with not so big datasets it would take at least 5 to 10 seconds to parse through all the data and plot, which is quite a long time.
So, since, it was an abnormal amount of time for little code execution i decided to uninstall both Anaconda and Python 3.6.1 and take a extra steps:
Now code execution is faster (much faster then before). |
I just ran the same as rockg suggested but sorted by Same with the pandas.plotting module -- I have an application which doesn't do any plotting, so it stinks that it adds a significant time to my importing with no benefit. It seems like it would make sense to make this lazy, since matplotlib takes a long time anyway and 0.15s extra isn't noticeable.
which prints (stuff below 0.1 second elided)
|
FYI -- I have an SSD on my PC so if there is a disk seek issue that some people have, I don't see it. numpy 1.12 takes 0.17 seconds to import. |
I'm using pandas 0.20.2 with pytz 2016.4 on a Windows 7 machine running Anaconda Python 2.7 |
I just ran conda uninstall pytz and reinstalled it, it now takes 0.01 second with pytz-2017.2 Reinstalled pytz 2016.4 (conda install pytz=2016.4) and it slowed back down to 0.92 seconds again Installed pytz 2016.7 -- it also is very fast (13 milliseconds to import). There is an item in the profile data called "lazy.py" which sounds like they converted to a "lazy" loading in 2016.7.
which prints this for pytz 2016.7:
|
Hmm. unfortunately switching to pytz 2017.2 (or 2016.7) doesn't seem to speed up the pandas import; looks like either there are a lot of shared dependencies between the two, or during pandas Oh, here we go, both are using pkg_resources.py, which takes about 0.9s on my PC to execute whatever it is doing, whether it's from pytz or pandas. I had setuptools 27.2 (which includes pkg_resources); this seems to be related to this issue pypa/setuptools#926 |
OK, I used ripgrep in my site-packages to look for pkg_resources, and the culprits are pytz (which now uses it lazily) and numexpr. I filed an issue with numexpr. Is numexpr imported lazily in pandas in the upcoming release? That's another area where a feature I never use (at least, I think I never use it) slows down the pandas import significantly. edit: never mind, you already know about this: |
For reference, here's an import profile using Python 3.7's
|
Our solution is to set up a web server, and using post request to the algorithm part, and the time for import "pandas" package could be reduced. |
having the same issue here
|
Feel free to make a PR if you identify easy fixes.
…On Wed, Nov 28, 2018 at 4:41 PM hosamn ***@***.***> wrote:
having the same issue here
python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 8.56 sec per loop
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7282 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIix6pTkeFPbLXh1GHnWwM8L_y4ceks5uzxD9gaJpZM4B_lQ5>
.
|
So I think I may have found this issue. Over 50% of my time is one a single function call: mkl._py_mkl_service.get_version pandasImport
Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) Code
|
For future reference (since Google led me here), an easy way to figure out where the import time is being spent is to use -X importtime. Here I'm illustrating 'import matplotlib.pyplot as plt', which is kinda slow for me at the moment, and filtering just those particular imports that take longer than 0.1 seconds:
|
solved using conda: It upgraded python from 3.8.10 to 3.10.4 and installed pandas v. 1.4.2 instead of 1.4.3. It is now 10 times faster |
I'm wondering if this has something to do with it and why people aren't seeing the same results. I import a handful of modules in a notebook and the first time is painful - on the order of a minute (sklearn is the worst, pandas second longest). Any subsequent load is fast. ❯ . /tmp/venv/bin/activate
❯ python -m timeit -n1 -r1 "import pandas"
1 loop, best of 1: 289 msec per loop <<< ~0.3 seconds
❯ pip uninstall pandas
... Successfully uninstalled pandas-2.2.3
❯ pip install pandas
... Successfully installed pandas-2.2.3
❯ python -m timeit -n1 -r1 "import pandas"
1 loop, best of 1: 18 sec per loop <<< 18 seconds
❯ python -m timeit -n1 -r1 "import pandas"
1 loop, best of 1: 389 msec per loop <<< ~0.4 seconds
❯ python --version
Python 3.13.0 I ended up writing a step in my notebook's installer that imports each library (in parallel) so that nobody thinks my notebook is slow or hung the first time it's run. |
Python, like Java, is "compiled" to an intermediate "object" .pyc file that that they gets run by a runtime interpreter.
Uninstalling pandas removed these files. So when you reinstall and run pandas the first time, it has to recompile source into the intermediate object files. That's why it take so long the first time you run it (and in some cases, when parts are run for the first time.
Java is compiled at the source. Python is compiled just in time when you run it the first time with an LLM1 compiler that is very fast comparted to just a few years ago. This enabled the 500% performance improvement between python 3.8 and 3.13.
I'm old enough to remember compiles of PL/1 and Fortran taking hours. 18 seconds? Not losing sleep over it. Still, Options:
import compileall
# Compile all .py files in the current directory and its subdirectories
compileall.compile_dir('.', force=True)
# Compile a specific file
compileall.compile_file('my_module.py')
or Precompiling — rules_python 0.0.0 documentation (rules-python.readthedocs.io)<https://rules-python.readthedocs.io/en/0.35.0/precompiling.html>
And maybe this old time solution still works: from Is it possible to precompile an entire python package? - Stack Overflow<https://stackoverflow.com/questions/8301130/is-it-possible-to-precompile-an-entire-python-package>
But I am more interested your code to do parallel imports – that might be useful to know. Where can I find an example?
From: keith p. jolley ***@***.***>
Sent: Sunday, November 24, 2024 11:59 AM
To: pandas-dev/pandas ***@***.***>
Cc: Summers, Harvey ***@***.***>; Comment ***@***.***>
Subject: Re: [pandas-dev/pandas] PERF: pandas import is too slow (#7282)
I'm wondering if this has something to do with it and why people aren't seeing the same results. I import a handful of modules in a notebook and the first time is painful - on the order of a minute (sklearn is the worst, pandas second longest). Any subsequent load is fast.
❯ . /tmp/venv/bin/activate
❯ python -m timeit -n1 -r1 "import pandas"
1 loop, best of 1: 289 msec per loop <<< ~0.3 seconds
❯ pip uninstall pandas
... Successfully uninstalled pandas-2.2.3
❯ pip install pandas
... Successfully installed pandas-2.2.3
❯ python -m timeit -n1 -r1 "import pandas"
1 loop, best of 1: 18 sec per loop <<< 18 seconds
❯ python -m timeit -n1 -r1 "import pandas"
1 loop, best of 1: 389 msec per loop <<< ~0.4 seconds
❯ python --version
Python 3.13.0
I ended up writing a step in my notebook's installer that imports each library (in parallel) so that nobody thinks my notebook is slow or hung the first time it's run.
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/pandas-dev/pandas/issues/7282*issuecomment-2496107161__;Iw!!I2XIyG2ANlwasLbx!TqWw9cdgRdA5UvpqhHy4AZyZAjoSwHJOROwBOgtOvpMJal7u4AT7J8ZGPv7VOSmb1GJBswfs3DH1nMM9uvUThwmb4eE$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AP6RVUTBNIWE5GNMK6PBN4L2CIAWDAVCNFSM6AAAAABSMNKSGGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJWGEYDOMJWGE__;!!I2XIyG2ANlwasLbx!TqWw9cdgRdA5UvpqhHy4AZyZAjoSwHJOROwBOgtOvpMJal7u4AT7J8ZGPv7VOSmb1GJBswfs3DH1nMM9uvUTChNdkhU$>.
You are receiving this because you commented.Message ID: ***@***.******@***.***>>
…----------------------------------------------------------------------
This message, and any attachment(s), is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/electronic-disclaimer. If you are not the intended recipient, please delete this message. For more information about how Bank of America protects your privacy, including specific rights that may apply, please visit the following pages: https://business.bofa.com/en-us/content/global-privacy-notices.html (which includes global privacy notices) and https://www.bankofamerica.com/security-center/privacy-overview/ (which includes US State specific privacy notices such as the http://www.bankofamerica.com/ccpa-notice).
|
More data. In each of these examples I'm leaving out the destruction and recreation of the virtual environment to make sure that every run is starting from the same place. I disagree that one shouldn't be worried about 18s. Multiply that out by how many times this occurs every day and it adds up quickly - 500k/day would be ~104 cpus days/day. Anyways, this notebook has a couple of imports: import pandas as pd
import numpy as np
import scipy
import networkx as nx
import matplotlib.pyplot as plt
import heapq
import colorsys
from sklearn.preprocessing import minmax_scale
import json
from IPython.display import display, HTML
from operator import add
import graphviz
from copy import deepcopy
from matrepr import mprint If I install the ❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | paste -s -d , - | xargs -I% python -m timeit -n1 -r1 "import %"
1 loop, best of 1: 43.1 sec per loop
❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | paste -s -d , - | xargs -I% python -m timeit -n1 -r1 "import %"
1 loop, best of 1: 1.02 sec per loop My first instinct was to do exactly what you suggested. My installer ran python -m compileall -j0
...
Listing '/private/tmp/venv/lib/python3.13/site-packages'...
Compiling '/private/tmp/venv/lib/python3.13/site-packages/decorator.py'...
Compiling '/private/tmp/venv/lib/python3.13/site-packages/ipykernel_launcher.py'...
... Alas, no ❯ find /tmp/venv -type f -name \*.py | xargs python -m compileall -j0
...
❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | paste -s -d , - | xargs -I% python -m timeit -n1 -r1 "import %"
1 loop, best of 1: 43 sec per loop To load each module in parallel (after recreating my venv): ❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | time parallel python -c '"import {}"'
parallel python -c '"import {}"' 2.75s user 1.12s system 13% cpu 29.543 total Which saves ~13 seconds, the long pole in this particular tent being sklearn. The advantage here is that this is done at install time, not when the user is opening the notebook for the first time and then has to wait ~43 seconds before anything happens. But, now, out of curiosity, I redid this entire process on my old laptop and came up with a different outcome. Using the exact same inputs and destroying/recreating the environment I cannot reproduce the above results. The load time is the same (fast) if I go through the above hoops or not. Here from a fresh ❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | paste -s -d , - | xargs -I% python -m timeit -n1 -r1 "import %"
1 loop, best of 1: 1.17 sec per loop The difference? All but the last run was done on a 2023 macbook pro with an m3 max/36G (arm64e) cpu running macos 15.1.1 with python13. That last run was done on old lenovo laptop running ubuntu 24.04.1 with an Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz (x86_64) and python 3.12.3. Another reason people are seeing inconsistent results. This was a surprise to me. |
The following test demonstrates the problem... the contents of
testme.py
is literallyimport pandas
; however, it takes almost 6 seconds to import pandas on my Lenovo T60.The text was updated successfully, but these errors were encountered: