PERF: pandas import is too slow #7282

mpenning · 2014-05-30T02:29:43Z

The following test demonstrates the problem... the contents of testme.py is literally import pandas; however, it takes almost 6 seconds to import pandas on my Lenovo T60.

[mpenning@Mudslide panex]$ time python testme.py 

real    0m5.759s
user    0m5.612s
sys 0m0.120s
[mpenning@Mudslide panex]$
[mpenning@Mudslide panex]$ uname -a
Linux Mudslide 3.2.0-4-686-pae #1 SMP Debian 3.2.57-3+deb7u1 i686 GNU/Linux
[mpenning@Mudslide panex]$ python -V
Python 2.7.3
[mpenning@Mudslide panex]$
[mpenning@Mudslide panex]$ pip freeze
Babel==1.3
Cython==0.20.1
Flask==0.10.1
Flask-Babel==0.8
Flask-Login==0.2.7
Flask-Mail==0.7.6
Flask-OpenID==1.1.1
Flask-SQLAlchemy==0.16
Flask-WTF==0.8.4
Flask-WhooshAlchemy==0.54a
Jinja2==2.7.1
MarkupSafe==0.18
Pygments==1.6
SQLAlchemy==0.7.9
Sphinx==1.2.2
Tempita==0.5.1
WTForms==1.0.5
Werkzeug==0.9.4
Whoosh==2.5.4
argparse==1.2.1
backports.ssl-match-hostname==3.4.0.2
blinker==1.3
ciscoconfparse==1.1.1
decorator==3.4.0
docutils==0.11
dulwich==0.9.6
## FIXME: could not find svn URL in dependency_links for this package:
flup==1.0.3.dev-20110405
hg-git==0.5.0
ipaddr==2.1.11
itsdangerous==0.23
matplotlib==1.3.1
mercurial==3.0
mock==1.0.1
nose==1.3.3
numexpr==2.4
numpy==1.8.1
numpydoc==0.4
pandas==0.13.1
pyparsing==2.0.2
python-dateutil==2.2
python-openid==2.2.5
pytz==2013b
six==1.6.1
speaklater==1.3
sqlalchemy-migrate==0.7.2
tables==3.1.1
tornado==3.2.1
wsgiref==0.1.2

The text was updated successfully, but these errors were encountered:

jreback · 2014-05-30T02:42:42Z

sounds a bit odd, you might have a path issue. do you have multiple pythons/environments installed? does importing numpy take the same amount of time?

import pandas
pandas.show_versions()

time python testme.py
0.252u 0.076s 0:00.33 96.9%     0+0k 0+8io 1pf+0w

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-5-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.0rc1-43-g0dec048
nose: 1.3.0
Cython: 0.20
numpy: 1.8.1
scipy: 0.12.0
statsmodels: 0.5.0
IPython: 2.0.0
sphinx: 1.1.3
patsy: 0.1.0
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: 0.6.0
tables: 3.0.0
numexpr: 2.4
matplotlib: None
openpyxl: 1.5.7
xlrd: 0.9.0
xlwt: 0.7.4
xlsxwriter: None
lxml: 2.3.4
bs4: 4.1.3
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: 0.7.7
pymysql: None
psycopg2: 2.4.5 (dt dec pq3 ext)

mpenning · 2014-05-30T12:08:00Z

numpy doesn't seem to have this issue...

[mpenning@Mudslide pymtr]$ time python -c 'import numpy'

real    0m0.184s
user    0m0.136s
sys 0m0.048s
[mpenning@Mudslide pymtr]$ time python -c 'import pandas'

real    0m5.724s
user    0m5.516s
sys 0m0.188s
[mpenning@Mudslide pymtr]$

jreback · 2014-05-30T12:13:16Z

no idea; whey don't you try in a virtualenv with only pandas deps installed

jreback · 2014-05-30T15:30:06Z

are you loading this over a network? try to install locally, print out pd.__file__ to be sure

jreback · 2014-07-07T15:45:04Z

closing as not a bug.

steve3141 · 2014-07-14T16:03:52Z

I have the same problem. Was this closed because you found a solution? I'd be grateful if you could share it. Thanks.

jtratner · 2014-07-14T21:00:48Z

@steve3141 - have you tried creating a pristine virtualenv and seeing if that helps?

steve3141 · 2014-07-14T23:22:53Z

Afraid I can't; work, lockdown, etc. So I realize this is very likely not the fault of pandas, except insofar as "import pandas" executes an enormous number -- over 500 by my count -- of secondary import statements. Filesystem overhead.

Thanks,
Steve

From: Jeff Tratner [email protected]

To: pydata/pandas [email protected]
Cc: steve3141 [email protected]
Sent: Monday, July 14, 2014 5:00 PM
Subject: Re: [pandas] PERF: pandas import is too slow (#7282)

@steve3141 - have you tried creating a pristine virtualenv and seeing if that helps?
—
Reply to this email directly or view it on GitHub.

rockg · 2014-09-29T16:51:32Z

I know this has been closed for awhile but I'm seeing the same thing and it is not pandas specific. We have our pandas environment in a virtualenv on a drive on a server. That drive is then mounted by each client. This allows us to maintain a sane package environment among all users. However, this is clearly sacrificing startup time to an unreasonable extent. The import times in seconds are as follows:

Package	Server	Client
pandas	1.22	6.23
numpy	.2	1.2

So clearly this is a setup issue, but how do other companies deal with this problem? I find it hard to believe that packages are installed locally on every user's box and if that isn't the case, that they experience these long startup times.

The network itself is working fine...transfer speeds are ~120MB/s.

jtratner · 2014-09-30T03:01:46Z

@rockg - dunno about every corporation, but certainly all of the installations I've worked with have had everything locally. Conda and tox can make it much easier to have local installs.

ndc33 · 2015-02-12T22:37:57Z

I have the same problem -> 6s import time, local install (anaconda, pandas '0.14.1). This is impossibly slow, especially trying to import on multiple processes.

Rufflewind · 2016-04-25T06:57:05Z

Same problem, (pandas 0.18) although mine is not as awful: 400ms just to import pandas on a local SSD. I can't imagine how bad this would be for someone using say a networked filesystem.

szs8 · 2016-05-24T15:53:06Z

+1. I see anywhere between 400 - 700ms.

tacaswell · 2016-05-24T22:00:22Z

try removing the mpl font caches. Or, if you are in such o locked down enviroment that you can not write the caches, this might be mpl searching your system for fonts everytime it is imported.

brycepg · 2016-06-07T20:50:36Z

(in python3/ pandas 1.6.2 via anaconda)
In ipython clearing matplotlib cache:

import shutil; import matplotlib
shutil.rmtree(matplotlib.get_cachedir())

---- restart ipython ----

%timeit -n1 -r1 import pandas

381 ms on linux
748 ms on windows
(it didn't do anything)

importing pandas from ipython(300ms) is faster than running it from python(500ms)

Importing some sub-dependencies speeds up importing pandas

%timeit -n1 -r1 import pandas
375ms

--- restart ipython -----

In [1]: %timeit -n1 -r1 import numpy
1 loops, best of 1: 87.8 ms per loop

In [2]: %timeit -n1 -r1 import pytz
1 loops, best of 1: 157 ms per loop

In [3]: %timeit -n1 -r1 import dateutil
1 loops, best of 1: 1.51 ms per loop

In [4]: %timeit -n1 -r1 import matplotlib
1 loops, best of 1: 54 ms per loop

In [5]: %timeit -n1 -r1 import xlsxwriter
1 loops, best of 1: 47.8 ms per loop

In [6]: %timeit -n1 -r1 import pandas
1 loops, best of 1: 177 ms per loop

It looks like pytz is particularly slow

Getting all the modules from pandas

I uninstalled matplotlib, xlsxwriter, and cython and imported pandas' sub imports before pandas(as seen via sys.modules.keys()). The import time of pandas(running this scripts via interpreter) was 100ms after all the dependent imports instead of 500ms:

import __future__
import __main__
import _ast
import _bisect
import _bootlocale
import _bz2
import _codecs
import _collections
import _collections_abc
import _compat_pickle
import _csv
import _ctypes
import _datetime
import _decimal
import _frozen_importlib
import _functools
import _hashlib
import _heapq
import _imp
import _io
import _json
import _locale
import _lzma
import _opcode
import _operator
import _pickle
import _posixsubprocess
import _random
import _sitebuiltins
import _socket
import _sre
import _ssl
import _stat
import _string
import _struct
import _sysconfigdata
import _thread
import _warnings
import _weakref
import _weakrefset
import abc
import argparse
import ast
import atexit
import base64
import binascii
import bisect
import builtins
import bz2
import calendar
import codecs
import collections
import contextlib
import copy
import copyreg
import csv
import ctypes
import datetime
import dateutil
import decimal
import difflib
import dis
import distutils
import email
import encodings
import enum
import errno
import fnmatch
import functools
import gc
import genericpath
import gettext
import grp
import hashlib
import heapq
import http
import importlib
import inspect
import io
import itertools
import json
import keyword
import linecache
import locale
import logging
import lzma
import marshal
import math
import numbers
import numexpr
import numpy
import opcode
import operator
import os
import parser
import pickle
import pkg_resources
import pkgutil
import platform
import plistlib
import posix
import posixpath
import pprint
import pwd
import pyexpat
import pytz
import quopri
import random
import re
import reprlib
import select
import selectors
import shutil
import signal
import site
import six
import socket
import sre_compile
import sre_constants
import sre_parse
import ssl
import stat
import string
import struct
import subprocess
import symbol
import sys
import sysconfig
import tarfile
import tempfile
import textwrap
import threading
import time
import timeit
import token
import tokenize
import traceback
import types
import unittest
import urllib
import uu
import uuid
import warnings
import weakref
import xml
import zipfile
import zipimport
import zlib

print(timeit.timeit('import pandas', number=1))

A workaround may be to stratify these imports before you need pandas

I'm getting similar results with no anaconda / python2 / pandas 1.8

jacobSingh · 2017-01-06T18:07:32Z

Similar issue for me. It makes development in Flask unbearable since it is 10s after every file change to reload. I debugged it an an import time of 3-10 seconds of pandas is the main culprit (2015 MBA) running anaconda on 3.5

There is some caching happening, but not sure what...

python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 3.71 sec per loop
(abg) jacob@Jacobs-Air:~/stuff/abg% python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 652 msec per loop

brycepg · 2017-01-07T01:57:14Z

One workaround is to isolate all the code that interacts with pandas and lazily import that code only when you need it so that the wait period is only during program execution. (that's what I do)

jacobSingh · 2017-01-08T04:56:57Z

I don't think that will help - then I'll have the delay every reload (since all my code works with pandas). I'd done this in a terminal window: ''' while true; do date && python -m timeit -n1 -r1 "import pandas"; sleep 2; done; ''' Doing this keeps pandas in the OS cache. Stupid hack, but keeps loading down to 300-500ms.

…

-J

On Sat, Jan 7, 2017 at 7:27 AM, Bryce Guinta ***@***.***> wrote: One workaround is to isolate all the code that interacts with pandas and lazily import that code only when you need it so that the wait period is only during program execution. (that's what I do) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7282 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHchj5AFRzCAFrswjpTz1B7e0XM-mo1ks5rPvEDgaJpZM4B_lQ5> .

-- +919971876580 twitter: @jacobSingh ( http://twitter.com/#!/JacobSingh ) web: http://www.jacobsingh.name Skype: pajamadesign gTalk: [email protected]

grantstephens · 2017-01-10T14:26:33Z

I'm having a similar issue. Running on OSX and does the same in the virtualenv and out of it. Tried reinstalling everything and that didn't help. Doesn't seem to be matplotlib as that is relatively fast on its own. Very tricky to troubleshoot this- doesn't seem to show anything in the logs.

rockg · 2017-01-10T14:41:34Z

Can somebody please profile a simple "import pandas" and we can see if the problem is easily identified?

grantstephens · 2017-01-10T16:17:39Z

So I did a quick profile and found the following:

         93778 function calls (91484 primitive calls) in 4.278 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        6    0.426    0.071    0.905    0.151 api.py:3(<module>)
        1    0.306    0.306    4.276    4.276 __init__.py:5(<module>)
        1    0.189    0.189    0.211    0.211 base.py:1(<module>)
        1    0.163    0.163    0.426    0.426 api.py:1(<module>)
        1    0.129    0.129    1.170    1.170 format.py:5(<module>)
        2    0.121    0.061    0.197    0.099 base.py:3(<module>)
       20    0.120    0.006    0.390    0.019 __init__.py:1(<module>)
        3    0.119    0.040    0.178    0.059 common.py:1(<module>)
        1    0.115    0.115    0.572    0.572 __init__.py:26(<module>)
        1    0.112    0.112    0.569    0.569 frame.py:10(<module>)
        1    0.111    0.111    0.214    0.214 httplib.py:67(<module>)
        2    0.103    0.051    0.630    0.315 index.py:2(<module>)
        1    0.089    0.089    0.144    0.144 parser.py:29(<module>)
        1    0.078    0.078    0.084    0.084 excel.py:3(<module>)
        1    0.074    0.074    0.840    0.840 api.py:5(<module>)
        1    0.072    0.072    0.091    0.091 sparse.py:4(<module>)
        1    0.070    0.070    0.149    0.149 gbq.py:1(<module>)
        1    0.070    0.070    0.650    0.650 groupby.py:1(<module>)
        1    0.068    0.068    0.138    0.138 generic.py:2(<module>)
        1    0.063    0.063    1.265    1.265 config_init.py:11(<module>)
        1    0.060    0.060    0.060    0.060 socket.py:45(<module>)
        1    0.055    0.055    0.145    0.145 eval.py:4(<module>)
        1    0.054    0.054    0.075    0.075 expr.py:2(<module>)
        2    0.052    0.026    0.069    0.035 __init__.py:9(<module>)
        1    0.052    0.052    0.054    0.054 pytables.py:4(<module>)
        1    0.052    0.052    0.165    0.165 series.py:3(<module>)

Seems like the init at line 5 is taking most of the time- is this the main init of pandas?

jreback · 2017-01-10T16:26:44Z

just for comparison on osx.

# 2.7
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 287 msec per loop
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 671 msec per loop

# 3.5
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 168 msec per loop
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 494 msec per loop

jacobSingh · 2017-01-10T16:55:17Z

Probably cached?

…

On Tue, Jan 10, 2017 at 9:56 PM, Jeff Reback ***@***.***> wrote: just for comparison on osx. # 2.7 bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import numpy" 1 loops, best of 1: 287 msec per loop bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import pandas" 1 loops, best of 1: 671 msec per loop # 3.5 bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import numpy" 1 loops, best of 1: 168 msec per loop bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import pandas" 1 loops, best of 1: 494 msec per loop — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7282 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHchrtO89XXFLDcIdOvZzNQFEzSgmPrks5rQ7FOgaJpZM4B_lQ5> .

-- +919971876580 twitter: @jacobSingh ( http://twitter.com/#!/JacobSingh ) web: http://www.jacobsingh.name Skype: pajamadesign gTalk: [email protected]

jreback · 2017-01-10T17:12:40Z

not sure what you think is cached

rockg · 2017-01-10T17:14:07Z

@RexFuzzle I'm surprised you don't have any long file names. Did you strip the directories? You should be seeing something like the below. That will make it easier to see what is taking the majority of time. I think it comes down to pandas importing a lot of dependencies each which have their own hit.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.296    0.296    4.990    4.990 /mnt/environment/software/python/lib/python2.7/site-packages/pandas/__init__.py:3(<module>)
        1    0.198    0.198    0.331    0.331 /mnt/environment/software/python/lib/python2.7/site-packages/numpy/core/__init__.py:1(<module>)
        1    0.165    0.165    0.248    0.248 /mnt/environment/software/python/lib/python2.7/site-packages/bottleneck/__init__.py:3(<module>)
        1    0.154    0.154    0.164    0.164 /mnt/environment/software/python/lib/python2.7/site-packages/bs4/dammit.py:8(<module>)
        1    0.134    0.134    0.164    0.164 /mnt/environment/software/python/lib/python2.7/site-packages/pandas/core/common.py:3(<module>)

grantstephens · 2017-01-10T17:20:26Z

Hmmm, that is strange- I didn't strip anything- was using cProfile- don't know if that could have caused it. Will investigate it a bit further tomorrow. From my results though it certainly seems like it is just the one init that is taking all the time- will try to get mine in the same format as yours and then we can compare- see if it is the same init file and line number.

rockg · 2017-01-10T17:25:06Z

Save out the cprofile to a file and then load with pstats and print. If it is a specific module, run the line profiler to see if it anything specific or just a lot of small things.

import cProfile
import pstats
cProfile.run("import pandas", "pandasImport")
p = pstats.Stats("pandasImport")
p.sort_stats("tottime").print_stats()

jacobSingh · 2017-01-10T17:26:41Z

For me, the first load is 4s. The the OS caches the library in memory, so it's around 300-500ms. Wait a little while, and try again. Best, Jacob

…

On Tue, Jan 10, 2017 at 10:42 PM, Jeff Reback ***@***.***> wrote: not sure what you think is cached

-- +919971876580 twitter: @jacobSingh ( http://twitter.com/#!/JacobSingh ) web: http://www.jacobsingh.name Skype: pajamadesign gTalk: [email protected]

rockg · 2017-01-11T12:36:28Z

All right, let's go one step further and do a line profile of pandas.__init__. You can do this by using the line_profiler.

jorisvandenbossche · 2017-01-11T12:47:45Z

Maybe you could also give https://github.com/cournape/import-profiler a try

But looking at the above values, although the import time is much larger, also numpy takes much longer. The ratio of numpy import to full pandas import seems rather the same of for the much smaller numbers @jreback posted, or I also see). So if numpy is already taking more than 4 seconds to import, we of course are not going to get pandas import time below that.

grantstephens · 2017-01-11T12:56:45Z

Thanks for all the input. I ran a dtruss in the mean time and found that nothing happens for a few seconds before anything shows up there and so I'm thinking that there is a lag on disk reads instead of it being a python problem, this, to me, is re-enforced by the fact that the time seems to be grouped with the first line of the init file (artifact from cProfile?). Will do a bit more digging. Also agree that it seems to be more a numpy problem and will have a look through their issues and see if anybody else has something similar.
Thanks again for the input.

jorisvandenbossche · 2017-01-11T13:00:17Z

Also agree that it seems to be more a numpy problem

Sorry, that is not what I wanted to say. I just meant that both numpy and pandas seem to take longer (compared to my laptop, both x10 to x15 times longer), so that is not necessarily to pinpoint to a certain import that is the culprit. It just seems generally slower. Which does not mean of course that we might do some more lazy imports in pandas to improve things, if there are bottlenecks.

melroy89 · 2017-02-15T11:28:21Z

Please, do not ignore this issue. It's closed, but I also found problems with a long import duration. Maybe it should be picked up. Create awareness about this issue and higher the prio? Otherwise it is not good for the popularity of pandas.

grantstephens · 2017-02-15T12:29:47Z

I'm willing and able to do any more testing, but I don't know of any other profile type tests that I can run that can try to find the source, so I am open to suggestions.

dafer660 · 2017-06-15T15:36:17Z

Greetings,

When using pandas with not so big datasets it would take at least 5 to 10 seconds to parse through all the data and plot, which is quite a long time.
So, the steps that led me to the slow execution of pandas in pycharm were:

Anaconda installation for all users
Python 3.6.1 installation

So, since, it was an abnormal amount of time for little code execution i decided to uninstall both Anaconda and Python 3.6.1 and take a extra steps:

Install visualcppbuildtoolsfull (which can be found here: http://landinghub.visualstudio.com/visual-cpp-build-tools)
Python 3.6.1 installation
Anaconda installation for all users
Pycharm Default Settings > Project Interpreter > Select correct one (generic python 3.6.1 or Anaconda) in order to query through all the packages.
(Optional) I suggest doing step 4 for every path that pycharm detects.

Now code execution is faster (much faster then before).
I hope it helps someone.

jason-s · 2017-10-23T21:03:43Z

I just ran the same as rockg suggested but sorted by cumtime, not tottime, which immediately points out that the pytz module takes half of the total import time (on my PC). Is there any way to make this optional or lazy? I rarely use datetimes, and when I do, they are almost always UTC, so I have very little interest in timezones.

Same with the pandas.plotting module -- I have an application which doesn't do any plotting, so it stinks that it adds a significant time to my importing with no benefit. It seems like it would make sense to make this lazy, since matplotlib takes a long time anyway and 0.15s extra isn't noticeable.

import cProfile
import pstats
cProfile.run("import pandas", "pandasImport")
p = pstats.Stats("pandasImport")
p.sort_stats("cumtime").print_stats()

which prints (stuff below 0.1 second elided)

Mon Oct 23 14:01:19 2017    pandasImport

         204659 function calls (202288 primitive calls) in 1.875 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.042    0.042    1.876    1.876 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\__init__.py:5(<module>)
   321/44    0.041    0.000    1.156    0.026 {__import__}
        1    0.008    0.008    0.925    0.925 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\__init__.py:9(<module>)
        1    0.002    0.002    0.914    0.914 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:14(<module>)
        1    0.000    0.000    0.651    0.651 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:704(subscribe)
      217    0.000    0.000    0.650    0.003 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:2870(<lambda>)
      217    0.001    0.000    0.650    0.003 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:2299(activate)
      427    0.002    0.000    0.602    0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:1845(_handle_ns)
      217    0.001    0.000    0.586    0.003 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:1898(fixup_namespace_packages)
      411    0.003    0.000    0.581    0.001 c:\app\python\anaconda\1.6.0\lib\pkgutil.py:176(find_module)
      411    0.571    0.001    0.571    0.001 {imp.find_module}
        1    0.011    0.011    0.423    0.423 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\api.py:5(<module>)
        1    0.007    0.007    0.352    0.352 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\groupby.py:1(<module>)
       40    0.001    0.000    0.248    0.006 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:444(add_entry)
      472    0.005    0.000    0.236    0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:1779(find_on_path)
        1    0.005    0.005    0.231    0.231 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\frame.py:10(<module>)
      472    0.188    0.000    0.188    0.000 {nt._isdir}
      476    0.002    0.000    0.187    0.000 {map}
        1    0.023    0.023    0.173    0.173 c:\app\python\anaconda\1.6.0\lib\site-packages\numpy\__init__.py:106(<module>)
        1    0.003    0.003    0.157    0.157 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\series.py:3(<module>)
        1    0.005    0.005    0.142    0.142 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\plotting\__init__.py:3(<module>)
        1    0.008    0.008    0.132    0.132 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\plotting\_converter.py:1(<module>)
        1    0.000    0.000    0.127    0.127 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:430(__init__)
        1    0.003    0.003    0.119    0.119 c:\app\python\anaconda\1.6.0\lib\site-packages\numpy\add_newdocs.py:10(<module>)
        1    0.019    0.019    0.115    0.115 c:\app\python\anaconda\1.6.0\lib\site-packages\numpy\lib\__init__.py:1(<module>)
        1    0.002    0.002    0.109    0.109 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\util\_tester.py:3(<module>)
        1    0.015    0.015    0.107    0.107 c:\app\python\anaconda\1.6.0\lib\site-packages\pytest.py:4(<module>)
        1    0.005    0.005    0.102    0.102 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\index.py:2(<module>)

jason-s · 2017-10-23T21:09:52Z

FYI -- I have an SSD on my PC so if there is a disk seek issue that some people have, I don't see it. numpy 1.12 takes 0.17 seconds to import.

TomAugspurger · 2017-10-23T21:12:08Z

@jason-s pytz imports in <5 microseconds on my machine, so something is strange there.

FYI #17710 did some work on this, so things should be quicker in the upcoming release (nothing touching pytz though).

jason-s · 2017-10-23T21:13:51Z

I'm using pandas 0.20.2 with pytz 2016.4 on a Windows 7 machine running Anaconda Python 2.7

jason-s · 2017-10-23T21:28:59Z

I just ran conda uninstall pytz and reinstalled it, it now takes 0.01 second with pytz-2017.2

Reinstalled pytz 2016.4 (conda install pytz=2016.4) and it slowed back down to 0.92 seconds again

Installed pytz 2016.7 -- it also is very fast (13 milliseconds to import). There is an item in the profile data called "lazy.py" which sounds like they converted to a "lazy" loading in 2016.7.

import cProfile
import pstats
cProfile.run("import pytz", "profiling_data")
p = pstats.Stats("profiling_data")
p.sort_stats("cumtime").print_stats()

which prints this for pytz 2016.7:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.005    0.005    0.018    0.018 <string>:1(<module>)
        1    0.008    0.008    0.013    0.013 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\__init__.py:9(<module>)
        2    0.002    0.001    0.002    0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\lazy.py:135(__new__)
        1    0.002    0.002    0.002    0.002 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\tzinfo.py:1(<module>)
        1    0.000    0.000    0.001    0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\lazy.py:1(<module>)
        1    0.000    0.000    0.000    0.000 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\tzfile.py:4(<module>)
        2    0.000    0.000    0.000    0.000 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\lazy.py:80(__new__)

jason-s · 2017-10-23T21:36:24Z

Hmm. unfortunately switching to pytz 2017.2 (or 2016.7) doesn't seem to speed up the pandas import; looks like either there are a lot of shared dependencies between the two, or during pandas __init__ process it uses pytz and negates the speed advantage that pytz import provides by lazy initialization.

Oh, here we go, both are using pkg_resources.py, which takes about 0.9s on my PC to execute whatever it is doing, whether it's from pytz or pandas.

I had setuptools 27.2 (which includes pkg_resources); this seems to be related to this issue pypa/setuptools#926

jason-s · 2017-10-23T22:04:13Z

OK, I used ripgrep in my site-packages to look for pkg_resources, and the culprits are pytz (which now uses it lazily) and numexpr.

I filed an issue with numexpr.

Is numexpr imported lazily in pandas in the upcoming release? That's another area where a feature I never use (at least, I think I never use it) slows down the pandas import significantly.

edit: never mind, you already know about this:

#17710 (comment)

nschloe · 2018-07-01T15:37:43Z

For reference, here's an import profile using Python 3.7's importtime and tuna:

python3.7 -X importtime -c "import pandas" 2> pandas.log
tuna pandas.log

miazoin · 2018-10-16T07:34:14Z

Our solution is to set up a web server, and using post request to the algorithm part, and the time for import "pandas" package could be reduced.

hosamn · 2018-11-28T22:40:55Z

having the same issue here

python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 8.56 sec per loop

TomAugspurger · 2018-11-28T22:57:43Z

Feel free to make a PR if you identify easy fixes.

…

On Wed, Nov 28, 2018 at 4:41 PM hosamn ***@***.***> wrote: having the same issue here python -m timeit -n1 -r1 "import pandas" 1 loops, best of 1: 8.56 sec per loop — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7282 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIix6pTkeFPbLXh1GHnWwM8L_y4ceks5uzxD9gaJpZM4B_lQ5> .

Gnomic20 · 2020-09-22T13:39:04Z

So I think I may have found this issue. Over 50% of my time is one a single function call: mkl._py_mkl_service.get_version

pandasImport

     187472 function calls (181157 primitive calls) in 4.406 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
1 2.295 2.295 2.295 2.295 {built-in method mkl._py_mkl_service.get_version}

Code

import cProfile
import pstats
cProfile.run("import pandas", "pandasImport")
p = pstats.Stats("pandasImport")
p.sort_stats("tottime").print_stats()

pandas.show_versions()

INSTALLED VERSIONS
------------------
commit           : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python           : 3.7.4.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.18362
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 1.1.1
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.2
setuptools       : 49.6.0.post20200814
Cython           : 0.29.21
pytest           : 6.0.2
hypothesis       : 5.35.3
sphinx           : 2.2.0
blosc            : None
feather          : None
xlsxwriter       : 1.3.3
lxml.etree       : 4.5.2
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.18.1
pandas_datareader: None
bs4              : 4.9.1
bottleneck       : 1.3.2
fsspec           : 0.8.0
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.1
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : 1.3.19
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.51.2

cameronkerrnz · 2021-12-02T00:03:29Z

For future reference (since Google led me here), an easy way to figure out where the import time is being spent is to use -X importtime. Here I'm illustrating 'import matplotlib.pyplot as plt', which is kinda slow for me at the moment, and filtering just those particular imports that take longer than 0.1 seconds:

(venv) time python -X importtime -c 'import matplotlib.pyplot as plt' 2>&1 | awk '$3 > 100000'
import time: self [us] | cumulative | imported package
import time:    109578 |     112835 |           pyparsing.helpers
import time:    179964 |     229597 |     matplotlib.collections
import time:    182284 |     182284 |           matplotlib.patches

real	0m2.047s
user	0m1.924s
sys	0m0.268s

Jalagarto · 2022-07-28T11:20:05Z

solved using conda:
stackoverflow - Running quicker for Numpy and Pandas( installed via conda) than via pip?

It upgraded python from 3.8.10 to 3.10.4 and installed pandas v. 1.4.2 instead of 1.4.3.

It is now 10 times faster

keithpjolley · 2024-11-24T16:58:47Z

I'm wondering if this has something to do with it and why people aren't seeing the same results. I import a handful of modules in a notebook and the first time is painful - on the order of a minute (sklearn is the worst, pandas second longest). Any subsequent load is fast.

❯ . /tmp/venv/bin/activate

❯ python -m timeit -n1 -r1 "import pandas"
1 loop, best of 1: 289 msec per loop         <<< ~0.3 seconds

❯ pip uninstall pandas
... Successfully uninstalled pandas-2.2.3

❯ pip install pandas
... Successfully installed pandas-2.2.3

❯ python -m timeit -n1 -r1 "import pandas"
1 loop, best of 1: 18 sec per loop           <<< 18 seconds

❯ python -m timeit -n1 -r1 "import pandas"
1 loop, best of 1: 389 msec per loop         <<< ~0.4 seconds

❯ python --version
Python 3.13.0

I ended up writing a step in my notebook's installer that imports each library (in parallel) so that nobody thinks my notebook is slow or hung the first time it's run.

Gnomic20 · 2024-11-27T03:05:10Z

Python, like Java, is "compiled" to an intermediate "object" .pyc file that that they gets run by a runtime interpreter. Uninstalling pandas removed these files. So when you reinstall and run pandas the first time, it has to recompile source into the intermediate object files. That's why it take so long the first time you run it (and in some cases, when parts are run for the first time. Java is compiled at the source. Python is compiled just in time when you run it the first time with an LLM1 compiler that is very fast comparted to just a few years ago. This enabled the 500% performance improvement between python 3.8 and 3.13. I'm old enough to remember compiles of PL/1 and Fortran taking hours. 18 seconds? Not losing sleep over it. Still, Options: import compileall # Compile all .py files in the current directory and its subdirectories compileall.compile_dir('.', force=True) # Compile a specific file compileall.compile_file('my_module.py') or Precompiling — rules_python 0.0.0 documentation (rules-python.readthedocs.io)<https://rules-python.readthedocs.io/en/0.35.0/precompiling.html> And maybe this old time solution still works: from Is it possible to precompile an entire python package? - Stack Overflow<https://stackoverflow.com/questions/8301130/is-it-possible-to-precompile-an-entire-python-package> But I am more interested your code to do parallel imports – that might be useful to know. Where can I find an example? From: keith p. jolley ***@***.***> Sent: Sunday, November 24, 2024 11:59 AM To: pandas-dev/pandas ***@***.***> Cc: Summers, Harvey ***@***.***>; Comment ***@***.***> Subject: Re: [pandas-dev/pandas] PERF: pandas import is too slow (#7282) I'm wondering if this has something to do with it and why people aren't seeing the same results. I import a handful of modules in a notebook and the first time is painful - on the order of a minute (sklearn is the worst, pandas second longest). Any subsequent load is fast. ❯ . /tmp/venv/bin/activate ❯ python -m timeit -n1 -r1 "import pandas" 1 loop, best of 1: 289 msec per loop <<< ~0.3 seconds ❯ pip uninstall pandas ... Successfully uninstalled pandas-2.2.3 ❯ pip install pandas ... Successfully installed pandas-2.2.3 ❯ python -m timeit -n1 -r1 "import pandas" 1 loop, best of 1: 18 sec per loop <<< 18 seconds ❯ python -m timeit -n1 -r1 "import pandas" 1 loop, best of 1: 389 msec per loop <<< ~0.4 seconds ❯ python --version Python 3.13.0 I ended up writing a step in my notebook's installer that imports each library (in parallel) so that nobody thinks my notebook is slow or hung the first time it's run. — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/pandas-dev/pandas/issues/7282*issuecomment-2496107161__;Iw!!I2XIyG2ANlwasLbx!TqWw9cdgRdA5UvpqhHy4AZyZAjoSwHJOROwBOgtOvpMJal7u4AT7J8ZGPv7VOSmb1GJBswfs3DH1nMM9uvUThwmb4eE$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AP6RVUTBNIWE5GNMK6PBN4L2CIAWDAVCNFSM6AAAAABSMNKSGGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJWGEYDOMJWGE__;!!I2XIyG2ANlwasLbx!TqWw9cdgRdA5UvpqhHy4AZyZAjoSwHJOROwBOgtOvpMJal7u4AT7J8ZGPv7VOSmb1GJBswfs3DH1nMM9uvUTChNdkhU$>. You are receiving this because you commented.Message ID: ***@***.******@***.***>>

…

---------------------------------------------------------------------- This message, and any attachment(s), is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/electronic-disclaimer. If you are not the intended recipient, please delete this message. For more information about how Bank of America protects your privacy, including specific rights that may apply, please visit the following pages: https://business.bofa.com/en-us/content/global-privacy-notices.html (which includes global privacy notices) and https://www.bankofamerica.com/security-center/privacy-overview/ (which includes US State specific privacy notices such as the http://www.bankofamerica.com/ccpa-notice).

keithpjolley · 2024-11-28T14:40:34Z

More data.

In each of these examples I'm leaving out the destruction and recreation of the virtual environment to make sure that every run is starting from the same place. I disagree that one shouldn't be worried about 18s. Multiply that out by how many times this occurs every day and it adds up quickly - 500k/day would be ~104 cpus days/day.

Anyways, this notebook has a couple of imports:

import pandas as pd
import numpy as np
import scipy
import networkx as nx
import matplotlib.pyplot as plt
import heapq
import colorsys
from sklearn.preprocessing import minmax_scale
import json
from IPython.display import display, HTML
from operator import add
import graphviz
from copy import deepcopy
from matrepr import mprint

If I install the requirements.txt and do nothing else:

❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | paste -s -d , - | xargs -I% python -m timeit -n1 -r1 "import %"
1 loop, best of 1: 43.1 sec per loop

❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | paste -s -d , - | xargs -I% python -m timeit -n1 -r1 "import %"
1 loop, best of 1: 1.02 sec per loop

My first instinct was to do exactly what you suggested. My installer ran compileall (here on a fresh venv).

python -m compileall -j0
...
Listing '/private/tmp/venv/lib/python3.13/site-packages'...
Compiling '/private/tmp/venv/lib/python3.13/site-packages/decorator.py'...
Compiling '/private/tmp/venv/lib/python3.13/site-packages/ipykernel_launcher.py'...
...

Alas, no pandas in that output so I forced the issue and saw that compiling did nothing to improve the load times.

❯ find /tmp/venv -type f -name \*.py | xargs python -m compileall -j0
...
❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | paste -s -d , - | xargs -I% python -m timeit -n1 -r1 "import %"
1 loop, best of 1: 43 sec per loop

To load each module in parallel (after recreating my venv):

❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | time parallel python -c '"import {}"'
parallel python -c '"import {}"'  2.75s user 1.12s system 13% cpu 29.543 total

Which saves ~13 seconds, the long pole in this particular tent being sklearn. The advantage here is that this is done at install time, not when the user is opening the notebook for the first time and then has to wait ~43 seconds before anything happens.

But, now, out of curiosity, I redid this entire process on my old laptop and came up with a different outcome. Using the exact same inputs and destroying/recreating the environment I cannot reproduce the above results. The load time is the same (fast) if I go through the above hoops or not. Here from a fresh venv:

❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | paste -s -d , - | xargs -I% python -m timeit -n1 -r1 "import %"
1 loop, best of 1: 1.17 sec per loop

The difference? All but the last run was done on a 2023 macbook pro with an m3 max/36G (arm64e) cpu running macos 15.1.1 with python13. That last run was done on old lenovo laptop running ubuntu 24.04.1 with an Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz (x86_64) and python 3.12.3.

Another reason people are seeing inconsistent results. This was a surprise to me.

jreback closed this as completed Jul 7, 2014

jreback mentioned this issue Feb 13, 2015

CLN: make sure that we don't have extraneous imports #9482

Closed

5 tasks

rockg mentioned this issue Jun 23, 2017

PERF: pandas' import time #16764

Closed

jreback added the Performance Memory or execution speed performance label Oct 2, 2017

QuLogic mentioned this issue Oct 31, 2017

Check dependencies at runtime as declared in setup.py. matplotlib/matplotlib#9638

Closed

6 tasks

asah mentioned this issue Nov 21, 2017

slow quilt(1) startup times due to import pyarrow (and pandas?) quiltdata/quilt#194

Closed

CatGarab mentioned this issue Aug 12, 2019

Improve Python app API import time kubos/kubos#448

Open

lmeyerov mentioned this issue Jan 23, 2022

[BUG] Long import times rapidsai/cudf#627

Open

PERF: pandas import is too slow #7282

PERF: pandas import is too slow #7282

Comments

mpenning commented May 30, 2014

jreback commented May 30, 2014

mpenning commented May 30, 2014

jreback commented May 30, 2014

jreback commented May 30, 2014

jreback commented Jul 7, 2014

steve3141 commented Jul 14, 2014

jtratner commented Jul 14, 2014

steve3141 commented Jul 14, 2014

rockg commented Sep 29, 2014

jtratner commented Sep 30, 2014

ndc33 commented Feb 12, 2015

Rufflewind commented Apr 25, 2016 • edited Loading

szs8 commented May 24, 2016

tacaswell commented May 24, 2016

brycepg commented Jun 7, 2016 • edited Loading

jacobSingh commented Jan 6, 2017

brycepg commented Jan 7, 2017

jacobSingh commented Jan 8, 2017 via email

grantstephens commented Jan 10, 2017

rockg commented Jan 10, 2017

grantstephens commented Jan 10, 2017

jreback commented Jan 10, 2017

jacobSingh commented Jan 10, 2017 via email

jreback commented Jan 10, 2017

rockg commented Jan 10, 2017

grantstephens commented Jan 10, 2017

rockg commented Jan 10, 2017

jacobSingh commented Jan 10, 2017 via email

rockg commented Jan 11, 2017

jorisvandenbossche commented Jan 11, 2017

grantstephens commented Jan 11, 2017

jorisvandenbossche commented Jan 11, 2017

melroy89 commented Feb 15, 2017

grantstephens commented Feb 15, 2017

dafer660 commented Jun 15, 2017

jason-s commented Oct 23, 2017 • edited Loading

jason-s commented Oct 23, 2017 • edited Loading

TomAugspurger commented Oct 23, 2017

jason-s commented Oct 23, 2017 • edited Loading

jason-s commented Oct 23, 2017 • edited Loading

jason-s commented Oct 23, 2017 • edited Loading

jason-s commented Oct 23, 2017 • edited Loading

nschloe commented Jul 1, 2018

miazoin commented Oct 16, 2018

hosamn commented Nov 28, 2018

TomAugspurger commented Nov 28, 2018 via email

Gnomic20 commented Sep 22, 2020

cameronkerrnz commented Dec 2, 2021

Jalagarto commented Jul 28, 2022 • edited Loading

keithpjolley commented Nov 24, 2024

Gnomic20 commented Nov 27, 2024 via email

keithpjolley commented Nov 28, 2024 • edited Loading

Rufflewind commented Apr 25, 2016 •

edited

Loading

brycepg commented Jun 7, 2016 •

edited

Loading

jason-s commented Oct 23, 2017 •

edited

Loading

jason-s commented Oct 23, 2017 •

edited

Loading

jason-s commented Oct 23, 2017 •

edited

Loading

jason-s commented Oct 23, 2017 •

edited

Loading

jason-s commented Oct 23, 2017 •

edited

Loading

jason-s commented Oct 23, 2017 •

edited

Loading

Jalagarto commented Jul 28, 2022 •

edited

Loading

keithpjolley commented Nov 28, 2024 •

edited

Loading