Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: pandas import is too slow #7282

Closed
mpenning opened this issue May 30, 2014 · 53 comments
Closed

PERF: pandas import is too slow #7282

mpenning opened this issue May 30, 2014 · 53 comments
Labels
Performance Memory or execution speed performance

Comments

@mpenning
Copy link

The following test demonstrates the problem... the contents of testme.py is literally import pandas; however, it takes almost 6 seconds to import pandas on my Lenovo T60.

[mpenning@Mudslide panex]$ time python testme.py 

real    0m5.759s
user    0m5.612s
sys 0m0.120s
[mpenning@Mudslide panex]$
[mpenning@Mudslide panex]$ uname -a
Linux Mudslide 3.2.0-4-686-pae #1 SMP Debian 3.2.57-3+deb7u1 i686 GNU/Linux
[mpenning@Mudslide panex]$ python -V
Python 2.7.3
[mpenning@Mudslide panex]$
[mpenning@Mudslide panex]$ pip freeze
Babel==1.3
Cython==0.20.1
Flask==0.10.1
Flask-Babel==0.8
Flask-Login==0.2.7
Flask-Mail==0.7.6
Flask-OpenID==1.1.1
Flask-SQLAlchemy==0.16
Flask-WTF==0.8.4
Flask-WhooshAlchemy==0.54a
Jinja2==2.7.1
MarkupSafe==0.18
Pygments==1.6
SQLAlchemy==0.7.9
Sphinx==1.2.2
Tempita==0.5.1
WTForms==1.0.5
Werkzeug==0.9.4
Whoosh==2.5.4
argparse==1.2.1
backports.ssl-match-hostname==3.4.0.2
blinker==1.3
ciscoconfparse==1.1.1
decorator==3.4.0
docutils==0.11
dulwich==0.9.6
## FIXME: could not find svn URL in dependency_links for this package:
flup==1.0.3.dev-20110405
hg-git==0.5.0
ipaddr==2.1.11
itsdangerous==0.23
matplotlib==1.3.1
mercurial==3.0
mock==1.0.1
nose==1.3.3
numexpr==2.4
numpy==1.8.1
numpydoc==0.4
pandas==0.13.1
pyparsing==2.0.2
python-dateutil==2.2
python-openid==2.2.5
pytz==2013b
six==1.6.1
speaklater==1.3
sqlalchemy-migrate==0.7.2
tables==3.1.1
tornado==3.2.1
wsgiref==0.1.2
@jreback
Copy link
Contributor

jreback commented May 30, 2014

sounds a bit odd, you might have a path issue. do you have multiple pythons/environments installed? does importing numpy take the same amount of time?

import pandas
pandas.show_versions()
time python testme.py
0.252u 0.076s 0:00.33 96.9%     0+0k 0+8io 1pf+0w
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-5-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.0rc1-43-g0dec048
nose: 1.3.0
Cython: 0.20
numpy: 1.8.1
scipy: 0.12.0
statsmodels: 0.5.0
IPython: 2.0.0
sphinx: 1.1.3
patsy: 0.1.0
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: 0.6.0
tables: 3.0.0
numexpr: 2.4
matplotlib: None
openpyxl: 1.5.7
xlrd: 0.9.0
xlwt: 0.7.4
xlsxwriter: None
lxml: 2.3.4
bs4: 4.1.3
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: 0.7.7
pymysql: None
psycopg2: 2.4.5 (dt dec pq3 ext)

@mpenning
Copy link
Author

numpy doesn't seem to have this issue...

[mpenning@Mudslide pymtr]$ time python -c 'import numpy'

real    0m0.184s
user    0m0.136s
sys 0m0.048s
[mpenning@Mudslide pymtr]$ time python -c 'import pandas'

real    0m5.724s
user    0m5.516s
sys 0m0.188s
[mpenning@Mudslide pymtr]$

@jreback
Copy link
Contributor

jreback commented May 30, 2014

no idea; whey don't you try in a virtualenv with only pandas deps installed

@jreback
Copy link
Contributor

jreback commented May 30, 2014

are you loading this over a network? try to install locally, print out pd.__file__ to be sure

@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

closing as not a bug.

@jreback jreback closed this as completed Jul 7, 2014
@steve3141
Copy link

I have the same problem. Was this closed because you found a solution? I'd be grateful if you could share it. Thanks.

@jtratner
Copy link
Contributor

@steve3141 - have you tried creating a pristine virtualenv and seeing if that helps?

@steve3141
Copy link

Afraid I can't; work, lockdown, etc. So I realize this is very likely not the fault of pandas, except insofar as "import pandas" executes an enormous number -- over 500 by my count -- of secondary import statements. Filesystem overhead.

Thanks,
Steve
 
From: Jeff Tratner [email protected]

To: pydata/pandas [email protected]
Cc: steve3141 [email protected]
Sent: Monday, July 14, 2014 5:00 PM
Subject: Re: [pandas] PERF: pandas import is too slow (#7282)

@steve3141 - have you tried creating a pristine virtualenv and seeing if that helps?

Reply to this email directly or view it on GitHub.

@rockg
Copy link
Contributor

rockg commented Sep 29, 2014

I know this has been closed for awhile but I'm seeing the same thing and it is not pandas specific. We have our pandas environment in a virtualenv on a drive on a server. That drive is then mounted by each client. This allows us to maintain a sane package environment among all users. However, this is clearly sacrificing startup time to an unreasonable extent. The import times in seconds are as follows:

Package Server Client
pandas 1.22 6.23
numpy .2 1.2

So clearly this is a setup issue, but how do other companies deal with this problem? I find it hard to believe that packages are installed locally on every user's box and if that isn't the case, that they experience these long startup times.

The network itself is working fine...transfer speeds are ~120MB/s.

@jtratner
Copy link
Contributor

@rockg - dunno about every corporation, but certainly all of the installations I've worked with have had everything locally. Conda and tox can make it much easier to have local installs.

@ndc33
Copy link

ndc33 commented Feb 12, 2015

I have the same problem -> 6s import time, local install (anaconda, pandas '0.14.1). This is impossibly slow, especially trying to import on multiple processes.

@Rufflewind
Copy link
Contributor

Rufflewind commented Apr 25, 2016

Same problem, (pandas 0.18) although mine is not as awful: 400ms just to import pandas on a local SSD. I can't imagine how bad this would be for someone using say a networked filesystem.

@szs8
Copy link

szs8 commented May 24, 2016

+1. I see anywhere between 400 - 700ms.

@tacaswell
Copy link
Contributor

try removing the mpl font caches. Or, if you are in such o locked down enviroment that you can not write the caches, this might be mpl searching your system for fonts everytime it is imported.

@brycepg
Copy link
Contributor

brycepg commented Jun 7, 2016

(in python3/ pandas 1.6.2 via anaconda)
In ipython clearing matplotlib cache:

import shutil; import matplotlib
shutil.rmtree(matplotlib.get_cachedir())

---- restart ipython ----

%timeit -n1 -r1 import pandas

381 ms on linux
748 ms on windows
(it didn't do anything)

importing pandas from ipython(300ms) is faster than running it from python(500ms)

Importing some sub-dependencies speeds up importing pandas

%timeit -n1 -r1 import pandas
375ms

--- restart ipython -----

In [1]: %timeit -n1 -r1 import numpy
1 loops, best of 1: 87.8 ms per loop

In [2]: %timeit -n1 -r1 import pytz
1 loops, best of 1: 157 ms per loop

In [3]: %timeit -n1 -r1 import dateutil
1 loops, best of 1: 1.51 ms per loop

In [4]: %timeit -n1 -r1 import matplotlib
1 loops, best of 1: 54 ms per loop

In [5]: %timeit -n1 -r1 import xlsxwriter
1 loops, best of 1: 47.8 ms per loop

In [6]: %timeit -n1 -r1 import pandas
1 loops, best of 1: 177 ms per loop

It looks like pytz is particularly slow

Getting all the modules from pandas

I uninstalled matplotlib, xlsxwriter, and cython and imported pandas' sub imports before pandas(as seen via sys.modules.keys()). The import time of pandas(running this scripts via interpreter) was 100ms after all the dependent imports instead of 500ms:

import __future__
import __main__
import _ast
import _bisect
import _bootlocale
import _bz2
import _codecs
import _collections
import _collections_abc
import _compat_pickle
import _csv
import _ctypes
import _datetime
import _decimal
import _frozen_importlib
import _functools
import _hashlib
import _heapq
import _imp
import _io
import _json
import _locale
import _lzma
import _opcode
import _operator
import _pickle
import _posixsubprocess
import _random
import _sitebuiltins
import _socket
import _sre
import _ssl
import _stat
import _string
import _struct
import _sysconfigdata
import _thread
import _warnings
import _weakref
import _weakrefset
import abc
import argparse
import ast
import atexit
import base64
import binascii
import bisect
import builtins
import bz2
import calendar
import codecs
import collections
import contextlib
import copy
import copyreg
import csv
import ctypes
import datetime
import dateutil
import decimal
import difflib
import dis
import distutils
import email
import encodings
import enum
import errno
import fnmatch
import functools
import gc
import genericpath
import gettext
import grp
import hashlib
import heapq
import http
import importlib
import inspect
import io
import itertools
import json
import keyword
import linecache
import locale
import logging
import lzma
import marshal
import math
import numbers
import numexpr
import numpy
import opcode
import operator
import os
import parser
import pickle
import pkg_resources
import pkgutil
import platform
import plistlib
import posix
import posixpath
import pprint
import pwd
import pyexpat
import pytz
import quopri
import random
import re
import reprlib
import select
import selectors
import shutil
import signal
import site
import six
import socket
import sre_compile
import sre_constants
import sre_parse
import ssl
import stat
import string
import struct
import subprocess
import symbol
import sys
import sysconfig
import tarfile
import tempfile
import textwrap
import threading
import time
import timeit
import token
import tokenize
import traceback
import types
import unittest
import urllib
import uu
import uuid
import warnings
import weakref
import xml
import zipfile
import zipimport
import zlib

print(timeit.timeit('import pandas', number=1))

A workaround may be to stratify these imports before you need pandas

I'm getting similar results with no anaconda / python2 / pandas 1.8

@jacobSingh
Copy link

Similar issue for me. It makes development in Flask unbearable since it is 10s after every file change to reload. I debugged it an an import time of 3-10 seconds of pandas is the main culprit (2015 MBA) running anaconda on 3.5

There is some caching happening, but not sure what...

python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 3.71 sec per loop
(abg) jacob@Jacobs-Air:~/stuff/abg% python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 652 msec per loop

@brycepg
Copy link
Contributor

brycepg commented Jan 7, 2017

One workaround is to isolate all the code that interacts with pandas and lazily import that code only when you need it so that the wait period is only during program execution. (that's what I do)

@jacobSingh
Copy link

jacobSingh commented Jan 8, 2017 via email

@grantstephens
Copy link

I'm having a similar issue. Running on OSX and does the same in the virtualenv and out of it. Tried reinstalling everything and that didn't help. Doesn't seem to be matplotlib as that is relatively fast on its own. Very tricky to troubleshoot this- doesn't seem to show anything in the logs.

@rockg
Copy link
Contributor

rockg commented Jan 10, 2017

Can somebody please profile a simple "import pandas" and we can see if the problem is easily identified?

@grantstephens
Copy link

So I did a quick profile and found the following:

         93778 function calls (91484 primitive calls) in 4.278 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        6    0.426    0.071    0.905    0.151 api.py:3(<module>)
        1    0.306    0.306    4.276    4.276 __init__.py:5(<module>)
        1    0.189    0.189    0.211    0.211 base.py:1(<module>)
        1    0.163    0.163    0.426    0.426 api.py:1(<module>)
        1    0.129    0.129    1.170    1.170 format.py:5(<module>)
        2    0.121    0.061    0.197    0.099 base.py:3(<module>)
       20    0.120    0.006    0.390    0.019 __init__.py:1(<module>)
        3    0.119    0.040    0.178    0.059 common.py:1(<module>)
        1    0.115    0.115    0.572    0.572 __init__.py:26(<module>)
        1    0.112    0.112    0.569    0.569 frame.py:10(<module>)
        1    0.111    0.111    0.214    0.214 httplib.py:67(<module>)
        2    0.103    0.051    0.630    0.315 index.py:2(<module>)
        1    0.089    0.089    0.144    0.144 parser.py:29(<module>)
        1    0.078    0.078    0.084    0.084 excel.py:3(<module>)
        1    0.074    0.074    0.840    0.840 api.py:5(<module>)
        1    0.072    0.072    0.091    0.091 sparse.py:4(<module>)
        1    0.070    0.070    0.149    0.149 gbq.py:1(<module>)
        1    0.070    0.070    0.650    0.650 groupby.py:1(<module>)
        1    0.068    0.068    0.138    0.138 generic.py:2(<module>)
        1    0.063    0.063    1.265    1.265 config_init.py:11(<module>)
        1    0.060    0.060    0.060    0.060 socket.py:45(<module>)
        1    0.055    0.055    0.145    0.145 eval.py:4(<module>)
        1    0.054    0.054    0.075    0.075 expr.py:2(<module>)
        2    0.052    0.026    0.069    0.035 __init__.py:9(<module>)
        1    0.052    0.052    0.054    0.054 pytables.py:4(<module>)
        1    0.052    0.052    0.165    0.165 series.py:3(<module>)

Seems like the init at line 5 is taking most of the time- is this the main init of pandas?

@jreback
Copy link
Contributor

jreback commented Jan 10, 2017

just for comparison on osx.

# 2.7
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 287 msec per loop
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 671 msec per loop

# 3.5
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 168 msec per loop
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 494 msec per loop

@jacobSingh
Copy link

jacobSingh commented Jan 10, 2017 via email

@jreback
Copy link
Contributor

jreback commented Jan 10, 2017

not sure what you think is cached

@rockg
Copy link
Contributor

rockg commented Jan 10, 2017

@RexFuzzle I'm surprised you don't have any long file names. Did you strip the directories? You should be seeing something like the below. That will make it easier to see what is taking the majority of time. I think it comes down to pandas importing a lot of dependencies each which have their own hit.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.296    0.296    4.990    4.990 /mnt/environment/software/python/lib/python2.7/site-packages/pandas/__init__.py:3(<module>)
        1    0.198    0.198    0.331    0.331 /mnt/environment/software/python/lib/python2.7/site-packages/numpy/core/__init__.py:1(<module>)
        1    0.165    0.165    0.248    0.248 /mnt/environment/software/python/lib/python2.7/site-packages/bottleneck/__init__.py:3(<module>)
        1    0.154    0.154    0.164    0.164 /mnt/environment/software/python/lib/python2.7/site-packages/bs4/dammit.py:8(<module>)
        1    0.134    0.134    0.164    0.164 /mnt/environment/software/python/lib/python2.7/site-packages/pandas/core/common.py:3(<module>)

@grantstephens
Copy link

Hmmm, that is strange- I didn't strip anything- was using cProfile- don't know if that could have caused it. Will investigate it a bit further tomorrow. From my results though it certainly seems like it is just the one init that is taking all the time- will try to get mine in the same format as yours and then we can compare- see if it is the same init file and line number.

@rockg
Copy link
Contributor

rockg commented Jan 10, 2017

Save out the cprofile to a file and then load with pstats and print. If it is a specific module, run the line profiler to see if it anything specific or just a lot of small things.

import cProfile
import pstats
cProfile.run("import pandas", "pandasImport")
p = pstats.Stats("pandasImport")
p.sort_stats("tottime").print_stats()

@jacobSingh
Copy link

jacobSingh commented Jan 10, 2017 via email

@rockg
Copy link
Contributor

rockg commented Jan 11, 2017

All right, let's go one step further and do a line profile of pandas.__init__. You can do this by using the line_profiler.

@jorisvandenbossche
Copy link
Member

Maybe you could also give https://github.com/cournape/import-profiler a try

But looking at the above values, although the import time is much larger, also numpy takes much longer. The ratio of numpy import to full pandas import seems rather the same of for the much smaller numbers @jreback posted, or I also see). So if numpy is already taking more than 4 seconds to import, we of course are not going to get pandas import time below that.

@grantstephens
Copy link

Thanks for all the input. I ran a dtruss in the mean time and found that nothing happens for a few seconds before anything shows up there and so I'm thinking that there is a lag on disk reads instead of it being a python problem, this, to me, is re-enforced by the fact that the time seems to be grouped with the first line of the init file (artifact from cProfile?). Will do a bit more digging. Also agree that it seems to be more a numpy problem and will have a look through their issues and see if anybody else has something similar.
Thanks again for the input.

@jorisvandenbossche
Copy link
Member

Also agree that it seems to be more a numpy problem

Sorry, that is not what I wanted to say. I just meant that both numpy and pandas seem to take longer (compared to my laptop, both x10 to x15 times longer), so that is not necessarily to pinpoint to a certain import that is the culprit. It just seems generally slower. Which does not mean of course that we might do some more lazy imports in pandas to improve things, if there are bottlenecks.

@melroy89
Copy link

Please, do not ignore this issue. It's closed, but I also found problems with a long import duration. Maybe it should be picked up. Create awareness about this issue and higher the prio? Otherwise it is not good for the popularity of pandas.

@grantstephens
Copy link

I'm willing and able to do any more testing, but I don't know of any other profile type tests that I can run that can try to find the source, so I am open to suggestions.

@dafer660
Copy link

Greetings,

When using pandas with not so big datasets it would take at least 5 to 10 seconds to parse through all the data and plot, which is quite a long time.
So, the steps that led me to the slow execution of pandas in pycharm were:

  1. Anaconda installation for all users
  2. Python 3.6.1 installation

So, since, it was an abnormal amount of time for little code execution i decided to uninstall both Anaconda and Python 3.6.1 and take a extra steps:

  1. Install visualcppbuildtoolsfull (which can be found here: http://landinghub.visualstudio.com/visual-cpp-build-tools)
  2. Python 3.6.1 installation
  3. Anaconda installation for all users
  4. Pycharm Default Settings > Project Interpreter > Select correct one (generic python 3.6.1 or Anaconda) in order to query through all the packages.
  5. (Optional) I suggest doing step 4 for every path that pycharm detects.

Now code execution is faster (much faster then before).
I hope it helps someone.

@jreback jreback added the Performance Memory or execution speed performance label Oct 2, 2017
@jason-s
Copy link

jason-s commented Oct 23, 2017

I just ran the same as rockg suggested but sorted by cumtime, not tottime, which immediately points out that the pytz module takes half of the total import time (on my PC). Is there any way to make this optional or lazy? I rarely use datetimes, and when I do, they are almost always UTC, so I have very little interest in timezones.

Same with the pandas.plotting module -- I have an application which doesn't do any plotting, so it stinks that it adds a significant time to my importing with no benefit. It seems like it would make sense to make this lazy, since matplotlib takes a long time anyway and 0.15s extra isn't noticeable.


import cProfile
import pstats
cProfile.run("import pandas", "pandasImport")
p = pstats.Stats("pandasImport")
p.sort_stats("cumtime").print_stats()

which prints (stuff below 0.1 second elided)

Mon Oct 23 14:01:19 2017    pandasImport

         204659 function calls (202288 primitive calls) in 1.875 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.042    0.042    1.876    1.876 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\__init__.py:5(<module>)
   321/44    0.041    0.000    1.156    0.026 {__import__}
        1    0.008    0.008    0.925    0.925 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\__init__.py:9(<module>)
        1    0.002    0.002    0.914    0.914 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:14(<module>)
        1    0.000    0.000    0.651    0.651 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:704(subscribe)
      217    0.000    0.000    0.650    0.003 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:2870(<lambda>)
      217    0.001    0.000    0.650    0.003 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:2299(activate)
      427    0.002    0.000    0.602    0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:1845(_handle_ns)
      217    0.001    0.000    0.586    0.003 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:1898(fixup_namespace_packages)
      411    0.003    0.000    0.581    0.001 c:\app\python\anaconda\1.6.0\lib\pkgutil.py:176(find_module)
      411    0.571    0.001    0.571    0.001 {imp.find_module}
        1    0.011    0.011    0.423    0.423 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\api.py:5(<module>)
        1    0.007    0.007    0.352    0.352 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\groupby.py:1(<module>)
       40    0.001    0.000    0.248    0.006 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:444(add_entry)
      472    0.005    0.000    0.236    0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:1779(find_on_path)
        1    0.005    0.005    0.231    0.231 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\frame.py:10(<module>)
      472    0.188    0.000    0.188    0.000 {nt._isdir}
      476    0.002    0.000    0.187    0.000 {map}
        1    0.023    0.023    0.173    0.173 c:\app\python\anaconda\1.6.0\lib\site-packages\numpy\__init__.py:106(<module>)
        1    0.003    0.003    0.157    0.157 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\series.py:3(<module>)
        1    0.005    0.005    0.142    0.142 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\plotting\__init__.py:3(<module>)
        1    0.008    0.008    0.132    0.132 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\plotting\_converter.py:1(<module>)
        1    0.000    0.000    0.127    0.127 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:430(__init__)
        1    0.003    0.003    0.119    0.119 c:\app\python\anaconda\1.6.0\lib\site-packages\numpy\add_newdocs.py:10(<module>)
        1    0.019    0.019    0.115    0.115 c:\app\python\anaconda\1.6.0\lib\site-packages\numpy\lib\__init__.py:1(<module>)
        1    0.002    0.002    0.109    0.109 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\util\_tester.py:3(<module>)
        1    0.015    0.015    0.107    0.107 c:\app\python\anaconda\1.6.0\lib\site-packages\pytest.py:4(<module>)
        1    0.005    0.005    0.102    0.102 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\index.py:2(<module>)

@jason-s
Copy link

jason-s commented Oct 23, 2017

FYI -- I have an SSD on my PC so if there is a disk seek issue that some people have, I don't see it. numpy 1.12 takes 0.17 seconds to import.

@TomAugspurger
Copy link
Contributor

@jason-s pytz imports in <5 microseconds on my machine, so something is strange there.

FYI #17710 did some work on this, so things should be quicker in the upcoming release (nothing touching pytz though).

@jason-s
Copy link

jason-s commented Oct 23, 2017

I'm using pandas 0.20.2 with pytz 2016.4 on a Windows 7 machine running Anaconda Python 2.7

@jason-s
Copy link

jason-s commented Oct 23, 2017

I just ran conda uninstall pytz and reinstalled it, it now takes 0.01 second with pytz-2017.2

Reinstalled pytz 2016.4 (conda install pytz=2016.4) and it slowed back down to 0.92 seconds again

Installed pytz 2016.7 -- it also is very fast (13 milliseconds to import). There is an item in the profile data called "lazy.py" which sounds like they converted to a "lazy" loading in 2016.7.

import cProfile
import pstats
cProfile.run("import pytz", "profiling_data")
p = pstats.Stats("profiling_data")
p.sort_stats("cumtime").print_stats()

which prints this for pytz 2016.7:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.005    0.005    0.018    0.018 <string>:1(<module>)
        1    0.008    0.008    0.013    0.013 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\__init__.py:9(<module>)
        2    0.002    0.001    0.002    0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\lazy.py:135(__new__)
        1    0.002    0.002    0.002    0.002 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\tzinfo.py:1(<module>)
        1    0.000    0.000    0.001    0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\lazy.py:1(<module>)
        1    0.000    0.000    0.000    0.000 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\tzfile.py:4(<module>)
        2    0.000    0.000    0.000    0.000 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\lazy.py:80(__new__)

@jason-s
Copy link

jason-s commented Oct 23, 2017

Hmm. unfortunately switching to pytz 2017.2 (or 2016.7) doesn't seem to speed up the pandas import; looks like either there are a lot of shared dependencies between the two, or during pandas __init__ process it uses pytz and negates the speed advantage that pytz import provides by lazy initialization.

Oh, here we go, both are using pkg_resources.py, which takes about 0.9s on my PC to execute whatever it is doing, whether it's from pytz or pandas.

I had setuptools 27.2 (which includes pkg_resources); this seems to be related to this issue pypa/setuptools#926

@jason-s
Copy link

jason-s commented Oct 23, 2017

OK, I used ripgrep in my site-packages to look for pkg_resources, and the culprits are pytz (which now uses it lazily) and numexpr.

I filed an issue with numexpr.

Is numexpr imported lazily in pandas in the upcoming release? That's another area where a feature I never use (at least, I think I never use it) slows down the pandas import significantly.

edit: never mind, you already know about this:

#17710 (comment)

@nschloe
Copy link
Contributor

nschloe commented Jul 1, 2018

For reference, here's an import profile using Python 3.7's importtime and tuna:

python3.7 -X importtime -c "import pandas" 2> pandas.log
tuna pandas.log

pandas

@miazoin
Copy link

miazoin commented Oct 16, 2018

Our solution is to set up a web server, and using post request to the algorithm part, and the time for import "pandas" package could be reduced.

@hosamn
Copy link

hosamn commented Nov 28, 2018

having the same issue here

python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 8.56 sec per loop

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 28, 2018 via email

@Gnomic20
Copy link

So I think I may have found this issue. Over 50% of my time is one a single function call: mkl._py_mkl_service.get_version

pandasImport

     187472 function calls (181157 primitive calls) in 4.406 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
1 2.295 2.295 2.295 2.295 {built-in method mkl._py_mkl_service.get_version}

Code

import cProfile
import pstats
cProfile.run("import pandas", "pandasImport")
p = pstats.Stats("pandasImport")
p.sort_stats("tottime").print_stats()

pandas.show_versions()

INSTALLED VERSIONS
------------------
commit           : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python           : 3.7.4.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.18362
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 1.1.1
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.2
setuptools       : 49.6.0.post20200814
Cython           : 0.29.21
pytest           : 6.0.2
hypothesis       : 5.35.3
sphinx           : 2.2.0
blosc            : None
feather          : None
xlsxwriter       : 1.3.3
lxml.etree       : 4.5.2
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.18.1
pandas_datareader: None
bs4              : 4.9.1
bottleneck       : 1.3.2
fsspec           : 0.8.0
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.1
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : 1.3.19
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.51.2

@cameronkerrnz
Copy link

For future reference (since Google led me here), an easy way to figure out where the import time is being spent is to use -X importtime. Here I'm illustrating 'import matplotlib.pyplot as plt', which is kinda slow for me at the moment, and filtering just those particular imports that take longer than 0.1 seconds:

(venv) time python -X importtime -c 'import matplotlib.pyplot as plt' 2>&1 | awk '$3 > 100000'
import time: self [us] | cumulative | imported package
import time:    109578 |     112835 |           pyparsing.helpers
import time:    179964 |     229597 |     matplotlib.collections
import time:    182284 |     182284 |           matplotlib.patches

real	0m2.047s
user	0m1.924s
sys	0m0.268s

@Jalagarto
Copy link

Jalagarto commented Jul 28, 2022

solved using conda:
stackoverflow - Running quicker for Numpy and Pandas( installed via conda) than via pip?

It upgraded python from 3.8.10 to 3.10.4 and installed pandas v. 1.4.2 instead of 1.4.3.

It is now 10 times faster

@keithpjolley
Copy link

I'm wondering if this has something to do with it and why people aren't seeing the same results. I import a handful of modules in a notebook and the first time is painful - on the order of a minute (sklearn is the worst, pandas second longest). Any subsequent load is fast.

. /tmp/venv/bin/activate

❯ python -m timeit -n1 -r1 "import pandas"
1 loop, best of 1: 289 msec per loop         <<< ~0.3 seconds

❯ pip uninstall pandas
... Successfully uninstalled pandas-2.2.3

❯ pip install pandas
... Successfully installed pandas-2.2.3

❯ python -m timeit -n1 -r1 "import pandas"
1 loop, best of 1: 18 sec per loop           <<< 18 seconds

❯ python -m timeit -n1 -r1 "import pandas"
1 loop, best of 1: 389 msec per loop         <<< ~0.4 seconds

❯ python --version
Python 3.13.0

I ended up writing a step in my notebook's installer that imports each library (in parallel) so that nobody thinks my notebook is slow or hung the first time it's run.

@Gnomic20
Copy link

Gnomic20 commented Nov 27, 2024 via email

@keithpjolley
Copy link

keithpjolley commented Nov 28, 2024

More data.

In each of these examples I'm leaving out the destruction and recreation of the virtual environment to make sure that every run is starting from the same place. I disagree that one shouldn't be worried about 18s. Multiply that out by how many times this occurs every day and it adds up quickly - 500k/day would be ~104 cpus days/day.

Anyways, this notebook has a couple of imports:

import pandas as pd
import numpy as np
import scipy
import networkx as nx
import matplotlib.pyplot as plt
import heapq
import colorsys
from sklearn.preprocessing import minmax_scale
import json
from IPython.display import display, HTML
from operator import add
import graphviz
from copy import deepcopy
from matrepr import mprint

If I install the requirements.txt and do nothing else:

❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | paste -s -d , - | xargs -I% python -m timeit -n1 -r1 "import %"
1 loop, best of 1: 43.1 sec per loop

❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | paste -s -d , - | xargs -I% python -m timeit -n1 -r1 "import %"
1 loop, best of 1: 1.02 sec per loop

My first instinct was to do exactly what you suggested. My installer ran compileall (here on a fresh venv).

python -m compileall -j0
...
Listing '/private/tmp/venv/lib/python3.13/site-packages'...
Compiling '/private/tmp/venv/lib/python3.13/site-packages/decorator.py'...
Compiling '/private/tmp/venv/lib/python3.13/site-packages/ipykernel_launcher.py'...
...

Alas, no pandas in that output so I forced the issue and saw that compiling did nothing to improve the load times.

❯ find /tmp/venv -type f -name \*.py | xargs python -m compileall -j0
...
❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | paste -s -d , - | xargs -I% python -m timeit -n1 -r1 "import %"
1 loop, best of 1: 43 sec per loop

To load each module in parallel (after recreating my venv):

❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | time parallel python -c '"import {}"'
parallel python -c '"import {}"'  2.75s user 1.12s system 13% cpu 29.543 total

Which saves ~13 seconds, the long pole in this particular tent being sklearn. The advantage here is that this is done at install time, not when the user is opening the notebook for the first time and then has to wait ~43 seconds before anything happens.

But, now, out of curiosity, I redid this entire process on my old laptop and came up with a different outcome. Using the exact same inputs and destroying/recreating the environment I cannot reproduce the above results. The load time is the same (fast) if I go through the above hoops or not. Here from a fresh venv:

❯ awk '/import/{print $2}' graphs.ipynb | sed 's/\\.*//' | paste -s -d , - | xargs -I% python -m timeit -n1 -r1 "import %"
1 loop, best of 1: 1.17 sec per loop

The difference? All but the last run was done on a 2023 macbook pro with an m3 max/36G (arm64e) cpu running macos 15.1.1 with python13. That last run was done on old lenovo laptop running ubuntu 24.04.1 with an Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz (x86_64) and python 3.12.3.

Another reason people are seeing inconsistent results. This was a surprise to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests