-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing pkg_resources is slow. Work is done at import time. #926
Comments
Updating this bug, I traced it down from a very slow import OpenSSL on our system due to this issue. We do have a pretty large sys.path with lots of user generated 3rd party libraries, it takes more than 0.800ms to load and we aim to run scripts subseconds which is impossible:
|
I'm seeing ~200 msec for a click CLI app I'm developing (printing the usage):
|
Ran into this and poked at it briefly. It looks like a significant portion of the cost here is related to the sorting in https://github.com/pypa/setuptools/blob/master/pkg_resources/__init__.py#L1977-L2000. Notice the number of calls to _by_version and total cumulative time there in the attached file. This file was generated by running pkg_resources_import_profile.txt The other big cost here appears to be https://github.com/pypa/setuptools/blob/master/pkg_resources/__init__.py#L640-L658. I don't know enough about why this state is generated on import, but it seems like maybe things like sorting or even reading through the entire list of packages available could potentially be deferred until the info is necessary (also why is sorting important at all?). And finally pip list reports |
Duplicate of #510? |
Yes, thanks. |
This is not beautiful but closes #665 https://github.com/ninjaaron/fast-entry_points pypa/setuptools#510 pypa/setuptools#926
Speed up `import ligo.skymap` by up to a second by replacing uses of `pkg_resources` with the new Python standard library module `importlib.resources` (or, for Python < 3.7, the backport `importlib_resources`). The old `pkg_resources` module is known to be slow because it does a lot of work on startup. See, for example, [pypa/setuptools#926](pypa/setuptools#926) and [pypa/setuptools#510](pypa/setuptools#510).
Speed up imports by up to a second by replacing uses of `pkg_resources` with the new Python standard library module `importlib.resources` (or, for Python < 3.7, the backport `importlib_resources`). The old `pkg_resources` module is known to be slow because it does a lot of work on startup. See, for example, [pypa/setuptools#926](pypa/setuptools#926) and [pypa/setuptools#510](pypa/setuptools#510).
Speed up imports by up to a second by replacing uses of `pkg_resources` with the new Python standard library module `importlib.resources` (or, for Python < 3.7, the backport `importlib_resources`). The old `pkg_resources` module is known to be slow because it does a lot of work on startup. See, for example, [pypa/setuptools#926](pypa/setuptools#926) and [pypa/setuptools#510](pypa/setuptools#510).
Providing __version__ attribute is a reasonably common convention among packages in the Python ecosystem. Currently the only other reliable alternative is to use pkg_resources.get_distribution method, however, importing pkg_resources is notoriously slow [1,2]. Provide the __version__ attribute to provide an API interface to check the version of tasklib at runtime. Bump the version in order to reflect module API change. [1] pypa/setuptools#510 [2] pypa/setuptools#926
Providing __version__ attribute is a reasonably common convention among packages in the Python ecosystem. Currently the only other reliable alternative is to use pkg_resources.get_distribution method, however, importing pkg_resources is notoriously slow [1,2]. Provide the __version__ attribute to provide an API interface to check the version of tasklib at runtime. Bump the version in order to reflect module API change. [1] pypa/setuptools#510 [2] pypa/setuptools#926
Importing pkg_resources module is notoriously slow, see [1,2]. Tasklib module now provides __version__ attribute for an easy method of version checking. [1] pypa/setuptools#510 [2] pypa/setuptools#926
Importing pkg_resources module is notoriously slow, see [1,2]. Tasklib module now provides __version__ attribute for an easy method of version checking. [1] pypa/setuptools#510 [2] pypa/setuptools#926
pkg_resources has known performance issues: pypa/setuptools#926. This PR replaces pkg_resources with importlib.metadata and uses this module to retrieve package names and versions. A further optimization was made to the importlib implementation which parses package metadata: https://github.com/DataDog/dd-trace-py/compare/munir/benchmark-importlib...munir/tests-importlib-metadata-custom-parsing?expand=1. Benchmarks for this third optimization are also shown in the table below: | benchmark | test case | Number of Packages | mean (ms) | std (ms) | baseline (ms) | overhead (ms) | overhead (%) | |---------------------------|---------------------------|--------------------|:---------:|:--------:|:-------------:|:-------------:|:------------:| | ddtracerun-auto_telemetry | pkg_resources (1.x branch) | 15 | 326 | 13 | 274 | 52 | 19.0 | | ddtracerun-auto_telemetry | importlib | 15 | 285 | 5 | 270 | 15 | 5.6 | | ddtracerun-auto_telemetry | importlib with partial parsing | 15 | 285 | 10 | 269 | 16 | 5.9 | | ddtracerun-auto_telemetry | importlib | 30 | 377 | 5 | 350 | 27 | 7.7 | | ddtracerun-auto_telemetry | importlib with partial parsing | 30 | 362 | 7 | 350 | 12 | 3.4 | | ddtracerun-auto_telemetry | importlib | 45 | 381 | 24 | 348 | 31 | 8.9 | | ddtracerun-auto_telemetry | importlib with partial parsing | 45 | 363 | 9 | 350 | 23 | 6.3 | | ddtracerun-auto_telemetry | importlib | 313 | 1050 | 79 | 991 | 59 | 5.9 | | ddtracerun-auto_telemetry | importlib with partial parsing | 313 | 911 | 28 | 905 | 6 | 0.6 | | benchmark | test case | Number of Packages | mean (ms) | std (ms) | baseline (ms) | overhead (ms) | overhead (%) | |:---------------------------------:|---------------------------|--------------------|:---------:|:--------:|:-------------:|:-------------:|:------------:| | ddtracerun-auto_tracing_telemetry | pkg_resources (1.x) | 15 | 324 | 8 | 274 | 50 | 18.2 | | ddtracerun-auto_tracing_telemetry | importlib | 15 | 293 | 11 | 272 | 21 | 8.3 | | ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 15 | 291 | 12 | 272 | 19 | 6.9 | | ddtracerun-auto_tracing_telemetry | importlib | 30 | 373 | 11 | 351 | 22 | 6.28 | | ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 30 | 367 | 13 | 354 | 13 | 3.6 | | ddtracerun-auto_tracing_telemetry | importlib | 45 | 376 | 8 | 355 | 21 | 5.9 | | ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 45 | 364 | 9 | 352 | 22 | 6.5 | | ddtracerun-auto_tracing_telemetry | importlib | 313 | 1010 | 80 | 960 | 50 | 5.2 | | ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 313 | 910 | 20 | 873 | 37 | 4.2 | Note: redis, requests and urllib3 were included in test cases with 30 and 45 packages. These packages were patched by `ddtrace-run` and this increased the baseline by ~74ms but the overhead of telemetry observed remained consistent. The case with 313 packages patched gevent, pylons, SQLAlchemy, requests, flask, grpc, cassandra, botocore, and urllib3. This was to simulate the overhead of telemetry in a real world application with telemetry enabled. Findings from benchmarking sending telemetry events with different number of packages installed, patching integrations, and/or enabling tracing: - Using importlib instead of pkg_resources reduced the overhead of telemetry in half (~50ms to ~19ms) - The number of packages does not appear to correlate with the overhead of telemetry - the benchmarks might've been too noisy to measure the difference accurately. - creating a custom parser to retrieve package names and versions from PKG-INFO and METADATA files lead to notable performance gains with a large number of packages. - the difference appears to be within a standard deviation so more testing is required to accurately measure the difference. - Iterating on this approach might lead to better results: https://github.com/DataDog/dd-trace-py/compare/munir/benchmark-importlib...munir/tests-importlib-metadata-custom-parsing?expand=1 - These performance gains seem to be minor. It might not be work developing and maintaining a metadata parser. ## Checklist - [x] Library documentation is updated. - [x] [Corp site](https://github.com/DataDog/documentation/) documentation is updated (link to the PR). ## Reviewer Checklist - [ ] Title is accurate. - [ ] Description motivates each change. - [ ] No unnecessary changes were introduced in this PR. - [ ] PR cannot be broken up into smaller PRs. - [ ] Avoid breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [ ] Tests provided or description of manual testing performed is included in the code or PR. - [ ] Release note has been added for fixes and features, or else `changelog/no-changelog` label added. - [ ] All relevant GitHub issues are correctly linked. - [ ] Backports are identified and tagged with Mergifyio. - [ ] Add to milestone.
Any "import pkg_resources" by a module is slow, in the 100-150ms range, depending on the system.
This is due to the number of modules imported by pkg_resources itself (email.parser is also slow), and side-effect work done at import time, notably the
_initialize
and_initialize_master_working_set
functions at the bottom ofpkg_resources/__init__.py
.As wall clock time matters to humans running CLIs, every millisecond counts.
The work done at import time should be deferred until needed, and the imports themselves also deferred if possible.
The text was updated successfully, but these errors were encountered: