cache metadata lookups for sdists and lazy wheels #12256

cosmicexplorer · 2023-09-03T05:07:58Z

This PR is on top of #12186; see the +392/-154 diff against it at https://github.com/cosmicexplorer/pip/compare/metadata_only_resolve_no_whl...cosmicexplorer:pip:link-metadata-cache?expand=1.

Fixes Cache prepared metadata #11165.

Background: Pip Bandwidth Usage Improvements

In 2016, @dstufft tweeted:

In the last 24 hours PyPI has served 120 million requests and used 13.7 TB of bandwidth.

Today, @sethmlarson tweeted:

urllib3 recently reached 10 billion total downloads!! 🎉🔟🥳

Since PEP 658 from @uranusjr was implemented via #11111 and pypi/warehouse#13649, pip has been making use of shallow metadata files to resolve against, meaning that instead of pulling down several very large wheels over the course of a single pip install invocation, only to throw away most of them, we now only download and prepare a single version of each requirement at the conclusion of each pip subcommand. With #12186, we can even avoid downloading any dists at all if we only want to generate an install report with pip install --report --dry-run. For find-links or other non-pypi indices which haven't implemented PEP 658, #12208 improves the fast-deps feature to enable metadata-only resolves against those too.

Proposal: Caching Metadata to Reduce Number of Requests

The above is all great, but despite reducing bandwidth usage, we haven't yet reduced the total number of requests. As described in #12184, I believe we can safely cache metadata lookup to both drastically improve pip's runtime as well as significantly reduce the number of requests against pypi.

This caching is made safe by extending @sbidoul's prior work (see e.g. #7333, #7296) to ensure pip._internal.cache.Cache incorporates all the information that may change the compatibility of a downloaded/prepared distribution resolved from a Link.

Result: 6.5x Speedup, Shaving Off 30 Seconds of Runtime

This change produces a 6.5x speedup against the example tested below, reducing the runtime of this pip install --report command from over 36 seconds down to just 5.7 seconds:

# this branch is #12186, which this PR is based off of
> git checkout metadata_only_resolve_no_whl
> python3.8 -m pip install --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m36.410s
user    0m15.706s
sys     0m13.377s
# switch to this PR
> git checkout link-metadata-cache
# enable --use-feature=metadata-cache
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m5.671s
user    0m4.429s
sys     0m0.123s

But possibly more importantly, it also drastically reduces the number of requests made against pypi (this is reflected in the much lower sys time in the second command output above), without introducing significant disk space usage (only ~60KB, because we only cache metadata for each dist, and we compress the serialized cache entries):

> du -b -s ~/.cache/pip/link-metadata/
60238   /home/cosmicexplorer/.cache/pip/link-metadata/
> find ~/.cache/pip/link-metadata/ -type f | parallel bzcat | jq
...
{
  "metadata": "Metadata-Version: 2.1\nName: h5py\nVersion: 3.1.0\nSummary: Read and write HDF5 files from Python\nHome-page: http://www.h5py.org\nAuthor: Andrew Collette\nAuthor-email: [email protected]\nMaintainer: Andrew Collette\nMaintainer-email: [email protected]\nLicense: BSD\nDownload-URL: https://pypi.python.org/pypi/h5py\nPlatform: UNKNOWN\nClassifier: Development Status :: 5 - Production/Stable\nClassifier: Intended Audience :: Developers\nClassifier: Intended Audience :: Information Technology\nClassifier: Intended Audience :: Science/Research\nClassifier: License :: OSI Approved :: BSD License\nClassifier: Programming Language :: Cython\nClassifier: Programming Language :: Python\nClassifier: Programming Language :: Python :: 3\nClassifier: Programming Language :: Python :: 3.6\nClassifier: Programming Language :: Python :: 3.7\nClassifier: Programming Language :: Python :: Implementation :: CPython\nClassifier: Topic :: Scientific/Engineering\nClassifier: Topic :: Database\nClassifier: Topic :: Software Development :: Libraries :: Python Modules\nClassifier: Operating System :: Unix\nClassifier: Operating System :: POSIX :: Linux\nClassifier: Operating System :: MacOS :: MacOS X\nClassifier: Operating System :: Microsoft :: Windows\nRequires-Python: >=3.6\nRequires-Dist: cached-property ; python_version < \"3.8\"\nRequires-Dist: numpy (>=1.12) ; python_version == \"3.6\"\nRequires-Dist: numpy (>=1.14.5) ; python_version == \"3.7\"\nRequires-Dist: numpy (>=1.17.5) ; python_version == \"3.8\"\nRequires-Dist: numpy (>=1.19.3) ; python_version >= \"3.9\"\n\n\nThe h5py package provides both a high- and low-level interface to the HDF5\nlibrary from Python. The low-level interface is intended to be a complete\nwrapping of the HDF5 API, while the high-level component supports  access to\nHDF5 files, datasets and groups using established Python and NumPy concepts.\n\nA strong emphasis on automatic conversion between Python (Numpy) datatypes and\ndata structures and their HDF5 equivalents vastly simplifies the process of\nreading and writing data from Python.\n\nSupports HDF5 versions 1.8.4 and higher.  On Windows, HDF5 is included with\nthe installer.\n\n\n",
  "filename": "h5py-3.1.0-cp38-cp38-manylinux1_x86_64.whl",
  "canonical_name": "h5py"
}

TODO

Merge pull preparer logic out of the resolver to consume metadata-only dists in commands #12186.

cosmicexplorer · 2023-09-03T12:25:37Z

@ewdurbin: regarding #12257 (comment), I don't think I'd tagged you in this change that will actually reduce requests against files.pythonhosted.org as well!

sbidoul · 2023-09-03T13:05:08Z

@cosmicexplorer unfortunately I don't have enough time available at the moment to delve into your PRs which all sound very interesting.

This one caught my attention, though, as it is something I also contemplated to do previously. So I gave it a quick try. And I have questions :)

When running pip install --cache-dir=/tmp/pipcache --no-binary=:all: --use-feature=metadata-cache git+https://github.com/pypa/pip-test-package it creates a link-metadata cache entry. Actually, it should not since this link is not cacheable because the head of the branch can change. I tend to think a metadata caching mechanism should somehow re-use the same _should_cache() logic we have to decide whether locally built wheels should be cached or not.
When running pip install --cache-dir=/tmp/pipcache --use-feature=metadata-cache lxml which downloads and installs a wheel, it creates a link-metadata cache entry. Is that intended? Is not sufficient to rely on the http cache obtain the .metadata from the index, or if not available the wheel ?
When running pip install --cache-dir=/tmp/pipcache --use-feature=metadata-cache --no-binary=:all: lxml which downloads and builds a sdist, it also creates a metadata cache entry. One potential issue is that the cache entry file name seems to be created from a hash of the link only (IIUC). However, the metadata can be different depending on the platform on which it has been generated. My intuition is that the compatibility tags we have in the wheels could play a role here, although I don't know how to obtain them without doing a full wheel build.

cosmicexplorer · 2023-09-03T13:28:18Z

@sbidoul: thank you so much for your thoughtful feedback!! ^_^

When running pip install --cache-dir=/tmp/pipcache --no-binary=:all: --use-feature=metadata-cache git+https://github.com/pypa/pip-test-package it creates a link-metadata cache entry

This is fixed in 55f185a, which avoids creating cache entries for git or file urls.

Is not sufficient to rely on the http cache obtain the .metadata from the index, or if not available the wheel ?

I am looking into this now. I agree that the http cache should retain the .metadata from the index, so I will make sure that is working. I also agree that it should be able to pull from the cached wheel, if available.

One potential issue is that the cache entry file name seems to be created from a hash of the link only (IIUC). However, the metadata can be different depending on the platform on which it has been generated. My intuition is that the compatibility tags we have in the wheels could play a role here, although I don't know how to obtain them without doing a full wheel build.

So you're absolutely right about this, but I was under the impression that I had solved this by directly making use of your work in pip._internal.cache.Cache._get_cache_path_parts() from ab05936 to solve #7296, which incorporates the current interpreter name and version. As you said:

I tend to think a metadata caching mechanism should somehow re-use the same _should_cache() logic we have to decide whether locally built wheels should be cached or not.

Please no rush to respond, I really appreciate your feedback.

cosmicexplorer · 2023-09-03T22:48:53Z

Is not sufficient to rely on the http cache obtain the .metadata from the index, or if not available the wheel ?

After a lengthy investigation, I've discovered that the http cache from CacheControlAdapter will still make GET requests against files.pythonhosted.org, even when the file is already in the http cache. The response from the server for such requests is ignored. So while debug output with -vvv shows e.g. The response is "fresh", returning cached response, we will still make a request against pypi for each .metadata or .whl. It is nontrivial to work around this.

In #12257 we were able to make some GET requests much faster by adding HTTP caching headers to make pypi return an empty 304 Not Modified, but there are two issues with applying that here:

files.pythonhosted.org doesn't return a 0-length 304 Not Modified, it always returns a full-length 200 OK.
unlike send HTTP caching headers for index pages to further reduce bandwidth usage #12257, we are dealing with immutable links. We don't need to check whether the result has changed.

I think this PR is the cleanest way to avoid making those unused GET requests against files.pythonhosted.org by making use of the existing "metadata-only fetch" concept, without having to mess around with our CacheControlAdapter. What do you think @sbidoul?

src/pip/_internal/cache.py

pfmoore · 2023-09-04T07:43:44Z

After a lengthy investigation, I've discovered that the http cache from CacheControlAdapter will still make GET requests against files.pythonhosted.org, even when the file is already in the http cache.

Is this by design, or is it a bug? If the latter, I think we’d be better just getting the bug fixed…

sbidoul · 2023-09-04T11:23:27Z

I think this PR is the cleanest way to avoid making those unused GET requests against files.pythonhosted.org by making use of the existing "metadata-only fetch" concept, without having to mess around with our CacheControlAdapter.

I perceive several aspects to this metadata caching topic:

Caching metadata we computed locally is something I consider useful (Cache prepared metadata #11165), because there are several scenario where pip downloads a sdist or a VCS repo to compute the metadata and then discard it, if a wheel is not built for any reason (already installed, rejected version, ...). The decision to cache the metadata or not and the way to structure the cache should likely be as similar as possible to the cache of locally built wheels, for least-surprise behaviour.
Caching .metadata files from indexes. I can't say I fully understand the subtleties of the interactions between the pip http cache and PyPI, but my intuition says these should be cached at the http level (like wheels), and that any potential improvements should be investigated in details in the CacheControl area first before leaking that caching complexity elsewhere in the code base.
Caching metadata obtained from downloaded or cached wheels. For this I'd say we should investigate if there is a tangible performance benefit, because I'm under the impression getting the metadata from inside the locally cached wheel is fast enough.

What do you think @sbidoul?

Easier said than done, I know :) And caching is hard. But I'd suggest addressing each aspect independently, as these are very different and independent problems.

cosmicexplorer · 2023-09-14T22:06:13Z

Ok. the above has been implemented, which means we only cache metadata for cacheable sdists and lazy wheel dists, and otherwise rely on the CacheControl HTTP cache for PEP 658 metadata. I also created a new exception class so any errors with this new caching are made extremely obvious:

pip/src/pip/_internal/exceptions.py

Lines 256 to 257 in 227d8e8

    
           class CacheMetadataError(PipError): 
        
               """Raised when de/serializing a requirement into the metadata cache."""

cosmicexplorer · 2023-09-18T01:20:32Z

@sbidoul: before I ask you to review this in depth, does @uranusjr's proposal (which I've implemented here) to use this metadata cache for both lazy wheels and cacheable sdists (sdists which return True from should_cache()) seem like an appropriate compromise? Otherwise the PEP 658 metadata is indeed already efficiently retrieved from the CacheControl HTTP cache, just as you suspected.

sbidoul · 2023-09-20T20:28:45Z

@cosmicexplorer I still have some open questions after a bit of further thinking.

My first one is if there is still plan for PyPI to backfill PEP 658 metadata?

Assuming that is the case, will fast-deps be something that we still need? I've not followed all the discussions about it but I kind of remember talks about dropping it. Was it because PEP 658 makes it unncessary?

Assuming that we conclude that fast-deps is going to be abandoned, then there are only 3 2 cases left:

local metadata preparations for source trees and sdists: a metadata cache is useful for that, similar to the cache of locally built wheels
.metadata cache: the HTTP cache is adequate as you concluded
~~metadata from downloaded wheels: I wonder if optimizations to the http cache such as New HTTP cache with lower memory usage #11143 would help~~

[edit] if we have .metadata for all wheels in the index, item 3 does not apply anymore

dstufft · 2023-09-21T02:06:22Z

My first one is if there is still plan for PyPI to backfill PEP 658 metadata?

I believe so yes.

edmorley · 2023-12-17T12:16:02Z

@cosmicexplorer I still have some open questions after a bit of further thinking.

My first one is if there is still plan for PyPI to backfill PEP 658 metadata?

Backfilling is being tracked at pypi/warehouse#8254.

Assuming that is the case, will fast-deps be something that we still need? I've not followed all the discussions about it but I kind of remember talks about dropping it. Was it because PEP 658 makes it unncessary?

@cosmicexplorer I don't suppose you saw this question / have a chance to take a look at it? :-)

cosmicexplorer · 2024-01-11T06:34:06Z

@edmorley @sbidoul:

Assuming that is the case, will fast-deps be something that we still need? I've not followed all the discussions about it but I kind of remember talks about dropping it. Was it because PEP 658 makes it unncessary?

Please see #12208, which lays out the case for the continued use of the fast-deps feature (and also dramatically improves the implementation). Note that @radoering has also recently reviewed that code and even adapted it for poetry already (see his many very helpful comments in that PR), which demonstrates another vote of confidence for the fast-deps feature as perfected in #12208. The fast-deps implementation complexity is strictly limited to lazy_wheel.py with extremely thorough documentation in order to minimize the maintenance cost of this functionality. Also note that pypi itself does not need to do anything at all to support fast-deps, as this implementation is extremely robust and handles a variety of behaviors.

In a followup PR, I plan to simply remove the fast-deps feature and turn it on by default, as one among multiple metadata fetching methods including PEP 658 or cached sdists.

When performing `install --dry-run` and PEP 658 .metadata files are available to guide the resolve, do not download the associated wheels. Rather use the distribution information directly from the .metadata files when reporting the results on the CLI and in the --report file. - describe the new --dry-run behavior - finalize linked requirements immediately after resolve - introduce is_concrete - funnel InstalledDistribution through _get_prepared_distribution() too - add test for new install --dry-run functionality (no downloading)

- catch an exception when parsing metadata which only occurs in CI - handle --no-cache-dir - call os.makedirs() before writing to cache too - catch InvalidSchema when attempting git urls with BatchDownloader - fix other test failures - reuse should_cache(req) logic - gzip compress link metadata for a slight reduction in disk space - only cache built sdists - don't check should_cache() when fetching - cache lazy wheel dists - add news - turn debug logs in fetching from cache into exceptions - use scandir over listdir when searching normal wheel cache - handle metadata email parsing errors - correctly handle mutable cached requirement - use bz2 over gzip for an extremely slight improvement in disk usage

psf-chronographer bot added the bot:chronographer:provided label Sep 3, 2023

cosmicexplorer force-pushed the link-metadata-cache branch 2 times, most recently from f072af8 to 07a0d2f Compare September 3, 2023 05:33

This was referenced Sep 3, 2023

send HTTP caching headers for index pages to further reduce bandwidth usage #12257

Draft

memoizing dependency lookups with a local resolve index #12184

Closed

cosmicexplorer force-pushed the link-metadata-cache branch 4 times, most recently from 1b26b58 to 7a9db70 Compare September 3, 2023 10:07

cosmicexplorer changed the title ~~make LinkMetadataCache to cache metadata lookups~~ introduce --use-feature=metadata-cache to cache metadata lookups Sep 3, 2023

cosmicexplorer force-pushed the link-metadata-cache branch 3 times, most recently from bc2eb29 to 3a25be5 Compare September 3, 2023 11:48

cosmicexplorer mentioned this pull request Sep 3, 2023

persistent cache for link parsing and interpreter compatibility #12258

Draft

cosmicexplorer force-pushed the link-metadata-cache branch from 3a25be5 to 0f20f62 Compare September 3, 2023 12:18

cosmicexplorer force-pushed the link-metadata-cache branch from e6e29f2 to a3256f4 Compare September 3, 2023 13:16

cosmicexplorer force-pushed the link-metadata-cache branch 3 times, most recently from 351aa56 to 635539b Compare September 3, 2023 21:09

cosmicexplorer force-pushed the link-metadata-cache branch 2 times, most recently from 2089c05 to 5a42ae4 Compare September 3, 2023 23:43

uranusjr reviewed Sep 4, 2023

View reviewed changes

src/pip/_internal/cache.py Outdated Show resolved Hide resolved

uranusjr reviewed Sep 4, 2023

View reviewed changes

src/pip/_internal/cache.py Outdated Show resolved Hide resolved

edmorley mentioned this pull request Sep 6, 2023

Populate find_all_candidates cache from threadpool #10480

Open

cosmicexplorer force-pushed the link-metadata-cache branch from b09437f to 227d8e8 Compare September 14, 2023 20:39

cosmicexplorer mentioned this pull request Sep 18, 2023

Cache prepared metadata #11165

Open

1 task

ddelange mentioned this pull request Sep 21, 2023

⚡ Prepare for metadata-only resolves ddelange/pipgrip#113

Merged

pradyunsg removed the bot:chronographer:provided label Dec 20, 2023

psf-chronographer bot added the bot:chronographer:provided label Dec 20, 2023

cosmicexplorer mentioned this pull request Jan 11, 2024

perform 1-3 HTTP requests for each wheel using fast-deps #12208

Open

cosmicexplorer force-pushed the link-metadata-cache branch 3 times, most recently from cf868b7 to 840b346 Compare January 11, 2024 06:19

cosmicexplorer force-pushed the link-metadata-cache branch 4 times, most recently from a337d34 to 36b0fa8 Compare January 17, 2024 04:25

ddelange mentioned this pull request Mar 5, 2024

Expose the METADATA file of wheels in the simple API pypi/warehouse#8254

Closed

cosmicexplorer mentioned this pull request Mar 5, 2024

Consider using uv as an optional alternate resolver. pex-tool/pex#2371

Open

cosmicexplorer mentioned this pull request Aug 13, 2024

Pip install --dry-run shouldn't download full wheels when metadata file available #12603

Open

1 task

cosmicexplorer added 4 commits August 13, 2024 05:34

cache "concrete" dists by Distribution instead of InstallRequirement

ad93d54

refactor much of RequirementPreparer and add lengthy docs

52133ba

cosmicexplorer force-pushed the link-metadata-cache branch from 36b0fa8 to b02915a Compare August 13, 2024 10:36

This was referenced Aug 17, 2024

metadata resolve workstream #12921

Open

execute batch downloads in parallel worker threads #12923

Draft

cosmicexplorer marked this pull request as draft August 20, 2024 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache metadata lookups for sdists and lazy wheels #12256

cache metadata lookups for sdists and lazy wheels #12256

cosmicexplorer commented Sep 3, 2023 •

edited

Loading

cosmicexplorer commented Sep 3, 2023

sbidoul commented Sep 3, 2023

cosmicexplorer commented Sep 3, 2023 •

edited

Loading

cosmicexplorer commented Sep 3, 2023 •

edited

Loading

pfmoore commented Sep 4, 2023

sbidoul commented Sep 4, 2023

cosmicexplorer commented Sep 14, 2023

cosmicexplorer commented Sep 18, 2023 •

edited

Loading

sbidoul commented Sep 20, 2023 •

edited

Loading

dstufft commented Sep 21, 2023

edmorley commented Dec 17, 2023 •

edited

Loading

cosmicexplorer commented Jan 11, 2024 •

edited

Loading

cache metadata lookups for sdists and lazy wheels #12256

Are you sure you want to change the base?

cache metadata lookups for sdists and lazy wheels #12256

Conversation

cosmicexplorer commented Sep 3, 2023 • edited Loading

Background: Pip Bandwidth Usage Improvements

Proposal: Caching Metadata to Reduce Number of Requests

Result: 6.5x Speedup, Shaving Off 30 Seconds of Runtime

TODO

cosmicexplorer commented Sep 3, 2023

sbidoul commented Sep 3, 2023

cosmicexplorer commented Sep 3, 2023 • edited Loading

cosmicexplorer commented Sep 3, 2023 • edited Loading

pfmoore commented Sep 4, 2023

sbidoul commented Sep 4, 2023

cosmicexplorer commented Sep 14, 2023

cosmicexplorer commented Sep 18, 2023 • edited Loading

sbidoul commented Sep 20, 2023 • edited Loading

dstufft commented Sep 21, 2023

edmorley commented Dec 17, 2023 • edited Loading

cosmicexplorer commented Jan 11, 2024 • edited Loading

cosmicexplorer commented Sep 3, 2023 •

edited

Loading

cosmicexplorer commented Sep 3, 2023 •

edited

Loading

cosmicexplorer commented Sep 3, 2023 •

edited

Loading

cosmicexplorer commented Sep 18, 2023 •

edited

Loading

sbidoul commented Sep 20, 2023 •

edited

Loading

edmorley commented Dec 17, 2023 •

edited

Loading

cosmicexplorer commented Jan 11, 2024 •

edited

Loading