Cache call to path_to_url #12322

notatallshaw · 2023-10-06T16:39:12Z

In my testing it reduces the amount of time the scenario described in #12320 from ~2 mins 33 seconds to ~2 mins 5 seconds.

An alternative solution would be to rearchitect Pip to not need to call path_to_url so much, but the architecture here is very complex and I don't have a good grasp on how one would go about going that.

notatallshaw · 2023-10-06T17:02:23Z

~~I don't know what "Changelog entry: History fragments missing" means or how to fix, I did fix a typo in my news entry but I don't know what this is checking for exactly.~~

Other test failures should be fixed now, can only cache after calculating the absolute path as relative paths depend on state outside the function.

We still retain the vast majority of the performance gains though, I now benchmark it at ~2 mins 10 seconds.

Also, it may be possible to claw that 5 seconds back by changing the function _normalized_abs_path_to_url(abs_path: str) to _path_and_cwd_path_to_url(path: str, cwd_path: str), but that seemed a little complicated for a much smaller performance win. But let me know if you would prefer that as a solution.

notatallshaw · 2023-10-06T17:15:24Z

Ah, I seem to have fixed the news entry 🙂

pradyunsg · 2023-10-06T20:05:17Z

src/pip/_internal/utils/urls.py

@@ -13,13 +14,22 @@ def get_url_scheme(url: str) -> Optional[str]:
    return url.split(":", 1)[0].lower()


+@lru_cache(maxsize=None)


This is unbounded and would store information that is not used more than once.

Yes, but what other solution is there?

We can't know ahead of time how many file paths need caching.

If a maxsize is given it is completely arbitrary. If you think it's required for memory safety I would prefer a very large number that is unexpected to be reached, like 10'000.

Yes, but what other solution is there?

As you said above

An alternative solution would be to rearchitect Pip to not need to call path_to_url so much

Agreed, it's complex, but that may be better than throwing memory at the problem - after all, pip does get used in memory-constrained environments. I don't know what lru_cache does when it's getting close to memory limits, but I doubt it tries to manage that situation particularly - so you'd probably at some point start to get paging and a significant reduction in performance.

We can't know ahead of time how many file paths need caching.

No, but it's not a matter of needing to cache anything. It's simply a case of only getting some of the performance benefits, not all of them.

Ultimately, we need to balance different use cases here. I'd consider installing 1000 wheels in a single install to be a very extreme case, and honestly I don't think 2.5 minutes is a particularly bad time for that. So while I'm always glad if we can get performance improvements, I think we have to be careful to keep perspective here. The overwhelming majority of the uses of pip "in the wild" are likely to be of the form pip install <one or two packages>.

Alternatively, maybe we can just speed up path_to_url? Looking at the following:

❯ pyperf timeit -s "from pathlib import Path; import os" "Path(os.path.normpath(os.path.abspath('.'))).as_uri()" ..................... Mean +- std dev: 5.67 us +- 0.13 us ❯ pyperf timeit -s "from urllib.parse import urljoin; from urllib.request import pathname2url; import os" "urljoin('file:', pathname2url(os.path.normpath(os.path.abspath('.'))))" ..................... Mean +- std dev: 9.23 us +- 0.34 us

suggests that using Path.as_uri() is a lot faster. Using Path.absolute() rather than os.path loses a lot of the gain, I'm not sure why - maybe because it calls the Path constructor twice.

The point is, there may well be other options than caching the results. Or there may be improvements that can be achieved as well as a (limited-size) cache. As with any performance exercise, it's all about trade offs.

Edit: The same tests on Ubuntu (WSL) don't give the same improvements for the pathlib approach. Make of that what you will.

I'd consider installing 1000 wheels in a single install to be a very extreme case, and honestly I don't think 2.5 minutes is a particularly bad time for that.

Home assistant is a very popular application in the smart home world, so it's relatively common use case.

And it's 2.5 minutes in my machine, it's 1-2 hours on others people's: #12314. This was going to my first in a series of PRs.

And the cache here only grows if there are a lot of wheels, so there's only a few kilobytes of memory used in non-"extreme" examples.

Agreed, it's complex, but that may be better than throwing memory at the problem - after all, pip does get used in memory-constrained environments.

If a user needs to install over 1000 wheels using Pip I have to assume they have 1 or 2 MBs of spare memory.

I'll take another look at rearchitecture approach, but it probably means a significant rework of the way pip and resolvelib interact with each other. And my worry is that even I was able to make a PR there's a good chance it would never be accepted as a non-Pip maintainer adding such a significant requirement of knowledge for maintenance. Or that it would be rejected by resolvelib for breaking other downstream consumers.

(If you have spent this time, let us know -- I might've missed information around this!)

Yes, I've been profiling this: #12314 (comment) (path_to_url is the far most left light greeny blue box).

There are lots of other hot spots in this profile graph, but I just thought I'd start with the most simple looking offender.

Yes, I've been profiling this

The two big callers are file_links() and page_candidates in _FlatDirectorySource, and they do the same loop. So maybe put the result of that loop in a cached property?

class _FlatDirectorySource(LinkSource): def __init__( self, candidates_from_page: CandidatesFromPage, path: str, ) -> None: self._candidates_from_page = candidates_from_page self._path = pathlib.Path(os.path.realpath(path)) self._file_urls = None @property def link(self) -> Optional[Link]: return None def _scan_dir(self): if self._file_urls is None: _file_urls = [] for path in self._path.iterdir(): url = path_to_url(str(path)) _file_urls.append((url, _is_html_file(url))) self._file_urls = _file_urls def page_candidates(self) -> FoundCandidates: self._scan_dir() for url, html in self._file_urls: if html: yield from self._candidates_from_page(Link(url)) def file_links(self) -> FoundLinks: self._scan_dir() return (Link(url) for url, html in self._file_urls if not html)

(Add type annotations as needed).

I will try that locally and if successful create a new PR, unless you are wanting to given you have already wrote some code on this?

No, go ahead. I haven't got the patience to work out the right type annotations :-)

FYI, this approach doesn't work immediately because each call to page_candidates and file_links is from a seperate instance of _FlatDirectorySource. However there a number of possible solutions here, I will come up with one and submit a new PR.

notatallshaw · 2023-10-08T00:58:01Z

Based on the feedback in this PR I have made an alternative PR: #12327

That PR is significantly more agressive in it's optimizations but also significantly more complex, so I would like to leave this PR open and wait to see if that PR gets accepted or rejected.

notatallshaw · 2024-01-29T23:56:18Z

I'm closing this PR due to expressed non-interest of Pip maintainers, I am going to continue to push for #12327 which directly addresses the O(n^2) issue that Pip has here.

Of course this approach is significantly simpler, so if anyone wants to push for this again you are 100% welcome to reuse my code.

notatallshaw added 2 commits October 6, 2023 12:35

Cache call to path_to_url

967607d

Add news entry

d1db13f

notatallshaw force-pushed the path_to_url_cache branch from a8efcac to d1db13f Compare October 6, 2023 16:44

notatallshaw mentioned this pull request Oct 6, 2023

New resolver takes 1-2 hours to install a large requirements file #12314

Closed

1 task

Can only cache after making absolute

7beea44

Fix news entry

8466261

psf-chronographer bot added the bot:chronographer:provided label Oct 6, 2023

End of line on news

560478f

pradyunsg reviewed Oct 6, 2023

View reviewed changes

pfmoore mentioned this pull request Oct 7, 2023

path_to_url called millions of times for ~1000 offline wheel installs #12320

Closed

1 task

pradyunsg removed the bot:chronographer:provided label Dec 20, 2023

psf-chronographer bot added the bot:chronographer:provided label Dec 20, 2023

Merge branch 'main' into path_to_url_cache

7349bf4

notatallshaw closed this Jan 29, 2024

notatallshaw deleted the path_to_url_cache branch January 29, 2024 23:56

github-actions bot locked as resolved and limited conversation to collaborators Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache call to path_to_url #12322

Cache call to path_to_url #12322

notatallshaw commented Oct 6, 2023

notatallshaw commented Oct 6, 2023 •

edited

Loading

notatallshaw commented Oct 6, 2023

pradyunsg Oct 6, 2023 •

edited

Loading

notatallshaw Oct 6, 2023

pfmoore Oct 6, 2023

pfmoore Oct 6, 2023 •

edited

Loading

notatallshaw Oct 6, 2023 •

edited

Loading

notatallshaw Oct 7, 2023 •

edited

Loading

pfmoore Oct 7, 2023

notatallshaw Oct 7, 2023

pfmoore Oct 7, 2023

notatallshaw Oct 7, 2023

notatallshaw commented Oct 8, 2023

notatallshaw commented Jan 29, 2024

		@@ -13,13 +14,22 @@ def get_url_scheme(url: str) -> Optional[str]:
		return url.split(":", 1)[0].lower()


		@lru_cache(maxsize=None)

Cache call to path_to_url #12322

Cache call to path_to_url #12322

Conversation

notatallshaw commented Oct 6, 2023

notatallshaw commented Oct 6, 2023 • edited Loading

notatallshaw commented Oct 6, 2023

pradyunsg Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

notatallshaw Oct 6, 2023

Choose a reason for hiding this comment

pfmoore Oct 6, 2023

Choose a reason for hiding this comment

pfmoore Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

notatallshaw Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

notatallshaw Oct 7, 2023 • edited Loading

Choose a reason for hiding this comment

pfmoore Oct 7, 2023

Choose a reason for hiding this comment

notatallshaw Oct 7, 2023

Choose a reason for hiding this comment

pfmoore Oct 7, 2023

Choose a reason for hiding this comment

notatallshaw Oct 7, 2023

Choose a reason for hiding this comment

notatallshaw commented Oct 8, 2023

notatallshaw commented Jan 29, 2024

notatallshaw commented Oct 6, 2023 •

edited

Loading

pradyunsg Oct 6, 2023 •

edited

Loading

pfmoore Oct 6, 2023 •

edited

Loading

notatallshaw Oct 6, 2023 •

edited

Loading

notatallshaw Oct 7, 2023 •

edited

Loading