-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache call to path_to_url #12322
Cache call to path_to_url #12322
Conversation
a8efcac
to
d1db13f
Compare
Other test failures should be fixed now, can only cache after calculating the absolute path as relative paths depend on state outside the function. We still retain the vast majority of the performance gains though, I now benchmark it at ~2 mins 10 seconds. Also, it may be possible to claw that 5 seconds back by changing the function |
Ah, I seem to have fixed the news entry 🙂 |
@@ -13,13 +14,22 @@ def get_url_scheme(url: str) -> Optional[str]: | |||
return url.split(":", 1)[0].lower() | |||
|
|||
|
|||
@lru_cache(maxsize=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unbounded and would store information that is not used more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but what other solution is there?
We can't know ahead of time how many file paths need caching.
If a maxsize is given it is completely arbitrary. If you think it's required for memory safety I would prefer a very large number that is unexpected to be reached, like 10'000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but what other solution is there?
As you said above
An alternative solution would be to rearchitect Pip to not need to call path_to_url so much
Agreed, it's complex, but that may be better than throwing memory at the problem - after all, pip does get used in memory-constrained environments. I don't know what lru_cache
does when it's getting close to memory limits, but I doubt it tries to manage that situation particularly - so you'd probably at some point start to get paging and a significant reduction in performance.
We can't know ahead of time how many file paths need caching.
No, but it's not a matter of needing to cache anything. It's simply a case of only getting some of the performance benefits, not all of them.
Ultimately, we need to balance different use cases here. I'd consider installing 1000 wheels in a single install to be a very extreme case, and honestly I don't think 2.5 minutes is a particularly bad time for that. So while I'm always glad if we can get performance improvements, I think we have to be careful to keep perspective here. The overwhelming majority of the uses of pip "in the wild" are likely to be of the form pip install <one or two packages>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, maybe we can just speed up path_to_url
? Looking at the following:
❯ pyperf timeit -s "from pathlib import Path; import os" "Path(os.path.normpath(os.path.abspath('.'))).as_uri()"
.....................
Mean +- std dev: 5.67 us +- 0.13 us
❯ pyperf timeit -s "from urllib.parse import urljoin; from urllib.request import pathname2url; import os" "urljoin('file:', pathname2url(os.path.normpath(os.path.abspath('.'))))"
.....................
Mean +- std dev: 9.23 us +- 0.34 us
suggests that using Path.as_uri()
is a lot faster. Using Path.absolute()
rather than os.path
loses a lot of the gain, I'm not sure why - maybe because it calls the Path
constructor twice.
The point is, there may well be other options than caching the results. Or there may be improvements that can be achieved as well as a (limited-size) cache. As with any performance exercise, it's all about trade offs.
Edit: The same tests on Ubuntu (WSL) don't give the same improvements for the pathlib approach. Make of that what you will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider installing 1000 wheels in a single install to be a very extreme case, and honestly I don't think 2.5 minutes is a particularly bad time for that.
Home assistant is a very popular application in the smart home world, so it's relatively common use case.
And it's 2.5 minutes in my machine, it's 1-2 hours on others people's: #12314. This was going to my first in a series of PRs.
And the cache here only grows if there are a lot of wheels, so there's only a few kilobytes of memory used in non-"extreme" examples.
Agreed, it's complex, but that may be better than throwing memory at the problem - after all, pip does get used in memory-constrained environments.
If a user needs to install over 1000 wheels using Pip I have to assume they have 1 or 2 MBs of spare memory.
I'll take another look at rearchitecture approach, but it probably means a significant rework of the way pip and resolvelib interact with each other. And my worry is that even I was able to make a PR there's a good chance it would never be accepted as a non-Pip maintainer adding such a significant requirement of knowledge for maintenance. Or that it would be rejected by resolvelib for breaking other downstream consumers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(If you have spent this time, let us know -- I might've missed information around this!)
Yes, I've been profiling this: #12314 (comment) (path_to_url is the far most left light greeny blue box).
There are lots of other hot spots in this profile graph, but I just thought I'd start with the most simple looking offender.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've been profiling this
The two big callers are file_links()
and page_candidates
in _FlatDirectorySource
, and they do the same loop. So maybe put the result of that loop in a cached property?
class _FlatDirectorySource(LinkSource):
def __init__(
self,
candidates_from_page: CandidatesFromPage,
path: str,
) -> None:
self._candidates_from_page = candidates_from_page
self._path = pathlib.Path(os.path.realpath(path))
self._file_urls = None
@property
def link(self) -> Optional[Link]:
return None
def _scan_dir(self):
if self._file_urls is None:
_file_urls = []
for path in self._path.iterdir():
url = path_to_url(str(path))
_file_urls.append((url, _is_html_file(url)))
self._file_urls = _file_urls
def page_candidates(self) -> FoundCandidates:
self._scan_dir()
for url, html in self._file_urls:
if html:
yield from self._candidates_from_page(Link(url))
def file_links(self) -> FoundLinks:
self._scan_dir()
return (Link(url) for url, html in self._file_urls if not html)
(Add type annotations as needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try that locally and if successful create a new PR, unless you are wanting to given you have already wrote some code on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, go ahead. I haven't got the patience to work out the right type annotations :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this approach doesn't work immediately because each call to page_candidates
and file_links
is from a seperate instance of _FlatDirectorySource
. However there a number of possible solutions here, I will come up with one and submit a new PR.
Based on the feedback in this PR I have made an alternative PR: #12327 That PR is significantly more agressive in it's optimizations but also significantly more complex, so I would like to leave this PR open and wait to see if that PR gets accepted or rejected. |
I'm closing this PR due to expressed non-interest of Pip maintainers, I am going to continue to push for #12327 which directly addresses the O(n^2) issue that Pip has here. Of course this approach is significantly simpler, so if anyone wants to push for this again you are 100% welcome to reuse my code. |
In my testing it reduces the amount of time the scenario described in #12320 from ~2 mins 33 seconds to ~2 mins 5 seconds.
An alternative solution would be to rearchitect Pip to not need to call
path_to_url
so much, but the architecture here is very complex and I don't have a good grasp on how one would go about going that.