Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache call to path_to_url #12322

Closed
wants to merge 6 commits into from
Closed

Conversation

notatallshaw
Copy link
Member

In my testing it reduces the amount of time the scenario described in #12320 from ~2 mins 33 seconds to ~2 mins 5 seconds.

An alternative solution would be to rearchitect Pip to not need to call path_to_url so much, but the architecture here is very complex and I don't have a good grasp on how one would go about going that.

@notatallshaw
Copy link
Member Author

notatallshaw commented Oct 6, 2023

I don't know what "Changelog entry: History fragments missing" means or how to fix, I did fix a typo in my news entry but I don't know what this is checking for exactly.

Other test failures should be fixed now, can only cache after calculating the absolute path as relative paths depend on state outside the function.

We still retain the vast majority of the performance gains though, I now benchmark it at ~2 mins 10 seconds.

Also, it may be possible to claw that 5 seconds back by changing the function _normalized_abs_path_to_url(abs_path: str) to _path_and_cwd_path_to_url(path: str, cwd_path: str), but that seemed a little complicated for a much smaller performance win. But let me know if you would prefer that as a solution.

@notatallshaw
Copy link
Member Author

Ah, I seem to have fixed the news entry 🙂

@@ -13,13 +14,22 @@ def get_url_scheme(url: str) -> Optional[str]:
return url.split(":", 1)[0].lower()


@lru_cache(maxsize=None)
Copy link
Member

@pradyunsg pradyunsg Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unbounded and would store information that is not used more than once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but what other solution is there?

We can't know ahead of time how many file paths need caching.

If a maxsize is given it is completely arbitrary. If you think it's required for memory safety I would prefer a very large number that is unexpected to be reached, like 10'000.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but what other solution is there?

As you said above

An alternative solution would be to rearchitect Pip to not need to call path_to_url so much

Agreed, it's complex, but that may be better than throwing memory at the problem - after all, pip does get used in memory-constrained environments. I don't know what lru_cache does when it's getting close to memory limits, but I doubt it tries to manage that situation particularly - so you'd probably at some point start to get paging and a significant reduction in performance.

We can't know ahead of time how many file paths need caching.

No, but it's not a matter of needing to cache anything. It's simply a case of only getting some of the performance benefits, not all of them.

Ultimately, we need to balance different use cases here. I'd consider installing 1000 wheels in a single install to be a very extreme case, and honestly I don't think 2.5 minutes is a particularly bad time for that. So while I'm always glad if we can get performance improvements, I think we have to be careful to keep perspective here. The overwhelming majority of the uses of pip "in the wild" are likely to be of the form pip install <one or two packages>.

Copy link
Member

@pfmoore pfmoore Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, maybe we can just speed up path_to_url? Looking at the following:

❯ pyperf timeit -s "from pathlib import Path; import os" "Path(os.path.normpath(os.path.abspath('.'))).as_uri()"
.....................
Mean +- std dev: 5.67 us +- 0.13 us
❯ pyperf timeit -s "from urllib.parse import urljoin; from urllib.request import pathname2url; import os" "urljoin('file:', pathname2url(os.path.normpath(os.path.abspath('.'))))"
.....................
Mean +- std dev: 9.23 us +- 0.34 us

suggests that using Path.as_uri() is a lot faster. Using Path.absolute() rather than os.path loses a lot of the gain, I'm not sure why - maybe because it calls the Path constructor twice.

The point is, there may well be other options than caching the results. Or there may be improvements that can be achieved as well as a (limited-size) cache. As with any performance exercise, it's all about trade offs.

Edit: The same tests on Ubuntu (WSL) don't give the same improvements for the pathlib approach. Make of that what you will.

Copy link
Member Author

@notatallshaw notatallshaw Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider installing 1000 wheels in a single install to be a very extreme case, and honestly I don't think 2.5 minutes is a particularly bad time for that.

Home assistant is a very popular application in the smart home world, so it's relatively common use case.

And it's 2.5 minutes in my machine, it's 1-2 hours on others people's: #12314. This was going to my first in a series of PRs.

And the cache here only grows if there are a lot of wheels, so there's only a few kilobytes of memory used in non-"extreme" examples.

Agreed, it's complex, but that may be better than throwing memory at the problem - after all, pip does get used in memory-constrained environments.

If a user needs to install over 1000 wheels using Pip I have to assume they have 1 or 2 MBs of spare memory.

I'll take another look at rearchitecture approach, but it probably means a significant rework of the way pip and resolvelib interact with each other. And my worry is that even I was able to make a PR there's a good chance it would never be accepted as a non-Pip maintainer adding such a significant requirement of knowledge for maintenance. Or that it would be rejected by resolvelib for breaking other downstream consumers.

Copy link
Member Author

@notatallshaw notatallshaw Oct 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If you have spent this time, let us know -- I might've missed information around this!)

Yes, I've been profiling this: #12314 (comment) (path_to_url is the far most left light greeny blue box).

There are lots of other hot spots in this profile graph, but I just thought I'd start with the most simple looking offender.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've been profiling this

The two big callers are file_links() and page_candidates in _FlatDirectorySource, and they do the same loop. So maybe put the result of that loop in a cached property?

class _FlatDirectorySource(LinkSource):
    def __init__(
        self,
        candidates_from_page: CandidatesFromPage,
        path: str,
    ) -> None:
        self._candidates_from_page = candidates_from_page
        self._path = pathlib.Path(os.path.realpath(path))
        self._file_urls = None

    @property
    def link(self) -> Optional[Link]:
        return None

    def _scan_dir(self):
        if self._file_urls is None:
            _file_urls = []
            for path in self._path.iterdir():
                url = path_to_url(str(path))
                _file_urls.append((url, _is_html_file(url)))
            self._file_urls = _file_urls

    def page_candidates(self) -> FoundCandidates:
        self._scan_dir()
        for url, html in self._file_urls:
            if html:
                yield from self._candidates_from_page(Link(url))

    def file_links(self) -> FoundLinks:
        self._scan_dir()
        return (Link(url) for url, html in self._file_urls if not html)

(Add type annotations as needed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try that locally and if successful create a new PR, unless you are wanting to given you have already wrote some code on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, go ahead. I haven't got the patience to work out the right type annotations :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this approach doesn't work immediately because each call to page_candidates and file_links is from a seperate instance of _FlatDirectorySource. However there a number of possible solutions here, I will come up with one and submit a new PR.

@notatallshaw
Copy link
Member Author

Based on the feedback in this PR I have made an alternative PR: #12327

That PR is significantly more agressive in it's optimizations but also significantly more complex, so I would like to leave this PR open and wait to see if that PR gets accepted or rejected.

@notatallshaw
Copy link
Member Author

I'm closing this PR due to expressed non-interest of Pip maintainers, I am going to continue to push for #12327 which directly addresses the O(n^2) issue that Pip has here.

Of course this approach is significantly simpler, so if anyone wants to push for this again you are 100% welcome to reuse my code.

@notatallshaw notatallshaw deleted the path_to_url_cache branch January 29, 2024 23:56
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants