Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

path_to_url called millions of times for ~1000 offline wheel installs #12320

Closed
1 task done
notatallshaw opened this issue Oct 6, 2023 · 15 comments · Fixed by #12327
Closed
1 task done

path_to_url called millions of times for ~1000 offline wheel installs #12320

notatallshaw opened this issue Oct 6, 2023 · 15 comments · Fixed by #12327
Labels
type: refactor Refactoring code

Comments

@notatallshaw
Copy link
Member

notatallshaw commented Oct 6, 2023

Description

The fastest way, and has been advised as the best way, to install Python packages is to use wheels.

However when you actually try to install ~1000 wheels from a single set of requirements there are significant performance bottlenecks in Pip

Expected behavior

Installing pre-resolved wheels offline should be very very fast

pip version

23.3.dev0

Python version

Python 3.11

OS

Linux

How to Reproduce

  1. python3.11 -m venv .venv
  2. source .venv/bin/activate
  3. <install latest/dev pip>
  4. git clone https://github.com/home-assistant/core
  5. cd core/
  6. python -m pip download -d {download_directory} --prefer-binary -r requirements_all.txt
  7. cd {download_directory}
  8. for file in $(ls *.tar.gz); do pip wheel --no-deps "$file" && mv "$file" "$file".built ; done
  9. for file in $(ls *.zip); do pip wheel --no-deps "$file" && mv "$file" "$file".built ; done
  10. cd -
  11. time python -m pip install --dry-run --only-binary=:all: --no-index --ignore-installed --find-links file://{download_directory} -r requirements_all.txt

Output

On my machine:

real    2m33.486s
user    2m31.886s
sys     0m1.568s

Running a profile on this I see pip._internal.utils.url.path_to_url is called over 3,000,000 times and takes ~25% of the total run time.

Code of Conduct

@notatallshaw notatallshaw added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels Oct 6, 2023
@pfmoore
Copy link
Member

pfmoore commented Oct 7, 2023

Some questions (treating this purely as an optimisation question, rather than specifically focusing on path_to_url).

  1. In line 11, you do not specify --no-deps. Was that an accident? If not, then you say these are "pre-resolved wheels" but I'm not 100% clear what you mean if it doesn't imply --no-deps. Are these actually a fully frozen set of dependencies?
  2. Why are you using a file: URL for the --find-links option? This is a local directory, so can't you just use the directory name, avoiding the need for at least some of the URL processing.

I'd be interested to know how much benefit you'd get from implementing these two ideas.

And a question for context - reading the Home Assistant website, it seems that the recommended installation approach is to use a prebuilt environment ("Home Assistant Operating System" or "Home Assistant Container"). Installing with pip ("Home Assistant Core") is described as an "advanced installation process" - which seems like a very reasonable way of viewing the options. The Home Assistant Core approach installs just the homeassistant package, letting pip do the dependency resolution. So I'm not 100% clear how realistic the "installing pre-resolved wheels offline" scenario is in practice. Is it a step used when preparing the container or OS forms of HA? If so, then presumably that's a relatively rare exercise?

To reiterate, I'm not at all against performance improvements for pip, and I'm fine with fixing "low hanging fruit" no matter how it gets identified. But if we're going to focus on performance fixes in general, I'd prefer working on improvements that give benefits to the majority of our users, rather than ones that focus unusual or very specific use cases. For this particular example, I'd expect that the "normal" case of installing thousands of dependencies would involve the resolver and downloading multiple potential candidates (i.e., much like the HA Core approach) and therefore the savings from improving path_to_url would be swamped by network traffic and similar (where we are already doing performance work).

@RonnyPfannschmidt
Copy link
Contributor

Also if the intent is to basically sidestep all of pip

Why not Just use https://installer.pypa.io/en/stable/

@notatallshaw
Copy link
Member Author

notatallshaw commented Oct 7, 2023

Some questions (treating this purely as an optimisation question, rather than specifically focusing on path_to_url).

Then please reply in issue #12314 which is the real world use case, this issue was opened to address a specific performance bottleneck, as I explained here: #12314 (comment)

In line 11, you do not specify --no-deps. Was that an accident? If not, then you say these are "pre-resolved wheels" but I'm not 100% clear what you mean if it doesn't imply --no-deps. Are these actually a fully frozen set of dependencies?

Because I am most closely trying to reproduce #12314 and it would be good for Pip to still validate the dependencies are correct.

Why are you using a file: URL for the --find-links option? This is a local directory, so can't you just use the directory name, avoiding the need for at least some of the URL processing.

Well firstly because doing so was the first way I could reproduce the what OP said in #12314 that Version construction takes a significant amount of the total run time.

Secondly, because this is just a specific issue about one of the performance bottlenecks I found trying to reproduce the issue described in #12314 and isn't meant to address the larger picture.

Thirdly because I am unsure what the alternative is that you mean, I can not pass ~1000 wheels on to the command line.

And a question for context - reading the Home Assistant website, it seems that the recommended installation approach is to use a prebuilt environment ("Home Assistant Operating System" or "Home Assistant Container"). Installing with pip ("Home Assistant Core") is described as an "advanced installation process" - which seems like a very reasonable way of viewing the options

My understanding is #12314 is about the steps to build the containter. But I am anecdotally aware of users who role their own container and therefore need to go through the build process themselves.

Is it a step used when preparing the container or OS forms of HA? If so, then presumably that's a relatively rare exercise?

Well in an ideal world you presumably want to rebuild a container any time new requirements are released and run your tests, which I assume with over 1000 requirements would be near constantly.

But if we're going to focus on performance fixes in general, I'd prefer working on improvements that give benefits to the majority of our users, rather than ones that focus unusual or very specific use cases.

There are an increasing number of applications that are using a lot of dependenciess, WinPython, Airflow, SuperSet, etc. These all probably are in the 300 to 500 range, but there is no reason users can not depend on multiple of these probjects. And most projects are moving, or being encouraged to move, to wheels rather than sdists. So I would be happy to see performance improvements in Pip in this area.

therefore the savings from improving path_to_url would be swamped by network traffic and similar (where we are already doing performance work).

I was going to raise other issues around this, I beleive looking at the callstack when I use network instead of local install I see similiar issues where similiar savings could be made. But in #12314 the user is clearly seeing most the time is not being spent on the network.

However I started with what I saw as the simplest and least constrovertial gain, to test the waters on how receptive Pip maintainers would be to this approach. I was not expecting to provide this level of justification until I got into some of the more complex issues of trying to cache the return results of packaging. As this issue is clean and reproducible and the PR is a double digits % win in those reproducible steps for very little cost or complexity.

@notatallshaw
Copy link
Member Author

Also if the intent is to basically sidestep all of pip

I am not, Pip is critical for resolving and checking the resolution here.

@pfmoore
Copy link
Member

pfmoore commented Oct 7, 2023

However I started with what I saw as the simplest and least constrovertial gain, to test the waters on how receptive Pip maintainers would be to this approach.

Fair enough. My personal view is that I see massive dependency graphs like this as the exception rather than the rule, and while I'm happy enough to improve things for these use cases, I'd rather see evidence that improvements are of benefit (or at least neutral) in the normal case (which I consider to be up to an order of magnitude of maybe 100 dependencies in one install). Performance is a case where "fixes" are notorious for improving one area at the cost of others, and so I'm skeptical of results that don't show the effect across the full range of use cases we support.

I won't comment any further on the Home Assistant use case. I assume they have decided for themselves that the approach they are taking is the best one for them, and I have no experience to allow me to say otherwise. But the one thing I will say is that they are definitely pushing the limits of scale as far as pip is concerned, and that's something they have to take into account, whether we find performance improvements or not.

Having said all of this, you have definitely nerd-sniped me into looking at this. I've made an alternative suggestion in #12322 which you might like to try out.

@notatallshaw
Copy link
Member Author

notatallshaw commented Oct 7, 2023

But the one thing I will say is that they are definitely pushing the limits of scale as far as pip is concerned, and that's something they have to take into account, whether we find performance improvements or not.

Well, probably they should move to rip once it is released, I have already been helping rip by testing it against lots of real world complex dependency graphs and reporting various issues from the results.

But personally I would like Pip to able to handle the ~1000 dependencies scenario smoothly in almost all use cases, as I see this as the future for a lot of large Python applications (or at least in the 500-1000 range). And I'm not quite sure what it says about Python if in it's future all of it's supporting infrastructure is written in Rust.

@notatallshaw
Copy link
Member Author

notatallshaw commented Oct 7, 2023

Performance is a case where "fixes" are notorious for improving one area at the cost of others, and so I'm skeptical of results that don't show the effect across the full range of use cases we support.

That's quite a fair point, unfortunately there's no Pip performance benchmark suite that can be leveraged or contributed to.

In that sense I do understand the carefulness of reviewing performance improvements.

@pfmoore
Copy link
Member

pfmoore commented Oct 7, 2023

I see this as the future for a lot of large Python applications (or at least in the 500-1000 range)

I'm curious - do you have evidence of this? My experience is that there are a lot of Python applications, but smaller ones that you install via pip or pipx usually have few dependencies, and larger ones either bundle their dependencies, or otherwise avoid the whole "install a bunch of dependencies into a virtual environment" experience for their end user installs - because virtual environment management sucks as an end user experience.

The only really big applications I know of are Apache Airflow, and Home Assistant, and both of these seem to have their own unique challenges.

That's why I like the HA approach of having container and OS builds. Those sound like much better user experiences than "install this 1000-dependency package into a virtual environment".

@notatallshaw
Copy link
Member Author

notatallshaw commented Oct 7, 2023

I'm curious - do you have evidence of this?

Not right now on a Satuday morning while brewing my first coffee, but I will try and put something together

The only really big applications I know of are Apache Airflow, and Home Assistant

WinPython and Apache Superset have at least as many dependencies as Apache Airflow, there are definetly a lot more, I will make notebook and try and provide more definetivie evidence in the future.

I have seen several probjects put Apache Airflow as a dependency, but right now I don't have a good way to search for them (I did think there was a website that let you search Python dependencies globally but I can't seem to find it).

That's why I like the HA approach of having container and OS builds. Those sound like much better user experiences than "install this 1000-dependency package into a virtual environment".

Well this reduces the problem of installing all these packages from being a user problem to a developer problem, which is better but not exactly solved, if the problem becomes too difficult for the developer than the process breaks anyway.

Further it only works when a user is using these Python projects as a standalone application, if the user wants to use them as a Python library, modify the container to work in a different environment, or many other things they want to modify, they still have to install everything anyway.

@pfmoore
Copy link
Member

pfmoore commented Oct 7, 2023

Not right now on a Satuday morning while brewing my first coffee, but I will try and put something together

No worries, I'm not trying to put pressure on you here. I'm just surprised that our experiences differ so significantly. Not least, we get (as far as I know) few if any issues raised by these larger projects apart from Airflow and (occasionally) HA. I don't honestly know what to make of that information, but I do find it interesting.

As I've said, my personal interest is very much more in the area of making pip fast and convenient for small installs that are run frequently - disposable virtual environments, tools like tox, nox and pip-run, etc.

I'd rather the packaging community found a better way to build and distribute applications than making some sort of "master" package with huge numbers of dependences that get installed via pip, and that we encouraged shallower, wider, dependency trees - but I think I'm a lone voice in the wilderness on that subject, unfortunately.

@notatallshaw
Copy link
Member Author

notatallshaw commented Oct 7, 2023

I'd rather the packaging community found a better way to build and distribute applications than making some sort of "master" package with huge numbers of dependences that get installed via pip, and that we encouraged shallower, wider, dependency trees - but I think I'm a lone voice in the wilderness on that subject, unfortunately.

But isn't a big problem here that we are using Pip as the main installer and Pip does not guarantee dependencies between installs?

e.g. pip install bar foo is guaranteed to have correct dependencies installed for both foo and bar, but pip install bar followed by pip install foo is not guaranteed to have correct dependencies installed now for bar.

Once we move to a world where Pip is considered low level install machinery and we are mainly using a tool that manages our environment rather than just out individual installs, won't it be a lot easier to seperate out to smaller packages and not worry that a user will break their environment when they install something?

@pfmoore
Copy link
Member

pfmoore commented Oct 7, 2023

e.g. pip install bar foo is guaranteed to have correct dependencies installed for both for and bar, but pip install bar followed by pip install foo is not guaranteed to have correct dependencies installed now for bar.

For applications, that should be solved by pinning, surely? But we're very off-topic here and I don't have any experience managing a big application like this, so let's drop this line of discussion.

@notatallshaw
Copy link
Member Author

notatallshaw commented Oct 7, 2023

For applications, that should be solved by pinning, surely? But we're very off-topic here and I don't have any experience managing a big application like this, so let's drop this line of discussion.

You've asked me a clarifying question and asked to drop this line of discussion, I wll do both in that order.

It's a problem for applications because if an application could be split into sub-applications they can not guarantee the user will get the correct dependencies. e.g. pip install monolithic-app is guaranteed to work, but telling users to do pip install sub-app-1 and then when they need sub-app-2 to do pip install sub-app-2 there is a chance they will break sub-app-1.

So in this sense Pip encourages large monolithic applications with giant dependency trees.

@pradyunsg pradyunsg added type: refactor Refactoring code and removed type: bug A confirmed bug or unintended behavior S: needs triage Issues/PRs that need to be triaged labels Oct 16, 2023
@sbidoul
Copy link
Member

sbidoul commented Dec 28, 2023

As I was investigating a similar issue, I also throwed py-spy at this scenario.

We can indeed see path_to_url standing out (called by find_all_candidates, which is cached). We also see is_satisfied_by that stands out (vastly improved by #12453, also with "brute force" caching).

image

@notatallshaw
Copy link
Member Author

notatallshaw commented Dec 28, 2023

We can indeed see path_to_url standing out (called by find_all_candidates, which is cached). We also see is_satisfied_by that stands out (vastly improved by #12453, also with "brute force" caching).

Yes, I beleive these are two different O(n^2) issues in Pip, one related to path_to_url called caused by Pip rechecking what it has collected at each step, partially fixed by: #12327, and one related to is_satisfied_by caused by the way resolvelib's algorithm works: sarugaku/resolvelib#147

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: refactor Refactoring code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants