-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed of Poetry on CI #332
Comments
My observation currently is that the process might be slower with caching than without!?
|
Regarding performance improvements wrt. observations by others, python-poetry/poetry#2184 (comment) by @TheFriendlyCoder might be important to watch. Also, some comments at python-poetry/poetry#2094 might yield more insights. cc @sondrelg |
Don't have much to add here, but I guess whether adding caching for Windows runs is faster or not, should boil down to whether it would be faster to retrieve a cache or download packages from PyPi - everything that happens after that should be the same. Looking at this run, it seems your total cache size is only 50mb, and is retrieved in 1s, so generally I would expect that to beat PyPi, but I guess this isn't really a bottleneck either way. |
It really feels like Poetry is doing some kind of network roundtrips even if all dependencies can be satisfied from the cache. This is kind of reflected by observations from others, see #332 (comment). |
Any correct resolver would need network anyway because if you request |
@webknjaz: I see, thanks for outlining this scenario. So, with respect to eventually sacrificing correctness on this aspect, any outcome from discussions like python-poetry/poetry#1992, python-poetry/poetry#2184 or python-poetry/poetry#3322 might bring some improvements on this? |
I guess... I gave up on poetry years ago. But with pip, you can do something like this by making a wheelhouse directory, and doing something like |
Thanks Sviatoslav. We will look into that and how it might be combined with Poetry. python-poetry/poetry#2873 also discusses that option. I also remembered So, as we will need the synthesized poetry export --dev --extras=http --extras=sql --extras=excel --without-hashes --format=requirements.txt --output=requirements.txt
pip download --dest=~/.cache/wheelhouse --requirement=requirements.txt
pip download --dest=~/.cache/wheelhouse setuptools-scm wheel When restoring the wheelhouse content into a virtualenv using time pip install --no-index --find-links=~/.cache/wheelhouse --requirement=requirements.txt we see it will still take 1m13s on my macOS machine. So, this might also not gain any improvements on Windows. However, I also see that not all packages have been able to be downloaded as wheels: l ~/.cache/wheelhouse/*.tar.gz | wc -l
19 So, a significant amount of time was spent on building wheels from sdist packages beforehand when running So, instead of only caching |
Would you mind clarifying what you mean here? If I have a pegged revision as a dependency (ie: "==1.0.0") I would expect that exact pegged revision to be installed (ie: "v1.0.0") and not some alternative compatible version (ie: "v1.0.0.post0"). Perhaps you meant to use "~=1.0.0" or "^1.0.0" in your example, which would be referring to "compatible" versions and not explicitly pegged revisions. That being said, regardless of the dependency specification that may appear in the pyproject.toml file, the resolver should not need to reach out to the remote repository except when initially generating the poetry.lock file. Once that file is generated it should contain all relevant metadata needed in order to reproduce the dependency tree without additional external resources. So then, assuming that all of the requisite dependencies are stored in the local cache, one should be able to install those dependencies mentioned in the lock file without requiring additional network IO. Maybe I'm missing some nuance here so feel free to correct me if that is the case. |
No, I described exactly how the resolvers in pip and pipenv (IIRC) work. And based on https://www.python.org/dev/peps/pep-0440/#post-releases this is by design. Maybe @pradyunsg could explain better. Another case would be having a universal wheel locally while there's a platform-specific wheel for the same index on PyPI, the resolver has no way of knowing that there's a more precise match unless it talks to the server. Besides, if a version got marked as bad and removed or yanked, the installer should know about that too. |
The section you are referring to simply describes the format and interpretation of the post-release build identifier. That section does not imply a specific dependency resolution behaviour for a given tool. Conversely, there is a section in PEP-440 that explains the expected behaviour for an exact / pegged version matcher that any compliant tool is expected to implement: https://www.python.org/dev/peps/pep-0440/#version-matching. This section does indicate that tools should allow for "strict" vs "non-strict" comparisons but that tools should enable strict enforcement by default. This means that if my pegged revision is "==1.0.0" and there is a 1.0.0.post1 release in the index, that it should (by default) be ignored. However, all this being said, after the initial dependency tree has been generated (ie: while generating the poetry.lock file) the package manager should be able to precisely and unambiguously locate the exact matching package to be deployed on subsequent runs. |
Similar to my previous argument, I can't see how this would lead to any change in behaviour wrt the dependency resolver. At the time the poetry.lock file is generated, the content of the lock file will be based on the current state of the pypi index. If the state of the index changes in any way then I would expect subsequent runs to deploy the exact version that is specified in the lock file, not a comparable / compatible version that may have since been added to the index. I guess the point I'm trying to make is, any potential subtleties wrt cache management and dependency resolution should be avoidable by relying on the package metadata stored in the lock file. |
I am pretty sure that pypi.org no longer allows maintainers to delete / yank / pull bad packages from the public repository. However, even if that were possible there are many other ways for that to break dependency resolution tools (ie: consider the case of repository mirrors) which is probably why it is prevented or strongly discouraged. Conversely, in the case where a system has a cached dependency that has been removed from the remote repository, I would expect subsequent Poetry executions to still resolve and deploy the dependency correctly so long as that exact dependency / version is still referenced in the lock file and it can be discovered in the local cache. In general, whether the tool should purge a package from the local cache if it has been removed from the remote repository when installing dependencies without a lock file could be debatable (I can see rationale for and against it), this need not dictate how the cache should be managed when installing dependencies from a lock file. |
I would suggest basing the mechanics of cache management on comparable tools from other languages like Gradle, NuGet, etc. In those cases the tools mostly allow deploying of packages from local caches, either on demand or when they are unable to connect to the remote repository. |
I simply pointed out how things work right now without speculating about how they should work. I recommend raising issues on respective trackers of the upstream projects if you're looking for a change.
This is 100% incorrect. PyPI has been allowing deleting packages from the index for many years and there are no plans to disallow this (as long as a given user's role is an "owner", not just a "maintainer" — https://pypi.org/help/#collaborator-roles). One remark tho: deleting a version removes it from the available version lists but not from the CDN servers that host actual sdists and wheels. Another remark: "yank" is not the same as "delete", it's basically a special flag that tells clients like pip that is strongly discouraged to use a yanked package version but doesn't prevent them from doing so. Old pip versions ignore this flag, while the new ones try to respect it when the user requests an unpinned package version but if it's pinned, it'll go ahead and consider installing a yanked version. Also, PyPI has red badges for yanked versions, here's an example of how this looks: And here's additional info on yanking: Another reason for removing packages is if they contain malware/spam/namesquatting/etc per https://www.python.org/dev/peps/pep-0541/#invalid-projects. Is using the malware from cache more important than taking into account such security issues? The index must be used for resolution for as long as it's enabled. If you want local cache instead, there just should be a mechanism for disabling the unwanted indexes (FWIW pip already knows how to treat local dirs as indexes). |
Dear all, thank you very much for your feedback on this topic. Very much appreciated.
Exactly. So, I conclude that we should investigate whether Poetry also offers such an option and whether it will be able to deliver respective performance gains. Otherwise, going through With kind regards, |
I can see how malware being introduced into the dependency tree would be one of those extenuating circumstances where you may want to fail a deployment even when a revision is pegged. Thanks for pointing that out. If there was at least an option or flag to disable the index lookup, that would put the user / developer back in control, which would satisfy my use case. I hope that this will be implemented in Poetry. Juggling multiple package management tools, using pip for some things and Poetry for others, would just introduce unnecessary complexity into the tool chain. |
Hi again, it looks like discussions around this topic are going on on the Poetry issue tracker since quite some time already, see python-poetry/poetry#331 and python-poetry/poetry#559, eventually leading to python-poetry/poetry#908. Then, it goes on with python-poetry/poetry#1554 and python-poetry/poetry#1070. Finally, the main issue tracking respective improvements seems to be python-poetry/poetry#1632, eventually leading to python-poetry/poetry#2074 by @Zyava. However, progress on this apparently has stalled and @BrandonLWhite and @mcouthon haven't been able to take over. Now, @ZeroAurora started another attempt with python-poetry/poetry#3624 which also relates to python-poetry/poetry#3479 by @zillionare. I wish you the best of luck to be able to finally tackle this, @ZeroAurora! It would be really nice if this patch can also bring in a feature to not use any index at all by adding a
With kind regards, P.S.: I've just added a comment about our use case at python-poetry/poetry#3624 (comment). |
Hey everyone. Thank you for your encouragement, but this kind of issue is somewhat out of the scope of my PR. My PR focuses on those who can't connect to PyPI (or can but with a slow bandwidth). I've mentioned a probable solution in python-poetry/poetry#3624 (comment) but I'm sorry that I can't work on this since I don't have extra time and energy. I hope someone will work on this (or an offline mode). Good luck everyone! Hope this issue get resolved! |
Hi @ZeroAurora, thanks for your kind response.
I see. Thanks for your explanation.
It actually does, but Poetry will still reach out to the package index and this apparently adds some slowness compared to running with
This! 👍 With kind regards, |
@sinoroc just suggested at python-poetry/poetry#3624 (comment):
I answered:
|
Well you could look into dumb-pypi or bandersnatch then. But it still feels like an overkill. |
Hi there,
@gutzbenj tried to improve speed on GHA CI through #328 and I later chimed in through #331.
We observed different problems, specifically when caching whole Python virtualenvs using GHA's
actions/cache
and are discussing it with @sondrelg at snok/install-poetry#18.However, the recommendation is actually not to do it at all, see actions/cache#175. Otherwise, this will probably bring in more problems than not. We should really trust @webknjaz and @pradyunsg here:
The reason why it takes twice the time to install packages from the download cache into the environment on Windows is that Windows is just slow when handling thousands of small files, as @maphew suggested at actions/cache#175 (comment).
With kind regards,
Andreas.
The text was updated successfully, but these errors were encountered: