Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: remove isdir() check in copy_file(), about 10% of the run time was checking if the file to copy is a directory. #258

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

morotti
Copy link

@morotti morotti commented May 31, 2024

( I originally raised the PR to setuptools pypa/setuptools#4406 and they redirected me to this repo. )

Hello,
I was doing a bit of profiling on my CI builds. The step to build package (python setup.py bdist_wheel) is taking more time than I would like.

Going in with a commercial profiler, I am finding that nearly 10% of the time is taken in isdir() calls, mostly in the copy_file() function.
Which is quite surprising IMO because the copy file function should not be given directories anyway :D

Can I offer to remove the isdir() calls to get a speedup?
I've found 2 calls that were passing directories instead of a filepath and corrected them.

By the way, the whole _copy_file_contents() could be replaced by shutil.copyfile() for another little boost. The code goes back to year 2000 or before, which predates optimized shutil functions.

Screenshot of the profiler run. The whole run is about 20+ seconds.
image

All tests are passing in the setuptools repos, including all integration tests.
They don't seem to be able to run in this repo, failing in some macos imports

Regards.

@morotti morotti force-pushed the performance-distutils branch from 6a54b19 to 106561e Compare June 20, 2024 10:58
…was checking if the file to copy is a directory.
@morotti morotti force-pushed the performance-distutils branch from 106561e to 745640e Compare June 20, 2024 12:55
@morotti
Copy link
Author

morotti commented Jun 20, 2024

@jaraco I saw you fixed the tests on master. I've rebased and tests are green now for this PR. Can you have another look?

There is a diffcov workflow complaining about less test coverage? If i understand correctly, it's saying 2 lines are not covered by the tests, inside bdist_rpm code that's copying the icon of the RPM. I hope that's not a blocker.

Comment on lines +319 to +320
dest = os.path.join(source_dir, os.path.basename(source))
self.copy_file(source, dest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uneasy with this pattern being now duplicated 5 times across the codebase. I'd rather see something like:

Suggested change
dest = os.path.join(source_dir, os.path.basename(source))
self.copy_file(source, dest)
self.copy_file_to_dir(source, source_dir)

And then do some refactoring to make copy_file_to_dir perform the dest operation.

Comment on lines +73 to +74
"""Copy a file 'src' to 'dst'.
(If the file exists, it will be ruthlessly clobbered.) If 'preserve_mode'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can assume that distutils is the only consumer of this function. Maybe no one else is using it, but we can't simply remove it without risking breaking users. Let's do this instead:

  1. Rename copy_file to _copy_file_direct, implementing this new, limited interface.
  2. Create a new copy_file that implements the backward-compatible behavior (infer dst from dir), raises a DeprecationWarning, and then calls _copy_file_direct.
  3. Retain the logic of emitting only the dirname if the basenames match (I don't think that change is necessary to get the performance gain).
  4. Create a new _copy_file_to_dir that assumes the dst is a dir and performs the join and basename, calling _copy_file_direct.

What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 3. the logic is broken as far as I can tell. we can't keep the syscall to isdir() to determine whether we're dealing with a directory because that's what makes the function slow.
for context, this function is called for thousands of files, it's used by distutils and setuptools/pip when working with packages. (the profiling run was for python setup.py bdist_wheel)

I think we can't have a replacement function be private with a underscore. It's a public API that is used in other places. it think linters will complain if we import a private _ function in other places (note that setuptools is vendoring this repo and might have more CI rules).

searching on github distutils.file_util AND copy_file AND path:.py, it's finding 2k files using this function. Going over the results, I see a few are passing a directory. That is problematic. :(
I don't think adding a warning would be helpful, distutils is removed in python 3.12 and already raising warnings/errors. If people ain't fixing errors because the module is removed, they're not going to care about yet another warning. ^^

what I can think of?

  • make two functions copy_file_direct() and copy_file_to_dir().
  • fix code inside of distutils to use the appropriate function.
  • keep copy_file() logic to check whether the argument is a directory + call one of the 2 functions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started the refactoring... but unfortunately found out copy_file() is wrapped inside of distutils.cmd::Command class and there are a dozen of invocations through self.copy_file across the codebase.
If we don't want to fix copy_file() but would rather ship a copy_file_direct() + copy_file_to_dir(), then we would have to expose both functions though the class and adjust a dozen invocations. it's not hard to do but that's a fair amount of boilerplate and code to refactor.
I would like to have confirmation we want to do that before engaging into that.

realistically, distutils is removed from the interpreter as of python 3.12, it's only available inside of setuptools where it's vendored. I initially made the PR on setuptools where it works and pass all tests, it provides a welcome performance improvement to package operations.

Is there really a risk in fixing copy_file() to break other packages? packages that are not already broken because distutils is removed.

Copy link
Contributor

@abravalheri abravalheri Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyrum's law usually hits hard on distutils/setuptools...

If you want to define 2 other new functions (one for files or one for directories) it should be fine. But my opinion is that we should keep the old behaviour (maybe behind a deprecated warning) for a couple of years before dropping it (if ever dropping would be an option... we have to remember that some packages in the ecosystem, haven't even moved to setuptools yet...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that sounds messy. Let's consider some other options.

@jaraco
Copy link
Member

jaraco commented Jun 28, 2024

A couple of alternative ways we might be able to improve performance without changing the API:

  1. Instead of testing for is_dir, simply attempt each copy in a try/except that first attempts to copy to {dest}/{basename(source)} and if that fails, copy to dest.
  2. Wrap is_dir in a cache function so it doesn't have to go to the file system for repeated copies to the same dir.

@abravalheri
Copy link
Contributor

abravalheri commented Jun 28, 2024

I wonder if it makes sense to shortcircuit the condition with a (yet another) new keyword only parameter (e.g. isdir: None | Literal[True] | Literal[False], "tri-state")? Something like:

def copy_file(  # noqa: C901
    src,
    dst,
    preserve_mode=1,
    preserve_times=1,
    update=0,
    link=None,
    verbose=1,
    dry_run=0,
+   *,
+   isdir=None,
):
   ...
-   if os.path.isdir(dst):
+   if isdir is True or (isdir is None and os.path.isdir(dst)):
        dir = dst
        dst = os.path.join(dst, os.path.basename(src))
    else:
        dir = os.path.dirname(dst)

The pro of such approach is that it is backwards compatible... The cons is that it introduces yet another argument in a function that is already full of arguments AND it feels very messy...

@morotti
Copy link
Author

morotti commented Jul 1, 2024

I am trying to think of what to do without getting into horrible hacks 🤔 🤔 🤔
The caching is tempting as a simple solution but the invalidation is going to be a problem. e.g. code to uninstall a wheel then reinstall a different version.
Having two functions copy_file_to_file() copy_file_to_dir() is a bit verbose but it's clear on what they do and won't cause surprises to the next maintainers or users.

@jaraco
Copy link
Member

jaraco commented Aug 27, 2024

The caching is tempting as a simple solution but the invalidation is going to be a problem. e.g. code to uninstall a wheel then reinstall a different version.

That's a good point, although since is_dir() is checking for whether a target is a dir or a file, it seems unlikely that it would be one and then the other, even when installing different packages.

My instinct - unless there's a clean way to implement this performance improvement, we'll probably want to defer this improvement until distutils is no longer an independent code base (importable as distutils) and is fully integrated with Setuptools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants