-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repro: DVC is slow with million of files #7681
Comments
Hi @nsorros, thank you for the detailed report. A few questions about your points:
Creating an archive (zip, tar or gzip) and tracking it as an out instead of tracking it as a directory would speed up dvc considerably, since it would not require dealing with 1M+ objects. zip_file = "/path/to/zip"
with ZipFile(zip_FILE) as archive:
for file_name in archive.namelist():
with z.open(file_name) as fh:
data = fh.read()
# do something with data of course, this approach will not always be possible, depending on how you need to use the directory contents. |
Related #7607 |
It always takes time since it recalculates the hashes as I understand it to check if something has changed.
This might be difficult as the actual process that fails takes hours to complete but I will try to reproduce the problem in a different script to give you more information.
I think so yes.
4 in the AWS instance (its a GPU instance) and 8 locally (Apple M1)
I will try the zip approach to see how it speeds up things and come back. Other than the actual problems
|
We don't have capacity to work on this in the short to medium term. Also, this item is not very actionable and we have other focused tickets regarding this. Closing for now. |
Description
We are experiencing some issues with DVC in a task that produces 3M files as an output. For context these are
embeddings from chunks of documents. In this situation some commands error while other take a lot of time to
complete which is make working with dvc not an option. To be fair producing 3M files that need to be hashed every time
is understandably above the limits DVC expects.
I have not been able to reproduce all problems below but let me mention them briefly
dvc status
takes 20+ minutes to calculate hashesdvc repro
fails to complete. the command finishes fine but some step after creates an invisible errorgit commit
with the pre commit hook takes minutes since it checks the hashes before switching branchdvc pull
throwsERROR: failed to transfer 'md5: xxx' - Could not connect to the endpoint URL: xxx
in a lot of filesgit push
with the pre push hook takes minutes so the connection to GitHub is lost as dvc is pushing filesFor
3
I ended up removing the pre commit hook.For
4
I had to increase the file number limit withulimit -n 1024
.For
5
I randvc push
beforegit push
For
2
I am not sure what caused the error, it could be related to number of files opened but still investigatingTo reproduce I wrote a simple script that produces 1M random numpy vectors and saves them. I am including that below.
I noticed that⚠️ warning in cases where a user runs a command that makes it
dvc repro
takes minutes, sometimes hours to complete even when it does not run the command becausethe stage is cached. I wonder whether DVC should throw a
work outside some limits, for example 100K files. This warning could be thrown when DVC goes into the process of
calculating hashes and it could redirect into a troubleshooting page for working with many files.
I also wonder what is the recommended way to work in these situations. For once it seems that some or all hooks
should be dropped. Then would it be quicker if the user zips the files to calculate the hash for the zip? Is there another
workaround to speed up the hash calculation? The solution I see atm is removing outs or the stage all together.
Finally another suggestion related to
4
is that the problem seem to be about too many open files but the probe to thetroubleshooting guide only came at the end. The error itself was confusing in that it seemed like the remote was not working
properly. If DVC can detect that too many files are open and change the error accordingly, this would be helpful. This is because
if someone stops the operation early (as I was doing at start) they never get to see the recommendation in the end which points
to the right solution.
Reproduce
scale.py
dvc.yaml
Expected
dvc repro
could throw a warning at the point where it would start calculating hashes. Same fordvc status
.WARNING: Calculating 1M hashes is expected to be slow. Here are some tips on how to work with a lot of files LINK
Environment information
Output of
dvc doctor
:The text was updated successfully, but these errors were encountered: