-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc add *
for 21gb of files that are already checked in to dvc takes 7 minutes
#10251
Comments
We are aware of this and are working to fix it. Multiple issues are open regarding this, so I will close this in favor of those. |
I have a similar issue with a datasets of ~50GB which takes hours whenever I add few files. Hope this will be fixed soon |
@RadouaneK, if you are only adding a few files, you can instead pass the filenames of the dataset that you are modifying. Eg: data
└── file1 Instead of doing See https://dvc.org/doc/user-guide/data-management/modifying-large-datasets. |
This seems to be a long standing issue. Is there an ETA for a fix or improvement to the |
Unfortunately, we don't have any ETA to share at the moment. |
Bug Report
Description
I have 21GB of laz files (spread across 2554 files) which I'm storing in DVC. Below are the times for various operations. Since DVC is ultimately computing the md5sum of each file and potentially (but not in this case) updating the .dvc files and .gitignore, I don't see why
dvc add
should take much longer than 40 seconds on my machine.dvc add **/*.laz 309.93s user 33.29s system 83% cpu 6:52.26
md5sum **/*.laz 35.41s user 3.62s system 98% cpu 39.675 total
xargs -n 1 -P 128 md5sum 41.96s user 10.64s system 898% cpu 5.857 total
Reproduce
dvc add
all the files and and commit the changetime dvc add
Expected
time dvc add
takes a little more than the time tomd5sum
the same set of filesEnvironment information
Output of
dvc doctor
:Additional Information (if any):
The text was updated successfully, but these errors were encountered: