Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc add * for 21gb of files that are already checked in to dvc takes 7 minutes #10251

Closed
froody opened this issue Jan 23, 2024 · 6 comments
Closed

Comments

@froody
Copy link

froody commented Jan 23, 2024

Bug Report

Description

I have 21GB of laz files (spread across 2554 files) which I'm storing in DVC. Below are the times for various operations. Since DVC is ultimately computing the md5sum of each file and potentially (but not in this case) updating the .dvc files and .gitignore, I don't see why dvc add should take much longer than 40 seconds on my machine.

dvc add **/*.laz 309.93s user 33.29s system 83% cpu 6:52.26
md5sum **/*.laz 35.41s user 3.62s system 98% cpu 39.675 total
xargs -n 1 -P 128 md5sum 41.96s user 10.64s system 898% cpu 5.857 total

Reproduce

  1. Populate a dvc repo with 21gb of random data evenly spread across 2554 files
  2. dvc add all the files and and commit the change
  3. run time dvc add

Expected

time dvc add takes a little more than the time to md5sum the same set of files

Environment information

Output of dvc doctor:

DVC version: 3.35.0
-------------------
Platform: Python 3.10.12 on Linux-6.2.0-36-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 3.1.0
	dvc_objects = 3.0.0
	dvc_render = 1.0.0
	dvc_task = 0.3.0
	scmrepo = 2.0.2
Supports:
	http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2023.12.2, boto3 = 1.33.13)
Config:
	Global: /home/tbirch/.config/dvc
	System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/7441350f0cbbb0aad68bca587b098ee0

Additional Information (if any):

@skshetry
Copy link
Member

Duplicate of #8008, #7607, and #3177.

@skshetry
Copy link
Member

We are aware of this and are working to fix it. Multiple issues are open regarding this, so I will close this in favor of those.

@RadouaneK
Copy link

I have a similar issue with a datasets of ~50GB which takes hours whenever I add few files. Hope this will be fixed soon

@skshetry
Copy link
Member

skshetry commented Jan 27, 2024

@RadouaneK, if you are only adding a few files, you can instead pass the filenames of the dataset that you are modifying.

Eg:

data
└── file1

Instead of doing dvc add data, you can dvc add data/file1. Only those files will be updated. Similarly, you can pass a subdir if you are modifying a subdirectory of the dataset (eg: dvc add data/dir/subdir_or_file).

See https://dvc.org/doc/user-guide/data-management/modifying-large-datasets.

@code-inflation
Copy link

This seems to be a long standing issue. Is there an ETA for a fix or improvement to the dvc add performance?

@skshetry
Copy link
Member

skshetry commented Feb 4, 2024

This seems to be a long standing issue. Is there an ETA for a fix or improvement to the dvc add performance?

Unfortunately, we don't have any ETA to share at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants