DVC very slow for datasets with many files #7607

RBMindee · 2022-04-20T17:21:23Z

A 1.7Gb dataset containing 600k files takes many hours with dvc add or dvc push/pull.

Description

Hi all, we have been using dvc for a while on medium size datasets, but struggle when trying to use it with big ones. We are unsure if it is due to our poor use of the tool or if it is a real bug.

We have a dataset containing about 600k small files for a total of 1.7Gb. The repo is configured to be stored on s3. Updating the whole dataset with aws s3 cp to upload this dataset takes only a few minutes.

We add two problems and are unsure if that's a DVC problem or if it's due to us missusing the tool.:

If we run dvc add /path/to/dataset it takes a few tens of minutes to add, which is okay I think as all md5 must be computed.
Then if we run dvc status, it goes fast. However, if we change a single file, all md5 are recomputed and just checking the status takes ages again.
We sort of solved that problem as the dataset is composed of many smaller subfolders, so instead, we add files with dvc add /path/to/dataset/*/*. Then, everything is fine, but this seems quite odd. Is there a better way to do it ?
Once we are able to add, the problem is to do dvc push/pull. This takes between 6 and 8 hours, which seems too much for less than 2Gb. It seems that dvc is uploading every file separately ? Are we doing something wrong ?+

Reproduce

I cannot put my own dataset but it weights 1.7Gb and contains 600k files splitted in subfolders which themselves have subfolders.

dvc remote add -d origin path/to/s3/dir
dvc add /path/to/dataset/*/*
git commit -am 'test'
dvc push

Expected

Uploading files to s3 should be reasonably faster I think ?

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 2.9.5 (deb)

Platform: Python 3.8.3 on Linux-5.4.0-107-generic-x86_64-with-glibc2.14
Supports:
azure (adlfs = 2022.2.0, knack = 0.9.0, azure-identity = 1.7.1),
gdrive (pydrive2 = 1.10.0),
gs (gcsfs = 2022.1.0),
hdfs (fsspec = 2022.1.0, pyarrow = 7.0.0),
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.1.0, boto3 = 1.20.24),
ssh (sshfs = 2021.11.2),
oss (ossfs = 2021.8.0),
webdav (webdav4 = 0.9.4),
webdavs (webdav4 = 0.9.4)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p2
Additional Information (if any):

The text was updated successfully, but these errors were encountered:

daavoo · 2022-04-20T17:41:43Z

Hi @RBMindee !

We are unsure if it is due to our poor use of the tool or if it is a real bug.

From what you describe, your usage has nothing wrong and it's most likely an issue with DVC performance.

Regarding point 1, it's discussed/tracked in #7390. We are currently working on improving status performance in #7597

Regarding point 2, it's discussed/tracked in #6222 . We are also working on improving the transfer speed, there is no P.R. yet though.

Could you check, with a small subset of the data, if passing --jobs=1 has an impact on the performance of the transfer?

Would be useful if you could share, even for a small subset so it doesn't take much, a profile of the run (i.e. viztracer:

pip install viztracer
dvc pull --viztracer --viztracer-depth 6

RBMindee · 2022-04-20T18:37:23Z

Thank you very much for that quick answer! I will send the profiles tomorrow. Just a question: do you have a rough estimate of ETA of these improvements? Especially, the last issue seems to be a few months old already. We can certainly wait for a few weeks but if it's more, we will have to bootstrap something ourselves. I guess tracking a collection of zip files would do, or maybe using a shared cache. But I'm quite reluctant to do so especially if a cleaner solution is in progress on your side. Le mer. 20 avr. 2022 à 19:41, David de la Iglesia Castro < ***@***.***> a écrit :

…

Hi @RBMindee <https://github.com/RBMindee> ! We are unsure if it is due to our poor use of the tool or if it is a real bug. For what you describe, your usage has nothing wrong and it's most likely an issue with DVC performance. ------------------------------ Regarding point 1, it's discussed/tracked in #7390 <#7390>. We are currently working on improving status performance in #7597 <#7597> ------------------------------ Regarding point 2, it's discussed/tracked in #6222 <#6222> . We are also working on improving the transfer speed, there is no P.R. yet though. Could you check, with a small subset of the data, if passing --jobs=1 has an impact on the performance of the transfer? Would be useful if you could share, even for a small subset so it doesn't take much, a profile of the run (i.e. viztracer: pip install viztracer dvc pull --viztracer --viztracer-depth 6 — Reply to this email directly, view it on GitHub <#7607 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AV5EYTI74S57YM3YSND4R63VGA6WHANCNFSM5T4XCIGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

RBMindee · 2022-04-21T14:13:58Z

I tried to run viztracer. I add to run a very small subset in order not to loose any information though (3000 files):
viztracer.dvc-20220421_161104.zip

I timed three configurations for benchmark:

dvc pull
aws s3 cp on the actual files downloaded by dvc (the dvc remote directory)
aws s3 cp on the file that I am versionning.

It appears that:

dvc pull is 3x slower than aws s3 cp on the same files. (31s vs 11s)
aws s3 cp of original file is still slightly faster (9s)

For the sake of completed ness, please note that I add to run it on another machine. here is the new output of dvc doctor.
DVC version: 2.10.1 (pip)
Platform: Python 3.8.10 on Linux-5.13.0-39-generic-x86_64-with-glibc2.29
Supports:
webhdfs (fsspec = 2022.3.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.3.0, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git

daavoo · 2022-04-21T15:08:09Z

Thanks @RBMindee ! Could you run the same but with --jobs=1 and --viztracer-depth 8 ?

RBMindee · 2022-04-21T15:35:09Z

Forgot about the job. Here it is
viztracer.dvc-20220421_172832.zip

Additionally I also timed aws s3 cp on a zip archive containing all of the cache and it takes 0.27s, thus about 100x faster than the original command. Zipping directories and subdirectories is originally proposed in #6222 but is not discussed. Is there a reason ?

daavoo · 2022-04-22T09:37:14Z

Additionally I also timed aws s3 cp on a zip archive containing all of the cache and it takes 0.27s, thus about 100x faster than the original command. Zipping directories and subdirectories is originally proposed in #6222 but is not discussed. Is there a reason ?

Ideally, we should reduce to the minimum (there is always going to be some) the performance gap between cloud API and DVC transfer. However, regardless of using DVC or not, transferring a compressed artifact is going to be faster than a directory of files.

It's up to your use case and totally ok if you don't see any downsides. From the top of my mind, I could think as tradeoffs that transferring individual files allows you granular access (i.e. download a single file) and doesn't require the additional compressing step.

In addition to the above, when using DVC there is an additional tradeoff. DVC handles deduplication at the file level and compressing the directory completely removes this functionality (although this could be addressed in #829).

There is nothing stopping you from zipping the directory as part of a DVC stage and tracking and transferring the compressed artifact via DVC.

We use this approach ourselves in our example dataset registry (https://github.com/iterative/dataset-registry/tree/master/tutorials/versioning) but we could improve the visibility of this alternative in the docs.

dberenbaum · 2023-03-06T16:20:39Z

To do dvc add for 1 million+ XML annotations files (totaling ~4GB) taken from https://image-net.org/data/bboxes_annotations.tar.gz, here's the viztracer output for the ~90 min runtime:

viztracer_add.zip

Related: #3177

dberenbaum · 2023-04-04T19:57:43Z

I have tested with different sizes and approaches, and even when dvc add is a noop (no files have changed), I end up with something that looks like the above where checkout takes most of the time. Any ideas why this could by the case?

efiop · 2023-04-04T20:10:37Z

@dberenbaum Yes, it is because we force relink on dvc add to make sure that we take up the minimal amount of data storage possible by relinking with links. We might want to abandon that, but ideally, we would just use link type from metadata to detect that the links are already correct and we don't have to do anything with them. Trying to find another issue that we've discussed this in before...

dberenbaum · 2023-04-04T20:18:09Z

Okay, sounds familiar although I also can't remember where we discussed before 😄 .

we would just use link type from metadata to detect that the links are already correct and we don't have to do anything with them

Let's create an issue for this if we don't have one specific to it?

efiop · 2023-04-05T06:24:55Z

@dberenbaum We definitely have it somewhere, just failing to find it.

E.g. related iterative/dvc-data#274

mathematiguy · 2023-07-29T01:59:20Z

This blog post makes an argument for either zipping or consolidating your large file counts to something more manageable. They make the fair point that transfers via s3 or azure for millions of files are expensive, and that the filesystem doesn't like it either.

So perhaps having too many files really should just be avoided, past a certain point.

https://fizzylogic.nl/2023/01/13/did-you-know-dvc-doesn-t-handle-large-datasets-neither-did-we-and-here-s-how-we-fixed-it

dberenbaum · 2024-02-05T20:07:52Z

To do dvc add for 1 million+ XML annotations files (totaling ~4GB) taken from https://image-net.org/data/bboxes_annotations.tar.gz, here's the viztracer output for the ~90 min runtime:

Revisited this scenario and wanted to leave notes here.

It took about an hour, compared to 26 minutes to do raw md5 calculations. This isn't a fair comparison since dvc is also caching and linking the files, and most of the time is spent in checkout (which isn't be surprising since we still have not had a chance to address iterative/dvc-data#274 and stop relinking everything).

Here's the dvc add viztracer output:

viztracer.zip

dberenbaum · 2024-04-04T12:03:44Z

If I set relink=False in this code block, a noop dvc add drops all the way to 6 minutes.

dvc/dvc/output.py

Line 1343 in dd2b515

self, path: Optional[str] = None, no_commit: bool = False, relink: bool = True

dberenbaum · 2024-04-30T16:22:18Z

To do dvc add for 1 million+ XML annotations files (totaling ~4GB) taken from https://image-net.org/data/bboxes_annotations.tar.gz, here's the viztracer output for the ~90 min runtime:

As of dvc 3.50.1, dvc add --no-relink took ~20 minutes for me with the same dataset.

I have tested with different sizes and approaches, and even when dvc add is a noop (no files have changed), I end up with something that looks like the above where checkout takes most of the time.

dvc add --no-relink took ~7 minutes when nothing has changed. Most time was spent querying the cache.

I don't think there's anything else actionable in this issue that isn't covered in other issues, so closing this as completed.

alita-moore · 2024-07-31T12:41:52Z

dvc pull is still very slow because it seems that the "applying changes" runs single threaded?

alita-moore · 2024-07-31T12:48:10Z

@mathematiguy it would make a lot more sense if dvc automatically reduced the number of files under the hood by compressing them, and then when you pull it extracts them. This way it would remove that overhead from the data management process.

daavoo added A: data-sync Related to dvc get/fetch/import/pull/push A: data-management Related to dvc add/checkout/commit/move/remove performance improvement over resource / time consuming tasks labels Apr 20, 2022

dtrifiro mentioned this issue May 3, 2022

repro: DVC is slow with million of files #7681

Closed

skshetry mentioned this issue Jan 23, 2024

dvc add * for 21gb of files that are already checked in to dvc takes 7 minutes #10251

Closed

mattseddon mentioned this issue Mar 4, 2024

Add parallelism options to speed up dvc add #3177

Closed

dberenbaum mentioned this issue Apr 4, 2024

Faster index building #9813

Closed

dberenbaum added the p1-important Important, aka current backlog of things to do label Apr 4, 2024

dberenbaum mentioned this issue Apr 4, 2024

add: taking more than 20min when multithreaded vs 20s with one job #8008

Closed

dberenbaum mentioned this issue Apr 12, 2024

add option to not relink #10389

Merged

dberenbaum closed this as completed Apr 30, 2024

bric-afisher mentioned this issue Nov 23, 2024

dvc commit is slow when there are many stages #10629

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DVC very slow for datasets with many files #7607

DVC very slow for datasets with many files #7607

RBMindee commented Apr 20, 2022 •

edited

Loading

daavoo commented Apr 20, 2022 •

edited

Loading

RBMindee commented Apr 20, 2022 via email

RBMindee commented Apr 21, 2022 •

edited

Loading

daavoo commented Apr 21, 2022

RBMindee commented Apr 21, 2022

daavoo commented Apr 22, 2022 •

edited

Loading

dberenbaum commented Mar 6, 2023

dberenbaum commented Apr 4, 2023

efiop commented Apr 4, 2023

dberenbaum commented Apr 4, 2023

efiop commented Apr 5, 2023

mathematiguy commented Jul 29, 2023

dberenbaum commented Feb 5, 2024

dberenbaum commented Apr 4, 2024

dberenbaum commented Apr 30, 2024

alita-moore commented Jul 31, 2024

alita-moore commented Jul 31, 2024

DVC very slow for datasets with many files #7607

DVC very slow for datasets with many files #7607

Comments

RBMindee commented Apr 20, 2022 • edited Loading

A 1.7Gb dataset containing 600k files takes many hours with dvc add or dvc push/pull.

Description

Reproduce

Expected

Environment information

DVC version: 2.9.5 (deb)

daavoo commented Apr 20, 2022 • edited Loading

RBMindee commented Apr 20, 2022 via email

RBMindee commented Apr 21, 2022 • edited Loading

daavoo commented Apr 21, 2022

RBMindee commented Apr 21, 2022

daavoo commented Apr 22, 2022 • edited Loading

dberenbaum commented Mar 6, 2023

dberenbaum commented Apr 4, 2023

efiop commented Apr 4, 2023

dberenbaum commented Apr 4, 2023

efiop commented Apr 5, 2023

mathematiguy commented Jul 29, 2023

dberenbaum commented Feb 5, 2024

dberenbaum commented Apr 4, 2024

dberenbaum commented Apr 30, 2024

alita-moore commented Jul 31, 2024

alita-moore commented Jul 31, 2024

RBMindee commented Apr 20, 2022 •

edited

Loading

daavoo commented Apr 20, 2022 •

edited

Loading

RBMindee commented Apr 21, 2022 •

edited

Loading

daavoo commented Apr 22, 2022 •

edited

Loading