How to deal with 20 million + files #10450

alita-moore · 2024-06-06T11:57:30Z

alita-moore
Jun 6, 2024

I want to use dvc to manage 20 million + small files, but I think it's pretty slow when dealing with many files. Is there a common way of handling cases like these such as using an intermediate zip file or something to that effect? Is 20 million beyond the scope / abuse of the tool? should I use a different tool instead?

jendefig · 2024-06-06T18:22:00Z

jendefig
Jun 6, 2024

Hi @alita-moore! Thanks for reaching out! Yes. DVC can be slow for that scale of dataset. We have a new tool that we've been building for those purposes. We will be releasing it June 25th. You can learn more in this recent talk our CEO, @dmpetrov gave at OSS4AI (at the 1:02 mark). The tool will work processing images, text, video, audio data at scale for computer vision, LLM, or Multimodal applications.

More info can be found at https://dvc.ai and if you'd like to talk about your use case and see a demo, you can book a meeting here: https://calendly.com/dmitry-at-iterative/dmitry-petrov-30-minutes

7 replies

jendefig Jun 17, 2024

@alita-moore I'm not sure. @dberenbaum @dmpetrov @shcheklein any ideas?

shcheklein Jun 17, 2024
Maintainer

For example, if dvc were to calculate hashes based on many files in a directory to reduce the number of hashes it needs to calculate when there are many small files.

I'm not sure tbh I understand the suggestion. @alita-moore could you clarify please?

overall, using tar (e.g. webdataset) is a regular way to go (besides even DVC) - that's how for example Laion 5B (5B images) recommends handing so many files. It easier for the cloud, for tools like DVC, filesystems, etc.

alita-moore Jun 17, 2024
Author

@shcheklein the difficult with using tar (I'm assuming you are suggesting to archive / compress the files) is that you lose the ability to diff versions and you need to implement some way to compress and decompress the files.

For example, if I have a dataset of 1 million images. Let's say that these files are in the local directory /app/images. What I would like the workflow to be is that I save / commit the images to dvc directly using dvc commit /app/images. Then, when I want to pull the latest version of the dataset I run `dvc pull /app/images. Now assume that I have a training script that pulls from these images. That script relies on the assumption that there is a directory called /app/images with images in it, because we backup the directory directly using dvc then when you pull using dvc the training script should run without further configuration. Because the files are individually saved it's also easy to diff between versions of the dataset.

If we were to use tar files instead, then the workflow would instead be

pull dvc archive dvc pull /app/images.zip
decompress images archive
make changes to the decompressed images
re-archive images
dvc commit /app/images.zip

This adds significant indirection to an already tedious process. The process of managing data outside of git and having to commit / push separately has led to difficulties and lost data with my team. To add another layer of indirection by using tar files is simply not an option. This is also the reason why the datachain wouldn't work well for my team; one of the biggest advantages to using dvc is that it's similar to git, so they don't have to learn new skills.

Also, when you do zip / archive the files and commit that archive, you lose granular diffing between versions.

I think the ideal solution to my concerns is that dvc simply becomes faster at managing / calculating hashes for large numbers of individual files. In my previous response I made a suggestion based on how I might approach the problem. So take it with a large grain of salt, obviously. To avoid getting too deep into my (likely uneducated) suggestion, I was just suggesting that you hash files in batches to reduce the number of hashes you have to calculate to determine if some subset of files changed.

In any case, the root concern is that dvc is not capable of efficiently handling millions of individual files, which reduces its usefulness in a big way. At least to me. But it's also possible that I'm not understanding what the goal or target audience of the dvc tool is, perhaps it's not meant to be used in projects that are dealing with millions of files?

alita-moore Jun 18, 2024
Author

btw I admit I don't fully understand what datachain is doing, and it may be the case it's the perfect solution. Do you mind explaining why it's valuable to separate teh functionality of the datachain from the base dvc tool? How are the two tools connected, and how do you expect a user to use both?

shcheklein Jun 18, 2024
Maintainer

@alita-moore in datachain there are two important ideas that differentiate it from DVC (again, I'm happy to jump on a call and show it):

we don't move files around like in DVC. Data is expected to be in a some cloud storage. It's expected that you are adding more files there + you might have some metadata around (JSON files nearby images, parquet files, embeddings, etc). DataChain is indexing it - reads all of that and puts into a table. A table has file names, all that meta info, etc, etc. A table in this case is your dataset. Datasets are immutable, can be versioned, etc. Can be fed into PyTorch dataloader, etc, etc.
second important idea. Since we now have these tables with meta information (labels, bounding boxes, files names, sizes, etc, etc) we can slice and dice it efficiently. E.g. filter out bad images. Or using embeddings - make clusters and remove duplicates, or fine similar, etc. DVC never had that functionality. We can also add more "signals" at scale by providing UDFs. E.g. to calculate embeddings.

At the end inside DataChain you have all these named, versioned tables with "pointers" to your buckets + all metadata that you need. As I mentioned those could be fed into training, or can be instantiated, etc.

All of this can run at scale. All of this is done (and we keep improving it) to make it appealing to ML engineers to curate, prepare, manipulate datasets.

How are the two tools connected, and how do you expect a user to use both?

DataChain is upstream. DVC is downstream.

In the DVC pipeline it is enough to have a name and a version of the dataset from the DataChain. DVC then responsible for metrics, experiments, ML model versioning, reproducibility, GitOps - all those things.

Let me know if we you'd like to see some demos of this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with 20 million + files #10450

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to deal with 20 million + files #10450

alita-moore Jun 6, 2024

Replies: 1 comment · 7 replies

jendefig Jun 6, 2024

jendefig Jun 17, 2024

shcheklein Jun 17, 2024 Maintainer

alita-moore Jun 17, 2024 Author

alita-moore Jun 18, 2024 Author

shcheklein Jun 18, 2024 Maintainer

alita-moore
Jun 6, 2024

Replies: 1 comment 7 replies

jendefig
Jun 6, 2024

shcheklein Jun 17, 2024
Maintainer

alita-moore Jun 17, 2024
Author

alita-moore Jun 18, 2024
Author

shcheklein Jun 18, 2024
Maintainer