repro: DVC is slow with million of files #7681

nsorros · 2022-05-03T06:51:08Z

Description

We are experiencing some issues with DVC in a task that produces 3M files as an output. For context these are
embeddings from chunks of documents. In this situation some commands error while other take a lot of time to
complete which is make working with dvc not an option. To be fair producing 3M files that need to be hashed every time
is understandably above the limits DVC expects.

I have not been able to reproduce all problems below but let me mention them briefly

dvc status takes 20+ minutes to calculate hashes
dvc repro fails to complete. the command finishes fine but some step after creates an invisible error
git commit with the pre commit hook takes minutes since it checks the hashes before switching branch
dvc pull throws ERROR: failed to transfer 'md5: xxx' - Could not connect to the endpoint URL: xxx in a lot of files
git push with the pre push hook takes minutes so the connection to GitHub is lost as dvc is pushing files

For 3 I ended up removing the pre commit hook.
For 4 I had to increase the file number limit with ulimit -n 1024.
For 5 I ran dvc push before git push
For 2 I am not sure what caused the error, it could be related to number of files opened but still investigating

To reproduce I wrote a simple script that produces 1M random numpy vectors and saves them. I am including that below.

I noticed that dvc repro takes minutes, sometimes hours to complete even when it does not run the command because
the stage is cached. I wonder whether DVC should throw a ⚠️ warning in cases where a user runs a command that makes it
work outside some limits, for example 100K files. This warning could be thrown when DVC goes into the process of
calculating hashes and it could redirect into a troubleshooting page for working with many files.

I also wonder what is the recommended way to work in these situations. For once it seems that some or all hooks
should be dropped. Then would it be quicker if the user zips the files to calculate the hash for the zip? Is there another
workaround to speed up the hash calculation? The solution I see atm is removing outs or the stage all together.

Finally another suggestion related to 4 is that the problem seem to be about too many open files but the probe to the
troubleshooting guide only came at the end. The error itself was confusing in that it seemed like the remote was not working
properly. If DVC can detect that too many files are open and change the error accordingly, this would be helpful. This is because
if someone stops the operation early (as I was doing at start) they never get to see the recommendation in the end which points
to the right solution.

Reproduce

scale.py

import argparse
import pickle
import os

from tqdm import tqdm
import numpy as np


def scale(files_count, output_path, embedding_dim=100):
    # Cause DVC deletes it because its output
    if not os.path.exists(output_path):
        os.mkdir(output_path)

    for i in tqdm(range(files_count)):
        embedding = np.random.randn(embedding_dim)


        embedding_path = os.path.join(output_path, f"embedding_{i}.pkl")
        with open(embedding_path, "wb") as f:
            f.write(pickle.dumps(embedding))

if __name__ == "__main__":
    argparser = argparse.ArgumentParser()
    argparser.add_argument("files_count", type=int, help="number of files to create")
    argparser.add_argument("output_path", type=str, help="path to save created files")
    args = argparser.parse_args()

    scale(args.files_count, args.output_path)

dvc.yaml

stages:
  scale:
    cmd: python scale.py 1000000 embeddings
    deps:
      - scale.py
    outs:
      - embeddings/

Expected

dvc repro could throw a warning at the point where it would start calculating hashes. Same for dvc status.

WARNING: Calculating 1M hashes is expected to be slow. Here are some tips on how to work with a lot of files LINK

Environment information

Output of dvc doctor:

DVC version: 2.9.5 (brew)
---------------------------------
Platform: Python 3.9.10 on macOS-12.3.1-arm64-arm-64bit
Supports:
	azure (adlfs = 2022.2.0, knack = 0.9.0, azure-identity = 1.7.1),
	gdrive (pydrive2 = 1.10.0),
	gs (gcsfs = 2022.1.0),
	webhdfs (fsspec = 2022.1.0),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	s3 (s3fs = 2022.1.0, boto3 = 1.20.24),
	ssh (sshfs = 2021.11.2),
	webdav (webdav4 = 0.9.4),
	webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

The text was updated successfully, but these errors were encountered:

dtrifiro · 2022-05-03T08:58:18Z

Hi @nsorros, thank you for the detailed report.

A few questions about your points:

does dvc status always take a long time to run or just when the embeddings directory has been modified? If it is the latter case, an upcoming optimization (status: "recalculating" hashes each call #7390) should speed up status considerably in this case
could you provide some more information? For example a report with verbose flag dvc repro -v
Might be related to 1.
How many CPU cores does do you have? python -c 'import os; print(os.cpu_count())'

I also wonder what is the recommended way to work in these situations. For once it seems that some or all hooks
should be dropped. Then would it be quicker if the user zips the files to calculate the hash for the zip? Is there another
workaround to speed up the hash calculation? The solution I see atm is removing outs or the stage all together.

Creating an archive (zip, tar or gzip) and tracking it as an out instead of tracking it as a directory would speed up dvc considerably, since it would not require dealing with 1M+ objects.
You could track the archive as an out, then stages that require the directory (now an archive file) as a dep, could use it like so:

zip_file = "/path/to/zip"
with ZipFile(zip_FILE) as archive:
    for file_name in archive.namelist():
        with z.open(file_name) as fh:
            data = fh.read()
        # do something with data

of course, this approach will not always be possible, depending on how you need to use the directory contents.

dtrifiro · 2022-05-03T08:58:45Z

Related #7607

nsorros · 2022-05-03T09:54:51Z

does dvc status always take a long time to run or just when the embeddings directory has been modified? If it is the latter case, an upcoming optimization (status: "recalculating" hashes each call #7390) should speed up status considerably in this case

It always takes time since it recalculates the hashes as I understand it to check if something has changed.

could you provide some more information? For example a report with verbose flag dvc repro -v

This might be difficult as the actual process that fails takes hours to complete but I will try to reproduce the problem in a different script to give you more information.

Might be related to 1.

I think so yes.

How many CPU cores does do you have? python -c 'import os; print(os.cpu_count())'

4 in the AWS instance (its a GPU instance) and 8 locally (Apple M1)

Creating an archive (zip, tar or gzip) and tracking it as an out instead of tracking it as a directory would speed up dvc considerably, since it would not require dealing with 1M+ objects. You could track the archive as an out, then stages that require the directory (now an archive file) as a dep, could use it like so:

I will try the zip approach to see how it speeds up things and come back.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repro: DVC is slow with million of files #7681

repro: DVC is slow with million of files #7681

nsorros commented May 3, 2022 •

edited

Loading

dtrifiro commented May 3, 2022

dtrifiro commented May 3, 2022

nsorros commented May 3, 2022 •

edited

Loading

skshetry commented Mar 5, 2024 •

edited

Loading

repro: DVC is slow with million of files #7681

repro: DVC is slow with million of files #7681

Comments

nsorros commented May 3, 2022 • edited Loading

Description

Reproduce

Expected

Environment information

dtrifiro commented May 3, 2022

dtrifiro commented May 3, 2022

nsorros commented May 3, 2022 • edited Loading

skshetry commented Mar 5, 2024 • edited Loading

nsorros commented May 3, 2022 •

edited

Loading

nsorros commented May 3, 2022 •

edited

Loading

skshetry commented Mar 5, 2024 •

edited

Loading