Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repro: DVC is slow with million of files #7681

Closed
nsorros opened this issue May 3, 2022 · 4 comments
Closed

repro: DVC is slow with million of files #7681

nsorros opened this issue May 3, 2022 · 4 comments
Labels
A: data-sync Related to dvc get/fetch/import/pull/push A: status Related to the dvc diff/list/status performance improvement over resource / time consuming tasks ui user interface / interaction

Comments

@nsorros
Copy link
Contributor

nsorros commented May 3, 2022

Description

We are experiencing some issues with DVC in a task that produces 3M files as an output. For context these are
embeddings from chunks of documents. In this situation some commands error while other take a lot of time to
complete which is make working with dvc not an option. To be fair producing 3M files that need to be hashed every time
is understandably above the limits DVC expects.

I have not been able to reproduce all problems below but let me mention them briefly

  1. dvc status takes 20+ minutes to calculate hashes
  2. dvc repro fails to complete. the command finishes fine but some step after creates an invisible error
  3. git commit with the pre commit hook takes minutes since it checks the hashes before switching branch
  4. dvc pull throws ERROR: failed to transfer 'md5: xxx' - Could not connect to the endpoint URL: xxx in a lot of files
  5. git push with the pre push hook takes minutes so the connection to GitHub is lost as dvc is pushing files

For 3 I ended up removing the pre commit hook.
For 4 I had to increase the file number limit with ulimit -n 1024.
For 5 I ran dvc push before git push
For 2 I am not sure what caused the error, it could be related to number of files opened but still investigating

To reproduce I wrote a simple script that produces 1M random numpy vectors and saves them. I am including that below.

I noticed that dvc repro takes minutes, sometimes hours to complete even when it does not run the command because
the stage is cached. I wonder whether DVC should throw a ⚠️ warning in cases where a user runs a command that makes it
work outside some limits, for example 100K files. This warning could be thrown when DVC goes into the process of
calculating hashes and it could redirect into a troubleshooting page for working with many files.

I also wonder what is the recommended way to work in these situations. For once it seems that some or all hooks
should be dropped. Then would it be quicker if the user zips the files to calculate the hash for the zip? Is there another
workaround to speed up the hash calculation? The solution I see atm is removing outs or the stage all together.

Finally another suggestion related to 4 is that the problem seem to be about too many open files but the probe to the
troubleshooting guide only came at the end. The error itself was confusing in that it seemed like the remote was not working
properly. If DVC can detect that too many files are open and change the error accordingly, this would be helpful. This is because
if someone stops the operation early (as I was doing at start) they never get to see the recommendation in the end which points
to the right solution.

Reproduce

scale.py

import argparse
import pickle
import os

from tqdm import tqdm
import numpy as np


def scale(files_count, output_path, embedding_dim=100):
    # Cause DVC deletes it because its output
    if not os.path.exists(output_path):
        os.mkdir(output_path)

    for i in tqdm(range(files_count)):
        embedding = np.random.randn(embedding_dim)


        embedding_path = os.path.join(output_path, f"embedding_{i}.pkl")
        with open(embedding_path, "wb") as f:
            f.write(pickle.dumps(embedding))

if __name__ == "__main__":
    argparser = argparse.ArgumentParser()
    argparser.add_argument("files_count", type=int, help="number of files to create")
    argparser.add_argument("output_path", type=str, help="path to save created files")
    args = argparser.parse_args()

    scale(args.files_count, args.output_path)

dvc.yaml

stages:
  scale:
    cmd: python scale.py 1000000 embeddings
    deps:
      - scale.py
    outs:
      - embeddings/

Expected

dvc repro could throw a warning at the point where it would start calculating hashes. Same for dvc status.

WARNING: Calculating 1M hashes is expected to be slow. Here are some tips on how to work with a lot of files LINK

Environment information

Output of dvc doctor:

DVC version: 2.9.5 (brew)
---------------------------------
Platform: Python 3.9.10 on macOS-12.3.1-arm64-arm-64bit
Supports:
	azure (adlfs = 2022.2.0, knack = 0.9.0, azure-identity = 1.7.1),
	gdrive (pydrive2 = 1.10.0),
	gs (gcsfs = 2022.1.0),
	webhdfs (fsspec = 2022.1.0),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	s3 (s3fs = 2022.1.0, boto3 = 1.20.24),
	ssh (sshfs = 2021.11.2),
	webdav (webdav4 = 0.9.4),
	webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
@dtrifiro dtrifiro added performance improvement over resource / time consuming tasks A: data-sync Related to dvc get/fetch/import/pull/push A: status Related to the dvc diff/list/status ui user interface / interaction labels May 3, 2022
@dtrifiro
Copy link
Contributor

dtrifiro commented May 3, 2022

Hi @nsorros, thank you for the detailed report.

A few questions about your points:

  1. does dvc status always take a long time to run or just when the embeddings directory has been modified? If it is the latter case, an upcoming optimization (status: "recalculating" hashes each call #7390) should speed up status considerably in this case
  2. could you provide some more information? For example a report with verbose flag dvc repro -v
  3. Might be related to 1.
  4. How many CPU cores does do you have? python -c 'import os; print(os.cpu_count())'

I also wonder what is the recommended way to work in these situations. For once it seems that some or all hooks
should be dropped. Then would it be quicker if the user zips the files to calculate the hash for the zip? Is there another
workaround to speed up the hash calculation? The solution I see atm is removing outs or the stage all together.

Creating an archive (zip, tar or gzip) and tracking it as an out instead of tracking it as a directory would speed up dvc considerably, since it would not require dealing with 1M+ objects.
You could track the archive as an out, then stages that require the directory (now an archive file) as a dep, could use it like so:

zip_file = "/path/to/zip"
with ZipFile(zip_FILE) as archive:
    for file_name in archive.namelist():
        with z.open(file_name) as fh:
            data = fh.read()
        # do something with data

of course, this approach will not always be possible, depending on how you need to use the directory contents.

@dtrifiro
Copy link
Contributor

dtrifiro commented May 3, 2022

Related #7607

@nsorros
Copy link
Contributor Author

nsorros commented May 3, 2022

  1. does dvc status always take a long time to run or just when the embeddings directory has been modified? If it is the latter case, an upcoming optimization (status: "recalculating" hashes each call #7390) should speed up status considerably in this case

It always takes time since it recalculates the hashes as I understand it to check if something has changed.

  1. could you provide some more information? For example a report with verbose flag dvc repro -v

This might be difficult as the actual process that fails takes hours to complete but I will try to reproduce the problem in a different script to give you more information.

  1. Might be related to 1.

I think so yes.

  1. How many CPU cores does do you have? python -c 'import os; print(os.cpu_count())'

4 in the AWS instance (its a GPU instance) and 8 locally (Apple M1)

Creating an archive (zip, tar or gzip) and tracking it as an out instead of tracking it as a directory would speed up dvc considerably, since it would not require dealing with 1M+ objects. You could track the archive as an out, then stages that require the directory (now an archive file) as a dep, could use it like so:

I will try the zip approach to see how it speeds up things and come back.

Other than the actual problems

  • I wonder what's the recommended way to work large number of files. Possibly this is out of scope for DVC which is fair. I tried some experiments with git and even though faster, things are quite slow on gits end as well so this might be approaching the limits of version control
  • Also wonder whether DVC should add some warnings in cases where it detects a large number of files to notify the user

@skshetry
Copy link
Member

skshetry commented Mar 5, 2024

We don't have capacity to work on this in the short to medium term. Also, this item is not very actionable and we have other focused tickets regarding this.

Closing for now.

@skshetry skshetry closed this as not planned Won't fix, can't repro, duplicate, stale Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push A: status Related to the dvc diff/list/status performance improvement over resource / time consuming tasks ui user interface / interaction
Projects
None yet
Development

No branches or pull requests

3 participants