Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: taking more than 20min when multithreaded vs 20s with one job #8008

Closed
percevalw opened this issue Jul 12, 2022 · 1 comment
Closed
Labels
A: data-management Related to dvc add/checkout/commit/move/remove fs: nfs performance improvement over resource / time consuming tasks

Comments

@percevalw
Copy link

Bug Report

Description

Following iterative/dvc-objects#99

I initialized a local repository and tried adding a 74MB folder of 564 files to dvc. I ran the command on a cluster node with 32 cpus, with no remote. Running time dvc add data while forcing the number of workers to 1 with iterative/dvc-objects#99 produced this output

$ time dvc add data
100% Adding...|██████████████████████████████████████████████|1/1 [00:26, 26.69s/file]
                                                                                                                                                                                                          
To track the changes with git, run:                                                                                                                                                                       

    git add data.dvc

To enable auto staging, run:

        dvc config core.autostage true

real    0m28.422s
user    0m5.421s
sys     0m4.311s

while leaving the number of workers to the default 32 * 4 produced this

$ dvc add data --verbose
2022-07-12 09:35:01,274 DEBUG: built tree 'object 852afa69c8fda0798544716c6f98c630.dir'                                                                                                                   
2022-07-12 09:35:01,277 DEBUG: Computed stage: 'data.dvc' md5: 'None'                                                                                                                             
2022-07-12 09:35:02,010 DEBUG: built tree 'object 852afa69c8fda0798544716c6f98c630.dir'
2022-07-12 09:35:02,014 DEBUG: Preparing to transfer data from 'memory://dvc-staging/144e708b03c791f83cdb1b5d0087c235b149865e23c7f862e93754c444cf30c5' to '/export/home/pwajsburt/dvc-test/.dvc/cache'
2022-07-12 09:35:02,015 DEBUG: Preparing to collect status from '/export/home/pwajsburt/dvc-test/.dvc/cache'
2022-07-12 09:35:02,019 DEBUG: Collecting status from '/export/home/pwajsburt/dvc-test/.dvc/cache'
2022-07-12 09:35:02,647 DEBUG: Preparing to collect status from 'memory://dvc-staging/144e708b03c791f83cdb1b5d0087c235b149865e23c7f862e93754c444cf30c5'
Adding...                                                                                                                                                                                                 
  1%|▏         |Transferring                       8/564 [02:03<2:41:58, 17.48s/file]

which I aborted, but had run until completion before, just slowly.

I paid attention to the cache and cleared it, as well as any generated file, before each execution. The blocking part of the add command seems to be happening here : https://github.com/iterative/dvc-data/blob/main/src/dvc_data/transfer.py#L180-L186, and the core.checksum_jobs option doesn't affect this operation.

The --jobs option is only available with the --to-remote option, so there is no easy way to disable multithreading. I suspect that parallelizing local copy ops might be the cause of this. However, if I do run dvc add data --to-remote, and set a local folder as the remote, no blocking occurs no matter the number of workers and the cache fills itself as expected.

I could not pinpoint precisely why the behavior between the two commands differ, as they both modify a local folder.

Reproduce

Running

$ dvc add data

takes from 20min to hours in my setup with vanillla dvc, while a few seconds after modifying the source files like iterative/dvc-objects#99.

Also, if this might help

$ dvc remote add test my-cache
$ dvc remote default test
$ dvc add data --to-remote --jobs 128

takes a few seconds, while having both local copy ops and being multithreaded.

Expected

I expect the add command to take a few seconds, when it can take up to hours. In any case, being able to configure the number of jobs (not only core.checksum_jobs) globally would be great.

Environment information

$ dvc doctor
DVC version: 2.13.0 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-glibc2.29
Supports:
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.8),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.8),
        s3 (s3fs = 2022.5.0, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: nfs4 on nfs.prod.xxxxxxxxxx:/zfspool/home/pwajsburt
Caches: local
Remotes: local
Workspace directory: nfs4 on nfs.prod.xxxxxxxxxx:/zfspool/home/pwajsburt
Repo: dvc, git

Additional Information (if any):

Due to some restrictions I could neither export the profiling information from viztracer nor from cprofile, my apologies.

@karajan1001 karajan1001 added performance improvement over resource / time consuming tasks A: data-management Related to dvc add/checkout/commit/move/remove labels Jul 13, 2022
@efiop efiop added the fs: nfs label Sep 26, 2023
@dberenbaum
Copy link
Collaborator

Closing as a duplicate of #7607 and others.

@dberenbaum dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove fs: nfs performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants