Add parallelism options to speed up dvc add #3177

Ykid · 2020-01-17T06:33:32Z

One of the project I did contains directories of more than 1m files. When I do dvc add dir1, it takes quite a while to run and I saw that dvc only utilize one cpu to do so, it would be nice if there's parallelism options that can utilize multiple cores of the machine.

The text was updated successfully, but these errors were encountered:

Ykid · 2020-01-17T06:33:54Z

current workaround: use multiple terminals to do dvc add

update: can't do so because there's a repo level lock.

efiop · 2020-01-17T06:44:16Z

Hi @Ykid !

current workaround: use multiple terminals to do dvc add

Not sure how that is going to help to parallelize single dvc add dir. Or maybe you meant dvc add dir1 dir2 ...?

ghost · 2020-01-17T06:46:41Z

Not sure how that is going to help to parallelize single dvc add dir. Or maybe you meant dvc add dir1 dir2 ...?

@efiop , it is dvc add dir_1 dir_2 dir_n (clear it out on the Discord conversation, forgot to update it here)

efiop · 2020-01-17T06:52:29Z

@Ykid Also, what dvc was showing when you were looking at CPU utilization? We do run things like checksum computation in parallel https://github.com/iterative/dvc/blob/master/dvc/remote/base.py#L178 .

ghost · 2020-01-17T06:52:59Z

@efiop , from the discord convo again, @Ykid mentioned 1 CPU. Not sure in what part of the process, didn't ask 🙈

efiop · 2020-01-17T06:55:16Z

Discord: https://discordapp.com/channels/485586884165107732/563406153334128681/667622134100787210

Ykid · 2020-01-17T06:59:44Z

@efiop this is what I got

ps aux | grep dvc
> some-user    1980 99.2  3.9 3483664 2634612 pts/14 Rl+ 13:57  56:44 python dvc add dir1

efiop · 2020-01-17T07:00:48Z

@Ykid Yes, but what was it doing at that time? Meaining what it was showing to stdout? Was it saying that it is computing checksums or maybe that it is linking?

Ykid · 2020-01-17T07:20:35Z

should be copying stuff

efiop · 2020-01-17T07:23:43Z

For the record: decided to try out hardlinks or symlinks as a workaround.

efiop · 2020-01-17T08:55:47Z

For the record: linking is still single-threaded in dvc, so need to look into parallelizing it as well.

Suor · 2020-01-18T16:47:29Z

@efiop might a part or a follow up after checkout refactor

arrowkato · 2021-07-19T08:00:20Z

I agree with Ykid.

As with dvc fetch and other commands, I hope that dvc add is added the --jobs option which allow it to execute multi- thread processing.
It took over 15 minutes, when I ran dvc add dir1. dir1 have Over 100,000 files.

mvonpohle · 2022-05-06T06:09:48Z

I'm running DVC 2.10.2 and running checksum calculations for dvc add dvc status and dvc commit takes forever. I'm running this on a directory that has a tree-ish structure with 130 directories and ~150 files totally 1.1GB. (It's the result of a Keras-tuner hyper band experiment. Definitely a topical thing for DVC to handle)

From the above thread I'm under the impression that checksum calculations are done in parallel? I'm monitoring top while the checksums are calculating and I only ever see one instance of dvc. Am I missing something?

daavoo · 2022-05-06T19:51:51Z

Hi @mvonpohle , you can pip install viztracer and run dvc commit --viztracer-depth 8. You should see in the generated profile that multiple threads are being used.

mvonpohle · 2022-05-10T01:39:09Z

Ah, okay. I was looking for processes instead of threads. 😝 Thanks for the clarification! Do you know if there's a way to give dvc more resources for these commands? I'd think checksum calculations would be ripe for aggressive parallelization.

skshetry · 2022-05-10T03:40:34Z

@mvonpohle, there is core.checksum_jobs config that you can set.

skshetry · 2022-05-10T11:47:15Z

Closing this issue, as add already uses multithreaded checksum calculation, transfer to cache is also multithreaded. The only part that is not multithreaded is checkout, which will get more boost from upcoming diff optimizations than being multithreaded (having multithreaded checkout is still a goal but it won't happen in short term).

ivartz · 2023-01-24T18:39:39Z

Hello all, I wonder if there is still possibilities to make a multiprocessed version of dvc add? This for utilizing more than one CPU core in a system, which should further speed up md5 calculations. Multithreading only speeds up md5 calculations within the single process that dvc add currently uses. Multiprocessing would be a great and necessary improvement of dvc add for large datasets.

shcheklein · 2023-01-24T20:44:25Z

This for utilizing more than one CPU core in a system, which should further speed up md5 calculations.

As far as I understand md5 calculation should be happening in the native C code and thus is not affected by the global lock. I would expect the same performance more or less in this case from a multithreaded / multiprocess implementation. @ivartz do you have any specific tests, or may be pefr profile, or even htop screen cast that can show that it's not being parallelized atm?

ivartz · 2023-01-25T08:10:02Z

Thanks for the quick reply. The step I'm referring to is "Building data objects". "Here is a test scenario: I have a large dataset (~100GB) with many small-sized files (500KB). I would like to add the datset like this using a local remote and no caching:

dvc remote add testlocalremote /home/ivar/Documents/testlocalremote
dvc add --to-remote --remote testlocalremote sourcedata/treatment-deid

Htop when "Building data objects":

If there were no global lock for the steps in dvc add, something like the following bash script using multiple dvc add commands in parallell, should work:

`: '
bash dvc-add-dataset-mp.sh

Pararellized DVC add on sessions.

Assuming dataset root directory contains subjects/sessions
in hierarchical structure, e.g., sub-01/ses-01 etc.

Example:

bash dvc-add-dataset-mp.sh sourcedata/treatment-deid 32


readarray -t sess < <(find $1 -mindepth 2 -maxdepth 2 -type d)
numsess=${#sess[*]}
nprocs=$2
process=0
while [[ $process -lt $numsess ]]
do
    # Start all processes
    for ((i=$process; i<$(($process + $nprocs)); ++i))
	do
		if [ $i -lt $numsess ] 
		then
			ses=${sess[$i]}
			cmd="dvc add --to-remote --remote testlocalremote $ses &"
			eval $cmd
			pids[$i]=$!
		fi
done
    # Wait for all processes to finish
	for pid in ${pids[*]}
	do
		wait $pid
    done
    process=$(($process + $nprocs))
done

However, this script gets multiple locking errors. I'm not able re reproduce these errors right now, but it is clear that this way of utilizing dvc add (dvc version 2.43.1) does not work correctly at this time.

Optimally, what I would like to achieve is to have dvc add utilize as much CPU cores as possible when running the command dvc add --to-remote --remote testlocalremote sourcedata/treatment-deid (or dvc add sourcedata/treatment-deid), for the objective of speeding up the dvc add command. I assume that from a technical side this should be possible and would significantly benefit most users having more than one CPU core/thread.

daavoo · 2023-01-25T09:45:24Z

As far as I understand md5 calculation should be happening in the native C code and thus is not affected by the global lock.

md5 calculation is not correctly isolated and it is currently bundled with other internal operations.
multiprocessing could indeed speed up dvc add but the current logic would need to be refactored to be able to parallelize the parts that can benefit from it

shcheklein · 2023-01-25T23:17:22Z

thanks @ivartz !

Your intention makes total sense. What I'm trying to understand better is why it doesn't utilize multiple cores now and if this ticket should be reopened (even if we don't prioritize right away).

md5 calculation is not correctly isolated and it is currently bundled with other internal operations.
multiprocessing could indeed speed up dvc add but the current logic would need to be refactored to be able to parallelize the parts that can benefit from it

thanks @daavoo ! should we reopen the ticket then? Since per @skshetry 's comment above almost everything is parallel and we are waiting for some diff changes to do the last part?

daavoo · 2023-02-02T13:09:26Z

Since per @skshetry 's comment above almost everything is parallel and we are waiting for some diff changes to do the last part?

The comment is outdated, multithreaded checksum calculation and transfer to cache is also multithreaded. are no longer True.

As of today, it happens sequentially in a single thread:

skshetry · 2023-02-03T10:56:01Z

We removed it in iterative/dvc-data#53, because there was high overhead for small files, and running in a single thread was much faster than multithreaded hashing/building. The state is to blame here, of course.

Also since we are using mnist dataset as a benchmark, there's also a question of whether it's a good representative to optimize for.

shcheklein · 2023-02-03T18:06:01Z

much faster

@skshetry @efiop any data to support that decision by chance (there is nothing in the ticket 🤔 )

Also since we are using mnist dataset as a benchmark, there's also a question of whether it's a good representative to optimize for.

yes, I don't think it's very relevant tbh. But even in that case it's a bit surprising - what would be the underlying reason for the pool to be slower?

efiop · 2023-02-03T19:59:23Z

@shcheklein Can't find anything recorded in the ticket, but there are today's benchmark runs on mnist dataset from dvc bench (see pic below, 2.11 is where the changes rolled out).

The way we were working with state is suboptimal, persistent index will be replacing it. add will be migrating to index-based workflow (cloud versioning already using it) and that logic is already based on async/threadpool https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py thanks to great work by @pmrowla With data status already using indexes and dvc diff almost migrated to it (#8930), dvc add will likely be next.

ivartz · 2023-02-04T19:21:55Z

Thanks for reopening this issue!

The way we were working with state is suboptimal, persistent index will be replacing it. add will be migrating to index-based workflow (cloud versioning already using it) and that logic is already based on async/threadpool https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py thanks to great work by @pmrowla With data status already using indexes and dvc diff almost migrated to it (#8930), dvc add will likely be next.

Does the async/threadpool here refer to using multiple threads within a single process, or will also additional processes be utilized by add when migrating to index-based workflow? For the first option, it will likely not improve the speedup of add according to my previous test case?

yes, I don't think it's very relevant tbh. But even in that case it's a bit surprising - what would be the underlying reason for the pool to be slower?

mnist is not representative for my data case as described above, and I argue that my case would be common for many users of dvc. Perhaps it is possible to add some logic for letting the size or amount of files determine whether to use multiprocessing or multithreading?

Thanks for all the replies. Looking forward to see if index-based workflow can speed up adding multiple small sized files (500KB) consisted within a directory (recursively) with a total size of 100GB.

efiop · 2023-02-06T01:47:37Z

@ivartz Sorry for the confusion. There are a few different ways we compute hashes (e.g. dvc add is not the same as dvc add --to-remote). And there is also transfer involved during dvc add, so I'm mixing things up myself as well.

We are filling up our dvc-bench project with some typical datasets and "many small files" (like mnist) was (and kinda is still) a big pain point, so we added it and used it to make decisions at the current stage.

Your data seems to be structurally similar though, or am I missing something? 100G of 500kb files means ~50K small files, which is not the same, but is kinda similar.

mattseddon · 2024-03-04T21:38:36Z

We can track this under #7607

triage-new-issues bot added the triage Needs to be triaged label Jan 17, 2020

ghost added the performance improvement over resource / time consuming tasks label Jan 17, 2020

triage-new-issues bot removed the triage Needs to be triaged label Jan 17, 2020

efiop mentioned this issue Jan 17, 2020

logger: remote: use lazy formatting #3178

Merged

3 tasks

skshetry closed this as completed May 10, 2022

shcheklein reopened this Feb 2, 2023

dberenbaum mentioned this issue Mar 6, 2023

DVC very slow for datasets with many files #7607

Closed

skshetry mentioned this issue Aug 29, 2023

Faster index building #9813

Closed

skshetry mentioned this issue Jan 23, 2024

dvc add * for 21gb of files that are already checked in to dvc takes 7 minutes #10251

Closed

mattseddon closed this as completed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallelism options to speed up dvc add #3177

Add parallelism options to speed up dvc add #3177

Ykid commented Jan 17, 2020

Ykid commented Jan 17, 2020 •

edited

Loading

efiop commented Jan 17, 2020

ghost commented Jan 17, 2020

efiop commented Jan 17, 2020

ghost commented Jan 17, 2020 •

edited by ghost

Loading

efiop commented Jan 17, 2020

Ykid commented Jan 17, 2020

efiop commented Jan 17, 2020

Ykid commented Jan 17, 2020

efiop commented Jan 17, 2020

efiop commented Jan 17, 2020

Suor commented Jan 18, 2020

arrowkato commented Jul 19, 2021

mvonpohle commented May 6, 2022

daavoo commented May 6, 2022

mvonpohle commented May 10, 2022 via email •

edited

Loading

skshetry commented May 10, 2022

skshetry commented May 10, 2022

ivartz commented Jan 24, 2023

shcheklein commented Jan 24, 2023

ivartz commented Jan 25, 2023 •

edited by efiop

Loading

daavoo commented Jan 25, 2023 •

edited

Loading

shcheklein commented Jan 25, 2023

daavoo commented Feb 2, 2023

skshetry commented Feb 3, 2023

shcheklein commented Feb 3, 2023

efiop commented Feb 3, 2023 •

edited

Loading

ivartz commented Feb 4, 2023

efiop commented Feb 6, 2023

mattseddon commented Mar 4, 2024

Add parallelism options to speed up dvc add #3177

Add parallelism options to speed up dvc add #3177

Comments

Ykid commented Jan 17, 2020

Ykid commented Jan 17, 2020 • edited Loading

efiop commented Jan 17, 2020

ghost commented Jan 17, 2020

efiop commented Jan 17, 2020

ghost commented Jan 17, 2020 • edited by ghost Loading

efiop commented Jan 17, 2020

Ykid commented Jan 17, 2020

efiop commented Jan 17, 2020

Ykid commented Jan 17, 2020

efiop commented Jan 17, 2020

efiop commented Jan 17, 2020

Suor commented Jan 18, 2020

arrowkato commented Jul 19, 2021

mvonpohle commented May 6, 2022

daavoo commented May 6, 2022

mvonpohle commented May 10, 2022 via email • edited Loading

skshetry commented May 10, 2022

skshetry commented May 10, 2022

ivartz commented Jan 24, 2023

shcheklein commented Jan 24, 2023

ivartz commented Jan 25, 2023 • edited by efiop Loading

daavoo commented Jan 25, 2023 • edited Loading

shcheklein commented Jan 25, 2023

daavoo commented Feb 2, 2023

skshetry commented Feb 3, 2023

shcheklein commented Feb 3, 2023

efiop commented Feb 3, 2023 • edited Loading

ivartz commented Feb 4, 2023

efiop commented Feb 6, 2023

mattseddon commented Mar 4, 2024

Ykid commented Jan 17, 2020 •

edited

Loading

ghost commented Jan 17, 2020 •

edited by ghost

Loading

mvonpohle commented May 10, 2022 via email •

edited

Loading

ivartz commented Jan 25, 2023 •

edited by efiop

Loading

daavoo commented Jan 25, 2023 •

edited

Loading

efiop commented Feb 3, 2023 •

edited

Loading