-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parallelism options to speed up dvc add #3177
Comments
current workaround: use multiple terminals to do update: can't do so because there's a repo level lock. |
Hi @Ykid !
Not sure how that is going to help to parallelize single |
@efiop , it is |
@Ykid Also, what dvc was showing when you were looking at CPU utilization? We do run things like checksum computation in parallel https://github.com/iterative/dvc/blob/master/dvc/remote/base.py#L178 . |
@efiop this is what I got
|
@Ykid Yes, but what was it doing at that time? Meaining what it was showing to stdout? Was it saying that it is computing checksums or maybe that it is linking? |
should be copying stuff |
For the record: decided to try out hardlinks or symlinks as a workaround. |
For the record: linking is still single-threaded in dvc, so need to look into parallelizing it as well. |
@efiop might a part or a follow up after checkout refactor |
I agree with Ykid. As with dvc fetch and other commands, I hope that |
I'm running DVC 2.10.2 and running checksum calculations for From the above thread I'm under the impression that checksum calculations are done in parallel? I'm monitoring top while the checksums are calculating and I only ever see one instance of dvc. Am I missing something? |
Hi @mvonpohle , you can |
Ah, okay. I was looking for processes instead of threads. 😝 Thanks for the clarification!
Do you know if there's a way to give dvc more resources for these commands? I'd think checksum calculations would be ripe for aggressive parallelization.
|
@mvonpohle, there is |
Closing this issue, as |
Hello all, I wonder if there is still possibilities to make a multiprocessed version of dvc add? This for utilizing more than one CPU core in a system, which should further speed up md5 calculations. Multithreading only speeds up md5 calculations within the single process that dvc add currently uses. Multiprocessing would be a great and necessary improvement of dvc add for large datasets. |
As far as I understand |
Thanks for the quick reply. The step I'm referring to is "Building data objects". "Here is a test scenario: I have a large dataset (~100GB) with many small-sized files (500KB). I would like to add the datset like this using a local remote and no caching:
Htop when "Building data objects": If there were no global lock for the steps in dvc add, something like the following bash script using multiple dvc add commands in parallell, should work: `: ' Pararellized DVC add on sessions. Assuming dataset root directory contains subjects/sessions Example: bash dvc-add-dataset-mp.sh sourcedata/treatment-deid 32
However, this script gets multiple locking errors. I'm not able re reproduce these errors right now, but it is clear that this way of utilizing dvc add (dvc version 2.43.1) does not work correctly at this time. Optimally, what I would like to achieve is to have dvc add utilize as much CPU cores as possible when running the command |
md5 calculation is not correctly isolated and it is currently bundled with other internal operations. |
thanks @ivartz ! Your intention makes total sense. What I'm trying to understand better is why it doesn't utilize multiple cores now and if this ticket should be reopened (even if we don't prioritize right away).
thanks @daavoo ! should we reopen the ticket then? Since per @skshetry 's comment above almost everything is parallel and we are waiting for some diff changes to do the last part? |
The comment is outdated, As of today, it happens sequentially in a single thread: |
We removed it in iterative/dvc-data#53, because there was high overhead for small files, and running in a single thread was much faster than multithreaded hashing/building. The state is to blame here, of course. Also since we are using mnist dataset as a benchmark, there's also a question of whether it's a good representative to optimize for. |
@skshetry @efiop any data to support that decision by chance (there is nothing in the ticket 🤔 )
yes, I don't think it's very relevant tbh. But even in that case it's a bit surprising - what would be the underlying reason for the pool to be slower? |
@shcheklein Can't find anything recorded in the ticket, but there are today's benchmark runs on mnist dataset from dvc bench (see pic below, 2.11 is where the changes rolled out). The way we were working with state is suboptimal, persistent index will be replacing it. add will be migrating to index-based workflow (cloud versioning already using it) and that logic is already based on async/threadpool https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py thanks to great work by @pmrowla With |
Thanks for reopening this issue!
Does the async/threadpool here refer to using multiple threads within a single process, or will also additional processes be utilized by add when migrating to index-based workflow? For the first option, it will likely not improve the speedup of add according to my previous test case?
mnist is not representative for my data case as described above, and I argue that my case would be common for many users of dvc. Perhaps it is possible to add some logic for letting the size or amount of files determine whether to use multiprocessing or multithreading? Thanks for all the replies. Looking forward to see if index-based workflow can speed up adding multiple small sized files (500KB) consisted within a directory (recursively) with a total size of 100GB. |
@ivartz Sorry for the confusion. There are a few different ways we compute hashes (e.g. We are filling up our Your data seems to be structurally similar though, or am I missing something? 100G of 500kb files means ~50K small files, which is not the same, but is kinda similar. |
We can track this under #7607 |
One of the project I did contains directories of more than 1m files. When I do
dvc add dir1
, it takes quite a while to run and I saw that dvc only utilize one cpu to do so, it would be nice if there's parallelism options that can utilize multiple cores of the machine.The text was updated successfully, but these errors were encountered: