-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batch dump and batch update of old stages #10630
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10630 +/- ##
==========================================
+ Coverage 90.68% 91.00% +0.32%
==========================================
Files 504 504
Lines 39795 39835 +40
Branches 3141 3154 +13
==========================================
+ Hits 36087 36253 +166
+ Misses 3042 2949 -93
+ Partials 666 633 -33 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
@skshetry could you PTAL? |
Hi, thank you for the reproducer in #10629. I profiled the
That is a lot of time spent reading and writing the DVC files. We should optimize this. I see you have attempted to fix both issues in the PR. I'd suggest splitting it into two separate PRs, starting with batching Regarding
|
Thanks so much for your time and comments!
I'm not sure if I follow, but perhaps this related to the use case my group has: We have been parsing the dvc DAG to find stages of our dvc.yaml that can be run in parallel. In each parallel job, we run the command in the stage's Alternatively, instead of using running the We also considered doing a try-wait cycle for updating the lock file in parallel, but this essentially brings us back to the problem in #10629 where the lock file is updated sequentially, and the time taken to update the lock file becomes longer than the time required to run the stages (even in parallel). One might expect batch updates would result in a lag time during which the Is this the scenario you're referring to? Or is there a different sequence of commands or actions that would result in a problem? (I'm a VS Code user myself btw.)
I did see PR #8873 but there and in general I was having trouble figuring out what the goal of merge_versioned and meta versioning is. Is there a place where this is documented? Or, is it not something that users are expected to interact with? |
In case of See https://dvc.org/doc/command-reference/repro#parallel-stage-execution. (You may get For
This versioning stuff is from Cloud Versioning feature. In this feature, we keep metadata of every single file inside |
Description
Rather than running stage.dump separately for each stage, which involves loading the dvc.lock file many times, this proposed code groups stages together by their dvcfiles and does a
dvcfile.batch_dump
(see thestages_by_dvcfile
object).Also, rather than getting the old version of each stage within
get_versioned_outs
, this proposed code gets all old versions in one step, stores them in aold_stages
object, and then passes each old stage to stage.save. This similarly removes a step where the lock file is reloaded once per stage.My understanding is that these changes improve the speed of dvc commit when the dvc.lock file is very large and slow to load, without causing any changes or problems regarding the function's behavior. That said, I am not 100%, especially when it comes to the get_versioned_outs part which I have less context for. I would appreciate feedback about what this could be missing!
Related issues
Fixes #10629, although it seems like it would be even better to have separate lock files.
Checklist