-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
status: "recalculating" hashes each call #7390
Comments
@pared Are you sure the recalculation doesn't happen? I see that subsequent calls take just as long. This seems related to known regressions from the current refactoring. @efiop Is this expected?
A smaller dataset to reproduce:
|
Can confirm:
|
The hashes are not recalculated, they are retrieved from the Times for subsequent runs are the same, though. I run with viztracer as follows (same snippet from @pawel but I just used 50 files instead of 2000 for clearer debug):
And visualized with:
|
It appears that 114a07e introduced a performance downgrade for
After 114a07e, when there is a new untracked file, we are building a new Before 114a07e, we were loading from state ( This is causing the I suspect this affects more places beyond |
@dberenbaum strange, what + dvc add data
100% Adding...|██████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:12, 12.75s/file]
To track the changes with git, run:
git add .gitignore data.dvc
To enable auto staging, run:
dvc config core.autostage true
real 0m13.364s
user 0m8.684s
sys 0m1.900s
+ echo modification
+ tee data/another_file
+ dvc status
data.dvc:
changed outs:
modified: data
real 0m0.959s
user 0m0.759s
sys 0m0.163s
+ dvc status
data.dvc:
changed outs:
modified: data
real 0m0.984s
user 0m0.789s
sys 0m0.162s |
@pared I also used latest master. Maybe I misinterpreted your concern. It seems like the two |
Sorry @dberenbaum, my concern was that we get |
We have been looking into that with @daavoo and it seems that before 114a07e
Lines 231 to 232 in 114a07e
however, after the change, staging db became a ReferenceObjectDB, and underlying MemoryFileSystem does not seem to find staged dir, |
That's great @pared, thanks! So hashes are being recomputed each time, right?
Do you know if this is true? |
@pmrowla Any insight into this? |
I'll take a look into this |
So this occurs for This was an intentional decision when we moved to using objects, since the previous behavior was to save the .dir file to The actual md5 for the modified directory ( However, when we rebuild the tree, we don't recalculate file MD5s more than once. Even though the .dir file is not saved to cache, computed hashes are saved in the state db. So when we rebuild the tree, each hash will be loaded from state instead of recomputed entirely: Lines 97 to 99 in 6c04d0e
Lines 80 to 86 in 6c04d0e
Some performance regression is expected here since we have to walking the filesystem directory each for each IMO the main problem in this issue is that the progress bar message for when we build the tree is misleading, since it is hard coded to just say that DVC is recomputing hashes and that this is only supposed to happen once Lines 126 to 130 in 6c04d0e
It should probably just say something like The actual ODB/dvc data fix for this is is for us to separate the performance optimization check from .dir file existence in dvc data, but this isn't really something we can do until 3.0 cache changes are made. Basically what we want is to be able to save the object containing the .dir json data to cache, and use a different file (something like abc123.dir.index) as the indicator of whether or not the rest of that directory's files have also been committed to an ODB. One temporary workaround would be to store these loose .dir files somewhere other than |
The problem I see is that the performance gap increases with the number of existing files tracked in directories. I don't know how critical the scenario is (it might somehow overlap with LDB scope) but it has a significant impact in datasets with many "small" files, for example, with 10k files:
Where are jumping from 0.8s to 28s. Increasing the number of threads doesn't help, probably a different issue and related to
Apart from the UI fix, I would then add this to a checklist for 3.0 changes to ensure it gets revisited. |
Besides any
In the previous behavior, how was the
I think we should at least reassess some of these tradeoffs/temporary regressions until we have a clear plan for when they will be "properly" fixed. @pmrowla Do you have thoughts on the level of effort and other pros/cons? |
Commands like
We know what the hash of the modified dir will be after the first Previously, since we saved the
The |
Action point for start is to change the message, since it clearly does not correspond to what is going under the hood. |
Also related benchmark issue: iterative/dvc-bench#315. |
For the record: added a reproducer in iterative/dvc-bench#341 The issue is in workspace_status branch. Looks like we need to save raw |
When staging a directory, always save a "raw dir object" to the odb. If the corresponding ".dir" object has not been added to the odb, `stage()` calls can load the tree from the raw dir object instead of rebuilding it by walking the directory. This can lead to significant speed improvements when calling `dvc status` for modified directories. Fixes iterative#7390
When staging a directory, always save a "raw dir object" to the odb. If the corresponding ".dir" object has not been added to the odb, `stage()` calls can load the tree from the raw dir object instead of rebuilding it by walking the directory. This can lead to significant speed improvements when calling `dvc status` for modified directories. Fixes #7390
When calling
status
on modified dir, DVC shows messageComputing file/dir hashes (only done once)
for each status call.Reproduction script:
It might not necessarily be an issue with DVC, as "calculation" for subsequent status calls seems to be faster than the original one. Maybe it's just a problem with a message.
EDIT:
Running previous script for much bigger files shows that the recalculation does not happen. So its probably a message that is the problem.
The text was updated successfully, but these errors were encountered: