Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status: "recalculating" hashes each call #7390

Closed
pared opened this issue Feb 16, 2022 · 18 comments · Fixed by #7597
Closed

status: "recalculating" hashes each call #7390

pared opened this issue Feb 16, 2022 · 18 comments · Fixed by #7597
Assignees
Labels
A: status Related to the dvc diff/list/status bug Did we break something? p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks regression Ohh, we broke something :-( ui user interface / interaction

Comments

@pared
Copy link
Contributor

pared commented Feb 16, 2022

When calling status on modified dir, DVC shows message Computing file/dir hashes (only done once) for each status call.
Reproduction script:

#!/bin/bash

set -exu

TMPDIR=issue-7390
mkdir $TMPDIR
pushd $TMPDIR

wsp=test_wspace
rp=test_repo

rm -rf $wsp && mkdir $wsp && pushd $wsp

mkdir $rp && pushd $rp

git init --quiet
dvc init --quiet
dvc config core.checksum_jobs 1

git add -A
git commit -am "init"

mkdir data

for i in {1..2000}
do
	dd if=/dev/random of=data/file$i bs=1k count=10 >& /dev/null
done

dvc add data
time dvc status
echo modification >> data/another_file

time dvc status 
time dvc status 

It might not necessarily be an issue with DVC, as "calculation" for subsequent status calls seems to be faster than the original one. Maybe it's just a problem with a message.

EDIT:
Running previous script for much bigger files shows that the recalculation does not happen. So its probably a message that is the problem.

@pared pared added bug Did we break something? p2-medium Medium priority, should be done, but less important labels Feb 16, 2022
@pared pared changed the title status: recalculating hashes each call status: "recalculating" hashes each call Feb 16, 2022
@dberenbaum
Copy link
Collaborator

@pared Are you sure the recalculation doesn't happen? I see that subsequent calls take just as long.

This seems related to known regressions from the current refactoring. @efiop Is this expected?

$ time dvc status
wiki.dvc:
        changed outs:
                modified:           wiki
dvc status  47.80s user 11.55s system 104% cpu 56.869 total
$ time dvc status
wiki.dvc:
        changed outs:
                modified:           wiki
dvc status  47.78s user 11.07s system 102% cpu 57.451 total

A smaller dataset to reproduce:

$ dvc get-url http://images.cocodataset.org/zips/val2017.zip
$ unzip val2017.zip
$ dvc add val2017
$ echo foo > val2017/foo
$ time dvc status
val2017.dvc:
        changed outs:
                modified:           val2017
dvc status  5.62s user 1.18s system 98% cpu 6.896 total
$ time dvc status
val2017.dvc:
        changed outs:
                modified:           val2017
dvc status  5.61s user 1.19s system 95% cpu 7.105 total
$ dvc add val2017
$ time dvc status
Data and pipelines are up to date.
dvc status  0.42s user 0.13s system 95% cpu 0.575 total

@dtrifiro
Copy link
Contributor

dtrifiro commented Feb 18, 2022

Can confirm:

+ dvc add data
100% Adding...|██████████|1/1 [00:07,  7.09s/file]

To track the changes with git, run:

    git add .gitignore data.dvc

To enable auto staging, run:

	dvc config core.autostage true
+ dvc status
Data and pipelines are up to date.

real	0m0.569s
user	0m0.454s
sys	0m0.077s
+ echo modification
+ dvc status
data.dvc:
	changed outs:
		modified:           data

real	0m2.941s
user	0m2.494s
sys	0m0.243s
+ dvc status
data.dvc:
	changed outs:
		modified:           data

real	0m3.053s
user	0m2.555s
sys	0m0.301s```

@daavoo
Copy link
Contributor

daavoo commented Feb 18, 2022

The hashes are not recalculated, they are retrieved from the md5s db, so messaging is somehow misleading.
It's true that "computing" it's only done once but it should not say 'computing hashes' if we are not computing, IMO.

Times for subsequent runs are the same, though.

I run with viztracer as follows (same snippet from @pawel but I just used 50 files instead of 2000 for clearer debug):

dvc status --viztracer --viztracer-depth 10
dvc status --viztracer --viztracer-depth 10

And visualized with:

vizviewer viztracer.dvc-20220218_093001.json

@daavoo
Copy link
Contributor

daavoo commented Feb 18, 2022

It appears that 114a07e introduced a performance downgrade for status:

# Previous commit
# @34697ef2983cd4fac7cc66aefc8577fdde8980b4
dvc status  0,31s user 0,11s system 64% cpu 0,654 total
# @114a07e7a448b667682fbe331ac06efbf4453494
dvc status  1,46s user 0,32s system 98% cpu 1,807 total

After 114a07e, when there is a new untracked file, we are building a new Tree object (_get_tree_obj->_build_tree) on each status call.

Before 114a07e, we were loading from state (_load_from_state) from the stagging DB.

This is causing the Computing file/dir hashes (only done once) to appear and the performance downgrade.

I suspect this affects more places beyond status.

cc @efiop @pmrowla

@pared
Copy link
Contributor Author

pared commented Feb 18, 2022

@dberenbaum strange, what dvc version you used? For me on latest master, (500 files, 10Mb each):

+ dvc add data
100% Adding...|██████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:12, 12.75s/file]
                                                                                                                                          
To track the changes with git, run:                                                                                                       

    git add .gitignore data.dvc

To enable auto staging, run:

        dvc config core.autostage true

real    0m13.364s
user    0m8.684s
sys     0m1.900s
+ echo modification
+ tee data/another_file
+ dvc status
data.dvc:                                                                                                                                 
        changed outs:                                                                                                                     
                modified:           data

real    0m0.959s
user    0m0.759s
sys     0m0.163s
+ dvc status
data.dvc:                                                                                                                                 
        changed outs:                                                                                                                     
                modified:           data

real    0m0.984s
user    0m0.789s
sys     0m0.162s

@dberenbaum
Copy link
Collaborator

@pared I also used latest master. Maybe I misinterpreted your concern. It seems like the two status calls after modification took the same time, which I thought was the problem?

@daavoo daavoo added the A: status Related to the dvc diff/list/status label Feb 22, 2022
@pared
Copy link
Contributor Author

pared commented Feb 23, 2022

Sorry @dberenbaum, my concern was that we get Computing file/dir hashes (only done once) each time we call dvc status - so its clearly not done only once. I am yet not sure if recalculation actually happens, but the message seems misleading.

@pared
Copy link
Contributor Author

pared commented Feb 23, 2022

We have been looking into that with @daavoo and it seems that before 114a07e

staging DB, under .dvc/tmp/staging was able to locate hash_info for the target:

dvc/dvc/objects/stage.py

Lines 231 to 232 in 114a07e

for odb_ in (odb, staging):
if odb_.exists(hash_info):

however, after the change, staging db became a ReferenceObjectDB, and underlying MemoryFileSystem does not seem to find staged dir, _load_from_state fails with FileNotFoundError and the tree gets build as @daavoo mentioned in previous comment . We did not manage to find exactly why.

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do regression Ohh, we broke something :-( and removed p2-medium Medium priority, should be done, but less important labels Feb 23, 2022
@dberenbaum
Copy link
Collaborator

That's great @pared, thanks! So hashes are being recomputed each time, right?

I suspect this affects more places beyond status.

Do you know if this is true?

@dberenbaum
Copy link
Collaborator

@pmrowla Any insight into this?

@pmrowla
Copy link
Contributor

pmrowla commented Feb 24, 2022

I'll take a look into this

@pmrowla
Copy link
Contributor

pmrowla commented Feb 24, 2022

So this occurs for status because computed .dir hashes are not saved to disk after the status call (the actual .dir json data only exists in the memory fs for the duration of the dvc status call). But this is expected behavior, and IMO the issue here is a UI problem. We display the progress bar but are not actually computing file hashes, @daavoo's initial comment was correct: #7390 (comment).


This was an intentional decision when we moved to using objects, since the previous behavior was to save the .dir file to .dvc/cache, even though none of the actual file data is committed to cache at that time. With this behavior we were actually corrupting .dvc/cache (since having .dir files present without the committed file data breaks the "if dir hash exists in ODB assume that the entire dir contents also exists in ODB performance optimization).

The actual md5 for the modified directory (abc123.dir) is saved and retrieved from the state DB, but we cannot load the actual directory tree (since the json data that tells us abc123.dir contains files def456, abc789, ... is unavailable). So in this case, we rebuild the tree each time (instead of loading it from a .dir file).

However, when we rebuild the tree, we don't recalculate file MD5s more than once. Even though the .dir file is not saved to cache, computed hashes are saved in the state db. So when we rebuild the tree, each hash will be loaded from state instead of recomputed entirely:

dvc/dvc/data/stage.py

Lines 97 to 99 in 6c04d0e

def _stage_file(fs_path, fs, name, odb=None, upload_odb=None, dry_run=False):
state = odb.state if odb else None
meta, hash_info = get_file_hash(fs_path, fs, name, state=state)

dvc/dvc/data/stage.py

Lines 80 to 86 in 6c04d0e

def get_file_hash(fs_path, fs, name, state=None):
if state:
meta, hash_info = state.get( # pylint: disable=assignment-from-none
fs_path, fs
)
if hash_info:
return meta, hash_info

Some performance regression is expected here since we have to walking the filesystem directory each for each dvc status call to rebuild the ODB tree object, but the hit is only due to walking the fs tree, it's not from recomputing MD5s for every file in that tree. Computing MD5s still only happens on the first dvc status call. (Note that this isnt necessarily specific to dvc status, it applies to any other command that checks stage/output status without committing to cache)


IMO the main problem in this issue is that the progress bar message for when we build the tree is misleading, since it is hard coded to just say that DVC is recomputing hashes and that this is only supposed to happen once

dvc/dvc/data/stage.py

Lines 126 to 130 in 6c04d0e

with Tqdm(
unit="md5",
desc="Computing file/dir hashes (only done once)",
disable=no_progress_bar,
) as pbar:

It should probably just say something like Checking status of <dir> and drop the only done once part of the message.


The actual ODB/dvc data fix for this is is for us to separate the performance optimization check from .dir file existence in dvc data, but this isn't really something we can do until 3.0 cache changes are made. Basically what we want is to be able to save the object containing the .dir json data to cache, and use a different file (something like abc123.dir.index) as the indicator of whether or not the rest of that directory's files have also been committed to an ODB.

One temporary workaround would be to store these loose .dir files somewhere other than .dvc/cache (like in .dvc/tmp/...) which we discussed doing back when these changes were first made, but decided not to (based on the assumption that the performance hit here is still acceptable until it is fixed properly in 3.0)

@pmrowla pmrowla added the ui user interface / interaction label Feb 24, 2022
@daavoo
Copy link
Contributor

daavoo commented Feb 24, 2022

Some performance regression is expected here since we have to walking the filesystem directory each for each dvc status call to rebuild the ODB tree object, but the hit is only due to walking the fs tree, it's not from recomputing MD5s for every file in that tree.

The problem I see is that the performance gap increases with the number of existing files tracked in directories.

I don't know how critical the scenario is (it might somehow overlap with LDB scope) but it has a significant impact in datasets with many "small" files, for example, with 10k files:

# Previous commit
# @34697ef2983cd4fac7cc66aefc8577fdde8980b4
dvc status  0,80s user 0,29s system 82% cpu 1,329 total
# >=114a07e7a448b667682fbe331ac06efbf4453494
dvc status  28,47s user 6,14s system 99% cpu 34,687 total                                                                                                 

Where are jumping from 0.8s to 28s.

Increasing the number of threads doesn't help, probably a different issue and related to
#7296 (reply in thread).

The actual ODB/dvc data fix for this is is for us to separate the performance optimization check from .dir file existence in dvc data, but this isn't really something we can do until 3.0 cache changes are made. Basically what we want is to be able to save the object containing the .dir json data to cache, and use a different file (something like abc123.dir.index) as the indicator of whether or not the rest of that directory's files have also been committed to an ODB.

Apart from the UI fix, I would then add this to a checklist for 3.0 changes to ensure it gets revisited.

@dtrifiro dtrifiro added the performance improvement over resource / time consuming tasks label Feb 24, 2022
@dberenbaum
Copy link
Collaborator

Note that this isnt necessarily specific to dvc status, it applies to any other command that checks stage/output status without committing to cache

Besides any --no-commit operations, are there any other commands where this would apply?

This was an intentional decision when we moved to using objects, since the previous behavior was to save the .dir file to .dvc/cache, even though none of the actual file data is committed to cache at that time. With this behavior we were actually corrupting .dvc/cache (since having .dir files present without the committed file data breaks the "if dir hash exists in ODB assume that the entire dir contents also exists in ODB performance optimization).

In the previous behavior, how was the .dir json file used to optimize getting the status of a modified directory?

One temporary workaround would be to store these loose .dir files somewhere other than .dvc/cache (like in .dvc/tmp/...) which we discussed doing back when these changes were first made, but decided not to (based on the assumption that the performance hit here is still acceptable until it is fixed properly in 3.0)

I think we should at least reassess some of these tradeoffs/temporary regressions until we have a clear plan for when they will be "properly" fixed. @pmrowla Do you have thoughts on the level of effort and other pros/cons?

@pmrowla
Copy link
Contributor

pmrowla commented Feb 25, 2022

Besides any --no-commit operations, are there any other commands where this would apply?

Commands like dvc diff that check workspace status

In the previous behavior, how was the .dir json file used to optimize getting the status of a modified directory?

We know what the hash of the modified dir will be after the first dvc status call, since we save it to the state db (assuming that the dir has not been modified again since that status call).

Previously, since we saved the .dir file to cache we could then just json.load() it to get the rest of the information about "what files and hashes are inside this directory". Since we no longer actually have the .dir file in cache, we have to walk the filesystem instead.

I think we should at least reassess some of these tradeoffs/temporary regressions until we have a clear plan for when they will be "properly" fixed. @pmrowla Do you have thoughts on the level of effort and other pros/cons?

The .dvc/tmp fix isn't too complicated, it's just a bit of a hack.

@pared
Copy link
Contributor Author

pared commented Mar 1, 2022

Action point for start is to change the message, since it clearly does not correspond to what is going under the hood.

@dberenbaum dberenbaum added this to DVC Mar 8, 2022
@dberenbaum dberenbaum moved this to Backlog in DVC Mar 8, 2022
@dberenbaum dberenbaum moved this from Backlog to Todo in DVC Mar 8, 2022
@dtrifiro dtrifiro self-assigned this Mar 15, 2022
@dtrifiro dtrifiro moved this from Todo to In Progress in DVC Mar 22, 2022
@skshetry
Copy link
Member

skshetry commented Apr 5, 2022

Also related benchmark issue: iterative/dvc-bench#315.

@efiop
Copy link
Contributor

efiop commented Apr 5, 2022

For the record: added a reproducer in iterative/dvc-bench#341

The issue is in workspace_status branch. Looks like we need to save raw .dir object (e.g. if 12345.dir is proper typed dir object, then 12345 is raw) to local cache somewhere during staging and try to load it during staging next time if we can't find proper 12345.dir.

@dtrifiro dtrifiro moved this from In Progress to Review In Progress in DVC May 2, 2022
dtrifiro added a commit to dtrifiro/dvc that referenced this issue May 11, 2022
When staging a directory, always save a "raw dir object" to the odb.
If the corresponding ".dir" object has not been added to the odb,
`stage()` calls can load the tree from the raw dir object instead of
rebuilding it by walking the directory.

This can lead to significant speed improvements when calling
`dvc status` for modified directories.

Fixes iterative#7390
efiop pushed a commit that referenced this issue May 12, 2022
When staging a directory, always save a "raw dir object" to the odb.
If the corresponding ".dir" object has not been added to the odb,
`stage()` calls can load the tree from the raw dir object instead of
rebuilding it by walking the directory.

This can lead to significant speed improvements when calling
`dvc status` for modified directories.

Fixes #7390
Repository owner moved this from Review In Progress to Done in DVC May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: status Related to the dvc diff/list/status bug Did we break something? p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks regression Ohh, we broke something :-( ui user interface / interaction
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants