Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc.data: save and try loading raw dir objects #7597

Merged
merged 1 commit into from
May 12, 2022

Conversation

dtrifiro
Copy link
Contributor

@dtrifiro dtrifiro commented Apr 19, 2022

When staging a directory, always save a "raw dir object" to the odb. If the corresponding ".dir" object has not been added to the odb, stage() calls can load the tree from the raw dir object instead of rebuilding it by walking the directory.
This can lead to significant speed improvements when calling dvc status for modified directories.

Fixes #7390

@dtrifiro dtrifiro changed the title speed up status calls with modified dirs WIP: speed up status calls with modified dirs Apr 19, 2022
@efiop efiop self-requested a review April 19, 2022 13:32
@dtrifiro dtrifiro force-pushed the fix/7390-status-recalculating-hashes branch 2 times, most recently from 86c491e to d6152f7 Compare April 20, 2022 12:58
@dtrifiro dtrifiro force-pushed the fix/7390-status-recalculating-hashes branch 7 times, most recently from 8d4fc81 to 73e1716 Compare April 27, 2022 17:13
@dtrifiro dtrifiro force-pushed the fix/7390-status-recalculating-hashes branch from 73e1716 to a207fcf Compare April 29, 2022 16:15
@dtrifiro

This comment was marked as outdated.

@dtrifiro dtrifiro changed the title WIP: speed up status calls with modified dirs speed up status calls for modified dirs Apr 30, 2022
@dtrifiro dtrifiro force-pushed the fix/7390-status-recalculating-hashes branch from a207fcf to 22fe50b Compare April 30, 2022 16:04
@dtrifiro dtrifiro added the A: status Related to the dvc diff/list/status label Apr 30, 2022
@dtrifiro dtrifiro marked this pull request as ready for review April 30, 2022 16:04
@dtrifiro dtrifiro requested a review from a team as a code owner April 30, 2022 16:04
dvc/data/stage.py Outdated Show resolved Hide resolved
dvc/data/stage.py Outdated Show resolved Hide resolved
@efiop
Copy link
Contributor

efiop commented Apr 30, 2022

@dtrifiro Thanks for the benches! 🙏

Btw, could you take a look at what is still taking that long in relation to this PR? I suppose it is mostly the overhead of those ref obj recreations? Just wondering if there are any low hangers that we could pick until a proper solution with persistent refobjects arrives.

@dtrifiro

This comment was marked as outdated.

@dtrifiro dtrifiro force-pushed the fix/7390-status-recalculating-hashes branch from 22fe50b to 6124165 Compare May 2, 2022 09:02
dvc/data/stage.py Outdated Show resolved Hide resolved
@efiop
Copy link
Contributor

efiop commented May 2, 2022

@dtrifiro Thanks for the traces 🙏 Any particular points you were able to identify? Any particular ideas?

dvc/hash_info.py Outdated Show resolved Hide resolved
@efiop efiop changed the title speed up status calls for modified dirs status: speed up for modified dirs May 2, 2022
@efiop efiop added the optimize Optimizes DVC label May 2, 2022
dvc/data/stage.py Outdated Show resolved Hide resolved
dvc/data/stage.py Outdated Show resolved Hide resolved
@dtrifiro dtrifiro changed the title status: speed up for modified dirs dvc.stage: save and try loading raw dir hashes May 9, 2022
@dtrifiro dtrifiro marked this pull request as draft May 9, 2022 11:03
@dtrifiro dtrifiro changed the title dvc.stage: save and try loading raw dir hashes dvc.stage: save and try loading raw dir objects May 9, 2022
@dtrifiro dtrifiro force-pushed the fix/7390-status-recalculating-hashes branch 3 times, most recently from 352517c to 27b7927 Compare May 10, 2022 08:14
@dtrifiro dtrifiro marked this pull request as ready for review May 10, 2022 08:14
dvc/data/stage.py Outdated Show resolved Hide resolved
@dtrifiro dtrifiro force-pushed the fix/7390-status-recalculating-hashes branch from 27b7927 to 3f24f91 Compare May 10, 2022 09:11
@dtrifiro dtrifiro changed the title dvc.stage: save and try loading raw dir objects dvc.data: save and try loading raw dir objects May 11, 2022
@dtrifiro dtrifiro force-pushed the fix/7390-status-recalculating-hashes branch from 3f24f91 to b8362bc Compare May 11, 2022 07:39
@dtrifiro
Copy link
Contributor Author

Updated benchmarks can be found here:
https://github.com/dtrifiro/dvc-bench/runs/6368942357?check_suite_focus=true#step:8:13

A summary for test_status:

----------------------------------------------------------------------------------- benchmark 'test_status-status': 3 tests ------------------------------------------------------------------------------------
Name (time in s)                                      Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_status-status (test_status/0001_fix-739)     25.2779 (1.0)      25.2779 (1.0)      25.2779 (1.0)      0.0000 (1.0)      25.2779 (1.0)      0.0000 (1.0)           0;0  0.0396 (1.0)           1           1
test_status-status (test_status/0002_main)        36.9391 (1.46)     36.9391 (1.46)     36.9391 (1.46)     0.0000 (1.0)      36.9391 (1.46)     0.0000 (1.0)           0;0  0.0271 (0.68)          1           1
test_status-status (test_status/0003_2.10.0)      36.9755 (1.46)     36.9755 (1.46)     36.9755 (1.46)     0.0000 (1.0)      36.9755 (1.46)     0.0000 (1.0)           0;0  0.0270 (0.68)          1           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------- benchmark 'test_status-status-changed': 3 tests ------------------------------------------------------------------------------------
Name (time in s)                                              Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_status-status-changed (test_status/0001_fix-739)     14.9064 (1.0)      14.9064 (1.0)      14.9064 (1.0)      0.0000 (1.0)      14.9064 (1.0)      0.0000 (1.0)           0;0  0.0671 (1.0)           1           1
test_status-status-changed (test_status/0002_main)        26.9573 (1.81)     26.9573 (1.81)     26.9573 (1.81)     0.0000 (1.0)      26.9573 (1.81)     0.0000 (1.0)           0;0  0.0371 (0.55)          1           1
test_status-status-changed (test_status/0003_2.10.0)      27.1236 (1.82)     27.1236 (1.82)     27.1236 (1.82)     0.0000 (1.0)      27.1236 (1.82)     0.0000 (1.0)           0;0  0.0369 (0.55)          1           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------ benchmark 'test_status-status-changed-noop': 3 tests -----------------------------------------------------------------------------------
Name (time in s)                                                   Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_status-status-changed-noop (test_status/0001_fix-739)      3.9865 (1.0)       3.9865 (1.0)       3.9865 (1.0)      0.0000 (1.0)       3.9865 (1.0)      0.0000 (1.0)           0;0  0.2508 (1.0)           1           1
test_status-status-changed-noop (test_status/0002_main)        26.6765 (6.69)     26.6765 (6.69)     26.6765 (6.69)     0.0000 (1.0)      26.6765 (6.69)     0.0000 (1.0)           0;0  0.0375 (0.15)          1           1
test_status-status-changed-noop (test_status/0003_2.10.0)      27.0214 (6.78)     27.0214 (6.78)     27.0214 (6.78)     0.0000 (1.0)      27.0214 (6.78)     0.0000 (1.0)           0;0  0.0370 (0.15)          1           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

tree = Tree.load(odb, hash_info.as_raw())
tree.check(odb)
except ObjectFormatError:
raise FileNotFoundError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise FileNotFoundError
raise FileNotFoundError from exc

Copy link
Member

@skshetry skshetry May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are here:

Suggested change
raise FileNotFoundError
raise FileNotFoundError(errno.ENOENT, "No such object", odb.hash_to_path(hash_info.value))

Or, just use hash_info.value for simplicity.

@dtrifiro dtrifiro force-pushed the fix/7390-status-recalculating-hashes branch from b8362bc to 3c7d08e Compare May 11, 2022 14:06
@dtrifiro
Copy link
Contributor Author

@dtrifiro Thanks for the traces 🙏 Any particular points you were able to identify? Any particular ideas?

We could push performance a little bit more by avoiding calling stat when creating the ReferenceHashFile in staging.add (calling fs.checksum).

Hacking around, I was able to get a working prototype and ran some benchmarks with dvc-bench's test_status. For the large dataset it's around a 4% speedup, although this comes at the price overriding of ReferenceObjectDB.add and duplicating some code, so it might not be worth it.

@dtrifiro dtrifiro force-pushed the fix/7390-status-recalculating-hashes branch from 3c7d08e to 42f87aa Compare May 11, 2022 15:47
When staging a directory, always save a "raw dir object" to the odb.
If the corresponding ".dir" object has not been added to the odb,
`stage()` calls can load the tree from the raw dir object instead of
rebuilding it by walking the directory.

This can lead to significant speed improvements when calling
`dvc status` for modified directories.

Fixes iterative#7390
@dtrifiro dtrifiro force-pushed the fix/7390-status-recalculating-hashes branch from 42f87aa to 1f92c4b Compare May 11, 2022 15:57
@skshetry
Copy link
Member

@dtrifiro, are those improvements on test_status-status and test_status-status-changed expected?

@dtrifiro
Copy link
Contributor Author

@dtrifiro, are those improvements on test_status-status and test_status-status-changed expected?

I forgot to update the main branch for fork after rebasing, meaning the comparison was against a slightly outdated revision. Here's a comparison for this PR against its base (main@3faf4132):

https://github.com/dtrifiro/dvc-bench/runs/6404025103?check_suite_focus=true#step:8:258

-------------------------------- benchmark 'test_status-status': 5 tests --------------------------------
Name (time in s)                                   Median            StdDev            Rounds  Iterations
---------------------------------------------------------------------------------------------------------
test_status-status (test_status/0001_fix-739)     22.8981 (1.10)     0.0000 (1.0)           1           1
test_status-status (test_status/0002_main)        21.4079 (1.03)     0.0000 (1.0)           1           1
test_status-status (test_status/0003_2.10.2)      20.7695 (1.0)      0.0000 (1.0)           1           1
test_status-status (test_status/0004_2.9.5)       32.8662 (1.58)     0.0000 (1.0)           1           1
test_status-status (test_status/0005_2.6.3)       36.3975 (1.75)     0.0000 (1.0)           1           1
---------------------------------------------------------------------------------------------------------

-------------------------------- benchmark 'test_status-status-changed': 5 tests --------------------------------
Name (time in s)                                           Median            StdDev            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------
test_status-status-changed (test_status/0001_fix-739)     12.6867 (1.0)      0.0000 (1.0)           1           1
test_status-status-changed (test_status/0002_main)        12.7397 (1.00)     0.0000 (1.0)           1           1
test_status-status-changed (test_status/0003_2.10.2)      12.8673 (1.01)     0.0000 (1.0)           1           1
test_status-status-changed (test_status/0004_2.9.5)       25.0225 (1.97)     0.0000 (1.0)           1           1
test_status-status-changed (test_status/0005_2.6.3)       25.2563 (1.99)     0.0000 (1.0)           1           1
-----------------------------------------------------------------------------------------------------------------

-------------------------------- benchmark 'test_status-status-changed-noop': 5 tests --------------------------------
Name (time in s)                                                Median            StdDev            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------
test_status-status-changed-noop (test_status/0001_fix-739)      3.7390 (1.0)      0.0000 (1.0)           1           1
test_status-status-changed-noop (test_status/0002_main)        12.7007 (3.40)     0.0000 (1.0)           1           1
test_status-status-changed-noop (test_status/0003_2.10.2)      11.8403 (3.17)     0.0000 (1.0)           1           1
test_status-status-changed-noop (test_status/0004_2.9.5)       24.2830 (6.49)     0.0000 (1.0)           1           1
test_status-status-changed-noop (test_status/0005_2.6.3)       25.2839 (6.76)     0.0000 (1.0)           1           1
----------------------------------------------------------------------------------------------------------------------

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it looks like we could clean this up a bit more, but let's run with it for now and get back to it when we safe ref objects.

@efiop efiop merged commit 7a795f7 into iterative:main May 12, 2022
@dtrifiro dtrifiro deleted the fix/7390-status-recalculating-hashes branch May 12, 2022 16:48
Comment on lines +296 to +297
obj = load(odb, hash_info)
check(odb, obj, check_hash=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, this PR introduced a typo: it should be _odb here 🙁

efiop added a commit to iterative/dvc-data that referenced this pull request May 29, 2022
efiop added a commit to iterative/dvc-data that referenced this pull request May 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: status Related to the dvc diff/list/status optimize Optimizes DVC performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

status: "recalculating" hashes each call
4 participants