-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added version control diff #1345
Merged
Merged
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
992abd1
added commit diff
AbhinavTuli 7987564
diff works
AbhinavTuli 1d9b205
bug fixes
AbhinavTuli 0cd38c1
bug fixes and formatting
AbhinavTuli fe99b77
better formatting
AbhinavTuli 6d229cb
simplify code
AbhinavTuli 85129ed
updated docstrings
AbhinavTuli 9b969fb
move diff utils
AbhinavTuli 7316d2b
add checks for commits
AbhinavTuli e632c9c
fix auto commit
AbhinavTuli 8f1eb3d
more vc fixes
AbhinavTuli 1261ba6
add diff tests
AbhinavTuli 9a69eee
Merge remote-tracking branch 'origin' into vc/diff
AbhinavTuli 35d78ab
fix readonly
AbhinavTuli c337b91
lint fixes
AbhinavTuli 23f3a3f
Merge remote-tracking branch 'origin' into vc/diff
AbhinavTuli 047b7a1
compressed the changes displayed
AbhinavTuli 1f55080
cleaned up diff output
AbhinavTuli 1daf235
fix linting
AbhinavTuli 40e3797
improved tests
AbhinavTuli 255adf8
mypy fix
AbhinavTuli 81c4727
clean up CommitDiff class
AbhinavTuli f86d832
mypy fix
AbhinavTuli 81379a6
cleanup
AbhinavTuli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
from typing import Set | ||
from hub.core.storage.cachable import Cachable | ||
|
||
|
||
class CommitDiff(Cachable): | ||
"""Stores set of diffs stored for a particular tensor in a commit.""" | ||
|
||
def __init__(self, created=False) -> None: | ||
AbhinavTuli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
self.created = created | ||
self.data_added: Set[int] = set() | ||
self.data_updated: Set[int] = set() | ||
|
||
def tobytes(self) -> bytes: | ||
"""Returns bytes representation of the commit diff | ||
|
||
The format stores the following information in order: | ||
1. The first byte is a boolean value indicating whether the tensor was created in the commit or not. | ||
2. The next 8 bytes are the number of elements in the data_added set, let's call this n. | ||
3. The next 8 * n bytes are the elements of the data_added set. | ||
4. The next 8 bytes are the number of elements in the data_updated set, let's call this m. | ||
5. The next 8 * m bytes are the elements of the data_updated set. | ||
""" | ||
return b"".join( | ||
[ | ||
self.created.to_bytes(1, "big"), | ||
len(self.data_added).to_bytes(8, "big"), | ||
*[idx.to_bytes(8, "big") for idx in self.data_added], | ||
len(self.data_updated).to_bytes(8, "big"), | ||
*[idx.to_bytes(8, "big") for idx in self.data_updated], | ||
] | ||
) | ||
|
||
@classmethod | ||
def frombuffer(cls, data: bytes) -> "CommitDiff": | ||
AbhinavTuli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Creates a CommitDiff object from bytes""" | ||
commit_diff = cls() | ||
|
||
commit_diff.created = bool(int.from_bytes(data[0:1], "big")) | ||
|
||
added_ct = int.from_bytes(data[1:9], "big") | ||
commit_diff.data_added = { | ||
int.from_bytes(data[9 + i * 8 : 9 + (i + 1) * 8], "big") | ||
for i in range(added_ct) | ||
} | ||
|
||
updated_ct = int.from_bytes(data[9 + added_ct * 8 : 17 + added_ct * 8], "big") | ||
offset = 17 + added_ct * 8 | ||
commit_diff.data_updated = { | ||
int.from_bytes(data[offset + i * 8 : offset + (i + 1) * 8], "big") | ||
for i in range(updated_ct) | ||
} | ||
|
||
return commit_diff | ||
|
||
@property | ||
def nbytes(self): | ||
"""Returns number of bytes required to store the commit diff""" | ||
return 17 + (len(self.data_added) + len(self.data_updated)) * 8 | ||
|
||
def add_data(self, global_indexes: Set[int]) -> None: | ||
"""Adds new indexes to data added""" | ||
self.data_added.update(global_indexes) | ||
|
||
def update_data(self, global_index: int) -> None: | ||
"""Adds new indexes to data updated""" | ||
if global_index not in self.data_added: | ||
self.data_updated.add(global_index) | ||
|
||
|
||
def get_sample_indexes_added(initial_num_samples: int, samples) -> Set[int]: | ||
"""Returns a set of indexes added to the tensor""" | ||
if initial_num_samples == 0: | ||
return set(range(len(samples))) | ||
else: | ||
return set(range(initial_num_samples, initial_num_samples + len(samples))) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we comprehensively changing these emits or cherry picking as we go along?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only have 2 functions that deliberately print out something, diff and log.
Realized that print was a better choice while implementing diff, changed log too for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cluster setup (atm) will capture
stdout
emits as log. Strongly suggest removing any free form debugging logs, and strictly stick tof"<log-level> <log-msg> - <key-0>: <val-0> - [...] <key-n>: <val-n>"
formatting of logs. For debug (check with @absinha18) it is possible thatstderr
is ignored and can be used for free-form debugging emits. We wants logs that we can systemically grep.