-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use-cases: second iteration of Data Registry case #818
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
to make it easier to follow the 2 parallel stories...
…mmands use to organize the registry for #818
78ef796
to
c49bc0c
Compare
Summary: way easier to read now. Title: Data Registry (match menu, easier) ✅
any other DVC repository ✅
For imports to work it should be pushed and you need a deafult upstream configured, this is not optional. The only exception is if we work within a single machine and import locally. ✅
There are also
It is
dataset -> file_descriptor or fd # Note repo=, this is the preferred way, more future-proof ✅
with dvc.api.open(data_path, repo=repo_url, rev=...) as fd:
full_text = fd.read()
# or
df = pandas.read_csv(fd)
# or read line by line
for line in fd:
process(line)
# or any other code consuming fd
# ... Updating registries is unfinished? you need: dvc add music/songs
git commit music/songs.dvc -m "Update songs dataset"
git push
dvc push # this might be done with a hook |
@Suor I applied most of your suggestions.
No API docs at all yet. This is the very fist mention. Let's review this paragraph as part of #463? (I updated that issue's description.)
The example opens a model file actually. How would you consume it? I'm just leaving a commend inside to avoid getting into technical details but open to suggestions for a model file. Should we use a different function instead of open like you suggested earlier? Not sure about all this.
Just the git commit/push commands are "missing", which is not DVC related. |
I see, we're not showing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome stuff!
Left a few comments + I think it would be good to include/mention saving data as part of the worflow - dvc push
, git commit
?
With models there is no universal code, however, if it's with dvc.api.open('model.pkl', repo='https://...') as fd:
model = pickle.load(fd)
# or
model = pickle.loads(dvc.api.read('model.pkl', repo='https://...')) You can use any one if these examples. They are simple, common enough and useful.
|
private as well as in #818 (review) and below
@Suor yep, I noticed and mentioned that in #818 (comment) as well. I just wasn't sure we wanted to add that step or whether that would complicate the text. But I think you're right, it's a critical thing to show both in the Building and Updating sections, will add. |
OK, addressed everything and resolved conflicts with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! 👏😎 LGTM!
Fix #795
We go from using regular imports/gets to setting up a dedicated data registry, while we should be comparing no DVC at all (ad-hoc conventions and total mess on S3) vs. the DVC Data Registry – which effectively provides some "meta" information for the same data on S3.Also closes #779