Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use-cases: second iteration of Data Registry case #818

Merged
merged 32 commits into from
Dec 16, 2019

Conversation

jorgeorpinel
Copy link
Contributor

@jorgeorpinel jorgeorpinel commented Nov 25, 2019

Continuation of #679 and #805

Fix #795

  • Put a diagram near the top.
  • Further simplify intro
  • Introduce some section between intro and example that provides some high level abstract overview of the Git and DVC commands you would use to organize the registry (from use-cases: improvements to data-registry case per Alex' review #805 (review)).
  • We go from using regular imports/gets to setting up a dedicated data registry, while we should be comparing no DVC at all (ad-hoc conventions and total mess on S3) vs. the DVC Data Registry – which effectively provides some "meta" information for the same data on S3.

Also closes #779

@jorgeorpinel

This comment has been minimized.

@shcheklein

This comment has been minimized.

shcheklein
shcheklein previously approved these changes Nov 25, 2019
@jorgeorpinel

This comment has been minimized.

@jorgeorpinel jorgeorpinel requested a review from Suor November 25, 2019 05:08
@jorgeorpinel jorgeorpinel changed the title use-cases: second iteration of data-registry case [WIP] use-cases: second iteration of data-registry case Nov 25, 2019
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-org-pr-818 November 25, 2019 06:37 Inactive
to make it easier to follow the 2 parallel stories...
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-org-pr-818 November 25, 2019 20:27 Inactive
@jorgeorpinel jorgeorpinel changed the title [WIP] use-cases: second iteration of data-registry case [WIP] use-cases: second iteration of Data Registry case Nov 25, 2019
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-org-pr-818 November 25, 2019 23:10 Inactive
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-org-pr-818 November 26, 2019 06:05 Inactive
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-org-pr-818 November 26, 2019 06:24 Inactive
@jorgeorpinel jorgeorpinel force-pushed the use-cases/data-registry branch from 78ef796 to c49bc0c Compare November 26, 2019 06:27
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-d-b6xlg5 December 12, 2019 01:45 Inactive
@Suor
Copy link
Contributor

Suor commented Dec 12, 2019

Summary: way easier to read now.

Title: Data Registry (match menu, easier) ✅

any other DVC repositories

any other DVC repository ✅

The actual data is stored in the project's cache and can be pushed

For imports to work it should be pushed and you need a deafult upstream configured, this is not optional. The only exception is if we work within a single machine and import locally. ✅

as well as dvc.api.open()

There are also dvc.api.read() and dvc.api.get_url(). The .read() is the most high-level. Also, why it is not linked to anything? We don't have api documented?

has a --revision option

It is --rev not --revision

with dvc.api.open(data_path, repo_url) as dataset:

dataset -> file_descriptor or fd

# Note repo=, this is the preferred way, more future-proof ✅
with dvc.api.open(data_path, repo=repo_url, rev=...) as fd:
    full_text = fd.read()
    # or
    df = pandas.read_csv(fd)
    # or read line by line
    for line in fd:
        process(line)
    # or any other code consuming fd
    # ...

Updating registries is unfinished? you need:

dvc add music/songs
git commit music/songs.dvc -m "Update songs dataset"
git push
dvc push  # this might be done with a hook

@jorgeorpinel jorgeorpinel mentioned this pull request Dec 13, 2019
2 tasks
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Dec 13, 2019

@Suor I applied most of your suggestions.

There are also dvc.api.read() and dvc.api.get_url(). The .read() is the most high-level. Also, why it is not linked to anything? We don't have api documented?

No API docs at all yet. This is the very fist mention. Let's review this paragraph as part of #463? (I updated that issue's description.)

dataset -> file_descriptor or fd

The example opens a model file actually. How would you consume it? I'm just leaving a commend inside to avoid getting into technical details but open to suggestions for a model file. Should we use a different function instead of open like you suggested earlier? Not sure about all this.

Updating registries is unfinished

Just the git commit/push commands are "missing", which is not DVC related.

@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-d-b6xlg5 December 13, 2019 00:27 Inactive
@jorgeorpinel
Copy link
Contributor Author

Updating registries is unfinished

I see, we're not showing dvc push at all in the doc in fact. I think we want to keep it lean and that's why. What do you think @shcheklein ? I guess it's pretty critical to push the data to a remote, should I add both in Building and Updating?

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff!

Left a few comments + I think it would be good to include/mention saving data as part of the worflow - dvc push, git commit?

@Suor
Copy link
Contributor

Suor commented Dec 13, 2019

The example opens a model file actually. How would you consume it?

With models there is no universal code, however, if it's *.pkl then you can simply unpickle it:

with dvc.api.open('model.pkl', repo='https://...') as fd:
    model = pickle.load(fd)
# or 
model = pickle.loads(dvc.api.read('model.pkl', repo='https://...'))

You can use any one if these examples. They are simple, common enough and useful.

Just the git commit/push commands are "missing", which is not DVC related.

dvc push is dvc related, which actually sends new data to remote making it accessible for gets/imports. And if you don't do git push you won't update repo either, so a consumer won't see anything in dvc status and won't be able to dvc update that one.

@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-d-b6xlg5 December 16, 2019 19:00 Inactive
@jorgeorpinel
Copy link
Contributor Author

dvc push is dvc related, which actually sends new data to remote...

@Suor yep, I noticed and mentioned that in #818 (comment) as well. I just wasn't sure we wanted to add that step or whether that would complicate the text. But I think you're right, it's a critical thing to show both in the Building and Updating sections, will add.

@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-d-b6xlg5 December 16, 2019 19:19 Inactive
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-d-b6xlg5 December 16, 2019 19:21 Inactive
@jorgeorpinel
Copy link
Contributor Author

OK, addressed everything and resolved conflicts with master branch. Merge? 😀

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 👏😎 LGTM!

@shcheklein shcheklein merged commit b8c3b8d into master Dec 16, 2019
@shcheklein shcheklein deleted the use-cases/data-registry branch December 18, 2019 06:19
@jorgeorpinel jorgeorpinel added A: docs Area: user documentation (gatsby-theme-iterative) C: cases Content of /doc/use-cases type: enhancement Something is not clear, small updates, improvement suggestions labels Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: cases Content of /doc/use-cases type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use-cases: improve Data Registry case use-cases: add API section to data registry case
3 participants