use-cases: second iteration of Data Registry case #818

jorgeorpinel · 2019-11-25T02:12:08Z

Continuation of #679 and #805

Fix #795

Put a diagram near the top.
Further simplify intro
Introduce some section between intro and example that provides some high level abstract overview of the Git and DVC commands you would use to organize the registry (from use-cases: improvements to data-registry case per Alex' review #805 (review)).
We go from using regular imports/gets to setting up a dedicated data registry, while we should be comparing no DVC at all (ad-hoc conventions and total mess on S3) vs. the DVC Data Registry – which effectively provides some "meta" information for the same data on S3.

Also closes #779

…agraph per #795 (comment)

per #795 and #795 (comment)

…data reg per #795 (comment) and #795 (comment)

…logic and readability per #795 (comment)

per #805 (comment)

per #795 (comment) (and #805 (review))

for #818

to make it easier to follow the 2 parallel stories...

…mmands use to organize the registry for #818

Suor · 2019-12-12T16:36:37Z

Summary: way easier to read now.

Title: Data Registry (match menu, easier) ✅

any other DVC repositories

any other DVC repository ✅

The actual data is stored in the project's cache and can be pushed

For imports to work it should be pushed and you need a deafult upstream configured, this is not optional. The only exception is if we work within a single machine and import locally. ✅

as well as dvc.api.open()

There are also dvc.api.read() and dvc.api.get_url(). The .read() is the most high-level. Also, why it is not linked to anything? We don't have api documented?

has a --revision option

It is --rev not --revision ✅

with dvc.api.open(data_path, repo_url) as dataset:

dataset -> file_descriptor or fd

# Note repo=, this is the preferred way, more future-proof ✅
with dvc.api.open(data_path, repo=repo_url, rev=...) as fd:
    full_text = fd.read()
    # or
    df = pandas.read_csv(fd)
    # or read line by line
    for line in fd:
        process(line)
    # or any other code consuming fd
    # ...

Updating registries is unfinished? you need:

dvc add music/songs
git commit music/songs.dvc -m "Update songs dataset"
git push
dvc push  # this might be done with a hook

per #818 (comment)

jorgeorpinel · 2019-12-13T00:27:35Z

@Suor I applied most of your suggestions.

There are also dvc.api.read() and dvc.api.get_url(). The .read() is the most high-level. Also, why it is not linked to anything? We don't have api documented?

No API docs at all yet. This is the very fist mention. Let's review this paragraph as part of #463? (I updated that issue's description.)

dataset -> file_descriptor or fd

The example opens a model file actually. How would you consume it? I'm just leaving a commend inside to avoid getting into technical details but open to suggestions for a model file. Should we use a different function instead of open like you suggested earlier? Not sure about all this.

Updating registries is unfinished

Just the git commit/push commands are "missing", which is not DVC related.

jorgeorpinel · 2019-12-13T00:29:59Z

Updating registries is unfinished

I see, we're not showing dvc push at all in the doc in fact. I think we want to keep it lean and that's why. What do you think @shcheklein ? I guess it's pretty critical to push the data to a remote, should I add both in Building and Updating?

static/docs/use-cases/data-registries.md

shcheklein

Awesome stuff!

Left a few comments + I think it would be good to include/mention saving data as part of the worflow - dvc push, git commit?

Suor · 2019-12-13T11:26:20Z

The example opens a model file actually. How would you consume it?

With models there is no universal code, however, if it's *.pkl then you can simply unpickle it:

with dvc.api.open('model.pkl', repo='https://...') as fd:
    model = pickle.load(fd)
# or 
model = pickle.loads(dvc.api.read('model.pkl', repo='https://...'))

You can use any one if these examples. They are simple, common enough and useful.

Just the git commit/push commands are "missing", which is not DVC related.

dvc push is dvc related, which actually sends new data to remote making it accessible for gets/imports. And if you don't do git push you won't update repo either, so a consumer won't see anything in dvc status and won't be able to dvc update that one.

private as well as in #818 (review) and below

jorgeorpinel · 2019-12-16T19:09:02Z

dvc push is dvc related, which actually sends new data to remote...

@Suor yep, I noticed and mentioned that in #818 (comment) as well. I just wasn't sure we wanted to add that step or whether that would complicate the text. But I think you're right, it's a critical thing to show both in the Building and Updating sections, will add.

#818 (comment)

jorgeorpinel · 2019-12-16T19:22:19Z

OK, addressed everything and resolved conflicts with master branch. Merge? 😀

shcheklein

Awesome! 👏😎 LGTM!

jorgeorpinel added 9 commits November 19, 2019 18:59

use-cases: address smaller points from review (#795)

c31d971

use-cases: reinforce hypothetical phrasing in data registry intro par…

6002cba

…agraph per #795 (comment)

use-cases: partitioned->split in data registry case

47ebae5

per #795 and #795 (comment)

use-cases: geatly simplify mention about project inter-dependency in …

a578c15

…data reg per #795 (comment) and #795 (comment)

use-cases: improve intro to example in data registry case

d9ad1ab

use-cases: rephrase much of the data registry example to improve its …

50b772e

…logic and readability per #795 (comment)

review usage of ellipses thoughout docs

55ab757

per #805 (comment)

use-cases: remove remark about imports getting messy

d125437

per #795 (comment) (and #805 (review))

Merge branch 'master' into use-cases/data-registry

283eef5

shcheklein temporarily deployed to dvc-org-pr-818 November 25, 2019 02:12 Inactive

jorgeorpinel mentioned this pull request Nov 25, 2019

use-cases: improvements to data-registry case per Alex' review #805

Closed

This comment has been minimized.

Sign in to view

shcheklein previously approved these changes Nov 25, 2019

View reviewed changes

This comment has been minimized.

Sign in to view

jorgeorpinel requested a review from Suor November 25, 2019 05:08

jorgeorpinel changed the title ~~use-cases: second iteration of data-registry case~~ [WIP] use-cases: second iteration of data-registry case Nov 25, 2019

use-cases: further simplify intro of data registry case

3cba8f8

for #818

jorgeorpinel temporarily deployed to dvc-org-pr-818 November 25, 2019 06:37 Inactive

use-cases: separate example into 2 sections, expand on them

131a27e

to make it easier to follow the 2 parallel stories...

jorgeorpinel temporarily deployed to dvc-org-pr-818 November 25, 2019 20:27 Inactive

jorgeorpinel changed the title ~~[WIP] use-cases: second iteration of data-registry case~~ [WIP] use-cases: second iteration of Data Registry case Nov 25, 2019

use-cases: comlpete "Building a data registry" section in data-registry

a7dc465

jorgeorpinel temporarily deployed to dvc-org-pr-818 November 25, 2019 23:10 Inactive

jorgeorpinel mentioned this pull request Nov 26, 2019

use-cases: revise so they're more high level "landing pages" #820

Closed

8 tasks

use-cases: provide high level abstract overview of the Git and DVC co…

57d4059

…mmands use to organize the registry for #818

jorgeorpinel temporarily deployed to dvc-org-pr-818 November 26, 2019 06:05 Inactive

jorgeorpinel temporarily deployed to dvc-org-pr-818 November 26, 2019 06:24 Inactive

use-cases: simplify intro and 2nd section in data-registry

c49bc0c

jorgeorpinel force-pushed the use-cases/data-registry branch from 78ef796 to c49bc0c Compare November 26, 2019 06:27

use-cases: updated img subscript for data registry

53ea7c6

jorgeorpinel temporarily deployed to dvc-landing-use-cases-d-b6xlg5 December 12, 2019 01:45 Inactive

jorgeorpinel mentioned this pull request Dec 13, 2019

api: create documentation #463

Closed

2 tasks

use-cases: address Alex' feedback on data registry 2nd iteration

7887ca2

per #818 (comment)

jorgeorpinel temporarily deployed to dvc-landing-use-cases-d-b6xlg5 December 13, 2019 00:27 Inactive

shcheklein reviewed Dec 13, 2019

View reviewed changes

static/docs/use-cases/data-registries.md Outdated Show resolved Hide resolved

shcheklein reviewed Dec 13, 2019

View reviewed changes

static/docs/use-cases/data-registries.md Outdated Show resolved Hide resolved

shcheklein reviewed Dec 13, 2019

View reviewed changes

static/docs/use-cases/data-registries.md Show resolved Hide resolved

shcheklein reviewed Dec 13, 2019

View reviewed changes

weekly-digest bot mentioned this pull request Dec 15, 2019

Weekly Digest (8 December, 2019 - 15 December, 2019) #865

Closed

use-cases: addressing more feedback from Ivan

175b75a

private as well as in #818 (review) and below

jorgeorpinel temporarily deployed to dvc-landing-use-cases-d-b6xlg5 December 16, 2019 19:00 Inactive

use-cases: address Alex's feedback from

7a395f8

#818 (comment)

jorgeorpinel temporarily deployed to dvc-landing-use-cases-d-b6xlg5 December 16, 2019 19:19 Inactive

Merge branch 'master' into use-cases/data-registry

f9c1a74

jorgeorpinel temporarily deployed to dvc-landing-use-cases-d-b6xlg5 December 16, 2019 19:21 Inactive

jorgeorpinel requested a review from shcheklein December 16, 2019 19:22

shcheklein approved these changes Dec 16, 2019

View reviewed changes

shcheklein merged commit b8c3b8d into master Dec 16, 2019

shcheklein deleted the use-cases/data-registry branch December 18, 2019 06:19

weekly-digest bot mentioned this pull request Dec 22, 2019

Weekly Digest (15 December, 2019 - 22 December, 2019) #878

Closed

jorgeorpinel added A: docs Area: user documentation (gatsby-theme-iterative) C: cases Content of /doc/use-cases type: enhancement Something is not clear, small updates, improvement suggestions labels Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use-cases: second iteration of Data Registry case #818

use-cases: second iteration of Data Registry case #818

jorgeorpinel commented Nov 25, 2019 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Suor commented Dec 12, 2019 •

edited by jorgeorpinel

Loading

jorgeorpinel commented Dec 13, 2019 •

edited

Loading

jorgeorpinel commented Dec 13, 2019

shcheklein left a comment

Suor commented Dec 13, 2019

jorgeorpinel commented Dec 16, 2019

jorgeorpinel commented Dec 16, 2019

shcheklein left a comment

use-cases: second iteration of Data Registry case #818

use-cases: second iteration of Data Registry case #818

Conversation

jorgeorpinel commented Nov 25, 2019 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Suor commented Dec 12, 2019 • edited by jorgeorpinel Loading

jorgeorpinel commented Dec 13, 2019 • edited Loading

jorgeorpinel commented Dec 13, 2019

shcheklein left a comment

Choose a reason for hiding this comment

Suor commented Dec 13, 2019

jorgeorpinel commented Dec 16, 2019

jorgeorpinel commented Dec 16, 2019

shcheklein left a comment

Choose a reason for hiding this comment

jorgeorpinel commented Nov 25, 2019 •

edited

Loading

Suor commented Dec 12, 2019 •

edited by jorgeorpinel

Loading

jorgeorpinel commented Dec 13, 2019 •

edited

Loading