introduce dataset dependency #10164

skshetry · 2023-12-14T10:26:05Z

Example `dvc.yaml`

datasets:
  - name: dogs
    type: dvcx
    version: 2

  - name: publaynet
    type: webdataset
    url: "http://storage.googleapis.com/nvdata-publaynet/publaynet-train-{000000..000009}.tar"

  - name: stackoverflow
    type: dvc
    url: [email protected]:iterative/example-get-started.git
    rev: main
    path: data/data.xml

echo foo > foo
dvc stage add -n test -d ds://dogs -d foo "cp foo bar"
dvc repro
dvc status
# change version in `datasets` and run `dvc status` and `dvc repro` again.

from dql.query import DatasetQuery
from dvc.api.dataset import DVCxDataset, get

resolved = get(DVCxDataset, "dogs")
ds = DatasetQuery(name=resolved.name, version=resolved.version)

from dvc.api.dataset import DVCDataset, get
from dvc.fs.dvc import DVCFileSystem

resolved = get(DVCDataset, "stackoverflow")
fs = DVCFileSystem(url=resolved.url, rev=resolved.rev)
with fs.open(resolved.path) as f:
    process_posts(f.readlines())

import webdataset as wds
from dvc.api.dataset import WebDataset, get

resolved = get(WebDataset, "publaynet")
dataset = wds.WebDataset(resolved.url)

Requires pydantic>=2 at the moment.

Complete dvc.yaml file

datasets:
- name: dogs
  type: dvcx
  version: 3

- name: publaynet
  type: webdataset
  url:
    http://storage.googleapis.com/nvdata-publaynet/publaynet-train-{000000..000009}.tar

- name: stackoverflow
  url: [email protected]:iterative/example-get-started.git
  rev: main
  path: data/data.xml

- name: tiny
  type: dvcx
  version: 3

stages:
  test:
    cmd: cp foo bar
    deps:
    - ds://dogs
    - foo

Complete dvc.lock file

schema: '2.0'
stages:
  test:
    cmd: cp foo bar
    deps:
    - path: ds://dogs
      dataset:
        name: dogs
        type: dvcx
        version: 3
    - path: foo
      hash: md5
      md5: d3b07384d113edec49eaa6238ad5ff00
      size: 4

Screenplay

Note regarding Public API

We can easily change API if needed. I wanted to start with something that is type-safe, and that gives a good completions to the users.

codecov · 2023-12-14T10:35:52Z

Codecov Report

Attention: 38 lines in your changes are missing coverage. Please review.

Comparison is base (3b56c40) 90.47% compared to head (44677a3) 90.16%.

Files	Patch %	Lines
dvc/dependency/dataset.py	55.81%	19 Missing ⚠️
dvc/datasets.py	0.00%	6 Missing ⚠️
dvc/repo/graph.py	33.33%	2 Missing and 2 partials ⚠️
dvc/stage/loader.py	0.00%	2 Missing and 1 partial ⚠️
dvc/dependency/__init__.py	33.33%	1 Missing and 1 partial ⚠️
dvc/repo/index.py	71.42%	2 Missing ⚠️
dvc/stage/serialize.py	33.33%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10164      +/-   ##
==========================================
- Coverage   90.47%   90.16%   -0.31%     
==========================================
  Files         493      495       +2     
  Lines       37699    37775      +76     
  Branches     5449     5461      +12     
==========================================
- Hits        34107    34059      -48     
- Misses       2963     3062      +99     
- Partials      629      654      +25

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skshetry · 2023-12-14T10:40:26Z

@dberenbaum, this PR is a very basic implementation. But I wanted to get a feedback if this was something you had in mind in regard to pipelines and datasets. Please have a look and let me know what you think.

dberenbaum · 2023-12-14T15:42:25Z

Looks good and aligns with what we discussed.

I added the .dql/config file mentioned in #10123 (comment) but so far not able to get the actual API working:

$ python ds.py
Traceback (most recent call last):
  File "/private/tmp/dvcx/ds.py", line 6, in <module>
    ds = DatasetQuery(name=resolved.name, version=resolved.version)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dql/query/dataset.py", line 1146, in __init__
    ds = data_storage.get_dataset(name)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dql/data_storage/abstract.py", line 1068, in get_dataset
    raise DatasetNotFoundError(f"Dataset {name} not found.")
dql.error.DatasetNotFoundError: Dataset dogs not found.

$ dvcx pull dogs -o dogs
Error: Error when parsing dataset uri

I can try to look into it more or ask dvcx team, but let me know if you have ideas on what might be wrong.

I think we will need input from DVCX team on:

What version info to store in dvc.lock
What the API should return

skshetry · 2023-12-14T15:57:02Z

I think you have to pull the index first before using DatasetQuery. Also dql pull needs ds:// scheme.

dql pull ds://dogs --no-cp

shcheklein · 2023-12-14T17:00:59Z

Cool stuff. Thanks @skshetry for prototyping and sharing this quickly. Folks, can we add the dvc.lock file after running the pipeline, and the complete dvc.yaml - just to see how it looks like.

skshetry · 2023-12-14T17:03:46Z

@shcheklein, dvc.lock looks something as follows:

schema: '2.0'
stages:
  test:
    cmd: cp foo bar
    deps:
    - path: ds://dogs
      dataset:
        name: dogs
        type: dvcx
        version: 3
    - path: foo
      hash: md5
      md5: d3b07384d113edec49eaa6238ad5ff00
      size: 4

It is inside deps, but we'll likely move it to a separate section as with params. deps has been used for local/external-filesystems based dependencies, so I'd probably create a separate section (but still thinking about it).

The dataset section at the moment contains same information as in dvc.yaml top-level datasets section and is compared to know if the stage changed or not, and whether to reproduce or not.

Complete dvc.yaml file

datasets:
- name: dogs
  type: dvcx
  version: 3

- name: publaynet
  type: webdataset
  url:
    http://storage.googleapis.com/nvdata-publaynet/publaynet-train-{000000..000009}.tar

- name: stackoverflow
  url: [email protected]:iterative/example-get-started.git
  rev: main
  path: data/data.xml

- name: tiny
  type: dvcx
  version: 3

stages:
  test:
    cmd: cp foo bar
    deps:
    - ds://dogs
    - foo

(Also added in the description above.)

dberenbaum · 2023-12-15T14:38:42Z

I think you have to pull the index first before using DatasetQuery.

Yup, that fixed it, thanks!

It seems like this isn't a good UX, and we will need a way to see what dataset versions are available on studio without pulling their indexes locally (I think I raised the same concern in the prior PR). Do you know if it's a priority for the DVCX team? Would it make sense to contribute back this functionality?

skshetry · 2023-12-15T15:56:53Z

You can try out with something like following to test:

from dql.catalog import get_catalog
from dql.dataset import create_dataset_uri
from dql.query import DatasetQuery
from dvc.api.dataset import DVCxDataset, get

resolved = get(DVCxDataset, "dogs")
uri = create_dataset_uri(resolved.name, version=resolved.version)
catalog = get_catalog()
catalog.pull_dataset(uri, no_cp=True)
ds = DatasetQuery(name=resolved.name, version=resolved.version, catalog=catalog)

DatasetQuery feels low-level to create this auto-pull around. Maybe there will be some high-level wrappers around to do this.

We were using a different API before, and now we are not even responsible for anything than materializing the metadata that we have. So, I'd rather prefer someone from dvcx team propose API here. It'd be wrong for me to try to fit what we have here to their API.

We should finalize on dvc side before proposing something at least. It's too early from our side, I think.

cc @dmpetrov.

dmpetrov · 2023-12-19T04:01:04Z

@skshetry excellent work!

It would be great to know how datasets: section is populated. It should help better understand the UI aspects such as --no-cp.

This might be outside of the scope: do we plane to support his type: webdataset? Also, how about importing a cloud-versioning file from S3? What type it should be?

skshetry · 2023-12-19T04:07:25Z

It would be great to know how datasets: section is populated. It should help better understand the UI aspects such as --no-cp.

datasets section is populated by users themselves (we could provide a CLI command to generate in the future).

This might be outside of the scope: do we plane to support his type: webdataset? Also, how about importing a cloud-versioning file from S3? What type it should be?

At the moment, we only provide an API to give you the metadata you have in datasets section. That's all that it does. (Note webdataset is just an example, it could be anything as it's only metadata.)

I have been thinking of what we can do with this metadata, hence my experiment with pydantic.
But I'm not clear if we want to be involved in materializing/streaming of datasets ourselves.
The idea is similar to #2719.

For the cloud versioning, it could be like this:

datasets:
  - name: versioned
    type: url
    url: s3://bucket/name?versionId=versionId

And you can read that using s3fs.

import fsspec

fsspec.open(resolved.url)

Also, how about importing a cloud-versioning file from S3? What type it should be?

Note that this is about pipelines only. import-ing, as in materializing/instantiating is not supported or part of the plan here. dvc import-url is the way to do that.

dberenbaum · 2023-12-19T22:01:52Z

Can/should dvc collect some lightweight info about the datasets and save them like stages with their own entries in dvc.lock?

For example, assume this dvc.yaml:

datasets:
  dogs:
    name: dogs
    type: dvcx
    version: 3
  versioned:
    type: url
    url: s3://bucket/name/dir
    version_aware: true
stages:
  test:
    cmd: cp foo bar
    deps:
    - ds://dogs
    - foo

The dvc.lock could look like:

schema: '2.0'
datasets:
  dogs:
    type: dvcx
    name: dogs
    version: 3
  versioned:
    type: url
    url: s3://bucket/name/dir
    files:
      - relpath: file1
        version_id: versionId
      - relpath: file2
        version_id: versionId
stages:
  test:
    cmd: cp foo bar
    deps:
    - path: ds://dogs
      dataset:
        name: dogs
        type: dvcx
        version: 3
    - path: foo
      hash: md5
      md5: d3b07384d113edec49eaa6238ad5ff00
      size: 4

This would support:

cloud-versioned directories
lightweight checksums - use version IDs or other available info to capture and validate the version without calculating file hashes or caching, although dvc will not be able to recover old versions if they are no longer available
freezing datasets - if you want to keep using this version of the cloud-versioned directory rather than check for the latest version, set frozen: true for that dataset
CLI:
- dvc datasets add dvcx dogs --info name=dogs,version=2 can add the entry to both dvc.yaml and dvc.lock
- dvc update ds://dogs can update the entry

dmpetrov · 2023-12-20T06:26:19Z

Can/should dvc collect some lightweight info about the datasets and save them

It shouldn't since a portion of the data is in DB and data can be not materialized locally. If something is needed, it should be requested though a command like dvcx status ds://cats@v1.

dberenbaum · 2023-12-20T13:33:16Z

Can/should dvc collect some lightweight info about the datasets and save them

It shouldn't since a portion of the data is in DB and data can be not materialized locally. If something is needed, it should be requested though a command like dvcx status ds://cats@v1.

Yes, this is what I meant by "lightweight" @dmpetrov. I think we all are clear that DVC should not be calculating md5s etc. It seems like doing something like dvcx status would be ideal here. At minimum, it would be helpful to list available versions of datasets by name in dvcx. Are there plans for something like that?

skshetry · 2024-01-04T12:14:38Z

We discussed in the team meeting if we could fit this into the import-url/import workflow.

$ dvc import-url dvcx://dogs@v3 --virtual
$ cat dogs.dvc

md5: 43d8c1bfe46a6cf2cb9dfe00a8e431b3
frozen: true
deps:
- path: dvcx://dogs@v3
outs:
- hash: md5
  path: dogs # which is an alias for dvcx
  pull: false # or, virtual: true

The path will work similar to an alias or a name, but won't be materialized. If user's need to materialize, they can remove pull: False/virtual: true.

Also, dvc update will work automatically. We can add top-level datasets in the future if we need to. @pmrowla mentioned that we have artifacts section which is a metadata and has an alias that points to a path to a file, and datasets section is similar in that regard.

The dataset.get() method will work based on that alias or path.

dberenbaum · 2024-01-04T21:33:09Z

Caveat: This somewhat depends on the discussion in https://github.com/iterative/dql/issues/732 and goes back to #10114 (comment), but I'm putting my current thoughts here so I don't lose them.

Does the proposal above mean that some pipeline stage could look like this?

stages:
  process_dataset:
    cmd: ...
    deps:
    - dogs # alias to dvcx dataset?

Questions:

If we support syntax like path: dvcx://dogs@v3, can we list that directly as a stage dependency instead of doing an import?
Does dvc.lock look the same as above with info about the dataset version?
Does dataset.get("dogs") still make sense as an api?
What happens if I put some file/dir at path dogs?

I think it's more important to handle stage deps first, although it seems possible to support both. Handling datasets as deps could replace or streamline some import scenarios, especially when users prefer to stream the data. Maybe we can even combine ideas and call the top-level dvc.yaml section imports, where they act similar to other imports but don't get materialized anywhere in the workspace (the hash info could go into dvc.lock as mentioned above).

* introduce dataset dependency * support Annotated types * split * drop dvc.api.datasets * drop pydantic and typing-extensions * drop dvc.datasets

skshetry mentioned this pull request Dec 14, 2023

imp_url: support dvcx datasets #10123

Closed

skshetry force-pushed the dataset-dependency branch from 9316442 to 96debf5 Compare December 15, 2023 06:14

dberenbaum mentioned this pull request Jan 4, 2024

cmd-reference: document import-db iterative/dvc.org#5033

Merged

skshetry force-pushed the dataset-dependency branch from 9780979 to 012db26 Compare January 9, 2024 10:42

This was referenced Jan 11, 2024

Epic: Database table/non-file dependencies #9945

Closed

Streaming cloud-versioned datasets #10231

Closed

Streaming DVC imports #10232

Closed

skshetry force-pushed the dataset-dependency branch from 012db26 to 7872fa7 Compare January 30, 2024 07:30

skshetry added 4 commits January 30, 2024 16:08

introduce dataset dependency

9509a07

support Annotated types

55e0825

split

d29bf53

drop dvc.api.datasets

44677a3

skshetry force-pushed the dataset-dependency branch from 7872fa7 to 44677a3 Compare January 30, 2024 10:35

skshetry added 2 commits January 30, 2024 16:37

drop pydantic and typing-extensions

dc14c9f

drop dvc.datasets

4cac592

skshetry merged commit 422caa1 into iterative:main Jan 30, 2024
16 of 17 checks passed

skshetry deleted the dataset-dependency branch January 30, 2024 10:59

BradyJ27 pushed a commit to BradyJ27/dvc that referenced this pull request Apr 22, 2024

introduce dataset dependency (iterative#10164)

a060bc5

* introduce dataset dependency * support Annotated types * split * drop dvc.api.datasets * drop pydantic and typing-extensions * drop dvc.datasets

skshetry mentioned this pull request Oct 30, 2024

add support for tracking remote dataset #10287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce dataset dependency #10164

introduce dataset dependency #10164

skshetry commented Dec 14, 2023 •

edited

Loading

codecov bot commented Dec 14, 2023 •

edited

Loading

skshetry commented Dec 14, 2023

dberenbaum commented Dec 14, 2023

skshetry commented Dec 14, 2023

shcheklein commented Dec 14, 2023

skshetry commented Dec 14, 2023 •

edited

Loading

dberenbaum commented Dec 15, 2023 •

edited

Loading

skshetry commented Dec 15, 2023 •

edited

Loading

dmpetrov commented Dec 19, 2023

skshetry commented Dec 19, 2023 •

edited

Loading

dberenbaum commented Dec 19, 2023

dmpetrov commented Dec 20, 2023

dberenbaum commented Dec 20, 2023

skshetry commented Jan 4, 2024

dberenbaum commented Jan 4, 2024

introduce dataset dependency #10164

introduce dataset dependency #10164

Conversation

skshetry commented Dec 14, 2023 • edited Loading

Example dvc.yaml

Screenplay

Note regarding Public API

codecov bot commented Dec 14, 2023 • edited Loading

Codecov Report

skshetry commented Dec 14, 2023

dberenbaum commented Dec 14, 2023

skshetry commented Dec 14, 2023

shcheklein commented Dec 14, 2023

skshetry commented Dec 14, 2023 • edited Loading

dberenbaum commented Dec 15, 2023 • edited Loading

skshetry commented Dec 15, 2023 • edited Loading

dmpetrov commented Dec 19, 2023

skshetry commented Dec 19, 2023 • edited Loading

dberenbaum commented Dec 19, 2023

dmpetrov commented Dec 20, 2023

dberenbaum commented Dec 20, 2023

skshetry commented Jan 4, 2024

dberenbaum commented Jan 4, 2024

skshetry commented Dec 14, 2023 •

edited

Loading

Example `dvc.yaml`

codecov bot commented Dec 14, 2023 •

edited

Loading

skshetry commented Dec 14, 2023 •

edited

Loading

dberenbaum commented Dec 15, 2023 •

edited

Loading

skshetry commented Dec 15, 2023 •

edited

Loading

skshetry commented Dec 19, 2023 •

edited

Loading