Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce dataset dependency #10164

Merged
merged 6 commits into from
Jan 30, 2024
Merged

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Dec 14, 2023

Example dvc.yaml

datasets:
  - name: dogs
    type: dvcx
    version: 2

  - name: publaynet
    type: webdataset
    url: "http://storage.googleapis.com/nvdata-publaynet/publaynet-train-{000000..000009}.tar"

  - name: stackoverflow
    type: dvc
    url: [email protected]:iterative/example-get-started.git
    rev: main
    path: data/data.xml
echo foo > foo
dvc stage add -n test -d ds://dogs -d foo "cp foo bar"
dvc repro
dvc status
# change version in `datasets` and run `dvc status` and `dvc repro` again.
from dql.query import DatasetQuery
from dvc.api.dataset import DVCxDataset, get

resolved = get(DVCxDataset, "dogs")
ds = DatasetQuery(name=resolved.name, version=resolved.version)
from dvc.api.dataset import DVCDataset, get
from dvc.fs.dvc import DVCFileSystem

resolved = get(DVCDataset, "stackoverflow")
fs = DVCFileSystem(url=resolved.url, rev=resolved.rev)
with fs.open(resolved.path) as f:
    process_posts(f.readlines())
import webdataset as wds
from dvc.api.dataset import WebDataset, get

resolved = get(WebDataset, "publaynet")
dataset = wds.WebDataset(resolved.url)

Requires pydantic>=2 at the moment.

Complete dvc.yaml file

datasets:
- name: dogs
  type: dvcx
  version: 3

- name: publaynet
  type: webdataset
  url:
    http://storage.googleapis.com/nvdata-publaynet/publaynet-train-{000000..000009}.tar

- name: stackoverflow
  url: [email protected]:iterative/example-get-started.git
  rev: main
  path: data/data.xml

- name: tiny
  type: dvcx
  version: 3

stages:
  test:
    cmd: cp foo bar
    deps:
    - ds://dogs
    - foo

Complete dvc.lock file

schema: '2.0'
stages:
  test:
    cmd: cp foo bar
    deps:
    - path: ds://dogs
      dataset:
        name: dogs
        type: dvcx
        version: 3
    - path: foo
      hash: md5
      md5: d3b07384d113edec49eaa6238ad5ff00
      size: 4

Screenplay

asciicast

Note regarding Public API

We can easily change API if needed. I wanted to start with something that is type-safe, and that gives a good completions to the users.

Copy link

codecov bot commented Dec 14, 2023

Codecov Report

Attention: 38 lines in your changes are missing coverage. Please review.

Comparison is base (3b56c40) 90.47% compared to head (44677a3) 90.16%.

Files Patch % Lines
dvc/dependency/dataset.py 55.81% 19 Missing ⚠️
dvc/datasets.py 0.00% 6 Missing ⚠️
dvc/repo/graph.py 33.33% 2 Missing and 2 partials ⚠️
dvc/stage/loader.py 0.00% 2 Missing and 1 partial ⚠️
dvc/dependency/__init__.py 33.33% 1 Missing and 1 partial ⚠️
dvc/repo/index.py 71.42% 2 Missing ⚠️
dvc/stage/serialize.py 33.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10164      +/-   ##
==========================================
- Coverage   90.47%   90.16%   -0.31%     
==========================================
  Files         493      495       +2     
  Lines       37699    37775      +76     
  Branches     5449     5461      +12     
==========================================
- Hits        34107    34059      -48     
- Misses       2963     3062      +99     
- Partials      629      654      +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skshetry
Copy link
Member Author

@dberenbaum, this PR is a very basic implementation. But I wanted to get a feedback if this was something you had in mind in regard to pipelines and datasets. Please have a look and let me know what you think.

@dberenbaum
Copy link
Collaborator

Looks good and aligns with what we discussed.

I added the .dql/config file mentioned in #10123 (comment) but so far not able to get the actual API working:

$ python ds.py
Traceback (most recent call last):
  File "/private/tmp/dvcx/ds.py", line 6, in <module>
    ds = DatasetQuery(name=resolved.name, version=resolved.version)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dql/query/dataset.py", line 1146, in __init__
    ds = data_storage.get_dataset(name)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dql/data_storage/abstract.py", line 1068, in get_dataset
    raise DatasetNotFoundError(f"Dataset {name} not found.")
dql.error.DatasetNotFoundError: Dataset dogs not found.

$ dvcx pull dogs -o dogs
Error: Error when parsing dataset uri

I can try to look into it more or ask dvcx team, but let me know if you have ideas on what might be wrong.

I think we will need input from DVCX team on:

  1. What version info to store in dvc.lock
  2. What the API should return

@skshetry
Copy link
Member Author

I think you have to pull the index first before using DatasetQuery. Also dql pull needs ds:// scheme.

dql pull ds://dogs --no-cp

@shcheklein
Copy link
Member

Cool stuff. Thanks @skshetry for prototyping and sharing this quickly. Folks, can we add the dvc.lock file after running the pipeline, and the complete dvc.yaml - just to see how it looks like.

@skshetry
Copy link
Member Author

skshetry commented Dec 14, 2023

@shcheklein, dvc.lock looks something as follows:

schema: '2.0'
stages:
  test:
    cmd: cp foo bar
    deps:
    - path: ds://dogs
      dataset:
        name: dogs
        type: dvcx
        version: 3
    - path: foo
      hash: md5
      md5: d3b07384d113edec49eaa6238ad5ff00
      size: 4

It is inside deps, but we'll likely move it to a separate section as with params. deps has been used for local/external-filesystems based dependencies, so I'd probably create a separate section (but still thinking about it).

The dataset section at the moment contains same information as in dvc.yaml top-level datasets section and is compared to know if the stage changed or not, and whether to reproduce or not.

Complete dvc.yaml file

datasets:
- name: dogs
  type: dvcx
  version: 3

- name: publaynet
  type: webdataset
  url:
    http://storage.googleapis.com/nvdata-publaynet/publaynet-train-{000000..000009}.tar

- name: stackoverflow
  url: [email protected]:iterative/example-get-started.git
  rev: main
  path: data/data.xml

- name: tiny
  type: dvcx
  version: 3

stages:
  test:
    cmd: cp foo bar
    deps:
    - ds://dogs
    - foo

(Also added in the description above.)

@dberenbaum
Copy link
Collaborator

dberenbaum commented Dec 15, 2023

I think you have to pull the index first before using DatasetQuery.

Yup, that fixed it, thanks!

It seems like this isn't a good UX, and we will need a way to see what dataset versions are available on studio without pulling their indexes locally (I think I raised the same concern in the prior PR). Do you know if it's a priority for the DVCX team? Would it make sense to contribute back this functionality?

@skshetry
Copy link
Member Author

skshetry commented Dec 15, 2023

You can try out with something like following to test:

from dql.catalog import get_catalog
from dql.dataset import create_dataset_uri
from dql.query import DatasetQuery
from dvc.api.dataset import DVCxDataset, get

resolved = get(DVCxDataset, "dogs")
uri = create_dataset_uri(resolved.name, version=resolved.version)
catalog = get_catalog()
catalog.pull_dataset(uri, no_cp=True)
ds = DatasetQuery(name=resolved.name, version=resolved.version, catalog=catalog)

DatasetQuery feels low-level to create this auto-pull around. Maybe there will be some high-level wrappers around to do this.

We were using a different API before, and now we are not even responsible for anything than materializing the metadata that we have. So, I'd rather prefer someone from dvcx team propose API here. It'd be wrong for me to try to fit what we have here to their API.

We should finalize on dvc side before proposing something at least. It's too early from our side, I think.

cc @dmpetrov.

@dmpetrov
Copy link
Member

@skshetry excellent work!

It would be great to know how datasets: section is populated. It should help better understand the UI aspects such as --no-cp.

This might be outside of the scope: do we plane to support his type: webdataset? Also, how about importing a cloud-versioning file from S3? What type it should be?

@skshetry
Copy link
Member Author

skshetry commented Dec 19, 2023

It would be great to know how datasets: section is populated. It should help better understand the UI aspects such as --no-cp.

datasets section is populated by users themselves (we could provide a CLI command to generate in the future).

This might be outside of the scope: do we plane to support his type: webdataset? Also, how about importing a cloud-versioning file from S3? What type it should be?

At the moment, we only provide an API to give you the metadata you have in datasets section. That's all that it does. (Note webdataset is just an example, it could be anything as it's only metadata.)

I have been thinking of what we can do with this metadata, hence my experiment with pydantic.
But I'm not clear if we want to be involved in materializing/streaming of datasets ourselves.
The idea is similar to #2719.

For the cloud versioning, it could be like this:

datasets:
  - name: versioned
    type: url
    url: s3://bucket/name?versionId=versionId

And you can read that using s3fs.

import fsspec

fsspec.open(resolved.url)

Also, how about importing a cloud-versioning file from S3? What type it should be?

Note that this is about pipelines only. import-ing, as in materializing/instantiating is not supported or part of the plan here. dvc import-url is the way to do that.

@dberenbaum
Copy link
Collaborator

Can/should dvc collect some lightweight info about the datasets and save them like stages with their own entries in dvc.lock?

For example, assume this dvc.yaml:

datasets:
  dogs:
    name: dogs
    type: dvcx
    version: 3
  versioned:
    type: url
    url: s3://bucket/name/dir
    version_aware: true
stages:
  test:
    cmd: cp foo bar
    deps:
    - ds://dogs
    - foo

The dvc.lock could look like:

schema: '2.0'
datasets:
  dogs:
    type: dvcx
    name: dogs
    version: 3
  versioned:
    type: url
    url: s3://bucket/name/dir
    files:
      - relpath: file1
        version_id: versionId
      - relpath: file2
        version_id: versionId
stages:
  test:
    cmd: cp foo bar
    deps:
    - path: ds://dogs
      dataset:
        name: dogs
        type: dvcx
        version: 3
    - path: foo
      hash: md5
      md5: d3b07384d113edec49eaa6238ad5ff00
      size: 4

This would support:

  • cloud-versioned directories
  • lightweight checksums - use version IDs or other available info to capture and validate the version without calculating file hashes or caching, although dvc will not be able to recover old versions if they are no longer available
  • freezing datasets - if you want to keep using this version of the cloud-versioned directory rather than check for the latest version, set frozen: true for that dataset
  • CLI:
    • dvc datasets add dvcx dogs --info name=dogs,version=2 can add the entry to both dvc.yaml and dvc.lock
    • dvc update ds://dogs can update the entry

@dmpetrov
Copy link
Member

Can/should dvc collect some lightweight info about the datasets and save them

It shouldn't since a portion of the data is in DB and data can be not materialized locally. If something is needed, it should be requested though a command like dvcx status ds://cats@v1.

@dberenbaum
Copy link
Collaborator

Can/should dvc collect some lightweight info about the datasets and save them

It shouldn't since a portion of the data is in DB and data can be not materialized locally. If something is needed, it should be requested though a command like dvcx status ds://cats@v1.

Yes, this is what I meant by "lightweight" @dmpetrov. I think we all are clear that DVC should not be calculating md5s etc. It seems like doing something like dvcx status would be ideal here. At minimum, it would be helpful to list available versions of datasets by name in dvcx. Are there plans for something like that?

@skshetry
Copy link
Member Author

skshetry commented Jan 4, 2024

We discussed in the team meeting if we could fit this into the import-url/import workflow.

$ dvc import-url dvcx://dogs@v3 --virtual
$ cat dogs.dvc
md5: 43d8c1bfe46a6cf2cb9dfe00a8e431b3
frozen: true
deps:
- path: dvcx://dogs@v3
outs:
- hash: md5
  path: dogs # which is an alias for dvcx
  pull: false # or, virtual: true

The path will work similar to an alias or a name, but won't be materialized. If user's need to materialize, they can remove pull: False/virtual: true.

Also, dvc update will work automatically. We can add top-level datasets in the future if we need to. @pmrowla mentioned that we have artifacts section which is a metadata and has an alias that points to a path to a file, and datasets section is similar in that regard.

The dataset.get() method will work based on that alias or path.

@dberenbaum
Copy link
Collaborator

Caveat: This somewhat depends on the discussion in https://github.com/iterative/dql/issues/732 and goes back to #10114 (comment), but I'm putting my current thoughts here so I don't lose them.

Does the proposal above mean that some pipeline stage could look like this?

stages:
  process_dataset:
    cmd: ...
    deps:
    - dogs # alias to dvcx dataset?

Questions:

  1. If we support syntax like path: dvcx://dogs@v3, can we list that directly as a stage dependency instead of doing an import?
  2. Does dvc.lock look the same as above with info about the dataset version?
  3. Does dataset.get("dogs") still make sense as an api?
  4. What happens if I put some file/dir at path dogs?

I think it's more important to handle stage deps first, although it seems possible to support both. Handling datasets as deps could replace or streamline some import scenarios, especially when users prefer to stream the data. Maybe we can even combine ideas and call the top-level dvc.yaml section imports, where they act similar to other imports but don't get materialized anywhere in the workspace (the hash info could go into dvc.lock as mentioned above).

@skshetry skshetry merged commit 422caa1 into iterative:main Jan 30, 2024
16 of 17 checks passed
@skshetry skshetry deleted the dataset-dependency branch January 30, 2024 10:59
BradyJ27 pushed a commit to BradyJ27/dvc that referenced this pull request Apr 22, 2024
* introduce dataset dependency

* support Annotated types

* split

* drop dvc.api.datasets

* drop pydantic and typing-extensions

* drop dvc.datasets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants