-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce dataset dependency #10164
introduce dataset dependency #10164
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #10164 +/- ##
==========================================
- Coverage 90.47% 90.16% -0.31%
==========================================
Files 493 495 +2
Lines 37699 37775 +76
Branches 5449 5461 +12
==========================================
- Hits 34107 34059 -48
- Misses 2963 3062 +99
- Partials 629 654 +25 ☔ View full report in Codecov by Sentry. |
@dberenbaum, this PR is a very basic implementation. But I wanted to get a feedback if this was something you had in mind in regard to pipelines and datasets. Please have a look and let me know what you think. |
Looks good and aligns with what we discussed. I added the $ python ds.py
Traceback (most recent call last):
File "/private/tmp/dvcx/ds.py", line 6, in <module>
ds = DatasetQuery(name=resolved.name, version=resolved.version)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dql/query/dataset.py", line 1146, in __init__
ds = data_storage.get_dataset(name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dql/data_storage/abstract.py", line 1068, in get_dataset
raise DatasetNotFoundError(f"Dataset {name} not found.")
dql.error.DatasetNotFoundError: Dataset dogs not found.
$ dvcx pull dogs -o dogs
Error: Error when parsing dataset uri I can try to look into it more or ask dvcx team, but let me know if you have ideas on what might be wrong. I think we will need input from DVCX team on:
|
I think you have to dql pull ds://dogs --no-cp |
Cool stuff. Thanks @skshetry for prototyping and sharing this quickly. Folks, can we add the |
@shcheklein, schema: '2.0'
stages:
test:
cmd: cp foo bar
deps:
- path: ds://dogs
dataset:
name: dogs
type: dvcx
version: 3
- path: foo
hash: md5
md5: d3b07384d113edec49eaa6238ad5ff00
size: 4 It is inside The Complete dvc.yaml file
datasets:
- name: dogs
type: dvcx
version: 3
- name: publaynet
type: webdataset
url:
http://storage.googleapis.com/nvdata-publaynet/publaynet-train-{000000..000009}.tar
- name: stackoverflow
url: [email protected]:iterative/example-get-started.git
rev: main
path: data/data.xml
- name: tiny
type: dvcx
version: 3
stages:
test:
cmd: cp foo bar
deps:
- ds://dogs
- foo
(Also added in the description above.) |
9316442
to
96debf5
Compare
Yup, that fixed it, thanks! It seems like this isn't a good UX, and we will need a way to see what dataset versions are available on studio without pulling their indexes locally (I think I raised the same concern in the prior PR). Do you know if it's a priority for the DVCX team? Would it make sense to contribute back this functionality? |
You can try out with something like following to test: from dql.catalog import get_catalog
from dql.dataset import create_dataset_uri
from dql.query import DatasetQuery
from dvc.api.dataset import DVCxDataset, get
resolved = get(DVCxDataset, "dogs")
uri = create_dataset_uri(resolved.name, version=resolved.version)
catalog = get_catalog()
catalog.pull_dataset(uri, no_cp=True)
ds = DatasetQuery(name=resolved.name, version=resolved.version, catalog=catalog)
We were using a different API before, and now we are not even responsible for anything than materializing the metadata that we have. So, I'd rather prefer someone from dvcx team propose API here. It'd be wrong for me to try to fit what we have here to their API. We should finalize on dvc side before proposing something at least. It's too early from our side, I think. cc @dmpetrov. |
@skshetry excellent work! It would be great to know how This might be outside of the scope: do we plane to support his |
At the moment, we only provide an API to give you the metadata you have in I have been thinking of what we can do with this metadata, hence my experiment with pydantic. For the cloud versioning, it could be like this: datasets:
- name: versioned
type: url
url: s3://bucket/name?versionId=versionId And you can read that using s3fs. import fsspec
fsspec.open(resolved.url)
Note that this is about pipelines only. |
Can/should dvc collect some lightweight info about the For example, assume this datasets:
dogs:
name: dogs
type: dvcx
version: 3
versioned:
type: url
url: s3://bucket/name/dir
version_aware: true
stages:
test:
cmd: cp foo bar
deps:
- ds://dogs
- foo The schema: '2.0'
datasets:
dogs:
type: dvcx
name: dogs
version: 3
versioned:
type: url
url: s3://bucket/name/dir
files:
- relpath: file1
version_id: versionId
- relpath: file2
version_id: versionId
stages:
test:
cmd: cp foo bar
deps:
- path: ds://dogs
dataset:
name: dogs
type: dvcx
version: 3
- path: foo
hash: md5
md5: d3b07384d113edec49eaa6238ad5ff00
size: 4 This would support:
|
It shouldn't since a portion of the data is in DB and data can be not materialized locally. If something is needed, it should be requested though a command like dvcx status |
Yes, this is what I meant by "lightweight" @dmpetrov. I think we all are clear that DVC should not be calculating md5s etc. It seems like doing something like |
We discussed in the team meeting if we could fit this into the $ dvc import-url dvcx://dogs@v3 --virtual
$ cat dogs.dvc md5: 43d8c1bfe46a6cf2cb9dfe00a8e431b3
frozen: true
deps:
- path: dvcx://dogs@v3
outs:
- hash: md5
path: dogs # which is an alias for dvcx
pull: false # or, virtual: true The Also, The |
Caveat: This somewhat depends on the discussion in https://github.com/iterative/dql/issues/732 and goes back to #10114 (comment), but I'm putting my current thoughts here so I don't lose them. Does the proposal above mean that some pipeline stage could look like this? stages:
process_dataset:
cmd: ...
deps:
- dogs # alias to dvcx dataset? Questions:
I think it's more important to handle stage deps first, although it seems possible to support both. Handling datasets as deps could replace or streamline some |
9780979
to
012db26
Compare
012db26
to
7872fa7
Compare
7872fa7
to
44677a3
Compare
* introduce dataset dependency * support Annotated types * split * drop dvc.api.datasets * drop pydantic and typing-extensions * drop dvc.datasets
Example
dvc.yaml
Requires
pydantic>=2
at the moment.Complete dvc.yaml file
Complete dvc.lock file
Screenplay
Note regarding Public API
We can easily change API if needed. I wanted to start with something that is type-safe, and that gives a good completions to the users.