Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset storage improvements #1487

Closed
dmpetrov opened this issue Jan 10, 2019 · 35 comments
Closed

Dataset storage improvements #1487

dmpetrov opened this issue Jan 10, 2019 · 35 comments
Labels
feature request Requesting a new feature question I have a question?

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Jan 10, 2019

There were many requests related to datasets storing which might require a redesign of DVC internals and the cli API. I'll list the requirements here in the issue description. It would be great to discuss possible solutions in comments.

  1. A global place for all the datasets. People tend to use a single DVC repo for all their datasets. Otherwise, the number or git-repos explodes.
    1.1. Reusage. How to reuse these datasets from different projects and even repos?
    1.2. List all datasets.
  2. Dataset versioning.
    2.1. Assign a version/tag/label like 1.3 to a specific dataset. Git tag won't work since we don't need a global tag for all files.
    2.2. See list of versions/tags/labels for a dataset.
    2.3. How to checkout a specific version of a dataset in a convenient way?
    2.4. Ability to get a dataset (with specified version) without Git. ML model deployment scenario when Git is not available in production servers.
  3. Storage visibility for not technical folks like managers.
    3.1. Human readable cache would be great. Thus manager can see datasets and models through S3 web.
    3.2. If 3.1. is not possible - some UI is needed.
  4. Diff's for dataset versions (see 2.1.). Which files were added\deleted\modified.
  5. Datasets synchronization between machines. It looks like DVC solves this. Should we improve this experience?

Bonus question:

  1. Access control. How can I give access to a dataset1 but not to dataset2 to a particular user?

The list can be extended.

UPDATE 1/15/19: Added 2.4.

@dmpetrov dmpetrov added question I have a question? feature request Requesting a new feature labels Jan 10, 2019
@ghost
Copy link

ghost commented Jan 10, 2019

A global place for all the datasets.

Why a shareable directory (NFS, S3 bucket, etc.) is not enough? (this would also cover access control)

Human readable cache would be great.

Indeed, maybe we can store our files with the following naming: .dvc/cache/version/path relative to root dvc + checksum
This way, we can know if a file was renamed and the modifications the file had in a version.

Datasets synchronization

Just to be on the same page, we are talking about dvc push and dvc pull right?

@dmpetrov
Copy link
Member Author

dmpetrov commented Jan 10, 2019

@MrOutis "Why a shareable directory (NFS, S3 bucket, etc.) is not enough?" - great question! I'd separate the question into two:

  1. Why the flat versions structure in a file system/S3/HDFS (like s3://my/dataset/{ver1, ver2, ver2_cleansed, ver3}) is not enough and why people need versioning tools?
  • The dataset is evolving fast (new files every week or you modify files too frequently). You don't want to duplicate files by storing copies in each version directory.
  • You build a dataset out of other datasets and don't want to duplicate files. Two raw datasets sources/ds1/{ver1, ver2, ver3 ...} and sources/ds2/{ver1, ver2, ver3 ...}. The third one is based on the first two - sources/myds/ver1 needs to contain sources/ds1/ver5 and sources/ds2/ver2.
  • You want a single "pointer" to a dataset like latest or just a a link to dvc repro.
  1. Why the current DVC is not enough?
  • The number or git-repos explodes if you use a single repo for each of the datasets. A team don't want to manage 20+ git repositories.

".dvc/cache/version/path relative to root dvc + checksum" - we might have problems with duplication. The same file might be presented in a few different locations of a repo. Just a fun fact - the initial version of dvc (0.8) worked just like you described but we decided to use a single cache dir.

"Datasets synchronization" - yes, dvc pull\push.

@ternaus you are using a single repo for many datasets. Could you please share your motivation?

@villasv and @sotte, do you guys have any thought regarding this topic?

@villasv
Copy link
Contributor

villasv commented Jan 11, 2019

  1. I'd keep the current cache directory flat structure and sync mechanism

Considering that DVC is a command line tool, I think S3/GCP/NFS/etc should still be the backing storage, which doesn't mean that their issues can't be addressed. I think Git is a role model in terms of "don't worry about what's inside .git", but with effort you can make it sane to inspect what matters.

1.1 Git has the concept of submodules. Admittedly, I find it a huge PITA and avoid it like the plague, but it also shows that it's feasible to reuse projects by composition/linkage. I've never seen how it works internally though, so I'm not sure if it's really something possible to replicate. In fact, git submodules sucks so much IMO much that I'd slap someone that suggest DVC to build on it. Assuming that importing data from another project's cache would be read-only, this might be simpler than it looks.

1.2 Listing all datasets requires us to disambiguate which files in the cache represent the same dataset, and probably be selective about versions as well. But once versioning is figured out, I think this is given.

  1. Git handles tags the same way it handles branches, it simply saves an alias (properly called a ref) for a specific hash. Copying this mechanism allows DVC to keep the cache structure flat and makes checking out easy. The difference is that a git ref points to the state of the entire repo, while a dvc ref would point to the state of a single file, so the ref needs to know the file name as well.

  2. Human readable cache sounds nice, I know. I too find myself inspecting cache files sometimes. But I only do that because I don't know how to checkout old versions and I can't diff them sanely. Is there really any other motivation for having human readable cache files? (except storage visibility for stakeholders)

3.2 Yes I would like to have UI for that, just after saying that I want the cache directory flat. Nothing fancy, though. Just list all refs, an aggregate count of versioned files without refs, total size. A stakeholder-friendly version of this might be a web version with nice-to-know things, like GitHub's useless graph of my most active days on the week.

4 Diff is tricky. I've scripted my own, but it's something very opinionated and dataset comparison is specially complex because sometimes the base object of comparison is not the row, but rows identified by some ID (assuming the dataset is ordered, otherwise it may be just implausible). Also, diffing might change by file format.

5 Security is a hairy topic. It's achievable with S3 access policy bending, but that's not stakeholder-friendly. Maybe combining data reuse from point 1 and a UI for versions... though I can't see how one could achieve meaningful access control without deeply integrating with the remote storage chosen.

My concerns:

  1. What's the expected behavior of checking out a specific version of some data file A that has been generated by a script S, but that script also generated another file B and depends on input X? Are A, B, S and X versions aware of each other?

  2. Does it really make sense to tag a single file? Git tags the entire repository because having a single file from another revision is an invalid state, as it should be with DVC. Isn't tagging all data files simultaneously enough, like one consistent repository?

  3. Can I reproduce the output of the pipeline after checking out an old version of A? Or is the expected behavior simply to checkout A's correct version for the current code?

@dmpetrov
Copy link
Member Author

dmpetrov commented Jan 11, 2019

Great feedback @villasv ! Thank you.

You are a DVC expert and it looks like you have an opinion on how to organize projects with DVC. Can I ask you a general question? How would you organize projects for a team which has 20 datasets and 30 projects (ml, analytics, data processing)? One project can use many datasets. One dataset can be reused in many projects. Options that I see:

  1. A separate git-repo for each of the projects.
  • do you need a "golden dataset" and how will you support it (keep datasets copies in S3\GCS)?
  1. A mono-repo with all the datasets and projects.
  2. Some combination of (1) and (2).

Which features in DVC are missing to support this scenario?

Details about your feedback

Totally agree with the cache directory structure and sync mechanism.

Yes, we should think about projects composition/linkage. The question is - should each dataset be presented as a single module/repo (so, 20+ repos/datasets for a single team) or we should come up with a structure where a single repo can naturally fit all the datasets (1 repo).

Re tags - this might require a separate tagging mechanism in addition to Git-tags. This is what I don't like but it might be the only solution.

"Human readable cache". Right, shareholders visibility is a major motivation. The minor one - it might lower the entrance bar to dvc (good for newbies). Yeah, some UI would be great if we keep cache structure.

"Diff" - agree. It might be enough just to have file counts (new, deleted, modified) and sizes.

Regarding your concerns:

  1. In the context of this issue I was talking only about checking out datasets\sources, but not data derivatives.
  2. It does make sense to version a dataset (with or with no DVC). The question - how to (should we) incorporate this feature into DVC? The problem happens when
  3. If I understood the concern correctly - checking out an old dataset version should work for both of the scenarios. The primary motivation in the current issue - use a dataset version in a current (or new) code.

@villasv
Copy link
Contributor

villasv commented Jan 11, 2019

Ah, thanks for clarifying. I had the impression that versioning would also be targeting data derivatives. Indeed, that makes all my three concerns vanish, because input files are the root of the dependency trees so checking out a specific version of them can be done in such a way to keep everything consistent.

I didn't mean to use git tags though, what I mean is that DVC could somewhat copy how git tags are done, which is basically a refs directory with alias files. E.g. a file named version 1 contains a single line with the hash of the cache file it references. Like I said, the major difference is that extra attention is needed to also associate the ref with the original file to avoid conflicts.

Q: How would you organize projects for a team which has 20 datasets and 30 of projects (ml, analytics, data processing)?

If a group of projects code in scripts or data derivatives, I'd go for a mono-repo. I think that even if these shared only the dataset/sources I might go with a mono-repo, if those datasets/sources aren't very stable, because it's the safest way to ensure everyone is interpreting that data the same way.

Separate repos would be fine if those projects are mostly code and data, or they derive from a "golden dataset" that changes never or very slowly. At my work, we have two repos using DVC but they share nothing directly (one is for curation and the other is the result of that curated data being used in production, so one affects the other, but the files are totally independent because the second has a database dump as source).

In fact, the first of these projects - the one for curation - even uses the same repository for the NodeJS API that serves that content and other stuff like Elasticsearch mappings to index the derived data. Those are very coupled and I see no reason to separate them.

Q: What could be different if I had a way to version datasets and share data between repositories?

The first scenario (shared scripts and data derivatives) wouldn't change IMO, if I can't version and distribute derived data as well. I think the scenario of single-source-of-truth is the one that really benefits here.

Why dvc import is not enough? At first glance, the only missing feature is that you don't have explicit versions to understand. Maybe if dvc import was extended to import from another projects cache, it could handle dataset versions semantically.

@polvoazul
Copy link

polvoazul commented Jan 12, 2019

@dmpetrov

How would you organize projects for a team which has 20 datasets and 30 projects (ml, analytics, data processing)?

I think having a monorepo or 30 repos depends on other factors apart from dvc. Dvc could offer full non-duplicated support if we store all dvc files somewhere else such as ~/.dvc (user) or /var (global).
There is no real advantage of having .dvc in the git directory if you assume there is a way to dvc pull your files.

Docker already does this by storing all images, and cache layers in /var, and i find that it works quite well

@polvoazul
Copy link

polvoazul commented Jan 12, 2019

I will give my impressions on your questions:

There were many requests related to datasets storing which might require a redesign of DVC internals and the cli API. I'll list the requirements here in the issue description. It would be great to discuss possible solutions in comments.

  1. A global place for all the datasets. People tend to use a single DVC repo for all their datasets. Otherwise, the number or git-repos explodes.

True, this is valuable. I suggest using a per-user or per-machine dvc directory (~/.dvc or /var/dvc). See comment above.

1.1. Reusage. How to reuse these datasets from different projects and even repos?
1.2. List all datasets.

Simple to implement if everything is in one place

  1. Dataset versioning.
    2.1. Assign a version/tag/label like 1.3 to a specific dataset. Git tag won't work since we don't need a global tag for all files.

I dont like this very much. The version is the git commit hash + comment on the related *.dvc file. Use git for all your versioning/identification problems. Could be helpful to name the dvc cache files with something human readable (something like <MD5>_<commit message>)

2.2. See list of versions/tags/labels for a dataset.

Just git log related *.dvc file. If you want to store metadata such as size, or number of lines. do it in .dvc file. Its just a simple text file, all git tooling works beautifully.

2.3. How to checkout a specific version of a dataset in a convenient way?

git checkout branch/tag; dvc pull;
or
git show branch:path/to/file.dvc > tmp.dvc; dvc pull

  1. Storage visibility for not technical folks like managers.
    3.1. Human readable cache would be great. Thus manager can see datasets and models through S3 web.

This would be very nice indeed. More metadata would be stored in .dvc files, but it is worth it! Maybe we could show a preview of the data such as the shape, columns and the first ten lines.

3.2. If 3.1. is not possible - some UI is needed.
4. Diff's for dataset versions (see 2.1.). Which files were added\deleted\modified.

use git? i dont understand this. I usually have one dataset per .dvc file so it is very simple.

  1. Datasets synchronization between machines. It looks like DVC solves this. Should we improve this experience?

Yes. For big files S3 loses connection sometimes. Maybe there is a more robust s3 download/upload recoverable experience. Also i think most files are some form of tabular format. We could give special support to some formats (such as csv) and upload only deltas. This would be VERY NICE. Think rsync --partial

Bonus question:

  1. Access control. How can I give access to a dataset1 but not to dataset2 to a particular user?

Just use S3/filesystem access control. Leave filesystem permissions to filesystems and we avoid a difficult subject.

The list can be extended.

If i think of anything new I will edit!

@dmpetrov
Copy link
Member Author

@villasv thank you for the clarification!

Please correct me if I misunderstand your point. You are saying that a global dataset repo\place is not a perfect solution because it does not provide code and data lineage (snapshots) and the environment becomes fragile. In this case, a mono-repo does not have such kind of problems as you said.

If the above is correct, how about using "data repos" with versions or checksums like dvc data-repo https://github.com/iterative/mydataset1 6cdc2cb or tag ver1.3 instead of the checksum? There is an analogy with packaging systems.

dvc import - yes, the solution can be based on a "golden dataset" somewhere like S3 plus proper imports for versioning from repos (not matter how often it is changing). You mentioned explicit versioning issue - I'll think about it. Please let me know if you have some ideas.

@dmpetrov
Copy link
Member Author

@polvoazul great feedback!

Awesome idea with per-user or per-machine dvc directory. And great analogy with Docker images.

I kind of agree with you about git commit hash + comment on the related *.dvc file but it looks like too heavy operation for a significant portions of users. Many people have asked about tags per datasets\dvc-file. I'm trying to understand if it is okay to keep such an advince scenario as we currently have and transfer all the complexity to documentation or it is better to simplify API.

Regarding the S3 connection loss - right, there is a corresponding issue #829.

@villasv
Copy link
Contributor

villasv commented Jan 15, 2019

I wasn't favoring a mono-repo each with its own cache against having a global cache like polvoazul suggested, only against many-repos each with its own cache. And caching is not the argument here, but sharing and lineage.

I didn't think about a central cache. At first glance, I have nothing against it... but it doesn't change much in the scenarios I've encountered so far. I'd still use mono-repos for projects that share data lineage because the transformations scripts are part of that lineage.

I'm associating the word "repository" here with actual git repositories, I don't see a "dvc repository" dissociated from it. I haven't looked at dvc as a data registry solution so far, only as an integrated LFS/Annex that properly handles reproducibility. Data registry is a nice thing though.

It seems that much of this thread is about making the cache into a proper registry. I think this can be done without actually changing anything about the cache inner structure.

@efiop
Copy link
Contributor

efiop commented Jan 15, 2019

@villasv I agree, we will probably add support for a few cache locations like in #1450 . So that you could have your local cache at .dvc/cache for your local files, as well as maybe /share/dvc/ for global datasets(e.g. some community repo with a special dataset). So the logic of cache directories might not actually change a lot compared to the current one and actually be a natural extension of the current approach.

@dmpetrov
Copy link
Member Author

dmpetrov commented Jan 21, 2019

@villasv yeah, data registry is a great term. Let's start using it.
Where to keep cache is just an optimization question. The biggest question - how to simplify the dataset experience.

I've created a proposal based on the discussion. The proposal assumes that we keep dataset metainformation such as dataset name and tags into dvc-files. And it opens a set of command to manipulate with the dataset as a separate type of objects. But in fact, it is just a syntax sugar on top of data files.

The proposed dataset API looks like a pretty reasonable solution which can bring DVC to a next level and does not break distributed philosophy of DVC and Git like many other dataset versioning solutions do. The only problem that the proposal does not solve is storage visibility. The cache is not humanly readable which is probably not a surprise. And we should go with some additional UI (3.2.) probably as an addition to DVC project.

The proposal: https://gist.github.com/dmpetrov/136dd5df9bcf6de90980cec22355437a

Looking forward to your feedback, guys. Any comments are welcome.

PS: @drorata you had some issues with files versioning and updating versions. You might be interested in participating in this discussion.

@drorata
Copy link

drorata commented Jan 21, 2019

It's a little hard to join such an in-depth discussion. I will try to share some of the thoughts that I had while reading the comments.

For me, one of the most important concerns which DVC addresses is reproducibility. When it comes to datasets, reproducibility means IMHO the combination of what raw data was used and what transformations were applied. If some (processed) dataset is used by several projects, it becomes in some sense a project of its own, and thus it would make sense to keep it separately. The depending projects, which use this set as an input, would then tap to the same source and enjoy a fixed reference.

I found git-submodulesuseful and it can fit into the setting I described rather nicely. A project can fetch the datasets it needs by adding the "data-projects" as sub-modules.

The problem of diffs of datasets is somewhat enigmatic. If the dataset is binary one is rather lost and metadata associated to the set might be the only remedy.

Personally, I like a lot the close coupling between DVC and git. If at some point, I realize that part of the project/data can be used by other projects, it means it has to be extracted into its own world. It is a bit like the flow when you code something in Jupyter, and then you realize it is used across different parts of the project, so you extract it as a module. Then you figure out it is actually useful in different projects, so you turn it into a package on its own. Along with this line, I don't understand how linking DVC to the user/machine is going to work; would not it make the link between the data and the project much less clear?

@dmpetrov Thanks for pinging me 👍

@sotte
Copy link
Contributor

sotte commented Jan 24, 2019

Sorry I'm a bit late to the party. The topic is super interesting! Here are my 2 cents.

What is a dataset? What is a version of a dataset?

Often you get some data from some business unit and it's dumped in some bucket (if you're lucky).

bucket1/2018-01-01/...
bucket1/2018-01-08/...
bucket1/2018-01-15/...
...

bucket2/2018-07/...
bucket2/2018-09/...
...

bucket3/...
...

This does not mean that one bucket corresponds to one "dataset". A "dataset" can consist of many buckets with different "versions" (the date part in the prev. example). A "dataset" can also expand (maybe next time I'll use some weather data to improve my ML model). And for a different project I'll use a different combination of parts and versions.

Because it's not easy to define what a "dataset" is and because it changes depending on your problem
I think it makes sense to have a one place where you can put all your data, the data registry. There you (the data scientist) define what a "dataset" is, a collection of data (more specifically I think this should be a collection of pointers to content adressable data simlar to how git handles is). From there you can check out that dataset.

So far we've only talked about the "raw" data you get from the business side. But now the data transformation begins :)

Q: The "raw dataset" has to be transformed and cleaned. Is this also tracked in the data registry? Should this be tracked in a sep. dvc/git project?

Storing Data

Regarding storage: I think it makes sense to store the data in a content addressable way (like git or IPFS).
This takes care of deduplication and can also be combined with chunking (see #829 ). That being said, an additional UI is needed for storage visibility.

git-submodules as data-projects

I feels right to import data as "modules". But I don't like git submodules. I just want to say that there is also git subtree as alternative to git-submodules.

Misc and personal pet peeve

One thing I don't like about DVC is that newer datasets are supposed to replace old ones in order for dvc repro to work. I normally evaluate new models against different versions of my data to gauge the performance. DVC does not help here! I'm mentioning it because I think it's imperative to be able to have multiple versions of you data at the same time and maybe this has to be considered in this proposal.

@dmpetrov
Copy link
Member Author

@drorata Thank you for the feedback!

I have a question for you about reproducibility. Does it make sense to keep the code as a part of a dataset (if it was generated by this code) or we can decouple these two parts (dataset and dvc project with code which consumes some dataset)?

git-submodules seems like a good idea. But it has some limitations. We need to re-think this part. The diffs that I was talking about are not binary diffs - just size and number of files difference. I agree that diff, in general, is an enigmatic problem and I don't see any easy solution.

you turn it into a package on its own. Along with this line, I don't understand how linking DVC to the user/machine is going to work;

The analogy with packaging systems is great. We just need to figure out how to connect the peaces. Important questions are:

  1. How to address “packages”. Should we use something like [email protected].* ?
  2. Should we include code into package or just package datasets? It is related to the question from the above.

@dmpetrov
Copy link
Member Author

Hey @scotte, it is great to see you in this discussion!

Is this also tracked in the data registry? Should this be tracked in a sep. dvc/git project?

Ha... you have raised one of the most important questions. Would you mind if I ask you to answer your own question? :) What is your take on this?

  1. Have you ever had any incentive to store transformed datasets (or models) outside of projects it belongs to (like in the data registry). What was your intention to do that?
  2. Do you mean that we will have a sep. dv/git repo per each "transformed" dataset which will include the code that describes the transformation?

Storing Data

I agree with “content addressable way” for storage format (this is how DVC works right now) as well as the common disbelief to git-submodules.

personal pet peeve

That is awesome feedback! This is what I meant in - 2.3. How to checkout a specific version of a dataset in a convenient way? Your description/problem definition is much better.

How an ideal DVC API will look like to solve the problem? At least in idea level.

PS: @tdeboissiere and @ophiry - guys, this discussion is getting more and more interesting and we need your opinions.

@drorata
Copy link

drorata commented Jan 28, 2019

I suspect that the notion of "data registry" is the entry sign to a slippery slope ending in some sort of yet-another-data(base)-storage-solution. To me, it is crystal clear that decoupling the data from a project would mean that when trying to reuse the same dataset for a different project, it would turn out it has to go through a major (2nd order) preprocessing before it could be used. And what would you do then? Extract this "new" dataset into its own entry in the registry?

I think that coupling the code and the data is crucial, and failing to do that undermines the ability to reproduce results. Each project has its own needs, and the very same data source (e.g. some database) would be used in different ways in different projects.

I would say that the starting point of each project is one or more data sources which are external from the data scientist/ML /etc. standpoint. Transforming this raw data into something you can work with is part of the project. If at some later point in time, one realizes that the same raw source can be used for two different projects using the same pipeline of transformations, then, it makes sense to extract this pipeline along the relation to raw data source into its own data solution. In other words, create a flow which takes the tables, for instance, apply the needed transformations/joins/etc. and save it into a new table.

By the way, in many cases, I would say that the entry point of the project is not a static dataset, but a code which extracts the data from some source/database. DVC, in my mind, helps to persist this point in time when the data is taken from the source and used by the project, for future reference.

It becomes harder for me to distinguish between some data-storage solution and the direction implied by this thread. What is the problem which is being solved here?

Regarding git-submodules, I am using it in some production solution, but I don't the milage to reckon whether it is a viable solution (assuming I know the problem).

@Casyfill
Copy link

Casyfill commented Jan 28, 2019

FWIW anaconda is building it's cataloging package intake, which might be a good solution to issues 1.1, 1.2, 2.4, 3.2 and maybe 2.3?

and it would be great to have 2.1, 2.2 and 4 indeed.
Won't tags solve the question of "human-readable hashes"?

@shcheklein
Copy link
Member

shcheklein commented Feb 1, 2019

@Casyfill thanks for joining this discussion! :) Quick question, just to get more sense of you thought process. 2.1, 2.2, 4 (assigning/listing tags and diffs for datasets). In what scenario would you use them? To manage sources, external data that is used to start the project? Or some intermediate artifacts/end results - models, preprocessed data, etc. Do you think about doing this within a project or to share/provide visibility to different versions to other ppl in your team?

@drorata thanks! that is a really great and deep answer!

First of all I'd like to clarify that by datasets management we actually mean "any data artifact" management. From DVC perspective any artifact is just a file or a directory. So, to be precise, dataset == input data (in some cases it's a snapshot indeed), intermediate/processed data, end results - models, reports.

It feels there are at least two broad segments we are trying to touch here:

  1. Better UI/UX for intra-project data management. E.g. I have a directory with pre-processed images and I keep adding images from time to time. I want to see those changes in some meaningful way. It's probably similar to having git tags that provide additional semantics + having some diff that can show how many files changed, probably names of them, etc. It also should include things like - dvc checkout spesific-version-dataset.

  2. Better addressability, visibility, UI/UX to manipulate, etc for inter-project data management. It covers cases when some artifact produced in a project should be used somewhere else. Examples might include: deploy a model (you need to address it somehow and get an actual blob, for example), reuse a processed dataset.

  3. Global data registry/store. Any artifact can be published. It's like when you have an S3 but with git repo (in github) on top of it that describes all the changes and what each md5 file means. To search, to being able remove, etc. Thus @dmpetrov mentioned UI for that, etc, etc. It's probably also a way for the team to save some resources by sharing some datasets.

Guys, what do you think about this split and what kind of problems described have you seen in your workflow? Any feedback please :)

@tdeboissiere
Copy link

Exciting discussion ! I do not feel qualified to answer all of the use cases given that my team has a very specific workflow. Instead I will briefly give some context on how we work, what solutions we came up with and what we think about @dmpetrov 's points.

Team description and requirements:

  • Research team working on speech synthesis
  • Need a system to centralize datasets to avoid duplication (limited storage) and avoid the case where 2 researchers are working on a dataset with the same name but with different content.
  • Need a system to also allow for fast exploration (e.g. moonshot research idea requires unconventional data) without strict version control red tape.
  • Our data pipeline are typically long (a few days) and made up of a few distinct stages. We do not want to reprocess earlier stages if that can be avoided

How we use DVC

  • Research phase: researchers are allowed to create datasets however they want, without controling them with DVC
  • Development phase: research has matured, a dvc pipeline is built to freeze the dataset. This golden dataset is then used for hyperparameter tuning and then to train models for production. Because data is basically immutable at this stage, we do not tie code + data with dvc, and only use git to version control the code.

Questions

1.1 Reusage:

  • A single repo stores all the code required to build any of the golden datasets.
  • Each researcher can access the same data with simple steps: clone data repository, dvc pull dataset.dvc, use docker container to mount the data repository.
  • Multiple projects can thus easily share the same data source

1.2 List all datasets

  • In the git repo where we have all our data pipelines, all dumps are stored under the following structure:
    ├── code_to_create_datasets
    │
    ├── data_outputs
         ├── dataset1
         ├── dataset2

So it is easy to list all datasets.

2.1 Assign a label to a specific dataset

This isn't something we really need at the moment (recall our datasets usually change very little over time). OTOH, I would implement it along the lines of:

dvc run-and-tag v1
git commit
dvc run-and-tag v2
git commit
# Need to revert to v1 of the pipeline !
dvc show-commit v1
git checkout $(dvc show commit v1)
dvc checkout

2.2 See list of versions/tags/labels for a dataset

Should be straightforward.

2.3 How to checkout a specific version of a dataset in a convenient way?

Again, not something my team really needs at the moment as our datasets are very stable.

On topic: I recall from earlier discussions that if I do something like:

dvc run -f out.dvc -d dep1 cmd
dvc push
dvc run -f out.dvc -d dep1 slightly-different-cmd
dvc push
dvc gc

Then the output of the first dvc command is removed from the remote/cache. This is not ideal if in the end I decide to use the first command and would like to use its output right away without relaunching it

2.4 Getting data without Git

Not an issue we face.

3.1. Human readable cache would be great.

  • +1 for a UI, especially with powerful search features
  • At the moment, we solve the UI part by making sure each data pipeline outputs its dump in a hiearchical way, which is readily human readable in the corresponding git repo, e.g. something like
    ├── code_to_create_datasets
    │
    ├── data_outputs_task1
         ├── dataset1
         ├── dataset2
    ├── data_outputs_task2
         ├── dataset1
         ├── dataset2
    ├── data_outputs_shared
         ├── dataset1
         ├── dataset2

Bonus:

As mentioned by others, I think accessibility should not be handled by DVC itself apart from a bit of syntactic sugar (for instance, assuming ownership groups can be created on the remote storage, allow the dvc push command to specify which group can access the data)

@ghost
Copy link

ghost commented Feb 15, 2019

@dmpetrov , @efiop I think we can support Human readable cache in object storage that supports some kind of "linking" or "referencing" mechanism.

For example, with SSH, we could have symlinks with the name of the object and mabye the branch/version.

There's also a hacky implementation for S3: https://stackoverflow.com/questions/35042316/amazon-s3-multiple-keys-to-one-object

It looks like we already have some information that we can work with: https://github.com/iterative/dvc/blob/master/dvc/remote/local.py#L619-L623

NOTE: I'm actually not 100% sure on this one, but just leaving it here for reference

@dmpetrov
Copy link
Member Author

The most difficult part of introducing this new dataset experience is to align it with DVC philosophy.

A couple of quotes from @drorata:

I suspect that the notion of "data registry" is the entry sign to a slippery slope ending in some sort of yet-another-data(base)-storage-solution.

coupling the code and the data is crucial

After collecting more feedback offline and discussing implementation details we come up with a solution which improve the dataset experience without breaking DVC fundamentals.

✅ New dvc tag command could nicely solve datasets navigation issue. With the command you can see:

  • list of your datasets: dvc tag
  • all tags\labels\versions history of a particular dataset: dvc tag images/
  • ability to checkout a dataset by a tag: dvc checkout images/ -t v1.12_cleansed

This idea was initially mentioned in the issue description and @villasv emphasized on the idea:

DVC could somewhat copy how git tags are done

Also, some random guy from the Internet in a recent Hacker News discusssion mentioned this:

This is definitely needed and DVC has a few cool ideas. I think the most useful feature missing from existing tools is integrating data versioning with git, and simple commands to tag, push, and pull data files.

dvc tag command covers (1.2), (2.1) - (2.3) from the issue description.

✅ Introduce dvc diff for getting summary difference (not binary diff) between dataset versions like dvc diff images v1.12_cleansed. This is just a natural addition to dvc tag.

This command covers (4).

✅ Introduce dvc pkg (modules, node name is under discussion) to reuse external repositories. Git submodules might be used under the hood. But it is not a prefered implementation since DVC should support other VCS (there is PR for Mercurial support) and DVC should work even with no VCS - see (2.4) from the issue description. Another reason not to use Git-submodules - additional abstraction leyer and command is still needed even with submodules.

❌ I'd suggest to keep semantic versioning (like 2.1.8) outside of the scope of this issue because the concept of versioning does not align well with Git philosophy. We should remember that Git is actually not versioning system, it is just a "the stupid content tracker" (please check man git). Kudos to @yarikoptic who pointed us to this idea recently. Git does not have a notion of semantic versions and DVC (at the basic level) should not. It is related to item (2.1) from the issue description.

❌ It looks like human readable cache (3.1) is no something that we can solve. There are some potential solutions for a specific file storges as @MrOutis mentioned. But it does not sound like a generalizable approach. From another point of view, the proposed solution with new tag and pkg experience improves human readability through dvc commands which might be even better approach for developers.

Next steps

So this is the plan to improve dataset storage experience and close the current issue:

  • dvc tag
  • dvc checkout - tag support
  • dvc diff
  • dvc pkg

I'll be marking the items when they are done.

What do you think guys about this solution? Do you see any conserns or potential pitfalls?

@drorata
Copy link

drorata commented Feb 24, 2019

I have one question. Let's assume for a second that dvc will always have the backbone of a versioning system which supports tagging. In this case, I don't understand what is the use case to have a tagging functionality provided by the VCS and DVC? I can always do git tag and this will also be relevant to the data "tracked" by DVC.

@dmpetrov
Copy link
Member Author

@drorata Good question!

Git tag is global - it marks all files in a repository. DVC tag is local (per data file\dir) - it is specific to a data artifact. Thus, using Git-tags you quickly pollute the namespace of tags and can easily get lost on which tag belongs to which dataset\model.

DVC tag localizes this tagging experience in a data file level. It can easily answer questions like:

  • give me a list of datasets and models
  • give me a list of versions\tags of a given model

DVC tag simplifies dvc checkout and dvc diff experience.

Another important reason is optimization. Ideally, we should not use Git-history to get a data file (checksum from dvc cache) with a specific tag. It is important for the model deployment scenario (2.4) when there is no access to Git, only to files (from HEAD). With custom tags we can aggregate this info in dvc-files or keep it separately somewhere like .dvc/tags/....

@drorata
Copy link

drorata commented Feb 24, 2019

Assuming still that git is available, one can easily use the provided tags mechanism without polluting anything by merely adhering to some conventions/best-practices. For example, something like a prefix models-v1-before-hyperparams-tuning or rawdata-v3-including-wind-sensors etc.

I can imagine the setting when git is not available is realistic, but in that case and to that end, tagging is a lighter issue. Isn't it?

@dmpetrov
Copy link
Member Author

@drorata I agree with you, in many cases, git-tag conventions\best-practices are enough. Usually, it means that people use a repository per problem\model which includes 1-2 input datasets and a reasonable amount (within a couple of dozens) of experiments. I'd even say this is best practice for DVC.

Git-tags are not enough in mono-repo scenarios when "People tend to use a single DVC repo for all their datasets" (from the issue description). A close to real example - a single git\DVC repo with ~10 datasets and ~5 separate projects inside the repo. The projects might reuse datasets (image-net is reused by 3 projects). Datasets and models are evolving. Some datasets are changing in a bi-weekly base. In this kind of settings, you can quickly end up with a few hundred git-tags.

@drorata
Copy link

drorata commented Feb 25, 2019

I am not familiar with dvc's vision/roadmap, but mono-repo for data is indeed something else and I was not aware that it is something dvc is aiming at.

@dmpetrov
Copy link
Member Author

@drorata we see that a significant number of users uses DVC in such way. And some companies prefer mono-repo over a set of repositories.

@drorata
Copy link

drorata commented Feb 26, 2019

If I understand correctly, this is a rather different use case. Won't some artifact managing solution be a natural choice for this case? I have maven in mind, but others might be better candidates. Or am I missing something?

@dmpetrov
Copy link
Member Author

Maven is a good analogy but not because of the new dvc tag command but rather due to dvc pkg. Maven builds projects like DVC-pipelines. Also, it takes care of dependencies. The dependency part is missing in DVC.

With dvc pkg command we basically start incorporating new use cases into the DVC ecosystem. The current dvc pkg for datasets (static modules) is the first step towards to modules with code #1472 (dynamic modules).

Can we utilize an existing tools for this use cases? Probably not, because systems from the industry are mostly focused on code files while DVC has a different data file centric model. System from academia like WDL are too abstract and didn't have enough traction so far (we will spend more time on integration and supporting their API then for actual work).

Is it a good idea to implement this kind of module\library\package scenarios into a single tool? From one side, Java (language), Ant (building system), and Maven (dependencies) are separate projects. From another side, these projects were created in different epoches. When Java and Ant were created there were no urgent need for Maven. I think it is a good idea to develope all the pieces of the ecosystem together as a single tool\language\project. Modern languages (like Go) pursuing this approach.

@dmpetrov
Copy link
Member Author

@drorata just an example: https://discordapp.com/channels/485586884165107732/485596304961962003/550094574916337669

if I have to separate github repo's one with the data. one that processes the data, will the dvc not be able to detect changes from one repo to another

This is a regular question in our Discord channel.

@shcheklein
Copy link
Member

shcheklein commented Mar 8, 2019

@dmpetrov just to clarify, which one of the commands solve "2.4. Ability to get a dataset (with specified version) without Git." and how will interface for that look like? A global version of dvc checkout?

@dmpetrov
Copy link
Member Author

dmpetrov commented Mar 8, 2019

@shcheklein I mentioned briefly that git submodules won't work partially because of (2.4).

I expect (2.4) to be a part of the pkg command. Something like dvc pkg https://github.com/dmpetrov/tag_classifier for importing all data files in your dir (through a DVC file and DVC cache) and dvc pkg --flat https://github.com/dmpetrov/tag_classifier for importing data file as a regular file (with no DVC files and no repositories).

@ghost
Copy link

ghost commented May 31, 2020

There were many requests related to datasets storing which might require a redesign of DVC internals and the cli API. I'll list the requirements here in the issue description. It would be great to discuss possible solutions in comments.

  1. A global place for all the datasets. People tend to use a single DVC repo for all their datasets. Otherwise, the number or git-repos explodes.
    1.1. Reusage. How to reuse these datasets from different projects and even repos?
    1.2. List all datasets.
  2. Dataset versioning.
    2.1. Assign a version/tag/label like 1.3 to a specific dataset. Git tag won't work since we don't need a global tag for all files.
    2.2. See list of versions/tags/labels for a dataset.
    2.3. How to checkout a specific version of a dataset in a convenient way?
    2.4. Ability to get a dataset (with specified version) without Git. ML model deployment scenario when Git is not available in production servers.
  3. Storage visibility for not technical folks like managers.
    3.1. Human readable cache would be great. Thus manager can see datasets and models through S3 web.
    3.2. If 3.1. is not possible - some UI is needed.
  4. Diff's for dataset versions (see 2.1.). Which files were added\deleted\modified.
  5. Datasets synchronization between machines. It looks like DVC solves this. Should we improve this experience?

Bonus question:

  1. Access control. How can I give access to a dataset1 but not to dataset2 to a particular user?

The list can be extended.

UPDATE 1/15/19: Added 2.4.

How about using git-lfs as a remote storage? Why is that not possible?

@yarikoptic
Copy link

yarikoptic commented May 31, 2020

It seems that I had managed to stay silent, and DataLad wasn't mentioned. FWIW I think all desired (and many others) use cases for data access, management, etc can be instrumented via git submodules + tags, git-annex and/or straight datalad API add interface to those. FWIW LFS is too tight to git, too centralized, doesn't support removal from it. FWIW git annex (and thus DataLad) supports LFS as a special remote. See http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html?highlight=LFS#use-github-for-sharing-content

@dmpetrov dmpetrov mentioned this issue Sep 7, 2020
@efiop efiop closed this as completed May 3, 2021
@iterative iterative locked and limited conversation to collaborators May 3, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
feature request Requesting a new feature question I have a question?
Projects
None yet
Development

No branches or pull requests

10 participants