drop external outputs #9531

dberenbaum · 2023-06-01T23:52:23Z

Drop current implementation of external outputs as part of #7093.

dberenbaum · 2023-06-01T23:53:11Z

When dropping external outputs, can we continue to support --outs-no-cache with external paths, and can we allow them without the --external flag (which we can drop)?

Originally posted by @dberenbaum in #7093 (comment)

dberenbaum · 2023-06-01T23:54:26Z

Hi, I posted this question regarding external outputs on Discord and am moving the discussion here as suggested.

I currently moved a larger scale data analysis from our local server to run in AWS using a Ray Cluster (https://www.ray.io/). The different steps are implemented as individual Ray jobs which are compiled into a reproducible pipeline with DVC using a dvc.yaml. Because the execution of jobs / stages is in AWS, and because the size of the data is too large to store locally, the inputs and outputs of each stage are external AWS S3 URIs. As part of this I run into one or the other issue (stage output always "not in cache" even right after dvc commit -f was run; error to run stage the first time because S3 URI prefix does not exist) indicating some bug in DVC in regards to external outputs (still need to file GitHub issues for these).

However, looking at this roadmap for the next major release version 3.0, I can see that there is a To-Do item that reads "drop external outputs". So my more immediately pressing questions are:

Will AWS S3 URIs for stage inputs and outputs no longer be supported from DVC version 3.0 onwards?
What is the recommendation for being able to run large reproducible data processing in the cloud (AWS)? All stage outputs should only live in AWS S3. Stage scripts stream the data directly from/to this remote storage.

An example dvc.yaml would look something like:

vars:
- s3_uri_prefix: s3://bucket/prefix
stages:
  stage_1:
    cmd: ray job submit --working-dir . -- python stage_1.py --output ${s3_uri_prefix}/stage_1
    deps:
    - stage_1.py
    outs:
    - ${s3_prefix}/stage_1:
        push: false
  stage_2:
    cmd: ray job submit --working-dir . -- python stage_2.py --input ${s3_uri_prefix}/stage_1 --output ${s3_uri_prefix}/stage_2
    deps:
    - stage_2.py
    - ${s3_uri_prefix}/stage_1
    outs:
    - ${s3_uri_prefix}/stage_2:
        push: false

I'd appreciate any pointers as to how DVC would still fit into the picture here.

Originally posted by @aschuh-hf in #7093 (comment)

dberenbaum · 2023-06-01T23:55:02Z

@aschuh-hf I'm personally torn on this decision, but ultimately (as you mentioned) there are a lot of bugs and problems with the way we handle external outputs today. The goal is not to give up on the use case you mention above (it's more important than ever), but to build a better solution for ti.

Besides staying on DVC 2.x for now, you should be able to continue working with pipelines streaming data to/from s3 if you set cache: false. Given that your data is large, do you mostly use dvc to manage your pipeline, or do you use the cache and checkout older versions of your data?

To cache your s3 pipeline outputs, I hope we can use cloud versioning to help address this, so please also comment there with your thoughts. This could mean that DVC no longer has to do slow operations like caching an extra copy of your data, so I hope it will end up improving the workflow in the end.

Originally posted by @dberenbaum in #7093 (comment)

dberenbaum · 2023-06-01T23:55:27Z

Thanks, @dberenbaum, for adding some more motivation for potentially deprecating it.

As a matter of fact, my current use case was for a data preprocessing and model evaluation pipeline. I am indeed mostly using the dvc.yaml in this context as a means to document the commands I run alongside the source code used to make it easier to reproduce the same results another time or re-running the same steps with different inputs. I don't for this purpose care too much about the caching of past results apart from the current run, and if I were to revert back to an earlier version I would be OK with spending the compute again to reproduce the outputs. Cloud versioning seems would fit also well in, though we haven't enabled this for the S3 bucket we are using for such experimental runs at the moment.

I can also confirm that the time spent on caching and re-computing hashes is quite a bottleneck given how slow it is for thousands of S3 objects. Given the use of a Ray cluster, the time waiting for DVC perform these operations is far greater than the actual execution of the individual stage commands.

It would mostly be sufficient for my use case here if outs and deps would merely define the dependency graph.

It would, however, also be useful if stages would still only re-run if the inputs, parameters, or command of a stage changed. If that can be done via version information provided by cloud versioning backends, that sounds good.

[...] you should be able to continue working with pipelines streaming data to/from s3 if you set cache: false.

Thanks for reminding me of the cache option, I hadn't thought about that I could use it to bypass the duplication of the S3 objects (given the s3cache is configured to be within the same S3 bucket as the stage outputs).

drop external outputs

I think when I read this, I mostly was concerned that S3 URIs would no longer be valid stage deps or outs. But if these are still supported that works for me.

Originally posted by @aschuh-hf in #7093 (comment)

dberenbaum · 2023-06-02T00:00:02Z

@aschuch-hf Moved the discussion to a new issue to keep it organized. Sounds like it should work for your needs then if we can do it the way we are planning. One of the arguments in favor of dropping the current approach was to get feedback from users like you who need it so we can build something that works better.

It would, however, also be useful if stages would still only re-run if the inputs, parameters, or command of a stage changed. If that can be done via version information provided by cloud versioning backends, that sounds good.

Yes, we should ensure this is still the case. The caching is only needed to recover prior versions of the pipeline, which as you say may often make more sense to recompute in the rare cases they are needed.

@dberenbaum

Fixes iterative#9531 Docs iterative/dvc.org#4574 Kudos @dberenbaum

@dberenbaum

Fixes iterative#9531 Docs iterative/dvc.org#4574 Kudos @dberenbaum

@dberenbaum

Fixes #9531 Docs iterative/dvc.org#4574 Kudos @dberenbaum

pokey · 2023-06-18T13:08:45Z

What are we supposed to do if we have external outputs? I use dvc as a pipeline to generate transcripts, timelines, etc for videos on my youtube channel, and I have a bunch of stages (here's one) that copy output artifacts into a separate directory that I use to define a website where I embed my videos.

efiop · 2023-06-18T16:08:02Z

@pokey Unfortunately the only option for you is to keep using 2.x.

pokey · 2023-06-19T08:01:42Z

Is there a plan to reintroduce such functionality in some form in the future? Or maybe some recommended third-party solution, eg a Makefile? 😄

efiop · 2023-06-19T12:42:57Z

@pokey You can still use --outs-no-cache if you don't need caching, it seems like that should be enough for your use case?

dberenbaum assigned efiop Jun 1, 2023

dberenbaum mentioned this issue Jun 1, 2023

Release 3.0 #7093

Closed

4 tasks

dberenbaum added this to DVC Jun 1, 2023

dberenbaum moved this to Todo in DVC Jun 1, 2023

This was referenced Jun 4, 2023

checkout: use index checkout #9444

Merged

add: --external unexpected error: FileNotFoundError in MemoryFS #7461

Closed

efiop added a commit to efiop/dvc that referenced this issue Jun 9, 2023

dvc: drop cached external outputs

55fbdeb

Fixes iterative#9531 Docs iterative/dvc.org#4574 Kudos @dberenbaum

efiop mentioned this issue Jun 9, 2023

dvc: drop cached external outputs #9570

Merged

efiop added a commit to efiop/dvc that referenced this issue Jun 9, 2023

dvc: drop cached external outputs

cf233e0

Fixes iterative#9531 Docs iterative/dvc.org#4574 Kudos @dberenbaum

efiop closed this as completed in #9570 Jun 9, 2023

efiop added a commit that referenced this issue Jun 9, 2023

dvc: drop cached external outputs

af1fb72

Fixes #9531 Docs iterative/dvc.org#4574 Kudos @dberenbaum

github-project-automation bot moved this from Todo to Done in DVC Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drop external outputs #9531

drop external outputs #9531

dberenbaum commented Jun 1, 2023

dberenbaum commented Jun 1, 2023

dberenbaum commented Jun 1, 2023

dberenbaum commented Jun 1, 2023

dberenbaum commented Jun 1, 2023

dberenbaum commented Jun 2, 2023

pokey commented Jun 18, 2023

efiop commented Jun 18, 2023

pokey commented Jun 19, 2023

efiop commented Jun 19, 2023

drop external outputs #9531

drop external outputs #9531

Comments

dberenbaum commented Jun 1, 2023

dberenbaum commented Jun 1, 2023

dberenbaum commented Jun 1, 2023

dberenbaum commented Jun 1, 2023

dberenbaum commented Jun 1, 2023

dberenbaum commented Jun 2, 2023

pokey commented Jun 18, 2023

efiop commented Jun 18, 2023

pokey commented Jun 19, 2023

efiop commented Jun 19, 2023