Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop external outputs #9531

Closed
dberenbaum opened this issue Jun 1, 2023 · 9 comments · Fixed by #9570
Closed

drop external outputs #9531

dberenbaum opened this issue Jun 1, 2023 · 9 comments · Fixed by #9570
Assignees

Comments

@dberenbaum
Copy link
Collaborator

Drop current implementation of external outputs as part of #7093.

@dberenbaum
Copy link
Collaborator Author

When dropping external outputs, can we continue to support --outs-no-cache with external paths, and can we allow them without the --external flag (which we can drop)?

Originally posted by @dberenbaum in #7093 (comment)

@dberenbaum
Copy link
Collaborator Author

Hi, I posted this question regarding external outputs on Discord and am moving the discussion here as suggested.

I currently moved a larger scale data analysis from our local server to run in AWS using a Ray Cluster (https://www.ray.io/). The different steps are implemented as individual Ray jobs which are compiled into a reproducible pipeline with DVC using a dvc.yaml. Because the execution of jobs / stages is in AWS, and because the size of the data is too large to store locally, the inputs and outputs of each stage are external AWS S3 URIs. As part of this I run into one or the other issue (stage output always "not in cache" even right after dvc commit -f was run; error to run stage the first time because S3 URI prefix does not exist) indicating some bug in DVC in regards to external outputs (still need to file GitHub issues for these).

However, looking at this roadmap for the next major release version 3.0, I can see that there is a To-Do item that reads "drop external outputs". So my more immediately pressing questions are:

  • Will AWS S3 URIs for stage inputs and outputs no longer be supported from DVC version 3.0 onwards?
  • What is the recommendation for being able to run large reproducible data processing in the cloud (AWS)? All stage outputs should only live in AWS S3. Stage scripts stream the data directly from/to this remote storage.

An example dvc.yaml would look something like:

vars:
- s3_uri_prefix: s3://bucket/prefix
stages:
  stage_1:
    cmd: ray job submit --working-dir . -- python stage_1.py --output ${s3_uri_prefix}/stage_1
    deps:
    - stage_1.py
    outs:
    - ${s3_prefix}/stage_1:
        push: false
  stage_2:
    cmd: ray job submit --working-dir . -- python stage_2.py --input ${s3_uri_prefix}/stage_1 --output ${s3_uri_prefix}/stage_2
    deps:
    - stage_2.py
    - ${s3_uri_prefix}/stage_1
    outs:
    - ${s3_uri_prefix}/stage_2:
        push: false

I'd appreciate any pointers as to how DVC would still fit into the picture here.

Originally posted by @aschuh-hf in #7093 (comment)

@dberenbaum
Copy link
Collaborator Author

@aschuh-hf I'm personally torn on this decision, but ultimately (as you mentioned) there are a lot of bugs and problems with the way we handle external outputs today. The goal is not to give up on the use case you mention above (it's more important than ever), but to build a better solution for ti.

Besides staying on DVC 2.x for now, you should be able to continue working with pipelines streaming data to/from s3 if you set cache: false. Given that your data is large, do you mostly use dvc to manage your pipeline, or do you use the cache and checkout older versions of your data?

To cache your s3 pipeline outputs, I hope we can use cloud versioning to help address this, so please also comment there with your thoughts. This could mean that DVC no longer has to do slow operations like caching an extra copy of your data, so I hope it will end up improving the workflow in the end.

Originally posted by @dberenbaum in #7093 (comment)

@dberenbaum
Copy link
Collaborator Author

Thanks, @dberenbaum, for adding some more motivation for potentially deprecating it.

As a matter of fact, my current use case was for a data preprocessing and model evaluation pipeline. I am indeed mostly using the dvc.yaml in this context as a means to document the commands I run alongside the source code used to make it easier to reproduce the same results another time or re-running the same steps with different inputs. I don't for this purpose care too much about the caching of past results apart from the current run, and if I were to revert back to an earlier version I would be OK with spending the compute again to reproduce the outputs. Cloud versioning seems would fit also well in, though we haven't enabled this for the S3 bucket we are using for such experimental runs at the moment.

I can also confirm that the time spent on caching and re-computing hashes is quite a bottleneck given how slow it is for thousands of S3 objects. Given the use of a Ray cluster, the time waiting for DVC perform these operations is far greater than the actual execution of the individual stage commands.

It would mostly be sufficient for my use case here if outs and deps would merely define the dependency graph.

It would, however, also be useful if stages would still only re-run if the inputs, parameters, or command of a stage changed. If that can be done via version information provided by cloud versioning backends, that sounds good.

[...] you should be able to continue working with pipelines streaming data to/from s3 if you set cache: false.

Thanks for reminding me of the cache option, I hadn't thought about that I could use it to bypass the duplication of the S3 objects (given the s3cache is configured to be within the same S3 bucket as the stage outputs).

drop external outputs

I think when I read this, I mostly was concerned that S3 URIs would no longer be valid stage deps or outs. But if these are still supported that works for me.

Originally posted by @aschuh-hf in #7093 (comment)

@dberenbaum dberenbaum mentioned this issue Jun 1, 2023
4 tasks
@dberenbaum dberenbaum added this to DVC Jun 1, 2023
@dberenbaum dberenbaum moved this to Todo in DVC Jun 1, 2023
@dberenbaum
Copy link
Collaborator Author

@aschuch-hf Moved the discussion to a new issue to keep it organized. Sounds like it should work for your needs then if we can do it the way we are planning. One of the arguments in favor of dropping the current approach was to get feedback from users like you who need it so we can build something that works better.

It would, however, also be useful if stages would still only re-run if the inputs, parameters, or command of a stage changed. If that can be done via version information provided by cloud versioning backends, that sounds good.

Yes, we should ensure this is still the case. The caching is only needed to recover prior versions of the pipeline, which as you say may often make more sense to recompute in the rare cases they are needed.

@pokey
Copy link

pokey commented Jun 18, 2023

What are we supposed to do if we have external outputs? I use dvc as a pipeline to generate transcripts, timelines, etc for videos on my youtube channel, and I have a bunch of stages (here's one) that copy output artifacts into a separate directory that I use to define a website where I embed my videos.

@efiop
Copy link
Contributor

efiop commented Jun 18, 2023

@pokey Unfortunately the only option for you is to keep using 2.x.

@pokey
Copy link

pokey commented Jun 19, 2023

Is there a plan to reintroduce such functionality in some form in the future? Or maybe some recommended third-party solution, eg a Makefile? 😄

@efiop
Copy link
Contributor

efiop commented Jun 19, 2023

@pokey You can still use --outs-no-cache if you don't need caching, it seems like that should be enough for your use case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants