-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drop external outputs #9531
Comments
When dropping external outputs, can we continue to support Originally posted by @dberenbaum in #7093 (comment) |
Hi, I posted this question regarding external outputs on Discord and am moving the discussion here as suggested. I currently moved a larger scale data analysis from our local server to run in AWS using a Ray Cluster (https://www.ray.io/). The different steps are implemented as individual Ray jobs which are compiled into a reproducible pipeline with DVC using a However, looking at this roadmap for the next major release version 3.0, I can see that there is a To-Do item that reads "drop external outputs". So my more immediately pressing questions are:
An example
I'd appreciate any pointers as to how DVC would still fit into the picture here. Originally posted by @aschuh-hf in #7093 (comment) |
@aschuh-hf I'm personally torn on this decision, but ultimately (as you mentioned) there are a lot of bugs and problems with the way we handle external outputs today. The goal is not to give up on the use case you mention above (it's more important than ever), but to build a better solution for ti. Besides staying on DVC 2.x for now, you should be able to continue working with pipelines streaming data to/from s3 if you set To cache your s3 pipeline outputs, I hope we can use cloud versioning to help address this, so please also comment there with your thoughts. This could mean that DVC no longer has to do slow operations like caching an extra copy of your data, so I hope it will end up improving the workflow in the end. Originally posted by @dberenbaum in #7093 (comment) |
Thanks, @dberenbaum, for adding some more motivation for potentially deprecating it. As a matter of fact, my current use case was for a data preprocessing and model evaluation pipeline. I am indeed mostly using the I can also confirm that the time spent on caching and re-computing hashes is quite a bottleneck given how slow it is for thousands of S3 objects. Given the use of a Ray cluster, the time waiting for DVC perform these operations is far greater than the actual execution of the individual stage commands. It would mostly be sufficient for my use case here if It would, however, also be useful if stages would still only re-run if the inputs, parameters, or command of a stage changed. If that can be done via version information provided by cloud versioning backends, that sounds good.
Thanks for reminding me of the
I think when I read this, I mostly was concerned that S3 URIs would no longer be valid stage Originally posted by @aschuh-hf in #7093 (comment) |
@aschuch-hf Moved the discussion to a new issue to keep it organized. Sounds like it should work for your needs then if we can do it the way we are planning. One of the arguments in favor of dropping the current approach was to get feedback from users like you who need it so we can build something that works better.
Yes, we should ensure this is still the case. The caching is only needed to recover prior versions of the pipeline, which as you say may often make more sense to recompute in the rare cases they are needed. |
Fixes #9531 Docs iterative/dvc.org#4574 Kudos @dberenbaum
What are we supposed to do if we have external outputs? I use dvc as a pipeline to generate transcripts, timelines, etc for videos on my youtube channel, and I have a bunch of stages (here's one) that copy output artifacts into a separate directory that I use to define a website where I embed my videos. |
@pokey Unfortunately the only option for you is to keep using 2.x. |
Is there a plan to reintroduce such functionality in some form in the future? Or maybe some recommended third-party solution, eg a Makefile? 😄 |
@pokey You can still use |
Drop current implementation of external outputs as part of #7093.
The text was updated successfully, but these errors were encountered: