-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc.yaml: future of foreach stages #5440
Comments
@jorgeorpinel really good questions. It seems like we should be clear with scenarios for It feels like we are mixing up different scenarios while
The initial motivation for @jorgeorpinel you are referencing mostly the 2nd scenario. And I agree that it might be not covered well in the current design and I don't know if we should focus on this scenario. Based on your experience, how often users implement (1) and (2). Which one is the most common case? |
OK so what I call "partitioning" is not a misuse after all, and
I personally can only remember wide stage scenarios from past support cases (another example). I have a feeling that (1) will be 90% of the usage but it's too early to tell. We'll find out I suppose 🙂 Also, I forgot to mention that the only small problem for the wide stage case in the current implementation is that if the width keeps changing, dvc.lock will grow linearly until the point that users may notice it and figure out that they have to delete it regularly. This is because DVC doesn't delete old states from dvc.lock if you change their names in dvc.yaml (to prevent merge conflicts in Git). I don't have a suggestion rn on how to improve this. |
Agreed with both of you that the wide stage scenario is what The issue presented in the https://discuss.dvc.org/t/versioning-predictions/656 is unique in that the inputs to In the scenario presented by mauro, the pipeline is being used not for model development/training, but for inference/prediction/scoring of new batches of data. This is a common use case (why else develop ML models?), but DVC does not have an obvious pattern for it right now. I think the suggestions to either create separate directories or delete I'd be interested to get feedback from mauro and others who want to use dvc for prediction using a pre-trained model (or users who are both regularly re-training and then making predictions on new data). |
Yep. It matches the feeling that we had when we were designing this feature. I won't be surprised if we get more signals for the 2nd scenario - that would be the time to improve
Agree. It would be great to hear @skshetry opinion on this.
💯 it would be great to find more scenarios and feedback. We can use that for the next version of the DVC pipelines. |
And also in that per-file provenance is a requirement which is why they don't track entire directories as deps/outs, preferring a per-file
Yeah so maybe that's the unintended use? DVC pipeline ≠ general task automation |
Since it's a core part of ML pipelines, I'd prefer to say it's a use that's not well developed yet :) |
Hi guys. So back on this, it seems people are finding even more broad use cases for the My Q is, is that a 3rd case we are trying to solve, or somewhat of a misuse? Thanks |
p.s.
See also #5644. I still think users will assume it behaves like a typical foreach loop (preserving the order of items passed) even when dvc.yaml is clearly declarative, not imperative. |
I think this is beyond the scope of what
|
Closing as stale |
Sorry for long text in advance and as a disclaimer, I plan to tag others here and probably not be too involved in this discussion for now (if it happens). So please feel free to manage the issue as you best see fit.
First a Q: What is the main use case or reason for foreach stages? I haven't been able to find an independent issue where this is explained. It seems to have spawned naturally out of the parameterization feature (OP #3633). The usage is explained in its PR (#4734) but again I can't find the official motivation (what problem gets solved).
Edit: I found some motivation in https://discuss.dvc.org/t/using-dvc-to-keep-track-of-multiple-model-variants/471/2 towards the idea of "generalizing stages", OK let's go with that for now.
The reason I ask is that while it's a greatly engineered feature, the current syntax may encourage a "misuse" of DVC. Specifically, it seems rather difficult to connect the stages defined inside
foreach/do
to one another (let alone among foreach clauses). For example:First, the command in each stage should probably change (thus
${item.exec}
above).Second, only if
${item.out}
i =${item.in}
i+1 (i = [1 ...len(mylist)
]) would these stages form a pipeline. This kind of patterns imply a very careful construction of params.yaml (or embeddedvars:
).While quite doable, I doubt all this will be obvious to users.
We can document it (maybe transfer this to dvc.org if that is the conclusion), but most users tend to jump into usage first and ask questions later. And it seems to me that by far the most intuitive way to use this now is to create a bunch of completely disconnected stages — which earlier led me to discuss #5181 — that are really one stage that partitions data inputs/outputs for batching or parallel processing (which we don't support yet).
Per some earlier discussions (example) I know we don't want foreach stages to be considered "groups", so I assume this is a misuse.
Questions:
$${list.val}
instead offoreach/do
, as suggested originally).vars:
?Thanks!
The text was updated successfully, but these errors were encountered: