dvc.yaml: future of foreach stages #5440

jorgeorpinel · 2021-02-10T04:52:33Z

Sorry for long text in advance and as a disclaimer, I plan to tag others here and probably not be too involved in this discussion for now (if it happens). So please feel free to manage the issue as you best see fit.

First a Q: What is the main use case or reason for foreach stages? I haven't been able to find an independent issue where this is explained. It seems to have spawned naturally out of the parameterization feature (OP #3633). The usage is explained in its PR (#4734) but again I can't find the official motivation (what problem gets solved).
Edit: I found some motivation in https://discuss.dvc.org/t/using-dvc-to-keep-track-of-multiple-model-variants/471/2 towards the idea of "generalizing stages", OK let's go with that for now.

The reason I ask is that while it's a greatly engineered feature, the current syntax may encourage a "misuse" of DVC. Specifically, it seems rather difficult to connect the stages defined inside foreach/do to one another (let alone among foreach clauses). For example:

stages:
  mystages:
    foreach: ${mylist}
    do:
      cmd: ./${item.exec} in ${item.in} out ${item.out}

First, the command in each stage should probably change (thus ${item.exec} above).
Second, only if ${item.out}_i = ${item.in}_i+1 (i = [1 ... len(mylist)]) would these stages form a pipeline. This kind of patterns imply a very careful construction of params.yaml (or embedded vars:).
While quite doable, I doubt all this will be obvious to users.

We can document it (maybe transfer this to dvc.org if that is the conclusion), but most users tend to jump into usage first and ask questions later. And it seems to me that by far the most intuitive way to use this now is to create a bunch of completely disconnected stages — which earlier led me to discuss #5181 — that are really one stage that partitions data inputs/outputs for batching or parallel processing (which we don't support yet).

So we may end up running into support cases like https://discuss.dvc.org/t/versioning-predictions/656 very often. BTW that case does have a nice usage for this feature: facilitate per-file data provenance.

Per some earlier discussions (example) I know we don't want foreach stages to be considered "groups", so I assume this is a misuse.

Questions:

Is there an alternative syntax that doesn't encourage disconnected stage sets? (perhaps a 2nd substitution expression like $${list.val} instead of foreach/do, as suggested originally).
Or do we want DVC to be everything it is + a dataset partitioning tool (and/or parallel processing framework repro: add scheduler for parallelising execution jobs #755) ?
If so, should we
- consider other features so it's not necessary to manually specify data partitions in params files/ vars:?
- reconsider the fallback run queueing order (repro: fallback order of execution for independent stages #5181)?

Thanks!

See also #5172

The text was updated successfully, but these errors were encountered:

dmpetrov · 2021-02-10T21:29:16Z

@jorgeorpinel really good questions. It seems like we should be clear with scenarios for foreach - what it aims to solve and what is not.

It feels like we are mixing up different scenarios while foreach was designed only with one of these scenarios in mind.

Wide stage. Generate multiple similar artifacts/models for a single next stage.
a. Parallel pipelines. It is a special case of (1).
Long stage. Connect inputs and outputs like ${item.out}_i = ${item.in}_i+1 (i = [1 ... len(mylist)])

The initial motivation for foreach was the 1st scenario. An example, building 3 separate models for different markets (us, gb, ch) and later ensemble them together into a single model and metric.

@jorgeorpinel you are referencing mostly the 2nd scenario. And I agree that it might be not covered well in the current design and I don't know if we should focus on this scenario. Based on your experience, how often users implement (1) and (2). Which one is the most common case?

jorgeorpinel · 2021-02-11T00:52:22Z

OK so what I call "partitioning" is not a misuse after all, and foreach stages can often times be considered groups then.

Based on your experience, how often users implement (1) and (2)?

I personally can only remember wide stage scenarios from past support cases (another example). I have a feeling that (1) will be 90% of the usage but it's too early to tell. We'll find out I suppose 🙂

Also, I forgot to mention that the only small problem for the wide stage case in the current implementation is that if the width keeps changing, dvc.lock will grow linearly until the point that users may notice it and figure out that they have to delete it regularly. This is because DVC doesn't delete old states from dvc.lock if you change their names in dvc.yaml (to prevent merge conflicts in Git). I don't have a suggestion rn on how to improve this.

dberenbaum · 2021-02-11T01:28:45Z

Agreed with both of you that the wide stage scenario is what foreach solves and is its likely intended use (and that parallel execution is an additional but distinct consideration for these scenarios).

The issue presented in the https://discuss.dvc.org/t/versioning-predictions/656 is unique in that the inputs to foreach are completely different with each run. A more typical use case might be something like building an NLP model for 100 different languages, each of which is specified in params.yaml. More may be added over time, but it would be rare to completely change the set of languages.

In the scenario presented by mauro, the pipeline is being used not for model development/training, but for inference/prediction/scoring of new batches of data. This is a common use case (why else develop ML models?), but DVC does not have an obvious pattern for it right now.

I think the suggestions to either create separate directories or delete dvc.lock are reasonable. Another suggestion would be to create a branch (maybe named by date or some identifier of that batch of data) for each run to avoid conflicts in dvc.lock and to prevent any kind of "sequencing" since I'm assuming there is no connection between one run and the next. Another idea we could explore would be to have a command (or more likely reuse an existing command like status or list) that outputs the dvc.lock info specifically for the stages in the current workspace's pipeline/dvc.yaml.

I'd be interested to get feedback from mauro and others who want to use dvc for prediction using a pre-trained model (or users who are both regularly re-training and then making predictions on new data).

dmpetrov · 2021-02-11T06:20:32Z

I have a feeling that (1) will be 90% of the usage but it's too early to tell

Yep. It matches the feeling that we had when we were designing this feature.

I won't be surprised if we get more signals for the 2nd scenario - that would be the time to improve foreach. But the current design looks good.

I think the suggestions to either create separate directories or delete dvc.lock are reasonable.

Agree. It would be great to hear @skshetry opinion on this.

I'd be interested to get feedback from mauro and others who want to use dvc for prediction using a pre-trained model

💯 it would be great to find more scenarios and feedback. We can use that for the next version of the DVC pipelines.

jorgeorpinel · 2021-02-12T08:44:05Z

discuss.dvc.org/t/versioning-predictions/656 is unique in that the inputs to foreach are completely different with each run

And also in that per-file provenance is a requirement which is why they don't track entire directories as deps/outs, preferring a per-file foreach.

the pipeline is being used not for model development/training, but for inference/prediction/scoring of new batches of data

Yeah so maybe that's the unintended use? DVC pipeline ≠ general task automation

dberenbaum · 2021-02-12T13:29:18Z

Yeah so maybe that's the unintended use? DVC pipeline ≠ general task automation

Since it's a core part of ML pipelines, I'd prefer to say it's a use that's not well developed yet :)

jorgeorpinel · 2021-03-22T22:10:35Z

Hi guys. So back on this, it seems people are finding even more broad use cases for the foreach feature as I ~~feared~~ suspected (the name makes it seem like a very general control structure if you are familiar with programming). E.g. in https://discord.com/channels/485586884165107732/485596304961962003/823593858583232522 they are generalizing a Python call with cmd: python ${item.script} ${item.input} ... (but hitting a limitation because the stages all need to have the same no. of dependencies).

My Q is, is that a 3rd case we are trying to solve, or somewhat of a misuse? Thanks

jorgeorpinel · 2021-03-22T22:20:43Z

p.s.

reconsider the fallback run queueing order (#5181)?
the name makes it seem like a very general control structure

See also #5644. I still think users will assume it behaves like a typical foreach loop (preserving the order of items passed) even when dvc.yaml is clearly declarative, not imperative.

dberenbaum · 2021-03-22T23:57:29Z

they are generalizing a Python call with cmd: python ${item.script} ${item.input} ... (but hitting a limitation because the stages all need to have the same no. of dependencies).

I think this is beyond the scope of what foreach supports, and it's a good example of where a Python API for creating the DAG might be better. @skshetry has started that discussion in #5646, so I'll just quote him:

We have already introduced foreach (aka for loops) and parametrization (aka fstring). There's a discussion about introducing map-reduce and nested-foreach within dvc.yaml. The question could be asked another way around: is it worth adding them in dvc.yaml if we can provide a good python API?

The Python API are usually very granular, so they can capture much more scenarios (known and unknown ones) and semantics than the supposedly high-level dvc.yaml file. With python, we have a complete programming language to our side, we can easily provide generalized and specialized APIs to satisfy user needs. In dvc.yaml, this is very much limited (or, we are going to add too many keywords and schemas to support them which makes it hard to read/edit/write.

efiop · 2023-12-08T12:46:13Z

Closing as stale

jorgeorpinel added discussion requires active participation to reach a conclusion product: VSCode Integration with VSCode extension labels Feb 10, 2021

jorgeorpinel changed the title ~~dvc.yaml: foreach stages use case?~~ dvc.yaml: future of foreach stages Feb 12, 2021

jorgeorpinel added the enhancement Enhances DVC label Feb 12, 2021

jorgeorpinel mentioned this issue May 3, 2021

ref: update stage list refs. iterative/dvc.org#2369

Merged

dberenbaum removed the product: VSCode Integration with VSCode extension label May 18, 2021

karajan1001 mentioned this issue May 19, 2021

Run all variants of a foreach stage in parallel #6035

Closed

efiop closed this as not planned Won't fix, can't repro, duplicate, stale Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvc.yaml: future of foreach stages #5440

dvc.yaml: future of foreach stages #5440

jorgeorpinel commented Feb 10, 2021 •

edited

Loading

dmpetrov commented Feb 10, 2021

jorgeorpinel commented Feb 11, 2021 •

edited

Loading

dberenbaum commented Feb 11, 2021

dmpetrov commented Feb 11, 2021

jorgeorpinel commented Feb 12, 2021

dberenbaum commented Feb 12, 2021

jorgeorpinel commented Mar 22, 2021 •

edited

Loading

jorgeorpinel commented Mar 22, 2021

dberenbaum commented Mar 22, 2021

efiop commented Dec 8, 2023

dvc.yaml: future of foreach stages #5440

dvc.yaml: future of foreach stages #5440

Comments

jorgeorpinel commented Feb 10, 2021 • edited Loading

dmpetrov commented Feb 10, 2021

jorgeorpinel commented Feb 11, 2021 • edited Loading

dberenbaum commented Feb 11, 2021

dmpetrov commented Feb 11, 2021

jorgeorpinel commented Feb 12, 2021

dberenbaum commented Feb 12, 2021

jorgeorpinel commented Mar 22, 2021 • edited Loading

jorgeorpinel commented Mar 22, 2021

dberenbaum commented Mar 22, 2021

efiop commented Dec 8, 2023

jorgeorpinel commented Feb 10, 2021 •

edited

Loading

jorgeorpinel commented Feb 11, 2021 •

edited

Loading

jorgeorpinel commented Mar 22, 2021 •

edited

Loading