Nested Looping in Parameterization #5172

bharathc346 · 2020-12-28T17:23:19Z

This feature request is in regards to the experimental parameterization feature here: https://github.com/iterative/dvc/wiki/Parametrization

Currently if in params.yaml if one has

  models:
    a:
      features: 1
    a:
      features: 2
    b:
      features: 1
    b:
      features: 2

we can have dvc.yaml as so

stages:
  build:
    foreach: ${models}
    do:
      cmd: >- 
          python script.py
          --model ${item}
          --features ${item.features}

Currently dvc.yaml can only loop over one thing e.g models in this case. I am wondering if something like this would be possible:

# params.yaml

models: 
  - a
  - b
features: 
  - 1
  - 2

and

# dvc.yaml

stages:
  build:
    foreach: ${models}
    foreach: ${features}
    do:
      cmd: >- 
          python script.py
          --model ${item_0}
          --features ${item_1}

This would be great as one wouldn't have to specify all possible configurations in params.yaml.
@skshetry tagging you as I believe you have been working on parameterization.

The text was updated successfully, but these errors were encountered:

skshetry · 2020-12-29T09:05:53Z

@bharathc346, we cannot have two foreach as it's not supported in YAML format.
We could support nested foreach as following:

stages:
  build:
    foreach: ${models}
    do:
      foreach: ${features}
      do:
        cmd: >- 
            python script.py
            --model ${item_0}
            --features ${item_1}

But, as you can see, this requires too much indentation. Though, this is what we have in mind right now.

@johnnychen94 suggested introducing something similar to Github's matrix or travis's env in #3633 (comment). But, we are not sure if matrix is more intuitive to our users than the loops. With matrix, it'd look something like following:

stages:
  build:
    matrix:
      - ${models}
      - ${features}
    do:
      cmd: >- 
          python script.py
          --model ${item_0}
          --features ${item_1}

Regarding item_0/item_1 naming, I think it might be better to do similar to what Ansible does.
So, item_0 will be accessible as item[0] and item_1 as item[1]. item will just be a list. Similarly, key[0], key[1], etc. This might make it confusing with item and key as they are now a list, but I don't have any good ideas for now.

Anyway, @bharathc346 thanks for creating the issue. Let's hear some feedback for 2.0 change and given enough interests, we'll come to it. Thanks.

bharathc346 · 2020-12-29T16:03:32Z

I see. I like @johnnychen94 comment about introducing the github matrix. But I think any solution for nested interpolation would be great. Thanks for the detailed comment

skshetry · 2020-12-30T04:16:19Z

@bharathc346, ahh, looks like I used the wrong term. It should be nested foreach.

courentin · 2021-03-11T09:41:06Z

Upvoted, that definitely something that would be nice to have.

Intuitively I was doing something like that:

stages:
  build:
    foreach:
      - fr:
        - a
        - b
      - en:
        - a
        - b
    do:
    ...

However, I feel the matrix is more readable. But I'd be ok with a nested foreach too.

ClementWalter · 2021-05-31T18:29:16Z

I am definitely interested in such a feature. If I were to vote on the naming convention, I would go for matrix as foreach means to me some sort of a parallelizable loop (for - each).

For the keys and items, I would stick with what is well done with the current foreach loop, meaning:

if a list is provided, then the var is item and since there is a matrix, item is a list:

stages:
  build:
    matrix:
      - ["us", "fr", "gb"]
      - [128, 256, 512]
    do:
      cmd: >-
          python script.py
          --model ${item[0]}
          --features ${item[1]}

if a dict is provided, item is a dict to access the values by their names (this second option improves readability and I could even vote to enforce it)

stages:
  build:
    matrix:
      model: ["us", "fr", "gb"]
      features: [128, 256, 512]
    do:
      cmd: >-
          python script.py
          --model ${item.model}
          --features ${item.features}

kiedanski · 2021-06-07T13:00:30Z

Hi, any updates on this feature? I would be really interested in this

ClementWalter · 2021-06-07T13:41:43Z

I am glad to work on it as well if we have a go from the dvc team

johnnychen94 · 2021-06-07T13:53:27Z

The matrix loop is a very trivial case in that it introduces a dense parameter grid. There will always be a need to run a "filter" on this grid to reduce unnecessary tasks.

It would be nice if DVC defines an interchangeable protocol for multiple parameterization tasks so that people can write their own task generation script when needed. The "params.yaml" example I provide in #3633 (comment) is a very naive version of the interchangeable protocol. It would be very trivial work to dump "matrix" definition into something like this, as long as the protocol is well-defined.

#5795 should be also fixed to enable wide usage of this requested feature.

dberenbaum · 2021-06-15T17:12:27Z

Overlap with dvc exp run

There is some overlap between this kind of parametrized loop and dvc exp run --set-param. The current foreach approach creates many separate stages, while dvc exp run creates many variations of the same stage. @johnnychen94 mentioned concurrent execution in another thread, which dvc exp run is already starting to tackle. If it were possible to programmatically define the set of experiments to run with dvc exp run, would people still need matrix?

Nested loop syntax

There seems to be consensus that the matrix syntax makes sense. Personally, I agree with @ClementWalter that enforcing a dict structure is most readable.

Adding matrix might make it challenging to explain the difference between foreach and matrix. This also relates generally to how to balance complexity and flexibility in defining DAGs (see #5646 (comment)).

karajan1001 · 2021-06-16T07:42:35Z

@dberenbaum, For myself, the foreach is a useful functionality when I'm doing some text classifications. In this work, two billion items need to be classified into about 5000 child categories inside 50 parent categories. The project has two levels, in the first one I trained a model to classify items into parent categories, and in the second I use foreach to train a separate model for the classification of the child categories for each parent category. It could not be replaced by exp or other functionalities.

So, If we are facing a project with an even more level of structure, then a nested loop might be an unavoidable solution.

jotsif · 2022-01-05T10:06:46Z

@dberenbaum I think matrix or nested foreach still make sense, even with more dynamic parameter definitions in dvc exp. We use "foreach stages" to define different models (similar model but for different countries and products for example), while using dvc experiments for trying out different parameters for those models. There is a logical difference, and it would not be optimal to start mixing those concepts.

And of course upvoting this feature request 👍🏻

johnyaku · 2022-03-21T04:43:28Z

Here is another upvote.

Either matrix or nested foreach would help me align the structure of the pipeline to the natural structure of the experiment.

Although I am fairly new to DVC, I keep running into limitations. I wonder if DVC might get more mileage by leveraging existing workflow systems (such as Snakemake and NextFlow) and focusing on what DVC does best, which is managing data versions. I guess I could trigger a whole Snakemake workflow as one stage of a DVC pipeline, and then specify which parameters, dependencies and outputs I want to track with DVC. I'm just wondering out loud if it might be more efficient to better integrate DVC with other workflow systems (such as automatically detecting parameters, dependencies and outputs) rather than focus on making DVC pipelines more flexible.

mraxilus · 2022-04-23T07:09:27Z

This is my most wanted feature. Just adding my perspective, as I have a collection of models I want to run on a collection of datasets. If I could nest the foreach, my config size would literally lose an order of magnitude in terms of line count.

As for whether we should use matrix instead, I think although it's visually cleaner it would have to be documented as a separate feature, which feels bloated. Whereas uses nested foreach, although less visually appealing, is a more intuitive extension of the concept (i.e. the first thing I tried after ready about DVC's looping). There would be no way someone would naturally change their foreach to matrix.

behrica · 2022-06-13T20:15:58Z

Just to comment here, that the "nested loop" is just "one way", of tuning the hyper parameters automatically.
There are others (random search or sobol sequences) and ideally DVC should allow each of them.

Allowing "multiple param files" in one go, might be a middle ground. (#7891 )

sjawhar · 2023-06-08T19:43:20Z

Chiming in with support for this feature! For context, see this trickery I'm currently doing with YAML anchors to serve this same purpose, and which is at risk of being broken after 3.0 release.

I think it's definitely possible that some version of matrix could serve my needs (what I'm trying to do is really not that complicated), thought it would depend on the details of the implementation. For example, would each "coordinate" in the constructed matrix need to be a simple scalar, or could one construct a matrix from e.g.

matrix:
  - [foo, bar]
  - - feature: a
      type: json
    - feature: b
      type: npz

I use maps like that a lot, so that feels pretty important.

As an aside, when combined with hydra, this could possibly allow for "toggling" certain pipeline stages. In the example below, you could set is_enabled to an empty list to disable that stage—assuming that matrix doesn't require non-empty lists—or a list of a single value to turn it on. I actually do this currently with foreach and empty lists/maps.

stage_name:
  matrix:
    - ${features}
    - ${is_enabled}
  do:
    ...

dberenbaum · 2023-06-15T19:41:24Z

matrix:
  - [foo, bar]
  - - feature: a
      type: json
    - feature: b
      type: npz

Would you mind continuing that example with the do: part? It's not entirely clear to me how it would look. I'd prefer to use the convention suggested by @ClementWalter above where each of those sections has a key to identify it, so maybe it would look more like:

matrix:
  - name: [foo, bar]
  - config: 
    - feature: a
      type: json
    - feature: b
      type: npz

I think that will also make it easier to reference those items in the do: section, but that's where I'd like to know how you think that would look.

sjawhar · 2023-06-15T20:46:34Z

Would you mind continuing that example with the do: part?

I like the explicit naming, though you then don't need a list at the top level:

stage_foo:
  matrix:
    name: [foo, bar]
    config:
    - feature: a
      type: json
    - feature: b
      type: npz
  do:
    cmd: echo ${item.name} ${item.config.feature}

dberenbaum · 2023-06-16T11:41:10Z

@skshetry Do you think this could eventually replace foreach altogether, or do you think we would need to keep both?

skshetry · 2023-07-12T06:53:19Z

I have a WIP implementation for matrix-do in #9725, which is mostly based on the comment in #5172 (comment).
I would appreciate your feedback on it. 🙂

pmrowla added feature request Requesting a new feature A: templating Related to the templating feature labels Dec 29, 2020

jorgeorpinel mentioned this issue Mar 11, 2021

dvc.yaml: future of foreach stages #5440

Closed

isidentical assigned dberenbaum Jun 15, 2021

isidentical added the discussion requires active participation to reach a conclusion label Jun 15, 2021

dberenbaum added the p2-medium Medium priority, should be done, but less important label May 11, 2023

dberenbaum mentioned this issue Jun 8, 2023

Release 3.0 #7093

Closed

4 tasks

skshetry mentioned this issue Jun 23, 2023

Stage-level vars alternative in DVC 3.X #9639

Closed

dberenbaum added this to DVC Jun 27, 2023

github-project-automation bot moved this to Backlog in DVC Jun 27, 2023

dberenbaum added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Jun 27, 2023

dberenbaum assigned skshetry and unassigned dberenbaum Jul 4, 2023

dberenbaum moved this from Backlog to Todo in DVC Jul 11, 2023

skshetry mentioned this issue Jul 12, 2023

dvc.yaml: add matrix-do #9725

Merged

dberenbaum moved this from Todo to In Progress in DVC Jul 18, 2023

skshetry closed this as completed in #9725 Aug 4, 2023

github-project-automation bot moved this from In Progress to Done in DVC Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested Looping in Parameterization #5172

Nested Looping in Parameterization #5172

bharathc346 commented Dec 28, 2020 •

edited by skshetry

Loading

skshetry commented Dec 29, 2020 •

edited

Loading

bharathc346 commented Dec 29, 2020 •

edited

Loading

skshetry commented Dec 30, 2020

courentin commented Mar 11, 2021 •

edited

Loading

ClementWalter commented May 31, 2021

kiedanski commented Jun 7, 2021

ClementWalter commented Jun 7, 2021

johnnychen94 commented Jun 7, 2021 •

edited

Loading

dberenbaum commented Jun 15, 2021

karajan1001 commented Jun 16, 2021 •

edited

Loading

jotsif commented Jan 5, 2022

johnyaku commented Mar 21, 2022

mraxilus commented Apr 23, 2022 •

edited

Loading

behrica commented Jun 13, 2022

sjawhar commented Jun 8, 2023 •

edited

Loading

dberenbaum commented Jun 15, 2023

sjawhar commented Jun 15, 2023

dberenbaum commented Jun 16, 2023

skshetry commented Jul 12, 2023

Nested Looping in Parameterization #5172

Nested Looping in Parameterization #5172

Comments

bharathc346 commented Dec 28, 2020 • edited by skshetry Loading

skshetry commented Dec 29, 2020 • edited Loading

bharathc346 commented Dec 29, 2020 • edited Loading

skshetry commented Dec 30, 2020

courentin commented Mar 11, 2021 • edited Loading

ClementWalter commented May 31, 2021

kiedanski commented Jun 7, 2021

ClementWalter commented Jun 7, 2021

johnnychen94 commented Jun 7, 2021 • edited Loading

dberenbaum commented Jun 15, 2021

Overlap with dvc exp run

Nested loop syntax

karajan1001 commented Jun 16, 2021 • edited Loading

jotsif commented Jan 5, 2022

johnyaku commented Mar 21, 2022

mraxilus commented Apr 23, 2022 • edited Loading

behrica commented Jun 13, 2022

sjawhar commented Jun 8, 2023 • edited Loading

dberenbaum commented Jun 15, 2023

sjawhar commented Jun 15, 2023

dberenbaum commented Jun 16, 2023

skshetry commented Jul 12, 2023

bharathc346 commented Dec 28, 2020 •

edited by skshetry

Loading

skshetry commented Dec 29, 2020 •

edited

Loading

bharathc346 commented Dec 29, 2020 •

edited

Loading

courentin commented Mar 11, 2021 •

edited

Loading

johnnychen94 commented Jun 7, 2021 •

edited

Loading

karajan1001 commented Jun 16, 2021 •

edited

Loading

mraxilus commented Apr 23, 2022 •

edited

Loading

sjawhar commented Jun 8, 2023 •

edited

Loading