Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested Looping in Parameterization #5172

Closed
bharathc346 opened this issue Dec 28, 2020 · 19 comments · Fixed by #9725
Closed

Nested Looping in Parameterization #5172

bharathc346 opened this issue Dec 28, 2020 · 19 comments · Fixed by #9725
Assignees
Labels
A: templating Related to the templating feature discussion requires active participation to reach a conclusion feature request Requesting a new feature p1-important Important, aka current backlog of things to do

Comments

@bharathc346
Copy link

bharathc346 commented Dec 28, 2020

This feature request is in regards to the experimental parameterization feature here: https://github.com/iterative/dvc/wiki/Parametrization

Currently if in params.yaml if one has

  models:
    a:
      features: 1
    a:
      features: 2
    b:
      features: 1
    b:
      features: 2

we can have dvc.yaml as so

stages:
  build:
    foreach: ${models}
    do:
      cmd: >- 
          python script.py
          --model ${item}
          --features ${item.features}

Currently dvc.yaml can only loop over one thing e.g models in this case. I am wondering if something like this would be possible:

# params.yaml

models: 
  - a
  - b
features: 
  - 1
  - 2

and

# dvc.yaml

stages:
  build:
    foreach: ${models}
    foreach: ${features}
    do:
      cmd: >- 
          python script.py
          --model ${item_0}
          --features ${item_1}

This would be great as one wouldn't have to specify all possible configurations in params.yaml.
@skshetry tagging you as I believe you have been working on parameterization.

@pmrowla pmrowla added feature request Requesting a new feature A: templating Related to the templating feature labels Dec 29, 2020
@skshetry
Copy link
Member

skshetry commented Dec 29, 2020

@bharathc346, we cannot have two foreach as it's not supported in YAML format.
We could support nested foreach as following:

stages:
  build:
    foreach: ${models}
    do:
      foreach: ${features}
      do:
        cmd: >- 
            python script.py
            --model ${item_0}
            --features ${item_1}

But, as you can see, this requires too much indentation. Though, this is what we have in mind right now.

@johnnychen94 suggested introducing something similar to Github's matrix or travis's env in #3633 (comment). But, we are not sure if matrix is more intuitive to our users than the loops. With matrix, it'd look something like following:

stages:
  build:
    matrix:
      - ${models}
      - ${features}
    do:
      cmd: >- 
          python script.py
          --model ${item_0}
          --features ${item_1}

Regarding item_0/item_1 naming, I think it might be better to do similar to what Ansible does.
So, item_0 will be accessible as item[0] and item_1 as item[1]. item will just be a list. Similarly, key[0], key[1], etc. This might make it confusing with item and key as they are now a list, but I don't have any good ideas for now.

Anyway, @bharathc346 thanks for creating the issue. Let's hear some feedback for 2.0 change and given enough interests, we'll come to it. Thanks.

@bharathc346
Copy link
Author

bharathc346 commented Dec 29, 2020

I see. I like @johnnychen94 comment about introducing the github matrix. But I think any solution for nested interpolation would be great. Thanks for the detailed comment

@skshetry
Copy link
Member

@bharathc346, ahh, looks like I used the wrong term. It should be nested foreach.

@courentin
Copy link
Contributor

courentin commented Mar 11, 2021

Upvoted, that definitely something that would be nice to have.

Intuitively I was doing something like that:

stages:
  build:
    foreach:
      - fr:
        - a
        - b
      - en:
        - a
        - b
    do:
    ...

However, I feel the matrix is more readable. But I'd be ok with a nested foreach too.

@ClementWalter
Copy link
Contributor

I am definitely interested in such a feature. If I were to vote on the naming convention, I would go for matrix as foreach means to me some sort of a parallelizable loop (for - each).

For the keys and items, I would stick with what is well done with the current foreach loop, meaning:

  • if a list is provided, then the var is item and since there is a matrix, item is a list:
stages:
  build:
    matrix:
      - ["us", "fr", "gb"]
      - [128, 256, 512]
    do:
      cmd: >-
          python script.py
          --model ${item[0]}
          --features ${item[1]}
  • if a dict is provided, item is a dict to access the values by their names (this second option improves readability and I could even vote to enforce it)
stages:
  build:
    matrix:
      model: ["us", "fr", "gb"]
      features: [128, 256, 512]
    do:
      cmd: >-
          python script.py
          --model ${item.model}
          --features ${item.features}

@kiedanski
Copy link

Hi, any updates on this feature? I would be really interested in this

@ClementWalter
Copy link
Contributor

I am glad to work on it as well if we have a go from the dvc team

@johnnychen94
Copy link
Contributor

johnnychen94 commented Jun 7, 2021

The matrix loop is a very trivial case in that it introduces a dense parameter grid. There will always be a need to run a "filter" on this grid to reduce unnecessary tasks.

It would be nice if DVC defines an interchangeable protocol for multiple parameterization tasks so that people can write their own task generation script when needed. The "params.yaml" example I provide in #3633 (comment) is a very naive version of the interchangeable protocol. It would be very trivial work to dump "matrix" definition into something like this, as long as the protocol is well-defined.

#5795 should be also fixed to enable wide usage of this requested feature.

@isidentical isidentical added the discussion requires active participation to reach a conclusion label Jun 15, 2021
@dberenbaum
Copy link
Collaborator

Overlap with dvc exp run

There is some overlap between this kind of parametrized loop and dvc exp run --set-param. The current foreach approach creates many separate stages, while dvc exp run creates many variations of the same stage. @johnnychen94 mentioned concurrent execution in another thread, which dvc exp run is already starting to tackle. If it were possible to programmatically define the set of experiments to run with dvc exp run, would people still need matrix?

Nested loop syntax

There seems to be consensus that the matrix syntax makes sense. Personally, I agree with @ClementWalter that enforcing a dict structure is most readable.

Adding matrix might make it challenging to explain the difference between foreach and matrix. This also relates generally to how to balance complexity and flexibility in defining DAGs (see #5646 (comment)).

@karajan1001
Copy link
Contributor

karajan1001 commented Jun 16, 2021

@dberenbaum, For myself, the foreach is a useful functionality when I'm doing some text classifications. In this work, two billion items need to be classified into about 5000 child categories inside 50 parent categories. The project has two levels, in the first one I trained a model to classify items into parent categories, and in the second I use foreach to train a separate model for the classification of the child categories for each parent category. It could not be replaced by exp or other functionalities.

So, If we are facing a project with an even more level of structure, then a nested loop might be an unavoidable solution.

@jotsif
Copy link

jotsif commented Jan 5, 2022

@dberenbaum I think matrix or nested foreach still make sense, even with more dynamic parameter definitions in dvc exp. We use "foreach stages" to define different models (similar model but for different countries and products for example), while using dvc experiments for trying out different parameters for those models. There is a logical difference, and it would not be optimal to start mixing those concepts.

And of course upvoting this feature request 👍🏻

@johnyaku
Copy link

Here is another upvote.

Either matrix or nested foreach would help me align the structure of the pipeline to the natural structure of the experiment.

Although I am fairly new to DVC, I keep running into limitations. I wonder if DVC might get more mileage by leveraging existing workflow systems (such as Snakemake and NextFlow) and focusing on what DVC does best, which is managing data versions. I guess I could trigger a whole Snakemake workflow as one stage of a DVC pipeline, and then specify which parameters, dependencies and outputs I want to track with DVC. I'm just wondering out loud if it might be more efficient to better integrate DVC with other workflow systems (such as automatically detecting parameters, dependencies and outputs) rather than focus on making DVC pipelines more flexible.

@mraxilus
Copy link

mraxilus commented Apr 23, 2022

This is my most wanted feature. Just adding my perspective, as I have a collection of models I want to run on a collection of datasets. If I could nest the foreach, my config size would literally lose an order of magnitude in terms of line count.

As for whether we should use matrix instead, I think although it's visually cleaner it would have to be documented as a separate feature, which feels bloated. Whereas uses nested foreach, although less visually appealing, is a more intuitive extension of the concept (i.e. the first thing I tried after ready about DVC's looping). There would be no way someone would naturally change their foreach to matrix.

@behrica
Copy link

behrica commented Jun 13, 2022

Just to comment here, that the "nested loop" is just "one way", of tuning the hyper parameters automatically.
There are others (random search or sobol sequences) and ideally DVC should allow each of them.

Allowing "multiple param files" in one go, might be a middle ground. (#7891 )

@dberenbaum dberenbaum added the p2-medium Medium priority, should be done, but less important label May 11, 2023
@dberenbaum dberenbaum mentioned this issue Jun 8, 2023
4 tasks
@sjawhar
Copy link
Contributor

sjawhar commented Jun 8, 2023

Chiming in with support for this feature! For context, see this trickery I'm currently doing with YAML anchors to serve this same purpose, and which is at risk of being broken after 3.0 release.

I think it's definitely possible that some version of matrix could serve my needs (what I'm trying to do is really not that complicated), thought it would depend on the details of the implementation. For example, would each "coordinate" in the constructed matrix need to be a simple scalar, or could one construct a matrix from e.g.

matrix:
  - [foo, bar]
  - - feature: a
      type: json
    - feature: b
      type: npz

I use maps like that a lot, so that feels pretty important.

As an aside, when combined with hydra, this could possibly allow for "toggling" certain pipeline stages. In the example below, you could set is_enabled to an empty list to disable that stage—assuming that matrix doesn't require non-empty lists—or a list of a single value to turn it on. I actually do this currently with foreach and empty lists/maps.

stage_name:
  matrix:
    - ${features}
    - ${is_enabled}
  do:
    ...

@dberenbaum
Copy link
Collaborator

matrix:
  - [foo, bar]
  - - feature: a
      type: json
    - feature: b
      type: npz

Would you mind continuing that example with the do: part? It's not entirely clear to me how it would look. I'd prefer to use the convention suggested by @ClementWalter above where each of those sections has a key to identify it, so maybe it would look more like:

matrix:
  - name: [foo, bar]
  - config: 
    - feature: a
      type: json
    - feature: b
      type: npz

I think that will also make it easier to reference those items in the do: section, but that's where I'd like to know how you think that would look.

@sjawhar
Copy link
Contributor

sjawhar commented Jun 15, 2023

Would you mind continuing that example with the do: part?

I like the explicit naming, though you then don't need a list at the top level:

stage_foo:
  matrix:
    name: [foo, bar]
    config:
    - feature: a
      type: json
    - feature: b
      type: npz
  do:
    cmd: echo ${item.name} ${item.config.feature}

@dberenbaum
Copy link
Collaborator

@skshetry Do you think this could eventually replace foreach altogether, or do you think we would need to keep both?

@dberenbaum dberenbaum added this to DVC Jun 27, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Jun 27, 2023
@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Jun 27, 2023
@dberenbaum dberenbaum assigned skshetry and unassigned dberenbaum Jul 4, 2023
@dberenbaum dberenbaum moved this from Backlog to Todo in DVC Jul 11, 2023
@skshetry
Copy link
Member

I have a WIP implementation for matrix-do in #9725, which is mostly based on the comment in #5172 (comment).
I would appreciate your feedback on it. 🙂

@dberenbaum dberenbaum moved this from Todo to In Progress in DVC Jul 18, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in DVC Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: templating Related to the templating feature discussion requires active participation to reach a conclusion feature request Requesting a new feature p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.