-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested Looping in Parameterization #5172
Comments
@bharathc346, we cannot have two stages:
build:
foreach: ${models}
do:
foreach: ${features}
do:
cmd: >-
python script.py
--model ${item_0}
--features ${item_1} But, as you can see, this requires too much indentation. Though, this is what we have in mind right now. @johnnychen94 suggested introducing something similar to Github's stages:
build:
matrix:
- ${models}
- ${features}
do:
cmd: >-
python script.py
--model ${item_0}
--features ${item_1} Regarding Anyway, @bharathc346 thanks for creating the issue. Let's hear some feedback for 2.0 change and given enough interests, we'll come to it. Thanks. |
I see. I like @johnnychen94 comment about introducing the github matrix. But I think any solution for nested interpolation would be great. Thanks for the detailed comment |
@bharathc346, ahh, looks like I used the wrong term. It should be nested |
Upvoted, that definitely something that would be nice to have. Intuitively I was doing something like that: stages:
build:
foreach:
- fr:
- a
- b
- en:
- a
- b
do:
... However, I feel the |
I am definitely interested in such a feature. If I were to vote on the naming convention, I would go for For the keys and items, I would stick with what is well done with the current foreach loop, meaning:
stages:
build:
matrix:
- ["us", "fr", "gb"]
- [128, 256, 512]
do:
cmd: >-
python script.py
--model ${item[0]}
--features ${item[1]}
stages:
build:
matrix:
model: ["us", "fr", "gb"]
features: [128, 256, 512]
do:
cmd: >-
python script.py
--model ${item.model}
--features ${item.features} |
Hi, any updates on this feature? I would be really interested in this |
I am glad to work on it as well if we have a go from the dvc team |
The matrix loop is a very trivial case in that it introduces a dense parameter grid. There will always be a need to run a "filter" on this grid to reduce unnecessary tasks. It would be nice if DVC defines an interchangeable protocol for multiple parameterization tasks so that people can write their own task generation script when needed. The "params.yaml" example I provide in #3633 (comment) is a very naive version of the interchangeable protocol. It would be very trivial work to dump "matrix" definition into something like this, as long as the protocol is well-defined. #5795 should be also fixed to enable wide usage of this requested feature. |
Overlap with dvc exp runThere is some overlap between this kind of parametrized loop and Nested loop syntaxThere seems to be consensus that the Adding |
@dberenbaum, For myself, the So, If we are facing a project with an even more level of structure, then a nested loop might be an unavoidable solution. |
@dberenbaum I think And of course upvoting this feature request 👍🏻 |
Here is another upvote. Either Although I am fairly new to DVC, I keep running into limitations. I wonder if DVC might get more mileage by leveraging existing workflow systems (such as Snakemake and NextFlow) and focusing on what DVC does best, which is managing data versions. I guess I could trigger a whole Snakemake workflow as one stage of a DVC pipeline, and then specify which parameters, dependencies and outputs I want to track with DVC. I'm just wondering out loud if it might be more efficient to better integrate DVC with other workflow systems (such as automatically detecting parameters, dependencies and outputs) rather than focus on making DVC pipelines more flexible. |
This is my most wanted feature. Just adding my perspective, as I have a collection of models I want to run on a collection of datasets. If I could nest the As for whether we should use |
Just to comment here, that the "nested loop" is just "one way", of tuning the hyper parameters automatically. Allowing "multiple param files" in one go, might be a middle ground. (#7891 ) |
Chiming in with support for this feature! For context, see this trickery I'm currently doing with YAML anchors to serve this same purpose, and which is at risk of being broken after 3.0 release. I think it's definitely possible that some version of matrix:
- [foo, bar]
- - feature: a
type: json
- feature: b
type: npz I use maps like that a lot, so that feels pretty important. As an aside, when combined with hydra, this could possibly allow for "toggling" certain pipeline stages. In the example below, you could set stage_name:
matrix:
- ${features}
- ${is_enabled}
do:
... |
Would you mind continuing that example with the matrix:
- name: [foo, bar]
- config:
- feature: a
type: json
- feature: b
type: npz I think that will also make it easier to reference those items in the |
I like the explicit naming, though you then don't need a list at the top level: stage_foo:
matrix:
name: [foo, bar]
config:
- feature: a
type: json
- feature: b
type: npz
do:
cmd: echo ${item.name} ${item.config.feature} |
@skshetry Do you think this could eventually replace |
I have a WIP implementation for matrix-do in #9725, which is mostly based on the comment in #5172 (comment). |
This feature request is in regards to the experimental parameterization feature here: https://github.com/iterative/dvc/wiki/Parametrization
Currently if in
params.yaml
if one haswe can have
dvc.yaml
as soCurrently dvc.yaml can only loop over one thing e.g models in this case. I am wondering if something like this would be possible:
and
This would be great as one wouldn't have to specify all possible configurations in params.yaml.
@skshetry tagging you as I believe you have been working on parameterization.
The text was updated successfully, but these errors were encountered: