Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task object does not work with custom Groovy functions inside of process directives #4215

Closed
stevekm opened this issue Aug 24, 2023 · 6 comments · Fixed by #4217
Closed

task object does not work with custom Groovy functions inside of process directives #4215

stevekm opened this issue Aug 24, 2023 · 6 comments · Fixed by #4217

Comments

@stevekm
Copy link
Contributor

stevekm commented Aug 24, 2023

Bug report

I am exploring some more advanced methods of Nextflow workflow and process introspection, and I am encountering difficulties in using the workflow and task implicit objects in different contexts where they seemingly should work.

Expected behavior and actual behavior

I would expect that I would be able to use the task object with my custom Groovy methods as process directives, since its already commonly used with other pipeline directives. However I get errors such as

ERROR ~ No such variable: task

Steps to reproduce the problem

I have a demo workflow set up like this;

$ tree
.
├── lib
│   └── Utils.groovy
├── main.nf
├── modules
│   └── bazz.nf
└── nextflow.config
  • main.nf
nextflow.enable.dsl=2

// use a custom Groovy function
println Utils.customMessage("main.nf")

include { BAZZ } from './modules/bazz.nf'

workflow {
    BAZZ("Sample1")
}
  • nextflow.config
params {
    submitter = null
}

process {
    cpus = 1
    memory = 250.MB

    // NOTE: THIS DOES NOT ACTUALLY WORK EITHER, for seemingly related reasons
    // executor = 'awsbatch'
    // queue = "spot-nf-job-queue"
    resourceLabels  = {[
        fooLabel: "barValue",
        pipelineProcess: "${task.process}",
        pipelineCPUs: "${task.cpus}",
        pipelineUser: "${workflow.userName}",
        pipelineSubmitter: "${params.submitter}",
        pipelineName: "${workflow.manifest.name}"
    ]}
}

manifest {
    name            = 'workflow-introspection-demo'
    author          = 'Stephen Kelly'
    description     = 'Demo workflow script'
    mainScript      = 'main.nf'
}
  • lib/Utils.groovy
class Utils {
    public static String customMessage (String label) {
        return "customMessage-from-${label}"
    }
}
  • modules/bazz.nf
process BAZZ {
    // this shows how custom methods dont work with `task` object inside
    // process directives, but `task` itself does with with process directives
    tag "${task.process}.${id}.tag" // THIS WORKS
    // tag Utils.customMessage("foobarbazz") // THIS WORKS
    // tag Utils.customMessage("${task.process}") // THIS DOESNT WORK
    // resourceLabels customLabel: Utils.customMessage("${task.process}") // THIS DOESNT WORK
    // NOTE: usage of someVar here will result in 'null'

    input:
    val(id)

    exec:
    someVar = "foooooo"
    println ">>> BAZZ: ${id}" // `id` is accessible as expected
    println Utils.customMessage("${task.process}.${someVar}.${id}") // custom method here works with both `task` and `someVar` objects
}

The import parts here are the tag and resourceLabels process directives under the BAZZ process scope. I have listed in the comments there a couple variations to illustrate some of the ways that things are broken.

Note also the usage of id and someVar, which are included to show further confusing discrepancies in regards to the scoping of variables within the process.

Program output

When running the above workflow with tag "${task.process}.${id}.tag", it works as expected;

$ nextflow run main.nf
N E X T F L O W  ~  version 23.04.1
Launching `main.nf` [agitated_mclean] DSL2 - revision: 635ebf9986
customMessage-from-main.nf
executor >  local (1)
[e6/cf1877] process > BAZZ (BAZZ.Sample1.tag) [100%] 1 of 1 ✔
>>> BAZZ: Sample1
customMessage-from-BAZZ.foooooo.Sample1

Changing it to tag Utils.customMessage("foobarbazz") also works as expected;

[6c/16d37f] process > BAZZ (customMessage-from-foobarbazz) [100%] 1 of 1 ✔

This shows that you can use the task object in a process directive, such as tag, and you can also use a custom Groovy method's output in the process directive as well.

However if you combine these, things break. When you change it to tag Utils.customMessage("${task.process}"), it no longer works;

$ nextflow run main.nf
N E X T F L O W  ~  version 23.04.1
Launching `main.nf` [focused_ekeblad] DSL2 - revision: 635ebf9986
customMessage-from-main.nf
ERROR ~ No such variable: task

 -- Check script './modules/bazz.nf' at line: 6 or see '.nextflow.log' file for more details

The task variable does not work here when you try to pass it to the custom Groovy method.

You get the same error when you use it with resourceLabels customLabel: Utils.customMessage("${task.process}") as well, which is ultimately the process directive I wanted to use in the first place.

It seems like something really weird is going on in regards to the scoping for this task variable which is allowing it to be used in some cases for process directives, and other times not. Some more strange combinations;

  • tag {task.process} works, but tag task.process does not (ERROR ~ No such variable: task)
  • tag Utils.customMessage(task.process) doesnt work (ERROR ~ No such variable: task)
  • tag Utils.customMessage({task.process}) and tag Utils.customMessage({ "${task.process}" }) both gives error
Process 'BAZZ' has been already used -- If you need to reuse the same component, include it with a different name or include it in a different workflow context

Ultimately, the usages of the task variable for process and workflow introspection here have been really confusing and unclear. Its not clear why task works in some cases, but not other. It feels like maybe there is some kind of "magic" happening surrounding these variables behind the scenes that could be influencing these behaviors? Or is it some kind of complicated Groovy variable scoping and initialization discrepancies?

I am not sure if this is a "bug" or if this behavior is an oversight or just inherent in the design of the framework. Regardless, its very counter-intuitive that you can use e.g. tag "${task.process}" but you cant use tag Utils.myMethod("${task.process}") or even tag Utils.myMethod(task), and from my experience so far this seems to apply to all (?) of the Nextflow process directives.

Ultimately, what I really want is to be able to use both task and workflow from within the nextflow.config scope for process configs, so I could have something like this;

process {
    cpus = 1
    memory = 250.MB

    executor = 'awsbatch'
    queue = "spot-nf-job-queue"
    resourceLabels  = [
        fooLabel: "barValue",
        pipelineProcess: "${task.process}",
        pipelineUser: "${workflow.userName}",
        pipelineName: "${workflow.manifest.name}"
    ]
}

But this obviously does not work either. If you cannot use task indiscriminately inside the process scope, I am not sure how you would be able to use it from the nextflow.config scope.

No matter the solution, it would be great to have more documentation on how this all works, and maybe some advanced examples.

Environment

  • Nextflow version: 23.04.1
  • Java version:
$ java -version
openjdk version "20" 2023-03-21
OpenJDK Runtime Environment Zulu20.28+85-CA (build 20+36)
OpenJDK 64-Bit Server VM Zulu20.28+85-CA (build 20+36, mixed mode, sharing)
  • Operating system: macOS, Linux
@stevekm
Copy link
Contributor Author

stevekm commented Aug 24, 2023

At the risk of making this Issue thread too verbose, I also wanted to note these other discrepant cases of the usage of task, particularly in relation to resourceLabels process directive, which seems like it might be unique in that it accepts a Map object (?) instead of a bool or string as most other directives do.


with a modified pipeline that looks like this;

$ tree .
.
├── lib
│   └── Utils.groovy
├── main.nf
├── modules
│   ├── bar.nf
│   └── bazz.nf
└── nextflow.config

  • main.nf
// $ nextflow run main.nf --queue my-aws-queue -work-dir "s3://my_bucket"
nextflow.enable.dsl=2

params.resourceLabels = [fooKey:"barValue"]
include { BAR } from './modules/bar.nf'

workflow {
    BAR("Sample1")
}
  • nextflow.config
params {
    queue = null
}

aws {
    region = 'us-east-2'
    batch {
        cliPath = "/home/ec2-user/miniconda/bin/aws"
    }
}

process {
    executor = 'awsbatch'
    queue = params.queue
}

manifest {
    name            = 'workflow-introspection-demo'
    author          = 'Stephen Kelly'
    description     = 'Demo workflow script'
    mainScript      = 'main.nf'
}
  • lib/Utils.groovy
class Utils {
    public static String customMessage (String label) {
        return "customMessage-from-${label}"
    }

    public static Map customTaskLabels (nextflow.processor.TaskConfig task) {
        def newLabels = [
            pipelineCustomKey: "customValue"
        ]
        newLabels = newLabels + [pipelineProcess: task.process]
        return newLabels
    }

    public static Map customMapLabels (Map taskLabel) {
        def newLabels = [
            pipelineCustomKey: "customValue"
        ]
        newLabels = newLabels + taskLabel
        return newLabels
    }
}
  • modules/bar.nf
process BAR {
    container "ubuntu:latest"
    // resourceLabels params.resourceLabels // THIS WORKS
    // resourceLabels pipelineTask: task.process // ERROR ~ No such variable: task
    // resourceLabels pipelineTask: {task.process} //   Unable to marshall request to JSON: MarshallingType not found for class class Script_9d318d7e$_runScript_closure1$_closure2
    // resourceLabels pipelineTask: "${task.process}" // THIS WORKS
    // resourceLabels params.resourceLabels + [pipelineTask: "${task.process}"] // THIS WORKS
    // resourceLabels Utils.customTaskLabels(task) // ERROR ~ No such variable: task
    // resourceLabels Utils.customMapLabels([barProcessKey:"barProcessValue"]) // THIS WORKS
    // resourceLabels Utils.customMapLabels([barProcessKey:task.process]) // ERROR ~ No such variable: task
    resourceLabels Utils.customMapLabels([barProcessKey:"${task.process}"]) // THIS WORKS

    input:
    val(id)

    script:
    println Utils.customTaskLabels(task) // THIS WORKS
    """
    """
}

I have added two new methods here, customTaskLabels which attempts to interact with the task object directly, and customMapLabels which instead uses a generic Map input. Both are attempting to output a new map, which would be used in the resourceLabels directive.

I included some notes there about combinations that do and do not work, notably;

  • resourceLabels params.resourceLabels works, where I define the map elsewhere in the pipeline and propagate it to the process via the params object
  • resourceLabels pipelineTask: task.process does not work (ERROR ~ No such variable: task) similar to how you could not use task directly with the tag directive
  • resourceLabels pipelineTask: "${task.process}" does work, but the syntax is limiting if we had a large number of other tags we wanted to include with every process (e.g. from introspection of the workflow object as well)
  • resourceLabels params.resourceLabels + [pipelineTask: "${task.process}"] works better, where we can pre-save a large map of labels elsewhere in the pipeline and then append in the task labels inside the Nextflow process
  • resourceLabels Utils.customTaskLabels(task) does not work (ERROR ~ No such variable: task), this one is especially problematic because we likely want to perform multiple operations on task in order to return a large number of introspected attributes for labels, that we dont want to copy/paste into every Nextflow process
  • resourceLabels Utils.customMapLabels([barProcessKey:"${task.process}"]) works, but notably its only able to access task via string interpolation, which severely limits what we can do with the task object

So this makes the whole thing even more confusing, in that sometimes you can access the task object directly, sometimes you can access it after wrapping it in a closure, sometimes you can access attributes off of it via a string interpolation, or wrapping it in a Map or potentially array object, or some combination or all or none of the above, depending on where in the Nextflow process scope you use it.

@bentsherman
Copy link
Member

Hi @stevekm , thank you for bringing up this issue. It confused me for a long time and I'm only now understanding it as a result of studying the codebase.

The short answer is to wrap the failing expressions in a closure:

    tag { Utils.customMessage("${task.process}") }
    resourceLabels { [customLabel: Utils.customMessage("${task.process}")] }

The long answer is...

If you don't wrap the value in a closure, it will be evaluated once when the script is executed rather than each time a task is executed. The difference here is important -- executing the script only defines the process, so variables like task or task inputs aren't available yet.

If the value is a closure, it will be "lazily" evaluated each time a task is executed, so that you can use task-specific variables.

But... if the value is a dynamic string, Nextflow will wrap it in a closure when parsing the script (i.e. as a syntax transformation), so that the dynamic string is also lazily evaluated.

But... if the dynamic string is nested in something else like a function call, then it isn't wrapped in a closure.

So you can see why this syntax sugar has caused a lot of confusion over what is and isn't allowed in the process definition... because it wasn't applied comprehensively.

We might be able to fix it by wrapping the value in a closure if it contains a dynamic string, but maybe we should just document the current behavior better.

@bentsherman
Copy link
Member

As for someVar, AFAIK directives simply don't have access to variables defined in the script/exec block.

@stevekm
Copy link
Contributor Author

stevekm commented Aug 24, 2023

wow that helps a lot, somehow I missed a few things here

  • resourceLabels Utils.customTaskLabels(task) does not work, as described before (ERROR ~ No such variable: task)
  • however, resourceLabels { Utils.customTaskLabels(task) } does work, as you describe. Wow this is great, thanks

following this line of thought, I went back to my original goal, of getting resourceLabels directives set using both task and workflow objects, and what you describe seems to have some interesting effects there as well.

  • nextflow.config
process {
    cpus = 1
    memory = 250.MB
    executor = 'awsbatch'
    queue = params.queue
    resourceLabels = {[ pipelineProcess: task.process, pipelineMemory: task.memory.toString(), pipelineName: workflow.manifest.name ]} // THIS WORKS
    // resourceLabels = {[ pipelineProcess: "${task.process}", pipelineMemory: "${task.memory}", pipelineName: "${workflow.manifest.name}" ]} // Unable to marshall request to JSON: MarshallingType not found for class class org.codehaus.groovy.runtime.GStringImpl
}

So it looks like the reason things were originally breaking for me was that despite using a closure, I was also using the syntax for "${task.process}" inside my closures, instead of task.process. This was triggering the error;

Unable to marshall request to JSON: MarshallingType not found for class class org.codehaus.groovy.runtime.GStringImpl

I am glad that simply using {task} works, but that does make it even more confusing as to how the scoping works here, since I was kind of assuming that the closure would be bringing with it a copy of its own environment at creation, which would not have any task object set. So it seems like the closure is retrieving a task object from the parent environment from which it gets called from? In this case I guess from inside the environment of the Nextflow process scope?

Its very counter-intuitive that using a closure would be required here, but at the same time, using string interpolation breaks the effect of using the closure. All told, it seems like there are multiple different situations where you should vs. shouldn't use string interpolation and/or closures to get the desired effect.

For advanced usages, I find it would be easier to be able to interact with the task object directly; when you use string interpolation, you are limiting yourself to only a single attribute of task that you can access. In this kind of situation, I want to be able to access all of task's attributes without having to write multiple arguments to my custom functions to access each one as a string.

As for someVar, AFAIK directives simply don't have access to variables defined in the script/exec block.

Right, this makes sense, until you try to do this and instead of getting a variable not found error, the value gets silently passed in as null. Which adds even more to the confusion. So now we also have situations where sometimes an unset/non-existant variable breaks the pipeline, and sometimes null gets substituted for a missing variable.

Overall I think resourceLabels = {[ pipelineProcess: task.process, ... ]} is pretty much what I wanted, so I will close this Issue. Thanks a lot.

@stevekm stevekm closed this as completed Aug 24, 2023
@stevekm
Copy link
Contributor Author

stevekm commented Aug 25, 2023

oh one follow up to this, might want to make liberal use of .toString() on the objects being passed in as values for resourceLabels, otherwise you might get an error like this;

the parameter 'value' may not be null (Service: AWSBatch; Status Code: 400; Error Code: BadRequestException; Request ID: ....; Proxy: null)

@bentsherman
Copy link
Member

Regarding the Unable to marshall request to JSON error, I think that is because dynamic strings technically have a different type from regular strings (GStringImpl vs String), so some Java libraries that only expect String (instead of CharSequence, which is the common parent class) fail with a GString. As you said, the solution is to cast the dynamic string with toString(), and in your example you can then skip the dynamic string entirely.

Process config settings work exactly the same way as process directives. If the value is a closure, it will be evaluated for each task, which is why you can use the task variable even in the config file.

Right, this makes sense, until you try to do this and instead of getting a variable not found error, the value gets silently passed in as null.

Good point, Nextflow should throw an error in this case.

oh one follow up to this, might want to make liberal use of .toString() on the objects being passed in as values for resourceLabels, otherwise you might get an error like this;

The main issue here is that AWS Batch does not allow the resource label value to be null. In cases where a label might be null, you might want to consider whether you want to set it to an empty string or not set it at all. With toString() you will set it to the string null, which might not be what you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants