Understanding the workflow
Parameters
So far in the tutorial, we have executed the workflow as-is. In nextflow, there
is a built-in way of defining variables that can be tweaked at runtime, called
"parameters". See the nextflow/we_demo.nf
as an example:
params.proj = 'we_demo'
params.init_flags = '--t 0.10 --log-every 1' // 200 steps each
params.init_seeds = 2
params.init_steps = 200000
params.init_model = 'input/pinn/pinet-hco-adam.yml'
params.init_time = 0.5
The params.proj
parameter determines where the output files will be
"published" into. To change while running the same workflow, we simply have to
add the option to nextflow run
, i.e. the following command will run the same
workflow, but use 2 initial seeds (to generate the initial dataset with).
Modules and inclusion
Thanks to the DSL2 syntax of nextflow, the workflows in PiNNAcLe are reusable easily. Below, on see how the workflow parameters can be injected into included processes/workflows; the publishing directories of subworkflows is set, and a few parameters required by acle are supplied:
include { mol2box } from "./module/molutils.nf" addParams (publish: "$params.proj/init/pack")
include { md } from "./module/dftb.nf" addParams (publish:"$params.proj/init/md/")
include { convert as mkgeo } from "./module/tips.nf" addParams (publish:"$params.proj/init/ds")
include { convert as mkds } from "./module/tips.nf" addParams (publish:"$params.proj/init/geo")
include { acle } from "./acle.nf" addParams (
publish: "$params.proj/acle",
geo_size: geo_size)
Processes
The processes, such as the ones included above are the building block of
nextflow scripts. As shown below, they are simply scripts with a few
input/output "channels". When the workflow runs, the scripts will be executed in
the work directory, where the $variables
will be substituted by whatever
supplied by the input channels; and the files produced by the output patterns
will.
process aseSP {
label "$params.ase_label"
publishDir "$params.publish/$name"
input:
tuple val(name), path(calc, stageAs: 'calc.py'), path(geo), path(aux)
output:
tuple val(name), path("sp.xyz")
script:
"""
#!/usr/bin/env python
from ase.io import read, write
from calc import calc
atoms = read('$geo')
atoms.calc=calc
calc.calculate(atoms, properties=${params.ase_props})
write('sp.xyz', atoms)
"""
}
It should be easy to tweak the workflow behavior via the process scripts. You might also want to check the nextflow documentation to see what is possible. In addition, a few implementation patterns are followed:
- processes are labelled by the dependent software and arranged as "modules";
- interchangeable processes in different modules share the same name (e.g.
sp
,md
);1 - processes input is always one tuple, where the first element is indented as an identifier;2
Workflow and recipes
IN DSL2, workflows are composed of processes, and can share the same input/output structure to behave like a process. However, the "main" workflows to be executed would have to be self-contained, meaning it will not receive inputs and will define the "channels" within itself and feed them to the sub-workflows, as shown in the following snippet.
workflow entry {
ch_inputs = Channel.fromList(inputs) \
| combine(Channel.of(1..params.init_seeds)) \
| map {name, tag, box, seed -> ["$name-$seed", tag, box.toFloat(), seed]} \
| mol2box
Those modules that provide self-contained workflow are called "recipes" in
PiNNAcLe. The workflow will always be named as entry
and exposed as the module
name in the main workflow, i.e., the following three commands does the same
thing:
nextflow run nextflow/acle.nf
nextflow run nextflow/acle.nf -entry entry
nextflow run main.nf -entry acle
I.e., the -entry
option chooses which workflow to run within a workflow file,
and the main.nf
serves as a shortcut to call those recipes. Other than that,
there are no qualitative difference between recipes and normal modules, and both
can expose reusable sub-workflows, such as the acle
workflow in the demo.
-
This also applies to workflows that shares a common input/output pattern. This allows complex workflows to be composed in a implementation-agnostic fashion, for instance, the reference (
params.ref
) and machine learning potential (params.mlp
) can be swapped to any compatible module. ↩ -
This ensures all outputs are trackable by their identifier in a complex workflow. For instance, in the acle workflow, the labelled points are identified as
gen/geo/idx
, the information is used to map them back to the sampled trajectory when evaluating the performance of the model. ↩