Table of Contents
Scientific simulations are often complex beasts: each step of a simulation requires some input data, a code to run on this input, and produces some output. Steps together form a complex, intricate workflow that can be difficult to deploy, even harder to maintain, and downright impossible to be reproduced by a third party.
This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces.
There are a good number of workflow management programs designed for scientific
computation. Some run as a complex server process that contain a live
description of a workflow. In my experience, deploying these systems is not
worth the time investment. Instead, I recommend using a rule-based tool like
GNU Make (i.e. Makefile
s) or
Snakemake, which runs in Python and is greatly
inspired from GNU Make. While it has its own faults, I have found it quite
useful to run complex simulations.
Rule-based workflow
In order to satisfy reproducibility requirements for a given scientific study, there must be a traceability of how each output (figure, table, etc.) was generated: which code created the output, what parameters were used, which intermediate output was processed, etc.
All this can be done with rules, which explain how, given a set of inputs, an output is created. A rule can be thought of as a "step" of a simulation pipeline, and rules can be chained together and combined, forming an directed acyclic graph. This allows two things:
- Traceability: following the graph allows to find the inputs (data and code) that were used to generate an output, which is a necessary condition for reproducibility.
- Update of outputs: if an input changes, it is easy to find the rules that need executing to update the outputs to reflect the changes.
These two features together provide a solid step towards reproducible simulation work.
Snakemake
Snakemake is a tool written in Python to managed rule-based workflows. The
workflow definition is a rather simple text file (usually a Snakefile
), which
typically looks like:
rule list_groups_with_users:
input:
"/etc/group"
output:
"groups_with_users.txt", # file which contains only groups with users
shell:
"""cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """
rule sort_group_names:
input:
rules.list_groups_with_users.output[0]
output:
"sorted_groups.txt", # sorted file with group name and user
"only_users.txt", # only contains the user names
shell:
"sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}"
rule filter_by_letter:
input:
rules.list_groups_with_users.output[0]
output:
"start_with_letter_{letter}.txt", # only groups starting with a letter
shell:
"grep '^{wildcards.letter}' < {input} > {output}"
This example filters the file /etc/group (which contains all groups on a linux system) and writes to three files. The first has the group name and users (created by the first rule). Then the second rule creates a sorted file and a file with the user names only. This rather pointless application shows that it is possible to chain rule inputs and outputs, and to have multiple outputs.
Executing the workflow with the command snakemake only_users.txt
(to tell it
to generate the only_users.txt
file) should execute both rules, with an output
similar to:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Job stats:
job count
---------------------- -------
list_groups_with_users 1
sort_group_names 1
total 2
Select jobs to execute...
Execute 1 jobs...
[Wed Dec 11 14:56:49 2024]
localrule list_groups_with_users:
input: /etc/group
output: groups_with_users.txt
jobid: 1
reason: Missing output files: groups_with_users.txt
resources: tmpdir=/tmp
[Wed Dec 11 14:56:49 2024]
Finished job 1.
1 of 2 steps (50%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Dec 11 14:56:49 2024]
localrule sort_group_names:
input: groups_with_users.txt
output: sorted_groups.txt, only_users.txt
jobid: 0
reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt
resources: tmpdir=/tmp
[Wed Dec 11 14:56:49 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log
Removing only_users.txt
and running snakemake only_users.txt
should only
re-run the last step.
The rule syntax is rather straight-forward: each rule has a list of inputs and
outputs (which are numbered from 0
to N
by default, and can be named). The
shell
directive specifies that we want to run a shell command. This is the
most flexible option. Alternatively, one can use the run
directive and write
inline python code directly in the Snakefile
, the script
directive, which
specifies the name of a Python (or another language)
script
to be run (Snakemake creates a context for this script which allows it to access
the input and output objects), or finally the notebook
directive,
similar to the script
directive, for which Snakemake allows interactive
execution (useful for post-processing/data exploration).
Reading the documentation is highly recommended. Although the examples are often biology oriented, the features they demonstrate are easily transposed to a mechanics environment.
Here is a list of useful features:
- Wildcards
allow to specify parameter values from file names. In the example above,
running
snakemake start_with_letter_m.txt
will replace the{wildcards.letter}
in theshell
directive bym
. This is very useful to distinguish output files based on parameter values. Multiple wildcards can be used in the same rule. - Expansion allows to specify a range of values for a wildcard. This is useful to explore a parametric space, or aggregate the data of several values for one wildcard.
- Rule parameters allow one to specify additional parameters (i.e. non-file inputs) to rules.
- Rule dependencies allow using the output of a rule as input to another without having to specify the name.
- Parameter space exploration.
- Command line arguments
GNU Make
Make is a program specifically designed to be a build system, i.e. a tool that coordinates the compilation of a program's source code so that an executable or library can be built. Each file of the build process is called a target and is the output of some rule. Although its primary purpose is creating build files, it can easily be made to manage outputs of simulations. While it has the advantage of being installed on virtually every Linux machine used for scientific work, it lacks some features (most notably integration with queue systems) which only make it practical for small cases (although I am sure some shortcomings could be solved with a strong knowledge of Make).
For reference, here is a Makefile
defining the same rules as the Snakemake example above.
.DELETE_ON_ERROR: # forces make to remove targets of failed rules
# One input, one output
groups_with_users.txt: /etc/group
cat $< | awk -F ':' '$$4 != "" { print $$1,$$4; }' > $@
# Multiple outputs with grouped targets
sorted_groups.txt only_users.txt &: groups_with_users.txt
sort < $< | tee sorted_groups.txt | cut -d ' ' -f 2 > only_users.txt
# Rule with pattern
start_with_letter_%.txt: groups_with_users.txt
grep '^$*' < $< > $@
Here are documentation pages for interesting features used in the example:
- Rule syntax
- Wildcards are semantically different from wildcards in Snakemake
- Pattern rules correspond to wildcards in Snakemake
- Automatic variables correspond to the symbols
$@
,$<
and$*
in the example - Grouped targets
One important advantage of Snakemake is the ability to define arbitrarily many wildcards (Make's patterns) in a single rule. Multi-pattern rules do not directly exist in Make. Emulating this feature is cumbersome.