Table of Contents

Scientific simulations are often complex beasts: each step of a simulation requires some input data, a code to run on this input, and produces some output. Steps together form a complex, intricate workflow that can be difficult to deploy, even harder to maintain, and downright impossible to be reproduced by a third party.

This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces.

There are a good number of workflow management programs designed for scientific computation. Some run as a complex server process that contain a live description of a workflow. In my experience, deploying these systems is not worth the time investment. Instead, I recommend using a rule-based tool like GNU Make (i.e. Makefiles) or Snakemake, which runs in Python and is greatly inspired from GNU Make. While it has its own faults, I have found it quite useful to run complex simulations.

Rule-based workflow

In order to satisfy reproducibility requirements for a given scientific study, there must be a traceability of how each output (figure, table, etc.) was generated: which code created the output, what parameters were used, which intermediate output was processed, etc.

All this can be done with rules, which explain how, given a set of inputs, an output is created. A rule can be thought of as a "step" of a simulation pipeline, and rules can be chained together and combined, forming an directed acyclic graph. This allows two things:

Traceability: following the graph allows to find the inputs (data and code) that were used to generate an output, which is a necessary condition for reproducibility.
Update of outputs: if an input changes, it is easy to find the rules that need executing to update the outputs to reflect the changes.

These two features together provide a solid step towards reproducible simulation work.

Snakemake

Snakemake is a tool written in Python to managed rule-based workflows. The workflow definition is a rather simple text file (usually a Snakefile), which typically looks like:

rule list_groups_with_users:
    input:
        "/etc/group"
    output:
        "groups_with_users.txt", # file which contains only groups with users
    shell:
        """cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """

rule sort_group_names:
    input:
        rules.list_groups_with_users.output[0]
    output:
        "sorted_groups.txt", # sorted file with group name and user
        "only_users.txt",    # only contains the user names
    shell:
        "sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}"

rule filter_by_letter:
    input:
        rules.list_groups_with_users.output[0]
    output:
        "start_with_letter_{letter}.txt", # only groups starting with a letter
    shell:
        "grep '^{wildcards.letter}' < {input} > {output}"

This example filters the file /etc/group (which contains all groups on a linux system) and writes to three files. The first has the group name and users (created by the first rule). Then the second rule creates a sorted file and a file with the user names only. This rather pointless application shows that it is possible to chain rule inputs and outputs, and to have multiple outputs.

Executing the workflow with the command snakemake only_users.txt (to tell it to generate the only_users.txt file) should execute both rules, with an output similar to:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Job stats:
job                       count
----------------------  -------
list_groups_with_users        1
sort_group_names              1
total                         2

Select jobs to execute...
Execute 1 jobs...

[Wed Dec 11 14:56:49 2024]
localrule list_groups_with_users:
    input: /etc/group
    output: groups_with_users.txt
    jobid: 1
    reason: Missing output files: groups_with_users.txt
    resources: tmpdir=/tmp

[Wed Dec 11 14:56:49 2024]
Finished job 1.
1 of 2 steps (50%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Dec 11 14:56:49 2024]
localrule sort_group_names:
    input: groups_with_users.txt
    output: sorted_groups.txt, only_users.txt
    jobid: 0
    reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt
    resources: tmpdir=/tmp

[Wed Dec 11 14:56:49 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log

Removing only_users.txt and running snakemake only_users.txt should only re-run the last step.

The rule syntax is rather straight-forward: each rule has a list of inputs and outputs (which are numbered from 0 to N by default, and can be named). The shell directive specifies that we want to run a shell command. This is the most flexible option. Alternatively, one can use the run directive and write inline python code directly in the Snakefile, the script directive, which specifies the name of a Python (or another language) script to be run (Snakemake creates a context for this script which allows it to access the input and output objects), or finally the notebook directive, similar to the script directive, for which Snakemake allows interactive execution (useful for post-processing/data exploration).

Reading the documentation is highly recommended. Although the examples are often biology oriented, the features they demonstrate are easily transposed to a mechanics environment.

Here is a list of useful features:

Wildcards allow to specify parameter values from file names. In the example above, running snakemake start_with_letter_m.txt will replace the {wildcards.letter} in the shell directive by m. This is very useful to distinguish output files based on parameter values. Multiple wildcards can be used in the same rule.
Expansion allows to specify a range of values for a wildcard. This is useful to explore a parametric space, or aggregate the data of several values for one wildcard.
Rule parameters allow one to specify additional parameters (i.e. non-file inputs) to rules.
Rule dependencies allow using the output of a rule as input to another without having to specify the name.
Parameter space exploration.
Command line arguments

GNU Make

Make is a program specifically designed to be a build system, i.e. a tool that coordinates the compilation of a program's source code so that an executable or library can be built. Each file of the build process is called a target and is the output of some rule. Although its primary purpose is creating build files, it can easily be made to manage outputs of simulations. While it has the advantage of being installed on virtually every Linux machine used for scientific work, it lacks some features (most notably integration with queue systems) which only make it practical for small cases (although I am sure some shortcomings could be solved with a strong knowledge of Make).

For reference, here is a Makefile defining the same rules as the Snakemake example above.

.DELETE_ON_ERROR: # forces make to remove targets of failed rules

# One input, one output
groups_with_users.txt: /etc/group
        cat $< | awk -F ':' '$$4 != "" { print $$1,$$4; }' > $@

# Multiple outputs with grouped targets
sorted_groups.txt only_users.txt &: groups_with_users.txt
        sort < $< | tee sorted_groups.txt | cut -d ' ' -f 2 > only_users.txt

# Rule with pattern
start_with_letter_%.txt: groups_with_users.txt
        grep '^$*' < $< > $@

Here are documentation pages for interesting features used in the example:

Rule syntax
Wildcards are semantically different from wildcards in Snakemake
Pattern rules correspond to wildcards in Snakemake
Automatic variables correspond to the symbols $@, $< and $* in the example
Grouped targets

One important advantage of Snakemake is the ability to define arbitrarily many wildcards (Make's patterns) in a single rule. Multi-pattern rules do not directly exist in Make. Emulating this feature is cumbersome.