added draft snakemake tutorial

Lucas Frérot 2026-01-13 11:23:00 +01:00
parent 9dab927f35
commit 648cf8d4bd
No known key found for this signature in database
GPG Key ID: 03B54A50E3FBA7E8
2 changed files with 99 additions and 105 deletions

79
Snakemake-Tutorial.md Normal file

@ -0,0 +1,79 @@
# Snakemake tutorial : free falling particle
## The computation
We'll study the equation for the velocity of a free falling particle with
viscous friction, in adimensional form:
$$ \frac{\mathrm d v}{\mathrm dt} = 1 - \eta v,\qquad v(0) = 0 $$
We will vary the physical parameter $\eta$, the time step $\Delta t$ for the
numerical integration scheme and the number of time steps.
The following function solves the ODE numerically:
```python
def solve_free_fall(eta, dt, total_time):
...
```
It returns a single array where the first line is time and the second line is velocity.
Save the above function in a `simulation.py` file.
## Steps
We want to do three things for our tutorial study:
- For any value of `eta`, `dt` and `total_time`, run the computation and save the
output data (we'll call it the *simulation* step)
- For any computation, plot the velocity as function of time and save the plot
file (the *plot* step)
- For a chosen range of `eta`, plot the terminal velocity as a function of `eta`
(the *aggregation* step)
Each of these three things constitutes a workflow step, which means we define a
rule for each. Each will also have its own script.
Since the first two steps do not specify which values are used for `eta`, `dt`
and `total_time`, these three parameters will be **wildcards**, i.e. values that
will be specified by the user when an output is requested (e.g. the user wants
the time-velocity plot for `eta=0.1`, `dt=1e-3`, `total_time=20`).
## Rules
Snakemake rules (defined in a `Snakefile` text file) typically define three
things for a step:
1. The input files (optional)
2. The output files
3. What needs to be executed to generate the output (e.g. a script or shell command)
The first *simulation* step does not require any input other than the wildcard
parameter values, so we can write its rule in the `Snakefile` as:
```
rule simulation:
output:
"free_fall,eta={eta},dt={dt},total_time={total_time}.txt"
script:
"simulation.py"
```
Here the output file is defined as
`"free_fall,eta={eta},dt={dt},total_time={total_time}.txt"`. The `{...}` syntax
indicates a **wildcard** parameter (here `eta`, `total_time` and `dt`), but the
rest of the output path is free (e.g. the `eta=` part could be omitted).
One can request the output file by running:
```
snakemake -j1 free_fall,eta=0.5,dt=1e-3,total_time=20.txt
```
This command will execute the `simulation` rule with wildcard replaced by the
specified values. Since `simulation.py` only contains a function, nothing
happens. Let's see how to retrieve the wildcard values and output paths from the
script file.
### Snakemake context

@ -1,6 +1,13 @@
Scientific simulations are often complex beasts: each step of a simulation requires some input data, a code to run on this input, and produces some output. Steps together form a complex, intricate workflow that can be difficult to deploy, even harder to maintain, and downright impossible to be reproduced by a third party. Scientific simulations are often complex beasts: each step of a simulation requires some input data, a code to run on this input, and produces some output. Steps together form a complex, intricate workflow that can be difficult to deploy, even harder to maintain, and downright impossible to be reproduced by a third party.
This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces. This is why making this process easier is often a good time investment: it
typically requires thinking logically about a workflow, to split it into simple
steps linked only by input and output data. This alone helps structure a
workflow so that it's easier to add, remove or change simulation steps. Workflow
management software helps to define a workflow graph and formalizes this
process. It also allows tracking data dependency, re-run steps that require
running when input data changes, and allows the configuration of parameter
spaces (for parameter studies).
There are a good number of workflow management programs designed for scientific There are a good number of workflow management programs designed for scientific
computation. Some run as a complex server process that contain a live computation. Some run as a complex server process that contain a live
@ -14,124 +21,32 @@ useful to run complex simulations.
# Rule-based workflow # Rule-based workflow
In order to satisfy reproducibility requirements for a given scientific study, In order to satisfy reproducibility requirements for a given scientific study,
there must be a traceability of how each output (figure, table, etc.) was there must be traceability of how each output (figure, table, etc.) was
generated: which code created the output, what parameters were used, which generated: which code created the output, what parameters were used, which
intermediate output was processed, etc. intermediate output was processed, etc.
All this can be done with **rules**, which explain how, given a set of inputs, All this can be done with **rules**, which explain how, given a set of inputs,
an output is created. A rule can be thought of as a "step" of a simulation an output is created. A rule can be thought of as a "step" of a simulation
pipeline, and rules can be chained together and combined, forming an *directed pipeline, and rules can be chained together and combined. This allows two things:
acyclic graph*. This allows two things:
- Traceability: following the graph allows to find the inputs (data *and* code) - Traceability: following the graph allows to find the inputs (data *and* code)
that were used to generate an output, which is a necessary condition for that were used to generate an output, which is a necessary condition for
reproducibility. reproducibility.
- Update of outputs: if an input changes, it is easy to find the rules that need - Update of outputs: if an input changes, it is easy to find the rules that need
executing to update the outputs to reflect the changes. executing to update the outputs and reflect the changes.
These two features together provide a solid step towards reproducible simulation These two features together provide a solid step towards a reproducible simulation
work. workflow.
# Snakemake # Snakemake
Snakemake is a tool written in Python to managed rule-based workflows. The Snakemake is a tool written in Python to manage rule-based workflows. The
workflow definition is a rather simple text file (usually a `Snakefile`), which workflow definition is a rather simple text file (usually a `Snakefile`) which
typically looks like: contains the rule definitions.
```python To illustrate how to setup a Snakemake workflow, you can follow the
rule list_groups_with_users: [tutorial](Snakemake-Tutorial.md), where you'll setup a python simulation of a
input: free-falling particle with viscous friction, a plot of its trajectory as
"/etc/group" function of time and a plot of the terminal velocity as function of the viscous friction.
output:
"groups_with_users.txt", # file which contains only groups with users
shell:
"""cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """
rule sort_group_names:
input:
rules.list_groups_with_users.output[0]
output:
"sorted_groups.txt", # sorted file with group name and user
"only_users.txt", # only contains the user names
shell:
"sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}"
rule filter_by_letter:
input:
rules.list_groups_with_users.output[0]
output:
"start_with_letter_{letter}.txt", # only groups starting with a letter
shell:
"grep '^{wildcards.letter}' < {input} > {output}"
```
> This example filters the file /etc/group (which contains all groups on a linux
> system) and writes to three files. The first has the group name and users
> (created by the first rule). Then the second rule creates a sorted file and a
> file with the user names only. This rather pointless application shows that it
> is possible to chain rule inputs and outputs, and to have multiple outputs.
Executing the workflow with the command `snakemake only_users.txt` (to tell it
to generate the `only_users.txt` file) should execute both rules, with an output
similar to:
```
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Job stats:
job count
---------------------- -------
list_groups_with_users 1
sort_group_names 1
total 2
Select jobs to execute...
Execute 1 jobs...
[Wed Dec 11 14:56:49 2024]
localrule list_groups_with_users:
input: /etc/group
output: groups_with_users.txt
jobid: 1
reason: Missing output files: groups_with_users.txt
resources: tmpdir=/tmp
[Wed Dec 11 14:56:49 2024]
Finished job 1.
1 of 2 steps (50%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Dec 11 14:56:49 2024]
localrule sort_group_names:
input: groups_with_users.txt
output: sorted_groups.txt, only_users.txt
jobid: 0
reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt
resources: tmpdir=/tmp
[Wed Dec 11 14:56:49 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log
```
Removing `only_users.txt` and running `snakemake only_users.txt` should only
re-run the last step.
The rule syntax is rather straight-forward: each rule has a list of inputs and
outputs (which are numbered from `0` to `N` by default, and can be named). The
`shell` directive specifies that we want to run a shell command. This is the
most flexible option. Alternatively, one can use the `run` directive and write
inline python code directly in the `Snakefile`, the `script` directive, which
specifies the name of a Python (or another language)
[script](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#external-scripts)
to be run (Snakemake creates a context for this script which allows it to access
the input and output objects), or finally the [`notebook`
directive](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration),
similar to the `script` directive, for which Snakemake allows interactive
execution (useful for post-processing/data exploration).
Reading the Reading the
[documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly [documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly