added draft snakemake tutorial

2026-01-13 11:23:00 +01:00 · 2026-01-13 11:23:00 +01:00 · 648cf8d4bd
parent 9dab927f35
commit 648cf8d4bd
2 changed files with 99 additions and 105 deletions
--- a/Snakemake-Tutorial.md
+++ b/Snakemake-Tutorial.md
@ -0,0 +1,79 @@
+# Snakemake tutorial : free falling particle
+
+## The computation
+
+We'll study the equation for the velocity of a free falling particle with
+viscous friction, in adimensional form:
+
+$$ \frac{\mathrm d v}{\mathrm dt} = 1 - \eta v,\qquad v(0) = 0 $$
+
+We will vary the physical parameter $\eta$, the time step $\Delta t$ for the
+numerical integration scheme and the number of time steps.
+
+The following function solves the ODE numerically:
+
+```python
+def solve_free_fall(eta, dt, total_time):
+    ...
+```
+
+It returns a single array where the first line is time and the second line is velocity.
+
+Save the above function in a `simulation.py` file.
+
+## Steps
+
+We want to do three things for our tutorial study:
+
+- For any value of `eta`, `dt` and `total_time`, run the computation and save the
+  output data (we'll call it the *simulation* step)
+- For any computation, plot the velocity as function of time and save the plot
+  file (the *plot* step)
+- For a chosen range of `eta`, plot the terminal velocity as a function of `eta`
+  (the *aggregation* step)
+
+Each of these three things constitutes a workflow step, which means we define a
+rule for each. Each will also have its own script.
+
+Since the first two steps do not specify which values are used for `eta`, `dt`
+and `total_time`, these three parameters will be **wildcards**, i.e. values that
+will be specified by the user when an output is requested (e.g. the user wants
+the time-velocity plot for `eta=0.1`, `dt=1e-3`, `total_time=20`).
+
+## Rules
+
+Snakemake rules (defined in a `Snakefile` text file) typically define three
+things for a step:
+
+1. The input files (optional)
+2. The output files
+3. What needs to be executed to generate the output (e.g. a script or shell command)
+
+The first *simulation* step does not require any input other than the wildcard
+parameter values, so we can write its rule in the `Snakefile` as:
+
+```
+rule simulation:
+    output:
+        "free_fall,eta={eta},dt={dt},total_time={total_time}.txt"
+    script:
+        "simulation.py"
+```
+
+Here the output file is defined as
+`"free_fall,eta={eta},dt={dt},total_time={total_time}.txt"`. The `{...}` syntax
+indicates a **wildcard** parameter (here `eta`, `total_time` and `dt`), but the
+rest of the output path is free (e.g. the `eta=` part could be omitted).
+
+One can request the output file by running:
+
+```
+snakemake -j1 free_fall,eta=0.5,dt=1e-3,total_time=20.txt
+```
+
+This command will execute the `simulation` rule with wildcard replaced by the
+specified values. Since `simulation.py` only contains a function, nothing
+happens. Let's see how to retrieve the wildcard values and output paths from the
+script file.
+
+### Snakemake context
--- a/Workflow-management.md
+++ b/Workflow-management.md
@ -1,6 +1,13 @@
 Scientific simulations are often complex beasts: each step of a simulation requires some input data, a code to run on this input, and produces some output. Steps together form a complex, intricate workflow that can be difficult to deploy, even harder to maintain, and downright impossible to be reproduced by a third party.

-This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces.
+This is why making this process easier is often a good time investment: it
+typically requires thinking logically about a workflow, to split it into simple
+steps linked only by input and output data. This alone helps structure a
+workflow so that it's easier to add, remove or change simulation steps. Workflow
+management software helps to define a workflow graph and formalizes this
+process. It also allows tracking data dependency, re-run steps that require
+running when input data changes, and allows the configuration of parameter
+spaces (for parameter studies).

 There are a good number of workflow management programs designed for scientific
 computation. Some run as a complex server process that contain a live
@ -14,124 +21,32 @@ useful to run complex simulations.
 # Rule-based workflow

 In order to satisfy reproducibility requirements for a given scientific study,
-there must be a traceability of how each output (figure, table, etc.) was
+there must be traceability of how each output (figure, table, etc.) was
 generated: which code created the output, what parameters were used, which
 intermediate output was processed, etc.

 All this can be done with **rules**, which explain how, given a set of inputs,
 an output is created. A rule can be thought of as a "step" of a simulation
-pipeline, and rules can be chained together and combined, forming an *directed
-acyclic graph*. This allows two things:
+pipeline, and rules can be chained together and combined. This allows two things:

 - Traceability: following the graph allows to find the inputs (data *and* code)
  that were used to generate an output, which is a necessary condition for
  reproducibility.
 - Update of outputs: if an input changes, it is easy to find the rules that need
-  executing to update the outputs to reflect the changes.
+  executing to update the outputs and reflect the changes.

-These two features together provide a solid step towards reproducible simulation
-work.
+These two features together provide a solid step towards a reproducible simulation
+workflow.

 # Snakemake
-Snakemake is a tool written in Python to managed rule-based workflows. The
-workflow definition is a rather simple text file (usually a `Snakefile`), which
-typically looks like:
+Snakemake is a tool written in Python to manage rule-based workflows. The
+workflow definition is a rather simple text file (usually a `Snakefile`) which
+contains the rule definitions.

-```python
-rule list_groups_with_users:
-    input:
-        "/etc/group"
-    output:
-        "groups_with_users.txt", # file which contains only groups with users
-    shell:
-        """cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """
-
-rule sort_group_names:
-    input:
-        rules.list_groups_with_users.output[0]
-    output:
-        "sorted_groups.txt", # sorted file with group name and user
-        "only_users.txt",    # only contains the user names
-    shell:
-        "sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}"
-
-rule filter_by_letter:
-    input:
-        rules.list_groups_with_users.output[0]
-    output:
-        "start_with_letter_{letter}.txt", # only groups starting with a letter
-    shell:
-        "grep '^{wildcards.letter}' < {input} > {output}"
-```
-
-> This example filters the file /etc/group (which contains all groups on a linux
-> system) and writes to three files. The first has the group name and users
-> (created by the first rule). Then the second rule creates a sorted file and a
-> file with the user names only. This rather pointless application shows that it
-> is possible to chain rule inputs and outputs, and to have multiple outputs.
-
-Executing the workflow with the command `snakemake only_users.txt` (to tell it
-to generate the `only_users.txt` file) should execute both rules, with an output
-similar to:
-
-```
-Building DAG of jobs...
-Using shell: /usr/bin/bash
-Provided cores: 20
-Rules claiming more threads will be scaled down.
-Job stats:
-job                       count
----------------------  -------
-list_groups_with_users        1
-sort_group_names              1
-total                         2
-
-Select jobs to execute...
-Execute 1 jobs...
-
-[Wed Dec 11 14:56:49 2024]
-localrule list_groups_with_users:
-    input: /etc/group
-    output: groups_with_users.txt
-    jobid: 1
-    reason: Missing output files: groups_with_users.txt
-    resources: tmpdir=/tmp
-
-[Wed Dec 11 14:56:49 2024]
-Finished job 1.
-1 of 2 steps (50%) done
-Select jobs to execute...
-Execute 1 jobs...
-
-[Wed Dec 11 14:56:49 2024]
-localrule sort_group_names:
-    input: groups_with_users.txt
-    output: sorted_groups.txt, only_users.txt
-    jobid: 0
-    reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt
-    resources: tmpdir=/tmp
-
-[Wed Dec 11 14:56:49 2024]
-Finished job 0.
-2 of 2 steps (100%) done
-Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log
-```
-
-Removing `only_users.txt` and running `snakemake only_users.txt` should only
-re-run the last step.
-
-The rule syntax is rather straight-forward: each rule has a list of inputs and
-outputs (which are numbered from `0` to `N` by default, and can be named). The
-`shell` directive specifies that we want to run a shell command. This is the
-most flexible option. Alternatively, one can use the `run` directive and write
-inline python code directly in the `Snakefile`, the `script` directive, which
-specifies the name of a Python (or another language)
-[script](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#external-scripts)
-to be run (Snakemake creates a context for this script which allows it to access
-the input and output objects), or finally the [`notebook`
-directive](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration),
-similar to the `script` directive, for which Snakemake allows interactive
-execution (useful for post-processing/data exploration).
+To illustrate how to setup a Snakemake workflow, you can follow the
+[tutorial](Snakemake-Tutorial.md), where you'll setup a python simulation of a
+free-falling particle with viscous friction, a plot of its trajectory as
+function of time and a plot of the terminal velocity as function of the viscous friction.

 Reading the
 [documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly