added draft snakemake tutorial

2026-01-13 11:23:00 +01:00 · 2026-01-13 11:23:00 +01:00 · 648cf8d4bd
parent 9dab927f35
commit 648cf8d4bd
2 changed files with 99 additions and 105 deletions
--- a/Snakemake-Tutorial.md
+++ b/Snakemake-Tutorial.md
@ -0,0 +1,79 @@
 # Snakemake tutorial : free falling particle
 ## The computation
 We'll study the equation for the velocity of a free falling particle with
 viscous friction, in adimensional form:
 $$ \frac{\mathrm d v}{\mathrm dt} = 1 - \eta v,\qquad v(0) = 0 $$
 We will vary the physical parameter $\eta$, the time step $\Delta t$ for the
 numerical integration scheme and the number of time steps.
 The following function solves the ODE numerically:
 ```python
 def solve_free_fall(eta, dt, total_time):
    ...
 ```
 It returns a single array where the first line is time and the second line is velocity.
 Save the above function in a `simulation.py` file.
 ## Steps
 We want to do three things for our tutorial study:
 - For any value of `eta`, `dt` and `total_time`, run the computation and save the
  output data (we'll call it the *simulation* step)
 - For any computation, plot the velocity as function of time and save the plot
  file (the *plot* step)
 - For a chosen range of `eta`, plot the terminal velocity as a function of `eta`
  (the *aggregation* step)
 Each of these three things constitutes a workflow step, which means we define a
 rule for each. Each will also have its own script.
 Since the first two steps do not specify which values are used for `eta`, `dt`
 and `total_time`, these three parameters will be **wildcards**, i.e. values that
 will be specified by the user when an output is requested (e.g. the user wants
 the time-velocity plot for `eta=0.1`, `dt=1e-3`, `total_time=20`).
 ## Rules
 Snakemake rules (defined in a `Snakefile` text file) typically define three
 things for a step:
 1. The input files (optional)
 2. The output files
 3. What needs to be executed to generate the output (e.g. a script or shell command)
 The first *simulation* step does not require any input other than the wildcard
 parameter values, so we can write its rule in the `Snakefile` as:
 ```
 rule simulation:
    output:
        "free_fall,eta={eta},dt={dt},total_time={total_time}.txt"
    script:
        "simulation.py"
 ```
 Here the output file is defined as
 `"free_fall,eta={eta},dt={dt},total_time={total_time}.txt"`. The `{...}` syntax
 indicates a **wildcard** parameter (here `eta`, `total_time` and `dt`), but the
 rest of the output path is free (e.g. the `eta=` part could be omitted).
 One can request the output file by running:
 ```
 snakemake -j1 free_fall,eta=0.5,dt=1e-3,total_time=20.txt
 ```
 This command will execute the `simulation` rule with wildcard replaced by the
 specified values. Since `simulation.py` only contains a function, nothing
 happens. Let's see how to retrieve the wildcard values and output paths from the
 script file.
 ### Snakemake context
--- a/Workflow-management.md
+++ b/Workflow-management.md
@ -1,6 +1,13 @@
 Scientific simulations are often complex beasts: each step of a simulation requires some input data, a code to run on this input, and produces some output. Steps together form a complex, intricate workflow that can be difficult to deploy, even harder to maintain, and downright impossible to be reproduced by a third party.
-This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces.
+This is why making this process easier is often a good time investment: it
 typically requires thinking logically about a workflow, to split it into simple
 steps linked only by input and output data. This alone helps structure a
 workflow so that it's easier to add, remove or change simulation steps. Workflow
 management software helps to define a workflow graph and formalizes this
 process. It also allows tracking data dependency, re-run steps that require
 running when input data changes, and allows the configuration of parameter
 spaces (for parameter studies).
 There are a good number of workflow management programs designed for scientific
 computation. Some run as a complex server process that contain a live
@ -14,124 +21,32 @@ useful to run complex simulations.
 # Rule-based workflow
 In order to satisfy reproducibility requirements for a given scientific study,
-there must be a traceability of how each output (figure, table, etc.) was
+there must be traceability of how each output (figure, table, etc.) was
 generated: which code created the output, what parameters were used, which
 intermediate output was processed, etc.
 All this can be done with **rules**, which explain how, given a set of inputs,
 an output is created. A rule can be thought of as a "step" of a simulation
-pipeline, and rules can be chained together and combined, forming an *directed
+pipeline, and rules can be chained together and combined. This allows two things:
 acyclic graph*. This allows two things:
 - Traceability: following the graph allows to find the inputs (data *and* code)
  that were used to generate an output, which is a necessary condition for
  reproducibility.
 - Update of outputs: if an input changes, it is easy to find the rules that need
-  executing to update the outputs to reflect the changes.
+  executing to update the outputs and reflect the changes.
-These two features together provide a solid step towards reproducible simulation
+These two features together provide a solid step towards a reproducible simulation
-work.
+workflow.
 # Snakemake
-Snakemake is a tool written in Python to managed rule-based workflows. The
+Snakemake is a tool written in Python to manage rule-based workflows. The
-workflow definition is a rather simple text file (usually a `Snakefile`), which
+workflow definition is a rather simple text file (usually a `Snakefile`) which
-typically looks like:
+contains the rule definitions.
-```python
+To illustrate how to setup a Snakemake workflow, you can follow the
-rule list_groups_with_users:
+[tutorial](Snakemake-Tutorial.md), where you'll setup a python simulation of a
-    input:
+free-falling particle with viscous friction, a plot of its trajectory as
-        "/etc/group"
+function of time and a plot of the terminal velocity as function of the viscous friction.
    output:
        "groups_with_users.txt", # file which contains only groups with users
    shell:
        """cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """
 rule sort_group_names:
    input:
        rules.list_groups_with_users.output[0]
    output:
        "sorted_groups.txt", # sorted file with group name and user
        "only_users.txt",    # only contains the user names
    shell:
        "sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}"
 rule filter_by_letter:
    input:
        rules.list_groups_with_users.output[0]
    output:
        "start_with_letter_{letter}.txt", # only groups starting with a letter
    shell:
        "grep '^{wildcards.letter}' < {input} > {output}"
 ```
 > This example filters the file /etc/group (which contains all groups on a linux
 > system) and writes to three files. The first has the group name and users
 > (created by the first rule). Then the second rule creates a sorted file and a
 > file with the user names only. This rather pointless application shows that it
 > is possible to chain rule inputs and outputs, and to have multiple outputs.
 Executing the workflow with the command `snakemake only_users.txt` (to tell it
 to generate the `only_users.txt` file) should execute both rules, with an output
 similar to:
 ```
 Building DAG of jobs...
 Using shell: /usr/bin/bash
 Provided cores: 20
 Rules claiming more threads will be scaled down.
 Job stats:
 job                       count
 ----------------------  -------
 list_groups_with_users        1
 sort_group_names              1
 total                         2
 Select jobs to execute...
 Execute 1 jobs...
 [Wed Dec 11 14:56:49 2024]
 localrule list_groups_with_users:
    input: /etc/group
    output: groups_with_users.txt
    jobid: 1
    reason: Missing output files: groups_with_users.txt
    resources: tmpdir=/tmp
 [Wed Dec 11 14:56:49 2024]
 Finished job 1.
 1 of 2 steps (50%) done
 Select jobs to execute...
 Execute 1 jobs...
 [Wed Dec 11 14:56:49 2024]
 localrule sort_group_names:
    input: groups_with_users.txt
    output: sorted_groups.txt, only_users.txt
    jobid: 0
    reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt
    resources: tmpdir=/tmp
 [Wed Dec 11 14:56:49 2024]
 Finished job 0.
 2 of 2 steps (100%) done
 Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log
 ```
 Removing `only_users.txt` and running `snakemake only_users.txt` should only
 re-run the last step.
 The rule syntax is rather straight-forward: each rule has a list of inputs and
 outputs (which are numbered from `0` to `N` by default, and can be named). The
 `shell` directive specifies that we want to run a shell command. This is the
 most flexible option. Alternatively, one can use the `run` directive and write
 inline python code directly in the `Snakefile`, the `script` directive, which
 specifies the name of a Python (or another language)
 [script](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#external-scripts)
 to be run (Snakemake creates a context for this script which allows it to access
 the input and output objects), or finally the [`notebook`
 directive](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration),
 similar to the `script` directive, for which Snakemake allows interactive
 execution (useful for post-processing/data exploration).
 Reading the
 [documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly