From 648cf8d4bd8ccbace11b10b4e8d18a50b488ff81 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lucas=20Fr=C3=A9rot?= Date: Tue, 13 Jan 2026 11:23:00 +0100 Subject: [PATCH] added draft snakemake tutorial --- Snakemake-Tutorial.md | 79 ++++++++++++++++++++++++++ Workflow-management.md | 125 +++++++---------------------------------- 2 files changed, 99 insertions(+), 105 deletions(-) create mode 100644 Snakemake-Tutorial.md diff --git a/Snakemake-Tutorial.md b/Snakemake-Tutorial.md new file mode 100644 index 0000000..398249e --- /dev/null +++ b/Snakemake-Tutorial.md @@ -0,0 +1,79 @@ +# Snakemake tutorial : free falling particle + +## The computation + +We'll study the equation for the velocity of a free falling particle with +viscous friction, in adimensional form: + +$$ \frac{\mathrm d v}{\mathrm dt} = 1 - \eta v,\qquad v(0) = 0 $$ + +We will vary the physical parameter $\eta$, the time step $\Delta t$ for the +numerical integration scheme and the number of time steps. + +The following function solves the ODE numerically: + +```python +def solve_free_fall(eta, dt, total_time): + ... +``` + +It returns a single array where the first line is time and the second line is velocity. + +Save the above function in a `simulation.py` file. + +## Steps + +We want to do three things for our tutorial study: + +- For any value of `eta`, `dt` and `total_time`, run the computation and save the + output data (we'll call it the *simulation* step) +- For any computation, plot the velocity as function of time and save the plot + file (the *plot* step) +- For a chosen range of `eta`, plot the terminal velocity as a function of `eta` + (the *aggregation* step) + +Each of these three things constitutes a workflow step, which means we define a +rule for each. Each will also have its own script. + +Since the first two steps do not specify which values are used for `eta`, `dt` +and `total_time`, these three parameters will be **wildcards**, i.e. values that +will be specified by the user when an output is requested (e.g. the user wants +the time-velocity plot for `eta=0.1`, `dt=1e-3`, `total_time=20`). + +## Rules + +Snakemake rules (defined in a `Snakefile` text file) typically define three +things for a step: + +1. The input files (optional) +2. The output files +3. What needs to be executed to generate the output (e.g. a script or shell command) + +The first *simulation* step does not require any input other than the wildcard +parameter values, so we can write its rule in the `Snakefile` as: + +``` +rule simulation: + output: + "free_fall,eta={eta},dt={dt},total_time={total_time}.txt" + script: + "simulation.py" +``` + +Here the output file is defined as +`"free_fall,eta={eta},dt={dt},total_time={total_time}.txt"`. The `{...}` syntax +indicates a **wildcard** parameter (here `eta`, `total_time` and `dt`), but the +rest of the output path is free (e.g. the `eta=` part could be omitted). + +One can request the output file by running: + +``` +snakemake -j1 free_fall,eta=0.5,dt=1e-3,total_time=20.txt +``` + +This command will execute the `simulation` rule with wildcard replaced by the +specified values. Since `simulation.py` only contains a function, nothing +happens. Let's see how to retrieve the wildcard values and output paths from the +script file. + +### Snakemake context diff --git a/Workflow-management.md b/Workflow-management.md index 0867db1..6ac2caf 100644 --- a/Workflow-management.md +++ b/Workflow-management.md @@ -1,6 +1,13 @@ Scientific simulations are often complex beasts: each step of a simulation requires some input data, a code to run on this input, and produces some output. Steps together form a complex, intricate workflow that can be difficult to deploy, even harder to maintain, and downright impossible to be reproduced by a third party. -This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces. +This is why making this process easier is often a good time investment: it +typically requires thinking logically about a workflow, to split it into simple +steps linked only by input and output data. This alone helps structure a +workflow so that it's easier to add, remove or change simulation steps. Workflow +management software helps to define a workflow graph and formalizes this +process. It also allows tracking data dependency, re-run steps that require +running when input data changes, and allows the configuration of parameter +spaces (for parameter studies). There are a good number of workflow management programs designed for scientific computation. Some run as a complex server process that contain a live @@ -14,124 +21,32 @@ useful to run complex simulations. # Rule-based workflow In order to satisfy reproducibility requirements for a given scientific study, -there must be a traceability of how each output (figure, table, etc.) was +there must be traceability of how each output (figure, table, etc.) was generated: which code created the output, what parameters were used, which intermediate output was processed, etc. All this can be done with **rules**, which explain how, given a set of inputs, an output is created. A rule can be thought of as a "step" of a simulation -pipeline, and rules can be chained together and combined, forming an *directed -acyclic graph*. This allows two things: +pipeline, and rules can be chained together and combined. This allows two things: - Traceability: following the graph allows to find the inputs (data *and* code) that were used to generate an output, which is a necessary condition for reproducibility. - Update of outputs: if an input changes, it is easy to find the rules that need - executing to update the outputs to reflect the changes. + executing to update the outputs and reflect the changes. -These two features together provide a solid step towards reproducible simulation -work. +These two features together provide a solid step towards a reproducible simulation +workflow. # Snakemake -Snakemake is a tool written in Python to managed rule-based workflows. The -workflow definition is a rather simple text file (usually a `Snakefile`), which -typically looks like: +Snakemake is a tool written in Python to manage rule-based workflows. The +workflow definition is a rather simple text file (usually a `Snakefile`) which +contains the rule definitions. -```python -rule list_groups_with_users: - input: - "/etc/group" - output: - "groups_with_users.txt", # file which contains only groups with users - shell: - """cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """ - -rule sort_group_names: - input: - rules.list_groups_with_users.output[0] - output: - "sorted_groups.txt", # sorted file with group name and user - "only_users.txt", # only contains the user names - shell: - "sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}" - -rule filter_by_letter: - input: - rules.list_groups_with_users.output[0] - output: - "start_with_letter_{letter}.txt", # only groups starting with a letter - shell: - "grep '^{wildcards.letter}' < {input} > {output}" -``` - -> This example filters the file /etc/group (which contains all groups on a linux -> system) and writes to three files. The first has the group name and users -> (created by the first rule). Then the second rule creates a sorted file and a -> file with the user names only. This rather pointless application shows that it -> is possible to chain rule inputs and outputs, and to have multiple outputs. - -Executing the workflow with the command `snakemake only_users.txt` (to tell it -to generate the `only_users.txt` file) should execute both rules, with an output -similar to: - -``` -Building DAG of jobs... -Using shell: /usr/bin/bash -Provided cores: 20 -Rules claiming more threads will be scaled down. -Job stats: -job count ----------------------- ------- -list_groups_with_users 1 -sort_group_names 1 -total 2 - -Select jobs to execute... -Execute 1 jobs... - -[Wed Dec 11 14:56:49 2024] -localrule list_groups_with_users: - input: /etc/group - output: groups_with_users.txt - jobid: 1 - reason: Missing output files: groups_with_users.txt - resources: tmpdir=/tmp - -[Wed Dec 11 14:56:49 2024] -Finished job 1. -1 of 2 steps (50%) done -Select jobs to execute... -Execute 1 jobs... - -[Wed Dec 11 14:56:49 2024] -localrule sort_group_names: - input: groups_with_users.txt - output: sorted_groups.txt, only_users.txt - jobid: 0 - reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt - resources: tmpdir=/tmp - -[Wed Dec 11 14:56:49 2024] -Finished job 0. -2 of 2 steps (100%) done -Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log -``` - -Removing `only_users.txt` and running `snakemake only_users.txt` should only -re-run the last step. - -The rule syntax is rather straight-forward: each rule has a list of inputs and -outputs (which are numbered from `0` to `N` by default, and can be named). The -`shell` directive specifies that we want to run a shell command. This is the -most flexible option. Alternatively, one can use the `run` directive and write -inline python code directly in the `Snakefile`, the `script` directive, which -specifies the name of a Python (or another language) -[script](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#external-scripts) -to be run (Snakemake creates a context for this script which allows it to access -the input and output objects), or finally the [`notebook` -directive](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration), -similar to the `script` directive, for which Snakemake allows interactive -execution (useful for post-processing/data exploration). +To illustrate how to setup a Snakemake workflow, you can follow the +[tutorial](Snakemake-Tutorial.md), where you'll setup a python simulation of a +free-falling particle with viscous friction, a plot of its trajectory as +function of time and a plot of the terminal velocity as function of the viscous friction. Reading the [documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly