added draft snakemake tutorial
parent
9dab927f35
commit
648cf8d4bd
|
|
@ -0,0 +1,79 @@
|
|||
# Snakemake tutorial : free falling particle
|
||||
|
||||
## The computation
|
||||
|
||||
We'll study the equation for the velocity of a free falling particle with
|
||||
viscous friction, in adimensional form:
|
||||
|
||||
$$ \frac{\mathrm d v}{\mathrm dt} = 1 - \eta v,\qquad v(0) = 0 $$
|
||||
|
||||
We will vary the physical parameter $\eta$, the time step $\Delta t$ for the
|
||||
numerical integration scheme and the number of time steps.
|
||||
|
||||
The following function solves the ODE numerically:
|
||||
|
||||
```python
|
||||
def solve_free_fall(eta, dt, total_time):
|
||||
...
|
||||
```
|
||||
|
||||
It returns a single array where the first line is time and the second line is velocity.
|
||||
|
||||
Save the above function in a `simulation.py` file.
|
||||
|
||||
## Steps
|
||||
|
||||
We want to do three things for our tutorial study:
|
||||
|
||||
- For any value of `eta`, `dt` and `total_time`, run the computation and save the
|
||||
output data (we'll call it the *simulation* step)
|
||||
- For any computation, plot the velocity as function of time and save the plot
|
||||
file (the *plot* step)
|
||||
- For a chosen range of `eta`, plot the terminal velocity as a function of `eta`
|
||||
(the *aggregation* step)
|
||||
|
||||
Each of these three things constitutes a workflow step, which means we define a
|
||||
rule for each. Each will also have its own script.
|
||||
|
||||
Since the first two steps do not specify which values are used for `eta`, `dt`
|
||||
and `total_time`, these three parameters will be **wildcards**, i.e. values that
|
||||
will be specified by the user when an output is requested (e.g. the user wants
|
||||
the time-velocity plot for `eta=0.1`, `dt=1e-3`, `total_time=20`).
|
||||
|
||||
## Rules
|
||||
|
||||
Snakemake rules (defined in a `Snakefile` text file) typically define three
|
||||
things for a step:
|
||||
|
||||
1. The input files (optional)
|
||||
2. The output files
|
||||
3. What needs to be executed to generate the output (e.g. a script or shell command)
|
||||
|
||||
The first *simulation* step does not require any input other than the wildcard
|
||||
parameter values, so we can write its rule in the `Snakefile` as:
|
||||
|
||||
```
|
||||
rule simulation:
|
||||
output:
|
||||
"free_fall,eta={eta},dt={dt},total_time={total_time}.txt"
|
||||
script:
|
||||
"simulation.py"
|
||||
```
|
||||
|
||||
Here the output file is defined as
|
||||
`"free_fall,eta={eta},dt={dt},total_time={total_time}.txt"`. The `{...}` syntax
|
||||
indicates a **wildcard** parameter (here `eta`, `total_time` and `dt`), but the
|
||||
rest of the output path is free (e.g. the `eta=` part could be omitted).
|
||||
|
||||
One can request the output file by running:
|
||||
|
||||
```
|
||||
snakemake -j1 free_fall,eta=0.5,dt=1e-3,total_time=20.txt
|
||||
```
|
||||
|
||||
This command will execute the `simulation` rule with wildcard replaced by the
|
||||
specified values. Since `simulation.py` only contains a function, nothing
|
||||
happens. Let's see how to retrieve the wildcard values and output paths from the
|
||||
script file.
|
||||
|
||||
### Snakemake context
|
||||
|
|
@ -1,6 +1,13 @@
|
|||
Scientific simulations are often complex beasts: each step of a simulation requires some input data, a code to run on this input, and produces some output. Steps together form a complex, intricate workflow that can be difficult to deploy, even harder to maintain, and downright impossible to be reproduced by a third party.
|
||||
|
||||
This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces.
|
||||
This is why making this process easier is often a good time investment: it
|
||||
typically requires thinking logically about a workflow, to split it into simple
|
||||
steps linked only by input and output data. This alone helps structure a
|
||||
workflow so that it's easier to add, remove or change simulation steps. Workflow
|
||||
management software helps to define a workflow graph and formalizes this
|
||||
process. It also allows tracking data dependency, re-run steps that require
|
||||
running when input data changes, and allows the configuration of parameter
|
||||
spaces (for parameter studies).
|
||||
|
||||
There are a good number of workflow management programs designed for scientific
|
||||
computation. Some run as a complex server process that contain a live
|
||||
|
|
@ -14,124 +21,32 @@ useful to run complex simulations.
|
|||
# Rule-based workflow
|
||||
|
||||
In order to satisfy reproducibility requirements for a given scientific study,
|
||||
there must be a traceability of how each output (figure, table, etc.) was
|
||||
there must be traceability of how each output (figure, table, etc.) was
|
||||
generated: which code created the output, what parameters were used, which
|
||||
intermediate output was processed, etc.
|
||||
|
||||
All this can be done with **rules**, which explain how, given a set of inputs,
|
||||
an output is created. A rule can be thought of as a "step" of a simulation
|
||||
pipeline, and rules can be chained together and combined, forming an *directed
|
||||
acyclic graph*. This allows two things:
|
||||
pipeline, and rules can be chained together and combined. This allows two things:
|
||||
|
||||
- Traceability: following the graph allows to find the inputs (data *and* code)
|
||||
that were used to generate an output, which is a necessary condition for
|
||||
reproducibility.
|
||||
- Update of outputs: if an input changes, it is easy to find the rules that need
|
||||
executing to update the outputs to reflect the changes.
|
||||
executing to update the outputs and reflect the changes.
|
||||
|
||||
These two features together provide a solid step towards reproducible simulation
|
||||
work.
|
||||
These two features together provide a solid step towards a reproducible simulation
|
||||
workflow.
|
||||
|
||||
# Snakemake
|
||||
Snakemake is a tool written in Python to managed rule-based workflows. The
|
||||
workflow definition is a rather simple text file (usually a `Snakefile`), which
|
||||
typically looks like:
|
||||
Snakemake is a tool written in Python to manage rule-based workflows. The
|
||||
workflow definition is a rather simple text file (usually a `Snakefile`) which
|
||||
contains the rule definitions.
|
||||
|
||||
```python
|
||||
rule list_groups_with_users:
|
||||
input:
|
||||
"/etc/group"
|
||||
output:
|
||||
"groups_with_users.txt", # file which contains only groups with users
|
||||
shell:
|
||||
"""cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """
|
||||
|
||||
rule sort_group_names:
|
||||
input:
|
||||
rules.list_groups_with_users.output[0]
|
||||
output:
|
||||
"sorted_groups.txt", # sorted file with group name and user
|
||||
"only_users.txt", # only contains the user names
|
||||
shell:
|
||||
"sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}"
|
||||
|
||||
rule filter_by_letter:
|
||||
input:
|
||||
rules.list_groups_with_users.output[0]
|
||||
output:
|
||||
"start_with_letter_{letter}.txt", # only groups starting with a letter
|
||||
shell:
|
||||
"grep '^{wildcards.letter}' < {input} > {output}"
|
||||
```
|
||||
|
||||
> This example filters the file /etc/group (which contains all groups on a linux
|
||||
> system) and writes to three files. The first has the group name and users
|
||||
> (created by the first rule). Then the second rule creates a sorted file and a
|
||||
> file with the user names only. This rather pointless application shows that it
|
||||
> is possible to chain rule inputs and outputs, and to have multiple outputs.
|
||||
|
||||
Executing the workflow with the command `snakemake only_users.txt` (to tell it
|
||||
to generate the `only_users.txt` file) should execute both rules, with an output
|
||||
similar to:
|
||||
|
||||
```
|
||||
Building DAG of jobs...
|
||||
Using shell: /usr/bin/bash
|
||||
Provided cores: 20
|
||||
Rules claiming more threads will be scaled down.
|
||||
Job stats:
|
||||
job count
|
||||
---------------------- -------
|
||||
list_groups_with_users 1
|
||||
sort_group_names 1
|
||||
total 2
|
||||
|
||||
Select jobs to execute...
|
||||
Execute 1 jobs...
|
||||
|
||||
[Wed Dec 11 14:56:49 2024]
|
||||
localrule list_groups_with_users:
|
||||
input: /etc/group
|
||||
output: groups_with_users.txt
|
||||
jobid: 1
|
||||
reason: Missing output files: groups_with_users.txt
|
||||
resources: tmpdir=/tmp
|
||||
|
||||
[Wed Dec 11 14:56:49 2024]
|
||||
Finished job 1.
|
||||
1 of 2 steps (50%) done
|
||||
Select jobs to execute...
|
||||
Execute 1 jobs...
|
||||
|
||||
[Wed Dec 11 14:56:49 2024]
|
||||
localrule sort_group_names:
|
||||
input: groups_with_users.txt
|
||||
output: sorted_groups.txt, only_users.txt
|
||||
jobid: 0
|
||||
reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt
|
||||
resources: tmpdir=/tmp
|
||||
|
||||
[Wed Dec 11 14:56:49 2024]
|
||||
Finished job 0.
|
||||
2 of 2 steps (100%) done
|
||||
Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log
|
||||
```
|
||||
|
||||
Removing `only_users.txt` and running `snakemake only_users.txt` should only
|
||||
re-run the last step.
|
||||
|
||||
The rule syntax is rather straight-forward: each rule has a list of inputs and
|
||||
outputs (which are numbered from `0` to `N` by default, and can be named). The
|
||||
`shell` directive specifies that we want to run a shell command. This is the
|
||||
most flexible option. Alternatively, one can use the `run` directive and write
|
||||
inline python code directly in the `Snakefile`, the `script` directive, which
|
||||
specifies the name of a Python (or another language)
|
||||
[script](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#external-scripts)
|
||||
to be run (Snakemake creates a context for this script which allows it to access
|
||||
the input and output objects), or finally the [`notebook`
|
||||
directive](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration),
|
||||
similar to the `script` directive, for which Snakemake allows interactive
|
||||
execution (useful for post-processing/data exploration).
|
||||
To illustrate how to setup a Snakemake workflow, you can follow the
|
||||
[tutorial](Snakemake-Tutorial.md), where you'll setup a python simulation of a
|
||||
free-falling particle with viscous friction, a plot of its trajectory as
|
||||
function of time and a plot of the terminal velocity as function of the viscous friction.
|
||||
|
||||
Reading the
|
||||
[documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly
|
||||
|
|
|
|||
Loading…
Reference in New Issue