added text on snakemake

Lucas Frérot 2024-12-11 15:07:12 +01:00
parent 3662963438
commit ce5cdc9871
No known key found for this signature in database
GPG Key ID: 03B54A50E3FBA7E8
1 changed files with 128 additions and 2 deletions

@ -2,7 +2,133 @@ Scientific simulations are often complex beasts: each step of a simulation requi
This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces.
There are a good number of workflow management programs designed for scientific computation. Some run as a complex server process that contain a live description of a workflow. In my experience, deploying these systems is not worth the time investment. Instead, I recommend using a tool called [Snakemake](https://snakemake.github.io/), which runs in Python and is greatly inspired from `make`, a very established build system. While it has its own faults, I have found it quite useful to run complex simulations.
There are a good number of workflow management programs designed for scientific
computation. Some run as a complex server process that contain a live
description of a workflow. In my experience, deploying these systems is not
worth the time investment. Instead, I recommend using a rule-based tool like
[GNU Make](https://www.gnu.org/software/make/) (i.e. `Makefile`s) or
[Snakemake](https://snakemake.github.io/), which runs in Python and is greatly
inspired from GNU Make. While it has its own faults, I have found it quite
useful to run complex simulations.
# Rule-based workflow
In order to satisfy reproducibility requirements for a given scientific study,
there must be a traceability of how each output (figure, table, etc.) was
generated: which code created the output, what parameters were used, which
intermediate output was processed, etc.
All this can be done with **rules**, which explain how, given a set of inputs,
an output is created. A rule can be thought of as a "step" of a simulation
pipeline, and rules can be chained together and combined, forming an *directed
acyclic graph*. This allows two things:
- Traceability: following the graph allows to find the inputs (data *and* code)
that were used to generate an output, which is a necessary condition for
reproducibility.
- Update of outputs: if an input changes, it is easy to find the rules that need
executing to update the outputs to reflect the changes.
These two features together provide a solid step towards reproducible simulation
work.
# GNU Make
Make is a program specifically designed to be a build system, i.e. a tool that
coordinates the compilation of a program's source code so that an executable or
library can be built. Each file of the build process is called a *target* and is
the output of some rule. Although it's primary purpose is creating build files,
it can easily be made to manage outputs of simulations. While it has the
advantage of being installed on virtually every Linux machine used for
scientific work, it lacks some features (most notably integration with queue
systems) which only make it practical for small cases (although I am sure some
shortcomings could be solved with a strong knowledge of Make).
# Snakemake
A workflow in Snakemake is defined in a text file called `Snakefile`, the equivalent of Make's `Makefile`. This file defines *rules*, which are a basic unit defining a simulation step with three basic features: input, how to run the code, output. A rule basically explains how a given output is generated. Each output can be used as input to another rule, thereby creating a dependency graph (also called direct acyclic graph). One can then request the creating of a specific output, and the system will know which rules to execute to get to this output.
Snakemake is a tool written in Python to managed rule-based workflows. The
workflow definition is a rather simple text file (usually a `Snakefile`), which
typically looks like:
```python
rule list_groups_with_users:
input:
"/etc/group"
output:
"groups_with_users.txt", # file which contains only groups with users
shell:
"""cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """
rule sort_group_names:
input:
rules.list_groups_with_users.output[0]
output:
"sorted_groups.txt", # sorted file with group name and user
"only_users.txt", # only contains the user names
shell:
"sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}"
```
Executing the workflow with the command `snakemake only_users.txt` (to tell it
to generate the `only_users.txt` file) should execute both rules, with an output
similar to:
```
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Job stats:
job count
---------------------- -------
list_groups_with_users 1
sort_group_names 1
total 2
Select jobs to execute...
Execute 1 jobs...
[Wed Dec 11 14:56:49 2024]
localrule list_groups_with_users:
input: /etc/group
output: groups_with_users.txt
jobid: 1
reason: Missing output files: groups_with_users.txt
resources: tmpdir=/tmp
[Wed Dec 11 14:56:49 2024]
Finished job 1.
1 of 2 steps (50%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Dec 11 14:56:49 2024]
localrule sort_group_names:
input: groups_with_users.txt
output: sorted_groups.txt, only_users.txt
jobid: 0
reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt
resources: tmpdir=/tmp
[Wed Dec 11 14:56:49 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log
```
Removing `only_users.txt` and running `snakemake only_users.txt` should only
re-run the last step.
The rule syntax is rather straight-forward: each rule has a list of inputs and
outputs (which are numbered from `0` to `N` by default, and can be named). The
`shell` directive specifies that we want to run a shell command. This is the
most flexible option. Alternatively one can use the `run` directive and write
inline python code directly in the `Snakefile`, the `script` directive, which
specifies the name of a Python (or another language) script to be run (Snakemake
creates a context for this script which allows it to access the input and output
objects), or finally the `notebook` directive, similar to the `script`
directive, for which Snakemake allows interactive execution (useful for
postprocessing/data exploration).
Reading the
[documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly
recommended. Although the examples are often biology oriented, the features they
demonstrate are easily transposed to a mechanics environment.