added text on snakemake
parent
3662963438
commit
ce5cdc9871
|
@ -2,7 +2,133 @@ Scientific simulations are often complex beasts: each step of a simulation requi
|
|||
|
||||
This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces.
|
||||
|
||||
There are a good number of workflow management programs designed for scientific computation. Some run as a complex server process that contain a live description of a workflow. In my experience, deploying these systems is not worth the time investment. Instead, I recommend using a tool called [Snakemake](https://snakemake.github.io/), which runs in Python and is greatly inspired from `make`, a very established build system. While it has its own faults, I have found it quite useful to run complex simulations.
|
||||
There are a good number of workflow management programs designed for scientific
|
||||
computation. Some run as a complex server process that contain a live
|
||||
description of a workflow. In my experience, deploying these systems is not
|
||||
worth the time investment. Instead, I recommend using a rule-based tool like
|
||||
[GNU Make](https://www.gnu.org/software/make/) (i.e. `Makefile`s) or
|
||||
[Snakemake](https://snakemake.github.io/), which runs in Python and is greatly
|
||||
inspired from GNU Make. While it has its own faults, I have found it quite
|
||||
useful to run complex simulations.
|
||||
|
||||
# Rule-based workflow
|
||||
|
||||
In order to satisfy reproducibility requirements for a given scientific study,
|
||||
there must be a traceability of how each output (figure, table, etc.) was
|
||||
generated: which code created the output, what parameters were used, which
|
||||
intermediate output was processed, etc.
|
||||
|
||||
All this can be done with **rules**, which explain how, given a set of inputs,
|
||||
an output is created. A rule can be thought of as a "step" of a simulation
|
||||
pipeline, and rules can be chained together and combined, forming an *directed
|
||||
acyclic graph*. This allows two things:
|
||||
|
||||
- Traceability: following the graph allows to find the inputs (data *and* code)
|
||||
that were used to generate an output, which is a necessary condition for
|
||||
reproducibility.
|
||||
- Update of outputs: if an input changes, it is easy to find the rules that need
|
||||
executing to update the outputs to reflect the changes.
|
||||
|
||||
These two features together provide a solid step towards reproducible simulation
|
||||
work.
|
||||
|
||||
# GNU Make
|
||||
Make is a program specifically designed to be a build system, i.e. a tool that
|
||||
coordinates the compilation of a program's source code so that an executable or
|
||||
library can be built. Each file of the build process is called a *target* and is
|
||||
the output of some rule. Although it's primary purpose is creating build files,
|
||||
it can easily be made to manage outputs of simulations. While it has the
|
||||
advantage of being installed on virtually every Linux machine used for
|
||||
scientific work, it lacks some features (most notably integration with queue
|
||||
systems) which only make it practical for small cases (although I am sure some
|
||||
shortcomings could be solved with a strong knowledge of Make).
|
||||
|
||||
# Snakemake
|
||||
A workflow in Snakemake is defined in a text file called `Snakefile`, the equivalent of Make's `Makefile`. This file defines *rules*, which are a basic unit defining a simulation step with three basic features: input, how to run the code, output. A rule basically explains how a given output is generated. Each output can be used as input to another rule, thereby creating a dependency graph (also called direct acyclic graph). One can then request the creating of a specific output, and the system will know which rules to execute to get to this output.
|
||||
Snakemake is a tool written in Python to managed rule-based workflows. The
|
||||
workflow definition is a rather simple text file (usually a `Snakefile`), which
|
||||
typically looks like:
|
||||
|
||||
```python
|
||||
rule list_groups_with_users:
|
||||
input:
|
||||
"/etc/group"
|
||||
output:
|
||||
"groups_with_users.txt", # file which contains only groups with users
|
||||
shell:
|
||||
"""cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """
|
||||
|
||||
rule sort_group_names:
|
||||
input:
|
||||
rules.list_groups_with_users.output[0]
|
||||
output:
|
||||
"sorted_groups.txt", # sorted file with group name and user
|
||||
"only_users.txt", # only contains the user names
|
||||
shell:
|
||||
"sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}"
|
||||
```
|
||||
|
||||
Executing the workflow with the command `snakemake only_users.txt` (to tell it
|
||||
to generate the `only_users.txt` file) should execute both rules, with an output
|
||||
similar to:
|
||||
|
||||
```
|
||||
Building DAG of jobs...
|
||||
Using shell: /usr/bin/bash
|
||||
Provided cores: 20
|
||||
Rules claiming more threads will be scaled down.
|
||||
Job stats:
|
||||
job count
|
||||
---------------------- -------
|
||||
list_groups_with_users 1
|
||||
sort_group_names 1
|
||||
total 2
|
||||
|
||||
Select jobs to execute...
|
||||
Execute 1 jobs...
|
||||
|
||||
[Wed Dec 11 14:56:49 2024]
|
||||
localrule list_groups_with_users:
|
||||
input: /etc/group
|
||||
output: groups_with_users.txt
|
||||
jobid: 1
|
||||
reason: Missing output files: groups_with_users.txt
|
||||
resources: tmpdir=/tmp
|
||||
|
||||
[Wed Dec 11 14:56:49 2024]
|
||||
Finished job 1.
|
||||
1 of 2 steps (50%) done
|
||||
Select jobs to execute...
|
||||
Execute 1 jobs...
|
||||
|
||||
[Wed Dec 11 14:56:49 2024]
|
||||
localrule sort_group_names:
|
||||
input: groups_with_users.txt
|
||||
output: sorted_groups.txt, only_users.txt
|
||||
jobid: 0
|
||||
reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt
|
||||
resources: tmpdir=/tmp
|
||||
|
||||
[Wed Dec 11 14:56:49 2024]
|
||||
Finished job 0.
|
||||
2 of 2 steps (100%) done
|
||||
Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log
|
||||
```
|
||||
|
||||
Removing `only_users.txt` and running `snakemake only_users.txt` should only
|
||||
re-run the last step.
|
||||
|
||||
The rule syntax is rather straight-forward: each rule has a list of inputs and
|
||||
outputs (which are numbered from `0` to `N` by default, and can be named). The
|
||||
`shell` directive specifies that we want to run a shell command. This is the
|
||||
most flexible option. Alternatively one can use the `run` directive and write
|
||||
inline python code directly in the `Snakefile`, the `script` directive, which
|
||||
specifies the name of a Python (or another language) script to be run (Snakemake
|
||||
creates a context for this script which allows it to access the input and output
|
||||
objects), or finally the `notebook` directive, similar to the `script`
|
||||
directive, for which Snakemake allows interactive execution (useful for
|
||||
postprocessing/data exploration).
|
||||
|
||||
Reading the
|
||||
[documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly
|
||||
recommended. Although the examples are often biology oriented, the features they
|
||||
demonstrate are easily transposed to a mechanics environment.
|
||||
|
|
Loading…
Reference in New Issue