added text on snakemake
parent
3662963438
commit
ce5cdc9871
|
@ -2,7 +2,133 @@ Scientific simulations are often complex beasts: each step of a simulation requi
|
||||||
|
|
||||||
This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces.
|
This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces.
|
||||||
|
|
||||||
There are a good number of workflow management programs designed for scientific computation. Some run as a complex server process that contain a live description of a workflow. In my experience, deploying these systems is not worth the time investment. Instead, I recommend using a tool called [Snakemake](https://snakemake.github.io/), which runs in Python and is greatly inspired from `make`, a very established build system. While it has its own faults, I have found it quite useful to run complex simulations.
|
There are a good number of workflow management programs designed for scientific
|
||||||
|
computation. Some run as a complex server process that contain a live
|
||||||
|
description of a workflow. In my experience, deploying these systems is not
|
||||||
|
worth the time investment. Instead, I recommend using a rule-based tool like
|
||||||
|
[GNU Make](https://www.gnu.org/software/make/) (i.e. `Makefile`s) or
|
||||||
|
[Snakemake](https://snakemake.github.io/), which runs in Python and is greatly
|
||||||
|
inspired from GNU Make. While it has its own faults, I have found it quite
|
||||||
|
useful to run complex simulations.
|
||||||
|
|
||||||
|
# Rule-based workflow
|
||||||
|
|
||||||
|
In order to satisfy reproducibility requirements for a given scientific study,
|
||||||
|
there must be a traceability of how each output (figure, table, etc.) was
|
||||||
|
generated: which code created the output, what parameters were used, which
|
||||||
|
intermediate output was processed, etc.
|
||||||
|
|
||||||
|
All this can be done with **rules**, which explain how, given a set of inputs,
|
||||||
|
an output is created. A rule can be thought of as a "step" of a simulation
|
||||||
|
pipeline, and rules can be chained together and combined, forming an *directed
|
||||||
|
acyclic graph*. This allows two things:
|
||||||
|
|
||||||
|
- Traceability: following the graph allows to find the inputs (data *and* code)
|
||||||
|
that were used to generate an output, which is a necessary condition for
|
||||||
|
reproducibility.
|
||||||
|
- Update of outputs: if an input changes, it is easy to find the rules that need
|
||||||
|
executing to update the outputs to reflect the changes.
|
||||||
|
|
||||||
|
These two features together provide a solid step towards reproducible simulation
|
||||||
|
work.
|
||||||
|
|
||||||
|
# GNU Make
|
||||||
|
Make is a program specifically designed to be a build system, i.e. a tool that
|
||||||
|
coordinates the compilation of a program's source code so that an executable or
|
||||||
|
library can be built. Each file of the build process is called a *target* and is
|
||||||
|
the output of some rule. Although it's primary purpose is creating build files,
|
||||||
|
it can easily be made to manage outputs of simulations. While it has the
|
||||||
|
advantage of being installed on virtually every Linux machine used for
|
||||||
|
scientific work, it lacks some features (most notably integration with queue
|
||||||
|
systems) which only make it practical for small cases (although I am sure some
|
||||||
|
shortcomings could be solved with a strong knowledge of Make).
|
||||||
|
|
||||||
# Snakemake
|
# Snakemake
|
||||||
A workflow in Snakemake is defined in a text file called `Snakefile`, the equivalent of Make's `Makefile`. This file defines *rules*, which are a basic unit defining a simulation step with three basic features: input, how to run the code, output. A rule basically explains how a given output is generated. Each output can be used as input to another rule, thereby creating a dependency graph (also called direct acyclic graph). One can then request the creating of a specific output, and the system will know which rules to execute to get to this output.
|
Snakemake is a tool written in Python to managed rule-based workflows. The
|
||||||
|
workflow definition is a rather simple text file (usually a `Snakefile`), which
|
||||||
|
typically looks like:
|
||||||
|
|
||||||
|
```python
|
||||||
|
rule list_groups_with_users:
|
||||||
|
input:
|
||||||
|
"/etc/group"
|
||||||
|
output:
|
||||||
|
"groups_with_users.txt", # file which contains only groups with users
|
||||||
|
shell:
|
||||||
|
"""cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """
|
||||||
|
|
||||||
|
rule sort_group_names:
|
||||||
|
input:
|
||||||
|
rules.list_groups_with_users.output[0]
|
||||||
|
output:
|
||||||
|
"sorted_groups.txt", # sorted file with group name and user
|
||||||
|
"only_users.txt", # only contains the user names
|
||||||
|
shell:
|
||||||
|
"sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}"
|
||||||
|
```
|
||||||
|
|
||||||
|
Executing the workflow with the command `snakemake only_users.txt` (to tell it
|
||||||
|
to generate the `only_users.txt` file) should execute both rules, with an output
|
||||||
|
similar to:
|
||||||
|
|
||||||
|
```
|
||||||
|
Building DAG of jobs...
|
||||||
|
Using shell: /usr/bin/bash
|
||||||
|
Provided cores: 20
|
||||||
|
Rules claiming more threads will be scaled down.
|
||||||
|
Job stats:
|
||||||
|
job count
|
||||||
|
---------------------- -------
|
||||||
|
list_groups_with_users 1
|
||||||
|
sort_group_names 1
|
||||||
|
total 2
|
||||||
|
|
||||||
|
Select jobs to execute...
|
||||||
|
Execute 1 jobs...
|
||||||
|
|
||||||
|
[Wed Dec 11 14:56:49 2024]
|
||||||
|
localrule list_groups_with_users:
|
||||||
|
input: /etc/group
|
||||||
|
output: groups_with_users.txt
|
||||||
|
jobid: 1
|
||||||
|
reason: Missing output files: groups_with_users.txt
|
||||||
|
resources: tmpdir=/tmp
|
||||||
|
|
||||||
|
[Wed Dec 11 14:56:49 2024]
|
||||||
|
Finished job 1.
|
||||||
|
1 of 2 steps (50%) done
|
||||||
|
Select jobs to execute...
|
||||||
|
Execute 1 jobs...
|
||||||
|
|
||||||
|
[Wed Dec 11 14:56:49 2024]
|
||||||
|
localrule sort_group_names:
|
||||||
|
input: groups_with_users.txt
|
||||||
|
output: sorted_groups.txt, only_users.txt
|
||||||
|
jobid: 0
|
||||||
|
reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt
|
||||||
|
resources: tmpdir=/tmp
|
||||||
|
|
||||||
|
[Wed Dec 11 14:56:49 2024]
|
||||||
|
Finished job 0.
|
||||||
|
2 of 2 steps (100%) done
|
||||||
|
Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log
|
||||||
|
```
|
||||||
|
|
||||||
|
Removing `only_users.txt` and running `snakemake only_users.txt` should only
|
||||||
|
re-run the last step.
|
||||||
|
|
||||||
|
The rule syntax is rather straight-forward: each rule has a list of inputs and
|
||||||
|
outputs (which are numbered from `0` to `N` by default, and can be named). The
|
||||||
|
`shell` directive specifies that we want to run a shell command. This is the
|
||||||
|
most flexible option. Alternatively one can use the `run` directive and write
|
||||||
|
inline python code directly in the `Snakefile`, the `script` directive, which
|
||||||
|
specifies the name of a Python (or another language) script to be run (Snakemake
|
||||||
|
creates a context for this script which allows it to access the input and output
|
||||||
|
objects), or finally the `notebook` directive, similar to the `script`
|
||||||
|
directive, for which Snakemake allows interactive execution (useful for
|
||||||
|
postprocessing/data exploration).
|
||||||
|
|
||||||
|
Reading the
|
||||||
|
[documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly
|
||||||
|
recommended. Although the examples are often biology oriented, the features they
|
||||||
|
demonstrate are easily transposed to a mechanics environment.
|
||||||
|
|
Loading…
Reference in New Issue