added text on snakemake

2024-12-11 15:07:12 +01:00 · 2024-12-11 15:07:12 +01:00 · ce5cdc9871
parent 3662963438
commit ce5cdc9871
1 changed files with 128 additions and 2 deletions
--- a/Workflow-management.md
+++ b/Workflow-management.md
@ -2,7 +2,133 @@ Scientific simulations are often complex beasts: each step of a simulation requi

 This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces.

-There are a good number of workflow management programs designed for scientific computation. Some run as a complex server process that contain a live description of a workflow. In my experience, deploying these systems is not worth the time investment. Instead, I recommend using a tool called [Snakemake](https://snakemake.github.io/), which runs in Python and is greatly inspired from `make`, a very established build system. While it has its own faults, I have found it quite useful to run complex simulations.
+There are a good number of workflow management programs designed for scientific
+computation. Some run as a complex server process that contain a live
+description of a workflow. In my experience, deploying these systems is not
+worth the time investment. Instead, I recommend using a rule-based tool like
+[GNU Make](https://www.gnu.org/software/make/) (i.e. `Makefile`s) or
+[Snakemake](https://snakemake.github.io/), which runs in Python and is greatly
+inspired from GNU Make. While it has its own faults, I have found it quite
+useful to run complex simulations.
+
+# Rule-based workflow
+
+In order to satisfy reproducibility requirements for a given scientific study,
+there must be a traceability of how each output (figure, table, etc.) was
+generated: which code created the output, what parameters were used, which
+intermediate output was processed, etc.
+
+All this can be done with **rules**, which explain how, given a set of inputs,
+an output is created. A rule can be thought of as a "step" of a simulation
+pipeline, and rules can be chained together and combined, forming an *directed
+acyclic graph*. This allows two things:
+
+- Traceability: following the graph allows to find the inputs (data *and* code)
+  that were used to generate an output, which is a necessary condition for
+  reproducibility.
+- Update of outputs: if an input changes, it is easy to find the rules that need
+  executing to update the outputs to reflect the changes.
+
+These two features together provide a solid step towards reproducible simulation
+work.
+
+# GNU Make
+Make is a program specifically designed to be a build system, i.e. a tool that
+coordinates the compilation of a program's source code so that an executable or
+library can be built. Each file of the build process is called a *target* and is
+the output of some rule. Although it's primary purpose is creating build files,
+it can easily be made to manage outputs of simulations. While it has the
+advantage of being installed on virtually every Linux machine used for
+scientific work, it lacks some features (most notably integration with queue
+systems) which only make it practical for small cases (although I am sure some
+shortcomings could be solved with a strong knowledge of Make).

 # Snakemake
-A workflow in Snakemake is defined in a text file called `Snakefile`, the equivalent of Make's `Makefile`. This file defines *rules*, which are a basic unit defining a simulation step with three basic features: input, how to run the code, output. A rule basically explains how a given output is generated. Each output can be used as input to another rule, thereby creating a dependency graph (also called direct acyclic graph). One can then request the creating of a specific output, and the system will know which rules to execute to get to this output.
+Snakemake is a tool written in Python to managed rule-based workflows. The
+workflow definition is a rather simple text file (usually a `Snakefile`), which
+typically looks like:
+
+```python
+rule list_groups_with_users:
+    input:
+        "/etc/group"
+    output:
+        "groups_with_users.txt", # file which contains only groups with users
+    shell:
+        """cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """
+
+rule sort_group_names:
+    input:
+        rules.list_groups_with_users.output[0]
+    output:
+        "sorted_groups.txt", # sorted file with group name and user
+        "only_users.txt",    # only contains the user names
+    shell:
+        "sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}"
+```
+
+Executing the workflow with the command `snakemake only_users.txt` (to tell it
+to generate the `only_users.txt` file) should execute both rules, with an output
+similar to:
+
+```
+Building DAG of jobs...
+Using shell: /usr/bin/bash
+Provided cores: 20
+Rules claiming more threads will be scaled down.
+Job stats:
+job                       count
+----------------------  -------
+list_groups_with_users        1
+sort_group_names              1
+total                         2
+
+Select jobs to execute...
+Execute 1 jobs...
+
+[Wed Dec 11 14:56:49 2024]
+localrule list_groups_with_users:
+    input: /etc/group
+    output: groups_with_users.txt
+    jobid: 1
+    reason: Missing output files: groups_with_users.txt
+    resources: tmpdir=/tmp
+
+[Wed Dec 11 14:56:49 2024]
+Finished job 1.
+1 of 2 steps (50%) done
+Select jobs to execute...
+Execute 1 jobs...
+
+[Wed Dec 11 14:56:49 2024]
+localrule sort_group_names:
+    input: groups_with_users.txt
+    output: sorted_groups.txt, only_users.txt
+    jobid: 0
+    reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt
+    resources: tmpdir=/tmp
+
+[Wed Dec 11 14:56:49 2024]
+Finished job 0.
+2 of 2 steps (100%) done
+Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log
+```
+
+Removing `only_users.txt` and running `snakemake only_users.txt` should only
+re-run the last step.
+
+The rule syntax is rather straight-forward: each rule has a list of inputs and
+outputs (which are numbered from `0` to `N` by default, and can be named). The
+`shell` directive specifies that we want to run a shell command. This is the
+most flexible option. Alternatively one can use the `run` directive and write
+inline python code directly in the `Snakefile`, the `script` directive, which
+specifies the name of a Python (or another language) script to be run (Snakemake
+creates a context for this script which allows it to access the input and output
+objects), or finally the `notebook` directive, similar to the `script`
+directive, for which Snakemake allows interactive execution (useful for
+postprocessing/data exploration).
+
+Reading the
+[documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly
+recommended. Although the examples are often biology oriented, the features they
+demonstrate are easily transposed to a mechanics environment.