added text on snakemake

2024-12-11 15:07:12 +01:00 · 2024-12-11 15:07:12 +01:00 · ce5cdc9871
parent 3662963438
commit ce5cdc9871
1 changed files with 128 additions and 2 deletions
--- a/Workflow-management.md
+++ b/Workflow-management.md
@ -2,7 +2,133 @@ Scientific simulations are often complex beasts: each step of a simulation requi
 This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces.
-There are a good number of workflow management programs designed for scientific computation. Some run as a complex server process that contain a live description of a workflow. In my experience, deploying these systems is not worth the time investment. Instead, I recommend using a tool called [Snakemake](https://snakemake.github.io/), which runs in Python and is greatly inspired from `make`, a very established build system. While it has its own faults, I have found it quite useful to run complex simulations.
+There are a good number of workflow management programs designed for scientific
 computation. Some run as a complex server process that contain a live
 description of a workflow. In my experience, deploying these systems is not
 worth the time investment. Instead, I recommend using a rule-based tool like
 [GNU Make](https://www.gnu.org/software/make/) (i.e. `Makefile`s) or
 [Snakemake](https://snakemake.github.io/), which runs in Python and is greatly
 inspired from GNU Make. While it has its own faults, I have found it quite
 useful to run complex simulations.
 # Rule-based workflow
 In order to satisfy reproducibility requirements for a given scientific study,
 there must be a traceability of how each output (figure, table, etc.) was
 generated: which code created the output, what parameters were used, which
 intermediate output was processed, etc.
 All this can be done with **rules**, which explain how, given a set of inputs,
 an output is created. A rule can be thought of as a "step" of a simulation
 pipeline, and rules can be chained together and combined, forming an *directed
 acyclic graph*. This allows two things:
 - Traceability: following the graph allows to find the inputs (data *and* code)
  that were used to generate an output, which is a necessary condition for
  reproducibility.
 - Update of outputs: if an input changes, it is easy to find the rules that need
  executing to update the outputs to reflect the changes.
 These two features together provide a solid step towards reproducible simulation
 work.
 # GNU Make
 Make is a program specifically designed to be a build system, i.e. a tool that
 coordinates the compilation of a program's source code so that an executable or
 library can be built. Each file of the build process is called a *target* and is
 the output of some rule. Although it's primary purpose is creating build files,
 it can easily be made to manage outputs of simulations. While it has the
 advantage of being installed on virtually every Linux machine used for
 scientific work, it lacks some features (most notably integration with queue
 systems) which only make it practical for small cases (although I am sure some
 shortcomings could be solved with a strong knowledge of Make).
 # Snakemake
-A workflow in Snakemake is defined in a text file called `Snakefile`, the equivalent of Make's `Makefile`. This file defines *rules*, which are a basic unit defining a simulation step with three basic features: input, how to run the code, output. A rule basically explains how a given output is generated. Each output can be used as input to another rule, thereby creating a dependency graph (also called direct acyclic graph). One can then request the creating of a specific output, and the system will know which rules to execute to get to this output.
+Snakemake is a tool written in Python to managed rule-based workflows. The
 workflow definition is a rather simple text file (usually a `Snakefile`), which
 typically looks like:
 ```python
 rule list_groups_with_users:
    input:
        "/etc/group"
    output:
        "groups_with_users.txt", # file which contains only groups with users
    shell:
        """cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """
 rule sort_group_names:
    input:
        rules.list_groups_with_users.output[0]
    output:
        "sorted_groups.txt", # sorted file with group name and user
        "only_users.txt",    # only contains the user names
    shell:
        "sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}"
 ```
 Executing the workflow with the command `snakemake only_users.txt` (to tell it
 to generate the `only_users.txt` file) should execute both rules, with an output
 similar to:
 ```
 Building DAG of jobs...
 Using shell: /usr/bin/bash
 Provided cores: 20
 Rules claiming more threads will be scaled down.
 Job stats:
 job                       count
 ----------------------  -------
 list_groups_with_users        1
 sort_group_names              1
 total                         2
 Select jobs to execute...
 Execute 1 jobs...
 [Wed Dec 11 14:56:49 2024]
 localrule list_groups_with_users:
    input: /etc/group
    output: groups_with_users.txt
    jobid: 1
    reason: Missing output files: groups_with_users.txt
    resources: tmpdir=/tmp
 [Wed Dec 11 14:56:49 2024]
 Finished job 1.
 1 of 2 steps (50%) done
 Select jobs to execute...
 Execute 1 jobs...
 [Wed Dec 11 14:56:49 2024]
 localrule sort_group_names:
    input: groups_with_users.txt
    output: sorted_groups.txt, only_users.txt
    jobid: 0
    reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt
    resources: tmpdir=/tmp
 [Wed Dec 11 14:56:49 2024]
 Finished job 0.
 2 of 2 steps (100%) done
 Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log
 ```
 Removing `only_users.txt` and running `snakemake only_users.txt` should only
 re-run the last step.
 The rule syntax is rather straight-forward: each rule has a list of inputs and
 outputs (which are numbered from `0` to `N` by default, and can be named). The
 `shell` directive specifies that we want to run a shell command. This is the
 most flexible option. Alternatively one can use the `run` directive and write
 inline python code directly in the `Snakefile`, the `script` directive, which
 specifies the name of a Python (or another language) script to be run (Snakemake
 creates a context for this script which allows it to access the input and output
 objects), or finally the `notebook` directive, similar to the `script`
 directive, for which Snakemake allows interactive execution (useful for
 postprocessing/data exploration).
 Reading the
 [documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly
 recommended. Although the examples are often biology oriented, the features they
 demonstrate are easily transposed to a mechanics environment.