diff --git a/Workflow-management.md b/Workflow-management.md index f06ef2c..31a8241 100644 --- a/Workflow-management.md +++ b/Workflow-management.md @@ -2,7 +2,133 @@ Scientific simulations are often complex beasts: each step of a simulation requi This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces. -There are a good number of workflow management programs designed for scientific computation. Some run as a complex server process that contain a live description of a workflow. In my experience, deploying these systems is not worth the time investment. Instead, I recommend using a tool called [Snakemake](https://snakemake.github.io/), which runs in Python and is greatly inspired from `make`, a very established build system. While it has its own faults, I have found it quite useful to run complex simulations. +There are a good number of workflow management programs designed for scientific +computation. Some run as a complex server process that contain a live +description of a workflow. In my experience, deploying these systems is not +worth the time investment. Instead, I recommend using a rule-based tool like +[GNU Make](https://www.gnu.org/software/make/) (i.e. `Makefile`s) or +[Snakemake](https://snakemake.github.io/), which runs in Python and is greatly +inspired from GNU Make. While it has its own faults, I have found it quite +useful to run complex simulations. + +# Rule-based workflow + +In order to satisfy reproducibility requirements for a given scientific study, +there must be a traceability of how each output (figure, table, etc.) was +generated: which code created the output, what parameters were used, which +intermediate output was processed, etc. + +All this can be done with **rules**, which explain how, given a set of inputs, +an output is created. A rule can be thought of as a "step" of a simulation +pipeline, and rules can be chained together and combined, forming an *directed +acyclic graph*. This allows two things: + +- Traceability: following the graph allows to find the inputs (data *and* code) + that were used to generate an output, which is a necessary condition for + reproducibility. +- Update of outputs: if an input changes, it is easy to find the rules that need + executing to update the outputs to reflect the changes. + +These two features together provide a solid step towards reproducible simulation +work. + +# GNU Make +Make is a program specifically designed to be a build system, i.e. a tool that +coordinates the compilation of a program's source code so that an executable or +library can be built. Each file of the build process is called a *target* and is +the output of some rule. Although it's primary purpose is creating build files, +it can easily be made to manage outputs of simulations. While it has the +advantage of being installed on virtually every Linux machine used for +scientific work, it lacks some features (most notably integration with queue +systems) which only make it practical for small cases (although I am sure some +shortcomings could be solved with a strong knowledge of Make). # Snakemake -A workflow in Snakemake is defined in a text file called `Snakefile`, the equivalent of Make's `Makefile`. This file defines *rules*, which are a basic unit defining a simulation step with three basic features: input, how to run the code, output. A rule basically explains how a given output is generated. Each output can be used as input to another rule, thereby creating a dependency graph (also called direct acyclic graph). One can then request the creating of a specific output, and the system will know which rules to execute to get to this output. \ No newline at end of file +Snakemake is a tool written in Python to managed rule-based workflows. The +workflow definition is a rather simple text file (usually a `Snakefile`), which +typically looks like: + +```python +rule list_groups_with_users: + input: + "/etc/group" + output: + "groups_with_users.txt", # file which contains only groups with users + shell: + """cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """ + +rule sort_group_names: + input: + rules.list_groups_with_users.output[0] + output: + "sorted_groups.txt", # sorted file with group name and user + "only_users.txt", # only contains the user names + shell: + "sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}" +``` + +Executing the workflow with the command `snakemake only_users.txt` (to tell it +to generate the `only_users.txt` file) should execute both rules, with an output +similar to: + +``` +Building DAG of jobs... +Using shell: /usr/bin/bash +Provided cores: 20 +Rules claiming more threads will be scaled down. +Job stats: +job count +---------------------- ------- +list_groups_with_users 1 +sort_group_names 1 +total 2 + +Select jobs to execute... +Execute 1 jobs... + +[Wed Dec 11 14:56:49 2024] +localrule list_groups_with_users: + input: /etc/group + output: groups_with_users.txt + jobid: 1 + reason: Missing output files: groups_with_users.txt + resources: tmpdir=/tmp + +[Wed Dec 11 14:56:49 2024] +Finished job 1. +1 of 2 steps (50%) done +Select jobs to execute... +Execute 1 jobs... + +[Wed Dec 11 14:56:49 2024] +localrule sort_group_names: + input: groups_with_users.txt + output: sorted_groups.txt, only_users.txt + jobid: 0 + reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt + resources: tmpdir=/tmp + +[Wed Dec 11 14:56:49 2024] +Finished job 0. +2 of 2 steps (100%) done +Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log +``` + +Removing `only_users.txt` and running `snakemake only_users.txt` should only +re-run the last step. + +The rule syntax is rather straight-forward: each rule has a list of inputs and +outputs (which are numbered from `0` to `N` by default, and can be named). The +`shell` directive specifies that we want to run a shell command. This is the +most flexible option. Alternatively one can use the `run` directive and write +inline python code directly in the `Snakefile`, the `script` directive, which +specifies the name of a Python (or another language) script to be run (Snakemake +creates a context for this script which allows it to access the input and output +objects), or finally the `notebook` directive, similar to the `script` +directive, for which Snakemake allows interactive execution (useful for +postprocessing/data exploration). + +Reading the +[documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly +recommended. Although the examples are often biology oriented, the features they +demonstrate are easily transposed to a mechanics environment.