added text on snakemake
							parent
							
								
									3662963438
								
							
						
					
					
						commit
						ce5cdc9871
					
				|  | @ -2,7 +2,133 @@ Scientific simulations are often complex beasts: each step of a simulation requi | |||
| 
 | ||||
| This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces. | ||||
| 
 | ||||
| There are a good number of workflow management programs designed for scientific computation. Some run as a complex server process that contain a live description of a workflow. In my experience, deploying these systems is not worth the time investment. Instead, I recommend using a tool called [Snakemake](https://snakemake.github.io/), which runs in Python and is greatly inspired from `make`, a very established build system. While it has its own faults, I have found it quite useful to run complex simulations. | ||||
| There are a good number of workflow management programs designed for scientific | ||||
| computation. Some run as a complex server process that contain a live | ||||
| description of a workflow. In my experience, deploying these systems is not | ||||
| worth the time investment. Instead, I recommend using a rule-based tool like | ||||
| [GNU Make](https://www.gnu.org/software/make/) (i.e. `Makefile`s) or | ||||
| [Snakemake](https://snakemake.github.io/), which runs in Python and is greatly | ||||
| inspired from GNU Make. While it has its own faults, I have found it quite | ||||
| useful to run complex simulations. | ||||
| 
 | ||||
| # Rule-based workflow | ||||
| 
 | ||||
| In order to satisfy reproducibility requirements for a given scientific study, | ||||
| there must be a traceability of how each output (figure, table, etc.) was | ||||
| generated: which code created the output, what parameters were used, which | ||||
| intermediate output was processed, etc. | ||||
| 
 | ||||
| All this can be done with **rules**, which explain how, given a set of inputs, | ||||
| an output is created. A rule can be thought of as a "step" of a simulation | ||||
| pipeline, and rules can be chained together and combined, forming an *directed | ||||
| acyclic graph*. This allows two things: | ||||
| 
 | ||||
| - Traceability: following the graph allows to find the inputs (data *and* code) | ||||
|   that were used to generate an output, which is a necessary condition for | ||||
|   reproducibility. | ||||
| - Update of outputs: if an input changes, it is easy to find the rules that need | ||||
|   executing to update the outputs to reflect the changes. | ||||
| 
 | ||||
| These two features together provide a solid step towards reproducible simulation | ||||
| work. | ||||
| 
 | ||||
| # GNU Make | ||||
| Make is a program specifically designed to be a build system, i.e. a tool that | ||||
| coordinates the compilation of a program's source code so that an executable or | ||||
| library can be built. Each file of the build process is called a *target* and is | ||||
| the output of some rule. Although it's primary purpose is creating build files, | ||||
| it can easily be made to manage outputs of simulations. While it has the | ||||
| advantage of being installed on virtually every Linux machine used for | ||||
| scientific work, it lacks some features (most notably integration with queue | ||||
| systems) which only make it practical for small cases (although I am sure some | ||||
| shortcomings could be solved with a strong knowledge of Make). | ||||
| 
 | ||||
| # Snakemake | ||||
| A workflow in Snakemake is defined in a text file called `Snakefile`, the equivalent of Make's `Makefile`. This file defines *rules*, which are a basic unit defining a simulation step with three basic features: input, how to run the code, output. A rule basically explains how a given output is generated. Each output can be used as input to another rule, thereby creating a dependency graph (also called direct acyclic graph). One can then request the creating of a specific output, and the system will know which rules to execute to get to this output. | ||||
| Snakemake is a tool written in Python to managed rule-based workflows. The | ||||
| workflow definition is a rather simple text file (usually a `Snakefile`), which | ||||
| typically looks like: | ||||
| 
 | ||||
| ```python | ||||
| rule list_groups_with_users: | ||||
|     input: | ||||
|         "/etc/group" | ||||
|     output: | ||||
|         "groups_with_users.txt", # file which contains only groups with users | ||||
|     shell: | ||||
|         """cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """ | ||||
| 
 | ||||
| rule sort_group_names: | ||||
|     input: | ||||
|         rules.list_groups_with_users.output[0] | ||||
|     output: | ||||
|         "sorted_groups.txt", # sorted file with group name and user | ||||
|         "only_users.txt",    # only contains the user names | ||||
|     shell: | ||||
|         "sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}" | ||||
| ``` | ||||
| 
 | ||||
| Executing the workflow with the command `snakemake only_users.txt` (to tell it | ||||
| to generate the `only_users.txt` file) should execute both rules, with an output | ||||
| similar to: | ||||
| 
 | ||||
| ``` | ||||
| Building DAG of jobs... | ||||
| Using shell: /usr/bin/bash | ||||
| Provided cores: 20 | ||||
| Rules claiming more threads will be scaled down. | ||||
| Job stats: | ||||
| job                       count | ||||
| ----------------------  ------- | ||||
| list_groups_with_users        1 | ||||
| sort_group_names              1 | ||||
| total                         2 | ||||
| 
 | ||||
| Select jobs to execute... | ||||
| Execute 1 jobs... | ||||
| 
 | ||||
| [Wed Dec 11 14:56:49 2024] | ||||
| localrule list_groups_with_users: | ||||
|     input: /etc/group | ||||
|     output: groups_with_users.txt | ||||
|     jobid: 1 | ||||
|     reason: Missing output files: groups_with_users.txt | ||||
|     resources: tmpdir=/tmp | ||||
| 
 | ||||
| [Wed Dec 11 14:56:49 2024] | ||||
| Finished job 1. | ||||
| 1 of 2 steps (50%) done | ||||
| Select jobs to execute... | ||||
| Execute 1 jobs... | ||||
| 
 | ||||
| [Wed Dec 11 14:56:49 2024] | ||||
| localrule sort_group_names: | ||||
|     input: groups_with_users.txt | ||||
|     output: sorted_groups.txt, only_users.txt | ||||
|     jobid: 0 | ||||
|     reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt | ||||
|     resources: tmpdir=/tmp | ||||
| 
 | ||||
| [Wed Dec 11 14:56:49 2024] | ||||
| Finished job 0. | ||||
| 2 of 2 steps (100%) done | ||||
| Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log | ||||
| ``` | ||||
| 
 | ||||
| Removing `only_users.txt` and running `snakemake only_users.txt` should only | ||||
| re-run the last step. | ||||
| 
 | ||||
| The rule syntax is rather straight-forward: each rule has a list of inputs and | ||||
| outputs (which are numbered from `0` to `N` by default, and can be named). The | ||||
| `shell` directive specifies that we want to run a shell command. This is the | ||||
| most flexible option. Alternatively one can use the `run` directive and write | ||||
| inline python code directly in the `Snakefile`, the `script` directive, which | ||||
| specifies the name of a Python (or another language) script to be run (Snakemake | ||||
| creates a context for this script which allows it to access the input and output | ||||
| objects), or finally the `notebook` directive, similar to the `script` | ||||
| directive, for which Snakemake allows interactive execution (useful for | ||||
| postprocessing/data exploration). | ||||
| 
 | ||||
| Reading the | ||||
| [documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly | ||||
| recommended. Although the examples are often biology oriented, the features they | ||||
| demonstrate are easily transposed to a mechanics environment. | ||||
|  |  | |||
		Loading…
	
		Reference in New Issue
	
	 Lucas Frérot
						Lucas Frérot