added text on snakemake
							parent
							
								
									3662963438
								
							
						
					
					
						commit
						ce5cdc9871
					
				|  | @ -2,7 +2,133 @@ Scientific simulations are often complex beasts: each step of a simulation requi | ||||||
| 
 | 
 | ||||||
| This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces. | This is why making this process easier is often a good time investment: it often requires thinking logically about a workflow, to split it into simple steps linked only by input and output data. This alone helps structure a workflow so that it's easier to add, remove or change simulation steps. Workflow management software helps to define a workflow graph and formalizes this process. It also allows tracking data dependency, re-run steps that require running when input data changes, and allows the configuration of parameter spaces. | ||||||
| 
 | 
 | ||||||
| There are a good number of workflow management programs designed for scientific computation. Some run as a complex server process that contain a live description of a workflow. In my experience, deploying these systems is not worth the time investment. Instead, I recommend using a tool called [Snakemake](https://snakemake.github.io/), which runs in Python and is greatly inspired from `make`, a very established build system. While it has its own faults, I have found it quite useful to run complex simulations. | There are a good number of workflow management programs designed for scientific | ||||||
|  | computation. Some run as a complex server process that contain a live | ||||||
|  | description of a workflow. In my experience, deploying these systems is not | ||||||
|  | worth the time investment. Instead, I recommend using a rule-based tool like | ||||||
|  | [GNU Make](https://www.gnu.org/software/make/) (i.e. `Makefile`s) or | ||||||
|  | [Snakemake](https://snakemake.github.io/), which runs in Python and is greatly | ||||||
|  | inspired from GNU Make. While it has its own faults, I have found it quite | ||||||
|  | useful to run complex simulations. | ||||||
|  | 
 | ||||||
|  | # Rule-based workflow | ||||||
|  | 
 | ||||||
|  | In order to satisfy reproducibility requirements for a given scientific study, | ||||||
|  | there must be a traceability of how each output (figure, table, etc.) was | ||||||
|  | generated: which code created the output, what parameters were used, which | ||||||
|  | intermediate output was processed, etc. | ||||||
|  | 
 | ||||||
|  | All this can be done with **rules**, which explain how, given a set of inputs, | ||||||
|  | an output is created. A rule can be thought of as a "step" of a simulation | ||||||
|  | pipeline, and rules can be chained together and combined, forming an *directed | ||||||
|  | acyclic graph*. This allows two things: | ||||||
|  | 
 | ||||||
|  | - Traceability: following the graph allows to find the inputs (data *and* code) | ||||||
|  |   that were used to generate an output, which is a necessary condition for | ||||||
|  |   reproducibility. | ||||||
|  | - Update of outputs: if an input changes, it is easy to find the rules that need | ||||||
|  |   executing to update the outputs to reflect the changes. | ||||||
|  | 
 | ||||||
|  | These two features together provide a solid step towards reproducible simulation | ||||||
|  | work. | ||||||
|  | 
 | ||||||
|  | # GNU Make | ||||||
|  | Make is a program specifically designed to be a build system, i.e. a tool that | ||||||
|  | coordinates the compilation of a program's source code so that an executable or | ||||||
|  | library can be built. Each file of the build process is called a *target* and is | ||||||
|  | the output of some rule. Although it's primary purpose is creating build files, | ||||||
|  | it can easily be made to manage outputs of simulations. While it has the | ||||||
|  | advantage of being installed on virtually every Linux machine used for | ||||||
|  | scientific work, it lacks some features (most notably integration with queue | ||||||
|  | systems) which only make it practical for small cases (although I am sure some | ||||||
|  | shortcomings could be solved with a strong knowledge of Make). | ||||||
| 
 | 
 | ||||||
| # Snakemake | # Snakemake | ||||||
| A workflow in Snakemake is defined in a text file called `Snakefile`, the equivalent of Make's `Makefile`. This file defines *rules*, which are a basic unit defining a simulation step with three basic features: input, how to run the code, output. A rule basically explains how a given output is generated. Each output can be used as input to another rule, thereby creating a dependency graph (also called direct acyclic graph). One can then request the creating of a specific output, and the system will know which rules to execute to get to this output. | Snakemake is a tool written in Python to managed rule-based workflows. The | ||||||
|  | workflow definition is a rather simple text file (usually a `Snakefile`), which | ||||||
|  | typically looks like: | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | rule list_groups_with_users: | ||||||
|  |     input: | ||||||
|  |         "/etc/group" | ||||||
|  |     output: | ||||||
|  |         "groups_with_users.txt", # file which contains only groups with users | ||||||
|  |     shell: | ||||||
|  |         """cat {input} | awk -F ':' '$4 != "" {{ print $1,$4; }}' > {output} """ | ||||||
|  | 
 | ||||||
|  | rule sort_group_names: | ||||||
|  |     input: | ||||||
|  |         rules.list_groups_with_users.output[0] | ||||||
|  |     output: | ||||||
|  |         "sorted_groups.txt", # sorted file with group name and user | ||||||
|  |         "only_users.txt",    # only contains the user names | ||||||
|  |     shell: | ||||||
|  |         "sort < {input[0]} | tee {output[0]} | cut -d ' ' -f 2 > {output[1]}" | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | Executing the workflow with the command `snakemake only_users.txt` (to tell it | ||||||
|  | to generate the `only_users.txt` file) should execute both rules, with an output | ||||||
|  | similar to: | ||||||
|  | 
 | ||||||
|  | ``` | ||||||
|  | Building DAG of jobs... | ||||||
|  | Using shell: /usr/bin/bash | ||||||
|  | Provided cores: 20 | ||||||
|  | Rules claiming more threads will be scaled down. | ||||||
|  | Job stats: | ||||||
|  | job                       count | ||||||
|  | ----------------------  ------- | ||||||
|  | list_groups_with_users        1 | ||||||
|  | sort_group_names              1 | ||||||
|  | total                         2 | ||||||
|  | 
 | ||||||
|  | Select jobs to execute... | ||||||
|  | Execute 1 jobs... | ||||||
|  | 
 | ||||||
|  | [Wed Dec 11 14:56:49 2024] | ||||||
|  | localrule list_groups_with_users: | ||||||
|  |     input: /etc/group | ||||||
|  |     output: groups_with_users.txt | ||||||
|  |     jobid: 1 | ||||||
|  |     reason: Missing output files: groups_with_users.txt | ||||||
|  |     resources: tmpdir=/tmp | ||||||
|  | 
 | ||||||
|  | [Wed Dec 11 14:56:49 2024] | ||||||
|  | Finished job 1. | ||||||
|  | 1 of 2 steps (50%) done | ||||||
|  | Select jobs to execute... | ||||||
|  | Execute 1 jobs... | ||||||
|  | 
 | ||||||
|  | [Wed Dec 11 14:56:49 2024] | ||||||
|  | localrule sort_group_names: | ||||||
|  |     input: groups_with_users.txt | ||||||
|  |     output: sorted_groups.txt, only_users.txt | ||||||
|  |     jobid: 0 | ||||||
|  |     reason: Missing output files: only_users.txt; Input files updated by another job: groups_with_users.txt | ||||||
|  |     resources: tmpdir=/tmp | ||||||
|  | 
 | ||||||
|  | [Wed Dec 11 14:56:49 2024] | ||||||
|  | Finished job 0. | ||||||
|  | 2 of 2 steps (100%) done | ||||||
|  | Complete log: .snakemake/log/2024-12-11T145649.039189.snakemake.log | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | Removing `only_users.txt` and running `snakemake only_users.txt` should only | ||||||
|  | re-run the last step. | ||||||
|  | 
 | ||||||
|  | The rule syntax is rather straight-forward: each rule has a list of inputs and | ||||||
|  | outputs (which are numbered from `0` to `N` by default, and can be named). The | ||||||
|  | `shell` directive specifies that we want to run a shell command. This is the | ||||||
|  | most flexible option. Alternatively one can use the `run` directive and write | ||||||
|  | inline python code directly in the `Snakefile`, the `script` directive, which | ||||||
|  | specifies the name of a Python (or another language) script to be run (Snakemake | ||||||
|  | creates a context for this script which allows it to access the input and output | ||||||
|  | objects), or finally the `notebook` directive, similar to the `script` | ||||||
|  | directive, for which Snakemake allows interactive execution (useful for | ||||||
|  | postprocessing/data exploration). | ||||||
|  | 
 | ||||||
|  | Reading the | ||||||
|  | [documentation](https://snakemake.readthedocs.io/en/stable/index.html) is highly | ||||||
|  | recommended. Although the examples are often biology oriented, the features they | ||||||
|  | demonstrate are easily transposed to a mechanics environment. | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue
	
	 Lucas Frérot
						Lucas Frérot