added dataset page from Lars' group wiki

Lucas Frérot 2024-12-18 13:26:33 +01:00
parent 9054ee638a
commit e184f85690
No known key found for this signature in database
GPG Key ID: 03B54A50E3FBA7E8
1 changed files with 156 additions and 2 deletions

@ -1,2 +1,156 @@
# Data management policies
# Data management tools
#### Disclaimer: this page is adapted from the internal wiki of Lars Pastewka's group
# Datasets
A _dataset_ is a collection of files that belong together logically. For example, this could be...
* ...input files for a simulation and the output of that simulation.
* ...data generated during post-processing of a simulation.
* ...the LaTeX manuscript of a paper, the raw data underlying that paper and the scripts required to plot that data.
The idea behind a dataset is that it becomes immutable at some point. The process of making a dataset immutable is called 'freezing'. Once a dataset is frozen, it can be moved between storage systems but no longer be edited. Changing a dataset would then require the creation of a new dataset and the deletion of the old dataset. While this may seem cumbersome, it has the advantage that the storage backend can be kept simple (and cheap) and that inadvertent modification of primary data is not possible.
## Granularity
What goes into a _dataset_ is the decision of the dataset's creator. In general, a finer granularity makes it easier to move, copy and backup dataset. We distinguish between...
* ..._primary data_, i.e. the input files of a simulation and the output of that simulation...
* ...and _derived data_, which is obtained by processing primary or derived data. Derived data could for example be the structure factor or some correlation function obtained from molecular dynamics trajectory. The trajectory itself would be primary data.
For example, this distinction allows to freeze a dataset once a simulation has finished, while there may still be postprocessing steps that follow. Those would go into a separate dataset. The relationship between datasets can be specified using the `derived_from` property discussed below.
## Owners
Each dataset has at least one owner. This is typically the person who created the dataset. __Note that the owner has scientific responsibility for the contents of the dataset.__ Part of this scientific responsibility is for example ensuring that the data has not been fabricated or falsified. Attaching an owner to a dataset allows traceability of these scientific responsibilities.
## Reviewers
Before archival, we will review the dataset. This review process ensures that the data inside the dataset conforms with our data management policies. The name(s) of the reviewer(s) will be attached to the dataset.
## Metadata
A dataset _always_ has metadata attached to it. The absolute minimum of this metadata would be the user who owns the dataset and the creation date. Please always use your clear name, your email and your [ORCID](https://orcid.org/).
The template below reflects the status quo of administrative metadata. Feel free to add any other type of metadata to the dataset, even if not standardized. The more data is available, the easier it will be later to search for a specific dataset.
# dtool
The above abstract description of a dataset requires a standardized
implementation. One possibility to maintain datasets is with a simple tool called [dtool](https://dtool.readthedocs.io/). `dtool` defines a standardized way to attach metadata to a dataset and handles transfer of datasets between storage systems. Please maintain all of your data in `dtool` dataset. There is also a [paper](https://peerj.com/articles/6562/#) describing `dtool`.
You will may find more help and (possibly livMatS-specific) usage guidelines on dtool at the [livMatS RDM wiki](https://github.com/livMatS/RDM-Wiki-public), including a picturesque [GUI quick start guide](https://github.com/livMatS/RDM-Wiki-public/blob/master/rdm/dtool/src/020_gui/005_quick_start.md) and a [PDF-compiled version of those guidelines](https://github.com/livmats/RDM-Wiki-public/releases/latest/download/dtool-guidelines-latest.pdf).
## Installation
Install `dtool` via
```
python3 -m pip install [--user] dtool
```
## Configuration
Please tell `dtool` about yourself. It needs your name and your email address. This information will later be used in the metadata template.
```
dtool config user name "Your Full Name"
dtool config user email "your@email.com"
```
## Managing datasets
A `dtool` dataset consists of a number of files and some metadata, including information on file sizes, hashes (generated automatically) and keywords set by the user. Start by creating an empty dataset:
```
dtool create <name_of_the_dataset>
```
This will create an empty dataset that contains the file `README.yml` and the subdirectory `data`. There is also a hidden directory `.dtool` that contains administrative data used by dtool.
You can add files to the dataset by
```
dtool add item <file> <name_of_the_dataset>
```
or just placing the files into the subdirectory ```<name_of_the_dataset>/data/```. Note that a possible workflow is to create a dataset before running a simulation, and then running that simulation within the `data` subdirectory of the dataset.
## Editing metadata
Metadata is a type of dictionary that is attached to the dataset. In its simple version, it is a number of keys with the accompanying values. At any stage you can add keys and values to the dataset using
```
dtool readme edit <name_of_the_dataset>
```
or by just editing the file `README.yml` inside the dataset directory. (Note that `dtool readme edit` just launches an editor -- typically `vi` -- for that file.) This file in the [YAML](https://yaml.org/) format. Make sure to familiarize yourself with _YAML_. You can use [yamllint](https://github.com/adrienverge/yamllint) to check the syntax.
When editing the metadata `README` file, you are completely free in choosing key-value pairs. `dtool` incorporates a mechanism in which you can set predefined keys by providing a _template_. This template can be stored in any file. Our recommendation is to use `$HOME/.dtool_readme.yml`. Please use the following template file:
```yaml
project: Project name
description: Short description
owners:
- name: {DTOOL_USER_FULL_NAME}
email: {DTOOL_USER_EMAIL}
username: {username}
orcid: Please obtain an ORCID at orcid.org
funders:
- organization: Please add funding organization
program: Please add the specific program within which funding occured
code: Please add funding code of that organization
creation_date: '{date}'
expiration_date: '{date}'
derived_from:
- uuid: UUID of the primary data or the previous simulation step
```
Note that this structure allows the specification of an arbitrary number of owners and [funders](Funders.md). `derived_from` is optional and only necessary if this is derived data. Please specify then the UUID of the primary data.
Tell `dtool` to use this template by executing
```
dtool config readme-template $HOME/.dtool_readme.yml
```
Note that you can of course keep multiple templates for different projects and switch between them with this command.
You can now fill the `README.yml` of your dataset by calling
```
dtool readme interactive <name_of_the_dataset>
```
This launches an interactive mode in which you are queried for the metadata specified in the template file. Please use the above template. You can of course add any further keys to the template file.
## 'Freezing' a dataset
Datasets can be modified until they are frozen. 'Freeze' the dataset with
```
dtool freeze <name_of_the_dataset>
```
This saves hashes of all files in the metadata for the dataset. The hashes can be checked later to verify that the data has not been modified since:
```
dtool verify <name_of_the_dataset>
```
Freezing will also generate a unique identifier (a UUID) for your dataset. You can query the UUID with
```
dtool uuid <name_of_the_dataset>
```
When referring to a dataset, always use the UUID as the name of the dataset is not necessarily unique.
## Listing datasets
`dtools ls .` will give you a list with all datasets in the present directory. The color of the name allows you to distinguish between live (red) and frozen (green) datasets.
## Copying datasets
Datasets can be copied with `dtool cp`. Note that `dtool` supports multiple storage backends.
## Optional: Per-file metadata
Metedata for individual files can be added to a dataset in the form of so called [overlays](https://dtool.readthedocs.io/en/latest/working_with_overlays.html#creating-overlays), that contain a boolean or string value for each file in the dataset. This could be used to flag simulation input files, MD trajectories, etc. The example below uses a regular expression to flag all files ending in ".nc" as MD trajectories:
```
dtool overlays glob <name_of_the_dataset> is_MD_trajectory '*.nc' | dtool overlays write <name_of_the_dataset> -
```
(Note that overlays created by the first part of the command are not saved anywhere, and thus need to be piped directly into the second part of the command.)
You can display the overlays for a dataset with
```
dtool overlays show <name_of_the_dataset>
```
Please refer to the [documentation](https://dtool.readthedocs.io/en/latest/working_with_overlays.html#creating-overlays) for all options to create an overlay.
## Removing datasets
Datasets will be removed by an administrator or an automated script once their expiration dated has past. There is no other way to remove a dataset then manipulating the experiation date and waiting for the next cleanup.
## Invalid datasets
You may want to delete a dataset because you made an error in the simulation script and it now interferes in your postprocessing workflow. Since the dataset will stay on the storage for a while, make it explicit that the dataset is invalid: `dtool tag set <URI> invalid`. When you select datasets for postprocessing, make sure it has not been marked as invalid. With the mongo query interface, this can be achieved by adding `{"tags": {"$not":{"$in": ["invalid"]}}}` ([mongodb doc](https://docs.mongodb.com/manual/reference/operator/query/in/#use-the-in-operator-to-match-values-in-an-array)) in an and clause. In python, you can check the dataset instance is valid with `not "invalid" in dataset.list_tags()`.