Disclaimer: this page is adapted from the internal wiki of Lars Pastewka's group
Datasets
A dataset is a collection of files that belong together logically. For example, this could be...
- ...input files for a simulation and the output of that simulation.
- ...data generated during post-processing of a simulation.
- ...the LaTeX manuscript of a paper, the raw data underlying that paper and the scripts required to plot that data.
The idea behind a dataset is that it becomes immutable at some point. The process of making a dataset immutable is called 'freezing'. Once a dataset is frozen, it can be moved between storage systems but no longer be edited. Changing a dataset would then require the creation of a new dataset and the deletion of the old dataset. While this may seem cumbersome, it has the advantage that the storage backend can be kept simple (and cheap) and that inadvertent modification of primary data is not possible.
Granularity
What goes into a dataset is the decision of the dataset's creator. In general, a finer granularity makes it easier to move, copy and backup dataset. We distinguish between...
- ...primary data, i.e. the input files of a simulation and the output of that simulation...
- ...and derived data, which is obtained by processing primary or derived data. Derived data could for example be the structure factor or some correlation function obtained from molecular dynamics trajectory. The trajectory itself would be primary data.
For example, this distinction allows to freeze a dataset once a simulation has finished, while there may still be postprocessing steps that follow. Those would go into a separate dataset. The relationship between datasets can be specified using the derived_from
property discussed below.
Owners
Each dataset has at least one owner. This is typically the person who created the dataset. Note that the owner has scientific responsibility for the contents of the dataset. Part of this scientific responsibility is for example ensuring that the data has not been fabricated or falsified. Attaching an owner to a dataset allows traceability of these scientific responsibilities.
Reviewers
Before archival, we will review the dataset. This review process ensures that the data inside the dataset conforms with our data management policies. The name(s) of the reviewer(s) will be attached to the dataset.
Metadata
A dataset always has metadata attached to it. The absolute minimum of this metadata would be the user who owns the dataset and the creation date. Please always use your clear name, your email and your ORCID.
The template below reflects the status quo of administrative metadata. Feel free to add any other type of metadata to the dataset, even if not standardized. The more data is available, the easier it will be later to search for a specific dataset.
dtool
The above abstract description of a dataset requires a standardized
implementation. One possibility to maintain datasets is with a simple tool called dtool. dtool
defines a standardized way to attach metadata to a dataset and handles transfer of datasets between storage systems. Please maintain all of your data in dtool
dataset. There is also a paper describing dtool
.
You will may find more help and (possibly livMatS-specific) usage guidelines on dtool at the livMatS RDM wiki, including a picturesque GUI quick start guide and a PDF-compiled version of those guidelines.
Installation
Install dtool
via
python3 -m pip install [--user] dtool
Configuration
Please tell dtool
about yourself. It needs your name and your email address. This information will later be used in the metadata template.
dtool config user name "Your Full Name"
dtool config user email "your@email.com"
Managing datasets
A dtool
dataset consists of a number of files and some metadata, including information on file sizes, hashes (generated automatically) and keywords set by the user. Start by creating an empty dataset:
dtool create <name_of_the_dataset>
This will create an empty dataset that contains the file README.yml
and the subdirectory data
. There is also a hidden directory .dtool
that contains administrative data used by dtool.
You can add files to the dataset by
dtool add item <file> <name_of_the_dataset>
or just placing the files into the subdirectory <name_of_the_dataset>/data/
. Note that a possible workflow is to create a dataset before running a simulation, and then running that simulation within the data
subdirectory of the dataset.
Editing metadata
Metadata is a type of dictionary that is attached to the dataset. In its simple version, it is a number of keys with the accompanying values. At any stage you can add keys and values to the dataset using
dtool readme edit <name_of_the_dataset>
or by just editing the file README.yml
inside the dataset directory. (Note that dtool readme edit
just launches an editor -- typically vi
-- for that file.) This file in the YAML format. Make sure to familiarize yourself with YAML. You can use yamllint to check the syntax.
When editing the metadata README
file, you are completely free in choosing key-value pairs. dtool
incorporates a mechanism in which you can set predefined keys by providing a template. This template can be stored in any file. Our recommendation is to use $HOME/.dtool_readme.yml
. Please use the following template file:
project: Project name
description: Short description
owners:
- name: {DTOOL_USER_FULL_NAME}
email: {DTOOL_USER_EMAIL}
username: {username}
orcid: Please obtain an ORCID at orcid.org
funders:
- organization: Please add funding organization
program: Please add the specific program within which funding occured
code: Please add funding code of that organization
creation_date: '{date}'
expiration_date: '{date}'
derived_from:
- uuid: UUID of the primary data or the previous simulation step
Note that this structure allows the specification of an arbitrary number of owners and funders. derived_from
is optional and only necessary if this is derived data. Please specify then the UUID of the primary data.
Tell dtool
to use this template by executing
dtool config readme-template $HOME/.dtool_readme.yml
Note that you can of course keep multiple templates for different projects and switch between them with this command.
You can now fill the README.yml
of your dataset by calling
dtool readme interactive <name_of_the_dataset>
This launches an interactive mode in which you are queried for the metadata specified in the template file. Please use the above template. You can of course add any further keys to the template file.
'Freezing' a dataset
Datasets can be modified until they are frozen. 'Freeze' the dataset with
dtool freeze <name_of_the_dataset>
This saves hashes of all files in the metadata for the dataset. The hashes can be checked later to verify that the data has not been modified since:
dtool verify <name_of_the_dataset>
Freezing will also generate a unique identifier (a UUID) for your dataset. You can query the UUID with
dtool uuid <name_of_the_dataset>
When referring to a dataset, always use the UUID as the name of the dataset is not necessarily unique.
Listing datasets
dtools ls .
will give you a list with all datasets in the present directory. The color of the name allows you to distinguish between live (red) and frozen (green) datasets.
Copying datasets
Datasets can be copied with dtool cp
. Note that dtool
supports multiple storage backends.
Optional: Per-file metadata
Metedata for individual files can be added to a dataset in the form of so called overlays, that contain a boolean or string value for each file in the dataset. This could be used to flag simulation input files, MD trajectories, etc. The example below uses a regular expression to flag all files ending in ".nc" as MD trajectories:
dtool overlays glob <name_of_the_dataset> is_MD_trajectory '*.nc' | dtool overlays write <name_of_the_dataset> -
(Note that overlays created by the first part of the command are not saved anywhere, and thus need to be piped directly into the second part of the command.)
You can display the overlays for a dataset with
dtool overlays show <name_of_the_dataset>
Please refer to the documentation for all options to create an overlay.
Removing datasets
Datasets will be removed by an administrator or an automated script once their expiration dated has past. There is no other way to remove a dataset then manipulating the experiation date and waiting for the next cleanup.
Invalid datasets
You may want to delete a dataset because you made an error in the simulation script and it now interferes in your postprocessing workflow. Since the dataset will stay on the storage for a while, make it explicit that the dataset is invalid: dtool tag set <URI> invalid
. When you select datasets for postprocessing, make sure it has not been marked as invalid. With the mongo query interface, this can be achieved by adding {"tags": {"$not":{"$in": ["invalid"]}}}
(mongodb doc) in an and clause. In python, you can check the dataset instance is valid with not "invalid" in dataset.list_tags()
.