3 Data management
Lucas Frérot edited this page 2024-12-18 13:30:05 +01:00

Disclaimer: this page is adapted from the internal wiki of Lars Pastewka's group

Datasets

A dataset is a collection of files that belong together logically. For example, this could be...

  • ...input files for a simulation and the output of that simulation.
  • ...data generated during post-processing of a simulation.
  • ...the LaTeX manuscript of a paper, the raw data underlying that paper and the scripts required to plot that data.

The idea behind a dataset is that it becomes immutable at some point. The process of making a dataset immutable is called 'freezing'. Once a dataset is frozen, it can be moved between storage systems but no longer be edited. Changing a dataset would then require the creation of a new dataset and the deletion of the old dataset. While this may seem cumbersome, it has the advantage that the storage backend can be kept simple (and cheap) and that inadvertent modification of primary data is not possible.

Granularity

What goes into a dataset is the decision of the dataset's creator. In general, a finer granularity makes it easier to move, copy and backup dataset. We distinguish between...

  • ...primary data, i.e. the input files of a simulation and the output of that simulation...
  • ...and derived data, which is obtained by processing primary or derived data. Derived data could for example be the structure factor or some correlation function obtained from molecular dynamics trajectory. The trajectory itself would be primary data.

For example, this distinction allows to freeze a dataset once a simulation has finished, while there may still be postprocessing steps that follow. Those would go into a separate dataset. The relationship between datasets can be specified using the derived_from property discussed below.

Owners

Each dataset has at least one owner. This is typically the person who created the dataset. Note that the owner has scientific responsibility for the contents of the dataset. Part of this scientific responsibility is for example ensuring that the data has not been fabricated or falsified. Attaching an owner to a dataset allows traceability of these scientific responsibilities.

Reviewers

Before archival, we will review the dataset. This review process ensures that the data inside the dataset conforms with our data management policies. The name(s) of the reviewer(s) will be attached to the dataset.

Metadata

A dataset always has metadata attached to it. The absolute minimum of this metadata would be the user who owns the dataset and the creation date. Please always use your clear name, your email and your ORCID.

The template below reflects the status quo of administrative metadata. Feel free to add any other type of metadata to the dataset, even if not standardized. The more data is available, the easier it will be later to search for a specific dataset.

dtool

The above abstract description of a dataset requires a standardized implementation. One possibility to maintain datasets is with a simple tool called dtool. dtool defines a standardized way to attach metadata to a dataset and handles transfer of datasets between storage systems. Please maintain all of your data in dtool dataset. There is also a paper describing dtool.

You will may find more help and (possibly livMatS-specific) usage guidelines on dtool at the livMatS RDM wiki, including a picturesque GUI quick start guide and a PDF-compiled version of those guidelines.

Installation

Install dtool via

python3 -m pip install [--user] dtool

Configuration

Please tell dtool about yourself. It needs your name and your email address. This information will later be used in the metadata template.

dtool config user name "Your Full Name"
dtool config user email "your@email.com"

Managing datasets

A dtool dataset consists of a number of files and some metadata, including information on file sizes, hashes (generated automatically) and keywords set by the user. Start by creating an empty dataset:

dtool create <name_of_the_dataset>

This will create an empty dataset that contains the file README.yml and the subdirectory data. There is also a hidden directory .dtool that contains administrative data used by dtool.

You can add files to the dataset by

dtool add item <file> <name_of_the_dataset>

or just placing the files into the subdirectory <name_of_the_dataset>/data/. Note that a possible workflow is to create a dataset before running a simulation, and then running that simulation within the data subdirectory of the dataset.

Editing metadata

Metadata is a type of dictionary that is attached to the dataset. In its simple version, it is a number of keys with the accompanying values. At any stage you can add keys and values to the dataset using

dtool readme edit <name_of_the_dataset>

or by just editing the file README.yml inside the dataset directory. (Note that dtool readme edit just launches an editor -- typically vi -- for that file.) This file in the YAML format. Make sure to familiarize yourself with YAML. You can use yamllint to check the syntax.

When editing the metadata README file, you are completely free in choosing key-value pairs. dtool incorporates a mechanism in which you can set predefined keys by providing a template. This template can be stored in any file. Our recommendation is to use $HOME/.dtool_readme.yml. Please use the following template file:

project: Project name
description: Short description
owners:
  - name: {DTOOL_USER_FULL_NAME}
    email: {DTOOL_USER_EMAIL}
    username: {username}
    orcid: Please obtain an ORCID at orcid.org
funders:
  - organization: Please add funding organization
    program: Please add the specific program within which funding occured
    code: Please add funding code of that organization
creation_date: '{date}'
expiration_date: '{date}'
derived_from:
  - uuid: UUID of the primary data or the previous simulation step

Note that this structure allows the specification of an arbitrary number of owners and funders. derived_from is optional and only necessary if this is derived data. Please specify then the UUID of the primary data.

Tell dtool to use this template by executing

dtool config readme-template $HOME/.dtool_readme.yml

Note that you can of course keep multiple templates for different projects and switch between them with this command.

You can now fill the README.yml of your dataset by calling

dtool readme interactive <name_of_the_dataset>

This launches an interactive mode in which you are queried for the metadata specified in the template file. Please use the above template. You can of course add any further keys to the template file.

'Freezing' a dataset

Datasets can be modified until they are frozen. 'Freeze' the dataset with

dtool freeze <name_of_the_dataset>

This saves hashes of all files in the metadata for the dataset. The hashes can be checked later to verify that the data has not been modified since:

dtool verify <name_of_the_dataset>

Freezing will also generate a unique identifier (a UUID) for your dataset. You can query the UUID with

dtool uuid <name_of_the_dataset>

When referring to a dataset, always use the UUID as the name of the dataset is not necessarily unique.

Listing datasets

dtools ls . will give you a list with all datasets in the present directory. The color of the name allows you to distinguish between live (red) and frozen (green) datasets.

Copying datasets

Datasets can be copied with dtool cp. Note that dtool supports multiple storage backends.

Optional: Per-file metadata

Metedata for individual files can be added to a dataset in the form of so called overlays, that contain a boolean or string value for each file in the dataset. This could be used to flag simulation input files, MD trajectories, etc. The example below uses a regular expression to flag all files ending in ".nc" as MD trajectories:

dtool overlays glob <name_of_the_dataset> is_MD_trajectory '*.nc' | dtool overlays write <name_of_the_dataset> -

(Note that overlays created by the first part of the command are not saved anywhere, and thus need to be piped directly into the second part of the command.)

You can display the overlays for a dataset with

dtool overlays show <name_of_the_dataset>

Please refer to the documentation for all options to create an overlay.

Removing datasets

Datasets will be removed by an administrator or an automated script once their expiration dated has past. There is no other way to remove a dataset then manipulating the experiation date and waiting for the next cleanup.

Invalid datasets

You may want to delete a dataset because you made an error in the simulation script and it now interferes in your postprocessing workflow. Since the dataset will stay on the storage for a while, make it explicit that the dataset is invalid: dtool tag set <URI> invalid. When you select datasets for postprocessing, make sure it has not been marked as invalid. With the mongo query interface, this can be achieved by adding {"tags": {"$not":{"$in": ["invalid"]}}} (mongodb doc) in an and clause. In python, you can check the dataset instance is valid with not "invalid" in dataset.list_tags().

Git LFS