7. Datasets and Containers

When developing new AI methods or evaluating existing ones, ML/AI researchers and scientists routinely use public datasets. Often the very same datasets are used by different research groups, which end up downloading these to their own storage. For example, more than one research group might download the Alphafold database needed for predicting 3D protein structures (see https://alphafold.ebi.ac.uk/, >2TB). This situation has previously lead to data replication and storage capacity wasting for both, users and LRZ.

To avoid the situation described above, the LRZ AI Systems offer a dedicated Data Science Storage (DSS) container aimed at storing public datasets as well as, potentially, Enroot container images of interest to more than one researcher for a period of time. The datasets may remain public given a continued demand and removed otherwise.

How to request the addition of public datasets

Users interested in a particular dataset may open a ticket with the LRZ Servicedesk and should provide at least the following information:

Licensing: The dataset must be available for public usage, allow redistribution and require neither an individual license nor a registration.
Justification: There must be a sufficient public interest in the dataset.
Instructions: There must be clear instructions on how the data can be downloaded. For example there is a bash script that downloads the data, makes sure the data is complete and in the desired format. Ideally, such a script was already successfully tested on the AI Systems beforehand. It could then be shared for example directly within the ticket or Sync&Share.

Acceptance & implementation of the request are subject to feasibility and available resources.

An example of request via tickets can be as follows:

Please specify your incident/request: AI topics
Please choose an AI category: Request new Dataset offer

Description: The Alphafold dataset (https://alphafold.ebi.ac.uk/), which requires >2TB of storage is becoming popular for protein prediction within the ML community. This dataset is used in the methods x and y by the research groups A, B and C.

The dataset is publicly available (https://github.com/deepmind/alphafold#genetic-databases) under an Apache-2.0 license. It can be downloaded with the scripts provided here https://github.com/deepmind/alphafold/tree/main/scripts/. The instruction for doing this are:

install the aria2c dependence

execute 
$ bash scripts/download_all_data.sh <DOWNLOAD_DIR>

How to request Enroot images on the AI systems

Users interested in a particular image need to:

make sure the image is licensed for public usage and requires no individual license or registration
make sure the image is not provided by the Nvidia NGC, Dockerhub or another public repository directly
write a ticket with the location of the Dockerfile for building the image and a justification for public interest (including the expected target audience)
provide clear instructions for building the image (in case it deviates from the standard procedure)

Acceptance & implementation is subject to feasibility and available resources.