Data Science Storage
Welcome to the LRZ Data Science Storage documentation.
Infrastructure Status
Overview
LRZ's Data Science Storage (DSS) is a novel approach at LRZ to solve the demands and requirements of data intensive science. Therefore, DSS implements a data centric management approach, which gives you the ability to:
- Store vast amounts of data for as long as the data is important to you or the science community
- Access this data from the whole LRZ computing ecosystem (SuperMUC-NG, LinuxCluster, Compute Cloud, AI-Systems, VMWare, Remote Visualization, Housed Customer Compute Systems)
- Share this data between arbitrary users of the LRZ computing ecosystem
- Access/Transfer this data world wide via a high performance, WAN optimized transfer protocol, using a simple Graphical User Interface in the Web
- Share this data with arbitrary users around the globe, like you are already used to from services like LRZ Sync+Share, Dropbox or Google Drive
Additionally, we also provide a new type of Data Archive, based on the DSS Solution stack, called Data Science Archive (DSA). Basically DSA relates to DSS like AWS Glacier relates to AWS S3 for example. The idea is that you can put vast amounts of data into the archive which eventually freezes and before you can access the data again, you have to explicitly thaw it. Thawing is either done implicitly when using Globus Online to transfer the data or explicitly by using a small CLI tool that directly talks to the DSA Service.
Who can use it
Currently DSS is available either as a so called joint project offer or as on demand offer from LRZ. In addition to that, SuperMUC-NG projects are eligible to apply for on demand storage space on a GCS funded DSS system.
Joint project offer means that we will analyze your requirements together and then provide you with an offer for purchasing, implementing and housing a dedicated DSS building block for 5 years that exactly fits your needs. The offer consists of a one time investment part, for which you usually can apply for funding by DFG or other funding agencies, and a yearly service fee, which covers the costs for managing and running the storage system at LRZ. Economically attractive configurations for this model start at around 1 PBs.
On demand offer means that you can get DSS storage on a shared DSS building block, pre-financed by LRZ. Accounting is done on a per TB, per year basis. Minimum contract term is one year and minimum storage space is 20TB.
DSA is available as on demand offer for SuperMUC-NG and Linux Cluster projects upon request.
If you are interested in getting an offer for DSS or DSA, please contact us via the LRZ Servicedesk.
How does it work
Management
At the very top level of the DSS data management approach are so called data projects. These data projects are special LRZ projects which are used as an organizational envelope of the data, associated with this project. Each data project is assigned grants from one or more data pools.
Additionally, each data project contains a set of users, called data curators. The data curators manage, own and take responsibility for the data, stored within the data project. Typically the principal investigator (PI) of a research project and one or more proxies own the role of the data curator.
In the context of a data project and within the limits of its grants, the data curators create one or more so called containers. Within the containers, a user can store data. In addition to that, containers have certain properties like:
- who is allowed to access the container
- how these access rights are enforced within the container
- how much data can be stored in the container
- how much data a specific user can store in the container
- to which Cloud or VMWare machines the container is exported
- if data in the container should be protected by regular tape backups or even archives
These properties are managed by the data curators using the DSSWeb Web Portal.
Access to data
A DSS container is basically a special directory on a shared file system (the data pool). Associated with this directory are one or more user groups, that define the users which can access the container and an enforcement policy. The enforcement policy defines how the access rights, defined by the data curators on the container level, are enforced on the data placed within the container.
On SuperMUC and the LinuxCluster, all DSS file systems are currently available on the Login Nodes via the path /dss/
. For virtual machines in the Compute Cloud or the LRZ VMWare infrastructure, data curators can export their containers via NFS to these machines.
Outside of the LRZ compute ecosystem, DSS containers can be accessed and shared via Globus Online.
Infrastructure
On the infrastructure level, DSS consists of multiple storage building blocks, that are integrated to a single virtual storage system, using the IBM Spectrum Scale software defined storage stack. This software allows us to scale DSS to hundreds of PBs, if required. The storage building blocks are connected to a dedicated high performance interconnect, which provides high speed connections to all of LRZ's compute systems. DSA additionally has two tape libraries in separate data centres attached that hold the frozen data.
Where to go from here
- If you are interested in our DSS or DSA offerings, please contact us via the LRZ Servicedesk (Datenhaltung → Data Science Storage)
- If you are a principal investigator, looking for information how to start a data project, continue reading here.
- If you are a data curator of a data project, looking for information how to manage containers and data within a data project, continue reading here.
- If you are a user, invited to access a container, looking for information how to use DSS, continue reading here.
- If you are a system owner, looking for information how to grant storage of your system to individual data projects, continue reading here.
Where to get further help
If you have a questions regarding DSS that has not been answered by our documentation or encountered a problem with DSS, please contact us via the LRZ Servicedesk and open a ticket for the Service Datenhaltung → Data Science Storage.