Your tasks: Data storage

What features do you need in a storage solution when collecting data?

Description

The need for Data storage arises early on in a research project, as space will be required to put your data when starting collection or generation. Therefore, it is a good practice to think about storage solutions during the data management planning phase, and request storage in advance and/or pay for it.

The storage solution for your data should fulfil certain criteria (e.g. space, access & transfer speed, duration of storage etc), which should be discussed with the IT team. You may choose a tiered storage system for assigning data to various types of storage media based on requirements for access, performance, recovery and cost. Using tiered storage allows you to classify data according to levels of importance and assign it to the appropriate storage tiers or move it to different tier for e.g. once analysis is completed you have the option to move data to lower tier for preservation or archiving.

Tiered Storage is classified as “Cold” or “Hot” Storage. “Hot” storage is associated with fast access speed, high access frequency, high value data and consists of faster drives such as the Solid State Drives (SSD). This storage is usually located in close proximity to the user such as on campus and incurs high costs. “Cold” storage is associate with low access speed and frequency and consists of slower drives or tapes. This storage is usually off-premises and incurs low cost.

Considerations

When looking for solutions to store your data during the collection or generation phase, you should consider the following aspects:

The volume of your data is an important discerning factor to determine the appropriate storage solution. At the minimum, try to estimate the volume of raw data that you are going to generate or collect.
What kind of access/transfer speed and access frequency will be required for your data?
Knowing where the data will come from is also crucial. If the data comes from an external facility or needs to be transferred to a different server, you should think about an appropriate data transfer method.
It is a good practice to have a copy of the original raw data in a separate location, to keep it untouched and unchanged (not editable).
Knowing for how long the raw data, as well as data processing pipelines and analysis workflows need to be stored, especially after the end of the project, is also a relevant aspect for storage.
It is highly recommended to have metadata, such as an identifier and file description, associated with your data (see Metadata management page). This is useful if you want to retrieve the data years later or if your data needs to be shared with your colleagues for collaboration. Make sure to keep metadata together with the data or establish a clear link between data and metadata files.
In addition to the original “read-only” raw (meta)data files, you need storage for files used for data processing and analysis as well as the workflows/processes used to produce the data. For these, you should consider:
- Who is allowed to access the data (in case of collaborative projects), how do they expect to access the data and for what purpose.
- Check if you have the rights to give access to the data, in case of legal limitations or third party rights (for instance, collaboration with industry).
- Consult policy for data sharing outside the institute/country (see Compliance and Monitoring page).
Keeping track of the changes (version control), conflict resolution and back-tracing capabilities.

Solutions

Provide an estimate about the volume of your raw data (i.e., is it in the order of Megabytes, Gigabytes or Terabytes?) to the IT support in your institute when consulting for storage solutions.
Clarify if your data needs to be transferred from one location to another. Try to provide IT with as much information as possible about the system where the data will come from. See our Data Transfer page for additional information.
Ask for a tiered storage solution that gives you easy and fast access to the data for processing and analysis. Explain to the IT support what machine or infrastructure you need to access the data from and if other researchers should have access as well (in case of collaborative projects).
Ask if the storage solution includes an automatic management of versioning, conflict resolution and back-tracing capabilities (see also our Data Organisation page).
Ask the IT support in your institute if they offer technical solutions to keep a copy of your (raw)data secure and untouched (snapshot, read-only access, backup…). You could also keep a copy of the original data file in a separate folder as “read-only”.
For small data files and private or collaborative projects within your institute, commonly accessible Cloud Storage is usually provided by the institute, such as NextCloud (on-premises), Microsoft OneDrive, DropBox, Box, etc. Do not use personal accounts on similar services for this purpose, adhere to the policies of your institute.
It is a requirement from the funders or universities to store raw data and data analysis workflows (for reproducible results) for a certain amount of time after the end of the project (see our Preserve page). This is usually a requirement. Check the data policy for your project or institute to know if a copy of the data should be also stored at your institute for a specific time after the project. This helps you budget for storage costs and helps your IT support with estimation of storage resources needed.
Make sure to generate good documentation (i.e., README file) and metadata together with the data. Follow best practices for folder structure, file naming and versioning systems (see our Data Organisation page). Check if your institute provides a (meta)data management system, such as iRODS, DataVerse, FAIRDOM-SEEK or OSF. See All tools and resources table below for additional tools.

How do you estimate computational resources for data processing and analysis?

Description

In order to process and analyse your data, you will need access to computational resources. This ranges from your laptop, local compute clusters to High Performance Computing (HPC) infrastructures. However, it can be difficult to be able to estimate the amount of computational resource needed for a process or an analysis.

Considerations

Below, you can find some aspects that you need to consider to be able to estimate the computational resource needed for data processing and analysis:

The volume of total data is an important discerning factor to estimate the computational resources needed.
Consider how much data volume you need “concurrently or at once”. For example, consider the possibility to analyse a large dataset by downloading or accessing only a subset of the data at a time (e.g., stream 1 TB at a time from a big dataset of 500 TB).
Define the expected speed and the reliability of connection between storage and compute.
Determine which software you are going to use. If it is a proprietary software, you should check possible licensing issues. Check if it only runs on specific operative systems (windows, mac, linux…).
Establish if and what reference datasets you need.
In the case of collaborative projects, define who can access the data and the computational resource for analysis (specify from what device, if possible). Check policy about data access between different Countries. Try to establish a versioning system.

Solutions

Try to estimate the volume of:
- Raw data files necessary for the process/analysis.
- Data files generated during the computational analysis as intermediate files.
- Results data files.
Communicate your expectations about speed and the reliability of connection between storage and compute to the IT team. This could depend on the communication protocols that the compute and storage systems use.
It is recommended to ask about the time span for analysis to colleagues or bioinformatic support that have done similar work before. This could save you money and time.
If you need some reference datasets (e.g the references genomes such as human genome.), ask IT if they provide it or consult bioinformaticians that can set up automated public reference dataset retrieval.
For small data files and private projects, using the computational resources of your own laptop might be fine, but make sure to preserve the reproducibility of your work by using data analysis software such as Galaxy or R Markdown.
For small data volume and small collaborative projects, a commonly accessible Cloud Storage, such as Nextcloud (on-premises) or Owncloud might be fine. Adhere to the policies of your institute.
For large data volume and bigger collaborative projects, you need a large storage volume on fast hardware that is closely tied to a computational resource accessible to multiple users.

Where should you store the data after the end of the project?

Description

After the end of the project, all the relevant (meta)data (to guarantee reproducibility) should be preserved for a certain amount of time, that is usually defined by funders or institution policy. However, where to preserve data that are not needed for active processing or analysis anymore is a common question in data management.

Considerations:

Data preservation doesn’t refer to a place nor to a specific storage solution, but rather to the way or “how” data can be stored. As described in our Preservation page, numerous precautions need to be implemented by people with a variety of technical skills to preserve data.
Estimate the volume of the (meta)data files that need to be preserved after the end of the project. Consider using a compressed file format to minimize the data volume.
Define the amount of time (hours, days…) that you could wait in case the data needs to be reanalysed in the future.
It is a good practice to publish your data in public data repositories. Usually, data publication in repositories is a requirement for scientific journals and funders. Repositories preserve your data for a long time, sometimes for free. See our Data Publication page for more information.
Institutes or universities could have specific policies for data preservation. For example, your institute can ask you to preserve the data internally for 5 years after the project, even if the same data is available in public repositories.

Solutions

Based on the funders or institutional policy about data preservation, the data volume and the retrieval time span, discuss with the IT team what preservation solutions they can offer (i.e., data archiving services in your Country) and the costs, so that you can budget for it in your DMP.
Publish your data in public repositories, and they will preserve the data for you.

Tool assembly

More information

Training

Training in TeSS

Links to other ELIXIR resources

Relevant tools and resources

Tool or resource	Description	Related pages	Registry
Amazon Web Services	Amazon Web Services	Data analysis Data transfer	Training
b2share	Store and publish your research data. Can be used to bridge between domains	Data publication Bioimaging data	Standards/Databases
BIONDA	BIONDA is a free and open-access biomarker database, which employs various text mining methods to extract structured information on biomarkers from abstracts of scientific publications	Researcher Human data Proteomics	Tool info
Box	Cloud storage and file sharing service	Data Steward: infrastructure Data transfer	Training
CERNBox	CERNBox cloud data storage, sharing and synchronization
CS3	Cloud Storage Services for Synchronization and Sharing (CS3)
DATAVERSE	Open source research data respository software. Different instances available	Researcher Data Steward: research Data Steward: infrastructure IFB	Training
DropBox	Cloud storage and file sharing service	Data Steward: infrastructure Data transfer
e!DAL	Electronic data archive library is a framework for publishing and sharing research data	Data Steward: infrastructure	Tool info
FAIRDOM-SEEK	Data, model and SOPs management for projects, from preliminary data to publication, support for running SBML models etc.	Data Steward: infrastructure NeLS Microbial biotechnology IFB Machine actionability	Tool info Training
FAIRDOMHub	Data, model and SOPs management for projects, from preliminary data to publication, support for running SBML models etc. (public SEEK instance)	Researcher NeLS Documentation and metadata Microbial biotechnology Machine actionability	Standards/Databases
Google Drive	Cloud Storage for Work and Home	Data transfer
iCloud	Data sharing	Data analysis Data transfer
iRODS	Integrated Rule-Oriented Data System (iRODS) is open source data management software for a cancer genome analysis workflow.	Data Steward: infrastructure TransMed Bioimaging data	Tool info
Microsoft Azure	Cloud storage and file sharing service from Microsoft	Data Steward: infrastructure Data transfer
Microsoft OneDrive	Cloud storage and file sharing service from Microsoft	Data Steward: infrastructure
NextCloud	As fully on-premises solution, Nextcloud Hub provides the benefits of online collaboration without the compliance and security risks.	Data Steward: infrastructure Data transfer
OHDSI	Multi-stakeholder, interdisciplinary collaborative to bring out the value of health data through large-scale analytics. All our solutions are open-source.	Researcher Data Steward: research Data analysis TransMed Toxicology data	Tool info
OMERO	OMERO is an open-source client-server platform for managing, visualizing and analyzing microscopy images and associated metadata	Documentation and metadata Data Steward: research Data Steward: infrastructure OMERO Bioimaging data	Tool info Training
OpenStack	OpenStack is an open source cloud computing infrastructure software project and is one of the three most active open source projects in the world Different instances available	Data analysis TransMed IFB	Training
OSF	OSF (Open Science Framework) is a free, open platform to support your research and enable collaboration.	Researcher Data Steward: research	Training
OwnCloud	Cloud storage and file sharing service	Data Steward: infrastructure Data transfer Data analysis
Research Data Management Platform (RDMP)	Data management platform for automated loading, storage, linkage and provision of data sets	Data Steward: infrastructure	Tool info
Research Object Crate (RO-Crate)	RO-Crate is a lightweight approach to packaging research data with their metadata, using schema.org. An RO-Crate is a structured archive of all the items that contributed to the research outcome, including their identifiers, provenance, relations and annotations.	Documentation and metadata Data organisation Data Steward: research Researcher Microbial biotechnology Machine actionability	Standards/Databases
Rucio	Rucio - Scientific Data Management	Data analysis Data transfer
ScienceMesh	ScienceMesh - frictionless scientific collaboration and access to research services	Data analysis Data transfer
SeaFile	SeaFile File Synchronization and Share Solution	Data transfer
semares	All-in-one platform for life science data management, semantic data integration, data analysis and visualization	Researcher Data Steward: research Documentation and metadata Data analysis Data Steward: infrastructure
tranSMART	Knowledge management and high-content analysis platform enabling analysis of integrated data for the purposes of hypothesis generation, hypothesis validation, and cohort discovery in translational research.	Researcher Data Steward: research Data analysis TransMed	Tool info
TSD	Norwegian Services for sensitive data	Sensitive data TSD	Training
National resources
Flemish Supercomputing Center (VSC)	VSC is the Flanders’ most highly integrated high-performance research computing environment, providing world-class services to government, industry, and researchers.	Data Steward: research Data Steward: infrastructure Data analysis
e!DAL-PGP	Plant Genomics and Phenomics Research Data Repository	Documentation and metadata Researcher Data Steward: research Data Steward: infrastructure Plant sciences Plant Genomics
GHGA	The German Human Genome-Phenome Archive	Documentation and metadata Researcher Data Steward: research
FAIRDOM-SEEK	Data management platform for organising, sharing and publishing research datasets, models, protocols, samples, publications and other research outcomes.	Documentation and metadata Researcher Data Steward: research Data Steward: infrastructure
PANGAEA	Data Publisher for Earth & Environmental Science	Documentation and metadata Researcher Data Steward: research
Fairdata.fi	With the Fairdata Services you can store, share and publish your research data with easy-to-use web tools.	CSC Researcher Data Steward: research Data publication Existing data
Sensitive Data Services for Research	CSC Sensitive Data Services for Research are designed to support secure sensitive data management through web-user interfaces accessible from the user’s own computer	CSC Researcher Data Steward: research Sensitive data Data analysis Data publication Human data
NIRD	The National Infrastructure for Research Data (NIRD) infrastructure offers storage services, archiving services, and processing capacity for computing on the stored data. It offers services and capacities to any scientific discipline that requires access to advanced, large-scale, or high-end resources for storing, processing, publishing research data or searching digital databases and collections. This service is owned and operatedby Sigma2 NRIS, which is a joint collaboration between UiO, UiB, NTNU, UiT, and UNINETT Sigma2	Data transfer NeLS
Norwegian Research and Education Cloud (NREC)	NREC is an Infrastructure-as-a-Service (IaaS) project between the University of Bergen and the University of Oslo, with additional contributions from NeIC (Nordic e-Infrastructure Collaboration) and Uninett., commonly referred to as a cloud infrastructure An IaaS is a self-service infrastructure where you spawn standardized servers and storage instantly, as needed, from a given resource quota. OpenStack	Data analysis
Educloud Research	Educloud Research is a platform provided by the Centre for information Technology (USIT) at the University of Oslo (UiO). This platform provides access to a work environment accessible to collaborators from other institutions or countries. This service provides a storage solution and a low threshold HPC system that offers batch job submission (SLURM) and interactive nodes. Data up to the red classification level can be stored/analysed.	Data analysis Sensitive data
TSD	The TSD – Service for Sensitive Data, is a platform for collecting, storing, analysing and sharing sensitive data in compliance with the Norwegian privacy regulation. TSD is developed and operated by UiO.	Human data Data analysis Sensitive data TSD
HUNTCloud	The HUNT Cloud, established in 2013, aims to improve and develop the collection, accessibility and exploration of large scale information. HUNT Cloud offers cloud services, lab management, and is a key service that has established a framework for data protection, data security, and data management. HUNT Cloud is owned by NTNU and operated by HUNT Research Centre at the Department of Public Health and Nursing at the Faculty of Medicine and Health Sciences.	Human data Data analysis Sensitive data
SAFE	SAFE (secure access to research data and e-infrastructure) is solution for secure processing of sensitive personal data in research at the University of Bergen. SAFE is based on “Norwegian Code of conduct for information security in the health and care sector” (Normen) and ensures confidentiality, integrity, and availability are preserved when processing sensitive personal data. Through SAFE, the IT-department offers a service where employees, students and external partners get access to dedicated resources for processing of sensitive personal data.	Human data Data analysis Sensitive data
BioData.pt Service Hub	BioData.pt Service Hub includes several data management resources, tools and services available for researchers in Life Sciences.	Researcher Data Steward: research Data analysis
BioData.pt Data Management Portal (DMPortal)	This instance of DataVerse is provided by the BioData.pt. We can help you write and maintain data management plans for your research. DATAVERSE	Researcher Data Steward: research
SNIC	The Swedish National Infrastructure for Computing (SNIC) is a national research infrastructure that makes available large-scale high-performance computing resources, storage capacity, and advanced user support, for Swedish research.	Data analysis