Your domain: Structural Bioinformatics

Introduction

Structural bioinformatics provides scientific methods to analyse, predict, and validate the three-dimensional structure of biological macromolecules such as proteins, RNA, DNA, or carbohydrates including small molecules bound to them. It also provides an important link with the genomics and structural biology communities. One objective of structural bioinformatics is the creation of new methods of analysis and manipulation of biological macromolecular data in order to predict their structures, function and interactions. This document describes guidelines to deposit structure predictions together with relevant metadata according to FAIR principles. While we describe guidelines for the deposition process, predictors are usually required to collect the relevant metadata already while doing the predictions so that the data is available during deposition.

Description

Researchers in the field should be able to find predictions of macromolecular structures, access their coordinates, understand how and why they were produced, and have estimates of model quality to assess the applicability of the model for specific applications. The considerations and solutions described below are written from the perspective of protein structure predictions but they also apply to other types of macromolecular structures.

Considerations

Is your prediction based on experimental data (i.e. integrative or hybrid modelling) or purely in silico?
This is important to define the appropriate deposition system.
What is the purpose of the structure prediction? Is it a large-scale modelling effort using automated prediction methods to (for instance) generally increase structural coverage of known proteins or a single modelling effort performed, possibly with manual intervention, for a specific application?
This is important to define the appropriate deposition system.
What is the source for the sequences of the modelled proteins?
This is important to cross-link with existing databases such as UniProtKB.
What modelling steps were performed?
Descriptions here can vary widely among modelling methods but should be detailed enough to enable reproducibility and include references to methods described in manuscripts and publicly available software or web services.
What input data were used for the modelling steps?
For protein structure predictions, this commonly includes the identification of homologous proteins from sequence databases with or without coverage by experimental structures. Knowing the input data greatly facilitates further analysis and reproducibility of the structure prediction.
What is the expected accuracy of the structure prediction?
This is commonly referred to as “model quality” or “model confidence” and is of major relevance to determine whether a given model can be used for downstream analysis. Quality estimates should enable users to judge the expected accuracy of the prediction both globally and locally.
Under which licence terms can others use your models?
Depending on the deposition system, there will be predefined and commonly permissive terms of use, but if this is to be restricted or if models are made available in a self-hosted system, an appropriate usage policy must be defined.

Solutions

There are three main options to make your models available:
- Deposit in ModelArchive for theoretical models of macromolecular structures. Models deposited in the ModelArchive are made available under the CC BY-SA 4.0 licence (see here for details).
- Deposit in PDB-Dev for models using integrative or hybrid modelling. Models deposited in PDB-Dev are made available under the CC0 1.0 licence (see here for details). If theoretical models were used as part of the modelling, they can either be included in the PDB-Dev deposition or, if they are expected to be useful by themselves, deposited in ModelArchive and referenced to.
- Make available using a dedicated web service for large-scale modelling efforts which are updated on a regular basis using automated prediction methods. Unified access to such services can be provided with the 3D-Beacons network which is being developed by the ELIXIR 3D-BioInfo Community. The data providers currently connected in the network are listed in the 3D-Beacons documentation. An appropriate licence must be associated with the models (check the RDMkit licensing page for guidance on this) and must be compatible with CC-BY 4.0 if the models are to be distributed in the 3D-Beacons network.
Model coordinates are preferably stored in the standard PDB archive format PDBx/mmCIF. While, for many purposes, the legacy PDB format may suffice to store model coordinates and is still widely used, the format is no longer being modified or extended.
Model quality estimates can be computed globally, per-residue, and per-residue-pair. The estimates should be computed using a relatively recent and well benchmarked tool or by the structure prediction method itself. Please check CAMEO, CASP, and CAPRI to find suitable quality estimators. The 3D-BioInfo Community is also currently working to further improve benchmarking for protein complexes, protein-ligand interactions, and nucleic acid structures. By convention, the main per-residue quality estimates are stored in place of B-factors in model coordinate files. In mmCIF files any number of quality estimates can be properly described and stored in the ma_qa_metric category of the PDBx/mmCIF ModelArchive Extension Dictionary described below.
Metadata for theoretical models of macromolecular structures should preferably be stored using the PDBx/mmCIF ModelCIF Extension Dictionary independently of the deposition process. The extension is being developed by the ModelCIF working group with input from the community. Feedback and change requests are welcome and can be given on github. The same information can also be provided manually during the deposition in ModelArchive and there is additional documentation on how to provide metadata and minimal requirements for it. Generally, the metadata must include:
- A short description of the study for which the model was generated
- If available, a citation to the manuscript referring to the models
- The source for the sequences of modelled proteins with references to databases such as UniProtKB
- Modelling steps with references to available software or web services used and to manuscripts describing the method.
- Input data needed for the modelling steps. For instance in homology modelling this could include the PDB identifiers for the template structures used for modelling and their alignments to the target protein.
- Model quality estimates
If necessary, accompanying data can be provided in separate files using different file formats. The files can be added to ModelArchive depositions and referred to in the PDBx/mmCIF ModelArchive extension format.

More information

Training

Training in TeSS

Relevant tools and resources

Tool or resource	Description	Related pages	Registry
3D-Beacons	Network providing unified programmatic access to experimentally determined and predicted structure models
CAMEO	Continuous evaluation of the accuracy and reliability of protein structure prediction methods in a fully automated manner		Tool info Standards/Databases
CAPRI	Critical assessment of structure prediction methods for protein-protein interactions
CASP	Biennial critical assessment of techniques for protein structure prediction		Tool info
ModelArchive	Repository for theoretical models of macromolecular structures with DOIs for models	Biomolecular simulation data Data publication	Standards/Databases
PDB	The Protein Data Bank (PDB)	Researcher Intrinsically disordered proteins	Tool info Training
PDB-Dev	Prototype archiving system for structural models obtained using integrative or hybrid modeling	Biomolecular simulation data
PDBx/mmCIF format and tools	Information about the standard PDB archive format PDBx/mmCIF, its dictionaries and related software tools		Standards/Databases
PDBx/mmCIF ModelCIF Extension Dictionary	Extension of the PDBx/mmCIF dictionary for theoretical models of macromolecular structures
UniProt	Comprehensive resource for protein sequence and annotation data	Documentation and metadata Researcher Intrinsically disordered proteins Microbial biotechnology Proteomics	Tool info Standards/Databases Training