Analysis and consensus of currently available intrinsic protein disorder annotation sources in the MobiDB database
© Di Domenico et al.; licensee BioMed Central Ltd. 2013
Published: 22 April 2013
Intrinsic protein disorder is becoming an increasingly important topic in protein science. During the last few years, intrinsically disordered proteins (IDPs) have been shown to play a role in many important biological processes, e.g. protein signalling and regulation. This has sparked a need to better understand and characterize different types of IDPs, their functions and roles. Our recently published database, MobiDB, provides a centralized resource for accessing and analysing intrinsic protein disorder annotations.
Here, we present a thorough description and analysis of the data made available by MobiDB, providing descriptive statistics on the various available annotation sources. Version 1.2.1 of the database contains annotations for ca. 4,500,000 UniProt sequences, covering all eukaryotic proteomes. In addition, we describe a novel consensus annotation calculation and its related weighting scheme. The comparison between disorder information sources highlights how the MobiDB consensus captures the main features of intrinsic disorder and correlates well with manually curated datasets. Finally, we demonstrate the annotation of 13 eukaryotic model organisms through MobiDB's datasets, and of an example protein through the interactive user interface.
MobiDB is a central resource for intrinsic disorder research, containing both experimental data and predictions. In the future it will be expanded to include additional information for all known proteins.
Intrinsic protein disorder is becoming an increasingly important topic in protein science [1–3]. Protein function has been traditionally thought to be determined by tertiary structure. Over the last decade, intrinsically disordered proteins (IDPs) have been found to be important in many important biological processes [4–6]. IDPs are widespread in natural proteins, especially in eukaryotic organisms [7, 8], and are frequently associated with molecular recognition [9, 10]. They have been observed to be common among hub proteins, i.e. those with many interaction partners  and also to play a key role in human disease . In addition, protein disorder is important for experimental protein characterization since difficulties often arise when long disordered regions are present, which frequently happens at the N and C termini . IDPs represent a heterogeneous concept with many different and elusive definitions  which can be traced back to different indirect experimental methods.
Sources of disorder information
Currently available sources for intrinsic disorder annotations can be divided in two main groups. The first group includes annotations inferred from experiment, with evidence in publications. The second group includes annotations automatically extracted by computational tools. The latter can be further subdivided into automatic annotations derived from experimental sources, and automatic annotations obtained from software predictors.
There are currently two available sources of intrinsic protein disorder information with evidence in publications. The DisProt  database, a manually curated repository, features disorder and structure annotations for 667 proteins (version 6.00). The IDEAL  database, also manually curated, contains information on 209 proteins. The Protein Data Bank (PDB)  constitutes the main source of available experimentally-based disorder annotations with over 70,000 different structures. It is widely accepted that missing residues from X-ray structures have a good correlation with intrinsically disordered residues . These missing regions can easily be extracted from structure files deposited in the PDB. Some 6,000 structures solved by NMR experiments are generally deposited as structural ensembles in a single file. These can be used to detect residue mobility  which, in a way that is analogous to the missing X-ray regions, are a good indicator of intrinsic disorder. NMR structures were only recently considered in disorder prediction , demonstrating the long held belief of different flavours of disorder [1, 3, 21].
A great number of intrinsic disorder predictors have been developed over the last few years , with two main scenarios emerging for their application. The first is represented by predictions of disorder on a relatively small number of proteins with maximum accuracy, such as in the CASP experiment . Most existing prediction methods, such as Disopred , VSL1  and CSpritz , have been trained for this scenario. A more practical scenario is however represented by the genome-scale analysis of disorder [1, 8], where some performance is sacrificed to achieve results in a reasonable time frame. This usually entails using a method that does not require a multiple-sequence alignment, thereby speeding up computation by several orders of magnitude . DisEMBL , IUPred  and, more recently, ESpritz  have been all developed with this scenario in mind.
In the following, we will describe the construction of the MobiDB database of experimental and predicted disorder annotations in proteins . In particular, we will compare the different annotation sources and how they are integrated. A coherent consensus disorder definition will be derived and used to annotate the proteomes of a set of representative model organisms.
Materials and methods
Data loading is performed as a three-step process. In the first step, annotations are extracted from each annotation source and stored as two Fasta files. One of these files contains the annotating sequences, and the other the annotations extracted from those. An extra comma-separated file is generated which links the annotating sequences to their corresponding reference sequences. In the second step, a script takes the first step output files and generates tab-separated files compatible with the database engine's batch-loading mechanism. During this step, if an annotating sequence covers only part of its corresponding reference sequence, an alignment between the two is performed. The potential resulting gaps introduced in the annotating sequence are also transferred to the extracted annotation. The third and final step consists simply of loading the data in batch to the database. To maximize the loading performance, the affected database indices are dropped before the insertion begins. The resulting database constitutes the backend of the application, which will then be accessed by the user interface.
Disorder data and resources in MobiDB
All of the aforementioned disorder sources are integrated into the MobiDB database. XML files from the DisProt and IDEAL databases are parsed for annotations. Information on the corresponding UniProt entries to be linked to those sources is included in also included in the XML files. Annotating sequences from PDB files are extracted by means of custom scripts (X-ray) and the MOBI server (NMR). These annotations are then linked to their corresponding UniProt protein sequences by means of the SIFTS database . In order to capture different flavours of disorder, seven in silico disorder predictors are run against all the reference sequences: Three Espritz  flavours (X-ray, NMR, DisProt) and two flavours each for IUpred  (short, long) and DisEMBL  (remark465 and hot loops).
Overview of the databases used in ModiDB 1.2.1. The databases used and relevant references are listed with the description of extracted information and the version or download date included in MobiDB.
Disorder and structure
Disorder and structure
Disorder and structure
Functional domain annotations
Disorder consensus and weighting
where is the sum of weights of annotations considering the region disordered, and is the sum of weights of annotations considering the region structured. The annotation score evidences the strength of a given consensus annotation. It is the sum of the weights of every annotation that agrees with the final consensus for a certain region. Its objective is to allow the classification of regions according to the amount of data backing up the resulting annotation. This amount is also dependent on the relative weight of each annotation. In all cases, the sums are calculated over all the annotations corresponding to a certain position of the reference sequence. This may be visualized as the columns in an alignment between the reference sequence and its corresponding annotating sequences. In the case where an annotating sequence has no annotation for a certain reference sequence position, its contribution to the sum is zero. In all cases the minimum value of the sums is zero, and the maximum will depend on the number of annotations available, and the weight assigned to each of them.
where r is the resolution of the experiment, and rT is a user-defined maximum resolution threshold. This threshold allows the user to set a baseline in the form of a minimum resolution required for a structure to provide a significant annotation. In the case where the resulting weight is smaller than 0.2, a fixed value of 0.2 is assigned. PDB NMR structures are assigned a fixed weight of 0.2 each, to reflect the usually higher uncertainty in coordinates obtained by NMR experiments when compared to their X-ray counterparts. Finally, predictor-generated annotations are given a weight of 0.05, which allows experimentally obtained data to prevail whenever it is available.
Sequence conservation and disorder classification
In order to provide information regarding the sequence conservation of disorder, MobiDB  also annotates sequence conservation on groups of orthologous protein sequences. For each reference sequence in the database, a search is performed in the OMA Browser database  to look for a corresponding group of orthologs. If such a group is found and contains at least 10 members, a multiple sequence alignment is constructed with CLUSTALW . A position in the alignment is considered conserved if the same residue is present in at least 50% of the sequences. Whenever such sequence conservation annotations are available, disordered regions in reference sequences are classified in a way analogous to the definitions introduced by Bellay and co-workers . If the region is disordered and its sequence conserved, it is defined as "constrained disorder". If, on the other hand, the region is disordered but the sequence not conserved, it is termed "flexible disorder".
Results and discussion
In order to assess the available information on disorder, it was first necessary to create a new database. MobiDB was thus designed with three main goals in mind: performance, scalability and usability. The database had to maintain good performance both when loading, so it can be updated frequently, and querying, so as to be useful for the public by providing fast response times. It had to be scalable, meaning that performance levels can be maintained when expanding with further information. Last but not least, it had to provide high levels of usability, giving the user a centralized, flexible and useful way to access intrinsic disorder information in an intuitive way. Updates for MobiDB are carried out through a three-step loading process integrated into a single, automated pipeline (see Methods). This allows for the easy regeneration of the entire database with up-to-date information in less than a week's time. Enabled by this fact, and based on the update frequencies of the different sources integrated into MobiDB, we have set a quarterly update interval. Every three months MobiDB will be updated to keep up with recent additions to its information sources.
Use cases for MobiDB
There are two main use cases for MobiDB. The first one is the analysis of a single protein by means of the user interface. The second one is the generation of a custom dataset for offline analysis. Both actions are available after performing a database search, or after accessing one of the browse options. MobiDB supports the UniProt complex search syntax, through a web service call to the UniProt server. This allows to build sophisticated queries with various filters, e.g. organisms and subcellular localizations. All proteins matching the search parameters will be listed along with relevant information for each entry in the Search results page.
From the search results, the user can click on a protein name and be directed to the Protein analysis page. This page features four interactive widgets, each containing different pieces of information regarding the selected protein. The Reference sequence information widget contains general information related to the chosen reference protein, extracted from the UniProt database. The Annotation sources widget contains the different annotated regions from each annotating sequence that has been linked to the reference sequence. The Annotations plot widget provides a graphical representation of the available annotations associated to the reference sequence. This contains general annotations such as Pfam annotations and disorder consensus, as well as all available disorder annotations sources.
Instead of analysing a single protein via the graphical interface, the user can opt to download a dataset containing multiple entries. This can be done by pressing the download button in the top left of the search results page. The exported dataset will is composed of two fasta files. One of them containing all relevant reference and annotating sequences and the other one containing all the corresponding annotations. Pre-computed datasets are available in the download section of the MobiDB website for the different experimental data sources, as well as for each of the 297 complete proteomes.
Comparison between disorder data sources. The different disorder data sources are compared in terms of available sequence entries and distribution of ordered and disordered residues. The distribution of disordered regions is also shown in terms of the lowest (1st) and highest (3rd) quartiles, median and mean.
Disordered region lengths
Overview of the disorder definitions used. The labels used for disorder data sources throughout the paper are defined. The type column lists whether the source contains experimental information (Exp), predictions (Pred) or consensus (Cons).
DisProt database annotations
IDEAL database annotations
PDB NMR annotations
PDB Xray annotations, resolution threshold of 2,5 Å
PDB Xray annotations, resolution threshold of 5 Å
PDB-xray and PDB-nmr annotations, resolution threshold of 2,5 Å
PDB-xray and PDB-nmr annotations, resolution threshold of 5 Å
DisEmbl remark 465 predictions
DisEmbl hot loops predictions
ESpritz DisProt predictions
Espritz NMR predictions
Espritz XRay predictions
IUPred long predictions
IUPred short predictions
Full MobiDB consensus without DisProt
Cons, Exp, Pred
Full MobiDB consensus without IDEAL
Cons, Exp, Pred
Full MobiDB consensus without manually curated data (DisProt and IDEAL)
Cons, Exp, Pred
Full MobiDB consensus (all sources)
Cons, Exp, Pred
Single protein analysis
We have presented a detailed description of MobiDB, a database of experimental and predicted disorder in proteins, and its main features, disorder consensus and weighting. The database is highly modular and extensible, allowing inclusion of a growing amount of information. A comparison between different disorder data sources highlights how the MobiDB consensus captures the main features of intrinsic disorder and correlates well with the manually curated datasets from DisProt and IDEAL. In more detail, the DisProt curation is best approximated with a combination of disorder predictors, allowing a robust estimation of the presence of disorder in eukaryotic genomes, roughly confirming the higher incidence of disorder in higher organisms. In the future we plan to expand MobiDB to include additional information for all known proteins, both from experimental sources and new predictors, with the goal of making it an increasingly useful, centralized source of data for intrinsic disorder research.
The authors are grateful to members of the BioComputing UP lab for useful discussions.
This project was supported by funding from the University of Padova (CPDA098382, CPDR097328), FIRB Futuro in Ricerca (RBFR08ZSXY) and Cariplo (2017/0724) to S.T.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 7, 2013: Italian Society of Bioinformatics (BITS): Annual Meeting 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S7
- Schlessinger A, Schaefer C, Vicedo E, Schmidberger M, Punta M, Rost B: Protein disorder--a breakthrough invention of evolution?. Curr Opin Struct Biol. 2011, 21: 412-418. 10.1016/j.sbi.2011.03.014.View ArticlePubMedGoogle Scholar
- Tompa P: Unstructural biology coming of age. Curr Opin Struct Biol. 2011, 21: 419-425. 10.1016/j.sbi.2011.03.012.View ArticlePubMedGoogle Scholar
- Dunker AK, Oldfield CJ, Meng J, Romero P, Yang JY, Chen JW, Vacic V, Obradovic Z, Uversky VN: The unfoldomics decade: an update on intrinsically disordered proteins. BMC Genomics. 2008, 9 (Suppl 2): S1-10.1186/1471-2164-9-S2-S1.View ArticleGoogle Scholar
- Wright PE, Dyson HJ: Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol. 1999, 293: 321-331. 10.1006/jmbi.1999.3110.View ArticlePubMedGoogle Scholar
- Dunker AK, Obradovic Z: The protein trinity--linking function and disorder. Nat Biotechnol. 2001, 19: 805-806. 10.1038/nbt0901-805.View ArticlePubMedGoogle Scholar
- Tompa P: Intrinsically unstructured proteins. Trends Biochem. Sci. 2002, 27: 527-533. 10.1016/S0968-0004(02)02169-2.View ArticlePubMedGoogle Scholar
- Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004, 337: 635-645. 10.1016/j.jmb.2004.02.002.View ArticlePubMedGoogle Scholar
- Pancsa R, Tompa P: Structural disorder in eukaryotes. PLoS ONE. 2012, 7: e34687-10.1371/journal.pone.0034687.PubMed CentralView ArticlePubMedGoogle Scholar
- Tompa P, Fuxreiter M: Fuzzy complexes: polymorphism and structural disorder in protein-protein interactions. Trends Biochem Sci. 2008, 33: 2-8. 10.1016/j.tibs.2007.10.003.View ArticlePubMedGoogle Scholar
- Fong JH, Shoemaker BA, Garbuzynskiy SO, Lobanov MY, Galzitskaya OV, Panchenko AR: Intrinsic disorder in protein interactions: insights from a comprehensive structural analysis. PLoS Comput Biol. 2009, 5: e1000316-10.1371/journal.pcbi.1000316.PubMed CentralView ArticlePubMedGoogle Scholar
- Dosztányi Z, Chen J, Dunker AK, Simon I, Tompa P: Disorder and sequence repeats in hub proteins and their implications for network evolution. J Proteome Res. 2006, 5: 2985-2995. 10.1021/pr060171o.View ArticlePubMedGoogle Scholar
- Uversky VN, Oldfield CJ, Dunker AK: Intrinsically disordered proteins in human diseases: introducing the D2 concept. Annu Rev Biophys. 2008, 37: 215-246. 10.1146/annurev.biophys.37.032807.125924.View ArticlePubMedGoogle Scholar
- Uversky VN, Radivojac P, Iakoucheva LM, Obradovic Z, Dunker AK: Prediction of intrinsic disorder and its use in functional proteomics. Methods Mol. Biol. 2007, 408: 69-92. 10.1007/978-1-59745-547-3_5.View ArticlePubMedGoogle Scholar
- Orosz F, Ovádi J: Proteins without 3D structure: definition, detection and beyond. Bioinformatics. 2011, 27: 1449-1454. 10.1093/bioinformatics/btr175.View ArticlePubMedGoogle Scholar
- Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN, Obradovic Z, Dunker AK: DisProt: the Database of Disordered Proteins. Nucleic Acids Res. 2007, 35: D786-793. 10.1093/nar/gkl893.PubMed CentralView ArticlePubMedGoogle Scholar
- Fukuchi S, Sakamoto S, Nobe Y, Murakami SD, Amemiya T, Hosoda K, Koike R, Hiroaki H, Ota M, IDEAL: Intrinsically Disordered proteins with Extensive Annotations and Literature. Nucleic Acids Res. 2012, 40: D507-511. 10.1093/nar/gkr884.PubMed CentralView ArticlePubMedGoogle Scholar
- Berman H, Henrick K, Nakamura H, Markley JL: The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. 2007, 35: D301-303. 10.1093/nar/gkl971.PubMed CentralView ArticlePubMedGoogle Scholar
- Brandt BW, Heringa J, Leunissen JAM: SEQATOMS: a web tool for identifying missing regions in PDB in sequence context. Nucleic Acids Res. 2008, 36: W255-259. 10.1093/nar/gkn237.PubMed CentralView ArticlePubMedGoogle Scholar
- Martin AJM, Walsh I, Tosatto SCE: MOBI: a web server to define and visualize structural mobility in NMR protein ensembles. Bioinformatics. 2010, 26: 2916-2917. 10.1093/bioinformatics/btq537.View ArticlePubMedGoogle Scholar
- Walsh I, Martin AJM, Di Domenico T, Tosatto SCE: ESpritz: accurate and fast prediction of protein disorder. Bioinformatics. 2012, 28: 503-509. 10.1093/bioinformatics/btr682.View ArticlePubMedGoogle Scholar
- Vucetic S, Obradovic Z, Vacic V, Radivojac P, Peng K, Iakoucheva LM, Cortese MS, Lawson JD, Brown CJ, Sikes JG, Newton CD, Dunker AK: DisProt: a database of protein disorder. Bioinformatics. 2005, 21: 137-140. 10.1093/bioinformatics/bth476.View ArticlePubMedGoogle Scholar
- Deng X, Eickholt J, Cheng J: A comprehensive overview of computational protein disorder prediction methods. Mol Biosyst. 2012, 8: 114-121. 10.1039/c1mb05207a.PubMed CentralView ArticlePubMedGoogle Scholar
- Monastyrskyy B, Fidelis K, Moult J, Tramontano A, Kryshtafovych A: Evaluation of disorder predictions in CASP9. Proteins. 2011, 79 (Suppl 10): 107-118.PubMed CentralView ArticlePubMedGoogle Scholar
- Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK: Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins. 2005, 61 (Suppl 7): 176-182.View ArticlePubMedGoogle Scholar
- Walsh I, Martin AJM, Di Domenico T, Vullo A, Pollastri G, Tosatto SCE: CSpritz: accurate prediction of protein disorder segments with annotation for homology, secondary structure and linear motifs. Nucleic Acids Res. 2011, 39: W190-196. 10.1093/nar/gkr411.PubMed CentralView ArticlePubMedGoogle Scholar
- Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB: Protein disorder prediction: implications for structural proteomics. Structure. 2003, 11: 1453-1459. 10.1016/j.str.2003.10.002.View ArticlePubMedGoogle Scholar
- Dosztányi Z, Csizmok V, Tompa P, Simon I: IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics. 2005, 21: 3433-3434. 10.1093/bioinformatics/bti541.View ArticlePubMedGoogle Scholar
- Di Domenico T, Walsh I, Martin AJM, Tosatto SCE: MobiDB: a comprehensive database of intrinsic protein disorder annotations. Bioinformatics. 2012, 28: 2080-2081. 10.1093/bioinformatics/bts327.View ArticlePubMedGoogle Scholar
- Tagari M, Tate J, Swaminathan GJ, Newman R, Naim A, Vranken W, Kapopoulou A, Hussain A, Fillon J, Henrick K, Velankar S: E-MSD: improving data deposition and structure quality. Nucleic Acids Res. 2006, 34: D287-290. 10.1093/nar/gkj163.PubMed CentralView ArticlePubMedGoogle Scholar
- The UniProt Consortium: Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012, 40: D71-75.PubMed CentralView ArticleGoogle Scholar
- Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C: OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res. 2011, 39: D289-294. 10.1093/nar/gkq1238.PubMed CentralView ArticlePubMedGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23: 2947-2948. 10.1093/bioinformatics/btm404.View ArticlePubMedGoogle Scholar
- Bellay J, Han S, Michaut M, Kim T, Costanzo M, Andrews BJ, Boone C, Bader GD, Myers CL, Kim PM: Bringing order to protein disorder through comparative genomics and genetic interactions. Genome Biol. 2011, 12: R14-10.1186/gb-2011-12-2-r14.PubMed CentralView ArticlePubMedGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD: The Pfam protein families database. Nucleic Acids Res. 2012, 40: D290-301. 10.1093/nar/gkr1065.PubMed CentralView ArticlePubMedGoogle Scholar
- Cuff AL, Sillitoe I, Lewis T, Clegg AB, Rentzsch R, Furnham N, Pellegrini-Calace M, Jones D, Thornton J, Orengo CA: Extending CATH: increasing coverage of the protein structure universe and linking structure with function. Nucleic Acids Res. 2011, 39: D420-426. 10.1093/nar/gkq1001.PubMed CentralView ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22: 2577-2637. 10.1002/bip.360221211.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.