DeltaProt: a software toolbox for comparative genomics
- Steinar Thorvaldsen^{1}Email author,
- Tor Flå^{1} and
- Nils P Willassen^{2}
DOI: 10.1186/1471-2105-11-573
© Thorvaldsen et al; licensee BioMed Central Ltd. 2010
Received: 25 June 2010
Accepted: 23 November 2010
Published: 23 November 2010
Abstract
Background
Statistical bioinformatics is the study of biological data sets obtained by new micro-technologies by means of proper statistical methods. For a better understanding of environmental adaptations of proteins, orthologous sequences from different habitats may be explored and compared. The main goal of the DeltaProt Toolbox is to provide users with important functionality that is needed for comparative screening and studies of extremophile proteins and protein classes. Visualization of the data sets is also the focus of this article, since visualizations can play a key role in making the various relationships transparent. This application paper is intended to inform the reader of the existence, functionality, and applicability of the toolbox.
Results
We present the DeltaProt Toolbox, a software toolbox that may be useful in importing, analyzing and visualizing data from multiple alignments of proteins. The toolbox has been written in MATLAB™ to provide an easy and user-friendly platform, including a graphical user interface, while ensuring good numerical performance. Problems in genome biology may be easily stated thanks to a compact input format. The toolbox also offers the possibility of utilizing structural information from the SABLE or other structure predictors. Different sequence plots can then be viewed and compared in order to find their similarities and differences. Detailed statistics are also calculated during the procedure.
Conclusions
The DeltaProt package is open source and freely available for academic, non-commercial use. The latest version of DeltaProt can be obtained from http://services.cbu.uib.no/software/deltaprot/. The website also contains documentation, and the toolbox comes with real data sets that are intended for training in applying the models to carry out bioinformatical and statistical analyses of protein sequences.
Equipped with the new algorithms proposed here, DeltaProt serves as an auxiliary analysis tool capable of visualizing and comparing orthologus protein sequences. The framework of the algorithms also enables easy incorporation of extra information on structure, if such data is available.
Background
Does life existing in extreme environments favour particular amino acids in its proteins, and can this preference be interpreted in terms of the physicochemical properties of these amino acids? Comparative bioinformatics [1] is an emerging field in which the information produced from sequence alignments is analyzed to find out more about the essential properties of proteins in diverse living systems. The success of genome technologies depends heavily on the correct statistical analyses of genomic data, and the new interdisciplinary area of statistical bioinformatics aims to modify available classical and nonclassical statistical methods and develop new methodologies. To explore and analyze the flow of experimental data from molecular biology requires efficient flexible data structures and statistical tools designed with the particular challenges of high-throughput data in mind. Statistical bioinformatics deals particularly with questions of modelling, handling and interpretation of this type of data, especially for modern molecular and post-genomic biology.
The novelty of analytical methods used in statistical bioinformatics and the special nature of the data generated by genome technologies, require the development of new software tools. There is a need for computationally powerful and open/expandable tools, which at the same time should be easy to use. DeltaProt was designed as a Matlab™ [2] companion with a set of functions and algorithms for modelling multiple alignments of proteins using a range of methods. The approach has been successfully applied in research on extremophilic organisms [3–6], and is of special relevance for membrane proteins since it is very difficult to obtain structural models of this important class. This is done by applying comparative statistical methods to comparisons of extremophile proteins versus orthologous genes from organisms of normal (e.g. mesophilic) habitats. Statistical analyses are best at distinguishing valid conclusions from random noise.
The information stored in protein sequences can be analyzed at different levels. Features of proteins that can be obtained as a single numerical value, such as volume or amino acid composition, can be easily compared across large numbers of genomes using statistical methods. Additionally, multiple variants of a gene can be compared between genomes by direct alignment to identify substitution patterns. All genes extracted from two or more genomes provide a further level at which comparisons can be carried out. Regardless of the method used, visual representation of the results becomes essential to display and interpret the findings.
Implementation
Here we describe the organization and capabilities of the toolbox, highlighting its key features. This tool provides several methods of comparative analyses, and enables computational scientists to study sequence alignments by rapidly prototyping new algorithms and performing computational experiments. DeltaProt provides access to implementation details and encourages modification and extension of capabilities.
The toolbox is distributed in the standard Matlab language, making it compatible with multiple platforms (Windows, Mac, Linux and Unix). DeltaProt requires at least Matlab™ R14 (2005). Some functions require the Matlab™ Statistics Toolbox. The installation of the package consists of unpacking a compressed file to the desired location. No manual compilation is necessary. The computational requirements are heavily dependent on the size of the data set. Generally, the toolbox only requires high numerical processing capabilities due to a large data set, for example if whole proteomes consisting of more than 1000 alignments, each with 10 sequences, are to be processed.
DeltaProt is supported by a short user manual and a complete webpage with examples, the purpose of which is to facilitate the use of the software for new users. The modular, open-source nature of the software allows users to understand how results are calculated and to add additional functionalities to customized solutions for their own sequence analysis.
Import of data
Results and Discussion
This section outlines the algorithms and performance of the different modules of DeltaProt by considering several illustrative examples of results. Throughout the rest of this paper, we will demonstrate the use of DeltaProt by applying it to a dataset of 65 membrane proteins from six microbial genomes [4]. Users only need to define their sequence data via simple and compact input files. Advanced users can 'tweak' many configuration settings in order to fine-tune for the different data sets.
In DeltaProt we present statistical methods and trend-tests which are useful when the protein sequences in the alignments can be divided into two or more groups based on known phenotypic traits of the host organism, such as preference of optimal growth temperature (mesophile, intermediate, psychrophile) or environmental metagenomic samples. The phenotypic group is assumed to be an ordinal stochastic variable in the statistical model.
The toolbox consists of a set of statistical routines. We consider both the amino acid sequence compositions and the substitution patterns to determine whether there are underlying trends that explain the observed variation between the phenotypic groups being analyzed. More than 80 different physicochemical properties of the amino acid [10, 11] may also be applied in order to reduce the sequence alphabet to measurements. Each situation is analyzed by appropriate statistical methods. The differences between the phenotypic groups may also be compared graphically. Amino acid composition, substitution patterns and physiochemical properties along the sequences may all be visualized. The statistical models have a very flexible design, and principles like pooling the data to reduce variability by measuring the average instead of the individuals, and blocking to reduce the effect of some varying factor that are identifies but uninteresting, may be applied. In cases where data from several different taxonomic orders are available, this information can be utilized, by a stratified blocking approach, where the biological order is included in the model. The stratification of the subjects into disjoint sets increases the power of the test to detect association, because like subjects are compared to like subjects.
The approach assumes statistical independence among the sequence samples in each group at each strata, but it may in certain cases, be modified to treat phylogenetically dependent sequences within the groups. When there are multiple dependent samples in a group, we compute the mean values in each group at each stratum and apply the statistical test to the means as representative observations.
Statistical models and tests available in DeltaProt for one protein alignment
Groups | Composition | Substitutions | Properties |
---|---|---|---|
2 | Two-way Anova | Mantel-Haenszel | Linear regression |
> 2 | Linear regression | _ | Linear regression |
Statistical models and tests available in DeltaProt to detect common trends in several protein alignments
Groups | Composition | Substitutions | Properties |
---|---|---|---|
2 | Cumulative Mann-Kendall Wilcoxon paired^{§} | Fisher's exact Chi-square^{§} | Cumulative Mann-Kendall Wilcoxon paired^{§} |
> 2 | Cumulative Mann-Kendall | _ | Cumulative Mann-Kendall |
Composition
Substitutions
Substitutions of amino acid x→ y
Substitutions of x: | y | Not y |
---|---|---|
Adaptation M→P: | ${N}_{x,y}^{(M,P)}$ 89 | ${N}_{x,non-y}^{(M,P)}$ 1754 |
Background M→M: | ${N}_{x,y}^{(M,M)}$ 14 | ${N}_{x,non-y}^{(M,M)}$ 514 |
where G_{ MP } is the number of ordered pairs of sequences in the two groups. This use of the group sample mean may also remove some of the random genetic drift in the data.
To be able to detect additional substitution biases, we can also pool the substitution data that share the same outcome in the psychrophilic population, e.g. we study the substitutions x → aa, where x may be any amino acid and aa is given.
Regression based on properties
In our case we are basically modeling discrete categorical data. However, the amino acids have many different physicochemical properties, and the amino acid alphabet can be efficiently reduced to univariate measurement data based on the physicochemical properties of each acid, one at a time. The mean property value Y is computed for each protein sequence and analyzed by regression, with the phenotypic traits of the host organism as the predictor variable X. The phenotypes may be grouped in an ordinal stochastic variable, i.e. preference of optimal growth temperature given as mesophile(X = 1), intermediate(X = 2), and psychrophile(X = 3) [5].
where β_{ 0 } is the intercept term of the baseline taxonomic order, and β_{ s } are the additional strata-specific intercept term for dummy variable I_{ s } .
False discovery rate
DeltaProt creates an output of summary statistics. These summary statistics include mean, standard deviation and p-values from the comparative tests. When tests of multiple hypotheses are carried out, the significance levels should be adjusted to account for the increased probability of false positives. For multiple test correction, the False Discovery Rate analysis (FDR) of Benjamini and Yekutieli [18] may be applied.
Limitations
The code for running DeltaProt is entered through a command line interface (CLI) to Matlab, and some key input such as the phenotypic group assignment of the sequences in the alignment(s) must be hard-coded in the source code. Biologists are often unfamiliar with programming computers and interacting with CLIs, and eventually a graphical user interface (GUI), may facilitate the practical use of the software. The approach we use is also suitable for high-throughput sequence analysis through the CLI. Therefore, for large datasets it would be useful to develop a more automatic manner to handle the whole process, from aligning sequences to statistical analysis. We have not ported our application to other similar software platforms like Octave, Scilab or R. With respect to future work on the software, it would also be useful to include analyses of DNA-sequences with physicochemical properties of nucleotides [19, 20].
Similar software
The well known program WebLogo [21] generates a graphical representation of an amino acid or nucleic acid multiple sequence alignment, where each logo consists of stacks of symbols, one stack for each position in the sequence. WebLogo contains no statistical analysis, however Blogo [22] is the same kind of web-based tool that detects and displays statistically significant position-specific sequence bias with reduced background noise. Some other toolboxes have been developed in Matlab for bioinformatic-related analyses. These include MBEToolbox [23, 24] and the bioinformatics toolbox developed by Mathworks [2]. MBEToolbox can aid in the exploration, interpretation and visualization of data in molecular biology and evolution. It includes statistical and sequence manipulation functions relevant to molecular evolution, phylogeny inference, and relevant graphic plots. The bioinformatics toolbox from Mathworks offers a wide range of general bioinformatics functions, including molecular evolution.
DeltaProt provides a range of specialised methods which are more relevant to comparative genomics than those included in either of the other projects, and as such it makes a contribution to the bioinformatics analyses that can be performed in the Matlab environment. It may be used to examine if physicochemical differences between proteins may be related to differences in the environment of the source organisms rather than to phylogeny.
Conclusions
Multiple alignments play a key role in bioinformatics, and provide a very rich set of interesting statistical and data-analysis problems. Addressing these problems will require both development of effective statistical methods and distribution of the implementation of those methods to biologists generating the data. Here we have presented DeltaProt, a toolbox for statistical and visual analyses of alignments from the domain of comparative molecular biology, and we believe that it may provide a good software environment in which to carry out some of this research related to extremophilic organisms. These are organisms that are able to tolerate and thrive in extreme conditions with respect to for example salt, pH, toxic chemicals, radiation, starvation, pressure and temperature. Adaptation to multiple extreme environments (polyextremophiles) makes untangling and isolating the adaptive mechanism(s) to a particular environment even more difficult. Statistical bioinformatics accounts for the inherent variation found in biological data that is generated as part of the bioinformatics investigation, and aims to statistically identify those significant changes in data that answer biological questions. DeltaProt is fully accessible within Matlab and capable of analyzing orthologous protein sequences in seconds. The toolbox is able to handle a large quantity of multiple-aligned sequences in order to look for common adaptive trends at the molecular level. Users can control extra structural attributes, as well as drawing details, to get sufficiently customized output.
Availability and requirements
Project name: DeltaProt, a Matlab toolbox for comparative genomics
Project homepage: The toolbox can be downloaded from the following websites, which also offer documentation (installation instructions, manual and tutorial): http://www.math.uit.no/bi/deltaprot/ or http://services.cbu.uib.no/software/deltaprot/
Operating systems: Cross-platform (dependent on Matlab availability, tested on Windows and Mac).
Programming language: Matlab versions 7.1-7.10 (2010a) is required.
Other requirements: Matlab Statistical Toolbox.
License: The toolbox can be obtained and used for free for academic purposes, and is under the creative commons license. The conditions of the license can be found on: http://creativecommons.org/licenses/by-nc-sa/3.0/
Any restrictions to use by non-academics: Following the previous license.
Declarations
Acknowledgements
We thank Bjørn Altermark, Erik Hjerde and Chris Fenton for their assistance in data acquisition. The present study was supported by the National Program for Research in Functional Genomics in Norway (FUGE) in The Research Council of Norway.
Authors’ Affiliations
References
- Ussery DW, Wassenaar TM, Borini S: Computing for comparative microbial genomics. Springer; 2009.View ArticleGoogle Scholar
- MatlabThe MathWorks, Inc. Natick, MA (US); [http://www.mathworks.com]
- Altermark B, Thorvaldsen S, Moe E, Smalås AO, Willassen NP: Sequence comparison and environmental adaptation of a bacterial endonuclease. Computational Biology and Chemistry 2007, 31(3):163–172. 10.1016/j.compbiolchem.2007.03.003View ArticlePubMedGoogle Scholar
- Thorvaldsen S, Hjerde E, Fenton C, Willassen NP: Molecular characterization of cold adaptation based on ortholog protein sequences from Vibrionaceae species. Extremophiles 2007, 11(5):719–732. 10.1007/s00792-007-0093-yView ArticlePubMedGoogle Scholar
- Thorvaldsen S, Ytterstad E: Environmental adaptation of proteins: Regression models with simple physicochemical properties. Computational Biology and Chemistry 2009, 33(5):351–356. 10.1016/j.compbiolchem.2009.07.002View ArticlePubMedGoogle Scholar
- Siglioccolo A, Bossa F, Pascarella S: Structural adaptation of serine hydroxymethyltransferase to low temperatures. International Journal of Biological Macromolecules 2010, 46(1):37–46. 10.1016/j.ijbiomac.2009.09.009View ArticlePubMedGoogle Scholar
- Fraczkiewicz R, Braun W: Exact and efficient analytical calculation of the accessible surface areas and their gradients for macromolecules. J Comp Chem 1998, 19(3):319–333. 10.1002/(SICI)1096-987X(199802)19:3<319::AID-JCC6>3.0.CO;2-WView ArticleGoogle Scholar
- Adamczak R, Porollo A, Meller J: Accurate Prediction of Solvent Accessi-bility Using Neural Networks Based Regression. Proteins: Structure, Function and Bioinformatics 2004, 56: 753–767. 10.1002/prot.20176View ArticleGoogle Scholar
- Adamczak R, Porollo A, Meller J: Combining Prediction of Secondary Structure and Solvent Accessibility in Proteins. Proteins: Structure, Function and Bioinformatics 2005, 59: 467–75. 10.1002/prot.20441View ArticleGoogle Scholar
- Gromiha MM, Oobatake M, Sarai A: Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins. Biophysical Chemistry 1999, 82(1):51–67. 10.1016/S0301-4622(99)00103-9View ArticlePubMedGoogle Scholar
- Kawashima S, Ogata H, Kanehisa M: AAindex: Amino Acid Index Database. Nucleic Acids Research 1999, 27(1):368–369. 10.1093/nar/27.1.368View ArticlePubMedPubMed CentralGoogle Scholar
- Kendall MG, Gibbons JD: Rank Correlation Methods. 5th edition. Edward Arnold; 1990.Google Scholar
- Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Computer Applications in the Biosciences 1992, 8(3):275–282.PubMedGoogle Scholar
- Chakravarty S, Varadarajan R: Elucidation of factors responsible for enhanced thermal stability of proteins: A structural genomics based study. Biochemistry 2002, 41(25):8152–8161. 10.1021/bi025523tView ArticlePubMedGoogle Scholar
- Agresti A: Exact inference for categorical data: recent advances and continuing controversies. Statistics in Medicine 2001, 20(17–18):2709–2722. 10.1002/sim.738View ArticlePubMedGoogle Scholar
- Andres AM, Quevedo MJS, Mato AS: Fisher's mid-p-value arrangement in 2×2 comparative trials. Computational Statistics & Data Analysis 1998, 29(1):107–115.View ArticleGoogle Scholar
- Agresti A: Categorical Data Analysis. 2nd edition. John Wiley & Sons; 2002.View ArticleGoogle Scholar
- Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Ann Stat 2001, 29(4):1165–1188. 10.1214/aos/1013699998View ArticleGoogle Scholar
- Florquin K, Saeys Y, Degroeve S, Rouze P, Van de Peer Y: Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Research 2005, 33(13):4255–4264. 10.1093/nar/gki737View ArticlePubMedPubMed CentralGoogle Scholar
- Friedel M, Nikolajewa S, Suhnel J, Wilhelm T: DiProDB: a database for dinucleotide properties. Nucleic Acids Research 2009, 37(Sp Iss):D37-D40. 10.1093/nar/gkn597View ArticlePubMedPubMed CentralGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: A sequence logo generator. Genome Research 2004, 14: 1188–1190. 10.1101/gr.849004View ArticlePubMedPubMed CentralGoogle Scholar
- Li W, Yang B, Liang S, Wang Y, Whiteley C, Cao Y, Wang X: BLogo: A tool for visualization of bias in biological sequences. Bioinformatics 2008, 24(19):2254–2255. 10.1093/bioinformatics/btn407View ArticlePubMedGoogle Scholar
- Cai JJ, Smith DK, Xia X, Yuen KY: MBEToolbox: a Matlab toolbox for sequence data analysis in molecular biology and evolution. BMC Bioinformatics 2005, 6: 64. 10.1186/1471-2105-6-64View ArticlePubMedPubMed CentralGoogle Scholar
- Cai JJ, Smith DK, Xia X, Yuen KY: MBEToolbox 2.0: An enhanced version of a MATLAB toolbox for Molecular Biology and Evolution. Evolutionary Bioinformatics 2006, 2: 179–182.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.