Detecting false positive sequence homology: a machine learning approach
- M. Stanley Fujimoto†1,
- Anton Suvorov†2Email author,
- Nicholas O. Jensen†2,
- Mark J. Clement1 and
- Seth M. Bybee2
© Fujimoto et al. 2016
Received: 11 May 2015
Accepted: 19 February 2016
Published: 24 February 2016
Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are many existing heuristic tools, most commonly based on bidirectional BLAST searches that are used to identify homologous genes and combine them into two fundamentally distinct classes: orthologs and paralogs. Due to only using heuristic filtering based on significance score cutoffs and having no cluster post-processing tools available, these methods can often produce multiple clusters constituting unrelated (non-homologous) sequences. Therefore sequencing data extracted from incomplete genome/transcriptome assemblies originated from low coverage sequencing or produced by de novo processes without a reference genome are susceptible to high false positive rates of homology detection.
In this paper we develop biologically informative features that can be extracted from multiple sequence alignments of putative homologous genes (orthologs and paralogs) and further utilized in context of guided experimentation to verify false positive outcomes. We demonstrate that our machine learning method trained on both known homology clusters obtained from OrthoDB and randomly generated sequence alignments (non-homologs), successfully determines apparent false positives inferred by heuristic algorithms especially among proteomes recovered from low-coverage RNA-seq data. Almost ~42 % and ~25 % of predicted putative homologies by InParanoid and HaMStR respectively were classified as false positives on experimental data set.
Our process increases the quality of output from other clustering algorithms by providing a novel post-processing method that is both fast and efficient at removing low quality clusters of putative homologous genes recovered by heuristic-based approaches.
One of the most fundamental questions of modern comparative evolutionary phylogenomics is to identify common (homologous) genes that originated through complex biological mechanisms such as speciation, multiple gene losses/gains, horizontal gene transfers, deep coalescence, etc. . When homologous sequences are identified, they are usually grouped and aligned together to form clusters. Homologous DNA (and those translated to amino acids) sequences can be further subdivided into two major classes: orthologs and paralogs. Orthologs are defined as homologous genes in different species that arose due to speciation events, whereas paralogs have evolved from gene duplications. Moreover, orthologous genes are more likely to exhibit a similar tempo and mode of evolution, thus preserving overall sequence composition and physiological function. Paralogs, instead, tend to follow different evolutionary trajectories leading to subfunctionalization, neofunctionalization or both . Nevertheless this phenomenon, called the ortholog conjecture, is still debatable  and requires additional validation since it has been shown that even between closely related species some orthologs can diverge such that they eventually loose common functionality.
The accurate detection of sequence homology and subsequent binning into aforementioned classes is essential for robust reconstruction of evolutionary histories in the form of phylogenetic trees . To date, numerous computational algorithms and statistical methods have been developed to perform orthology/paralogy assignments for genic sequences (for review see ). Methodologically these approaches employ heuristic-based or evidence (phylogenetic tree)-based identification strategies, which produces varying frequencies of false positive or negative results. The majority of heuristic algorithms rely on the principle of Reciprocal Best Hit (RBH, ) where BLAST  hit scores (e-values) approximate evolutionary similarity between two biological sequences. Further algorithmic augmentations of those heuristics, for instance Markov graph clustering (unsupervised learning) , enables the definition of orthologous/paralogous clusters from multiple pairwise comparisons. Despite their relatively low computational complexity, these algorithms have been shown to overestimate the number of putative homologies (i.e., higher rates of false positive detection compared to evidence-based methods ).
Proteome training sets composed of phylogeneticaly “meaningful” taxa for construction of core ortholog clusters may not be available,
Identification of informative core ortholog clusters may be somewhat cumbersome due to incomplete and/or low coverage sequencing,
The pHMMs may not contain any relevant compositional or phylogenetic properties about biological sequences that constitute MSA, and
Inability to explicitly identify paralogy limits the use of HaMStR for some evolutionary applications. Hence, homologous clusters inferred from various multiple sequences require further validation to improve confidence in orthology/paralogy classification.
Here, we propose a unique approach to identify false positive homologies detected by heuristic methods, for example HaMStR or InParanoid . Our machine learning method uses phylogenetically-guided inferred homologies to identify non-homologous (false positive) clusters of sequences. This improves the accuracy of heuristic searches, like those that rely on BLAST.
Library preparation and RNA-seq
For the experimental data set (OD_S) we used 18 Odonata (dragonflies and damselflies) and 2 Ephemeroptera (mayflies) species. Total RNA was extracted from the eye tissues of each taxon using NucleoSpin RNA II columns (Clontech) and reverse-transcribed into cDNA libraries using the Illumina TruSeq RNA v2 sample preparation kit that both generates and amplifies full-length cDNAs. Prepped Ephemeroptera mRNA libraries were sequenced on an Illumina HiSeq 2000 producing 101 bp paired-end reads by the Microarray and Genomic Analysis Core Facility at the Huntsman Cancer Institute at the University of Utah, Salt Lake City, UT, USA, while all Odonata preps were sequenced on a GAIIx producing 72 bp paired-end reads by the DNA sequencing center at Brigham Young University, Provo, UT, USA. The expected insert sizes were 150 bp and 280 bp respectively. Raw RNA-seq reads were deposited in the National Center for Biotechnology Information (NCBI), Sequence Read Archive, see Additional file 1.
Read trimming and de novo transcriptome assembly
The read libraries were trimmed using the Mott algorithm implemented in PoPoolation  with default parameters (minimum read length = 40, quality threshold = 20). For the assembly of the transcriptome contigs we used Trinity , currently the most accurate de novo assembler for RNA-seq data , under the default parameters.
Downstream transcriptome processing
In order to identify putative protein sequences within the Trinity assemblies we used TransDecoder (http://transdecoder.github.io), the utility integrated into the comprehensive Trinotate pipeline (http://trinotate.github.io) that is specifically developed for automatic functional annotation of transcriptomes . TransDecoder identifies the longest open reading frames (ORFs) within each assembled DNA contig, the subset of the longest ORFs is then used to empirically estimate parameters for a Markov model based on hexamer distribution. The reference null distribution that represents non-coding sequences is constructed by randomizing the composition of these longest contigs. During the next decision step, each longest determined ORF and its 5 other alternative reading frames are tested using the trained Markov model. If the log-likelihood coding/noncoding ratio is positive and is the highest, this putative ORF with the correct reading frame is retained in the protein collection (proteome). For more details about the RNA-seq libraries, assemblies and predicted proteomes see Additional file 1.
Construction of Drosophila data set
Ten high quality Drosophila raw RNA-seq data sets (DROSO) were obtained from NCBI (Additional file 2). First we trimmed the reads using PoPoolation  and subsampled the read libraries to the size of the smallest (Drosophila biarmipes). Then, two additional data sets corresponding to 50 % and 10 % of the scaled libraries were constructed by randomly drawing reads from the original full-sized libraries. Finally, de novo transcriptome assembly and protein prediction were conducted as outlined above for these three data sets. These data sets were used to test whether homology clusters derived from low-coverage RNA-seq libraries contain more false positives.
Gene homology inference
To predict probable homology relationships between proteomes we used the heuristic predictor InParanoid/MultiParanoid based on the RBH concept [12, 17]. Among various heuristic-based methods for sequence homology detection, OrthoMCL  and InParanoid  have been shown to exhibit comparable high specificity and sensitivity scores estimated by Latent Class Analysis , so in the present study we exploited InParanoid/MultiParanoid v. 4.1 for the purpose of simplicity in computational implementation. InParanoid initially performs bidirectional BLAST hits (BBHs) between two proteomes to detect BBHs in the pairwise manner. For this step, we set default parameters with the BLOSUM62 protein substitution matrix and bit score cutoff of 40 for all-against-all BLAST search. Next, MultiParanoid forms multi-species groups using the notion of a single-linkage. Due to inefficient MultiParanoid clustering algorithm, we had to perform a transitive closure to compile homology clusters for all species together. Transitive closure is an operation performed on a set of related values. Formally, a set S is transitive if the following condition is true: for all values A, B, and C in S, if A is related to B and B is related to C, then A is related to C. Transitive closure takes a set (transitive or non-transitive) and creates all transitive relationships, if they do not already exist. When a set is already transitive, its transitive closure is identical to itself. In the case of the pairwise relationships produced by InParanoid, we constructed orthologous clusters using the notion of transitive closure, where gene identifiers were the values, and homology was the relationship.
For example, our OD_S data set consisted of N = 20 proteomes, so we had to perform N×(N - 1)/2 = 190 pairwise InParanoid queries. A simple transitive closure yielded total 13,998 homology clusters for OD_S. The DROSO data set yielded 20,676, 18,584 and 17,067 homology clusters for 100 %, 50 % and 10 % respectively. Then putative homologous genes were aligned to form individual MSA homology clusters for the subsequent analyses using MAFFT v. 6.864b  with the “-auto” flag that enabled detection of the best alignment strategy between accuracy- and speed-oriented methods.
Additionally, we utilized HaMStR v. 13.2.3  under default parameters to delineate putative orthologous sequences in the OD_S proteome sets. 5,332 core 1-to-1ortholog clusters of 5 arthropod species (Ixodes scapularis, Daphnia pulex, Rhodnius prolixus, Apis mellifera and Heliconius melpomene) for training pHMM were retrieved from the latest version of OrthoDB . We used Rhodnius prolixus (triatomid bug) as the reference core proteome because this is the closest phylogenetically related species and publically available proteome to the Ephemeroptera/Odonata lineage . As previously described, each core ortholog cluster was aligned to create MSA using MAFFT and converted into HMM profile using HMMER v. 3.0 . BBHs against the reference proteome were derived using reciprocal BLAST.
Construction of ground-truth training sets
The OrthoDB database is one of the most comprehensive collections of putative orthologous relationships predicted from proteomes across a vast taxonomic range . This data is particularly useful for construction of training sets since OrthoDB clusters were detected using a phylogeny-informed approach collated with available functional annotations. Hence, training sets constructed from OrthoDB clusters have the inherent benefit of both an evolutionary and physiological assessment resulting in more precise filtering for false positive homology.
The key to our method was the development of labeled training sets that were used to train supervised machine learning classifiers. Previously, homology clusters were known and annotated in OrthoDB. There were, however, no annotated clusters that represented non-homology clusters from random alignments. Thus, we created and annotated our own set of non-homology clusters through a generative process. We created these clusters in two different manners: randomly aligned sequences and evolving sequences from the homology clusters.
We extracted 5,332 homology (H) clusters from the predefined OrthoDB profile called “single copy in > 70 % of species” across the entire arthropod phylogeny in the database, and then aligned them. Non-homology (NH) clusters were generated using: i) the alignment of randomly drawn sequences from the totality of the protein sequences with cluster size sampled from Poisson (λ), where λ = 44.3056 was estimated as the average cluster size of Hs and ii) by evolving the sequences taken from H clusters. This process of evolving sequences was accomplished by using PAML  to generate random binary trees for each sequence within a cluster. The discretized number of terminal branches for each random tree was sampled from a normal distribution with mean 50 and a standard deviation of 15. Within each of the clusters, individual sequences were evolved using their respective randomly generated tree using Seq-Gen . We used WAG + I  as the substitution model for the amino acid sequences during the evolving process specifying the number of invariable sites (−i) at 0 %, 25 % and 50 %. Then, to form NH clusters, a single evolved sequence from the terminal branches was selected randomly from each tree. By doing so, we simulated more realistic clusters in which the evolved sequences were diverged enough to be considered as non-homologous to each other.
From the H and NH clusters, two different sets of training, validation and testing partitions were formed. The first set (EQUAL) had an equal number of homology, randomly aligned, 0 % invariable-site evolved, 25 % invariable-site evolved and 50 % invariable-site evolved clusters within the combination of training, validation and testing data sets. The second set (PROP) consisted of 50 % of the training set as homology clusters while the remaining half of the training set was composed of equal parts randomly aligned, 0 % invariable-site evolved, 25 % invariable-site evolved and 50 % invariable-site evolved clusters. The combined data sets were then partitioned into training, validation and testing. This was done by randomly sampling from the pool of clusters and assigning 80 % of the clusters (8,800) to training, 10 % (1,100) to validation and the last 10 % (1,100) to testing.
All Features that were used in order to train the machine learning algorithm. Each of these features was calculated for each of the clusters
The number of positions identified by Aliscore as randomly aligned
The length of the alignment
# of Sequences
The number of sequences in the alignment
# of Gaps
Number of base positions marked with a gap
# of Amino Acids
Number of amino acids in the alignment
Longest non-aligned sequence length minus shortest non-aligned sequence length
Amino Acid Charged
Standard deviation for the proportions of amino acids in the charged class for each sequence
Amino Acid Uncharged
Standard deviation for the proportions of amino acids in the uncharged class for each sequence
Amino Acid Special
Standard deviation for the proportions of amino acids in the non-charged and non-hydrophobic class for each sequence
Amino Acid Hydrophobic
Standard deviation for the proportions of amino acids in the hydrophobic class for each sequence
For detection of false positive homology we utilized different supervised machine learning algorithms in order to learn from the labeled data instances. Supervised machine learning algorithms take in labeled instances of a particular event as input. From these labeled instances, the algorithm can then learn from the features associated with the instance to perform classification on other, unlabeled instances. A number of different algorithms were used in order to find a model that performed well. Waikato Environment for Knowledge Analysis (WEKA) software  was utilized for training different supervised machine learning classifiers and for evaluating the test data sets. A set of models was trained and compared using the arthropod data set (see Training data sets for additional information).
The machine learning parameters used for each of the different algorithms in WEKA
weka.classifiers.functions.MultilayerPerceptron -L 0.1 -M 0.05 -N 3000 -V 0 -S 0 -E 40 -H a
Support Vector Machine (SVM)
weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V −1 -W 1 -K “weka.classifiers.functions.supportVector.PolyKernel -C
weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1
weka.classifiers.functions.Logistic -R 1.0E-8 -M −1
Meta-Classifier w/o Logistic Regression
weka.classifiers.meta.Stacking -X 10 -M “weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a” -S 1 -B “weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1” -B “weka.classifiers.bayes.NaiveBayes ” -B “weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V −1 -W 1 -K “weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0””
Meta-Classifier w/Logistic Regression
weka.classifiers.meta.Stacking -X 10 -M “weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a“ -S 1 -B ”weka.classifiers.functions.Logistic -R 1.0E-8 -M −1” -B “weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a” -B “weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1” -B “weka.classifiers.bayes.NaiveBayes ” -B “weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V −1 -W 1 -K “weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0””
The training data set was used as input to the machine learning model for parameter selection. For the arthropod data set, 80 % of the data were used for training, while 10 % of the data was reserved for validation and the last 10 % for testing. Machine learning algorithms were utilized to learn from the combination of the H and NH clusters in the data set to differentiate the two. A trained model could then be used to classify unlabeled instances as homologous and non-homologous. There were a total of 8,800 instances in the OrthoDB arthropod data set that were used as a training set for both the PROP and the EQUAL data sets. In the PROP data set, there were 4,378 H and 4,422 NH clusters. In the EQUAL data set, there were 1,753 H and 7,047 NH clusters.
The validation data sets were used after the model had been trained on the training data set. By using the trained model on the validation set, the efficacy of the model could be seen. 10 % of the arthropod data set formed the arthropod validation set. The models trained using the arthropod training set were validated only with the arthropod instances. If the model did not perform adequately on the validation set, different parameters for the machine learning algorithms were modified in an attempt to improve the performance of the models. The re-trained models would then revalidate on their same, respective validation sets. The process was repeated until adequate performance of the learning algorithm was reached. The OrthoDB arthropod validation set consisted of 1,100 instances for both the PROP and EQUAL data sets. The PROP data set had 566 H and 534 NH clusters. The EQUAL data set had 238 H and 862 NH clusters.
We tested our filtering process by applying the arthropod classifiers trained on the ground-truth data set to the DROSO and OD_S data sets. Unlike the testing sets mentioned in the previous section, the ground-truth for these data sets was unknown. We examined the number of clusters filtered and conducted a manual inspection of a subset of the filtered clusters to verify the removal of only false positive homology clusters. Because there are, to the authors’ knowledge, no other post-processing methods for cluster filtering that exist our approach is novel. The filtering processes that do exist are heuristic-based approaches, such as an e-value cutoff, that are built-in modules of the clustering software. Therefore, for comparison, we only examined the number of clusters filtered from the output of InParanoid and HaMStR.
Results and discussion
Summary of arthropod machine learning model performance
OrthoDB Arthropod EQUAL
OrthoDB Arthropod PROP
Suppor Vector Machine (SVM)
Meta-Classifier w/o Logistic Regression
Meta-Classifier w/ Logistic Regression
Summary of InParanoid and HaMStR cluster filtering
We have demonstrated a machine learning method that can be used to differentiate homology and non-homology clusters based on characteristics of known good and bad clusters. These results can be seen in our trained models’ ability to achieve high classification accuracy on the test data sets as well as by examining the number of clusters that were removed from the experimental OD_S data set. We developed a training set of known good and bad clusters that was previously unavailable and made supervised machine learning impossible. Using a feature set that we developed, we tested various machine learning algorithms and found that when trained on our training data sets that the meta-classifier with logistic regression consistently outperformed all other models and performed just as well as the meta-classifier without logistic regression.
Applications of our method were also seen as we applied them to other data sets. Our method was especially useful when applied to the OD_S data set, by filtering out many clusters with false positive homology. We showed that our method is effective in settings where non-model organisms are being studied and the transcriptome assembly quality is low primarily due to low coverage sequencing or partial RNA degradation.
This paper has demonstrated the usefulness of machine learning in finding homology clusters by quickly removing low quality clusters without using any additional heuristics. The clusters that are retained can then be used later in higher quality phylogeny reconstruction and/or other analyses of gene evolution. In the future, we aim to explore machine learning approaches to clustering sequences more deeply to produce more refined and reliable homology clusters.
We thank Gavin J. Martin and Nathan P. Lord for the generation of sequence data, T. Heath Ogden for providing specimens and Eric Ringger for his help with machine learning model selection and valuable discussion. We also thank the National Science Foundation for funding this research in the form of a grant awarded to both SMB and MJC (IOS-1265714).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Koonin EV. Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet. 2005;39:309–38.View ArticlePubMedGoogle Scholar
- Gabaldon T, Koonin EV. Functional and evolutionary implications of gene orthology. Nat Rev Genet. 2013;14(5):360–6.View ArticlePubMedGoogle Scholar
- Dessimoz C, Gabaldon T, Roos DS, Sonnhammer EL, Herrero J, Quest for Orthologs C. Toward community standards in the quest for orthologs. Bioinformatics. 2012;28(6):900–4.PubMed CentralView ArticlePubMedGoogle Scholar
- Delsuc F, Brinkmann H, Philippe H. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 2005;6(5):361–75.View ArticlePubMedGoogle Scholar
- Kristensen DM, Wolf YI, Mushegian AR, Koonin EV. Computational methods for Gene Orthology inference. Brief Bioinform. 2011;12(5):379–91.PubMed CentralView ArticlePubMedGoogle Scholar
- Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N. The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A. 1999;96(6):2896–901.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.View ArticlePubMedGoogle Scholar
- Li L, Stoeckert Jr CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13(9):2178–89.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen F, Mackey AJ, Vermunt JK, Roos DS. Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS One. 2007;2(4):e383.PubMed CentralView ArticlePubMedGoogle Scholar
- Ebersberger I, Strauss S, von Haeseler A. HaMStR: profile hidden markov model based search for orthologs in ESTs. BMC Evol Biol. 2009;9:157.PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(9):755–63.View ArticlePubMedGoogle Scholar
- Remm M, Storm CE, Sonnhammer EL. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001;314(5):1041–52.View ArticlePubMedGoogle Scholar
- Kofler R, Orozco-terWengel P, De Maio N, Pandey RV, Nolte V, Futschik A, Kosiol C, Schlotterer C. PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PLoS One. 2011;6(1):e15925.Google Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29(7):644–52.Google Scholar
- Zhao QY, Wang Y, Kong YM, Luo D, Li X, Hao P. Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study. BMC Bioinformatics. 2011;12 Suppl 14:S2.View ArticleGoogle Scholar
- Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013;8(8):1494–512.Google Scholar
- Alexeyenko A, Tamas I, Liu G, Sonnhammer EL. Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics. 2006;22(14):e9–15.View ArticlePubMedGoogle Scholar
- Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002;30(14):3059–66.PubMed CentralView ArticlePubMedGoogle Scholar
- Waterhouse RM, Tegenfeldt F, Li J, Zdobnov EM, Kriventseva EV. OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res. 2013;41(Database issue):D358–365.PubMed CentralView ArticlePubMedGoogle Scholar
- Meusemann K, von Reumont BM, Simon S, Roeding F, Strauss S, Kuck P, Ebersberger I, Walzl M, Pass G, Breuers S, et al. A phylogenomic approach to resolve the arthropod tree of life. Mol Biol Evol. 2010;27(11):2451–64.Google Scholar
- Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7(10):e1002195.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–91.View ArticlePubMedGoogle Scholar
- Rambaut A, Grassly NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997;13(3):235–8.PubMedGoogle Scholar
- Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol. 2001;18(5):691–9.View ArticlePubMedGoogle Scholar
- Misof B, Misof K. A Monte Carlo approach successfully identifies randomness in multiple sequence alignments: a more objective means of data exclusion. Syst Biol. 2009;58(1):21–34.View ArticlePubMedGoogle Scholar
- Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000;17(4):540–52.View ArticlePubMedGoogle Scholar
- Kuck P, Meusemann K, Dambach J, Thormann B, von Reumont BM, Wagele JW, Misof B. Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Front Zool. 2010;7:10.Google Scholar
- Kreil DP, Ouzounis CA. Identification of thermophilic species by the amino acid compositions deduced from their genomes. Nucleic Acids Res. 2001;29(7):1608–15.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang GZ, Lercher MJ. Amino acid composition in endothermic vertebrates is biased in the same direction as in thermophilic prokaryotes. BMC Evol Biol. 2010;10:263.PubMed CentralView ArticlePubMedGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. SIGKDD Explor Newsl. 2009;11(1):10–8.View ArticleGoogle Scholar
- Hughes AL. Evolutionary conservation of amino acid composition in paralogous insect vitellogenins. Gene. 2010;467(1–2):35–40.PubMed CentralView ArticlePubMedGoogle Scholar
- Misof B, Liu S, Meusemann K, Peters RS, Donath A, Mayer C, Frandsen PB, Ware J, Flouri T, Beutel RG, et al. Phylogenomics resolves the timing and pattern of insect evolution. Science. 2014;346(6210):763–7.Google Scholar