MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction

Background Knowledge of subcellular localization of proteins is crucial to proteomics, drug target discovery and systems biology since localization and biological function are highly correlated. In recent years, numerous computational prediction methods have been developed. Nevertheless, there is still a need for prediction methods that show more robustness and higher accuracy. Results We extended our previous MultiLoc predictor by incorporating phylogenetic profiles and Gene Ontology terms. Two different datasets were used for training the system, resulting in two versions of this high-accuracy prediction method. One version is specialized for globular proteins and predicts up to five localizations, whereas a second version covers all eleven main eukaryotic subcellular localizations. In a benchmark study with five localizations, MultiLoc2 performs considerably better than other methods for animal and plant proteins and comparably for fungal proteins. Furthermore, MultiLoc2 performs clearly better when using a second dataset that extends the benchmark study to all eleven main eukaryotic subcellular localizations. Conclusion MultiLoc2 is an extensive high-performance subcellular protein localization prediction system. By incorporating phylogenetic profiles and Gene Ontology terms MultiLoc2 yields higher accuracies compared to its previous version. Moreover, it outperforms other prediction systems in two benchmarks studies. MultiLoc2 is available as user-friendly and free web-service, available at: .


Background
A eukaryotic cell is organized into different membranesurrounded compartments which are specialized for different cellular functions. However, most cellular proteins are synthesized in the cytoplasm and need to be transported to their final location to fulfill their biological function. The whole protein sorting process is not completely understood but, in principle, it depends on signals in the amino acid sequence or signal patches on the protein surface.
There are diverse applications for the knowledge of the localization of the complete proteome, the localizome, in the fields of proteomics, drug target discovery and systems biology. Since subcellular localization is highly correlated with biological function, it is possible to draw conclu-sions from the knowledge of a protein's localization regarding its cellular role. Hence, subcellular localization is a key functional characteristic of proteins. Proteins destined for the extracellular space or the cell surface are especially of pharmaceutical interest as they are easily accessible drug targets. The integration of large-scale localization data with diverse omics data, produced by highthroughput techniques, will help in understanding cellular function. Localization data can be used to validate or analyze protein-protein interactions inferred from twohybrid experiments or biochemical pathways inferred from microarray expression data.
In recent years large-scale sequencing projects have caused a rapid growth of sequence information and increased the number of proteins but without any further annotation in public databases. Determining the localization of proteins using experimental methods alone is expensive and timeconsuming.
Fast and accurate computational prediction methods provide an attractive complement to experimental methods. In the last decade numerous computational methods, which can be roughly divided into sequence-based and annotation-based methods [1,2], have been developed. Sequence-based predictors only use the amino acid sequence of the query protein as input. They are based either on the detection of sequence-coded sorting signals like N-terminal targeting peptides [3][4][5][6][7][8][9][10][11] and nuclear localization signals (NLS) [11] or use the fact that the amino acid composition of a protein is correlated with its localization [12]. The latter methods [2,[13][14][15][16][17][18][19][20][21] use different kinds of composition information like the overall, paired, gapped-paired, surface or pseudo amino acid composition from the protein sequence or sequence profiles. More recent and advanced methods combine composition information with the detection of sorting signals [22,23]. Annotation-based predictors search the sequence for functional domains and motifs [24,25] or use textual information like Swiss-Prot keywords [26,27], Gene Ontology (GO) terms [28,29] or PubMed abstracts [30,31]. If such information is not available for the query protein most of these methods transfer annotation from close homologs. Nair and Rost [32] quantitatively showed that proteins with sufficiently similar sequences usually are close homologs that function at the same localization site. Annotation-based predictors often report higher performance than sequence-based predictors which, however, are more general and robust and can also be used for novel proteins for which no additional information is present and no annotated close homologs can be found. In addition to the predictors of the two categories, there are also hybrid approaches which combine sequencebased and annotation-based information [33][34][35][36][37] and can therefore profit from the advantages of both worlds. A fur-ther category are meta predictors, which integrate the prediction results of multiple tools [38,39].
Although there already exist a lot of computational prediction methods, there is still room for improvement. This is due to the fact that the protein sorting process is very complex and not yet well understood. Only a small portion of proteins have clearly identifiable sorting signals in their primary sequence. As a consequence, available prediction methods are often either specialized for the prediction of very few localizations with higher accuracy or for the prediction of a wide range of localizations with reduced accuracy. A further challenge is how to deal with proteins present in multiple locations [40,41].
The aim of this work was not to develop a completely novel algorithm or protein coding but to create a reliable and efficient predictor with maximum prediction accuracy. Especially, in the context of large-scale genome annotations scientific users rely on high-accuracy subcellular localization predictions. A difference of a few percent points in performance can mean hundreds of correctly or incorrectly classified proteins in genome annotation. Hence, well-tuned predictors can make a significant difference in this area. To this end, we extend our previously published support vector machine (SVM) based predictor MultiLoc [23], which utilizes overall amino acid composition and the presence of known sorting signals. We show that the performance of MultiLoc can be clearly improved by incorporating phylogenetic profiles and GO terms inferred from the primary sequence leading to a highaccuracy prediction system that covers all main eukaryotic subcellular localizations. Phylogenetic profiles encode evolutionary information in the form of patterns of protein inheritance among the species. Marcotte et al. [42] originally applied this approach to distinguish mitochondrial and non-mitochondrial proteins. GO terms were previously combined with sequence-based information in the form of pseudo amino acid composition in a group of predictors [36,37].
The GO terms are used as primary prediction criteria and pseudo amino acid composition is used if no GO term can be found. Our extensive MultiLoc2 prediction system integrates composition and sorting signal information with phylogenetic profiles and GO terms towards a common localization prediction. The extended MultiLoc system is trained on two different datasets resulting in two versions with different resolutions. MultiLoc2-LowRes is a low resolution predictor that is specialized for globular proteins and predicts up to five localizations for animals, fungi and plants. MultiLoc2-HighRes is a high resolution predictor that covers all 11 main eukaryotic subcellular localizations. The main reason for creating MultiLoc2-LowRes additionally to MultiLoc2-HighRes is to provide a predictor with superior prediction accuracy for globular proteins. For certain applications it is sufficient to discriminate between secreted proteins and the localizations of globular proteins.
The MultiLoc2 approach was compared with current stateof-the-art tools (BaCelLo [19], LOCtree [2], Protein Prowler [9], TargetP [5] and WoLF PSORT [22]) using independent datasets sharing very low sequence identity with the training datasets of all compared tools. We found MultiLoc2 to perform considerably better than related tools for animals and plants and comparable for fungal proteins in a benchmark study with five localizations. Since GO terms are not always available, we evaluate MultiLoc2 as purely sequence-based and found the performance only slightly reduced but still better or comparable with other tools. Furthermore, MultiLoc2-HighRes performs clearly better compared with WoLF PSORT using a second independent dataset that extends the benchmark study to all main eukaryotic subcellular localizations. The second benchmark study was performed only between these two predictors since the remaining methods are specialized for a smaller amount of localizations. Both versions of MultiLoc2 are available online as web interface at http://www-bs.informatik.uni-tuebingen.de/Services/ MultiLoc2. The online version provides fast access to MultiLoc2 for a limited number of query sequences. Furthermore, a stand-alone version (including the source code of the method) is available from the website. The stand-alone version is suitable for large-scale offline prediction jobs.
In the following sections the MultiLoc2 system is described in detail together with the training and test datasets used, followed by the performance evaluation and the results of the benchmark studies.

Implementation
MultiLoc2 architecture The MultiLoc prediction system described earlier [23] is based on the integration of the output of four sequencebased subclassifiers (SVMTarget, SVMSA, SVMaac and MotifSearch) into a protein profile vector. The subclassifiers utilize the overall amino acid composition or search for specific sorting signals. MultiLoc2 extends the original architecture with two new classifiers based on phylogenetic profiles (PhyloLoc) and GO terms (GOLoc). As stated in the introduction, there are two versions of MultiLoc2 which differ in the number of predictable localizations. MultiLoc2-HighRes can deal with nuclear (nu), cytoplasmic (cy), mitochondrial (mi), chloroplast (ch), extracellular (ex), plasma membrane (pm), peroxisomal (pe), endoplasmic reticulum (er), Golgi apparatus (go), lysosomal (ly) and vacuolar (va) proteins. MultiLoc2-LowRes is specialized for globular proteins and predicts secretory pathway (SP) proteins (separated into the six classes ex, pm, er, go, ly, va in MultiLoc2-HighRes) as well as nu, cy, mi and ch. Similar to its previous version, MultiLoc2 is available for plant, animal and fungal protein localization prediction. A scheme of the overall architecture of MultiLoc2 is shown in Fig. 1. A query sequence is processed by a first layer of six subprediction methods. The results from these methods are collected in the pro- tein profile vector, which is used as input for the final layer of SVMs, which in turn outputs the final localization prediction. In both layers one-vs-one SVMs are used for classification. The corresponding figure of MultiLoc2-LowRes is also available [see Additional file 1]. The original four sequence-based classifiers are briefly described in the next section, followed by details of PhyloLoc and GOLoc.

Subprediction methods SVMTarget
SVMTarget is based on the detection of N-terminal targeting peptides to predict ch, mi, SP and other (OT) localizations for plant proteins and only mi, SP and OT for animal and fungal proteins. A sliding window approach scans the N-terminal part of a given query sequence. The partial amino acid composition in the window is used as input for the SVMs. The output of SVMTarget is a probability for each localization.

SVMSA
SVMSA scans the sequence for a signal anchor (SA) which can be present in membrane proteins of the secretory pathway instead of a signal peptide. Therefore, SVMSA complements SVMTarget. SAs are also detected using a sliding window approach based on partial amino acid composition. SVMSA is specialized for membrane proteins and is therefore not included in MultiLoc2-LowRes.

SVMaac
SVMaac is based on the overall amino acid composition of the query sequence and outputs a probability for each localization. In contrast to the original MultiLoc, the binary one-versus-all classification is replaced by a oneversus-one procedure since a slightly performance increase could be achieved.

MotifSearch
MotifSearch outputs five binary features that encode the presence or absence of sequence motifs relevant to protein sorting such as nuclear localization signals (NLSs). Two additional binary features represent the presence or absence of a DNA-binding domain or a plasma membrane receptor domain.

PhyloLoc
Proteins within the same subcellular localization tend to share a similar taxonomic distribution of homologous proteins in other genomes [42]. This kind of information can be represented as a profile [43] which encodes the pattern of presence or absence of a given protein in a set of genomes. Marcotte et al. [42] applied phylogenetic profiles for the distinction of mitochondrial and non-mitochondrial proteins using 31 genomes and a linear discrimination function. PhyloLoc is based on phyloge-netic profiles derived from 78 fully sequenced genomes and SVMs to predict all of the localizations of the MultiLoc2 predictors. The genomes were retrieved from the National Center for Biotechnology Information (NCBI) web site (downloaded between 6th and 9th February 2008). We used all available eukaryotic (20) and archaean (33) genomes and a non-redundant set of 25 bacterial genomes [see Additional file 1]. The input of PhyloLoc (as shown in Fig. 2) is a vector of similarities between the query sequence and the best sequence match in each genome using BLAST. The BLAST homology searches are performed using default settings. The bit score B qi of the best sequence match of the query sequence q in genome i and the self bit score B qq of q aligned with itself are used to calculate the similarity S qi which is defined as: S qi = B qi /B qq . Due to the fact that B qi is always smaller than B qq , the values of S qi range from zero to one. Values close to one indicate presence of the query protein and values close to zero indicate absence. The calculation of phylogenetic profiles based on bit scores was also previously used for the functional annotation of bacterial genomes [44]. An important point to note is that, although BLAST is used, creating phylogenetic profiles is not an annotation-based or homology-based method as sometimes described in the literature. The reason is that there is no annotation-transfer from the aligned sequences. Actually, it is irrelevant whether the proteins of the genomes are annotated or not. Proteins with similar phylogenetic profiles are co-inherited and do not have to be close homologs [45].
PhyloLoc and GOLoc architecture Figure 2 PhyloLoc and GOLoc architecture. The architectures of PhyloLoc and GOLoc from MultiLoc2-LowRes. The input of PhyloLoc is a vector of similarities (phylogenetic profile) between the query sequence and the best sequence match in each genome inferred from BLAST. The input of GOLoc is a binary-coded vector representing the GO terms of the query sequence inferred from InterPro using InterProScan. Phy-loLoc and GOLoc use one-versus-one SVMs to process their input and to calculate probability estimates for each localization.

GOLoc
The Gene Ontology (GO) is a controlled vocabulary for uniformly describing gene products in terms of biological processes, cellular components and molecular function across all organisms [46]. It has been shown that GO terms can be used to improve the performance of subcellular protein localization prediction methods [47,48]. In the literature to date, there are three possibilities for obtaining GO annotation terms for a query sequence. If the UniProt [49] accession number is known, one can simply extract the GO annotation from the UniProt database [50]. However, this procedure fails for novel proteins without accession number. Another possibility is to search for homologous proteins annotated with GO terms using BLAST [28,29]. This becomes difficult in cases where proteins have no close homolog or proteins have many homologs, because no GO term can be obtained or GO terms might be ambiguous. A further method of inferring GO terms is InterProScan [51] used, for example, by Chou and Cai [52]. Given a protein sequence, the tool scans against various pattern and signature data sources collected by the InterPro project [53]. InterPro also provides a mapping of the detected protein domains and functional sites to GO terms.
Our subpredictor GOLoc is based on GO terms calculated using InterProScan. Since the GO terms are derived directly from the query sequence, we avoid the drawbacks of using accession numbers or BLAST. The input of GOLoc is a binary-coded vector which represents all GO terms of the training sequences (see Fig. 2). GO terms present in the query sequence are set to 1 in the vector and to 0 otherwise [see Additional file 1].

Datasets BaCelLo
The datasets used for training and testing MultiLoc2-LowRes against comparable predictors were obtained from the BaCelLo website. The homology-reduced training dataset was extracted from Swiss-Prot release 48 and contains 2597 animal, 1198 fungal and 491 plant proteins resulting in three kingdom-specific predictors. By ignoring proteins annotated as 'membrane' or 'transmembrane', only globular proteins were considered.
The animal and fungal proteins represent four localizations (nu, cy, mi, SP) and the plant proteins five localizations (with the addition of ch). The independent test dataset (BaCelLo IDS) was extracted from Swiss-Prot release 54. Only proteins added to the database starting from release 49 were considered. Furthermore, proteins sharing a sequence identity >30% to at least one protein from release 48 were removed. This ensured that all test proteins were novel to the predictors in the benchmark study since all of them were trained using Swiss-Prot proteins up to release 48. In order to avoid a bias towards the prediction of over-represented protein classes, all sequences which share the same localization and a sequence identity >30% were clustered into 432 animal, 418 fungi and 132 plant groups. More information concerning the creation of the BaCelLo data sets can be found in Pierleoni et al [19] and Casadio et al. [54].

Höglund
For training MultiLoc2-HighRes we applied the original dataset used to train MultiLoc [23], which contains 5959 eukaryotic proteins extracted from Swiss-Prot release 42. The data set covers 11 localizations (cy, ch, er, ex, go, ly, mi, nu, pe, pm, va). To also compare the prediction performance of MultiLoc2 with WoLF PSORT in regard to the localizations not present in the BaCelLo test dataset, we created a second independent dataset (Höoglund IDS) which covers seven localizations (er, ex, go, ly, pe, pm, va). Therefore, animal, fungal and plant proteins of these localizations were extracted from Swiss-Prot release 55.3 in the same way as the BaCelLo independent dataset. However, in the case of the plant proteins, we increased the allowed sequence identity threshold to 40% in order to obtain enough data. We used BLASTClust to cluster the sequences using 30% pairwise sequence identity for the animal and fungal proteins and 40% for the plant proteins. The whole procedure delivered 158 animal, 106 fungi and 30 plant groups.
For each training dataset (BacelLo and Höglund), a table that shows the number of protein sequences for each localization is listed elsewhere [see Additional file 1]. Two different datasets were used for training since this allows a more objective performance comparison between MultLoc2-HighRes and its predecessor MultiLoc on the one hand and in particular between MultLoc2-LowRes and further methods on the other hand.

SVM training and performance evaluation
All building blocks (except MotifSearch) of MultiLoc2 were trained using SVMs [55] from the LIBSVM [56] software. Throughout, we used the radial basis kernel function and optimized the c and g parameters by grid search. Furthermore, we defined weights for each class in order to reduce the over-prediction effect when using unbalanced training datasets. To this end, we used built-in functionality of LIBSVM. The weighting scheme assigns weight 1.0 to the largest class and higher weights to the remaining classes. The weights of these classes are simply calculated by dividing the size of the largest class by that of each smaller class. The probability estimates calculated by LIB-SVM were used for ranking the final predicted localizations and choosing the most probable one.
We used five-fold cross-validation for training and evaluating the prediction performance. Additionally, independent datasets were used for testing MultiLoc2 and comparison with other prediction methods. Therefore, all test proteins share low sequence similarity with proteins in the training datasets.
Localization-specific performance results were expressed using sensitivity (SE), specificity (SP) and the Matthews correlation coefficient (MCC) defined as: To evaluate the overall prediction performance, we used average sensitivity (AVG), which is also known as the average localization-specific accuracy, as primary measure. The average sensitivity is better suited than the overall accuracy (ACC), the percentage of correctly predicted proteins of all localizations. The reason is that all prediction methods are trained on unbalanced datasets with strongly varying numbers of proteins per localization. This often biases the prediction towards the localization with the most representations in the training dataset. Hence an unbalanced test dataset would also normally lead to a distorted performance evaluation when using the ACC only. To calculate the performance measures for the independent datasets, we used the average rates of true and false predicted proteins within each cluster.

Cross-validation performance
The impact of the MultiLoc2 extensions on the overall prediction performance was evaluated using 5-fold crossvalidation. The results are summarized in Table 1 [23] since the SVMaac architecture has slightly changed. Adding PhyloLoc or GOLoc individually to MultiLoc already increased the performance considerably, whereas the performance gain caused by GOLoc is slightly higher compared to PhyloLoc. However, the best performance is achieved by the addition of both subpredictors, obtaining MultiLoc2. Similar trends can be detected regarding the overall accuracies. The standard deviations of the MultiLoc2-LowRes plant version are higher compared to the other versions due to the fact that the number of training sequences in the dataset is considerably lower.

Comparison with related tools
In a recently published study [54] five selected top-performing sequence-based prediction methods (BaCelLo, LOCtree, Protein Prowler, TargetP and WoLF PSORT) were compared using an independent dataset (see Section 2.3.1). Based on this benchmark study, we compared the performance of MultiLoc2 against these five methods using the same test setting. The benchmark study considered five subcellular localizations (nu, cy, mi, ch, SP). Furthermore, a virtual class nu/cy, containing nu and cy    proteins, was created in order to ensure a fair comparison with TargetP and Protein Prowler which do not discriminate between these two localizations. To deal with WoLF PSORT and LOCtree, predicted sublocalizations of the secretory pathway were grouped into the SP class. A similar approach was followed for the evaluation of MultiLoc2-HighRes. Depending on the inclusion of the virtual nu/cy class, the number of tested classes is three or four for animals and fungi as well as four or five for plants.
We also evaluated the performance of only sequencedbased predictions of MultiLoc2 by disregarding GO terms to simulate the case of unavailability of GO terms. Table   Table 2 2 shows the localization-specific performance results using sensitivity, specificity and MCC and Table 3 summarizes the overall performances using AVG and ACC. Note that the number of SP clusters for fungi (9) and plants (6) and the mi clusters for plants (6) is quite small compared to the remaining localizations. Therefore, some care should be taken when interpreting the prediction results. Small clusters have only a small influence on the ACC, however, a large influence on the AVG.
MultiLoc2-LowRes always yields the highest ACCs and AVGs for animal and plant proteins and hence outperforms all other predictors. The reason for this outstanding result is that MultiLoc2-LowRes is, in general, better suited to discriminate between nu and cy and between mi and ch proteins (see Table 2), which is a known challenge in the prediction of protein subcellular localization. For fungal proteins the ACCs are the highest and the AVGs are the second highest after the BaCelLo predictor. One reason for the reduced AVG performance is that on average only 34% of the fungal proteins are annotated with GO terms by InterProScan. The annotation-rate is higher for animals (43%) and plants (79%). Compared to MultiLoc2-LowRes, the performance of MultiLoc2-HighRes is, not surprisingly, reduced, since it is a more general predictor not specialized for globular proteins and covering a wider range of localizations. However, for animal and plant proteins the AVGs of MultiLoc2-HighRes are equal or higher compared to the remaining methods. Similar to MultiLoc2-LowRes, MultiLoc2-HighRes performs worse for fungal proteins. The AVGs are still better than LOCtree, however, worse compared with Protein Prowler, TargetP and WoLF PSORT.
If we simulate the case in which no GO terms are available for any test proteins, the overall performances of the MultiLoc2 predictors are slightly reduced but still better than those of the other methods for animal and plant and comparable for fungal proteins [see Additional file 1].
In a second benchmark study, MultiLoc2-HighRes and WoLF PSORT were compared using the Höglund independent dataset (see Section 2.3.2). In contrast to the other predictors, both methods allow the prediction of all main eukaryotic subcellular localizations. We further note that WoLF PSORT can also distinguish between the cytoskeleton within the cytoplasm. In this comparison we only consider those localizations (ex, pm, pe, er, go, ly, va) not tested in the previous study. Since it is known that discriminating between these classes is challenging, we also evaluated whether the tested proteins could be correctly predicted within the top three ranked localizations.
The results of this study are summarized in Table 4. MultiLoc2-HighRes always achieves clearly higher AVGs.
In particular, the AVG within the top three locations of MultiLoc2 is about twice as high as that of WoLF PSORT. A similar result is observed regarding the ACCs.
MultiLoc2-HighRes has a much lower bias towards overrepresented localizations and, thus, almost never shows zero sensitivity for a localization with few representatives. This again proves high robustness of MultiLoc2, even in cases of many localizations.

Conclusion
Our new approach for predicting protein subcellular localization, MultiLoc2, integrates several subpredictors based on the overall amino acid composition, the detection of sorting signals, phylogenetic profiles and GO terms. Compared to the original MultiLoc architecture, the robustness and prediction performance is clearly improved. The different resolutions of MultiLoc2 were compared with current state-of-the-art sequence-based methods using independent datasets.
MultiLoc2-LowRes is specialized for globular proteins and offers kingdom-specific prediction of up to five localizations based on the BaCelLo dataset. On the other hand, MultiLoc2-HighRes is able to deal with membrane proteins and predicts all of the main eukaryotic localizations based on a dataset that consists of a mixture of animal, fungal and plant proteins. In comparison with five other methods, the MultiLoc2 predictors performed better for animal and plant proteins whereas MultiLoc2-LowRes outperforms MultiLoc2-HighRes in general. However, the performance of MultiLoc2-HighRes is remarkable since it is able to predict more localizations than the other tools  (51) 46 (57) The average sensitivity and the overall accuracy (in parentheses) for the prediction of three and four classes for animals and fungi and four and five classes for plants are shown. Both measures are given in percentages. The top-scoring average sensitivity and average accuracy are highlighted in bold. Results for Protein Prowler and TargetP predictions are only available for a reduced number of classes since nu and cy are grouped as nu/cy. except for WoLF PSORT. We also simulated the scenario in which no GO term is available for any test proteins, which makes the prediction sequence-based only. The resulting performance of the MultiLoc2 predictors is slightly reduced but still better for animals and plants and comparable for fungi. Therefore, we conclude that the MultiLoc2 approach is very robust and well suited for novel proteins without relevant sequence similarity to annotated proteins but can also benefit from the presence of calculated GO annotation from the sequence using InterProScan.
In a second benchmark study we evaluated the prediction performance of MultiLoc2-HighRes compared to WoLF PSORT for proteins localized in the peroxisomes and in the sublocalizations of the secretory pathway. For all three eukaryotic kingdoms, MultiLoc2-HighRes performs clearly better. In particular, MultiLoc2-HighRes shows much better sensitivity throughout all localizations and yields high robustness. However, the results indicate that the classification in all main eukaryotic localizations is still a challenging task and leaves room for improvement for future work.
The flexible architecture of MultiLoc2 is based on the easily extendable protein profile vector. In the future, this will allow us to integrate more heterogeneous and relevant information to further improve the prediction accu-racy. In particular, we plan to investigate in further sequences-based or annotation-based information such as protein-protein interaction and text-terms from PubMed abstracts. Moreover, handling of proteins present in multiple locations is an open challenge.