Knowing where a protein resides in a cell can help biologists elucidate the functions of the protein. With the accomplishment of the various large-scale genome sequencing projects, an exponentially growing number of new protein sequences have been discovered
[1, 2]. Computation methods are required to automatically and accurately identify the subcellular locations of these proteins.
Conventional methods for subcellular-localization prediction can be roughly divided into sequence-based methods and annotation-based methods. Sequence-based methods include: (1) sorting-signals based methods, such as PSORT
, WoLF PSORT
 and SignalP
[6, 7]; (2) composition-based methods, such as amino-acid compositions (AA)
, amino-acid pair compositions (PairAA)
, gapped amino-acid pair compositions (GapAA)
[9, 10], and pseudo amino-acid composition (PseAA)
[11, 12]; and (3) homology-based methods, such as Proteome Analyst
 and other predictors
Annotation-based methods make use of the correlation between the annotations (usually the functional annotations) of a protein and its subcellular localization. Among them, methods based on Gene Ontology (GO) information are more attractive. Gene Ontology (GO)a is a set of standardized vocabularies that annotate the function of genes and gene products across different species. The term ‘ontology’ originally refers to a systematic account of existence. In the GO database, the annotations of gene products are organized in three related ontologies: cellular components, biological processes, and molecular functions. Cellular components refer to the substances that constitute cells and living organisms. Example substances are proteins, nucleic acids, membranes, and organelles. Majority of these substances are located within the cells, but there are also substances locating outside the cells (extracellular areas). A biological process is a sequence of events achieved by one or more ordered assemblies of molecular functions. A molecular function is achieved by activities that can be performed by individual or by assembled complexes of gene products at the molecular level.
As a result of the GO Consortium annotation effort, the Gene Ontology Annotation (GOA) databaseb has become a large and comprehensive resource for proteomics research
. The database provides structured annotations to non-redundant proteins from many species in UniProt Knowledgebase (UniProtKB)
 using standardized GO vocabularies through a combination of electronic and manual techniques. The large-scale assignment of GO terms to UniProtKB entries (or accession numbers) was done by converting a proportion of the existing knowledge held within the UniProKB database into GO terms
. The GO annotation database also includes a series of cross-references to other databases. Thus, the systematic integration of GO annotations and UniProtKB database can be exploited for subcellular localization. Specifically, given the accession number of a protein, a set of GO terms can be retrieved from the GO annotation database filec. In UniProKB, each protein has an accession number, and in the GO annotation database, each accession number may be associated with zero, one or more distinct GO terms. Conversely, one GO term may be associated with zero, one, or many different accession numbers. This means that the mappings between accession numbers and GO terms are many-to-many.
From the perspective of GO terms extraction, the GO-based predictors can be classified into three categories: (1) using InterProScan
 to search against a set of protein signature databases
[21–25]; (2) using the accession numbers of proteins to search against the GO annotation database such as Euk-OET-PLoc
 and an integrated method
; and (3) using the accession numbers of homologous proteins retrieved from BLAST
 to search against the GO annotation database, such as ProLoc-GO
 and Cell-PLoc 2.0
However, there exist multi-location proteins that can simultaneously reside at, or move between, two or more different subcellular locations. Unfortunately, most of the existing methods are limited to the prediction of single-location proteins. These methods generally exclude the multi-label proteins or are based on the assumption that multi-location proteins do not exist. Actually, proteins with multiple locations play important roles in some metabolic processes that take place in more than one compartment, such as fatty acid β-oxidation in the peroxisome and mitochondria, and antioxidant defense in the cytosol, mitochondria and peroxisome
There are a few predictors
[33, 37, 38] specifically designed for predicting viral proteins, generated by viruses in various cellular compartments of the host cell or virus-infected cells. Studying the subcellular localization of viral proteins enables us to obtain the information about their destructive tendencies and consequences
[33, 37, 38]. It is also beneficial to the annotation of the functions of viral proteins and the design of antiviral drugs. To the best of our knowledge, there are two predictors, namely Virus-mPLoc
 and iLoc-Virus
, capable of predicting multi-label viral proteins. iLoc-Virus performs better than Virus-mPLoc because the former has a better formulation for reflecting GO information and has a better way to predict the number of subcellular location sites of a query protein
. Recently, a method called KNN-SVM ensemble classifier
 is proposed to deal with multi-label proteins, including viral proteins. It was found that the performance of the KNN-SVM predictor is comparable to iLoc-Virus and is better than Virus-mPLoc.
Conventional methods specializing for plant proteins, such as TargetP
 and Predotar
, can only deal with single-label proteins. Plant-mPLoc
 and iLoc-Plant
 are state-of-the-art predictors that can deal with single-label and multi-label proteins of plants. iLoc-Plant performs better than Plant-mPLoc because of the similar improvement as in iLoc-Virus versus Virus-mPLoc.
This paper proposes an efficient multi-label predictor, namely mGOASVM, for multi-label protein subcellular localization prediction. Here, the prefix “m” stands for multiple, meaning that the predictor can deal with proteins with multiple subcellular locations. mGOASVM is different from other predictors in that (1) it adopts a new decision scheme for an SVM classifier so that it can effectively deal with datasets containing both single-label and multi-label proteins; (2) it selects a set of distinct, relevant GO terms to form a more informative GO subspace; (3) it constructs the GO vectors by using the frequency of occurrences of GO terms instead of using 1-0 values
[23, 37, 41] for indicating the presence or absence of some predefined GO terms. The results on two benchmark datasets and a newly created dataset full of novel proteins demonstrate that these three properties enable mGOASVM to predict multi-location proteins and outperform the state-of-the-art predictors such as iLoc-Virus and iLoc-Plant.