- Research article
- Open Access
TransportTP: A two-phase classification approach for membrane transporter prediction and characterization
© Li et al; licensee BioMed Central Ltd. 2009
- Received: 8 July 2009
- Accepted: 14 December 2009
- Published: 14 December 2009
Membrane transporters play crucial roles in living cells. Experimental characterization of transporters is costly and time-consuming. Current computational methods for transporter characterization still require extensive curation efforts, especially for eukaryotic organisms. We developed a novel genome-scale transporter prediction and characterization system called TransportTP that combined homology-based and machine learning methods in a two-phase classification approach. First, traditional homology methods were employed to predict novel transporters based on sequence similarity to known classified proteins in the Transporter Classification Database (TCDB). Second, machine learning methods were used to integrate a variety of features to refine the initial predictions. A set of rules based on transporter features was developed by machine learning using well-curated proteomes as guides.
In a cross-validation using the yeast proteome for training and the proteomes of ten other organisms for testing, TransportTP achieved an equivalent recall and precision of 81.8%, based on TransportDB, a manually annotated transporter database. In an independent test using the Arabidopsis proteome for training and four recently sequenced plant proteomes for testing, it achieved a recall of 74.6% and a precision of 73.4%, according to our manual curation.
TransportTP is the most effective tool for eukaryotic transporter characterization up to date.
- Gene Ontology
- Hide Markov Model
- Transporter Family
- Pfam Domain
- Hide Markov Model Model
Membrane transporter proteins, or simply transporters, play crucial roles in living cells, such as importing essential nutrients and exporting toxic cellular metabolites, mediating signal transduction, and maintaining ionic and osmotic homeostasis.
A handful of in-vitro or in-vivo experimental methods have been developed and applied to study transporter mechanisms, such as the patch-clamp techniques for analyzing ion channels  and heterologous expression and mutant complementation approaches, which are often coupled with the use of isotopically-labeled substrates [2, 3]. These methods are costly and time-consuming, both of which limit their applications in identifying transporters on a large scale. Therefore, computational methods are desired for selecting and sorting potential targets on a genome scale prior to laboratory experiments. Homology searches against experimentally-determined transporters is heretofore the most common approach in inferring novel transporters, as exemplified by BLAST searches  against the Transporter Classification Database (TCDB) [5, 6]. Employing this approach, a putative transporter database named TransportDB was constructed for hundreds of completely sequenced genomes [7, 8]. The wide adoption of TCDB for transporter annotation is due to its unique characteristics. It contains comprehensive information on experimentally-characterized transporters. These transporters are organized within a simple tree structure based on both function and homology, which contains over 550 transporter families. The transporter families possess a distinct functional-phylogenetic property, i.e. the members in a transporter family are not only homologous but also share similar transporter mechanisms. Other classification systems such as Pfam  and Gene Ontology  do not have this property. In the Pfam system, a specific domain may be contained within proteins of different functions. In the Gene Ontology system, a transporter may logically belong to multiple transporter terms or functions due to the Directed Acyclic Graphs (DAGs) adopted to represent relationships among protein functions.
Homology methods may reveal new putative transporters, but they can also generate many false positive assignments, due to systematic errors arising from homology inference, since nontrivial functional variations may be induced during gene duplication or domain shuffling [11, 12]. For example, paralogs often exhibit distinct functions . As a result, more complicated modeling of transporter families have been proposed, including profile based methods like HMMER  and PST , and machine learning methods like SVMProt . Profile based methods rely heavily on regions of local conservation among family members. However, the level of conservation within many transporter families, such as potassium channels, may be very low [16, 17], which limits the effectiveness of these methods. Machine learning methods can sidestep lack of conservation by utilizing quite distinct features such as physicochemical properties and overall composition of amino acids [15, 18, 19]. However, both methods, especially the machine learning ones, require many examples of a transporter family for effective modeling, which may be a limitation for many transporter families with few experimentally determined members.
Recently, some integrative methods have been reported that incorporate at least two data sources or methods. We proposed a nearest neighbor approach previously which integrated BLAST, Hidden Markov model (HMM) and topology analysis . Another transporter annotation pipeline named TransAAP was launched along with TransportDB . It searches TCDB , PFAM  and Gene Ontology , and utilizes a couple of empirical rules for decision. Though TransAAP currently works effectively for prokaryotic organisms, it is still weak in handling eukaryotic organisms. Therefore, existing computational tools for transporter prediction, including integrative methods, suffer from insufficient predictive coverage or low accuracy. Extensive curation efforts are still required for transporter annotation on a genome-wide scale.
TransportTP was implemented as both Linux command lines and a web server. Only basic information, such as nucleotide/amino acid sequences, predictive options like training organisms and initial e-value threshold, are required by TransportTP for prediction. The predictive results are presented in a user-friendly manner through a web interface, enumerating all evidence used in decision making and providing cross-links from the evidence to related databases for further verification. Results are sortable via various criteria selected by the user, to accommodate specific interests or curation emphasis. Implementation of the system is very efficient due to the use of parallel computing algorithms among multiple CPUs in a single computer and/or computers spanning a local network.
Data and performance assessment
The databases for implementation and testing of TransportTP were downloaded in September 2008. The TCDB  for the initial classifier consisted of 5,005 transporters within 557 TC families/superfamilies, of which 173 possessed at least 5 members with corresponding HMMs. The terms of family and superfamily are alternatively used in this manuscript because they mix in the third taxonomic level of the TC system. If a superfamily exists in a hierarchical branch, the superfamily was studied rather than its constituent TC families. Pfam (ver 23.0) , Gene Ontology (GO)  and SwissProt (ver 14.2)  databases contained 10,340 domains, and 163,260 and 398,181 curated sequences, respectively. Eleven organisms were chosen for cross-validation of TransportTP, including seven model organisms (Escherichia coli O157:H7 EDL933, Saccharomyces cerevisiae S288C, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, Oryza sativa and Homo sapiens), and four non-model organisms (Picrophilus torridus DSM 970, Photobacterium profundum SS9, Desulfotalea psychrophila LSv54 and Aspergillus fumigatus). Sequences and functional annotations of the eleven organisms were acquired from NCBI, except for S. cerevisiae and O. sativa, which were obtained from SGD ftp://ftp.yeastgenome.org/yeast/data_download/sequence/genomic_sequence/orf_protein/archive/ and JCVI ftp://ftp.jcvi.org/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_5.0/ respectively. Five plant organims were chosen for case studies, including Sorghum bicolor, Populus trichocarpa, Vitis vinifera, Physcomitrella patens and Chlamydomonas reinhardtii. The genomic sequences of the five plant organisms were downloaded from JGI Microbial Genomics ftp://ftp.jgi-psf.org/pub/JGI_data, except for Vitis vinifera, which was downloaded from Genoscope http://www.genoscope.cns.fr/externe/Download/Projets/Projet_ML/data/annotation/Vitis_vinifera_peptide_v1.fa.
These three measurements are also known as sensitivity, selectivity, and F-measure, respectively. The average value of each measurement on multiple testing organisms reflects the predictive performance of TransportTP on the specific training organisms(s). Specificity and the Receiver Operating Characteristic curve were not used in our assessment because non-transporters are much more than the transporters, making them less informative.
The performance of TransportTP was tested by two cross-validation schemas: (1) Leave-one-in (LOI) cross-validation, i.e. choosing the proteome of one model organism for training and the proteomes of other ten organisms for testing; (2) Leave-multiple-in (LMI) cross-validation, i.e. choosing the proteomes of all seven model organisms for training and only the proteomes of the four non-model organisms for testing. Redundant protein sequences with a similarity of 40% or above were removed during the training to avoid potential overestimation of prediction accuracy. TransportDB [7, 8] was used as a benchmark transporter database in the cross-validation.
The cross-validation results yielded by yeast.
Num of proteins
Predictions by TransportTP
Annotations in TransportDB
Text mining validated
Balanced accuracy (%)
Average on model proteomes
Average on non-model proteomes
Average on all testing proteomes
Predictive performance differed significantly between major transporter classes. The best performance was achieved for carriers, followed by channels, and finally primary active transporters. More specifically, TransportTP achieved a balanced accuracy of 87.9% on 74 carrier families, 61.5% on 31 channel families, and 54.6% on 8 primary active transporter families that TransportDB reported on the ten non-yeast testing organisms, at the e-value threshold of 0.1 (see Additional file 2 for details). This difference in performance is likely due to factors such as distinct transporter mechanisms within transporter classes, other than difference in sequence divergence in transporter families, since the correlation between the balanced accuracy and sequence identity of the predicted families/superfamilies on the ten testing organisms was 0.23, a weak correlation.
The cross-validation results yielded by all model organisms.
Num of proteins
Predictions by TransportTP
Annotations in TransportDB
Text mining validated
Balanced accuracy (%)
Average on non-model proteomes
Comparative study of performance
The performance of TransportTP was also studied by comparing it with other approaches, using at a broad range of e-value thresholds between 10 and 1e-50, to reveal the comprehensive characteristics of TransportTP and alternative strategies.
The advantages of TransportTP are further demonstrated in Figure 4, which shows the average balanced accuracy of the ten non-yeast testing organisms versus e-value thresholds adopted in the initial classifier. TransportTP outperformed both BLAST plus HMM and the BLAST search alone in balanced accuracy, especially at commonly used e-value thresholds. At the threshold e-value 10, TransportTP outperformed BLAST plus HMM by 49.3%. The great increase of balanced accuracy resulted from an increase of precision from 14.1% to 63.8% along with a relatively small decrease of recall from 91.6% to 87.2% (corresponded to the rightmost points of the two curves in Figure 3). At this high e-value threshold, most transporters were covered in the initial classifier, making it possible for the refining classifier to discriminate the true positives from the false positives. At very low e-value thresholds, many true positives are excluded and few false positives are included, limiting the power of the refining classifier to generate effective discriminative rules.
TransportTP was also compared with other reported methods. Compared with our previous work , which was similar to our initial classifier but included the TMS filtering, TransportTP achieved a better balanced accuracy of 81.8% compared with 67.0% in the previous approach. More specifically, TransportTP significantly improved the precision from 55.9% to 81.8%, with a slight sacrifice in recall from 83.6% to 81.8% (For fairness, validations through the text mining program in the previous approach were excluded in this comparison). Compared with SVMPort , TransportTP achieved better performance in recall, precision and most importantly, the much larger coverage of TC families, since SVMPort only achieved an average recall of 81.0% and an average precision of 26.1% among five TC superfamilies and three families in an independent evaluation set.
The performance of TransportTP was investigated further using sequences from five model plant oraganisms, based on the availability of whole proteome and their evolutionary divergence. Sorghum bicolor (sorghum), Populus trichocarpa (poplar) and Vitis vinifera (grape) are important agricultural organisms. Physcomitrella patens (moss) is a simplest plant model for plant functional evolutionary studies, and Chlamydomonas reinhardtii (green alga) is a single-cell green alga organism, representing the group from which land plants evolved. These organisms span a large part of the plant family tree, from single-celled to vascular plants, including monocots to dicots, so that the performance on these organisms should reflect the performance of TransportTP on plant transporters in general. Finally, although most of these organisms were recently sequenced, they lack good annotation for membrane transporters .
The performance of TransportTP on plant organisms was evaluated via the comparison of the predictive results of the program with the manual curation by our biologists. Automatic transporter prediction was learned from the model plant organisms Arabidopsis thaliana at an e-value threshold of 10, to cover as many transporters as possible. Manual transporter curation was carried out via human review of all transporter-related evidence from candidate proteins, specifically transmembrane proteins and homologs of TCDB transporters. The evidence included all data utilized by TransportTP, such as predicted membrane topology, presence of homologs in TCDB , conserved Pfam domains , Gene Ontology terms , and presence of homologs in SwissProt database . In addition, evidence which was difficult to manage by automatic prediction, such as annotation of homology hits in the NCBI NR-Refseq database , were also reviewed. Curated putative membrane transporters were organized into confidence levels based on types of confidence for the classification. Level one corresponded to the highest confidence, in which almost all expected pieces of evidence for a transporter superfamily/family supported the classification. Level two corresponded to a moderate confidence, where a minor piece of evidence was conflicting or missing (such as a little bit short of protein length). Level three corresponded to the lowest confidence level, in which multiple types of evidence or an important evidence were missing (such as lack of characteristically conserved domains and/or too small protein length), raising doubts about the transporter functionality or gene annotation. The union of level-one and level-two confidence levels was considered as a benchmark for manual curation of membrane transporters, while the level-three was only taken as reference and disregarded in further analysis. The detailed results are hosted at http://bioinfo3.noble.org/transporter/model.htm.
Case study results of TransportTP on five plant organisms.
Predictions of the green alga were less successful presumably because of the evolutionary distance from Arabidopsis, the training organism. Nevertheless, TransportTP still achieved a recall of 56.6% with respect to the manual curation approach and a precision of 72.0% on the automatic prediction. Compared with TransportDB which includes P. patens and C. reinhardtii, TransportTP achieved a recall of 77.6% and a precision of 78.6% on P. patens at an e-value threshold of 1, and achieved a recall of 71.5% and a precision of 72.8% on C. reinhardtii at an e-value threshold of 0.01 (see Additional file 5 for details). The results demonstrate a solid performance of TransportTP in predicting transporters on plant organisms using a model plant proteome for training.
Further investigation of the comparative results revealed that the confidence levels of manual curation were correlated with their recall of TransportTP on the groups. Specifically, 79.6%, 49.2% and 19.4% of curated transporters in confidence levels 1, 2 and 3 were exactly predicted by TransportTP, respectively, on the five plant organisms explored (details not shown). The low recall made by TransportTP for confidence level 3 does not deny the effectiveness of TransportTP, since the manual curation could not reliably assert this group of potential transporters into superfamilies/families, indicating major problems for classification of these proteins.
We adopted stringent assessment in the cross-validation of TransportTP, where only predictions matching with curated transporters in TransportDB were counted in correct ones during the training and the testing. However, TransportDB manually excluded some categories of transporters , resulting in these transporter categories incorrectly trained and undetectable in the testing for any genome being annotated. Therefore, if the coverage of the benchmark database improves, the predictive performance will be further increased, both for the cross-validation and for case studies.
We did not adopt the standard k-folds or leave-one-out cross-validations but instead, used leave-one-in strategy, because the underlying transporter mechanisms in different organisms may be distinct although they share very large similarity. For example, the distribution of transporter families and the gating mechanisms of transporters are likely to be different between prokaryotic and eukaryotic organisms . Thus, the combination of multiple organisms for training a classification model may not always increase the predictive accuracy, as shown in Additional file 4.
We did not adopt traditional Support Vector Machine (SVM) but used an ensemble and balanced SVMs in the refining classifier, to handle the unbalanced data produced by the initial classifier. The comparative results of using traditional Support Vector Machine (SVM)  without the ensemble and balanced technology are shown in Additional file 6. Although TransportTP with and without the ensemble of balanced SVMs achieved almost equivalent balanced accuracy with respect to the same e-value threshold used in the initial classifier, TransportTP outperformed traditional SVMs measured by precision versus recall, especially when the initial predictions were highly imbalanced with true and false positives. Similarly, the ensemble and balanced techniques brought improvement to precision versus recall for decision trees like CART . However, the balanced random forest of decision trees  was not adopted in TransportTP because of its over-training on model organisms and unstable performance, as shown in Additional file 6.
Despite of the effectiveness and efficiency, some potential pitfalls of TransportTP exist. Firstly, although non-homology pieces of evidence contribute significantly to the performance, sequence homology still plays a nontrivial rule in the inference of novel transporters. If some proteins are very similar to transporters but in fact they are not transporters, for example, receptors or sensors, which evolved from transporters but diverged functionally since the evolutionary split , they may be annotated as transporters incorrectly. Fortunately, the inclusion of characteristic features such as presence of conserved domains and number of transmembrane segments in our method may help distinguishing these proteins from the transporters in a great extent.
Secondly, TransportTP targets on individual transporters rather than multimeric transporters, which are complex functional structures assembled from products of multiple genes. For example, human TAP transporters dimers , while glutamate transporters are trimers . For those multimeric transporters, each polypeptide chain is classified individually into a transporter family according to the characteristics assigned for the chain, even when some other chains are not found to complete the transporter function. We opted to offer the possibility for the researchers to find the whole multimeric transporter structure, which would require further analysis such as examining the interactions among the chains and functional evolution of each chain.
Thirdly, TransportTP generally relies on complete protein sequences for classification, since partial sequences will introduce doubts on functionality of the gene/peptide, thus being confined to lower confidence levels. When handling partial sequences such as some ESTs, the classification will depend on the key regions of the proteins available for classification, such as unique conserved domains and transmembrane segments. In a nutshell, the system is built to rely on whole protein sequences, but can be useful for partial sequences, with loss of robustness.
In summary, the effectiveness and the utility of TransportTP has been demonstrated in detail through cross-validation between model and non-model organisms, comparative study with alternative strategies in the cross-validation, and the case studies of plant organisms. TransportTP will be of importance for researchers working on annotation of newly assembled genomes, especially eukaryotic genomes, and will probably be used as an additional step for classification of genes/proteins that cannot be clearly classified as transporters by using existing database resources. The approach of TransportTP may be of interest for improvement of broad classification tasks, showing how new classification rules can be extracted from sequences through combination of homology and non-homology evidence.
The framework of TransportTP is shown in Figure 1. It consisted of two components, a pre-processor and a predictor, and two interfaces, i.e. a web interface and a command line interface. The predictor was further divided into two phases: an initial classifier followed by a refining classifier.
The pre-processor constructed transporter-related databases and models for the predictor. A transporter-related Gene Ontology database  was constructed through the exaction of a subgraph rooted at the term GO:0022857, which corresponded to transmembrane transporter activities. The subgraph contained 561 transporter terms and associated 5,393 transporter sequences. Meanwhile, a transporter-related Pfam domain database was constructed via: (1) all cross-links between Pfam domains and TC families embedded in the Pfam database , (2) additional mapping between TC families and Pfam domains, constructed via all-versus-all HMM search  between TC sequences and Pfam domains, where a TC family was linked to a Pfam domain if at least a proportion (50% in implementation) of members in the TC family contained the Pfam domain, (3) other manually curated mappings. Consequently, 487 Pfam domains were mapped to 320 TC families or superfamilies (see Additional file 7). The construction of transporter-related databases was necessary because the inclusion of non-transporter terms or domains had little contribution to the predictive performance while significantly increased the computation cost. Two kinds of models were built for the predictors. An HMM model was constructed for each TC family/superfamily in TCDB  for the initial classifier if having enough cardinality, through the SAM program due to no pre-alignment requirement of member proteins . A classification model was constructed for the refining classifier from some well-studied model organisms during which the initial classifier was invoked but the refining classifier was not involved, to avoid potential circling.
In equation 2, p is an unknown protein for prediction, s(p) is the weighted score, f(p) is the classification function and λ is the threshold to discriminate transporters from non-transporters; q ij is the jth transporter in the ith TC family F i , blast(p, q ij ) is the BLAST e-value between the protein p and the transporter q ij and the hmm(p, MDL i ) is the HMM score of protein p on the transporter family F i , where MDL i is the HMM model built for the family, |F i | is the cardinality of family F i , κ is the threshold of family cardinality for construction of HMM models and τ is the threshold of BLAST search. The three cases in calculating s(p) for transporter q ij corresponds to 1) both BLAST and HMM scores being available, 2) only HMM e-value being available and 3) only BLAST e-value being available.
The refining classifier was to distinguish false positives from true positives generated by the initial classifier, because numerous false positives may be generated due to the lack of negative features of non-transporters in the TCDB database. The discriminative rules between false positive and true positives were learned from some well-studied model organisms based on a variety of transporter-related features. The excluded non-transporters by the initial classifier were not further refined because the initial classifier was designed to cover most of true positives, thus the small number of false negatives, i.e. missed transporters, would not influence the overall predictive accuracy while greatly increasing the efficiency. Seven categories of features for initially categorized proteins were extracted from transporter-related databases and the input sequences. The first was the basic homology scores to TCDB database generated by the initial classifier, specifically the BLAST and HMM e-value scores, calculated by Tera-BLASTP  and SAM , respectively. The second category of features was the initially categorized transporter classes, such as channels, carriers, or primary active transporters, and the sizes of the initially categorized transporter families, since the size may impact the quality of the homology inference. The family size was transformed logarithmically to avoid potential dominative impact.
where tms(p) was the number of TMS of a protein p, taken as the maximum calculated by HMMTOP  and TMPRED , π and σ were the mean and standard deviation of TMS of the initially categorized TC family F of the protein p and ξ was a tiny constant to prevent the denominator of being zero.
The fourth category of features for an initially categorized protein was the consistency of TC families among the top-K homologs (K is a small constant ) of the protein in the TCDB, evaluated by the proportion of the top-K homologs possessing the same TC families as that of the top homolog of the protein in the TCDB. High consistency is a very positive sign for potential transporters. The feature was also amended as an additional feature for the cases where the cardinalities of the predicted families were smaller than the constant K, in which K was adjusted to the cardinality of the initially categorized family to capture the maximal possibility of consistency.
The fifth category of features for an initially categorized protein was the occurrence of transporter-related Pfam domains, calculated by HMMER  with user specified e-value threshold. The occurrence of transporter-related domains provided an important clue for the potential transporters, especially for families showing characteristically conserved domains. Two Pfam features of an unknown protein were extracted for the refining classifier. The first feature was the best e-value score among the transporter-related Pfam domains and the second was whether the occurred Pfam domains coincided with the initial categorized TC family of the unknown protein, through checking the mapping between Pfam domains and TC families established in the preprocessor.
The sixth category of features for an initially categorized protein was the hits of transporter-related Gene Ontology terms , via the BLAST search of the protein against the transporters attached in transporter-related GO terms . A hit of transporter-related GO terms was a positive sign of potential transporters and the best e-value score among the hit transporter sequences in the transporter-related GO database was extracted as a GO feature of the unknown protein for the refining classifier. Whether one of the hit GO terms coincided with its initially categorized TC family in function was considered as another feature of the unknown protein. To simplify the situation where each transporter logically belonged to a branch of GO terms, TransportTP only counted the directly belonging GO terms as hit terms and searched for the functional coincidence between these hit terms and the initially predicted TC families, using a text mining program we developed. The text mining program justified the consistency between two descriptions if enough significant overlapping words were found. Obviously insignificant English words were filtered but compatible words were counted, based on a series of compatibility rules generated on the basis of biological activity; for example: the abbreviation K+ was compatible with words such as potassium, ions and metal. The last category of features was the negative information from UniProt/SwissProt . If the nearest neighbor of an unknown protein in SwissProt was more similar to the protein than the nearest neighbor in TCDB and its function in SwissProt implied a non-transporter function, such as a transcription factor, the initial categorization of the protein based on TCDB was more likely to be a false positive. Based on this principle, the existence of this kind of confliction between the nearest neighbors in TCDB and SwissProt was extracted as a feature of the protein for the refining classifier. The list of keywords chosen for detection of non-transporter functions is shown in Additional file 8.
The classification value of an initially categorized protein for the refining classifier was defined as whether the initial classifier categorized the protein with a correct TC family. A false classification value meant that the protein should be removed from the final predictions. A true classification value of a training protein in the refining classifier was determined by the match with a curated transporter at family or superfamily level in TransportDB [7, 8]. TransportDB was chosen as a benchmark database because the transporters collected in this database were curated by biologists, hence the data therein was relatively reliable.
where N major and N minor were the number of instances in the majority and minority classes, respectively, and β was a constant. If β was large enough, being 10 in our implementation, most instances in the majority class would have a chance to be explored. The ensemble of balanced SVMs vote by their confidences to decide the final predicted classification value for an unknown protein. The stochastic and voting strategies have been similarly applied in handling imbalanced data and proved to be effective [37, 38].
Project name: A two-phase transporter categorization system
Project home page: http://bioinfo3.noble.org/transporter.
Operating system: Platform independent.
Program language: C++, Perl and JSP.
Other requirement: Java application server.
Any restrictions to use by non-academics: Licence needed for non-academic use and source code available for academic purpose upon request.
The authors thank Drs. Xinbin Dai, Ranamalie Amarasinghe, Carolyn Young, Rakesh Kaundal and Ji He for their valuable comments and suggestions. The authors appreciate the funding support by The Samuel Roberts Noble Foundation, Inc.
- Sakmann B, Neher E: Patch clamp techniques for studying ionic channels in excitable membranes. Annu Rev Physiol 1984, 46: 455–472. 10.1146/annurev.ph.46.030184.002323View ArticlePubMedGoogle Scholar
- Hsu L, Chiou T, Chen L, Bush D: Cloning a plant amino acid transporter by functional complementation of a yeast amino acid transport mutant. Proc Natl Acad Sci USA 1993, 90: 7441–7445. 10.1073/pnas.90.16.7441PubMed CentralView ArticlePubMedGoogle Scholar
- Kuze K, Graves P, Leahy A, Wilson P, Stuhlmann H, You G: Heterologous expression and functional characterization of a mouse renal organic anion transporter in mammalian cells. J Biol Chem 1999, 274: 1519–1524. 10.1074/jbc.274.3.1519View ArticlePubMedGoogle Scholar
- Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Saier MJ, Tran C, Barabote R: TCDB: the Transporter Classification Database for membrane transport protein analyses and information. Nucleic Acids Res 2006, (34 Database):D181-D186. 10.1093/nar/gkj001Google Scholar
- Saier M, Yen M, Noto K, Tamang D, Elkan C: The Transporter Classification Database: recent advances. Nucleic Acids Res 2009, (37 Database):D274–278. 10.1093/nar/gkn862Google Scholar
- Ren Q, Kang K, Paulsen I: TransportDB: a relational database of cellular membrane transport systems. Nucleic Acids Res 2004, (32 Database):D284-D288. 10.1093/nar/gkh016Google Scholar
- Ren Q, Chen K, Paulsen I: TransportDB: a comprehensive database resource for cytoplasmic membrane transport systems and outer membrane channels. Nucleic Acids Res 2007, (35 Database):D274-D279. 10.1093/nar/gkl925Google Scholar
- Sonnhammer E, Eddy S, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 1997, 28: 405–420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-LView ArticlePubMedGoogle Scholar
- Ashburner M, Ball C, Blake J, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Devos D, Valencia A: Intrinsic errors in genome annotation. Trends Genet 2001, 17: 429–431. 10.1016/S0168-9525(01)02348-4View ArticlePubMedGoogle Scholar
- Koski L, Golding G: The closest BLAST hit is often not the nearest neighbor. J Mol Evol 2001, 52: 540–542.View ArticlePubMedGoogle Scholar
- Doolittle R: Similar amino acid sequences: chance or common ancestry? Science 1981, 214: 149–159. 10.1126/science.7280687View ArticlePubMedGoogle Scholar
- Bejerano G, Yonam G: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 2001, 17: 23–43. 10.1093/bioinformatics/17.1.23View ArticlePubMedGoogle Scholar
- Lin H, Han L, Cai C, Ji Z, Chen Y: Prediction of transporter family from protein sequence by support vector machine approach. Proteins 2006, 62: 218–231. 10.1002/prot.20605View ArticlePubMedGoogle Scholar
- Dibrov P, Fliegel L: Comparative molecular analysis of Na+/H+ exchangers: a unified model for Na+/H+ antiport? FEBS Lett 1998, 424: 1–5. 10.1016/S0014-5793(98)00119-7View ArticlePubMedGoogle Scholar
- Heil B, Ludwig J, Lichtenberg-Frate H, Lengauer T: Computational recognition of potassium channel sequences. Bioinformatics 2006, 22: 1562–1568. 10.1093/bioinformatics/btl132View ArticlePubMedGoogle Scholar
- Gromiha M, Yabuki Y: Functional discrimination of membrane proteins using machine learning techniques. BMC Bioinformatics 2008, 9: 135. 10.1186/1471-2105-9-135PubMed CentralView ArticlePubMedGoogle Scholar
- Lee M, Jeong C, Kim D: Predicting and improving the protein sequence alignment quality by support vector regression. BMC Bioinformatics 2007, 8: 471. 10.1186/1471-2105-8-471PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Dai X, Zhao X: A nearest neighbor approach for automated transporter prediction and categorization from protein sequences. Bioinformatics 2008, 24: 1129–1136. 10.1093/bioinformatics/btn099View ArticlePubMedGoogle Scholar
- Apweiler R: Functional information in SWISS-PROT: the basis for large-scale characterisation of protein sequences. Brief Bioinform 2001, 2: 9–18. 10.1093/bib/2.1.9View ArticlePubMedGoogle Scholar
- Platt JC: Advances in kernel methods: support vector learning, Cambridge, MA, USA: MIT Press 1999 chap. Fast training of support vector machines using sequential minimal optimization.185–208.Google Scholar
- Pruitt K, Tatusova T, Maglott D: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucl Acids Res 2005, 33(suppl 1):D501–504.PubMed CentralPubMedGoogle Scholar
- Saier MJ: A functional-phylogenetic classification system for transmembrane solute transporters. Microbiol Mol Biol Rev 2000, 64: 354–411. 10.1128/MMBR.64.2.354-411.2000PubMed CentralView ArticlePubMedGoogle Scholar
- Quinlan R: C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers; 1993.Google Scholar
- Breiman L: Random Forests. Machine Learning 2001, 45: 5–32. 10.1023/A:1010933404324View ArticleGoogle Scholar
- Boles E, Andre B: Role of transporter-like sensors in glucose and amino acid signalling in yeast. Top Curr Genet 2004, 9: 121–153.View ArticleGoogle Scholar
- Abele R, Tampe R: Function of the transport complex TAP in cellular immune recognition. Biochimica et Biophysica Acta (BBA) - Biomembranes 1999, 1461(2):405–419. 10.1016/S0005-2736(99)00171-6View ArticleGoogle Scholar
- Yernool D, Boudker O, Jin Y, Gouaux E: Structure of a glutamate transporter homologue from Pyrococcus horikoshii. Nature 431: 811–818. 10.1038/nature03018Google Scholar
- Krogh A, Brown M, Mian IS, Sjolander K, Haussler D: Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology 1994, 235: 1501–1531. 10.1006/jmbi.1994.1104View ArticlePubMedGoogle Scholar
- Alam I, Dress A, Rehmsmeier M, Fuellen G: Comparative homology agreement search: An effective combination of homology-search methods. Proc Natl Acad Sci USA 2004, 101: 13814–13819. 10.1073/pnas.0405612101PubMed CentralView ArticlePubMedGoogle Scholar
- Atteson K: Calculating the exact probability of language-like patterns in biomolecular sequences. Proceedings of the sixth International Conference on Intelligent Systems for Molecular Biology (ISMB), Canada 1998, 17–24.Google Scholar
- Tusnady G, Simon I: The HMMTOP transmembrane topology prediction server. Bioinformatics 2001, 17: 849–50. 10.1093/bioinformatics/17.9.849View ArticlePubMedGoogle Scholar
- Hofmann K, Stoffel W: TMbase - A database of membrane spanning proteins segments. Biol Chem 1993, 374: 166.Google Scholar
- Horton P, Nakai K: Better prediction of protein cellular localization sites with the k nearest neighbors classifier. Proc Int Conf Intell Syst Mol Biol, Halkidiki, Greece 1997, 5: 147–152.Google Scholar
- Witten I, Frank E: Data mining: practical machine learning tools and techniques with Java implementations. ACM SIGMOD Record 2002, 31: 76–77. 10.1145/507338.507355View ArticleGoogle Scholar
- Akbani R, Kwek S, Japkowicz N: Applying Support Vector Machines to Imbalanced Datasets. ECML 2004, 39–50.Google Scholar
- Wang BX, Japkowicz N: Boosting Support Vector Machines for Imbalanced Data Sets. ISMIS 2008, 38–47.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.