Dataset
We downloaded all prokaryotic proteomes from the integr8 FTP site ftp://ftp.ebi.ac.uk/pub/databases/integr8/fasta/proteomes/ on May 12, 2009 and selected all NCBI annotated organisms which could be uniquely associated with an integr8 proteome file via the integr8 proteome report ftp://ftp.ebi.ac.uk/pub/databases/integr8/proteome_report.txt. The annotation of these 1032 organisms for the phenotypes "Endospores", "Gram stain", "Motility" and "Oxygen Requirement" was extracted from the respective columns of the table on the NCBI prokaryotic genome project web site ftp://ftp.ncbi.nlm.nih.gov/genomes/genomeprj/lproks_0.txt. While these proteomes were used for the validation procedure (see below), we downloaded additional 443 proteomes and the associated annotation from the abovementioned websites on October 10, 2009. The complete list of organisms and their phenotype annotation can be found in additional file 1.
Construction of Pfam domain and COG profiles
In this work, the Pfam domain profile associated with a particular organism is a d = 10797 dimensional domain occurrence vector according to the maximum family index of PF10797 in version 23.0, section "A" of the Pfam database [13]. This vector has nonzero entries only for dimensions associated with the Pfam domains that occur in the organism's genome. To calculate the occurrences, we applied a Pfam domain detection to each protein in each genome using the UFO method [15]. The resulting vector x of absolute domain occurrence counts is normalized to relative domain frequencies such that
Furthermore, each dimension of the domain feature space with a standard deviation different from zero is normalized to unit standard deviation. The data matrices with UFO counts for all organisms of the validation and test set along with the organism names and relevant NCBI phenotype annotation in comma separated value (CSV) format can be found in additional file 5.
For comparison, we also calculated organism-specific profiles of full gene frequencies in terms of clusters of orthologous genes (COGs, [17]) detected in the organisms' protein sequences. For COG detection, we downloaded the COG database in RPS-BLAST (Reverse PSI-BLAST, [26]) format from ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/little_endian/Cog_LE.tar.gz and evaluated the organisms' proteomes against this database using the default e-value threshold of 0.01. The 5665 COG clusters give rise to 5665-dimensional organism-specific feature vectors, which we normalized as described above.
Phenotype prediction on domain profiles
To predict microbial phenotypes from organism-specific domain profiles, we use a supervised classification approach. For this purpose, we divide the task of predicting four phenotype categories into four binary classification problems according to the discrimination of different phenotype realizations. Here, organisms that are annotated as "yes" ("Endospores"), "+" ("Gram stain"), "motile" ("Motility") or "aerobic" ("Oxygen Requirement") are considered as positive examples and organisms that are annotated "no"/"-"/"non-motile"/"anaerobic" are used as negative examples of the respective two-class problem. Domain profiles of organisms that do not have any annotation for a particular phenotype realization are not considered for evaluation. Furthermore, organisms with an identical UFO Pfam domain profile were reduced to a single representative organism. Thus, the total number of examples as well as the number of positive and negative examples vary across the different phenotype categories. An overview of the number of examples associated with each phenotype and the phylogenetic distributions of all phenotypes in terms of histograms can be found in additional file 6.
For learning of discriminative phenotype prediction models we use so-called regularized least-squares classifiers (RLSC, [27]). RLS classifiers are closely related to widely-used Support Vector Machines (SVM) and have been shown to provide similar classification performance, while being simpler to implement [28]. Furthermore, the RLSC method is computationally efficient and the learned discriminative weight vector can be interpreted in terms of underlying features. To take into account the imbalanced number of positive and negative examples in each category, we apply a "balanced" implementation of RLSC [29]. The balanced RLSC error function for N examples and a regularization parameter λ can be written as
(1)
where w is the discriminative weight vector and y
i
∈ {-1, 1} is the class label of the i-th example. The vector b contains the example-specific balancing factors, whereby we use the inverse size of the class a particular example belongs to. For reasons of computational efficiency, we apply the RLSC method in a kernel-based manner using a linear kernel, whereby the kernel matrix K of all examples is computed by K = XtX. The matrix X corresponds to the matrix of all N organism-specific domain profiles X = [x1,..., x
N
]. The abovementioned error function can now be minimized by
(2)
where B is a matrix that contains the inverse elements of the vector b as diagonal elements. In case of multi-class learning problems with M classes, the label vector y has to be replaced by a matrix Y = [z1,..., z
M
], z
i, j
∈ {0, 1}, where z
i
is a class-specific vector with non-zero values for examples belonging to class i. The minimization then yields a matrix W = [w1,..., w
M
] of M class-specific discriminative weight vectors.
To evaluate the influence of the regularization parameter λ, we randomly divided the data set into 20 partitions with 70% training and 30% validation examples, respectively. Using these partitions, we computed the average area under curve with respect to the precision-recall characteristics (PRC, [16]) over all partitions to determine the best parameter λ = {10 m |m = -5, -4,..., 5}. Finally, we tested the prediction performance of our method in terms of the harmonic mean (also known as F1-measure) , which combines sensitivity and specificity . To estimate the generalization performance of our approach on the set of 443 test genomes, we evaluated the discriminative model associated with the highest validation performance.
The kernel-based RLSC model associated with a phenotype classification problem is represented by a vector of N organism-specific weights. For fast prediction of phenotypes, the discriminant in the original feature space of Pfam protein domain profiles can be calculated by a linear combination of the learned organism-specific weights and the domain profiles in X. The phenotype prediction for a newly sequenced organism then only requires the construction of the organism's Pfam domain profile and the computation of the dot product of the feature space discriminant and the domain profile.
The feature space discriminant also allows to inspect the learned discriminative features in terms of phenotype-specific Pfam domain families. For each phenotype prediction problem, we assemble two ranked lists of domain families that are associated with the 50 largest positive and negative weights in the profile space discriminant, respectively. This allows to identify indicative and counter-indicative domain families directly.
For further interpretation of the abovementioned lists and for identification of functional modules, we also calculate clustering dendrograms for the 50 most discriminative domain families based on their phylogenetic profile, i.e. the absence/presence pattern of a particular domain across all organisms. To calculate the dendrograms, we apply the Matlab® function 'dendrogram' to the domain phylogenetic profiles using average linkage (UPGMA) and the profile correlation as a distance measure. Dendrogram branch colors are calculated by the 'dendrogram' function using the default color threshold at 70% linkage branch length.
Comparison with pathway-based prediction
For comparison to a method that can in principle be used for microbial phenotype prediction, we evaluate our approach using the dataset presented in [8]. Because of the different set of phenotypic traits used in the two works, only the two phenotypes "Gram stain" and "Oxygen Requirement" can be directly compared. As in [8], we selected the organisms associated with the phenotypical traits "aerobic"/"anaerobic" and "gram positive"/"gram negative" as the respective positive and negative examples of the phenotype category. In Kastenmüller et al. [8] the commercial Pedant database [30] has been used for collection of genome features and phenotype annotation. As a consequence, not all organisms present in the Pedant database can be found in the NCBI prokaryote genome project. For our evaluation, we used the data of all organisms whose taxonomy ID could be mapped to a NCBI taxonomy ID. In total, we used 166/92 out of 201/113 possible organisms for the phenotype categories "Gram stain"/"Oxygen Requirement", respectively. However, our inspection of the list of all organisms revealed that most of the missing organisms have identical species names when compared with the present organisms, thus the performance values we obtain here can be expected to provide lower bounds.
For direct comparison of the results, we measure the prediction performance in terms of the area under ROC curve [16] and the product of sensitivity and specificity as in [8]. In the original work, a ten-fold cross validation procedure has been performed. To get a reliable estimate of the prediction performance, we repeated the cross validation 100 times using random partitions. Furthermore, we compared our approach to the best of the four classification methods used in [8] as indicated by the classification quality diagrams in additional file 2 of the original work.