Software development
The PICA framework was initially released in 2010 and has not received further improvements by the original developers since then. We downloaded the open-source code from the project's webpage. The initial release did not highlight SVM as a stand-alone method. Therefore, we re-wrote parts of PICA's SVM plug-in as to enable its use for phenotype prediction in novel genomes. For instance, we ensured that all COGs are always mapped to correct SVM features, especially if not all COGs in the prediction set had already been present in the training data. In addition, we extended the software by implementing a function to extract the most predictive features from linear SVM models. Dumping and ranking features enables the user to interpret the models from a biological perspective: the COGs with the highest discriminative power for presence or absence of a phenotype are ranked at the top. If specific proteins are already known to be part of the genetic underpinning of a phenotype of interest, the user finding the according COGs ranked highly in the list might gain confidence in the biological relevance of the predictions. Unexpected high-ranking COGs hint at previously unknown protein functions associated with the phenotype. However, these may also derive from taxonomic correlations rather than a shared phenotype, which can be caused by insufficient training data. Several minor changes were applied to the software in order to improve code quality. The current release can be obtained from the PICA website or directly from https://github.com/univieCUBE/PICA
Machine learning considerations
The original publication for PICA focused primarily on association rule mining for phenotype prediction [18]. The authors introduced SVMs as an alternative for decision-tree classifiers and trained those only with COGs from association rule antecedents mined beforehand (CPAR2SVM). We re-evaluated the example data sets and performed 10 replicates of 5-fold cross-validation with the SVM and CPAR2SVM approaches. We observed no significant difference in phenotype prediction quality between the two methods (Figure 1). However, the run time was approximately three times lower (average over ten phenotypic traits) using SVM only (Figure 2). Considering these results, we decided to drop association rule mining and performed all further predictions based solely on support vector classification. Furthermore, we repeated the evaluation employing additional SVM kernels. The linear kernel, which is used by PICA by default, is often suboptimal compared to others like e.g. radial basis functions (RBF). Nevertheless, our results clearly show the linear kernel performing better than or at least as good as RBF, sigmoid and polynomial kernels for all ten phenotypic traits (Figure 3). Besides from being most accurate in our test cases, the linear kernel has further general advantages: it is computationally less expensive than other kernels and facilitates feature ranking, which is a non-trivial problem for other kernels.
Coping with increasing problem sizes
We initially evaluated PICA with its example data set, which comprises COG profiles based on eggNOG 2.0 [20]. This database has been updated to version 4.0 [7] since PICA was released to the public, which brought several changes. One important difference for our purposes is the increasing number of COGs overall as well as specifically on LUCA level. The number the COGs present in at least one of the species from the example data set increased from 47,615 to 192,421. We repeated the evaluation of the ten example phenotypic traits with COG profiles from eggNOG 4.0, while maintaining the original phenotypic trait profiles, in order to estimate the impact of increasing dimensionality on computation time. The average time required for the cross-validation procedure increased from 12 minutes to 43 minutes. Thus, we deem phenotype predictions based on eggNOG 4.0 profiles computationally tractable. No significant differences were observed for prediction accuracy (data not shown).
Furthermore, we created virtual species to assess the effect of an increasing number of genomes in training sets in terms of run time and memory consumption. For real traits, it will be impossible to increase the sizes of genome sets to arbitrary numbers, because of insufficient biological knowledge. Several sets of increasing size were compiled and cross-validations were performed for virtual phenotypic traits. Our results indicate that problem sizes of 5,000 genomes with approximately 200,000 different COGs in the training set are still computationally feasible (Figure 4). A typical phenotype prediction scenario currently involves only several hundred genomes with trait labels. We expect the number of COGs not to rise dramatically in upcoming eggNOG versions. Also, not all present COGs are relevant for microbes. Thus, we conclude that PICA is capable of dealing with future large data sets.
Assessing the applicability for metagenomic data
The accuracy of phenotype prediction in metagenomes is bounded by genome completeness. Bins extracted from metagenomes are often incomplete due to limitations and prediction errors of assembly and binning algorithms. We therefore assessed the prediction quality for incomplete genomes in the following procedure: at first, 10 replicates of 5-fold cross-validation were performed for the ten example traits based on eggNOG 4.0 genotype data. The results of these represent a performance baseline, i.e. an estimation of maximal accuracy per trait. In further cross-validations, phenotypic trait models were built from complete genomes in the training set. The genomes in the test sets were simulated for incompleteness by random removal of x percent of the total number of COGs per genome (for x in [10, 20, ..., 90]; 3 replicates). Consequently, phenotypic trait prediction was conducted on these incomplete genomes. All models show approximately sigmoidal response curves for increasing completeness (Figure 5). No values below 50 percent balanced accuracy were observed, because the SVMs tend to assign all genomes to one class in cases of insufficient data, which necessarily produces balanced accuracies of 0.5. Most models achieve almost maximal performance with 60-70 percent complete genomes. The model for detecting Gram-negative type cell envelopes works well in completeness regimes as low as 30-40 percent. On the other hand, predicting psychrophily is difficult even for complete genomes, which may be caused by the very small training set (17 positives). For the generally well performing models, only the endospore-forming and photosynthetic traits profit significantly from completeness higher than 70 percent. We conclude that sensible phenotypic trait predictions are possible for metagenomic bins that do not comprise complete genomes.
Application on a novel trait based on genome reduction
So far we have investigated the phenotype prediction of traits, which can be inferred mainly from the presence of certain proteins. For instance, genes encoding particular porins are essential for the outer cell membrane of Gram-negative bacteria, or genes encoding specific regulators are characteristic for genomes of spore-forming microbes. However, phenotypes might also be determined by the absence of genes. In order to assess the possibility of predicting such traits, we investigated obligate intracellular lifestyle in bacteria. We hypothesize, that this trait should largely be predicted by absence of genes in the reduced genomes of such microbes. We trained a model for obligate intracellular against free-living species and facultative intracellular species (Additional file 1: Table S1). Only complete genomes were considered in this case to enable inference based on feature absence. We deemed a genome complete if at least 39 out of 40 universal prokaryotic marker COGs [31] were present. Cross-validation was performed to estimate the model quality. We observed a prediction accuracy of 0.997 ± 0.010, which is all positives as well as all negatives were predicted correctly in all cases, except for Cand. Pelagibacter sp. IMCC9063. This free-living organism with a small streamlined genome, which indeed shows some features of obligate intracellular microbes [32], was misclassified as obligate intracellular in 3 out of 10 folds.
Subsequently, we created a model from all 97 species with intracellular labels and performed feature ranking. All top 50 features are negative predictors, i.e. their absence is predictive for the obligate intracellular trait (Additional file 1: Table S2). We only find 2 positive predictors in the top 100 and 14 in the top 200 (data not shown). Hence, predicting this phenotype is indeed primarily based on gene absence.
We tested all prokaryotic species in eggNOG 4.0 (core and periphery genomes filtered for marker COGs as described above) for obligate intracellular lifestyle. A total of 169 species is predicted intracellular, which consists of 160 bacteria (Figure 6) and 9 archaea (Additional file 1: Table S3). As expected, many known intracellular bacteria were found in our analysis, e.g. several Mycoplasma, Chlamydia, Borrelia and Rickettsia species, or Coxiella burnetti. Six Bartonella species were predicted obligate intracellular although they are truly facultative intracellular and can be cultivated in vitro. Yet they are highly fastidious and show genome features indicative of a host-integrated metabolism [33]. It is evident, that the discriminative power of the model between obligate and facultative intracellular species is suboptimal so far. This is not surprising as there is a smooth transition between these life styles, largely blurred by our ability to meet growth requirements or simulate intracellular conditions in the lab for some microbes that are host-associated in nature. In addition, the very small number of facultative intracellular organisms (six bacteria) in the training set further aggravated this distinction. A number of free-living Firmicutes, including several Lactobacillus species, are examples for predicted intracellular microbes, for which the reason for their unexpected classification still needs to be deciphered. A number of archaea, including Crenarchaeota from the class of Thermoprotei, have also been predicted as obligate intracellular, but in fact represent free-living microbes (Additional file 1, Table S3). This mis-classification is not surprising as metabolic pathways show pronounced differences between bacteria and archaea, with archaea lacking many classical pathways and containing unique modified variants [34]. A model based on a training set with exclusively intracellular bacteria is thus not suited for the analysis of archaeal genomes. Future models might be trained, however, for microbial eukaryotes.