Scoring amino acid mutation to predict pandemic risk of avian influenza virus

Background Avian influenza virus can directly cross species barriers and infect humans with high fatality. As antigen novelty for human host, the public health is being challenged seriously. The pandemic risk of avian influenza viruses should be analyzed and a prediction model should be constructed for virology applications. Results The 178 signature positions in 11 viral proteins were firstly screened as features by the scores of five amino acid factors and their random forest rankings. The Supporting Vector Machine algorithm achieved well performance. The most important amino acid factor (Factor 5) and the minimal range of signature positions (63 amino acid residues) were also explored. Moreover, human-origin avian influenza viruses with three or four genome segments from human virus had pandemic risk with high probability. Conclusion Using machine learning methods, the present paper scores the amino acid mutations and predicts pandemic risk with well performance. Although long evolution distances between avian and human viruses suggest that avian influenza virus in nature still need time to fix among human host, it should be notable that there are high pandemic risks for H7N9 and H9N2 avian viruses. Electronic supplementary material The online version of this article (10.1186/s12859-019-2770-0) contains supplementary material, which is available to authorized users.


Background
Influenza A virus contains eight segments of single-strand negative RNA. Segment 4 codes hemagglutinin (HA) gene and segment 6 codes neuraminidase (NA) gene. According to the antigenic characteristics of HA and NA, avian influenza A virus has 16 subtypes HA and nine subtypes NA [1]. Since the mutation rates of viral genome were fast, the phenotype of antigen, drug-resistance, and virulence changed in a relative short time. Moreover, segmental pattern facilitates the reassortment of viral genome and promote fast change of phenotypes [1].
Avian influenza virus (AIV) could across the species barrier and infect human fatally, which caused huge loss of economy and attracted extensive attention of the society. The highly pathogenic AIV of H5N1 subtype was firstly reported in Asia in 1996 [2]. The fact that H5N1 virus cross species barriers directly and fatally infect the respiratory system were confirmed by the isolation of human-origin H5N1 virus from clinical samples in 1997 [3,4]. Human infections of H5N1 subtype were continuously reported widely since 2003 and huge data were deposited in public database [5][6][7][8]. Besides H5N1 virus, other subtypes can also infect human by direct interspecies transmission. There are two infection cases of H9N2 in 1999 and 2003 [9,10]. H7N7 virus infected farmers in the Netherlands in 2003 [11], Moreover, H7N9 occurred in 2013 and infections of human cases were still reported up to now [12,13]. Interspecies transmission of AIV had two phenotypes in the view of transmission efficiency: (1) keeping popular among poultry or causing human infection with low probability; (2) adaptation to human host and human-to-human transmission with high efficiency. Thus far, AIVs in nature had not the second phenotype, which represents initial adaption to the new host and low efficiency of transmission among human.
Seasonal and pandemic influenza virus had high efficiency of transmission among human. Unfortunately, more and more reports about transmission efficiency proved that AIV with adequate amino acid (AA) mutations could have the ability of highly efficient transmission among mammals, which strongly suggested that pandemic risk of AIVs among human was rising [14][15][16][17][18][19][20]. As high fatality and antigen novelty for human host, the public health is being challenged seriously by AIVs. So, computational tools in the field of bioinformatics should be proposed to screen mutations in viral proteins not only for the study of high efficiency transmission among human but also for the prediction of transmission phenotype and the corresponding pandemic risk of AIVs.
In a previous study, five amino acid factors summarized from 491 highly redundant amino acid attributes were associated with specific physiochemical amino acid properties, namely, polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge [21]. In this paper, we used five AA factors to transform viral proteins and used the random forest (RF) method to select features from high-dimensional protein data and score them by their contributions to the efficiency of transmission and pandemic risk. After ranking the positions containing important mutation information, the classifier could predict the transmission phenotype of high efficiency to evaluate the pandemic risk. In the paper, we first identified 178 signature mutation positions by the RF scoring, then predicted AIV occurrence by four popular machine learning methods. Using the most effective classifier, we explored the important amino acid factors and the minimal range of signature positions. The study results could benefit pandemic surveillance and future study on the efficiency of AIV transmission.

Dataset
The final dataset contained 869 high-quality AIV strains (440 avian-origin AIVs with H1-H14, H16 subtypes; 429 human-origin AIVs with H5N1, H5N6, H7N3, H7N7, H7N9 and H9N2 subtypes) and 914 seasonal, pandemic human, and artificial viruses (H1N1, H1N2, H3N2 subtype; H5N1 artificial virus). As the 869 AIVs have low efficiency of transmission and low pandemic risk among human, they were regarded as negative samples. The 914 human or artificial viruses were regarded as positive samples since they were verified to have high efficiency of transmission among humans or mammals. The information related to these strains is summarized in Additional file 1.

Signature amino acid residues
The importance score at each position in the 11 viral proteins was computed by the RF model to screening the signature positions. The slope of the curve obviously changed at an importance score of 10 ( Fig. 1a). Therefore, 10 was preliminary selected as cutoff score. The 178 signature positions were founded and the initial amino acid mutation set was generated for further machine learning.
As shown in Table 1, the hemagglutinin protein (HA) contained the largest number of signature positions (41 Fig. 1 Importance score curve and the performances of k-nearest neighbor (KNN), naïve Bayes (NB), support vector machine (SVM), and random forest (RF) classifiers. a The ranked scores were calculated from five AA factors using the random forest method. The x and y coordinates denote the total length of the 11 protein alignments and the importance scores, respectively. The cutoff value 10 is indicated by the thin horizontal line. b Performances of the four classifiers were evaluated from 100 repeats of 10-fold cross-validation. The area under the curve (AUC) ranges from 0 to 1 amino acid residues; about 41/178 = 23%), suggesting that HA is very important for highly efficient transmission of AIVs among human. HA is mainly involved in receptor-binding and fusion activities. Positions HA102-HA290 locate in or close to the region of host receptor binding [22,23], and HA158, H163, HA189, HA190, HA224, HA226, HA228H is reportedly related to the specificity of receptor binding [14][15][16][17][18][19]. HA94, HA101, HA327, HA367, and HA393 locate at or near the fusion peptide [24], which triggers fusion activity in acidic environments and favors transmission to humans. The HA327 position in the cleavage site are important virulence sites [25]. The 627 position in the polymerase basic protein 2 (PB2) has been implicated in increased replication or virulence of AIVs in mammals and transmission among humans [19,26]. The 93 and 95 positions in the matrix protein 2 (M2), which are affiliated with viral particle ensembles [27], were also screened. The 372 and 375 positions in the nucleoprotein (NP) are reportedly involved in intracellular transport of viral proteins [28,29]. The viral proteins were transformed by the five amino acid factors and 178 signature positions were screened by the RF method. Part of the signature positions had been verified to be related with the mechanism of interspecies transmission or high efficiency of transmission among humans, which would rationalize model construction and benefit predicting accuracy. Moreover, the rest amino acid mutation without trial verification would facilitate the exploration of molecular mechanisms about high efficiency transmission among humans.

Performance of the prediction model
The 10-fold cross validation and the receiver operating characteristic (ROC) curve were used to evaluate the performance of the classifiers. The area under the ROC curve (AUC) reveals the optimal parameters in the four classifiers. As shown in Fig. 1b, the performances were different obviously. The AUC medians of the Supporting Vector Machine (SVM) and RF models were almost 1 while that for the K-Nearest Neighbor (KNN) model were almost 0.5. The KNN model had not good performance and the reason may be the nonlinear prediction rules in feature space. The performance of the Naïve Bayes (NB) classifier was slightly poorer and less stable than those of the SVM and RF classifiers. Considering the benefit of small samples and the computation complex, the SVM classifier was selected as the optimal machine learning model for predicting pandemic risk of AIVs.

Contributions of the AA factors
AIVs were characterized by the scores of 178 amino acid mutations. The five AA factors were associated with specific physiochemical amino acid properties: polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. To understand the importance of the five AA factors, the SVM classifier was used to evaluate all combination patterns. As shown in Fig. 2a, most of the stable performances of the SVM classifier were contributed by AA Factor 5 or combinations with AA Factor 5. Notably, the median AUC values were almost 1 and remained stable under AA Factor 5 alone. The performances of the SVM classifiers under AA Factor 1, or AA Factor 2 alone were not as good as AA Factor 5. These results indicate an important role for AA Factor 5 in the mechanism of AIVs transmission. Therefore, AA Factor 5 was employed in further analysis.

Contributions of the mutation sets
One hundred seventy-eight mutation sites were achieved under a cutoff value of 10 as mentioned above. To further explore the minimum mutations set associated with transmission efficiency, the cutoff value was adjusted and was incremented in steps of 1. The SVM classifier was still calculated with the five AA factors together. As shown in Fig. 2b, the SVM classifier destabilized at higher cutoffs and achieved stable and best performance at cutoffs 13. The performance of the SVM classifier with AA Factor 5 alone was also calculated for different cutoffs. As shown in Fig. 3a, the SVM classifier performed stably and well up to a cutoff of 17 and the best performance was achieved at cutoff 13, which giving 63 signature positions (Table 2). These 63 signature residues were regarded as the minimum mutation set of amino acid residues and were transformed by AA  (Fig. 3c), which suggested that direct interspecies transmission once occurred.
As shown in Table 2, the 63 signature positions were screened with the cut-off value 13. The nucleoprotein (NP) contained the largest number of signature positions (12 amino acid residues; about 12/63 = 19%), suggesting that NP is very important for host range of influenza virus [1]. The HA protein contained the similar number of signature positions to the NP protein (11 amino acid residues; about 11/63 = 17%), which further confirmed that HA is very important for highly efficient transmission of AIVs among human. Although amino acid mutations in the HA protein are essential for AIV transmission in mammals [14][15][16][17][18][19], mutations in other proteins are also necessary and should be further verified by trials [14,15,20]. Mutations distribution in different viral proteins suggested that the role of synergy and nonlinearity among viral proteins should be focused in the study of AIVs.

Pandemic risk of human-origin AIVs
It was supposed that potential pandemic may be triggered by the reassortment of viral genomes [1], which means that genome segments of human viruses (excluding the HA segment) were inserted into the genome of AIVs. To value the pandemic risk of human-origin AIVs, the artificial stimulation of genome reassortment between human-origin AIVs and human influenza viruses (seasonal human virus and 2009 pandemic virus) was performed. As shown in Table 3, three or four genome segments were needed at least to achieve the change of transmission phenotype with high probability (> = 0.90). The computing results were compatible with the reports from Zhang Y., et al. 2013 [20]. It should be notable that there was high pandemic risk for H7N9 virus (only three segments needed) and H9N2 virus (flexible patterns of genome reassortment), which was very important for the surveillance of avian influenza virus in the future.

Discussion
Avian influenza viruses can cross the species barrier, potentially causing a human pandemic. In this paper, AIV pandemic risk was predicted by the SVM model with excellent performance. We firstly screened 178 mutation positions in the 11 viral proteins by the RF method. Part of the residues at these positions have been related to interspecies transmission in earlier reports, such as HA158, H163, HA189, HA190, HA224, HA226, HA228H [14][15][16]18], H163 [17], HA94, HA101, HA327, HA367, and HA393 [24], M2 93, M2 95 [27], NP372, NP375 [28,29], PB2 627 [26], which guarantee the accuracy and the biologically meaningful of the predicting model. The proposed models provide important clues for future surveillance in the field of virology and is a useful pre-screening tool for phenotype screening in high-level biological safety laboratories.
Amino acid mutations in the HA protein are essential for highly efficient transmission in mammals [16], but mutations in other viral proteins are also necessary [14,15]. Mutations in different proteins introduce synergy and nonlinearity among these viral proteins, which was supported by the results in the paper. The linear classifier (the KNN model) showed poor predictive performance on the initial set of 178 signature positions. Moreover, the minimal signature position set was composed by 63 amino acid residues and distributed among different viral proteins as shown in Table 2. This synergistic effect should be notable in further study. Moreover, the NP protein contained the largest number of   [1]. The role of NP protein for transmission should be focused in the future. The molecular characteristics of AA Factor 5 are related to electrostatic charge with high coefficients on isoelectric point and net charge [21]. Electrostatic charge is strong related with the binding of biology molecules, such as the binding between viral surface protein and host receptor, the binding between viral enzyme and host molecules. The poor performance of other four factors may suggest that host receptor binding, and viral polymerase activity play key roles for the adaption of human host and transmission of avian influenza virus with high efficiency.
Four popular classifiers were used to predict the phenotype of AIVs. With the empirical parameters, the SVM model achieved well performance while KNN not. The KNN parameters were adjust from k = 1 to 20 and the performance was still not good. The reason may be that the size of data was not adequate for the dimension of feature vector. In the paper, all of the 1783 influenza viruses in the final dataset were represented by a 178 × 5 = 890 dimension vector. The KNN algorism had weak performance for our data.
As shown in Table 3, three or four genome segments were needed for H7N9 and H9N2 virus to achieve the change of transmission phenotype with high probability (> = 0.90), which was very important for the surveillance of AIVs in the future. Moreover, when avian and human virus with the predicted genome pattern were founded in the same region or in the same case, the pandemic risk should be notable.

Conclusions
The 178 signature mutations in 11 viral proteins were firstly screened by the random forest model. AIV pandemic risk was predicted by the SVM model with excellent performance. Although long evolution distance between avian and human influenza suggested that avian influenza virus in nature still need long time to fix among human, it should be notable that there are high pandemic risks for H7N9 and H9N2 AIVs. The novel findings in the paper provide important clues for pandemic surveillance.

Dataset
The genome data of 16,551 influenza viruses isolated from nature were collected from the EpiFlu public database [30,31] and those of six artificial H5N1 viruses with pandemic risk were collected from the ref. [14], which were processed and modeled using multiple public bioinformatics tools and algorithms as shown in Fig. 4. The strains were isolated between January 1996 and February 2016. The details for data cleaning are the same as those in the ref. [32][33][34].
The final dataset for predicting pandemic risk contained two category virus in the view of pandemic risk: 1) 869 high-quality AIV strains with low transmission efficiency among human: 440 avian-origin AIVs (H1-H14, H16 subtypes) and 429 human-origin AIVs (H5N1, H5N6, H7N3, H7N7, H7N9 and H9N2 subtypes); 2) 914 influenza strains with high transmission efficiency among human: 908 seasonal or pandemic human influenza (H1N1, H1N2 and H3N2 subtypes) and six artificial H5N1 viruses [14]. Considering the balance of data size and high similarity of viral protein sequence, seasonal and pandemic human virus in nature should differ by isolation location, isolation time, or antigen subtype. The information related to these strains is summarized in Additional file 1.

Scoring amino acid mutation
Random Forest is a collection of a large number of decision trees. The contribution of each feature to each tree in the random forest were calculated. All of the features were ranked according to the average of contributions to all of the trees in the model. The random forest method is very popularly used for feature selection of prediction problems and can rank the importance of the features in a large scale to discriminate the different categories. In this paper, transmission phenotype of high efficiency was predicted to evaluate the pandemic risk. Before the construction of classifier models, molecular features associated with transmission efficiency were firstly screened. The positive samples (high transmission efficiency) and negative samples (low transmission efficiency) were then classified by their importance scores at each amino acid position.
The RF method was used to screen the signature mutation in the 11 viral proteins [35]. To facilitate the computing of importance scores, the 11 proteins in each strain were artificially concentrated as order: Polymerase basic protein 2 (PB2), Polymerase basic 1 (PB1), The second protein expressed in the PB1 gene (PB1-F2), Polymerase acidic protein (PA), Hemagglutinin (HA), Nucleoprotein (NP), Neuraminidase (NA), Matrix protein 1 (M1), Matrix protein 2 (M2), Non-structural protein 1 (NS1), Nuclear export protein (NEP). Numerical sequences of the amino acid factor were achieved with the transformation of the artificial protein with the length of 4620 amino acids. Any deletions or insertions in the protein were replaced by zeros. All of the viruses were processed sequentially and were input to the RF model for the ranking of signature position. Breiman's random forest algorithm was used as default. As five factors were used to select the feature and construct the classifiers, the final importance score at each position was the sum of five calculations. In brief, highly scoring positions were important for distinguishing positive and negative samples. Signature positions with high scores were regarded as important amino acid mutations associated with the phenotype of highly efficient transmission.

Constructing the predicting model
Two-class model was constructed to predict and evaluate the pandemic risk of AIVs in the paper. After the ranking of amino acid mutations in all of the 11 viral proteins, each strain was represented as a numeric vector of length 5 N, where N is the length of the screened amino acid residue set. The pandemic risk was then predicted by four popular machine learning models: 1) Support vector machine [36]. The optimal hyperplane is determined with the regularization parameter C (C = 1) and the radial basis function (RBF) as default. 2) Random forest [35]. The RF model was implemented with the default parameter in the package. 3) Naïve Bayes [36]. The NB model was also implemented with the default parameter in the package. 4) K-nearest neighbor [37]. The KNN classifier is a nonparametric method to determine a sample category by a majority vote of its neighbors; the number of neighbors in this paper was set to be 3 (k = 3). All of the four classifiers were implemented in the R environment and related packages.

Evaluating the performance of different classifiers
All of the four models were trained on 823 positive samples (high transmission efficiency) and 782 negative samples (low transmission efficiency) randomly selected from the cleaned dataset of influenza virus. The remaining 10% of samples (91 positive and 87 negative samples) were reserved as an independent test dataset for assessing the performances of the classifiers. The 10-fold cross validation and the receiver operating characteristic curve were used to evaluate the performance of the SVM, NB, RF and KNN classifiers. The area under the ROC curve reveals the optimal parameters in the four classifiers. To compare the classifier performances, we repeated the evaluation process 100 times and plotted the distributions of the resulting AUC values. The AUC was calculated in R [38]. The AUC value ranges from 0 to 1. The performance and robustness of the four classifiers was evaluated by the AUC values and its distribution. The 1783 influenza viruses in the final dataset were shown by the multidimensional scaling method in R [37].

Artificial simulation of genome reassortment
As human influenza virus and human-origin avian influenza virus existed simultaneously in nature, mix infection in one case could cause the occurrence of pandemic virus by the mechanism of genome reassortment [20]. The perfect SVM classifier was used to analysis the artificial stimulation of genome reassortments between three human-origin AIVs and three human viruses. The artificial data were treated and predicted as above. Platt scaling was used to transform the output of the SVM model into a probability over two classes and evaluated the pandemic risk of genome reassortment viruses.
In the paper, three human viruses with high effi-