- Research
- Open access
- Published:
Scoring amino acid mutation to predict pandemic risk of avian influenza virus
BMC Bioinformatics volume 20, Article number: 288 (2019)
Abstract
Background
Avian influenza virus can directly cross species barriers and infect humans with high fatality. As antigen novelty for human host, the public health is being challenged seriously. The pandemic risk of avian influenza viruses should be analyzed and a prediction model should be constructed for virology applications.
Results
The 178 signature positions in 11 viral proteins were firstly screened as features by the scores of five amino acid factors and their random forest rankings. The Supporting Vector Machine algorithm achieved well performance. The most important amino acid factor (Factor 5) and the minimal range of signature positions (63 amino acid residues) were also explored. Moreover, human-origin avian influenza viruses with three or four genome segments from human virus had pandemic risk with high probability.
Conclusion
Using machine learning methods, the present paper scores the amino acid mutations and predicts pandemic risk with well performance. Although long evolution distances between avian and human viruses suggest that avian influenza virus in nature still need time to fix among human host, it should be notable that there are high pandemic risks for H7N9 and H9N2 avian viruses.
Background
Influenza A virus contains eight segments of single-strand negative RNA. Segment 4 codes hemagglutinin (HA) gene and segment 6 codes neuraminidase (NA) gene. According to the antigenic characteristics of HA and NA, avian influenza A virus has 16 subtypes HA and nine subtypes NA [1]. Since the mutation rates of viral genome were fast, the phenotype of antigen, drug-resistance, and virulence changed in a relative short time. Moreover, segmental pattern facilitates the reassortment of viral genome and promote fast change of phenotypes [1].
Avian influenza virus (AIV) could across the species barrier and infect human fatally, which caused huge loss of economy and attracted extensive attention of the society. The highly pathogenic AIV of H5N1 subtype was firstly reported in Asia in 1996 [2]. The fact that H5N1 virus cross species barriers directly and fatally infect the respiratory system were confirmed by the isolation of human-origin H5N1 virus from clinical samples in 1997 [3, 4]. Human infections of H5N1 subtype were continuously reported widely since 2003 and huge data were deposited in public database [5,6,7,8]. Besides H5N1 virus, other subtypes can also infect human by direct interspecies transmission. There are two infection cases of H9N2 in 1999 and 2003 [9, 10]. H7N7 virus infected farmers in the Netherlands in 2003 [11], Moreover, H7N9 occurred in 2013 and infections of human cases were still reported up to now [12, 13]. Interspecies transmission of AIV had two phenotypes in the view of transmission efficiency: (1) keeping popular among poultry or causing human infection with low probability; (2) adaptation to human host and human-to-human transmission with high efficiency. Thus far, AIVs in nature had not the second phenotype, which represents initial adaption to the new host and low efficiency of transmission among human.
Seasonal and pandemic influenza virus had high efficiency of transmission among human. Unfortunately, more and more reports about transmission efficiency proved that AIV with adequate amino acid (AA) mutations could have the ability of highly efficient transmission among mammals, which strongly suggested that pandemic risk of AIVs among human was rising [14,15,16,17,18,19,20]. As high fatality and antigen novelty for human host, the public health is being challenged seriously by AIVs. So, computational tools in the field of bioinformatics should be proposed to screen mutations in viral proteins not only for the study of high efficiency transmission among human but also for the prediction of transmission phenotype and the corresponding pandemic risk of AIVs.
In a previous study, five amino acid factors summarized from 491 highly redundant amino acid attributes were associated with specific physiochemical amino acid properties, namely, polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge [21]. In this paper, we used five AA factors to transform viral proteins and used the random forest (RF) method to select features from high-dimensional protein data and score them by their contributions to the efficiency of transmission and pandemic risk. After ranking the positions containing important mutation information, the classifier could predict the transmission phenotype of high efficiency to evaluate the pandemic risk. In the paper, we first identified 178 signature mutation positions by the RF scoring, then predicted AIV occurrence by four popular machine learning methods. Using the most effective classifier, we explored the important amino acid factors and the minimal range of signature positions. The study results could benefit pandemic surveillance and future study on the efficiency of AIV transmission.
Results
Dataset
The final dataset contained 869 high-quality AIV strains (440 avian-origin AIVs with H1–H14, H16 subtypes; 429 human-origin AIVs with H5N1, H5N6, H7N3, H7N7, H7N9 and H9N2 subtypes) and 914 seasonal, pandemic human, and artificial viruses (H1N1, H1N2, H3N2 subtype; H5N1 artificial virus). As the 869 AIVs have low efficiency of transmission and low pandemic risk among human, they were regarded as negative samples. The 914 human or artificial viruses were regarded as positive samples since they were verified to have high efficiency of transmission among humans or mammals. The information related to these strains is summarized in Additional file 1.
Signature amino acid residues
The importance score at each position in the 11 viral proteins was computed by the RF model to screening the signature positions. The slope of the curve obviously changed at an importance score of 10 (Fig. 1a). Therefore, 10 was preliminary selected as cutoff score. The 178 signature positions were founded and the initial amino acid mutation set was generated for further machine learning.
As shown in Table 1, the hemagglutinin protein (HA) contained the largest number of signature positions (41 amino acid residues; about 41/178 = 23%), suggesting that HA is very important for highly efficient transmission of AIVs among human. HA is mainly involved in receptor-binding and fusion activities. Positions HA102-HA290 locate in or close to the region of host receptor binding [22, 23], and HA158, H163, HA189, HA190, HA224, HA226, HA228H is reportedly related to the specificity of receptor binding [14,15,16,17,18,19]. HA94, HA101, HA327, HA367, and HA393 locate at or near the fusion peptide [24], which triggers fusion activity in acidic environments and favors transmission to humans. The HA327 position in the cleavage site are important virulence sites [25]. The 627 position in the polymerase basic protein 2 (PB2) has been implicated in increased replication or virulence of AIVs in mammals and transmission among humans [19, 26]. The 93 and 95 positions in the matrix protein 2 (M2), which are affiliated with viral particle ensembles [27], were also screened. The 372 and 375 positions in the nucleoprotein (NP) are reportedly involved in intracellular transport of viral proteins [28, 29].
The viral proteins were transformed by the five amino acid factors and 178 signature positions were screened by the RF method. Part of the signature positions had been verified to be related with the mechanism of interspecies transmission or high efficiency of transmission among humans, which would rationalize model construction and benefit predicting accuracy. Moreover, the rest amino acid mutation without trial verification would facilitate the exploration of molecular mechanisms about high efficiency transmission among humans.
Performance of the prediction model
The 10-fold cross validation and the receiver operating characteristic (ROC) curve were used to evaluate the performance of the classifiers. The area under the ROC curve (AUC) reveals the optimal parameters in the four classifiers. As shown in Fig. 1b, the performances were different obviously. The AUC medians of the Supporting Vector Machine (SVM) and RF models were almost 1 while that for the K-Nearest Neighbor (KNN) model were almost 0.5. The KNN model had not good performance and the reason may be the nonlinear prediction rules in feature space. The performance of the Naïve Bayes (NB) classifier was slightly poorer and less stable than those of the SVM and RF classifiers. Considering the benefit of small samples and the computation complex, the SVM classifier was selected as the optimal machine learning model for predicting pandemic risk of AIVs.
Contributions of the AA factors
AIVs were characterized by the scores of 178 amino acid mutations. The five AA factors were associated with specific physiochemical amino acid properties: polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. To understand the importance of the five AA factors, the SVM classifier was used to evaluate all combination patterns. As shown in Fig. 2a, most of the stable performances of the SVM classifier were contributed by AA Factor 5 or combinations with AA Factor 5. Notably, the median AUC values were almost 1 and remained stable under AA Factor 5 alone. The performances of the SVM classifiers under AA Factor 1, or AA Factor 2 alone were not as good as AA Factor 5. These results indicate an important role for AA Factor 5 in the mechanism of AIVs transmission. Therefore, AA Factor 5 was employed in further analysis.
Contributions of the mutation sets
One hundred seventy-eight mutation sites were achieved under a cutoff value of 10 as mentioned above. To further explore the minimum mutations set associated with transmission efficiency, the cutoff value was adjusted and was incremented in steps of 1. The SVM classifier was still calculated with the five AA factors together. As shown in Fig. 2b, the SVM classifier destabilized at higher cutoffs and achieved stable and best performance at cutoffs 13. The performance of the SVM classifier with AA Factor 5 alone was also calculated for different cutoffs. As shown in Fig. 3a, the SVM classifier performed stably and well up to a cutoff of 17 and the best performance was achieved at cutoff 13, which giving 63 signature positions (Table 2). These 63 signature residues were regarded as the minimum mutation set of amino acid residues and were transformed by AA Factor 5 alone to show the pattern of avian and human influenza viruses by the multidimensional scaling method [see Additional file 2].
The distribution of human and avian influenza virus in two dimensions were shown in Fig. 3b. In the view of pandemic risk, most of avian viruses were cluster at the low left while human viruses formed three separate clusters at the right. Avian influenza virus 1 (EPI_ISL_64953, A/turkey/NC/353568/2005, H3N2), 2 (EPI_ISL_3141, A/Duck/Nanchang/4–184/2000, H2N9) and 3 (EPI_ISL_3362, A/duck/NC/91347/2001, H1N2) were closed to the human viruses, which should be strictly supervised in the future. The viruses in group 4 were composed by seasonal human and avian virus of H3N2 subtype isolated from 2005 to 2013 in North America (Fig. 3c), which suggested that direct interspecies transmission once occurred.
As shown in Table 2, the 63 signature positions were screened with the cut-off value 13. The nucleoprotein (NP) contained the largest number of signature positions (12 amino acid residues; about 12/63 = 19%), suggesting that NP is very important for host range of influenza virus [1]. The HA protein contained the similar number of signature positions to the NP protein (11 amino acid residues; about 11/63 = 17%), which further confirmed that HA is very important for highly efficient transmission of AIVs among human. Although amino acid mutations in the HA protein are essential for AIV transmission in mammals [14,15,16,17,18,19], mutations in other proteins are also necessary and should be further verified by trials [14, 15, 20]. Mutations distribution in different viral proteins suggested that the role of synergy and nonlinearity among viral proteins should be focused in the study of AIVs.
Pandemic risk of human-origin AIVs
It was supposed that potential pandemic may be triggered by the reassortment of viral genomes [1], which means that genome segments of human viruses (excluding the HA segment) were inserted into the genome of AIVs. To value the pandemic risk of human-origin AIVs, the artificial stimulation of genome reassortment between human-origin AIVs and human influenza viruses (seasonal human virus and 2009 pandemic virus) was performed. As shown in Table 3, three or four genome segments were needed at least to achieve the change of transmission phenotype with high probability (> = 0.90). The computing results were compatible with the reports from Zhang Y., et al. 2013 [20]. It should be notable that there was high pandemic risk for H7N9 virus (only three segments needed) and H9N2 virus (flexible patterns of genome reassortment), which was very important for the surveillance of avian influenza virus in the future.
Discussion
Avian influenza viruses can cross the species barrier, potentially causing a human pandemic. In this paper, AIV pandemic risk was predicted by the SVM model with excellent performance. We firstly screened 178 mutation positions in the 11 viral proteins by the RF method. Part of the residues at these positions have been related to interspecies transmission in earlier reports, such as HA158, H163, HA189, HA190, HA224, HA226, HA228H [14,15,16, 18], H163 [17], HA94, HA101, HA327, HA367, and HA393 [24], M2 93, M2 95 [27], NP372, NP375 [28, 29], PB2 627 [26], which guarantee the accuracy and the biologically meaningful of the predicting model. The proposed models provide important clues for future surveillance in the field of virology and is a useful pre-screening tool for phenotype screening in high-level biological safety laboratories.
Amino acid mutations in the HA protein are essential for highly efficient transmission in mammals [16], but mutations in other viral proteins are also necessary [14, 15]. Mutations in different proteins introduce synergy and nonlinearity among these viral proteins, which was supported by the results in the paper. The linear classifier (the KNN model) showed poor predictive performance on the initial set of 178 signature positions. Moreover, the minimal signature position set was composed by 63 amino acid residues and distributed among different viral proteins as shown in Table 2. This synergistic effect should be notable in further study. Moreover, the NP protein contained the largest number of signature positions (12 amino acid residues; about 12/63 = 19%), suggesting that NP is very important for host range of influenza virus [1]. The role of NP protein for transmission should be focused in the future.
The molecular characteristics of AA Factor 5 are related to electrostatic charge with high coefficients on isoelectric point and net charge [21]. Electrostatic charge is strong related with the binding of biology molecules, such as the binding between viral surface protein and host receptor, the binding between viral enzyme and host molecules. The poor performance of other four factors may suggest that host receptor binding, and viral polymerase activity play key roles for the adaption of human host and transmission of avian influenza virus with high efficiency.
Four popular classifiers were used to predict the phenotype of AIVs. With the empirical parameters, the SVM model achieved well performance while KNN not. The KNN parameters were adjust from k = 1 to 20 and the performance was still not good. The reason may be that the size of data was not adequate for the dimension of feature vector. In the paper, all of the 1783 influenza viruses in the final dataset were represented by a 178 × 5 = 890 dimension vector. The KNN algorism had weak performance for our data.
As shown in Table 3, three or four genome segments were needed for H7N9 and H9N2 virus to achieve the change of transmission phenotype with high probability (> = 0.90), which was very important for the surveillance of AIVs in the future. Moreover, when avian and human virus with the predicted genome pattern were founded in the same region or in the same case, the pandemic risk should be notable.
Conclusions
The 178 signature mutations in 11 viral proteins were firstly screened by the random forest model. AIV pandemic risk was predicted by the SVM model with excellent performance. Although long evolution distance between avian and human influenza suggested that avian influenza virus in nature still need long time to fix among human, it should be notable that there are high pandemic risks for H7N9 and H9N2 AIVs. The novel findings in the paper provide important clues for pandemic surveillance.
Methods
Dataset
The genome data of 16,551 influenza viruses isolated from nature were collected from the EpiFlu public database [30, 31] and those of six artificial H5N1 viruses with pandemic risk were collected from the ref. [14], which were processed and modeled using multiple public bioinformatics tools and algorithms as shown in Fig. 4. The strains were isolated between January 1996 and February 2016. The details for data cleaning are the same as those in the ref. [32,33,34].
The final dataset for predicting pandemic risk contained two category virus in the view of pandemic risk: 1) 869 high-quality AIV strains with low transmission efficiency among human: 440 avian-origin AIVs (H1–H14, H16 subtypes) and 429 human-origin AIVs (H5N1, H5N6, H7N3, H7N7, H7N9 and H9N2 subtypes); 2) 914 influenza strains with high transmission efficiency among human: 908 seasonal or pandemic human influenza (H1N1, H1N2 and H3N2 subtypes) and six artificial H5N1 viruses [14]. Considering the balance of data size and high similarity of viral protein sequence, seasonal and pandemic human virus in nature should differ by isolation location, isolation time, or antigen subtype. The information related to these strains is summarized in Additional file 1.
Scoring amino acid mutation
Random Forest is a collection of a large number of decision trees. The contribution of each feature to each tree in the random forest were calculated. All of the features were ranked according to the average of contributions to all of the trees in the model. The random forest method is very popularly used for feature selection of prediction problems and can rank the importance of the features in a large scale to discriminate the different categories. In this paper, transmission phenotype of high efficiency was predicted to evaluate the pandemic risk. Before the construction of classifier models, molecular features associated with transmission efficiency were firstly screened. The positive samples (high transmission efficiency) and negative samples (low transmission efficiency) were then classified by their importance scores at each amino acid position.
The RF method was used to screen the signature mutation in the 11 viral proteins [35]. To facilitate the computing of importance scores, the 11 proteins in each strain were artificially concentrated as order: Polymerase basic protein 2 (PB2), Polymerase basic 1 (PB1), The second protein expressed in the PB1 gene (PB1-F2), Polymerase acidic protein (PA), Hemagglutinin (HA), Nucleoprotein (NP), Neuraminidase (NA), Matrix protein 1 (M1), Matrix protein 2 (M2), Non-structural protein 1 (NS1), Nuclear export protein (NEP). Numerical sequences of the amino acid factor were achieved with the transformation of the artificial protein with the length of 4620 amino acids. Any deletions or insertions in the protein were replaced by zeros. All of the viruses were processed sequentially and were input to the RF model for the ranking of signature position. Breiman’s random forest algorithm was used as default. As five factors were used to select the feature and construct the classifiers, the final importance score at each position was the sum of five calculations. In brief, highly scoring positions were important for distinguishing positive and negative samples. Signature positions with high scores were regarded as important amino acid mutations associated with the phenotype of highly efficient transmission.
Constructing the predicting model
Two-class model was constructed to predict and evaluate the pandemic risk of AIVs in the paper. After the ranking of amino acid mutations in all of the 11 viral proteins, each strain was represented as a numeric vector of length 5 N, where N is the length of the screened amino acid residue set. The pandemic risk was then predicted by four popular machine learning models: 1) Support vector machine [36]. The optimal hyperplane is determined with the regularization parameter C (C = 1) and the radial basis function (RBF) as default. 2) Random forest [35]. The RF model was implemented with the default parameter in the package. 3) Naïve Bayes [36]. The NB model was also implemented with the default parameter in the package. 4) K-nearest neighbor [37]. The KNN classifier is a nonparametric method to determine a sample category by a majority vote of its neighbors; the number of neighbors in this paper was set to be 3 (k = 3). All of the four classifiers were implemented in the R environment and related packages.
Evaluating the performance of different classifiers
All of the four models were trained on 823 positive samples (high transmission efficiency) and 782 negative samples (low transmission efficiency) randomly selected from the cleaned dataset of influenza virus. The remaining 10% of samples (91 positive and 87 negative samples) were reserved as an independent test dataset for assessing the performances of the classifiers. The 10-fold cross validation and the receiver operating characteristic curve were used to evaluate the performance of the SVM, NB, RF and KNN classifiers. The area under the ROC curve reveals the optimal parameters in the four classifiers. To compare the classifier performances, we repeated the evaluation process 100 times and plotted the distributions of the resulting AUC values. The AUC was calculated in R [38]. The AUC value ranges from 0 to 1. The performance and robustness of the four classifiers was evaluated by the AUC values and its distribution. The 1783 influenza viruses in the final dataset were shown by the multidimensional scaling method in R [37].
Artificial simulation of genome reassortment
As human influenza virus and human-origin avian influenza virus existed simultaneously in nature, mix infection in one case could cause the occurrence of pandemic virus by the mechanism of genome reassortment [20]. The perfect SVM classifier was used to analysis the artificial stimulation of genome reassortments between three human-origin AIVs and three human viruses. The artificial data were treated and predicted as above. Platt scaling was used to transform the output of the SVM model into a probability over two classes and evaluated the pandemic risk of genome reassortment viruses.
In the paper, three human viruses with high efficiency of transmission in positive samples: A/Ohio/09/2015 (EPI_ISL_179403; H1N1), A/Wisconsin/13/2015 (EPI_ISL_176723; H3N2), and A/Sichuan/1/2009 (EPI_ISL_30411; H1N1; 2009 pandemic swine virus) and three human-origin avian viruses with low efficiency of transmission in negative samples: A/Egypt/682/2015 (EPI_ISL_195659; H5N1), A/Zhejiang/9/2015 (EPI_ISL_192505; H7N9) A/Hunan/44558/2015 (EPI_ISL_203644; H9N2) were used.
Abbreviations
- AA:
-
Amino acid
- AIV:
-
Avian influenza virus
- AUC:
-
Area under the ROC curve
- HA:
-
Hemagglutinin
- KNN:
-
K-nearest neighbor
- M1:
-
Matrix protein 1
- M2:
-
Matrix protein 2
- NA:
-
Neuraminidase
- NB:
-
Naïve bayes
- NEP:
-
Nuclear export protein
- NP:
-
Nucleoprotein
- NS1:
-
Non-structural protein 1
- PA:
-
Polymerase acidic protein
- PB1:
-
Polymerase basic 1
- PB1-F2:
-
The second protein expressed in the PB1 gene
- PB2:
-
Polymerase basic protein 2
- RBF:
-
Radial basis function
- RF:
-
Random forest
- ROC:
-
Receiver operating characteristic curve
- SVM:
-
Support vector machine
References
Webster RG, Bean WJ, Gorman OT, Chambers TM, Kawaoka Y. Evolution and ecology of influenza a viruses. Microbiol Rev. 1992;56:152–79.
Xu X, Subbarao K, Cox NJ, Guo Y. Genetic characterization of the pathogenic influenza a/goose/Guangdong/1/96 (H5N1) virus: similarity of its hemagglutinin gene to those of H5N1 viruses from the 1997 outbreaks in Hong Kong. Virology. 1999;261:15–9.
Claas EC, Osterhaus AD, van Beek R, De Jong JC, Rimmelzwaan GF, Senne DA, et al. Human influenza a H5N1 virus related to a highly pathogenic avian influenza virus. Lancet. 1998;351:472–7.
Subbarao K, Klimov A, Katz J, Regnery H, Lim W, Hall H, et al. Characterization of an avian influenza a (H5N1) virus isolated from a child with a fatal respiratory illness. Science. 1998;279:393–6.
Chen H, Smith GJ, Li KS, Wang J, Fan XH, Rayner JM, et al. Establishment of multiple sublineages of H5N1 influenza virus in Asia: implications for pandemic control. Proc Natl Acad Sci U S A. 2006;103:2845–50.
Li KS, Guan Y, Wang J, Smith GJ, Xu KM, Duan L, et al. Genesis of a highly pathogenic and potentially pandemic H5N1 influenza virus in eastern Asia. Nature. 2004;430:209–13.
Zhu QY, Qin ED, Wang W, Yu J, Liu BH, Hu Y, et al. Fatal infection with influenza a (H5N1) virus in China. N Engl J Med. 2006;354:2731–2.
Shu YL, Yu HJ, Li DX. Lethal avian influenza a (H5N1) infection in a pregnant woman in Anhui province, China. N Engl J Med. 2006;354:1421–2.
Peiris M, Yuen KY, Leung CW, Chan KH, Ip PL, Lai RW, et al. Human infection with influenza H9N2. Lancet. 1999;354:916–7.
Butt KM, Smith GJ, Chen H, Zhang LJ, Leung YH, Xu KM, et al. Human infection with an avian H9N2 influenza a virus in Hong Kong in 2003. J Clin Microbiol. 2005;43:5760–7.
Fouchier RA, Schneeberger PM, Rozendaal FW, Broekman JM, Kemink SA, Munster V, et al. Avian influenza a virus (H7N7) associated with human conjunctivitis and a fatal case of acute respiratory distress syndrome. Proc Natl Acad Sci U S A. 2004;101:1356–61.
Gao R, Cao B, Hu Y, Feng Z, Wang D, Hu W, et al. Human infection with a novel avian-origin influenza a (H7N9) virus. N Engl J Med. 2013;368:1888–97.
Cao HF, Liang ZH, Feng Y, Zhang ZN, Xu J, He H. A confirmed severe case of human infection with avian-origin influenza H7N9: a case report. Exp Ther Med. 2015;9:693–6.
Herfst S, Schrauwen EJ, Linster M, Chutinimitkul S, de Wit E, Munster VJ, et al. Airborne transmission of influenza a/H5N1 virus between ferrets. Science. 2012;336:1534–41.
Imai M, Watanabe T, Hatta M, Das SC, Ozawa M, Shinya K, et al. Experimental adaptation of an influenza H5 HA confers respiratory droplet transmission to a reassortant H5 HA/H1N1 virus in ferrets. Nature. 2012;486:420–8.
Glaser L, Stevens J, Zamarin D, Wilson IA, García-Sastre A, Tumpey TM, et al. A single amino acid substitution in 1918 influenza virus hemagglutinin changes receptor binding specificity. J Virol. 2005;79:11533–6.
Mishin VP, Novikov D, Hayden FG, Gubareva LV. Effect of hemagglutinin glycosylation on influenza virus susceptibility to neuraminidase inhibitors. J Virol. 2005;79:12416–24.
Sorrell EM, Wan H, Araya Y, Song H, Perez DR. Minimal molecular constraints for respiratory droplet transmission of an avian–human H9N2 influenza a virus. Proc Natl Acad Sci U S A. 2009;106:7565–70.
Li X, Shi J, Guo J, Deng G, Zhang Q, Wang J, et al. Genetics, receptor binding property, and transmissibility in mammals of naturally isolated H9N2 avian influenza viruses. PLoS Pathog. 2014;10:e1004508.
Zhang Y, Zhang Q, Kong H, Jiang Y, Gao Y, Deng G, et al. H5N1 hybrid viruses bearing 2009/H1N1 virus genes transmit in Guinea pigs by respiratory droplet. Science. 2013;340:1459–63.
Atchley WR, Zhao J, Fernandes AD, Drüke T. Solving the protein sequence metric problem. Proc Natl Acad Sci U S A. 2005;102:6395–400.
Stevens J, Corper AL, Basler CF, Taubenberger JK, Palese P, Wilson IA. Structure of the uncleaved human H1 hemagglutinin from the extinct 1918 influenza virus. Science. 2004;303:1866–70.
Hulse DJ, Webster RG, Russell RJ, Perez DR. Molecular determinants within the surface proteins involved in the pathogenicity of H5N1 influenza viruses in chickens. J Virol. 2004;78:9954–64.
Chen J, Skehel JJ, Wiley DC. N- and C-terminal residues combine in the fusion-pH influenza hemagglutinin HA (2) subunit to form an N cap that terminates the triple-stranded coiled coil. Proc Natl Acad Sci U S A. 1999;96:8967–72.
Schrauwen EJA, de Graaf M, Herfst S, Rimmelzwaan GF, Osterhaus ADME, Fouchier RAM. Determinants of virulence of influenza a virus. Eur J Clin Microbiol Infect Dis. 2014;33:479–90.
Hatta M, Gao P, Halfmann P, Kawaoka Y. Molecular basis for high virulence of Hong Kong H5N1 influenza a viruses. Science. 2001;293:1840–2.
Iwatsuki-Horimoto K, Horimoto T, Noda T, Kiso M, Maeda J, Watanabe S, et al. The cytoplasmic tail of the influenza a virus M2 protein plays a role in viral assembly. J Virol. 2006;80:5233–40.
Bullido R, Gomez-Puertas P, Albo C, Portela A. Several protein regions contribute to determine the nuclear and cytoplasmic localization of the influenza a virus nucleoprotein. J Gen Virol. 2000;81:135–42.
Iwatsuki-Horimoto K, Horimoto T, Fujii Y, Kawaoka Y. Generation of influenza a virus NS2 (NEP) mutants with an altered nuclear export signal sequence. J Virol. 2004;78:10149–55.
Elbe S, Buckland-Merrett G. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Chall. 2017;1:33–46 https://doi.org/10.1002/gch2.1018.
Shu Y, McCauley J. GISAID: global initiative on sharing all influenza data-from vision to reality. Euro Surveillance. 2017;22:30494 https://doi.org/10.2807/1560-7917.ES.2017.22.13.30494.
Qiang X, Kou Z, Fang G, Wang Y. Scoring amino acid mutations to predict avian-to-human transmission of avian influenza viruses. Molecules. 2018;23:1584.
Qiang X, Kou Z. Predicting interspecies transmission of avian influenza virus based on wavelet packet decomposition. Comput Biol Chem. 2018. https://doi.org/10.1016/j.compbiolchem.2018.11.029.
Wang J, Kou Z, Duan M, Ma C, Zhou Y. Using amino acid factor scores to predict avian-to-human transmission of avian influenza viruses: a machine learning study. Protein Peptide Lett. 2013;20:1115–21.
Liaw A, Wiener M. Classification and regression by random Forest. R News. 2002;2:18–22.
Chang C, Lin C. LIBSVM: a library for support vector machines. ACM T Intel Syst Tec. 2011;2:1–27.
Venables WN, Ripley BD. Modern applied statistics with S. 4th ed. New York: Springer; 2002.
Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21:7881.
Acknowledgements
We would like to acknowledge the originating and submitting laboratories of the viral sequences from GISAID’s EpiFlu Database.
Funding
Publication costs are funded by the National Natural Science Foundation of China (61632002) and the Natural Science Foundation of Guangdong Province of China (2018A030313380).
Availability of data and materials
The datasets analyzed during the current study are available in the EpiFlu repository, https://www.gisaid.org[30, 31]. The nomenclature for influenza virus in the final dataset is provided as Additional file 1. The clustering details for the MDS method is provided as Additional file 2.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 8, 2019: Decipher computational analytics in digital health and precision medicine. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-8.
Author information
Authors and Affiliations
Contributions
XQ and ZK implemented and performed all computational work. ZK and XQ wrote the manuscript. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
The nomenclature for influenza virus in the final dataset. (XLSX 98 kb)
Additional file 2:
The clustering details for the MDS method. (XLSX 135 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Qiang, X., Kou, Z. Scoring amino acid mutation to predict pandemic risk of avian influenza virus. BMC Bioinformatics 20 (Suppl 8), 288 (2019). https://doi.org/10.1186/s12859-019-2770-0
Published:
DOI: https://doi.org/10.1186/s12859-019-2770-0