FluTyper-an algorithm for automated typing and subtyping of the influenza virus from high resolution mass spectral data

Background High resolution mass spectrometry has been employed to rapidly and accurately type and subtype influenza viruses. The detection of signature peptides with unique theoretical masses enables the unequivocal assignment of the type and subtype of a given strain. This analysis has, to date, required the manual inspection of mass spectra of whole virus and antigen digests. Results A computer algorithm, FluTyper, has been designed and implemented to achieve the automated analysis of MALDI mass spectra recorded for proteolytic digests of the whole influenza virus and antigens. FluTyper incorporates the use of established signature peptides and newly developed naïve Bayes classifiers for four common influenza antigens, hemagglutinin, neuraminidase, nucleoprotein, and matrix protein 1, to type and subtype the influenza virus based on their detection within proteolytic peptide mass maps. Theoretical and experimental testing of the classifiers demonstrates their applicability at protein coverage rates normally achievable in mass mapping experiments. The application of FluTyper to whole virus and antigen digests of a range of different strains of the influenza virus is demonstrated. Conclusions FluTyper algorithm facilitates the rapid and automated typing and subtyping of the influenza virus from mass spectral data. The newly developed naïve Bayes classifiers increase the confidence of influenza virus subtyping, especially where signature peptides are not detected. FluTyper is expected to popularize the use of mass spectrometry to characterize influenza viruses.


Background
Influenza is a leading cause of death throughout the developed world and contributes to between 250,000 and 500,000 deaths every year worldwide [1]. On three occasions last century, global pandemics resulted in millions of deaths while recent pandemic threats have been posed by strains of avian [2] and swine origin [3]. Much higher rates of infection exist in the general population that, while not life threatening, inflicts illness and suffering. The virus also imposes a significant social and economic burden through productive losses in the workplace [4]. The genetic analysis of the influenza virus is derived from RT-PCR sequencing of amplified gene segments for the major antigens of the virus [5]. Most work is focused on the hemagglutinin gene because of its primary role in antigenic drift [6]. This is aided by the Influenza Virus Resource, a sequence database developed by the National Center for Biotechnology Information (NCBI) [7] that provides access to genetic sequence data that facilitates multiple sequence alignments, phylogenetic analysis and the generation of clusters [8,9]. It is typical in a retrospective analysis, for a strain from the most dominant genetic cluster within one influenza season to be recommended by the WHO for the vaccine in the following season. Antigenic change is measured primarily employing the hemagglutination inhibition (HI) assay [10], where anti sera raised from infection of a host with one strain are cross reacted with other uncharacterized and reference strains in parallel. New computational approaches have been developed to analyze HI data [11] that increases the reliability with which antigenic differences can be assessed and this has been aided by mass spectrometric approaches [12] that enable epitopic domains to be localized [13][14][15][16][17]. Antigenic maps allow for the visualization of antigenic relationships among many strains in order to follow the short and long evolution of the virus [18]. These maps can aid the comparison of antigenic data derived from different laboratories and enable such data to be more reliably interpreted. Epidemiological modeling to predict whether new emerging strains are likely to cause widespread epidemics in future seasons is also under development [19,20]. The inclusion of antigenic drift and cross-immunity data can improve the reliability of these models. We have recently developed the most direct and rapid method yet to survey influenza from the perspective of the viral protein antigens [21][22][23][24]. Antigens recovered from the virus or present in whole virus or vaccine preparations are digested with site-specific proteases and the peptide products are analyzed by high resolution mass spectrometry [25]. The mass accuracy attained in these analyzes enables the unambiguous identification of conserved signature peptides that are specific to a given type or subtype of the influenza virus. The signature peptides are unique in mass when compared to the in silico digest of all influenza proteins across all strains and hosts and those proteins known to contaminate virus preparations. To date, the analysis of high resolution mass spectra of influenza proteolytic preparations has required manual interpretation through the identification of signature peptide masses that indicate the type or the subtype of an influenza virus. Currently, manual interpretation can be performed when signature peptides dominate a mass spectrum but it is not possible to establish the degree of confidence in typing and subtyping strains. Further, spectral analysis often involves the detection of multiple signature peptides, some of low abundance, or in some cases establishing the type and subtype without signature peptides (Po > 90-95). Existing algorithms such as the Mascot Peptide Mass Fingerprinting algorithm [26] can be used to identify proteins within a mass spectrum, however, such algorithms do not provide any level of confidence for the type and subtype of the virus from which the proteins are identified. This is particularly a problem when signature peptides are not detected in a given mass spectrum. To extend our previous work and automate the analysis of high resolution mass spectra of influenza proteolytic preparations, the FluTyper algorithm has been developed. FluTyper implements methods to deisotope, filter and detect peaks from mass spectra. Peaks are then matched against established signature peptides from common antigens [21][22][23][24]. In addition, naïve Bayes classifiers have been developed to provide statistical confidence for type and subtype assignments where few or no signature peptides are available. Here the basis of the Flu-Typer algorithm is described and its application for the automated analysis of MALDI mass spectra derived from antigen and whole virus digests is demonstrated.

Algorithm overview
FluTyper has been designed to utilize naïve Bayes classifiers for the typing and subtyping of proteolytic influenza mass spectra. FluTyper is divided into two main parts, first, the algorithm generates naïve Bayes classifiers and determines unique signature peptides, and second, the algorithm pre-processes query mass spectra and determines the virus type and subtype based using the classifiers and signature peptides ( Figure 1). Naïve Bayes classifiers are generated for four common influenza antigens hemagglutinin (HA), neuraminidase (NA), nucleoprotein (NP), and matrix protein 1 (M1). Subsequently, the FluTyper algorithm uses all classifiers, in combination, for the computation of the type and subtype probabilities and the identification of proteolytic signature peptides from each mass spectrum analyzed.

Pre-processing of high resolution mass spectra
Mass spectra of tryptic influenza peptides are pre-processed prior to typing and subtyping using the naïve Bayes classifier. First, a user defined threshold is used to remove peaks that are considered to be noise (typically set at a signal-to-noise ratio of 2). Second, all isotope clusters are identified and the spectrum is deisotoped. The deisotoping method used is adapted from the THRASH algorithm [27]. The method involves iterating through each peak in the threshold mass spectrum starting from the lowest m/z value. As the algorithm proceeds, each peak is compared to previous peaks to determine if it belongs to an existing isotopic cluster. If a peak belongs to an existing isotopic cluster, the peak is removed and its intensity is added to the existing monoisotopic peak. To evaluate the composition of isotopic clusters, the model amino acid averagine (C 4.9384 H 7.7583 N 1.3577 O 1.4773 S 0.0417 ) [28] is used to define both the predicted distance between isotopic peaks and the intensity distribution of ions with an isotopic cluster. A major advantage of mass spectral data acquired by MALDI is that tryptic peptide ions generated are almost exclusively singly charged (i.e. [M+H + ] ions). This eliminates the need to deconvolute (by mass) the mass spectrum.

Naïve Bayes classifiers for the typing and subtyping of the influenza virus
Non-redundant HA, NA, NP and M1 sequence sets for human strains of influenza virus type A and B, and subtypes H1N1 excluding pandemic sequences (H1N1) 2009 sequences, pandemic (H1N1) 2009 sequences (P2009), H3N2 and H5N1 were retrieved from the NCBI Influenza Virus Sequence Database [7]. Each set of sequences is then aligned using ClustalW [29] to enable the relative frequency of occurrence Po(M, T) of each unique theoretical monoisotopic tryptic peptide ion [M+H] + , M, for a given type or subtype, T, to be determined. Tryptic peptide fragments were generated to allow for up to 2 missed cleavages, with fixed carbamidomethyl cysteine and optional modifications of methionine, glutamic acid and cysteine residues in the form of oxidized methionine, pyroglutamate and acrylamide adducts with cysteine. A naive Bayes classifier is a simple probabilistic classifier based on the application of Bayes' theorem. Using the classifier, the type or subtype of an influenza virus can be determined as follows: where p(T|M 1  tein sequence alignments. The independent probability for each mass to be present for a given type or subtype, p(M i |T), is given by its relative frequency of occurrence The assumption is made that the presence of peptide ion masses derived from a particular protein is independent to that of any other mass (i.e. that the presence of one tryptic peptide is independent of the presence of another). Where a particular mass M i is present in one type or subtype, but not another, the Laplace's rule of succession is applied, where 1 is added to the number of observed events to avoid zero probabilities. This assumption is useful to account for noise peaks that may be present in mass spectral data. The prior probability, p(T), reflects the probability of occurrence of a given type or subtype, T, and is estimated based on the relative number of sequences in the NCBI database for T. However, this value may be adjusted as necessary to match the observed occurrence of different influenza types and subtypes in a particular season. Finally, the independent probability of observing peaks M 1  where T a , T b , T x , etc are all the possible type or subtypes being analyzed. A naïve Bayes classifier is built for each of the HA, NA, M1 and NP antigens used to type and subtype the virus.
To assess the peak matching false discovery rate, decoy naïve Bayes classifier models are generated using randomly permutated sequences from the same set of influenza proteins.

Uniqueness of peptide ion masses in naïve Bayes classifiers
Since the naïve Bayes classifier is trained based on theoretical protein sequences from specific influenza proteins alone, validation that the tryptic peptide masses are unique to influenza is necessary. This is performed as described previously [21]. Briefly, each theoretical monoisotopic mass, M, from each type and subtype present in the naïve Bayes classifier, is compared against the theoretical monoisotopic tryptic ion masses [M+H + ] from a custom database containing all non-redundant influenza protein sequences, and those of possible contaminants, including human keratin, bovine/porcine trypsin and several chicken proteins that have been found to commonly contaminate egg-propagated virus preparations or are introduced during the sample preparation. The included egg-derived chicken protein contaminants are based on our own observation and their identity was confirmed by MALDI tandem mass spectrometry (unpublished observations -spectra available upon request). Other unknown contaminants are always possible, but due to the use of high-resolution mass spectrometry with mass accuracies routinely better than 1 ppm achieved, the misassignment of contaminants will be largely avoided. Masses are generated for predicted tryptic peptide ions allowing for up to 2 missed cleavages and the same post-translational modifications as described in the previous section. The difference in M and the closest theoretical mass, U M (in parts per million (ppm)), of a tryptic peptide derived from a contaminant or influenza antigen with at least 10 entries in the custom database is defined as the uniqueness.
Peak matching, signature peptide identification and computation of type and subtype probabilities using naïve Bayes classifiers In a mass spectrum, typically only a portion of theoretical tryptic peptides is observed experimentally. This may be due to a range of factors ranging from incomplete proteolytic cleavage to the presence of unanticipated posttranslational modifications. It is necessary to first define a set of theoretical tryptic peptide masses that are actually observed within a specified mass error tolerance. The list of theoretical masses used for matching are determined based on the specified protein (HA, NA, NP, M1 or all).
Where the mass of an observed peak is within the mass error tolerance of two or more peaks, the closest theoretical mass is selected. For a matching peak to be selected for further analysis, the mass must be sufficient unique as defined by: where ΔM is the mass error (in ppm) between the observed mass and theoretical tryptic peptide mass, and U M is the uniqueness as described in the previous section. A scaling of U M by a factor of 0.5 is necessary to ensure that there cannot be another tryptic contaminant peptide mass present that is closer to the observed mass than that of the theoretical mass. The concept of using signature peptides to type and subtype the influenza virus has been previously described [21]. A signature peptide is defined as a theoretical tryptic peptide that is exclusively present in one type or subtype, but not in any of the others. In the FluTyper algorithm, a signature peptide is defined as any theoretical tryptic peptide, M, where Po(M, T) > 0.7 for one type or subtype and Po(M, T) = 0 for all other types or subtypes for a given influenza protein. Since few signature peptides may be indicative of a particular subtype of the virus, indicator peptides are also used by the algorithm. An indicator peptide is defined similarly to a signature peptide with the exception that it may occur in the sequence of antigens from other viral subtypes with Po(M, T) < 0.1. For the computation of type and subtype probabilities, the naïve Bayes classifier (1) is applied using the set of matching peaks. For typing, this provides a probability that a set of masses is from influenza A (p(FluA|

Implementation
Since it is only necessary to generate a naïve Bayes classifier when new sequences have been added to the custom database, the implementation of the FluTyper algorithm is divided in two applications, consisting of the naïve Bayes classifier and signature peptide generator, and the mass spectrum analysis program (Figure 1). The classifier and signature peptide generator accepts ClustalW aligned sequences as input to compute the frequency of occurrence of theoretical tryptic peptides and determines the uniqueness of their mass. The output is a table containing all data necessary for naïve Bayes classification and signature peptide determination. The second component of FluTyper accepts a mass spectrum in ASCII format and the classification tables as input. FluTyper outputs the type and subtype prediction based on signature peptides and naïve Bayes probabilities. The number of matches to peptides from decoy sequences is also shown to provide an estimate of the false positive peak matching rate. A summary of all peaks identified can also be downloaded in tab-delimited format. FluTyper is implemented using GNU C++. A web interface has been developed for the second component of FluTyper and can be accessed at http://www.cancerresearch.unsw.edu.au/ CRCWeb.nsf/page/flutyper (see Figure S1 for a screenshot of the interface and Table S1 for a description of the parameters).

Theoretical evaluation of naïve Bayes classifier
The performance of the naïve Bayes classifiers were evaluated as a function of the protein coverage. For each protein (i.e. HA, NA, NP or M1), 500 random subsets of theoretical tryptic peptides representing 0-100% coverage of the protein were generated for each protein sequence used to train the classifier. The set of theoretical tryptic peptides masses represents a simulated mass spectrum.
Leave-one-out cross-validation was performed, meaning that a new classifier was used each time, leaving out the protein sequence being tested. For the purpose of this evaluation, a subset of masses were determined to be typed or subtyped if p(T| M 1 ...M n ) > 0.7 for any T. Figure 2A &2B shows the percentage of simulated mass spectra conclusively classified as a function of protein coverage for typing and subtyping respectively. For typing, over 90% classification rate was achieved with greater than 25% protein coverage in all cases. For subtyping, over 90% classification rate was achieved with greater than 30% protein coverage for HA, NA and NP. However, M1 was less reliable, with a classification rate limited to around 80% with a protein coverage of greater than 40%. The low classification rate for M1 is due to a combination of factors. First, the M1 protein has around 50% less amino acids compared to NP, NA and HA and therefore also has fewer tryptic peptide masses that can be used by the naïve Bayes classifier. Second, the M1 protein is more conserved between different influenza subtypes compared to NP, NA and HA, thus the classifier may not be able distinguish the subtype even with full protein coverage.
In the case of typing ( Figure 1C), the false positive rate (FPR) is less than 1% in all cases and 0% at protein coverage of greater than 25%. For subtyping ( Figure 1D), the FPR was less than 1% for protein coverage of 20% or greater for HA and less than 5% with increased sequence coverage for NA. HA performed more favorably than NA since the neuraminidase of H1N1 and H5N1 are similar, while the hemaggluttin antigen across H1N1, H3N2 and H5N1 are all significantly different. On the other hand, the NA classifier was able to distinguish HxN1 and H3N2 subtypes with 0% FPR (data not shown). For NP, the FPR is 10% at low protein coverage and decreases to 5% with increased coverage. For M1, the FPR is just under 10% independent of the protein coverage. The high apparent FPR for NP and M1 for subtyping can be expected since the subtype of a virus is characterized by the isoform of its HA and NA proteins. For instance, the reassortment of a virus can lead to the introduction of a NP protein from one subtype to another (e.g. H1N1 to H3N2) without changing the subtype of the actual virus. For example, the translated NP protein sequence derived from the NCBI entry gi148466309 is designated as a H3N2 subtype, but the actual sequence is in fact more similar to other H1N1 NP sequences. The theoretical testing results demonstrate that the use of naïve Bayes classifiers are appropriate at protein coverage levels expected from experimental mass spectra where 20-30% or greater protein coverage is common. Crucially, the false positive rate is less than 1% for typing and is still below 10% for subtyping using M1 and NP proteins. It is evident from testing that for confident assignment of the virus subtype, the use of HA or NA tryptic peptides would be most desirable.

Testing with experimental influenza mass spectra
To demonstrate FluTyper using experimental data, mass spectra were acquired from tryptic digests prepared from whole virus preparations and gel-separated influenza antigens. Mass spectra were generated for common human influenza virus strains including influenza type B strain B/Victoria/504/2000, type A (H1N1) strain A/Solomon Islands/03/06 and type A (H3N2) strain A/Brisbane/ 10/2007 (Additional file 1). The type and subtype of these three strains are in common with those viruses that are in circulation in humans today. All samples were analyzed using default FluTyper settings -with relative peak intensity cutoff at 0.001%, peak matching tolerance of 3 ppm, frequency of occurrence (Po) cutoff of 0.6, one missed cleavage and optional modification of methionine oxidation.
The high resolution mass spectrum of a whole virus digest of influenza type B strain B/Victoria/504/2000 is shown in Figure 3A. The 15 signature peptides for influenza type B identified enable the virus type to be confidently assigned (  Figure 3B). In total, there are 18 peaks with Po of > 0.6 and the peaks are matched within     Figure 3C). In total, 11 peptides are identified by FluTyper (Table 3). While 5 type A influenza signatures peptides are identified, no subtype indicator or signature peptides were found. In this case, the naïve Bayes classifier provides the only means for subtype determination. Using the Po values shown in Table 3, the classifier generates probabilities of 0.9998, 0.0002, 0 and 0 for H1N1, H3N2, H5N1 and P2009 respectively, indicating that the peptides identified are almost certain to have come from the H1N1 subtype.
To validate the naïve Bayes classification, the protein sequence coverage is shown in Table 4. In the case of the whole virus digests, a coverage range of between 10.5% and 42%, and 10.3% and 27.9% was achieved in mass spectra for the type A (H3N2) and type B virus, respectively. The combined FPR as estimated from Figure 2B and 2D based on the product of each of the individual antigen FPR is < 0.1% for type A (H3N2) and type B, respectively. For type A (H1N1), as expected, only nucleoprotein was identified for the in-gel digestion of this antigen with a sequence coverage of 24.8%. Based on theoretical testing from Figure 2D, there is an approximately 8% chance that the spectrum could be misidentified. As discussed earlier, the high false positive rate is due to the fact that the subtype of an influenza virus is defined based on hemagglutinin and neuraminidase, hence the possibility of reassortment cannot be excluded. Nevertheless, the nano-scale preparation and mass spectrometry analysis of whole virus digests described here provides highly reliable subtyping results for influenza using Flu-Typer.

Conclusions
The FluTyper algorithm has been developed for automated typing and subtyping of influenza virus using high resolution mass spectral data. FluTyper incorporates the use of influenza antigen signature peptides previously identified in this laboratory. Furthermore, to increase the confidence of subtyping, naïve Bayes classifiers have been developed for four common influenza antigens, hemagglutinin, neuraminidase, nucleoprotein, and matrix protein 1. Theoretical testing of the classifiers demonstrates their applicability at protein coverage rates expected in mass mapping experiments. Using laboratory grown virus samples analyzed by high resolution mass spectrometry, it is shown that FluTyper can rapidly and reliably type and subtype strains of the influenza viruses that are in common circulation in humans. Through the use of other signature peptides and classifiers, it is anticipated that the FluTyper algorithm could be applied to the typing/classification of other viruses and bacteria. ; 20 min at room temperature in the dark) was followed by tryptic digestion as previously described [21]. Cleaved peptides were extracted by repeated sonication in 60% acetonitrile containing 0.1% trifluoroacetic acid. Extracted peptides were dried completely in a vacuum concentrator and dissolved in 25 mM NH 4 HCO 3 .

MALDI FT-ICR mass spectrometry
MALDI FT-ICR mass spectra were recorded on a 7T Bruker APEX-Qe instrument (Bruker Daltonics, Billerica, MA, USA) in the positive ion mode as previously described [21][22][23][24]. Briefly, mass spectra were acquired for 1 M data points using a broadband excitation. Mass spectra were calibrated externally using a mixture of peptides comprising Angiotensin I, adrenocorticotropic hormone (ACTH) fragments containing residues 1-17, 7-38 and 18-39, and a synthetic hemagglutinin antigen derived peptide. Mass spectra were processed using the Data Analysis v3.4 software (Bruker Daltonics, Billerica, MA, USA) and recalibrated internally utilizing identified peptide ions in each spectrum derived from the viral proteins. Mass lists were exported as tab-delimited files. Mass accuracies of between 0.1 to 1 ppm are routinely achieved for all ions detected with mass resolutions (FWHM) exceeding 100,000.