Mass spectrometry-based protein identification by integrating de novo sequencing with database searching
© Wang and Wilson; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
Skip to main content
© Wang and Wilson; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
Mass spectrometry-based protein identification is a very challenging task. The main identification approaches include de novo sequencing and database searching. Both approaches have shortcomings, so an integrative approach has been developed. The integrative approach firstly infers partial peptide sequences, known as tags, directly from tandem spectra through de novo sequencing, and then puts these sequences into a database search to see if a close peptide match can be found. However the current implementation of this integrative approach has several limitations. Firstly, simplistic de novo sequencing is applied and only very short sequence tags are used. Secondly, most integrative methods apply an algorithm similar to BLAST to search for exact sequence matches and do not accommodate sequence errors well. Thirdly, by applying these methods the integrated de novo sequencing makes a limited contribution to the scoring model which is still largely based on database searching.
We have developed a new integrative protein identification method which can integrate de novo sequencing more efficiently into database searching. Evaluated on large real datasets, our method outperforms popular identification methods.
Mass spectrometry (MS) is a commonly used, high-throughput tool for studying proteins. The procedure of MS-based protein identification involves digesting proteins into peptides, which are then separated, fragmented, ionised, and captured by mass spectrometers. Proteins are finally identified from the peaks of the captured mass spectra using computational methods, where each peak theoretically represents a peptide fragment ion. However accurate identification of proteins from tandem mass spectra is a very challenging task and existing methods can typically identify fewer than 50% of the proteins in a complex sample [1–3]. Therefore, there is a critical need for new protein identification methods that can improve the identification accuracy and reliability.
Existing protein identification methods can be categorised into 2 approaches: the database search approach and the de novo sequencing approach. The database search approach has been widely used and is more popular. It identifies proteins by generating theoretical spectra in silico from a given protein sequence database and comparing experimental spectra with the theoretical ones to find the closest matches. A number of methods have been developed, for example SEQUEST  which applies a cross correlation scoring model, X!Tandem  which uses a hyper-geometric scoring model, OMSSA  which applies a Poisson scoring model, and MASCOT  which employs a probability-based scoring model. Despite having the advantage of robustness, the database search approach has several limitations. It is only effective if the proteins of interest are already known and the utilised database contains the correct protein sequences. Unfortunately, this is difficult since many studies involve unknown proteins and protein modifications [8, 9]. Therefore, only a portion of the identifications reported by database search methods is correct. In addition, specifying the enzyme used in the proteolytic digestion can also exclude the correct peptides from the database search space and lead to erroneous identifications .
The de novo sequencing approach identifies proteins by extracting protein sequence information directly from the spectrum peaks derived from peptide fragment ions without recourse to any protein database. Existing de novo sequencing methods can be classified into two categories. For the first category, such as Sherenga  and Lutefisk , the problem is projected into graph theory and algorithms used for finding the maximum path in a network topology are applied to achieve identification. In the second category, exemplified by PepNovo , probability models for inferring protein sequences from the spectrum peaks are applied. However, the main idea remains the same: to find the longest possible peptide sequence that best matches the experimental spectrum. The de novo sequencing approach is the only feasible means for finding novel proteins, detecting amino acid mutations, and so on. However, de novo sequencing is difficult because tandem mass spectra are inherently deficient . Even if the optimal path can be obtained, it may not always yield the correct peptide sequences because peptide fragment ions are usually under-represented and many intensive peaks in the spectra may derive from various interferences.
An "intermediate" approach has been proposed to integrate the aforementioned two approaches: short peptide sequence fragments or "tags" are inferred directly from the spectrum and a database search is performed to find complete peptide sequences that match the sequence fragments. Thus, the identification process is able to incorporate information from the two heterogeneous approaches. This integrative approach has great potential and several methods have been developed, including GutenTag , Inspect , MultiTag , etc. These methods perform favourably compared to existing database search and de novo sequencing methods. However, current implementation of this integrative approach has several limitations. Firstly, the utilised de novo sequencing mechanisms are rather simplistic. Therefore, the inferred sequence tags are short, and usually consist of 3 amino acid residues. Such small tags only offer limited information and may not significantly improve the accuracy. When the sample is complex, the errors in these tags may increase and lead to incorrect identifications . Secondly, most methods try to find exact sequence matches of the tags to the database. This undermines the identification of new proteins and protein modifications. Even if methods like MultiTag take a step forward to tolerate a couple of mismatches, only marginal improvement can be obtained. Thirdly, existing sequence tag methods still apply database search-centred scoring models to which de novo sequencing makes little contribution. With the introduction of high precision ion trap instruments, this leaves many signal-rich spectra seriously under-utilised.
Therefore, we have developed a new integrative method, NovoDB, for protein identification. The method extends the integrative approach introduced by the sequence tag methods and has several advantages. Firstly, it incorporates a sophisticated de novo sequencing algorithm and infers the peptide sequences in a data-driven manner. Much longer sequence tags can be inferred accurately. Secondly, it does not rely on finding exact sequence tag matches in the database but employs a dynamic programming approach to better tolerate sequencing errors. Thirdly, our method employs a simple scoring model that gives more weight to the de novo sequencing. Evaluated on large datasets, generally our method is able to identify more proteins at the same false discovery rate (FDR) when compared to 3 popular methods, including database search-based X!Tandem, de novo sequencing-based PepNovo, and sequence tag-based GutenTag.
The first stage is to preprocess the spectra and normalise the peak intensities. Our method uses two versions of the peak intensities: the continuous intensities and the discrete intensities. For each spectrum, the method firstly determines the baseline intensity and divides each peak's intensity to the baseline so that a normalised intensity is obtained. The continuous intensities are used for the ion matching and the final score calculation, while the discrete intensities are used for the de novo sequencing-based tag inference. The normalised peak intensities are discretised into four levels: no signal, low signal, medium signal, and strong signal. The method removes the low signal peaks by using a sliding window mechanism and discards all the peaks except the top several peaks within each sliding window. Because different regions of the spectrum have different characteristics, our method organises peaks into five regions based on the mass to charge ratio and utilises this information in the sequence tag inference.
The second stage is to infer a number of peptide sequence tags directly from the spectrum. Instead of inferring short sequence tags which usually leads to misidentifications , NovoDB applies a more sophisticated algorithm to dynamically infer longer peptide sequences in a data-driven fashion. This is achieved by incorporating a hybrid de novo sequencing approach which integrates a Bayesian Network probability model with a dynamic programming algorithm to infer the most probable tags . The sequence tag inference stage consists of 3 major steps in total.
Given a preprocessed spectrum S, NovoDB builds the spectrum graph and connects all edges if the mass difference between two vertices approximates the residue mass of an amino acid or other mass offsets of a residue derived from ion degradations. Since the most intensive peaks tend to be b- and y-ions, the spectrum graph has vertices for both interpretations. A vertex for an empty peptide and a vertex for the intact peptide are also added. Our method extends the Bayesian Network model used by PepNovo to calculate the probability of observing each vertex of the constructed spectrum graph. The details can be found in . Each vertex of the network contains a conditional probability table given the values of its parent vertices. The probability tables are trained by using the large-scale Seattle dataset .
One advantage of the model is that P real can distinguish the likely combinations of ions and ion degradations from the unlikely combinations.
NovoDB finds several top ranking asymmetric paths as the most probable peptide sequences. The method employs the dynamic programming algorithm proposed in  to obtain a set of highly scored peptide sequences by exploring the sub-optimal space from the spectrum graph. There are two reasons. Firstly, a number of vertices on the optimal path may be false positives because it is common that many intensive peaks derive from interferences. Secondly, vertices representing the real fragment ions may not always have the highest score and thus will not be included in the optimal path. It is normal that real fragment ions have low intensities or even cannot be detected. The highly similar segments of the sequences correspond to the fragment ions that are likely to be correctly identified, while the ambiguous segments are where the ions are hardly distinguishable from baseline noise. Given these characteristics, the most likely peptide sequence tags are extracted by adapting a dynamic programming-based algorithm similar to ClustalW . In this case, the introduced "gaps" between the sub-optimal peptide sequences correspond to the ambiguous sections of the tandem mass spectrum. Thus, it is able to dynamically generate longer sequence tags than 3 amino acid residues.
After sequence tags are obtained, the next stage is to query a database to see if matches can be found. This is important firstly because the information provided by the database can fill the gaps that de novo sequencing leaves out. Secondly, the sequences directly inferred from the spectrum may not be sufficient to uniquely identify a protein. Thirdly, even though the sequences of a novel protein are not present in the database, homologue proteins may have been discovered and they provide crucial information for validating and understanding the novel protein.
Our method applies a sequence similarity search based on a tailored WU-BLAST algorithm . The algorithm produces error-tolerant scores and does not require long and identical sequences to produce a confident protein hit. The sequence tag query algorithm identifies all high scoring pairs of regions having high local sequence similarities, namely between an individual peptide's sequences in the query and a protein's sequences from the database. We have introduced several modifications to the BLOSSUM62 matrix to suit the sequence query in the context of mass spectrometry. Scores for the two pairs of isobaric amino acid residues: glutamine and lysine, leucine and isoleucine, are substituted for their average values. The specificity of trypsin is considered by reserving the K symbol for the C-terminal lysine and by introducing a new value averaged between arginine and lysine to represent a cleavage site preceding the peptide sequence. Undefined amino acid residues are introduced with zero scores in order to increase the similarity score if peptide sequence tags are incomplete and contain errors. NovoDB ranks the reported peptide hits by similarity scores S s and constrains the total number of query hits.
where N b and N y represents the number of assigned b- and y-ions respectively. The ion matching score as given assumes an underlying hyper-geometric distribution for a valid match. This model has been shown to be very effective . The ion matching score is calculated for every candidate protein returned by the database query.
S c , S s and inferred sequence tags are also reported in the final output.
To evaluate the performance of our method, we use the raw spectra from two large-scale datasets as a benchmark: (1) the Aurum dataset  and (2) the CPTAC dataset  from Clinical Proteomic Technologies Assessment for Cancer. The Aurum dataset is generated from a mixture of 246 known human proteins. The CPTAC dataset comes from a large-scale study of the reproducibility and repeatability of the Universal Proteomics Standard Set 1 (UPS1).
We compare NovoDB with 3 other widely used algorithms: (1) the de novo sequencing method PepNovo, (2) the database search method X!Tandem, and (3) the sequence tag method GutenTag. PepNovo is one of the most widely used de novo sequencing methods. X!Tandem has been shown to outperform commercial SEQUEST and MASCOT database search engines on some data . GutenTag has been used as a benchmark for evaluating sequence tag-based methods .
Sequences obtained by de novo sequencing are valuable and can significantly increase identification coverage when effectively integrated with database searching. With current fast development of new instruments, this becomes crucial because the identification coverage of database search methods cannot be significantly improved with the increasing resolution of the spectra. This serious bottleneck may be due to the reliance on databases, which are seldom complete. On the other hand, the performance of the de novo sequencing approach increases proportionally to the increase of the spectra resolution. In recent years, proteomics research has shifted from a macro qualitative analysis into a micro perspective including the study of glycolysis and phosphoralytion. Unfortunately, database search methods may lead to misidentifications for these applications [3, 12]. De novo sequencing remains the only feasible approach in this situation. Therefore, it is essential to integrate de novo sequencing into database searching.
The evaluation results indicate that one should be careful in choosing the length of peptide sequence tags. Feeding longer tags will facilitate database searching and potentially increase the identification coverage. However this may lead to more sequence errors. Existing tag-based methods choose to use very short sequence tags, e.g. 3 residues long. Such an approach may perform well when the spectrum has high signal-to-noise ratio or the protein composition is simple. This explains why GutenTag performs much better on the CPTAC dataset. However, when the spectra are complicated, it becomes difficult for this approach to succeed. Therefore, it is critical to dynamically choose a proper sequence tag length based on each individual spectrum. Results demonstrate that this may be achieved by exploring the top ranking sub-optimal solutions of the spectrum graph. Based on our evaluation, sequence tags of 6 to 7 residues seem to yield the best results. This should be studied further.
How to effectively integrate de novo sequencing and database searching into a single scoring model is an open question. By using existing integrative methods, the incorporated de novo sequencing algorithm is normally simplistic and cannot contribute directly to the score calculation. Given poor quality spectra, this method is quite reliable. However, such a design cannot efficiently utilise the high precision and high resolution provided by the new instruments and may lead to sub-optimal results. It is therefore important to incorporate a sophisticated de novo sequencing algorithm and a more global scoring model that can give de novo sequencing more weight. Based on our evaluation, when the de novo sequencing component can directly contribute to the score calculation, a simple scoring model, as presented by our NovoDB approach, may work well. In theory, it is very desirable to incorporate more advanced scoring models that can integrate more effectively the de novo sequencing component with the database search component. A more advanced scoring model may further improve the identification accuracy. However, there is always a trade-off between the complexity of the scoring model and the computational cost. Therefore, one has to keep a good balance between the two when designing new scoring models. The design of more advanced scoring models is a very interesting direction for future research.
Protein identification plays a key role in mass spectrometry-based protein research. Existing protein identification methods have limitations which usually lead to low identification coverage. We have developed a new integrative protein identification method which can integrate de novo sequencing more efficiently into database searching. Evaluated on large real datasets, our method outperforms popular identification methods. This performance demonstrates that in order to significantly improve the identification coverage and accuracy, it may be necessary to integrate effectively heterogeneous approaches into protein identification.
The publication of this article was funded by the Australian National Health and Medical Research Council (NHMRC) grant 525453.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 2, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S2.
This project was funded by the Australian National Health and Medical Research Council (NHMRC) grant 525453.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.