SAMPI: Protein Identification with Mass Spectra Alignments
- Hans-Michael Kaltenbach^{1}Email author,
- Andreas Wilke^{2} and
- Sebastian Böcker^{3}Email author
DOI: 10.1186/1471-2105-8-102
© Kaltenbach et al; licensee BioMed Central Ltd. 2007
Received: 05 July 2006
Accepted: 26 March 2007
Published: 26 March 2007
Abstract
Background
Mass spectrometry based peptide mass fingerprints (PMFs) offer a fast, efficient, and robust method for protein identification. A protein is digested (usually by trypsin) and its mass spectrum is compared to simulated spectra for protein sequences in a database. However, existing tools for analyzing PMFs often suffer from missing or heuristic analysis of the significance of search results and insufficient handling of missing and additional peaks.
Results
We present an unified framework for analyzing Peptide Mass Fingerprints that offers a number of advantages over existing methods: First, comparison of mass spectra is based on a scoring function that can be custom-designed for certain applications and explicitly takes missing and additional peaks into account. The method is able to simulate almost every additive scoring scheme. Second, we present an efficient deterministic method for assessing the significance of a protein hit, independent of the underlying scoring function and sequence database. We prove the applicability of our approach using biological mass spectrometry data and compare our results to the standard software Mascot.
Conclusion
The proposed framework for analyzing Peptide Mass Fingerprints shows performance comparable to Mascot on small peak lists. Introducing more noise peaks, we are able to keep identification rates at a similar level by using the flexibility introduced by scoring schemes.
Background
Protein identification using mass spectrometry has become one of the central tools in proteomics and systems biology [1]: With growing protein sequence databases such as SwissProt [2], fast and accurate identification of a sample protein remains a central problem. There are two common strategies for protein identification using mass spectrometry: Peptide Mass Fingerprints [3] and protein identification from peptide sequence information using tandem mass spectrometry [4].
Peptide mass fingerprinting (PMF) is preceded by a protein separation step using gel or chromatographic separation. The separated protein is digested by specific enzymatic cleavage such as tryptic digestion, followed by mass spectrometric measurement of the resulting peptides. The resulting mass spectrum has to be preprocessed into a list of signal peaks that form the input to identification algorithms. In our approach, we concentrate on Matrix Assisted Laser Desorption/Ionization (MALDI) [5], the predominant ionization technique for PMF. This technique produces mainly singly charged ions, allowing us to talk of the mass m of a molecule, instead of its mass-to-charge ratio m/z.
To identify a measured protein from a sequence database, the database sequences are digested in-silico and each predicted peak list is matched and scored with the measured peak list. Usually, computation of the statistical significance should follow, using a statistical background model. Software routinely used for identification of proteins using PMF includes the commercial systems Mascot [6] which uses peak counting together with heuristic information and ProFound [7] which relies on a bayesian scoring scheme. These systems have a comparable performance [8].
In [9], we presented a new approach for PMF protein identification. The approach is based on a re-formulation of the identification problem as a global alignment problem. Further, p-values of identifications are computed using a combinatorial algorithm using uniform character frequencies.
The statistics for p-value computation is extended to a broader class of digestion enzymes and to arbitrary protein sequence models of independently and identically distributed (i.i.d.) amino acids in [10], where this model is also shown to be consistent with corresponding empirical SwissProt data.
Here, we validate the theoretical approach of peak list alignments as introduced in [9] and show the applicability of this approach on real proteomics data. We discuss several aspects of general scoring schemes to be used in peak list alignments; such schemes provide a unified framework that allows emulation and combination of already existing methods and ideas. We demonstrate how missing and additional peaks can explicitly be taken into account and peak intensities can be consistently added into the scoring procedure. We evaluate our method, called SAMPI (SAMPI: aligning mass spectra for protein identification), on real proteomics mass spectrometry data and compare the method to PMF identification using the standard software Mascot.
Results and discussion
To evaluate our method, 375 PMF tryptic mass fingerprints of charge state [M + H]^{+} from an in-house proteomics experiment on the organism Corynebacterium glutamicum (Cg) are measured on a Bruker Ultraflex mass spectrometer. The proteins are separated using SDS-PAGE gel electrophoresis before mass measurement. Well-separated spots are digested with trypsin and peptide masses measured by mass spectrometry. For identification, trypsin is set as a cleavage enzyme, Carbamidomethyl is set as a fixed mass modification of ≈ +57 Da for Cysteine, a mass tolerance of 1 Da is set, and no missed cleavages are allowed.
Processing the raw spectra
Databases
Both sets of peak lists are identified using Mascot versions v1.9 and v2.1 and the Gaussian scoring scheme described below with different parameters.
For estimating the false positive rate, we proceed as follows: In a first step, all peak lists are identified using the in-house Cg protein sequence database with 3,510 sequences. The sequence identifiers for each identification are recorded for later comparison.
In a second step, all peak lists are again identified using a concatenation of the SwissProt database, release 48 with 155,824 entries, to the above Cg database. An identification is assumed to be correct if it is the same Cg sequence as recorded before. Conversely, an identification is assumed to be incorrect, if it is not a Cg sequence.
This approach makes it necessary to discard protein sequences that are equal to or highly similar to any Cg sequence from the SwissProt database beforehand. We therefore do an all-against-all comparison of the SwissProt database with 194,317 sequences and the Cg database with BLAST [11]. Sequences from the SwissProt database with an e-value of 10^{-30} or better are discarded and the remaining 155,824 sequences are appended to the Cg database.
The performance of PMF identification algorithms can thus be evaluated and compared independently on the set of sample mass spectra.
Results
Runtimes
Bruker | PL | |
---|---|---|
SAMPI | 23 min (4.2 sec) | 99 min (15.8 sec) |
Mascot | 43.5 min (8 sec) | 192 min (30.7 sec) |
Identification results
w/out intensity | w/intensity | ||||||
---|---|---|---|---|---|---|---|
Bruk. | PL | PL_{316} | Bruk. | PL | PL_{316} | ||
Mascot v1.9 | 123 | 58 | 53 | - | - | - | |
Mascot v2.1 | 119 | 59 | 53 | - | - | - | |
SAMPI | |||||||
c _{1} | c _{2} | ||||||
-0.1 | -0.1 | 112 | 56 | 51 | 72 | 106 | 87 |
-0.2 | -0.2 | 111 | 56 | 51 | 78 | 96 | 92 |
-0.3 | -0.3 | 96 | 54 | 48 | 65 | 103 | 98 |
-0.4 | -0.3 | 89 | 53 | 48 | 52 | 110 | 105 |
-0.4 | -0.4 | 91 | 54 | 49 | 54 | 108 | 103 |
-0.4 | -0.5 | 94 | 53 | 48 | 57 | 109 | 104 |
Conclusion
We propose a new formulation of protein identification using Peptide Mass Fingerprinting as an alignment problem. We introduce general peak-wise scoring schemes and show how these can be used to score two peak lists by dynamic programming. The scoring schemes provide a large amount of flexibility by allowing the user to independently set matching, additional and missing scores. They also allow consistent inclusion of additional features such as peak intensities into the identification process.
A mathematical model based on random weighted strings is used to efficiently estimate the statistical significance of an alignment score. This model only needs character frequencies as parameters, which can be estimated even with small sequence databases, allowing the use of species-specific protein data. The significance score is then computed to get comparable scores independent of the underlying database and sequence lengths. We also propose a first example of an alignment scoring scheme, called Gaussian score, using mass difference and robust intensities. We tested our approach on biological PMF data and compared our results to the standard software Mascot. We were able to correctly identify a comparable number of proteins using peak lists produced by the machine vendor's peak detection software. Using our own peak detection with about 8–10 times as many peaks, we showed the flexibility of our approach by correctly identifying approximately the same number of proteins whereas the performance of Mascot dropped considerably to about half the number of correct identifications.
For the near future, we plan to incorporate missed cleavages into the program. They can easily be handled by the alignment algorithm, but the statistical model has to be extended slightly. Further, we plan to incorporate the method into the ProDB proteomics platform [13] and to compare it to other protein identification tools besides Mascot. As already discussed in the Results and Discussion section, the incorporation of intensity information is not optimal. This is not due to the framework but rather due to the use of scaled intensity values. Nevertheless, using full rank statistics or other probabilistic intensity incorporations are to be investigated in the future. As a last point, the score normalization is not as good as expected, especially on smaller peak lists; the reasons and possible improvements are to be investigated. The flexible alignment framework together with the deterministic, model-based significance computation seems promising, although some improvements are clearly necessary.
Methods
To identify a measured peak list using peak lists predicted from database sequences, a measure is needed for the similarity of two peak lists. Our scoring of similarity is based on a peak-wise scoring function to score a pair of peaks, one of them possibly being a "gap" peak. The optimal matching of two peak lists can then be computed in a way similar to global sequence alignment.
For computing a statistical significance of an alignment score, we introduce a null-model based on a random protein model and estimate the alignment score distribution.
The general method works as follows, where the individual steps are explained in detail below: In a pre-processing step, a peak list is computed from each entry in the particular protein sequence database; it is called the predicted peak list of the sequence. Further, several statistics are computed for later use in the identification's significance estimation. These statistics are the length distribution and the joint length-mass distribution of cleavage fragments. From these, the occurrence probabilities are computed for each possible fragment mass and each protein length contained in the database. Now, the highest scoring protein sequence is computed for each measured spectrum by aligning this spectrum to each predicted spectrum, computing the alignment score and returning the sequence with the highest scoring predicted spectrum. Further, a statistical significance is computed for each such alignment score.
Spectra alignments
Peaks and peak lists
Every peak p_{ i }has a mass m_{ i }∈ $\mathcal{M}$ and possibly other attributes (a_{i,1}, ..., a_{i,k}) ∈ $\mathcal{A}$, k ≥ 0. A peak list $\mathcal{S}$ of length n is a list $\mathcal{S}$ = {p_{1}, ..., p_{ n }} of peaks p_{ i }∈ $\mathcal{M}$ × $\mathcal{A}$. A peak list is sorted by mass, thus m_{ i }<m_{ j }if i <j.
Note that we allow the set $\mathcal{A}$ of peak attributes to be empty. The simplest type of peak is a peak having only its mass m ∈ ℝ. A peak with mass and relative intensity could be represented as an element of ℝ × [0,1].
Scoring peaks and spectra
Let $\mathcal{S}$_{ p }= {p_{1}, ..., p_{ n }} and $\mathcal{S}$_{ m }= {p'_{1}, ..., p'_{ n' }} be two peak lists of length n and n', respectively. For 1 ≤ i ≤ n and 1 ≤ j ≤ n', let p_{ i }∈ $\mathcal{M}$ × $\mathcal{A}$ and p'_{ j }∈ $\mathcal{M}$ × $\mathcal{A}$' where we allow the sets of additional peak attributes $\mathcal{A}$ and $\mathcal{A}$' to be different. An example would be a measured peak list $\mathcal{S}$_{ m }, with relative intensity and a predicted peak list $\mathcal{S}$_{ p }with the generating string fragment as additional attributes. To compute an optimal matching between peak list $\mathcal{S}$_{ p }and peak list $\mathcal{S}$_{ m }, we first define a scoring function that scores two individual peaks.
A peak-wise scoring function score is a function
score: ($\mathcal{S}$_{ p }∪ {ε}) × ($\mathcal{S}$_{ m }∪ {ε}) → ℝ
mapping a predicted and a measured peak to a real value. Here, ε denotes a special "gap" peak. For two peaks p ∈ $\mathcal{S}$_{ p }and p' ∈ $\mathcal{S}$_{ m }, we say that score(p, p') is a matching score. We call score(p, ε) a missing score and p a missing peak, as it is not matched to any peak in $\mathcal{S}$_{ m }. Similarly, score(ε, p') is called an additional score for an additional peak p'. For completeness, we define score(ε, ε) := -∞.
We want to stress that missing and additional peaks are likely to be seen even if the measured spectrum stems from a measurement of a known sequence. Additional peaks, peaks that are seen in the measurement but cannot be explained by the sequence, may simply be chemical noise from the biochemical sample preparation. Missing peaks may occur due to incorrect peak detection or failed ionization of the corresponding fragment. Of course, missing and additional peaks also occur if the measured spectrum does not stem from the sequence under investigation.
Example 1 (Peak counting)
Using only peak mass as attribute, a peak counting score could ignore missing and additional peaks, i.e. set score(p, ε) = score(ε, p') = 0, and give a positive score whenever the difference of the two peak masses m and m' is not too large:
for some positive constant δ.
Noting that it would be meaningless to match two pairs of peaks that overcross in mass, we compute the optimal matching between two spectra, i.e. the matching yielding the highest sum of peak-wise scores, as a global alignment, using the well-known dynamic programming recurrence. Let E[i, j] denote the score for the optimal matching between the two spectra up to peaks p_{ i }and p'_{ j }, respectively. Then the alignment table is computed as
The score score($\mathcal{S}$_{ p }, $\mathcal{S}$_{ m }) of the optimal matching, given in E[n, n'], is called the alignment score of the spectra $\mathcal{S}$_{ p }and $\mathcal{S}$_{ m }.
As in the case of sequence alignment, the optimal matching itself can be recovered by backtracking in the dynamic programming table E[·, ·]. The alignment score can be computed in time O(n·n'), but faster implementations are possible, using only a diagonal band in E[·, ·].
This approach is a standard technique [14, 15], and has been successfully applied to such diverse problems as tree ring and liquid chromatography matching [16, 17]. A more formal model of peak list alignment can be found in [9].
Scoring schemes
Although a peak in a measured peak list is described at least by its mass and absolute intensity, most identification algorithms only make use of its mass [18]. This is partly because mass is the most discriminative parameter measured and partly because intensity depends heavily on the actual parameter settings of the machine. The basis for many schemes is the observation that a measurement error between the "real" mass of a molecule and the measured mass can be described by a Gaussian distribution with mean 0 and a standard deviation sd dependent on the machine settings and experiment type. The mean might also deviate from 0 if the machine is not calibrated correctly. We will now introduce a family of scoring schemes, the Gaussian scores, that will be used in further sections to demonstrate the applicability of our approach. Note however, that the approach is by no means limited to this mass measurement error distribution.
Mass difference
The matching score score(p, p') for two peaks p and p' with masses m and m', respectively, is the probability of a Gaussian distributed random variable Z with mean 0 and standard deviation sd to exceed ±|m - m'| in the respective direction, i.e., score(p, p') = ℙ(|Z| ≥ |m - m'|). This score drops exponentially from 1 to 0 with increasing mass difference. As it is always positive, we set the score to -∞ if it falls below 0.05, that is, if the mass difference exceeds ≈ 2·sd. A similar approach is taken in ProFound and the tandem MS software SCOPE [19], whereas Mascot uses a constant positive matching score similar to that of Example 1.
Robust incorporation of intensities
In order to incorporate intensities of measured peaks into the scoring, we applied methods from robust statistics successfully used in tandem MS scoring [20]: All peaks in the peak list were ranked according to their absolute intensity. The intensity of the 10% highest abundant peaks were set to 1, the intensity of the 10% lowest abundant peaks to 0. The intensities of the remaining peaks were scaled linearly between 0 and 1. Thus, a very high abundant peak of chemical noise or a small number of wrongly detected, low abundant peaks cannot spoil the interpretation of the whole peak list. Up to this point, we only use intensity values of measured peaks, resulting in an asymmetric scoring scheme. Given an appropriate prediction model [18, 21], it would also be possible to incorporate predicted intensities into the scoring. Writing int(p') for a peak's scaled intensity, the matching score is multiplied with (1 + 2 int(p'))/3, yielding a factor of 1/3 for lowest and 1 for highest abundant peaks. Again, the approach is suitable for using any other incorporation of intensity information, such as logarithmic transforms as proposed, e.g., in [22].
Scoring gap peaks
Using peak-wise scoring schemes allows us to explicitly take additional and missing peaks into account.
For additional peaks, a constant penalty c_{1} is given. If intensities are used in the scoring, this penalty is again multiplied by the scaled intensity of the peak: score(ε, p') = c_{1}·int(p'). Very low abundant peaks are then penalized by 0 and thus simply ignored, and very high abundant peaks that are not explained are highly penalized. For missing peaks, the Gaussian score always gives a constant penalty c_{2}, but as with matching scores, predicted peak intensities could also be used.
Background model and significance of alignment scores
To estimate the significance of a score of a measured spectrum and a sequence of certain length L, we compute a table of mass occurrence probabilities in random weighted strings in a preprocessing step. Using these probabilities, we get a background model for predicted spectra allowing us to estimate the contribution of each measured peak to the overall alignment score under a well-defined null-model without sampling. We proceed as follows: After introducing a formal model of random protein sequences and their digestion, we compute the joint length-mass distribution of cleavage fragments in such random proteins. We then compute the probability that in a random protein of given length, at least one fragment of certain mass m occurs and will thus give rise to a corresponding peak in the predicted spectrum. All these quantities can be computed once in a pre-processing step. Using the mass occurrence probabilities, we estimate the expectation and variance of an alignment score for computing p-values of such the score.
Weighted strings
A weighted alphabet is a finite alphabet Σ together with a weight or mass function μ: Σ → ∨_{>0}, assigning a mass to each of its characters. Its domain can be extended to strings s ∈ Σ* by setting $\mu (s):={\displaystyle {\sum}_{i=1}^{\left|s\right|}\mu ({s}_{i})}$. Such strings are called weighted strings.
If each character σ ∈ Σ occurs with probability ℙ(σ), we call an i.i.d. sequence of such characters a random weighted string. The parameters of this model, i.e., the character frequencies, can be robustly estimated from a sequence database. As they are the only parameters needed for subsequent significance computations, we can use species-specific models where only small sequence databases are available.
Example weighted amino acids
σ | A (Ala) | C (Cys) | D (Asp) | E (Glu) | ... | Y (Tyr) |
---|---|---|---|---|---|---|
μ(σ) | 71.0371 | 103.0092 | 115.0269 | 129.0426 | ... | 163.0633 |
μ*(σ) | 710 | 1030 | 1150 | 1290 | ... | 1631 |
ℙ(σ) | 0.0785 | 0.0154 | 0.0531 | 0.0661 | ... | 0.0306 |
This model is readily extendible to capture distributions of masses for each character, such that copies of the same amino acid in a protein may have different masses. This allows to model isotopic mass distributions and different amino acid masses due to post-translational modifications such as phosphorylation or methylation.
Cleavage schemes
Most proteases cleave a peptide right after the occurrence of a specific cleavage character in the amino acid sequence, except in the presence of a prohibition character directly following the cleavage character. In the case of trypsin, the set of cleavage characters is Γ = {K, R} and the set of prohibition characters is Π = {P}. Together, Γ and Π form a cleavage scheme.
Applying a cleavage scheme on a weighted string results in a fragmentation of this string, a set of successive, non-overlapping substrings, the fragments.
Example 2 (Fragmentation of a string)
Let Σ := {A, B, C}, be a weighted alphabet with weights μ(A) = 1, μ(B) = 2, μ(C) = 3, let Γ := {B}, Π := {A} be a cleavage scheme on Σ. Then the string s = ABBACCBACBBB is fragmented into the fragments AB, BACCBACB, B, B of weights μ(AB) = 3, μ(BACCBACB) = 17 and μ(B) = 2.
For the sake of brevity, we concentrate on cleavage schemes without prohibition characters. Then, a fragment is simply a string of non-cleavage characters followed by a cleavage character. Generalizations to arbitrary cleavage schemes and more details on the stochastic models and efficient computation can be found in [10].
Mass occurrence probabilities
Let f^{ L }[l, m*] denote the probability that the first fragment of a random weighted string of length L has length l and integer mass m*. The main recurrence is given for the length-mass distribution of the inner part of a fragment, consisting solely of non-cleavage characters:
with initial condition f'[0, 0] = 1.
The fragment length-mass distribution can be computed by adding the cleavage character to the right and taking care of the finite string length L.
Lemma 1 (Fragment probabilities)
The fragment probability f^{ L }[l, m*] is given for l <L by
and for the boundary l = L by
The probability that a fragment of length l does not have mass m is computed as $\overline{f}$^{ L }[l, m*] = u[l] - f^{ L }[l, m*], where u[l] denotes the probability that a fragment has length l; it is a geometric distribution.
Taking the complementary probability $\overline{f}$^{ L }[l, m*], we can compute the mass occurrence probability p[L, m*] that at least one fragment of mass m* occurs in the fragmentation of a random weighted string of length L.
Lemma 2
The occurrence probability p[L, m*] = 1 - $\overline{p}$[L, m*] of mass m* in a random weighted string of length L is given by $\overline{p}$[0, m*] = 1 and
Alignment score distribution
The alignment score distribution is efficiently and deterministically estimated by adding the contribution of each peak to the overall alignment score. To compute the contribution of each measured peak p'_{ j }, ∈ $\mathcal{S}$_{ m }, let ${\mathcal{U}}_{j}$_{ j }denote the support of peak p'_{ j }, that is the set of integer masses m* for which a predicted peak p having integer mass m* would contribute a positive matching score score(p, p'_{ j }) > 0. Let ${X}_{j}^{match}(L)$ be the random variable that contains the sum of the matching scores over all peaks p in any spectrum $\mathcal{S}$_{ p }generated by a random weighted string of length L that have masses in ${{\mathcal{U}}^{\prime}}_{j}$. The expectation and variance of this random variable are then given by
Similarly, we define random variables ${X}_{j}^{add}(L)$ for the additional scores. Assuming independence of peaks, the overall matching and additional scores are simply the sum of these scores:
The missing scores are given for all masses that are not inside the support of any measured peak:
We omit the details for additional and missing peaks and again refer the interested reader to [9]. The alignment score score($\mathcal{S}$_{ p }, $\mathcal{S}$_{ m }) for a measured spectrum $\mathcal{S}$_{ m }and a random predicted spectrum $\mathcal{S}$_{ p }generated by a random weighted string of length L is finally given by
score($\mathcal{S}$_{ p }, $\mathcal{S}$_{ m }) = X^{match}(L) + X^{add}(L) + X^{miss}.
As score($\mathcal{S}$_{ p }, $\mathcal{S}$_{ m }) is the sum of nearly independent random variables, we can expect it to have a normal distribution for reasonable scoring schemes and peak lists. This distribution is completely determined by its expectation and variance.
Note that the alignment algorithm computes an optimal one-to-one peak matching score, whereas the estimation procedure corresponds to a many-to-one matching of peaks. As shown below, neither the peak independence assumptions nor the one-to-one peak matching are a problem in practice, as violations of either assumption do not contribute enough to alter the score distribution noticably.
Parameter sets for evaluation
Parameter | std. dev. sd | missing score | additional score | intensity used |
---|---|---|---|---|
A | 0.8 | -0.1 | -0.1 | No |
B | 0.8 | -0.4 | -0.3 | Yes |
Using significance as score
As the alignment score is an additive score, its value and distribution is dependent on the number of peaks in the measured and predicted spectra. This makes it difficult to compare alignment scores for different measured spectra and sequence lengths.
To avoid these problems, we will not use the alignment score itself, but its significance to rank the candidate sequences, a method previously shown to be effective for tandem MS data [20]. For each pair of measured and predicted spectra, the alignment score is computed, its distribution is estimated using the method described, and the p-value – the probability that a random sequence of the same length as the aligned sequence gives an alignment score at least as good as the computed one – is computed from this distribution. We then take – log_{10}(p-val.) as the significance score to rank the candidates. In the evaluation, we always used the significance score unless explicitly stated otherwise.
Declarations
Acknowledgements
Hans-Michael Kaltenbach is supported by the "International NRW Graduate School in Bioinformatics and Genome Research." Implementation of the presented method was provided by Tobias Marschall, Marcel Martin, Matthias Steinrücken, and Henner Sudek. The authors want to thank Sven Rahmann for helpful discussions and Martina Mahne for providing the original mass spectrometry data.
Authors’ Affiliations
References
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, (422):198–207. 10.1038/nature01511Google Scholar
- Bairoch A, Boeckmann B: The SWISS-PROT protein sequence data bank. Nucleic Acids Res 1992, 20: 2019–2022.PubMed CentralView ArticlePubMedGoogle Scholar
- Henzel WJ, Watanabe C, Stults JT: Protein Identification: The Origins of Peptide Mass Fingerprints. J Am Soc Mass Spectrometry 2003, 14: 931–942. 10.1016/S1044-0305(03)00214-9View ArticleGoogle Scholar
- Sadygov RG, Cociorva D, Yates JR: Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nat Methods 2004, 1(3):195–202. [http://dx.doi.org/10.1038/nmeth725] 10.1038/nmeth725View ArticlePubMedGoogle Scholar
- Karas M, Hillenkamp F: Laser desorption ionization of proteins with molecular masses exceeding 10,000 Daltons. Anal Chem 1988, 60: 2299–2301. 10.1021/ac00171a028View ArticlePubMedGoogle Scholar
- Perkins D, Pappin D, Creasy D, Cottrell J: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20: 3551–3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2View ArticlePubMedGoogle Scholar
- Zhang W, Chait BT: ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. Anal Chem 2000, 72(11):2482–2489. 10.1021/ac991363oView ArticlePubMedGoogle Scholar
- Chamrad DC, Körting G, Stühler K, Meyer HE, Klose J, Blüggel M: Evaluation of algorithms for protein identification from sequence databases using mass spectrometry data. Proteomics 2004, 4: 619–628. 10.1002/pmic.200300612View ArticlePubMedGoogle Scholar
- Böcker S, Kaltenbach HM: Mass Spectra Alignments and Their Significance. In Combinatorial Pattern Matching. Volume 3537. Edited by: Apostolico A, Crochemore M, Park K. Springer; 2005:429–441.View ArticleGoogle Scholar
- Kaltenbach HM, Sudek H, Böcker S, Rahmann S: Statistics of cleavage fragments in random weighted strings. Tech. Rep. TR-2005–06, Technische Fakultät der Universität Bielefeld, Abteilung Informationstechnik 2005. [http://bieson.ub.uni-bielefeld.de/volltexte/2006/900/]Google Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Havilio M, Haddad Y, Smilansky Z: Intensity-based statistical scorer for tandem mass spectrometry. Anal Chem 2003, 75: 435–444. 10.1021/ac0258913View ArticlePubMedGoogle Scholar
- Wilke A, Rückert C, Bartels D, Dondrup M, Goesmann A, Hüser AT, Kespohl S, Linke B, Mahne M, McHardy AC, Pühler A, Meyer F: Bioinformatics support for high-throughput proteomics. J Biotechnol 2003, 106(2–3):147–56. 10.1016/j.jbiotec.2003.08.009View ArticlePubMedGoogle Scholar
- Gusfield D: Algorithms on Strings. Trees, and Sequences. Cambridge University Press; 1997.View ArticleGoogle Scholar
- Huang X, Waterman MS: Dynamic programming algorithms for restriction map comparison. Comput Appl Biosci 1992, 8(5):511–520.PubMedGoogle Scholar
- Wenk C: Applying an Edit Distance to the Matching of Tree Ring Sequences in Dendrochronology. Proceedings of Combinatorial Pattern Matching (CPM99) 1999, 1645: 223–242.View ArticleGoogle Scholar
- Bylund D, Danielsson R, Malmquist G, Markides KE: Chromatographic alignment by warping and dynamic programming as a pre-processing tool for PARAFAC modelling of liquid chromatography-mass spectrometry data. Journal of Chromatography A 2002, 961: 237–244. 10.1016/S0021-9673(02)00588-5View ArticlePubMedGoogle Scholar
- Gay S, Binz PA, Hochstrasser DF, Appel RD: Peptide mass fingerprinting peak intensity prediction: Extracting knowledge from spectra. Proteomics 2002, 2: 1374–1391. 10.1002/1615-9861(200210)2:10<1374::AID-PROT1374>3.0.CO;2-DView ArticlePubMedGoogle Scholar
- Bafna V, Edwards N: SCOPE: A probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics 2001, 17: S13-S21. 10.1093/bioinformatics/17.1.13View ArticlePubMedGoogle Scholar
- Wan Y, Yang A, Chen T: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search. In Proc of RECOMB 2005. Volume 3500. Springer; 2005:342–356.Google Scholar
- Schütz F, Kapp EA, Simpson RJ, Speed TP: Deriving statistical models for predicting peptide tandem MS product ion intensities. Biochem Soc Trans 2003, 31(Pt 6):1479–1483.View ArticlePubMedGoogle Scholar
- Wolski WE, Lalowski M, Martus P, Herwig R, Giavalisco P, Gobom J, Sickmann A, Lehrach H, Reinert K: Transformation and other factors of the peptide mass spectrometry pairwise peak-list comparison process. BMC Bioinformatics 2005, 6: 285. 10.1186/1471-2105-6-285PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.