 Research
 Open access
 Published:
A graphbased approach for proteoform identification and quantification using topdown homogeneous multiplexed tandem mass spectra
BMC Bioinformatics volumeÂ 19, ArticleÂ number:Â 280 (2018)
Abstract
Background
Topdown homogeneous multiplexed tandem mass (HomMTM) spectra are generated from modified proteoforms of the same protein with different posttranslational modification patterns. They are frequently observed in the analysis of ultramodified proteins, some proteoforms of which have similar molecular weights and cannot be well separated by liquid chromatography in mass spectrometry analysis.
Results
We formulate the topdown HomMTM spectral identification problem as the minimum error ksplittable flow problem on graphs and propose a graphbased algorithm for the identification and quantification of proteoforms using topdown HomMTM spectra.
Conclusions
Experiments on a topdown mass spectrometry data set of the histone H4 protein showed that the proposed method identified many proteoform pairs that better explain the query spectra than single proteoforms.
Background
In topdown mass spectrometry (MS), separating similar proteoforms is a challenging problem. A ultramodified protein may have many similar proteoforms with similar weights and different posttranslational modification (PTM) patterns. These proteoforms are often not well separated in topdown MS analysis [1]. A multiplexed tandem mass (MTM) spectrum is generated when tandem mass spectrometry (MS/MS) is used to analyze two or more proteoforms with similar molecular masses that are not separated by protein separation methods [2]. Despite the complexity of MTM spectra, they have been extensively studied because the interpretation of these spectra provides valuable information about modifications and quantification of proteoforms of ultramodified proteins [1â€“3]. For example, MTM spectra are frequently observed and analyzed in studies of histone proteins, which play important roles in epigenetics and gene regulation [4, 5].
MTM spectra can be divided into two main types: heterogeneous multiplexed tandem mass (HetMTM) spectra and homogeneous multiplexed tandem mass (HomMTM) spectra. While HetMTM spectra are generated from proteoforms of two or more different proteins, HomMTM ones from proteoforms of the same protein with different PTM patterns. In dataindependent acquisition MS, which has been rapidly developed in the past several years, the precursor ions in a large masstocharge ratio (m/z value) interval are collected for MS/MS analysis, resulting in complex HetMTM spectra [6, 7]. In spectral identification, a HetMTM spectrum is searched against a protein database to find k proteins/peptides that best explain the spectrum [2], where k is a userdefined parameter. The problem is computational challenging because its search space is proportional to n^{k}, where n is the number of proteins/peptides in the database.
In the analysis of HomMTM spectra, we often focus on purified proteins, whose sequences are known. Let P be an unmodified protein sequence and S a HomMTM spectrum generated from k modified proteoforms of P. Denote \(\mathcal {Q}_{M}\) as the set of modified proteoforms of P that match the precursor mass of S. The HomMTM spectral identification problem is to find k proteoforms in \(\mathcal {Q}_{M}\) and their relative abundances to maximize the similarity between the theoretical spectra of the proteoforms and the spectrum S [1].
DiMaggio et al. first studied the HomMTM spectral identification problem and proposed a mixed integer linear optimization framework for solving it [1]. The proposed framework demonstrated good performance on the analysis of middledown MTM spectra of histone proteins, but the exponential time complexity of integer linear optimization makes it inefficient for analyzing topdown HomMTM spectra of long protein sequences.
In topdown MS, many software tools have been developed for the identification of proteoforms with PTMs and other alterations [8â€“12]. However, these software tools are designed for analyzing tandem mass spectra from single proteoforms, not multiplexed ones. Using these tools to analyze an MTM spectrum reports only one proteoform instead of multiple ones.
We formulate the minimum error ksplittable flow (MEkSF) problem on graphs and convert the HomTMT spectral identification problem to the MEkSF problem. To our best knowledge, the MEkSF problem has not been studied. However, the maximum ksplittable flow (MkSF) problem, which is related to the MEkSF problem, has been extensively studied and has various applications in commodity transportation and telecommunication network optimization [13â€“18].
Let G be a connected graph with edge capacities, a source vertex, and a sink vertex. A flow is ksplittable if it can be decomposed to k or less than k paths. These paths are neither required to be different, nor edge/vertex disjoint. The MkSF problem aims at finding a ksplittable flow in G from the source to the sink such that the edge capacity constraints are not violated and the flow value is maximized.
Baier et al. [13, 14] first investigated the MkSF problem and proved the NPhardness of the problem on directed graphs for k=2. They proposed approximation algorithms with a performance ratio \(\frac {2}{3}\) for the maximum 2 and 3splittable flow problem and presented a \(\frac {1}{2}\)approximation algorithm for the general MkSF problem. Koch and Spenke [15] studied the complexity and approximability of the MkSF problem for different values of kâ‰¥2 on directed and undirected graphs. In particular, they proved that the problem is NPhard for k=2 on directed and undirected graphs and showed that, for an arbitrary constant k, the problem cannot be approximated with a performance ratio better than \(\frac {5}{6}\). Koch, Skutella and Spenke [16] decoupled the MkSF problem into two steps: the first step called packing finds the flow values of the k paths in an optimal solution; the second step called routing reports the optimal paths of the k flow values. The packing procedure was described for general directed graphs, while the routing for graphs with bounded treewidth. Finally, they proposed a polynomial algorithm for the MkSF problem on graphs of bounded treewidth when k is a constant and presented a polynomialtime approximation scheme when k is part of the input.
Unlike the MkSF problem, an instance graph of the MEkSF problem has capacities on vertices instead of edges. In addition, it is allowed that a flow violates the vertex capacity constrains. That is, the flow value on a vertex may be larger than its capacity. The difference between the flow value and the capacity on a vertex (the flow value may be smaller or larger than the capacity) is defined as the error of the vertex. Let G be a connected graph with integer vertex capacities, a source vertex, and a sink vertex. Given a total integer flow value f, the objective of the MEkSF problem is to find a ksplittable flow F in G from the source to the sink such that the flow value of F is f and the sum of the errors on the vertices is minimized.
We prove that the MEkSF problem is NPhard when k is part of the input and propose a polynomial time algorithm for the problem on layered directed graphs when k=2. We tested the algorithm on a topdown MS/MS data set of the human histone H4 protein. Experimental results showed that the proposed method identified many proteoform pairs (path pairs in the graph) that provided better explanation for the query spectra than single proteoforms reported by MSAlignE [9], an existing tool for the identification of ultramodified proteins.
Methods
The MEkSF problem
Let G=(V,E) be a directed graph with a source vertex s and a sink vertex t. Each vertex vâˆˆV has a positive integer capacity \(c(v) \in \mathbb {Z}^{+}\). Let \(\mathcal {A}\) denote the set of all simple stpaths (without circles) in G. A ksplittable stflow F contains k pairs (A_{1},f_{1}),â‹¯,(A_{k},f_{k}) where A_{i} is a path in \(\mathcal {A}\) and \(f_{i} \in \mathbb {Z}^{+}\) is the integer flow value on A_{i}, for 1â‰¤iâ‰¤k. The paths A_{1},â‹¯,A_{k} may share vertices and/or edges. The flow value of F is the sum \(\sum ^{k}_{i=1}f_{i}\). Let F(v) be the set of the pairs (A_{i},f_{i}) in F satisfying that A_{i} contains vertex vâˆˆV. The flow value of v is the sum of the flow values of the pairs in F(v), denoted by \(f(v) = \sum _{(A_{i},f_{i}) \in F(v)} f_{i}\). The error on v is the difference between the flow value and the capacity of the vertex, denoted by Îµ(v)=f(v)âˆ’c(v). The error of F is the sum or the errors of all vertices in G, denoted by \(\varepsilon (F) =\sum _{v \in V} \varepsilon (v)\). The MEkSF problem is defined as follows.
Definition 1
Given a directed graph G with integer vertex capacities, a source vertex s and a sink vertex t, and an integer flow value f, the MEkSF problem is to find a ksplittable flow F in G from s to t such that the flow value of F is f and the error of F is minimized.
The HomMTM spectral identification problem
When a purified protein is analyzed and the target protein is known, the objective of MS analysis is to identify and quantify modified proteoforms of the protein [19, 20]. Although hundreds of PTMs have been found on various proteins, it is common that only several expected PTMs are observed on the target protein. For example, expected PTMs on histone proteins include methylation, dimethylation, trimethylation, acetylation, and phosphorylation. In this study, only proteoforms with expected PTMs are considered as candidates in HomMTM spectral identification.
Let \(\mathcal {Q}\) be the set of all proteoforms of an unmodified protein P with expected PTMs and \(\mathcal {Q}_{M}\) a subset of \(\mathcal {Q}\) containing all the proteoforms with a molecular mass M. For example, when acetylation on lysine residues is the only expected PTM, the set \(\mathcal {Q}\) for the protein AKGKL contains three proteoforms AK[acetylation]GKL, AKGK[acetylation]L, and AK[acetylation]GK[acetylation]L. When M is the sum of the mass of the protein and the mass shift of an acetylation, \(\mathcal {Q}_{M}\) contains only two proteoforms AK[acetylation]GKL and AKGK[acetylation]L. Let S be a HomMTM spectrum with a precursor mass M generated from proteoforms Q_{1},Q_{2},â€¦,Q_{k} of protein P. The molecular masses of the k proteoforms is the same to the precursor mass of S. That is, \(Q_{1}, Q_{2},\ldots, Q_{k} \in \mathcal {Q}_{M}\). In practice, errors in the precursor mass of M are allowed, and the set \(\mathcal {Q}_{M}\) contains all proteoforms whose molecular masses are similar to M (the difference is within an error tolerance).
Definition 2
Given a set \(\mathcal {T}\) of expected PTMs, a protein P, a HomMTM spectrum S with a precursor mass M, and a number k, the HomMTM spectral identification problem is to find k proteoforms \(Q_{1}, Q_{2},\ldots, Q_{k} \in \mathcal {Q}_{M}\) and their abundances that best explain the spectrum S.
Representing the HomMTM spectral identification problem as a graph problem
We will formulate the HomMTM spectral identification problem as the MEkSF problem. The proposed method can be applied to tandem mass spectra with various fragmentation methods, such as collisioninduced dissociation (CID), higherenergy collision dissociation (HCD), and electrontransfer dissociation (ETD). Here HCD tandem mass spectra are used to explain the method. Only one type of Nterminal fragment ions and one type of Cterminal fragment ions are considered in the method to simply the analysis.
Tandem mass spectra of proteoforms in topdown MS often contain high charge state fragment ions and isotopomer envelopes. The first step in interpreting these spectra is to convert a spectrum into a list of monoisotopic fragment masses using topdown spectral deconvolution tools, such as Thrash [21] and MSDeconv [22]. In the following analysis, we assume that the spectrum S is a deconvoluted tandem mass spectrum.
The target protein P is represented as a sequence of amino acids a_{1}a_{2},â€¦,a_{n}. The ith prefix residue mass of P is the sum of the residue masses of its first i amino acids, that is, \(p_{i} = \sum _{k=1}^{i} \text {Mass}(a_{k})\), where Mass(a_{k}) is the residue mass of a_{k}. Specifically, p_{0}=0. The ith suffix residue mass of P is the sum of the residue masses of its last i amino acids, that is, \(s_{i} = \sum _{k=ni+1}^{n} \text {Mass}(a_{k})\). Because of the existence of PTMs, a proteoform Q in \(\mathcal {Q}_{M}\) may have prefix residue masses different from those of P, and the ith prefix residue mass of Q is the sum of p_{i} and the mass shifts of the PTMs on the first i amino acids in Q. Two different proteoforms in \(\mathcal {Q}_{M}\) may have the same ith prefix residue mass because they have the same mass shifts on the first ith amino acids. For example, GK[acetylation]GKL and GKGK[acetylation]L have the same 4th prefix residue mass because the first 4 amino acids in the two proteoforms have the same PTM acetylation on different sites. Similarly, different proteoforms in \(\mathcal {Q}_{M}\) may have the same ith suffix residue mass. Let \(\mathcal {P}_{i}\) (\(\mathcal {S}_{i}\)), for 0â‰¤iâ‰¤n, be the set of all different ith prefix (suffix) residue masses of the proteoforms in \(\mathcal {Q}_{M}\). The ith prefix residue mass and the nâˆ’ith suffix residue mass of a proteoform in Q are called complementary masses. Each prefix residue mass in \(\mathcal {P}_{i}\) has a corresponding complementary suffix residue mass in \(\mathcal {S}_{ni}\). Let \(\mathcal {T}_{i}\) be the set of expected PTMs that can occur on the ith amino acid in P. A mass m_{1} in \(\mathcal {P}_{i}\) is a preceding mass of another mass m_{2} in \(\mathcal {P}_{i+1}\) if m_{2}âˆ’m_{1} matches Mass(a_{i+1}) or the sum of Mass(a_{i+1}) and the mass shift of a PTM in \(\mathcal {T}_{i}\).
Theoretical masses in \(\mathcal {P}_{1},\mathcal {P}_{2},\ldots, \mathcal {P}_{n}\), \(\mathcal {S}_{1}, \mathcal {S}_{2}, \ldots, \mathcal {S}_{n}\) are compared with deconvoluted fragment masses in S to find matched ones. Mass shifts determined by fragment ion types are added these theoretical masses in the matching because the theoretical prefix or suffix residue masses may have mass shifts compared with their corresponding experimental fragment masses. For example, the mass 18.015 Dalton (Da) of a water molecule is added to theoretical suffix residue masses to match experimental yion neutral fragment masses. The raw intensity of a prefix residue mass m is the sum of intensities of neutral fragment masses in S that match either m or the complementary suffix residue mass of m, denoted by Inte(m). The relative intensity of a prefix residue mass is the ratio between the raw intensity of the mass and the largest raw intensity of all prefix residue masses (Fig. 1a).
A directed graph G containing n+1 layers is generated from the sets of prefix residue masses \(\mathcal {P}_{0}, \mathcal {P}_{1}, \ldots, \mathcal {P}_{n}\) with five steps (Fig. 1b). (1) A vertex is added to the ith layer of G for each mass in \(\mathcal {P}_{i}\). (2) A vertex u in the ith layer is connected to another vertex v in the i+1th layer by a directed edge if and only if the mass corresponding to u is a preceding mass of the mass corresponding to v. (3) The only vertex in layer 0 is labeled as the source vertex, and the only vertex in layer n is labeled as the sink vertex. (4) We remove all vertices that are not on any path from the source to the sink. (5) Let m_{1},m_{2},â€¦,m_{k} be the prefix masses corresponding to the remaining vertices in a layer of G. The capacity of the vertex corresponding to mass m_{i} is defined as \(\frac {\text {Inte}(m_{i})} {\sum _{j=1}^{k} \text {Inte}(m_{j})}\).
Each path from the source to the sink in G corresponds to a proteoform in \(\mathcal {Q}_{M}\), and the flow of a path corresponds to the relative abundance of the proteoform. Using this method, the HomMTM spectral identification problem is transformed into an MEkSF problem on a graph, in which the total flow value is fixed (100 was used in the experiments).
An algorithm for the ME2SF problem
The MEkSF problem is NPhard on directed acyclic graphs when k is part of the input, which can be proved by reducing from the partition problem [23]. (See Additional file 1.) Here we propose a dynamic programming algorithm for the MEkSF problem for k=2 on layered directed graphs.
A directed graph G=(V,E) is a layered one if there exists a partition of its vertex set V={V_{1},V_{2},â‹¯,V_{h}}, such that (u,v)âˆˆE if and only if uâˆˆV_{i} and vâˆˆV_{i+1} for 1â‰¤iâ‰¤hâˆ’1. Let G=({V_{1},â‹¯,V_{h}},E) be a layered directed graph in which V_{1}={s} and V_{h}={t}. Following the terminology introduced in the studies of the MkSF problem, which has many applications on commodity transportation, a flow value pair (f_{1},f_{2}), f_{1},f_{2}â‰¥0, is called a packing, and the packing is optimal for the ME2SF problem if there is an optimal ME2SF (P_{1},f_{1}),(P_{2},f_{2}).
Koch et al. has proved that the MkSF problem can be solved in polynomial time on graphs with bounded treewidth, including layered directed graphs, when k is a constant [16]. The method consists of two steps: the packing step finds candidates for the flow values of the k paths in an optimal flow, and the routing step reports k paths with the minimum error for each candidate. Similarly, in the proposed algorithm for the ME2SF problem, we first generate candidate packings that contain an optimal one, then find the best routing for each packing.
In the packing step, the total flow value f is fixed, a naive approach is to enumerate all possible packings (f_{1},f_{2}) such that f_{1}+f_{2}=f. The number of candidate packings is O(f), which may be an exponential function of the length of the input. Below we show that it is sufficient to consider O(V) packings to solve the ME2SF problem.
A set \(\mathcal {S}\) of candidate packings with an O(V) size is generated as follows: (1) for each vertex vâˆˆV with c(v)<f, a candidate packing (c(v),fâˆ’c(v)) is added to \(\mathcal {S}\); (2) a special packing (f,0) is added to \(\mathcal {S}\). The total number of candidate packings in \(\mathcal {S}\) is no large than V+1.
We will prove the candidate set \(\mathcal {S}\) contains at least one optimal packing. Let F=(P_{1},f_{1}),(P_{2},f_{2}) be an optimal solution to the ME2SF problem, in which V_{1} is the set of vertices in P_{1}, V_{2} is the set of vertices in P_{2}, and V_{1}â‰ V_{2}. Let v^{âˆ—} be a vertex in (V_{1}âˆ’V_{2})âˆª(V_{2}âˆ’V_{1}) with the minimum capacity error. A vertex v with f(v)<c(v), f(v)=c(v), f(v)>c(v) is called an under flow, perfect flow, over flow vertex, respectively. The numbers of over flow and under flow vertices in V_{1}âˆ’V_{2} are denoted as \(n_{1}^{+}\) and \(n_{1}^{}\), respectively; the numbers of over flow and under flow vertices in V_{2}âˆ’V_{1} are denoted as \(n_{2}^{+}\) and \(n_{2}^{}\), respectively.
Lemma 1
If v^{âˆ—} is not a perfect flow vertex, then \(n_{1}^{+} + n_{2}^{} = n_{1}^{}+n_{2}^{+}\). That is, the sum of the numbers of over flow vertices in V_{1}âˆ’V_{2} and under flow vertices in V_{2}âˆ’V_{1} equals the sum of the numbers of under flow vertices in V_{1}âˆ’V_{2} and over flow vertices in V_{2}âˆ’V_{1}.
Proof
We prove the lemma by contradiction. If \(n_{1}^{+} + n_{2}^{} < n_{1}^{} + n_{2}^{+}\), then we increase the flow value of P_{1} by Î´=Îµ(v^{âˆ—}) and decrease the flow value of P_{2} by Î´ to obtain a new flow (P_{1},f_{1}+Î´),(P_{2},f_{2}âˆ’Î´). By increasing the flow value in P_{1}, the error of each over flow vertex in V_{1}âˆ’V_{2} increases by Î´, and the error of each under flow vertex in V_{1}âˆ’V_{2} decreases by Î´ because Î´=Îµ(v^{âˆ—}) is the smallest error of the vertices in (V_{1}âˆ’V_{2})âˆª(V_{2}âˆ’V_{1}). By decreasing the flow value in P_{2}, the error of each over flow vertex in V_{2}âˆ’V_{1} decreases by Î´, and the error of each under flow vertex in V_{2}âˆ’V_{1} increases by Î´. In addition, the errors of the vertices not in (V_{1}âˆ’V_{2})âˆª(V_{2}âˆ’V_{1}) do not change. As a result, the error of the new flow is \(\varepsilon (F) + \left (n_{1}^{+} + n_{2}^{}  n_{1}^{}  n_{2}^{+}\right) \delta \), which is smaller than the error of F. This is a contradiction. Similarly, if \(n_{1}^{+} + n_{2}^{} > n_{1}^{} + n_{2}^{+}\), then the error of the flow (P_{1},f_{1}âˆ’Î´),(P_{2},f_{2}+Î´) is smaller than the error of F, which is a contradiction. â–¡
Theorem 1
The candidate set S contains at least one optimal packing of G.
Proof
Let F=(P_{1},f_{1}),(P_{2},f_{2}) be an optimal solution to the ME2SF problem. We consider two cases: (1) P_{1} and P_{2} are the same and (2) P_{1} and P_{2} are different. In the first case, the two paths are the same, then (P_{1},f),(P_{2},0) is an optimal solution and \((f,0) \in \mathcal {S}\) is an optimal packing. In the second case, we will prove that there exists an optimal solution F^{â€²}=(P_{1},f1â€²),(P_{2},f2â€²) such that (f1â€²,f2â€²) or (f2â€²,f1â€²) is in \(\mathcal {S}\).
Suppose P_{1} and P_{2} are different and (V_{1}âˆ’V_{2})âˆª(V_{2}âˆ’V_{1}) is not empty. Let v^{âˆ—} be a vertex with the minimum error in (V_{1}âˆ’V_{2})âˆª(V_{2}âˆ’V_{1}). Without loss of generality, we assume that v^{âˆ—}âˆˆV_{1}âˆ’V_{2}. If v^{âˆ—} is a perfect flow edge, then F^{â€²}=F and (f_{1},f_{2})=(c(v^{âˆ—}),fâˆ’c(v^{âˆ—}))âˆˆS. Otherwise, based on Lemma 1, \(n_{1}^{+} + n_{2}^{} = n_{1}^{} + n_{2}^{+}\). By changing the flow values (f_{1},f_{2}) to (c(v^{âˆ—}),fâˆ’c(v^{âˆ—})), we obtain a new flow F^{â€²}=(P_{1},c(v^{âˆ—})),(P_{2},fâˆ’c(v^{âˆ—})). The difference between the errors of F and F^{â€²} is \(\left (n_{1}^{+} + n_{2}^{}  n_{1}^{}  n_{2}^{+}\right) \varepsilon = 0\). As a result, F^{â€²} is an optimal solution, and \((c(v^{*}), fc(v^{*})) \in \mathcal {S}\) is an optimal packing. â–¡
In the routing step, we propose a dynamic programming algorithm for finding a flow (P_{1},f_{1},),(P_{2},f_{2}) for a packing (f_{1},f_{2}) such that the error of the flow is minimized. We first introduce partial flows that are used in the routing algorithm. A path pair with flows (P_{1},f_{1}),(P_{2},f_{2}) is called a partial 2splittable stflow if P_{1} and P_{2} start at the source s. The two paths in the partial flow may not end at the sink t. The error of a partial flow is defined similarly as the error of a 2splittable stflow.
For each ordered vertex pair (v_{1},v_{2}) (v_{1} and v_{2} may be the same) in a layered directed graph G=({V_{1},V_{2},â€¦,V_{h}},E), we define D(v_{1},v_{2}) as the minimum error of all partial 2splittable flows (P_{1},f_{1}),(P_{2},f_{2}) such that P_{1} ends at v_{1} and P_{2} ends at v_{2}. A vertex pair (v1â€²,v2â€²) (\(v^{\prime }_{1}\) and \(v^{\prime }_{2}\) may be the same) precedes vertex pair (v_{1},v_{2}) if (v1â€²,v_{1}),(v2â€²,v_{2})âˆˆE. The error of an ordered vertex pair (v_{1},v_{2}) for a packing (f_{1},f_{2}) is defined as
Let T(v_{1},v_{2}) be the set of all precedent pairs of (v_{1},v_{2}). We use a dynamic programming algorithm to fill out D(v_{1},v_{2}) for all vertex pairs v_{1},v_{2} in the same layer. The recurrence function for computing D(v_{1},v_{2}) is
After obtaining the value D(t,t) for the sink, we use backtracking to find the best path pair for the ME2SF problem. The time complexity of the routing algorithm is O(l^{4}h) where l is the largest number of vertices in a layer and h is the number of layers in G. Since O(V) packings are searched for finding the best solution, the time complexity of the algorithm for the ME2SF problem is O(l^{4}hV). In practice, the value l is not large in most cases, and the proposed algorithm is efficient for the ME2SF problem.
Results
We implemented the dynamic programming algorithm for the ME2SF problem in C++ and tested it on a topdown MS/MS data set of the human histone H4 protein. The experiments were performed on a Linux server with Intel(R) Xeon(R) E52680 2.5 GHz CPU.
Data set
Core histone proteins collected from normal human dermal fibroblasts were separated using a 2dimensional reverse phase hydrophilic interaction liquid chromatography (RPHILIC) system. Histone H4 isolated in the first dimension of the separation was analyzed using an LTQ Orbitrap Velos with a resolution of 60k for MS and MS/MS spectra. In total, 1,626 CID and 1,626 ETD spectra were acquired. Details of the experiment can be found in ref. [9].
Proteoform identification
All topdown tandem mass spectra were deconvoluted using MSDeconv [22]. In the proposed algorithm, the error tolerances for precursor and fragment masses were set to 15 partsper million (ppm); the maximum mass difference between the molecular mass of the unmodified protein sequence and the precursor mass of the spectrum was set to 200 Da; five PTMs were treated as expected ones (Table 1). Spectral deconvolution often introduces Â±1 Da errors into precursor masses of topdown tandem mass spectra. To address this problem, Â±1 Da errors were also allowed in matching precursor masses to the molecular masses of proteoforms.
With a cutoff of 10 matched fragment ions, the proposed algorithm identified 441 spectra matched to single proteoforms and 184 spectra matched to proteoform pairs. The running time was about 25 minutes and the memory requirement was about 32 GB. If the proteoform pair matched to a spectrum provides explanation for many fragment ions that are not explained by one proteoform, it is highly possible that the spectrum is a HomTMT spectrum. For 39 of the 184 spectra, the proteoforms pairs have at least 10 more explained fragment ions than the single high abundance proteoforms in these pairs. In addition, for 26 of the 184 spectra, the proteoform pairs have at least 20% more explained peak intensity than the single high abundance proteoforms.
Comparison with MSAlignE
MSAlignE [9] was employed to align the histone H4 protein with the deconvoluted tandem mass spectra in the data set. With the same error tolerances and expected PTMs described in the previous subsection, MSAlignE identified 1037 proteoformspectrummatches with at least 10 matched fragment ions. The main reason that MSAlignE identified more spectra is that unexpected PTMs are allowed in MSAlignE, but not in the proposed method. The 184 spectra matched to proteoform pairs by the proposed method were all identified by MSAlignE. For these spectra, we compared the single proteoforms reported by MSAlignE and the proteoform pairs reported by the proposed method in the number of matched fragment ions and the explained peak intensities. Compared with MSAlignE, the proposed method increased the number of matched fragment ions by at least 10 for 43 spectra (Fig. 2) and increased the explained peak intensities by at least 20% for 26 spectra, demonstrating that these proteoform pairs better explain the spectra than the single proteoforms.
Parameter selection
The size of the graph generated from a protein sequence and a set of expected PTMs may be huge due to the combination of PTMs. We can reduce the size of the graph by introducing a bound for the sum of mass shifts introduced by PTMs. When the bound increases from 50 Da to 600 Da, the size of the graph generated from the histone H4 protein and the five expected PTMs increases significantly: the number of vertices increases from 606 to 77,246; the number of edges increases from 761 to 124,633 (Fig. 3). The size of the graph is proportional to hq^{t}, where h is the number of layers in the graph, q is the largest number of PTM sites in a proteoform, and t is the number of expected PTM types. The bound for the sum of mass shifts of PTMs is used to limit the number q. In practice, when the number of expected PTM types t is 1 or 2, the size of the graph increases slowly with respect to q and a large bound 1000 or 1500 Da can be used to allow more PTM sites in a proteoform. When the number t is 4 or 5, the size of the graph increases rapidly with respect to q and a small bound 500 or 600 Da should be used to guarantee the efficiency of the algorithm.
Discussion
Because experimental mass spectra contain many noise peaks and miss many fragment peaks, it is not easy to confidently identify more than 2 proteoforms from a HomMTM spectrum. Therefore, we focus on the HomMTM spectral identification problem for k=2. The proposed algorithm in the routing step can be extended to the case with k>2, but the one in the packing step cannot. A trivial algorithm for the packing step is to enumerate all combinations of flow values for the k paths, and the number of candidate packings is O(f^{kâˆ’1}), where f is the total flow value. Coupled with the dynamic programming algorithm for the routing step, the time complexity of the combined method is O(l^{2k}hf^{kâˆ’1}).
Conclusions
We formulated the HomMTM spectral identification problem as the MEkSF problem on graphs and proposed an efficient algorithm for solving the ME2SF problem on layered directed graphs. The experiments on the histone H4 data set demonstrated that the proposed algorithm is capable of identifying many topdown MTM spectra and gives better explanation for these spectra using proteoform pairs.
Abbreviations
 CID:

Collisioninduced dissociation
 Da:

Dalton
 ETD:

Electrontransfer dissociation
 GB:

Gigabyte
 HCD:

Higherenergy collision dissociation
 HetMTM spectrum:

Heterogeneous multiplexed tandem mass spectrum
 HomMTM spectrum:

Homogeneous multiplexed tandem mass spectrum
 LTQ:

Linear trap quadrupole
 ME2SF:

Minimum error 2splittable flow
 MEkSF:

Minimum error ksplittable flow
 MkSF:

Maximum ksplittable flow
 MS:

Mass spectrometry
 MS/MS:

Tandem mass spectrometry
 MTM spectrum:

Multiplexed tandem mass spectrum
 NPhard:

Nondeterministic polynomialtime hard
 PTM:

Posttranslational modification
 ppm:

Partsper million
 RPHILIC:

Reverse phase hydrophilic interaction liquid chromatography
References
DiMaggio Jr PA, Young NL, Baliban RC, Garcia BA, Floudas CA. A mixed integer linear optimization framework for the identification and quantification of targeted posttranslational modifications of highly modified proteins using multiplexed electron transfer dissociation tandem mass spectrometry. Mol Cell Proteomics. 2009; 8:2527â€“43.
Wang J, PerezSantiago J, Katz JE, Mallick P, Bandeira N. Peptide identification from mixture tandem mass spectra. Mol Cell Proteomics. 2010; 9:1476â€“85.
Wang J, Bourne PE, Bandeira N. MixGF: spectral probabilities for mixture spectra from more than one peptide. Mol Cell Proteomics. 2014; 13:3688â€“97.
Cosgrove MS, Wolberger C. How does the histone code work?. Biochem Cell Biol. 2005; 83:468â€“76.
Strahl BD, Allis CD. The language of covalent histone modifications. Nature. 2000; 403:41â€“5.
Distler U, Kuharev J, Navarro P, Levin Y, Schild H, Tenzer S. Drift timespecific collision energies enable deepcoverage dataindependent acquisition proteomics. Nat Methods. 2014; 11:167â€“70.
Rost HL, Rosenberger G, Navarro P, Gillet L, Miladinovic SM, Schubert OT, Wolski W, Collins BC, Malmstrom J, Malmstrom L, Aebersold R. OpenSWATH enables automated, targeted analysis of dataindependent acquisition ms data. Nat Biotechnol. 2014; 32:219â€“23.
Liu X, Sirotkin Y, Shen Y, Anderson G, Tsai YS, Ting YS, Goodlett DR, Smith RD, Bafna V, Pevzner PA. Protein identification using topdown spectra. Mol Cell Proteomics. 2012; 11:111â€“008524.
Liu X, Hengel S, Wu S, ToliÄ‡ N, PaÅ¡aToliÄ‡ L, Pevzner PA. Identification of ultramodified proteins using topdown tandem mass spectra. J Proteome Res. 2013; 12:5830â€“8.
Sun RX, Luo L, Wu L, Wang RM, Zeng WF, Chi H, Liu C, He SM. pTop 1.0: A highaccuracy and highefficiency search engine for intact protein identification. Anal Chem. 2016; 88:3082â€“90.
Kou Q, Xun L, Liu X. TopPIC: a software tool for topdown mass spectrometrybased proteoform identification and characterization. Bioinformatics. 2016; 32:3495â€“7.
Kou Q, Wu S, ToliÄ‡ N, PaÅ¡aToliÄ‡ L, Liu Y, Liu X. A mass graphbased approach for the identification of modified proteoforms using topdown tandem mass spectra. Bioinformatics. 2017; 33:1309â€“16.
Baier G, KÃ¶hler E, Skutella M. On the ksplittable flow problem. In: Algorithms ESA 2002. Lecture Notes in Computer Science. vol. 2461. Berlin Heidelberg: Springer: 2002. p. 101â€“13.
Baier G, KÃ¶hler E, Skutella M. The ksplittable flow problem. Algorithmica. 2005; 42:231â€“48.
Koch R, Spenke I. Complexity and approximability of ksplittable flows. Theor Comput Sci. 2006; 369:338â€“47.
Koch R, Skutella M, Spenke I. Maximum ksplittable s, tflows. Theor Comput Syst. 2008; 43:56â€“66.
Caramia M, Sgalambro A. An exact approach for the maximum concurrent ksplittable flow problem. Optim Lett. 2008; 2:251â€“65.
Caramia M, Sgalambro A. A fast heuristic algorithm for the maximum concurrent ksplittable flow problem. Optim Lett. 2010; 4:37â€“55.
Moradian A, Kalli A, Sweredoski MJ, Hess S. The topdown, middledown, and bottomup mass spectrometry approaches for characterization of histone variants and their posttranslational modifications. Proteomics. 2014; 14:489â€“97.
Yuan ZF, Arnaudo AM, Garcia BA. Mass spectrometric analysis of histone proteoforms. Annu Rev Anal Chem. 2014; 7:113â€“28.
Horn DM, Zubarev RA, McLafferty FW. Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules. J Am Soc Mass Spectrom. 2000; 11:320â€“32.
Liu X, Inbar Y, Dorrestein PC, Wynne C, Edwards N, Souda P, Whitelegge JP, Bafna V, Pevzner PA. Deconvolution and database search of complex tandem mass spectra of intact proteins: a combinatorial approach. Mol Cell Proteomics. 2010; 9:2772â€“82.
Korf RE. A complete anytime algorithm for number partitioning. Artif Intell. 1998; 106:181â€“203.
Funding
This work and publication of this article were supported by the National Institute of General Medical Sciences, National Institutes of Health (NIH) through Grant R01GM118470.
Availability of data and materials
All data used in this paper are available at Massive (http://massive.ucsd.edu) with the identification number MSV000082054.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 9, 2018: Selected articles from the 13th International Symposium on Bioinformatics Research and Applications (ISBRA 2017): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume19supplement9.
Author information
Authors and Affiliations
Contributions
KZ and XL formulated the HomMTM spectral identification problem and designed the graphbased algorithm. KZ implemented the algorithm and carried out the experiments. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Supplementary material. (PDF 162 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhu, K., Liu, X. A graphbased approach for proteoform identification and quantification using topdown homogeneous multiplexed tandem mass spectra. BMC Bioinformatics 19 (Suppl 9), 280 (2018). https://doi.org/10.1186/s1285901822734
Published:
DOI: https://doi.org/10.1186/s1285901822734