EnCOUNTer: a parsing tool to uncover the mature N-terminus of organelle-targeted proteins in complex samples
© The Author(s). 2017
Received: 1 July 2016
Accepted: 10 March 2017
Published: 20 March 2017
Characterization of mature protein N-termini by large scale proteomics is challenging. This is especially true for proteins undergoing cleavage of transit peptides when they are targeted to specific organelles, such as mitochondria or chloroplast. Protein neo-N-termini can be located up to 100–150 amino acids downstream from the initiator methionine and are not easily predictable. Although some bioinformatics tools are available, they usually require extensive manual validation to identify the exact N-terminal position. The situation becomes even more complex when post-translational modifications take place at the neo-N-terminus. Although N-terminal acetylation occurs mostly in the cytosol, it is also observed in some organelles such as chloroplast. To date, no bioinformatics tool is available to define mature protein starting positions, the associated N-terminus acetylation status and/or yield for each proteoform. In this context, we have developed the EnCOUNTer tool (i) to score all characterized peptides using discriminating parameters to identify bona fide mature protein N-termini and (ii) to determine the N-terminus acetylation yield of the most reliable ones.
Based on large scale proteomics analyses using the SILProNAQ methodology, tandem mass spectrometry favoured the characterization of thousands of peptides. Data processing using the EnCOUNTer tool provided an efficient and rapid way to extract the most reliable mature protein N-termini. Selected peptides were subjected to N-terminus acetylation yield determination. In an A. thaliana cell lysate, 1232 distinct proteotypic N-termini were characterized of which 648 were located at the predicted protein N-terminus (position 1/2) and 584 were located further downstream (starting at position > 2). A large number of these N-termini were associated with various well-defined maturation processes occurring on organelle-targeted proteins (mitochondria, chloroplast and peroxisome), secreted proteins or membrane-targeted proteins. It was also possible to highlight some protein alternative starts, splicing variants or erroneous protein sequence predictions.
The EnCOUNTer tool provides a unique way to extract accurately the most relevant mature proteins N-terminal peptides from large scale experimental datasets. Such data processing allows the identification of the exact N-terminus position and the associated acetylation yield.
KeywordsN-terminal modifications Protein maturation Acetylation Quantitation Processing tool Organelle proteins Transit peptide Cleavage site
N-terminal acetylation (NTA) is one of the major protein modifications of the eukaryotic cytosol and occurs mainly co-translationally [1, 2]. In plants, most chloroplast proteins are encoded in the nucleus, translated in the cytosol and targeted to the chloroplast by a transit peptide that is cleaved upon arrival inside the organelle. Large scale analyses show that 20–40% of these proteins are N-acetylated in their mature chloroplastic form [3, 4]. The determination of the associated cleavage site of the transit peptide (TP) are still challenging. The cleavage positions of the mitochondrial or plastid TP (mTP or cTP) can be predicted using TargetP or ChloroP softwares [5, 6], but the predictions are not always reliable . Although experimental data provide useful information, it still remains difficult to identify the true N-terminal peptides amid the multitude of internal peptides identified in a large scale experiment. In addition, the determination of NTA yield is a difficult task with the tools currently available. As an exemple, Mascot Distiller (MD) allows NTA quantitation using peptides N-terminally labeled with d3- (heavy, H) or d0- (light, L) acetyl . Although this tool was used to define Lys ε-acetylation yield , it is originally dedicated to provide protein differential quantitation. The determination of the NTA yield for each proteoform (especially for the mitochondrial and the plastidic mature proteins) is not easily available and requires some additional processing .
Therefore, the development of a new tool designed to perform the extraction of the data computed by Mascot and Mascot-Distiller is required. The combination of the outputs must provide a list of the mature N-termini and the associated accurate NTA yields. Although some alternative tools could be able to perform H/L ratio quantitation such as MaxQuant, the EnCOUNTer script is not able, presently, to handle other input file format than the Mascot and Mascot-Distiller ones.
The EnCOUNTer tool (Extraction and Calculation Of Unbiased N-Termini) uses a stepwise approach. First, the characterized peptides are scored to discriminate between protein N-termini at position 1–2 and downstream N-termini (DNT) of the protein sequence. This determination is based on a curated experimental dataset. Second, EnCOUNTer recalculates the average NTA yield taking into account the first residue of the characterised mature proteins. Finally, it provides an exhaustive list of the processed N-termini with the recalculated unbiased NTA yield. The EnCOUNTer tools was trained using a manually validated dataset (Additional file 1: Table S1). As a proof of concept, the optimized parameters were used against a complex Arabidopsis thaliana experimental dataset obtained after an enrichment of the mature protein N-termini using the SILProNAQ approach . Such experimental data set provides 584 DNT peptides (related to 383 distinct proteins) of which 338 were quantified for NTA yield. Some of these N-termini (112 hits), were experimentally validated and their positions well correlated with known cleavage sites of signal peptides, mTPs or cTPs (based on UniProtKB/Swiss-Prot annotations). Some others (224 hits) were in accordance with transit peptide cleavage site predictions within a range of ± 2 residues. In addition, 648 protein N-termini were characterized at positions 1 or 2 (on the initiator methionine or after its excision) of which 303 were also quantified for protein NTA yield (Additional file 2: Table S2).
Sample preparation and raw data aquisition
Proteins extracted from A. thaliana Col. 0 seedling were used to perform N-terminus enrichment using SCX chromatography. Rapidly, 1 mg of protein was denatured and reduced followed by cysteine alkylation with iodoacetamide. After cold acetone precipitation, proteins were resuspended in 50 mM NH4HCO3 and digested by 1/100 (w/w) of TPCK treated porcine trypsin (Sigma-Aldrich) for 1.5 h at 37 °C, twice. Peptides were desalted with Sep-Pak columns and the retained material was eluted with 80% acetonitrile (ACN), 0.1% TFA and then evaporated to dryness. The collected material was resuspended in Strong Cation eXchange (SCX) LC buffer (5 mM KH2PO4, 30% ACN and 0.05% formic acid) and injected into an Alliance HPLC system using a fluorimeter detector (Waters) equipped with polysulfoethyl A column (200 × 2.1 mm, 5 μm 200 Å; PolyLC, Colombia, MD). Peptides were eluted with a KCl gradient (SCX-LC buffer B: 350 mM KCl in SCX-LC buffer A; 0–5 min, 0% B; 15–40 min, 5–26% B; 40–45 min, 26–35% B). Fractions were collected every 2 min for 40 min and the solvent was evaporated to dryness before storage at −20 °C. Fractions eluted from SCX columns with retention times of 3 to 22 min were analyzed as previously described 1 with an Easy Nano-LC II (Thermo Scientific) coupled to a LTQ-Orbitrap™ Velos (Thermo Scientific). Finally, data processing usually combines a few acquisition files, i.e. 10 files related to each individual SCX fraction (1 h analysis) and 6 files related to combined fractions (for more details, see ). Furthermore, acquisition files obtained from SCX fraction 5 and 6 were used as training dataset and testing dataset, respectively.
Mascot Distiller/Mascot data processing and *.xml exports
MD extracted data were submitted to Mascot 2.4 software for protein identification and post-translational modification characterization. The database used was “The Arabidopsis Information Resource” (TAIR ver. 10; www. arabidopsis.org ). The parent and fragment mass tolerance were 5 ppm and 0.4 Da, respectively. Additionally, carbamidomethylcysteine and d3-acetyl on Lys were defined as fix modifications and methionine oxidation as variable modification. Semi-trypsin was defined for the enzyme cleavage rule with up to 6 missed cleavages. Peptide N-terminus acetylation status, i.e. d3-Acetyl (chemically induced modification) or d0-Acetyl (endogenous modification) were investigated using the Mascot quantification option (associated to the MD parameters). These parameters (“Acetylation [MD]” quantification method) are available in Additional file 3. Then, MD uploaded the Mascot processing results and parsed them using relaxed parameters (minimum peptide identification score was set at 25, 0.2 for the P-value, 0.1 for the peak correlation coefficient, the area fraction coefficient and the precursor standard error). Irrelevant and false positive peptide hits generated at this step were filtered out at the final stage of the EnCOUNTer process.
EnCOUNTer also required protein identification data generated by Mascot. These data were automatically exported in xml files with the same MD parameters for the P-value and the Mascot score threshold. Additionally, the “MudPIT Scoring” and “Bold Red peptides” option were selected. These exported files contain all “Protein Hit Information” except pI and Taxonomy ID and all “Peptide Match Information” except the frame number and the unassigned queries.
Basically, the EnCOUNTer tool requires the MD exported file, the associated Mascot results and a parameter file. Although the tool could be used with the default parameters provided (Additional file 4), an optimization of the scoring parameters using a relevant training dataset has been performed. During the scoring parameter optimization, the EnCOUNTer tool required an additional files containing the list of “curated N-termini” (True / False N-termini; Additional file 1: Table S1). At the end of the optimization, a file containing all optimized values was generated (*.json). This file could be applied on other experimental datasets (from a similar origin) without the optimization of the scoring scheme.
The EnCOUNTer tool parses the pre-processed data exported from MD and Mascot identification tool. Mascot matched queries (only Mascot first-ranked peptide sequences) and associated protein AC were extracted from the MD xml file. Each of these entries were enriched with information, e.g. peptide sequence, starting position, MD processing results such as H/L ratio and signal quality coefficients. Then, the collected results were complemented with data extracted from the Mascot exported files such as the peptide identification score, identification E value… Of note, some peptides were not proteotypic  and shared with few distinct proteins or, alternatively, to different translational isoforms of the same protein (especially for TAIR database). The redundancy is noted and these data could be easily removed at will. Also, the shared peptides were distinctively labelled in the final result list.
N-terminus scoring function
The EnCOUNTer tool should discriminate internal peptides from the mature protein N-termini. Biological details associated to nuclear encoded mitochondrial/plastidic proteins TP such as sequence composition and average length [13–16] (also observed from experimental dataset [3, 17, 18]), highlighted some features useful to define relevant scoring coefficients (Additional file 5: Figure S1 and Additional file 6: Figure S2). To this end, we defined a scoring function based on six distinct coefficients related to i) peptide “starting position”, ii) residues around the “starting position”, iii) characterized N-terminal modifications, iv) alternative start positions at the vicinity of the “starting position”, v) matched peptide redundancies and finally iv) the “Localization” score. Some of these features could be optimized from the training dataset (such as “starting position” or the “residues around the starting position”) whereas some other should be defined by the users to valorize/penalize experimental observations (such as “data redundancy” or “multiple transit peptide cleavage sites”.
Peptide “starting position” score (Bound Score)
Based on the experimental training dataset, EnCOUNTer determines the optimal range (OptiMin and OptiMax) where “true” N-termini are the most frequently distributed. The Matthews Correlation Coefficient (MCC) was determined for all possible combinations of positions between the two endpoints of the N-terminal distribution range of the “True” hits for the DNT candidates (defined as ExpMin and ExpMax). The optimum range defined with the higher MCC provides the optimum endpoints (OptiMin and OptiMax). This positional range is associated with a scoring weight of 2 to favor the characterization of these N-termini. This calculation was associated to a “K fold cross validation” (using 10 randomized fractions) to determine the robustness of the prediction and the results of the investigation were exported in the *.bound file (specifically for the “bound” K fold test) and *.json (all optimized values).
Nevertheless, some relevant candidates (Experimental “True” N-termini) were still present outside of these optimal values, i.e. in between ExpMin/OptiMin and OptiMax/ExpMax. Since the experimental dataset may be slightly different compared to the training dataset (considering the ExpMin and ExpMax values extracted from the training dataset), the ExpMin and ExpMax values were pondered by the standard deviation observed during the “K fold cross correlation” as an estimation of the dataset variability (defined as Min and Max respectively). Both ranges, i.e. Min/OptiMin and OptiMax/Max, were associated with a scoring weight of 1 (neutral effect on the result) that prevented their elimination at this stage. All others positions are associated with a scoring weight below 1 (e.g. 0.1) to penalize such less biologically relevant positions. Starting positions 1–2 were subjected to a special scoring detailed below.
Residues around the starting position (“Spec” Score)
Determination of the “Spec” score base on the tMCC determined for each possible residue (Xxx) at the define position Pi (in the P-n - Pn range).
A “K fold cross validation” (subdivided in 10 subsets) was applied after the optimization step to determine the robustness of the prediction. The “K fold cross validation” result was exported in the *.spec (specifically for the “Spec” K fold test) and *.json (all optimized values).
N-terminal modifications (Acetyl Score)
Due to the MD processing applied during the peptide identification step, peptide’s N-terminal modifications are restricted to d0/d3-NTA. Three different situations could occur (d0-NTA, d3-NTA and d0/d3-NTA). It could be interesting to segregate differentially such peptides especially for GAP test  where the main goal is to identify the N-terminal acetylated (NTAed) proteins and to rate them differently with values higher than 1 to valorize the modification or below 1 to penalize it. Characterization of the pair d0/d3-Ac reinforces the legacy of such N-termini the MS/MS spectra related to d0-Ac and d3-NTA could be considered as two independent events) and a score higher than 1 could be applied.
Alternative start positions (Prox score)
With R = user defined weight and m = number of alternative cleavages sites experimentally characterized in the defined window (±5 residues range defined in the “Default parameters”);
Peptide redundancy (Rep Score)
Where K is the score associated to such event (K = 2 is defined in the “Default parameters”) and q = number of experimental occurrences of the investigated starting position;
Localization score (Loc Score)
It is experimentally infrequent [2, 20, 21] to characterize mature protein N-termini both at the N-terminal side of the predicted protein (Pos 1–2) and further downstream in the same sample. Thus, it could be interesting to take advantage of such information to penalize/favor DNT peptides. The weight applied to DNT hits should be defined at will in the configuration file.
Protein N-terminal scoring at position 1–2
Since a negative dataset could not be defined for the N-termini at position 1 and 2, automated optimization of the score is not possible. The “Spec” score for these peptides is set at the optimized “Spec-score-threshold (automatically defined during parameters optimization) to favor the final NTA quantification of these peptides. To note, the other scoring coefficients (i.e. N-terminal modification characterized and peptide redundancy) were applied for these positions. Then, the final EnCOUNTer score for these peptides (Position 1–2) could not be compared with the DNT associated scores.
Scoring parameters optimization and calculation
Since few parameters such as sample preparation or species influencing the type and number of downstream N-termini (True or False hits), a test sample dataset should be used to optimize the parameters. Alternatively, default parameters are provided for the A. thaliana samples.
This optimization finishes with a “K fold cross validation” to provide some insights about the prediction robustness and the results are saved in the *.score (specifically for the final EnCOUNTer score K fold test), *.json (all optimized values) and *.param (all parameters resumed) file.
NTA quantification function
EnCOUNTer data export
The final results were exported in a *.csv file providing protein AC’s, the proteotypicity, the starting position, the N-terminal modifications characterised, the mature N-terminal sequence (first 10 residues after the starting position), the EnCOUNTer score, the < H/L > ratio (and deviation), the N-terminus acetylation yield (Average, Min and Max values). An additional file was also exported containing all collected and processed data (EnCOUNTer Intermediary file).
Training and testing dataset
Distribution of the manually checked peptides for the training and testing datasets (Fraction 5 and Fraction 6 respectively; based on Additional file 1: Table S1)
Hits for Fraction 5
Hits for Fraction 6
Position 1 and 2
True Protein N-termini
False Protein N-termini
Position > 2
True downstream N-termini
Ambiguous downstream N-termini
False downstream N-termini
The EnCOUNTer script should be launched in a prompt windows associated with the required files (fully described in the help support and the user manual). First, EnCOUNTer determined the optimized scoring parameters using the training dataset (MD and Mascot exported files) and the reference N-terminal list. A few files are exported at the end of the optimization including the optimized “scoring parameter” (*.json file) and the detailed results of the optimization and “K fold cross validation” (*.bound, *.param, *.score and *.spec). Second, the experimental datasets (MD and Mascot exported files) were scored using the previously optimized parameters to discriminate and quantify the mature N-termini and associated NTA yield. At the end of the process, the EnCOUNTer script provided two distinct files, i.e. the intermediary and final EnCOUNTer results. The intermediary file provided the detailed values used to determine the EnCOUNTer score and the individual NTA quantitation, whereas the final Encounter file provided the aggregated results per distinct proteoforms (EnCOUNTer score and the final NTA yied).
Results and discussion
Training and testing datasets
Two experimental samples were defined as training and testing dataset i.e. fraction 5 and fraction 6 respectively. The peptides characterized after the Mascot identification step are filtered using few different Mascot–associated values using the peptide E-value and the minimum Mascot score defined in the configuration files. These thresholds should be adapted to reach 1% of False Discovery Rate (FDR) at the peptide level. Applying these thresholds, false positive identifications for the expected N-terminal peptide (position 1 and 2) were infrequent (Table 1). As an example, no false candidate was characterized in Fraction 5 and only one probable false hit was listed in Fraction 6 (Additional file 1: Table S1). For these starting positions, the associated localizations were mainly the cytosol (49 hits), the membrane/vacuole (17 hits), the peroxisome (6 hits) or the mitochondria (without mTP, 5 hits). Only one plastidic protein (AT2G44640.1) was characterized with a mature N-termini at position 2. This infrequent but not unusual chloroplastic N-terminus  was confirmed experimentally and reported in PPDB . The characterized N-termini at position 1–2 corresponded well to the expected cytoplasmic localizations.
Additionally, 595 peptides were characterized with downstream starting position (Start position > 2). These hits were sorted between True N-termini (mature protein N-termini; 203 hits), False N-termini (erroneous mature N-termini; 329 hits) and ambiguous N-termini (mainly poor MS/MS spectra quality or inconsistencies with previous biological and experimental facts; 63 hits). Only the True/False candidates were used during the EnCOUNTer training step. The main subcellular localization is the chloroplast with 73% of the candidates (149 hits) for the “True” dataset. Other locations such as cytosol, membrane or mitochondria were also found (21, 7 and 5%, respectively). At the contrary, the “False” dataset exhibits random location and similar distributions were also observed in Fraction 6 dataset (Table 1 and Additional file 1: Table S1). These two manually curated datasets (Fraction 5 and 6) were used during the EnCOUNTer training and testing steps, respectively.
N-terminus scoring optimization
Residues around the starting position (“Spec” Score)
Results of the automated optimisation of the Bound and Spec parameters using the K fold cross validation result (n = 10) and the final scoring scheme using the same validation approach
Dataset or Scoring Scheme
EnCOUNTer or Spec threshold
False Discovery Rate
Spec (True dataset)
> 62.9 ± 2.0
162 ± 3
288 ± 3
7 ± 2
21 ± 2
94.0 ± 0.3%
88.5 ± 1.0%
97.4 ± 0.5%
4.6 ± 0.9%
0.87 ± 0.01
16 ± 2
31 ± 3
2 ± 2
4 ± 2
88.5 ± 4.1%
78.7 ± 7.3%
94.9 ± 4.2%
9.3 ± 7.7%
0.67 ± 0.11
Bound (True dataset)
17 ± 4
80 ± 6
164 ± 4
241 ± 6
56 ± 6
19 ± 3
84.3 ± 1.5%
89.8 ± 0.7%
81.2 ± 1.8%
25.3 ± 1.6%
0.71 ± 0.01
18 ± 3
26 ± 4
7 ± 3
3 ± 1
82.9 ± 5.5%
87.7 ± 4.3%
80.0 ± 7.8%
26.6 ± 8.7%
0.67 ± 0.09
All data together
Spec / Bound / Prox (True dataset)
17 ± 4
80 ± 6
> 129.9 ± 0.6
167 ± 4
293 ± 5
3 ± 1
16 ± 2
96.1 ± 0.6%
91.2 ± 0.6%
99.1 ± 0.2%
1.6 ± 0.3%
0.92 ± 0.01
19 ± 4
33 ± 5
0 ± 1
2 ± 2
95.9 ± 2.9%
91.1 ± 5.3%
98.7 ± 2.3%
1.9 ± 3.2%
0.91 ± 0.06
Spec / Bound / Prox (False dataset)
86 ± 9
300 ± 1
< 69.3 ± 6.1 (*)
272 ± 5
108 ± 4
74 ± 3
24 ± 4
79.5 ± 0.5%
92.0 ± 1.2%
59.1 ± 1.7%
21.4 ± 0.6%
0.59 ± 0.01
30 ± 3
12 ± 3
9 ± 3
3 ± 3
78.0 ± 4.1%
90.5 ± 6.7%
58.1 ± 10.1%
22.0 ± 5.4%
0.55 ± 0.08
Fraction 5 dataset
True dataset (Spec only)
True dataset (Spec, Bound)
True dataset (Spec, Bound, Prox)
False dataset (Spec Only)
False dataset (Spec, Bound, Prox)
False dataset (Stringent params)
Fraction 6 dataset
True dataset (Spec, Bound, Prox)
Finally, a K fold cross validation (k = 10) was performed to determine the robustness of this approach. The accuracy reach 88.5 ± 4.1% and 94.9 ± 4.2% sensitivity with 9.3 ± 7.7% FDR (Table 2 and Additional file 8: Table S3). Although additional features should be used to prevent the loss of “True” hits, the results obtained using only the “Spec” score are extremely promising.
Peptide “starting position” score (Bound Score)
For most proteins, the mature N-term position is located on the first two residues of the protein sequence (position 1–2). Nevertheless, some proteins N-termini could be located further downstream (Position > 2). For example, the mTP cleavage position is expected between positions 20–70 whereas for the position for the cTP of A. thaliana nuclear encoded proteins is expected between positions 40–70 [16, 33]. In our training datasets (Additional file 1: Table S1), the validated downstream starting positions were distributed from position 3 to 106 (defined as ExpMin and ExpMax) for the validated candidates (“True” dataset) vs. 4 to 1104 for the irrelevant candidates (False dataset).
Interestingly, few distinct TP regions (Additional file 5: Figure S1) could be highlighted and are associated with proteins carrying a signal peptide (positions between 20 to 35 [34, 35]), mitochondrial TP (between positions 25 to 65 [16, 28]) and plastidic TP (between positions 30 to 95 ). Comparatively, the starting positions of the “False” candidates were evenly distributed. Then, it is interesting to favor/penalize selected regions depending of the protein training set. This allows the EnCOUNTer tool to define the optimum range where mature N-terminal positions are characterized from the training dataset. The optimum range for Fraction 5 dataset is in between positions 14–78 with 84.4% accuracy and 80.9% specificity with 25.6%. FDR. The associated “K fold cross correlation” (K = 10) highlights the robustness of this determination (Additional file 8: Table S3). This parameter cannot be used alone but always in combination with the Spec score (at least). When combining “Spec” and “Bound” score, 95.1% sensitivity and 99.1% accuracy with 1.6% FDR are reached on Fraction 5 training dataset. Such combination clearly improved the EnCOUNTer discrimination power compared to the “Spec” score alone.
Influence of the other scoring coefficients
By default, reliable predictions are reached using the Spec score and the Bound score together (95.1% accuracy at 1.6% FDR on the training dataset). Nevertheless, it could be possible to improve the prediction specificity or sensitivity using the additional coefficients Acetyl, Prox, Rep and/or Loc. Depending on the coefficient applied, it was possible to improve the sensitivity or the specificity of the EnCOUNTer tool (data not shown). As an example, the combination of Spec, Bound and Prox coefficients provides a final 96.1% accuracy and 99.1% sensitivity with 1.6% FDR. The associated K fold cross validation (k = 10) was performed and provided 95.9 ± 2.9% accuracy and 98.7 ± 2.3% sensitivity at 1.9 ± 3.2% FDR (Table 2 and Additional file 8: Table S3).
Although, the overall accuracy could be improved using different scoring combinations this was usually detrimental to the sensitivity. Depending on the goal (sensitivity, accuracy, specificity), scoring coefficient combinations could be adapted to reach better result than those provided in Table 2 i.e. better accuracy or better sensitivity… In hour hands, the combination of the scoring coefficients Spec, Bound and Prox provides a good starting compromise (Table 2) that could be optimized at will. These optimized parameters were applied to the Fraction 6 training dataset and provided the discrimination of the N-termini at 91.3% of sensitivity and 98.2% of specificity (4.3% of FDR).
Protein N-terminal Acetylation quantitation
As previously mentioned, MD could provide protein NTA quantitation regardless of the multiple protein proteoforms. This is the example for the protein At1g16080.1 of which four distinct N-terminal positions could be characterized (positions 42, 44, 45 and 48; Additional file 1: Table S1). MD gave a single NTA yield of 35.5% (Min = 0.4%; Max = 98.98%) whereas EnCOUNTer provided four distinct NTA yield, i.e. 100.0% (Min: 99.8%; Max + 100.0%), 29.5%, 42.4% (Min: 42.0%; Max + 42.8%) and 2.1% respectively for each proteoforms. Another frequent MD processing error is the aggregation of H/L value associated to internal peptides. As an example, the MD quantification of At2g16600.1 protein combines the NTA yield associated to position 2 and 21 for a final NTA yield of 99.2% (Min = 26.5%; Max = 100.0%) whereas EnCOUNTer quantify only the N-terminus at position 2 with 99.8% NTA (Min = 98.5%; Max = 100.0%). Furthermore, the EnCOUNTer score of the peptide starting at position 21 is below the EnCOUNTer threshold and is not considered as a significant N-terminus. It is clear that EnCOUNTer discriminates the different N-termini and provides the most accurate NTA yield for each of them with an error range below 1% in average (Additional file 1: Table S1).
Example of application
As an example of application, our whole experimental dataset (N-terminus enriched fraction from A. thaliana leave lysate ) was processed using the optimized EnCOUNTer parameters. The parameters used where based on the results obtained during the optimization phase (Table 2), i.e. the combination of the Spec, Bound and Prox coefficients. 3964 potential N-termini were listed of which 1554 have an EnCOUNTer score higher than the threshold (EnCOUNTer Threshold = 130.1). After the removal of the non-proteotypic peptides, 1257 probable mature N-termini were listed of which 649 were located at position 1–2 and 608 at positions downstream of the protein N-terminus (Position >2). The NTA yield was determined for 594 N-termini (excluded none proteotypic N-termini) of which 275 were located at position 1–2 and 319 were associated to DNT (Additional file 2: Table S2).
In addition to these expected N-termini, 608 downstream N-termini could be characterized with an EnCOUNTer score higher than the threshold of which 319 were quantified for NTA. The pattern of the DNT-NTA yield with 8% of the downstream N-termini fully acetylated (>95%), 25% partially NTAed and 67% not acetylated (<5%) was clearly different (Fig. 3b) from protein NTA profile (Fig. 3a). The subcellular distribution (Fig. 3c) was also strongly modified and the main localisation for the downstream N-termini was for 73% associated to plastidic proteins. Additionally, DNT also revealed mitochondrial N-termini (13%) resulting from mTP excision and alternative maturation of peroxisomal proteins (e.g. At2g33150.1 ), membrane proteins (e.g. At3g06035.1 or At5g19250.1 ) or vacuolar proteins (At5g60360.1  or At2g23000). As previously observed for Pos 1–2, some of the SUBA subcellular localisation were erroneous, e.g. cytosolic localisation for At1G12900 or At4g26300 while they are localized in the chloroplastic stroma . Some of the DNTs could also be a consequence of an alternative splicing or alternative start position (e.g. At1g66240 ), or errors on the gene starting position (At1g23820). Most of the 608 downstream N-termini highlighted by the EnCOUNTer tool were clearly due to protein maturation processes. This result confirms the added-value of EnCOUNTer to highlight mature proteins N-termini in complex peptide mixtures.
Throughout few thousands experimentally characterised N-termini, the EnCOUNTer tool is able to parse the most relevant mature protein N-termini with 96.1% accuracy and 99.1% specificity on the training dataset (91.3% sensitivity and 98.2% specificity using Fraction 6 dataset). Furthermore, the EnCOUNTer tool is able to provide reliable NTA yield for each distinct proteoform at the expected protein N-terminus (Pos 1–2) but also downstream.
Applied to a large experimental dataset, the EnCOUNTer tool was able to characterize more than 1200 N-termini of which almost 600 were quantified for NTA yield. Those characterised DNT could be associated to different maturation processes including nuclear encoded proteins targeting to various organelles (e.g. mitochondria, chloroplast or peroxisome), cytosolic maturations involving transient targeting peptides (e.g. membrane or secreted proteins) or erroneously assigned protein starts. This tool provides a unique way to determine the experimental position of the protein mature N-terminus and NTA acetylation yield for few hundreds up to thousands of candidates. This tool is especially interesting to determine accurately and rapidly the influence of various stresses on protein N-terminal status and N-terminal modification yield.
Plastid transit peptide
endogeneous acetyl group
deuterated acetyl group
False Discovery Rate
Matthews Correlation Coefficient
Mitochondrial transit peptide
Strong Cation eXchange
translated (by 1 unit upper) Matthews Correlation Coefficient
This study has benefited from the facilities and expertise of the SICaPS platform of I2BC (Institute for Integrative Biology of the Cell). The authors also thank A. Estreicher and Y Vandenbrouck for their advices, counsels, and expertise.
This work was supported by the “French National Research Agency (grant No. ANR-13-BSV6-0004) and is directly associated to the development of the EnCOUNTer tool. JD has benefited from the support of the “LabEX Saclay Plant Sciences-SPS” (ANR-10-LABX-0040-SPS) for its gratification. The Lidex BIG (https://www.universite-paris-saclay.fr/fr/recherche/projet/lidex-big) finance experimental analyses to provide raw material used forthe development and the optimisation of the EnCOUNTer tool.
Availability of data and materials
The EnCOUNTer script implemented in Python 2.7.11 is freely available at https://mycore.core-cloud.net/public.php?service=files&t=4de3e3e100c6c3ba94114947c9ff929f. The mass spectrometry proteomics data and supplementary files supporting the conclusions of this article have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository with the dataset identifier PXD005720.
WB was involved in the acquisition of data, the conception and the design of the EnCOUNTer tool, the analysis and the interpretation of data. JPS and JD developed the python script, WIB, CG and TM were involved in drafting the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Giglione C, Fieulaine S, Meinnel T. N-terminal protein modifications: Bringing back into play the ribosome. Biochimie. 2015;114:134–46.View ArticlePubMedGoogle Scholar
- Linster E, Stephan I, Bienvenut WV, Maple-Grodem J, Myklebust LM, Huber M, Reichelt M, Sticht C, Geir Moller S, Meinnel T, et al. Downregulation of N-terminal acetylation triggers ABA-mediated drought responses in Arabidopsis. Nat Commun. 2015;6:7640.View ArticlePubMedPubMed CentralGoogle Scholar
- Bienvenut WV, Espagne C, Martinez A, Majeran W, Valot B, Zivy M, Vallon O, Adam Z, Meinnel T, Giglione C. Dynamics of post-translational modifications and protein stability in the stroma of Chlamydomonas reinhardtii chloroplasts. Proteomics. 2011;11(9):1734–50.View ArticlePubMedGoogle Scholar
- Rowland E, Kim J, Bhuiyan NH, van Wijk KJ. The Arabidopsis Chloroplast Stromal N-Terminome: Complexities of Amino-Terminal Protein Maturation and Stability. Plant Physiol. 2015;169(3):1881–96.PubMedPubMed CentralGoogle Scholar
- Emanuelsson O, Brunak S, von Heijne G, Nielsen H. Locating proteins in the cell using TargetP, SignalP and related tools. Nat Protoc. 2007;2(4):953–71.View ArticlePubMedGoogle Scholar
- Emanuelsson O, Nielsen H, von Heijne G. ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci. 1999;8(5):978–84.View ArticlePubMedPubMed CentralGoogle Scholar
- Bienvenut WV, Sumpton D, Martinez A, Lilla S, Espagne C, Meinnel T, Giglione C. Comparative large scale characterization of plant versus mammal proteins reveals similar and idiosyncratic N-alpha-acetylation features. Mol Cell Proteomics. 2012;11(6):M111. 015131.View ArticlePubMedPubMed CentralGoogle Scholar
- Bienvenut WV, Giglione C, Meinnel T. Proteome-wide analysis of the amino terminal status of Escherichia coli proteins at the steady-state and upon deformylation inhibition. Proteomics. 2015;15(14):2503–18.View ArticlePubMedGoogle Scholar
- ElBashir R, Vanselow JT, Kraus A, Janzen CJ, Siegel TN, Schlosser A. Fragment ion patchwork quantification for measuring site-specific acetylation degrees. Anal Chem. 2015;87(19):9939–45.View ArticlePubMedGoogle Scholar
- Knudsen AD, Bennike T, Kjeldal H, Birkelund S, Otzen DE, Stensballe A. Condenser: a statistical aggregation tool for multi-sample quantitative proteomic data from Matrix Science Mascot Distiller. J Proteomics. 2014;103:261–6.View ArticlePubMedGoogle Scholar
- Huala E, Dickerman AW, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang M, Huang W, et al. The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res. 2001;29(1):102–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Mallick P, Schirle M, Chen SS, Flory MR, Lee H, Martin D, Ranish J, Raught B, Schmitt R, Werner T, et al. Computational prediction of proteotypic peptides for quantitative proteomics. Nat Biotechnol. 2007;25(1):125–31.View ArticlePubMedGoogle Scholar
- Kunze M, Berger J. The similarity between N-terminal targeting signals for protein import into different organelles and its evolutionary relevance. Front Physiol. 2015;6:259.View ArticlePubMedPubMed CentralGoogle Scholar
- Patron NJ, Waller RF. Transit peptide diversity and divergence: A global analysis of plastid targeting signals. Bioessays. 2007;29(10):1048–58.View ArticlePubMedGoogle Scholar
- Shi LX, Theg SM. The chloroplast protein import system: from algae to trees. Biochim Biophys Acta. 2013;1833(2):314–31.View ArticlePubMedGoogle Scholar
- Huang S, Taylor NL, Whelan J, Millar AH. Refining the definition of plant mitochondrial presequences through analysis of sorting signals, N-terminal modifications, and cleavage motifs. Plant Physiol. 2009;150(3):1272–85.View ArticlePubMedPubMed CentralGoogle Scholar
- Vaca Jacome AS, Rabilloud T, Schaeffer-Reiss C, Rompais M, Ayoub D, Lane L, Bairoch A, Van Dorsselaer A, Carapito C. N-terminome analysis of the human mitochondrial proteome. Proteomics. 2015;15(14):2519–24.View ArticlePubMedGoogle Scholar
- Bionda T, Tillmann B, Simm S, Beilstein K, Ruprecht M, Schleiff E. Chloroplast import signals: the length requirement for translocation in vitro and in vivo. J Mol Biol. 2010;402(3):510–23.View ArticlePubMedGoogle Scholar
- Dinh TV, Bienvenut WV, Linster E, Feldman-Salit A, Jung VA, Meinnel T, Hell R, Giglione C, Wirtz M. Molecular identification and functional characterization of the first Nalpha-acetyltransferase in plastids by global acetylome profiling. Proteomics. 2015.
- Bienvenut WV, Sumpton D, Lilla S, Martinez A, Meinnel T, Giglione C. Influence of various endogenous and artefact modifications on large scale proteomics analysis. Rapid Commun Mass Spectrom. 2013;27:443–50.View ArticlePubMedGoogle Scholar
- Zybailov B, Rutschow H, Friso G, Rudella A, Emanuelsson O, Sun Q, van Wijk KJ. Sorting signals, N-terminal modifications and abundance of the chloroplast proteome. PLoS One. 2008;3(4):e1994.View ArticlePubMedPubMed CentralGoogle Scholar
- Bruley C, Dupierris V, Salvi D, Rolland N, Ferro M. AT_CHLORO: A Chloroplast Protein Database Dedicated to Sub-Plastidial Localization. Front Plant Sci. 2012;3:205.View ArticlePubMedPubMed CentralGoogle Scholar
- Ferro M, Brugiere S, Salvi D, Seigneurin-Berny D, Court M, Moyet L, Ramus C, Miras S, Mellal M, Le Gall S, et al. AT_CHLORO, a comprehensive chloroplast proteome database with subplastidial localization and curated information on envelope proteins. Mol Cell Proteomics. 2010;9(6):1063–84.View ArticlePubMedPubMed CentralGoogle Scholar
- Sun Q, Zybailov B, Majeran W, Friso G, Olinares PD, van Wijk KJ. PPDB, the Plant Proteomics Database at Cornell. Nucleic Acids Res. 2009;37(Database issue):D969–74.View ArticlePubMedGoogle Scholar
- Heazlewood JL, Verboom RE, Tonti-Filippini J, Small I, Millar AH. SUBA: the Arabidopsis Subcellular Database. Nucleic Acids Res. 2007;35(Database issue):D213–8.View ArticlePubMedGoogle Scholar
- Fortelny N, Yang S, Pavlidis P, Lange PF, Overall CM. Proteome TopFIND 3.0 with TopFINDer and PathFINDer: database and analysis tools for the association of protein termini to pre- and post-translational events. Nucleic Acids Res. 2015;43(Database issue):D290–7.View ArticlePubMedGoogle Scholar
- Joshi HJ, Hirsch-Hoffmann M, Baerenfaller K, Gruissem W, Baginsky S, Schmidt R, Schulze WX, Sun Q, van Wijk KJ, Egelhofer V, et al. MASCP Gator: an aggregation portal for the visualization of Arabidopsis proteomics data. Plant Physiol. 2011;155(1):259–70.View ArticlePubMedGoogle Scholar
- Fukasawa Y, Tsuji J, Fu SC, Tomii K, Horton P, Imai K. MitoFates: Improved Prediction of Mitochondrial Targeting Sequences and Their Cleavage Sites. Mol Cell Proteomics. 2015;14(4):1113–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Miras S, Salvi D, Piette L, Seigneurin-Berny D, Grunwald D, Reinbothe C, Joyard J, Reinbothe S, Rolland N. Toc159- and Toc75-independent import of a transit sequence-less precursor into the inner envelope of chloroplasts. J Biol Chem. 2007;282(40):29482–92.View ArticlePubMedGoogle Scholar
- Schechter I, Berger A. On the Size of the active site in proteases. I) Papain. Biochem Biophys Res Comm. 1967;27:157–62.View ArticlePubMedGoogle Scholar
- Keil-Dlouha VV, Zylber N, Imhoff J, Tong N, Keil B. Proteolytic activity of pseudotrypsin. FEBS Lett. 1971;16(4):291–5.View ArticlePubMedGoogle Scholar
- Hedstrom L, Szilagyi L, Rutter WJ. Converting trypsin to chymotrypsin: the role of surface loops. Science. 1992;255(5049):1249–53.View ArticlePubMedGoogle Scholar
- Jarvis P. Targeting of nucleus-encoded proteins to chloroplasts in plants. New Phytol. 2008;179(2):257–85.View ArticlePubMedGoogle Scholar
- Kall L, Krogh A, Sonnhammer EL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004;338(5):1027–36.View ArticlePubMedGoogle Scholar
- Dalbey RE, Wang P, van Dijl JM. Membrane proteases in the bacterial protein secretion and quality control pathway. Microbiol Mol Biol Rev. 2012;76(2):311–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Carrie C, Murcha MW, Millar AH, Smith SM, Whelan J. Nine 3-ketoacyl-CoA thiolases (KATs) and acetoacetyl-CoA thiolases (ACATs) encoded by five genes in Arabidopsis thaliana are targeted either to peroxisomes or cytosol but not to mitochondria. Plant Mol Biol. 2007;63(1):97–108.View ArticlePubMedGoogle Scholar
- Borner GH, Lilley KS, Stevens TJ, Dupree P. Identification of glycosylphosphatidylinositol-anchored proteins in Arabidopsis. A proteomic and genomic analysis. Plant Physiol. 2003;132(2):568–77.View ArticlePubMedPubMed CentralGoogle Scholar
- Sohn EJ, Kim ES, Zhao M, Kim SJ, Kim H, Kim YW, Lee YJ, Hillmer S, Sohn U, Jiang L, et al. Rha1, an Arabidopsis Rab5 homolog, plays a critical role in the vacuolar trafficking of soluble cargo proteins. Plant Cell. 2003;15(5):1057–70.View ArticlePubMedPubMed CentralGoogle Scholar
- Huang M, Friso G, Nishimura K, Qu X, Olinares PD, Majeran W, Sun Q, van Wijk KJ. Construction of plastid reference proteomes for maize and Arabidopsis and evaluation of their orthologous relationships; the concept of orthoproteomics. J Proteome Res. 2013;12(1):491–504.View ArticlePubMedGoogle Scholar
- Puig S, Mira H, Dorcey E, Sancenon V, Andres-Colas N, Garcia-Molina A, Burkhead JL, Gogolin KA, Abdel-Ghany SE, Thiele DJ, et al. Higher plants possess two different types of ATX1-like copper chaperones. Biochem Biophys Res Commun. 2007;354(2):385–90.View ArticlePubMedGoogle Scholar
- Colaert N, Helsens K, Martens L, Vandekerckhove J, Gevaert K. Improved visualization of protein consensus sequences by iceLogo. Nat Methods. 2009;6(11):786–7.View ArticlePubMedGoogle Scholar