PTM quantification by MS has largely been peptide-centric [30], which involves digesting proteins into peptides using protease(s) before enriching samples for a particular modification and detecting with liquid chromatography-tandem mass spectrometry (LC-MS/MS). Although methods available for high-throughput analysis of intact proteoforms are steadily growing [31, 32], a majority of laboratories currently use bottom-up proteomics and peptide-centric quantitation for PTM studies. To develop a robust data analysis workflow for such datasets, we analyzed two distinct experiments featuring: (1) UPS1 standards and (2) enrichment of protein cysteines from the yeast proteome. Both experiments were spiked into Chlamydomonas protein extract at different concentrations to determine our ability to identify changing abundances correctly.

### Performance evaluation

To assess the impact of missing data in the differential analysis, we provide a comparative analysis of multiple imputations using our empirical distribution-based missing data model with Random Forest imputation [20]. We evaluated the performance of the entire data analysis pipeline using combinations of data processing modules to identify the optimal processing pipeline. The performance evaluation used receiver operating characteristics (ROC) curves to compare our pipeline with a *t*-test with and without an LFC cut-off. On the x-axis, we plotted the false positive rate (FPR), against the true positive rate (TPR) on the y-axis (see eq. 7).

$$ {\displaystyle \begin{array}{l}\mathrm{FP}\mathrm{R}=\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}};\mathrm{TPR}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}\\ {}\mathrm{FP}=\mathrm{False}\ \mathrm{positives},\mathrm{TN}=\mathrm{True}\ \mathrm{negative},\mathrm{TP}=\mathrm{True}\ \mathrm{positives},\mathrm{FN}=\mathrm{False}\ \mathrm{negative}\mathrm{s}.\end{array}} $$

(7)

We compared our analysis with an FDR corrected *t*-test using the core R [29] functions (*t.test* with default settings and *p.adjust* with the *method = “fdr”* flag). We also included a hybrid decision method where an LFC cut-off threshold was added to the statistical decision at significance level *alpha* (default 0.05). The *alpha* and LFC cut-off were varied ten times for both the pipeline and the *t*-test (*alpha* between 0.05 and 0.001, and LFC between 0 and 2). The LFC and *alpha* were changed separately or both at the same time. For the *t*-test, a single imputed dataset was used.

Comparison with random forest imputation (which has been shown to display high performance in MS metabolomics data [21]) was performed using the R package missForest [33]. Default settings were used except for *maxiter* which was set to allow for 20 iterations. A single dataset was imputed using random forest and processed through *limma* and using the same hybrid decision method (LFC cut-off and *alpha*).

### Performance analysis using Chlamydomonas total proteome-UPS1 dataset

The *Chlamydomonas*-UPS1 (hereafter referred to as just UPS1) dataset is a global proteomics dataset containing 10,952 features (after filtering) and three different concentrations of spiked-in UPS1: 25, 50, or 100 fmol per 1 μg *Chlamydomonas* lysate (referred to as 25, 50, and 100, respectively) with four replicates each. In each condition, there were 403 confident peptides of UPS1 which were considered as true positives (TP). Since each replicate in each condition was from the same technical Chlamydomonas lysate, they were considered as true negatives (TN).

For the largest difference in concentration (100/25), there was clear discrimination between TP and TN relative to comparisons made between 100/50 and 50/25 (Additional file 3: Figure S1). The comparison with an LFC = 2 had better discrimination between TN and TP and a larger variation of TP (Additional file 3: Figure S1A). At LFC = 1 (100/50 and 50/25), the TP had lower variation, but it was harder to discriminate from TN (Additional file 3: Figure S1B and S1C). There were a total of 1083 (0.82%) missing values in 471 (4.3%) features for which imputation was performed.

To evaluate the performance of the pipeline, we plotted the receiver operating characteristic (ROC) curves by varying the LFC or *alpha* cut-off, or both simultaneously. *Alpha* was set to 0.05 when changing LFC, while LFC was set to 0 when changing *alpha*. We started by examining the performance of *limma* when using normal or robust regression and found that robust regression outperformed normal regression in all comparisons by achieving a higher TPR when allowing an FPR below 3% (Additional file 4: Figure S2). Accordingly, robust regression was used for all comparisons in the UPS1 dataset. We then compared our pipeline to an FDR corrected *t*-test (hereafter referred to as *t*-test only) with or without an LFC cut-off. The lower LFC cut-off was set to infinity as all TP had a positive LFC. We found that for comparisons in conditions with an LFC of 1 (100/50 and 50/25), our pipeline performed equally well in either, while the *t*-test had reduced performance at the comparison of the lower concentrations (50/25, Fig. 2a and b). When running the comparison with an LFC of 2 (100/25), both *t*-test and our pipeline detected more TP, but with slightly more FP, while the *t*-test was more conservative - missing more TP and fewer FP (Fig. 2c). Notably, the LFC criterion was the most important parameter for both our pipeline and the *t*-test, regardless if simultaneously varying both criteria or only the LFC.

*Performance analysis using redox-enriched Chlamydomonas-yeast dataset.*

The Chlamydomonas-yeast dataset is from an enrichment method for proteins bearing reversibly oxidized cysteine residues [21]. It contained six technical replicates of Chlamydomonas lysate, where three were spiked with 50 ng yeast proteome per 1 μg Chlamydomonas lysate while the other three received 100 ng yeast per 1 μg Chlamydomonas lysate. In total, it contained 2229 peptides with previously oxidized Cys, of which 449 were TP yeast features. Compared to the UPS1 dataset, discrimination between TP and TF in this dataset was more difficult due to increased overlap of TP and TF distributions in compared conditions (see scatterplots in Additional file 3: Figure S1 and Additional file 5: Figure S3). It had a total of 963 (7.2%) missing values in 435 (20%) features for which imputation was performed.

Similar to the UPS1 analysis, performance was evaluated with ROC curves for the yeast dataset, and started by comparing whether *limma* ran best with robust or normal regression. We found that normal regression improved performance over robust regression and that robust regression had, on average, a lower TPR and a higher FPR (Additional file 6: Figure S4). We decided to use *limma* with normal regression for the yeast dataset. Comparing our pipeline with a *t*-test, we first varied either *alpha* or LFC cut-off, or both at the same time, using values identical to the UPS1 dataset. We found that the *t*-test had a higher TPR and lower FPR at more stringent settings-independent if the criterions were changed individually or at the same time - but it again failed to reach as high TPR as our pipeline (Fig. 3a). At less stringent settings, our pipeline could almost identify all TP at the expense of a small increase in FP. We then performed imputation on the missing values. For the *t*-test, we generated one imputed dataset, while for our pipeline we performed 100 multiple imputations. We found that both the *t*-test and our pipeline improved in TP, but both picked up more FP (Fig. 3b). We found that the dynamics in the TPR and FPR were similar between a *t*-test and our pipeline regardless if the data was filtered or imputed. Importantly, we found that performing imputation instead of filtering out features with missing values lead to a drastic improvement in TP for both *t*-test and our pipeline. We also found that the parameter with the highest impact was LFC, which was consistent with the results of the UPS1 dataset.

We compared our imputation method with random forest imputation (using the *missForest* package [33]). We found that our multiple-imputation method had a slightly better performance versus random forest imputation (Fig. 4) regarding the maximum TPR achievable and when allowing a large FPR (larger than 20%) while performing comparable at lower FPRs. While Random Forest missing data algorithms perform better for high correlation data, their performance generally degrades for missing not at random data [34]. Our multiple imputation method show increased robustness as a result of binomial testing of the outcome of linear regression analysis.

Computationally, our method is inexpensive since it samples normal distributions to impute missing data. Improved missing data models can be developed from information on the missing-ness mechanisms related to the PTM quantification protocols. Stochastic sampling methods [35] can be used for left-censored missing value imputation for MNAR in combination with bootstrapping and other statistical resampling methods for MAR/MCAR imputation when the percentage of missing data is expected to be large.