Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data

Background Previous studies have reported that labeling errors are not uncommon in omics data. Potential outliers may severely undermine the correct classification of patients and the identification of reliable biomarkers for a particular disease. Three methods have been proposed to address the problem: sparse label-noise-robust logistic regression (Rlogreg), robust elastic net based on the least trimmed square (enetLTS), and Ensemble. Ensemble is an ensembled classification based on distinct feature selection and modeling strategies. The accuracy of biomarker selection and outlier detection of these methods needs to be evaluated and compared so that the appropriate method can be chosen. Results The accuracy of variable selection, outlier identification, and prediction of three methods (Ensemble, enetLTS, Rlogreg) were compared for simulated and an RNA-seq dataset. On simulated datasets, Ensemble had the highest variable selection accuracy, as measured by a comprehensive index, and lowest false discovery rate among the three methods. When the sample size was large and the proportion of outliers was ≤5%, the positive selection rate of Ensemble was similar to that of enetLTS. However, when the proportion of outliers was 10% or 15%, Ensemble missed some variables that affected the response variables. Overall, enetLTS had the best outlier detection accuracy with false positive rates < 0.05 and high sensitivity, and enetLTS still performed well when the proportion of outliers was relatively large. With 1% or 2% outliers, Ensemble showed high outlier detection accuracy, but with higher proportions of outliers Ensemble missed many mislabeled samples. Rlogreg and Ensemble were less accurate in identifying outliers than enetLTS. The prediction accuracy of enetLTS was better than that of Rlogreg. Running Ensemble on a subset of data after removing the outliers identified by enetLTS improved the variable selection accuracy of Ensemble. Conclusions When the proportion of outliers is ≤5%, Ensemble can be used for variable selection. When the proportion of outliers is > 5%, Ensemble can be used for variable selection on a subset after removing outliers identified by enetLTS. For outlier identification, enetLTS is the recommended method. In practice, the proportion of outliers can be estimated according to the inaccuracy of the diagnostic methods used.

Kurnaz et al. [11] proposed the robust elastic net based on the least trimmed square (enetLTS), which is a robust method for linear and logistic regression based on the EN penalty. The least trimmed square (LTS) is used to obtain robust results, and the EN penalty allows for variable selection in high-dimensional sparse settings. The trimming procedure is highly robust, but also leads to a loss in efficiency, and therefore, a reweighted step is considered in enetLTS.
To detect outliers in high-dimensional datasets including mislabeled samples, Lopes et al. [7] proposed Ensemble, which is an ensemble classification setting based on distinct procedures, including logistic regression with elastic net (EN) regularization, sparse partial least squares (PLS) -discriminant analysis (SPLS-DA), and sparse generalized PLS (SGPLS). Cook's distance (Cook's D) is used to evaluate the samples' outlierness. A final consensus list of observations sorted by their outlierness level is achieved using rank product statistics corrected for multiple testing.
The three methods Rlogreg, enetLTS, and Ensemble are described in detail in section 1 of Additional File 1. For variable selection in a sparse high-dimensional setting, all three methods use the regularization of parameters. Rlogreg uses the L 1 penalty and enetLTS uses the EN, where the regularizer is a linear combination of the L 1 and L 2 penalties. The EN tends to select more variables and groups of correlated variables than the L 1 penalty. In Ensemble, the combination of L 1 and L 2 penalties is set in the EN, SPLS-DA, and SGPLS models. For robustness, in Rlogreg, Bootkrajang et al. [4] set label-flipping probabilities as parameters in the maximum likelihood estimator to formulate the proportion of mislabeled samples. In enetLTS, Kurnaz et al. [11,12] applied the least trimmed square (LTS) to the EN and used the C-step algorithm to identify the optimal subset without outliers. In Ensemble, the models are performed on the original datasets with outliers, without considering robustness. In Rlogreg, the detected outliers are the misclassified observations with response y = 1 that are predicted as zero, or observations with y = 0 but predicted as 1. In enetLTS, outliers are detected using large Pearson residuals in the reweighted step. In Ensemble, Cook's D is used to evaluate outlierness, and the consensus ranking of outlierness is achieved using the rank product test.
These three methods, which address the label error in high-dimensional data, base on different principles. Their pros and cons need to be explored so that the appropriate method can be chosen when dealing with high-dimensional datasets with mislabeled samples. Hence the accuracy of variable selection and outlier detection of these methods needs to be evaluated. The evaluation and exploration of these methods can also provide guidance for improving the results in the next step.
This article is organized as follows: In results section, results of simulation studies are presented, which were conducted to evaluate the performance of the three models, including the accuracy of variable selection, outlier detection, and prediction. The three methods were also compared by applying them to a TNBC dataset. Then the results are discussed and concluded. The simulation setting and performance measures are also described in methods section.

Results
Simulation results for the comparison of the three methods In this section, we present a simulation study to investigate the performance of Rlogreg, Ensemble, and enetLTS. The accuracy of variable selection, outlier identification, and prediction, and the running time of the three methods are compared in scenarios with various sample sizes, dimensions, and proportions of outliers. Details of the simulation setting and performance measures are presented in the last section.
The results of running the three methods on simulated datasets are shown in Figs. 1-5. Section S2 (Table S2-1 to S2-4) of Additional File 1 gives the results in more detail, including the accuracy of variable selection, outlier detection, and prediction.
The high-dimensional variables were selected using indicators that represent the variable selection accuracy, namely positive selection rate (PSR) and false discovery rate (FDR) [13]. PSR indicates the proportion of real disease-related biomarkers that are screened out, and FDR indicates the proportion of biomarkers screened out that are not related to the disease. We used a comprehensive indicator GM [13,14], which is the geometric mean of (PSR and (1 − FDR)) to measure the accuracy of variable selection. High PSRs and low FDRs will give high GMs, which indicates high accuracy of variable selection.
The PSRs and FDRs of the three methods when n = 100 and p = 200 and 1000 are shown in Fig. 1. Ensemble had the lowest FDR among the three methods, and the Ensemble PSR was lower than the enetLTS PSR. The PSR and FDR of enetLTS were both high. Rlogreg had the lowest PSR among the three methods, and the FDR was high. The GMs indicated that Ensemble had the highest variable selection accuracy, followed by enetLTS, then Rlogreg with the lowest variable selection accuracy. PSRs decreased when the proportion of outliers increased, whereas the enetLTS PSRs did not change much when the proportion of outliers increased. The PSRs and FDRs of the three methods when n = 500 and p = 200 and 1000 are shown in Fig. 2. When the sample size was increased from n = 100 to n = 500, the variable selection accuracy improved for all three methods, but increased the most for Ensemble. When the proportion of outliers was ≤5%, the Ensemble PSR was close to the enetLTS PSR (approximately 0.8) and the Ensemble FDR was very low (about 0.01) but the enetLTS FDR was much higher. These results show that with ≤5% outliers, the Ensemble variable selection accuracy was very high. However, with 10 and 15% outliers, the Ensemble PSRs decreased to about 0.6 and 0.5, respectively, whereas the enetLTS PSRs decreased very little with the higher proportions of outliers; with 15% outliers, the enetLTS PSR was about 0.7. The Rlogreg FDR was similar to the enetLTS FDR, but the Rlogreg PSR was much lower than the enetLTS PSR, so the Rlogreg variable selection accuracy was the lowest among the three methods.
These results show that Ensemble had the highest variable selection accuracy, as measured by the GM, among the three methods, mainly because its FDR was much lower than the FDRs of the other two methods. The Ensemble variable selection accuracy was better with the large sample size. When the sample size was n = 500 and the proportion of outliers was ≤5%, the Ensemble and enetLTS PSRs were similar. However, when the proportion of outliers was 10% or 15%, the Ensemble and enetLTS PSRs were quite different, which implies that Ensemble may miss some variables that affect the response variables whereas the enetLTS PSRs decreased very little.
The outlier detection accuracy of the three methods is shown in Fig. 3. Here we used two indicators Sn (sensitivity) and FPR (false positive rate) in the screening test [13]. The outliers to be identified is regarded as patients to be detected in the screening test. Sn represents the proportion of truly misclassified outliers among outliers that identified. FPR represents the proportion of correctly labeled samples that are determined to be misclassified. The outliers identified by enetLTS and RLogreg had the highest Sn, but the Rlogreg FPRs were higher (and > 0.05 in some cases) than the enetLTS FPRs, which were all within 0.05. Ensemble has the lowest Sn and FPRs among the three methods. When the proportion of outliers was 1% or 2%, the Ensemble and enetLTS Sns were similar, but with ≥5% outliers, the differences between the Sns began to increase. With 15% outliers, the Ensemble Sn dropped to about 0.25. When the sample size was n = 500, the enetLTS Sn increased by 10 to 20% compared with its value for the smaller sample size (n = 100). The enetLTS Sn decreased slightly with the higher proportions of outliers, but the decrease was relatively small.
These results show that, overall, enetLTS had the highest outlier detection accuracy among the three methods. The enetLTS Sn was high and the FPRs were all within 0.05, even when the proportion of outliers was relatively large. With 1% or 2% outliers, Ensemble had high outlier detection accuracy, but with higher proportions of outliers many outliers were missed. Although the RLogreg Sn was high, the FPRs also were high and sometimes exceeded 5%. The prediction accuracy of the three methods is shown in Fig. 4. Their predictive performances were evaluated by calculating the misclassification rate (MR) from test datasets without outliers. The enetLTS MR was lower than the Rlogreg MR in most cases. Because the Ensemble variables are the intersection of variables selected by the three models EN, SPLS-DA, and SGPLS, the Ensemble MR could not be computed.
The leverage point refers to the outlier in the independent variable space; for example, when the gene expression data of a sample deviates from the gene expression data of most samples. We mainly examine the effect of misclassified observations on the various methods if their independent variables deviate from most observations. Because gene expression data are quality controlled before they are analyzed, none of the samples will deviate significantly. The degree of deviation was set as 3, that is, the independent variables of wrongly mislabeled samples follow independent normal distribution N (3,1). The results of these three methods with the simulated data were similar with and without leverage. Table S2-6 of Additional File 1 gives the results in more detail.
To make the scenario closer to real data, we set up a simulated dataset based on the triple-negative breast cancer (TNBC) dataset [7], which is publicly available from The Cancer Genome Atlas (TCGA) Data Portal. The simulated dataset had a sample size of n = 1000, among which 500 samples had y values of 1 (TNBC group) and 500 samples had y values of 0 (non-TNBC group). A subset with p = 5000 was selected randomly from the TNBC dataset. The mean vector and covariance matrix for the TNBC and the non-TNBC groups were obtained from the corresponding subsets, and then the normally distributed random variables were generated. We randomly selected samples and changed the labels to obtain the misclassified samples. The proportions of misclassified samples were set as 2.5, 5, 10, and 15%. The ability of the three methods to identify outliers is shown in Fig. 5 and Additional File 1: Tables S2-5.
The outlier detection accuracy was highest for enetLTS with FPRs of < 1% and the highest Sn among the three methods, as shown in Fig. 5. Although the Rlogreg Sn was close to that of enetLTS, its FPR of about 8% was much higher than the enetLTS FPR. The Ensemble FPR was very low (close to 0), but its Sn was low, especially when the proportion of outliers was large; with 10% or 15% outliers, the Sn was < 30%.
The results show that, when the proportion of outliers was ≤5%, the outlier detection accuracy was highest for Ensemble. However, with 10% or 15% outliers, although the overall GM still showed that Ensemble had the highest accuracy, the Ensemble PSR was lower than the enetLTS PSR, and some variables that affect the dependent variable were missed. With even higher proportions of outliers, the difference between these two methods would further increase. These results show that enetLTS had the highest  Combining two methods to improve the accuracy of variable selection When the proportion of outliers was small, Ensemble had high accuracy of variable selection; however, when the proportion of outliers was large, its accuracy was greatly reduced. Conversely, regardless of the proportion of outliers, enetLTS had the highest outlier detection accuracy among the three methods.
Considering the advantages and disadvantages of the two, we combine the two methods. That is, when the proportion of outliers is large, enetLTS was used first to identify the outliers. Then Ensemble was applied on the subset with outliers removed. Because the proportion of misclassified samples in the subset will be smaller, the accuracy of Ensemble's variable selection will increase in this case.
With 15% outliers, enetLTS was used to identify the outliers, than Ensemble was run on the subset. The Ensemble PSR increased from 0.533 to 0.644, and the GM increased from 0.714 to 0.786 as a result. The results are shown in Table 1.

The computation times of the three methods
The computation times of the three methods are summarized in Table 2. The computations were performed on an Intel Core i7-6500 U @2.50GHz processor. The CPU time was reported in seconds as an average over five repetitions. From Table 3, the computation time of Rlogreg was considerably lower than that of the other two methods because regularization parameterλwas determined using Bayesian regularization, which saved the time that cross-validation would take. The regularization parameters of Ensemble and enetLTS were both resolved by crossvalidation. However, enetLTS required much more time because cross-validation was conducted at each iterative step of the C-step algorithm.

Results of the analysis on a TNBC dataset
We compared the application of the three methods on a TNBC dataset from the TCGA-BRCA data collection. The BRCA RNA-Seq fragments per kilobase per million (FPKM) dataset was imported using the 'brca.data' R package (https://github.com/averissimo/brca.data/releases/download/1.0/brca.data_1.0.tar.gz).
A total of n = 1019 patients with solid tumors and 19,688 genes, including the three key TNBC-associated genes, estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2), were considered for further analysis. In consideration of the impact of possible confounding, like Lopes, et al. [7], we considered the inclusion of two variables, age and ethnicity, which are statistically significant through univariate Logisitc regression, and the missing observations in these The TNBC response variable Y was created based on the clinical variables ER and PR as detected by immunohistochemistry (IHC), and HER2 as detected by IHC and/or fluorescence in situ hybridization (FISH); Y was "1" (TNBC) when ER, PR, and HER2 were negative and "0" when at least one of the three variables was positive. There were three variables for HER2: HER2 (IHC) level, HER2 (IHC), and HER2 (FISH). The values of IHC level were "0" (negative), "1+" (negative), "2+" (indeterminate), and "3+" (positive); the values of IHC status were "equivocal", "indeterminate", "negative", and "positive". Thirteen individuals had discrepant labels for HER2 (IHC) level and HER2 (IHC) status, and 15 cases had inconsistent labels in HER2 (IHC) and HER2 (FISH). These individuals were potential outliers and were referred to as "suspect individuals" by Lopes et al. [7]. We checked whether the outliers detected by the three methods included these suspect individuals.
The distributions of the FPKM values for ER, PR, and HER2 in the TNBC and non-TNBC groups are presented in Table 4.
To analyze the TNBC dataset, for Rlogreg, the initial setting of the Γ matrix was 0:925 0:075 0:075 0:925 and, for Ensemble and enetLTS, the parameters were set in accordance with the settings used with the simulated data. Rlogreg detected 109 outliers, all of which were non-TNBCs. They were ranked from high to low according to the absolute values of the Pearson residuals and the top 20 outliers are listed in Table 5. The ER, PR, and HER2 genes in the 109 predicted non-TNBC patients all had low expression values, which indicated they should have been classified as TNBC patients. For example, individual "TCGA-AN-A0FJ" had different HER2 labels (positive by IHC status and negative by IHC level) and was classified as non-TNBC; however, the low HER2 (14.28), ER (0.08), and PR (0.04) expression values indicated that this individual was more likely to be a TNBC patient. Similarly, individual "TCGA-AN-A0FX" was labeled positive by IHC status and negative by IHC level and was classified as non-TNBC; however, the HER2 (24.02), ER (0.08), and PR (0.04) expression values indicated that this individual might be a TNBC patient. Individual "TCGA-LL-A5YP" also was classified as non-TNBC but had discordant HER2 labels Table 2 The computation times of the three methods for the datasets with n = 500, p = 1,000, and ε =0.   24) of the individuals who were ranked 13 and 14 indicated they were likely labeled correctly as non-TNBC. These results suggest that Rlogreg may produce false positives for outlier detection.
Ensemble identified 30 outliers and 5 genes. Ten patients with TNBC and 20 patients with non TNBC were found in 30 abnormal patients. Table 7 lists the 20 outliers with the minimum q-values. All outliers are listed in Table S5-1 of Additional File 1.
There were 28 suspect individuals in the TNBC dataset. Among the 30 outliers identified by Ensemble, three were suspect individuals; among the 68 outliers detected by enetLTS, seven were suspect individuals; and among 109 outliers detected by Rlogreg, nine were suspect individuals. Because the true labels of the individuals the TNBC dataset are not known, we regarded these 28 suspect individuals as true mislabeled individuals and compared the outlier detection accuracy of three methods. we used Sn_ Ref and FPR_Ref as references for the true Sn and FPRs, which is shown in Table 8. For suspect individuals, the enetLTS and Rlogreg Sn_Ref values were high, whereas they were low for Ensemble. However, because Rlogreg identified a large number of outliers, its FPR_Ref also was high. These results are similar to those obtained with the simulated data. Although 3% of the samples in the TNBC dataset had inconsistent labels, nearly 300 individuals were tested using only one method for the detection of HER2, and IHC was the only method used for ER and PR detection in all the individuals. Because false positives and false negatives will appear in the IHC test, the actual proportion of misclassified samples in the TNBC dataset may be higher. Therefore, Sn_Ref will likely underestimate the true Sn, and FPR_Ref may overestimate the actual FPR.
To test the robustness of outliers selected by the three methods, 5000 genes were selected randomly from the 19,690 variables to form a random gene set. A total of 66 outliers were identified in the random gene set using enetLTS, 62 of which coincided with those in the original TNBC dataset, and seven suspect individuals with inconsistent labels also were included. A total of 33 outliers were identified in the random gene set using Ensemble, 25 coincided with those in original TNBC dataset, including four suspect individuals with inconsistent labels. A total of 125 outliers were identified in the random gene set using Rlogreg, 92 coincided with the original TNBC dataset,   including nine suspect individuals with inconsistent labels. Therefore, the results for the random gene set mostly coincided with the results for the original TNBC dataset.
The results for the random gene sets are described in detail in Additional File 1. The 32 genes selected by Rlogreg are listed in Table 9. Among them, the gene encoding the fatty acid protein FABP7 was reported to be up-regulated in the TNBC dataset by [15], and elevated FABP7 expression levels have been associated with poor prognosis. Other genes selected by Rlogreg, namely KISS1 [16], IGF2BP2 [17], CALCA [18], PLA1A [19], and FAM171A1 [20], have been reported to be related to breast cancer or other types of cancer. However, the three key TNBC-associated genes ER, PR, and HER2 were not among the genes selected by Rlogreg.
The five genes selected by Ensemble are shown in Table 10. ESR1 (i.e., ER), one of the three key genes of TNBC, was among them. The other four, CA12 [21], AGR2 [22], TFF1 [23], and AGR3 [24] have been reported to be up-or down-regulated in TNBC.
A total of 433 genes were selected by enetLTS. The 40 genes with the largest absolute value of the coefficient are listed in Table 11 and details of all the selected genes are provided in Table S4-6 Additional File 1. The five genes selected by Ensemble were among the 433 genes. Two key genes of TNBC, ER and PR, were among the genes selected by enetLTS. Other genes selected by enetLTS, FOXA1 [25], GATA3 [25], SPDEF [26], FOXC1 [27], EN1 [28], HORMAD1 [29], KRT16 [30], and CT83 [31], have been reported to be related to TNBC.
We did not know the true genes associated with TNBC, nor the true outliers in the TNBC dataset. The results showed that when the proportion of outliers was > 5%, the Ensemble PSR was lower than the enetLTS PSR, and the larger the proportion of outliers, the greater was the gap between the two. Ensemble selected only five genes, and some genes that have been reported to be related to TNBC were missed, probably because of the relatively large proportion of misclassified samples. Although only 3% of the samples in this study had inconsistent labels, nearly 300 individuals were tested using only one method to detect HER2, and only one method, IHC, was used to detect ER and PR in all the individuals in the TNBC dataset. It has been reported that up to 20% of IHC test for ER and PR worldwide might be inaccurate (false negative or false positive), mainly due to variations in preanalytic variables, thresholds for positivity, and interpretation criteria [32]. Therefore, there may be more misclassified individuals in the TNBC dataset.
The results showed that when the proportion of outliers was relatively large, enetLTS had high outlier detection accuracy, and when the proportion of outliers was low, Ensemble had high variable selection accuracy. We combined the advantages of these two Table 9 Genes selected by Rlogreg for the TNBC dataset  methods and removed 68 outliers identified by enetLTS, then run Ensemble on a subset of 856 samples, which further improved the accuracy of gene selection. The prediction index MR of the three models in Ensemble was much lower on the TNBC subset with 68 outliers removed than it was on the original TNBC dataset as shown in Table 12; the EN MR decreased from 0.012 to 0, the SPLS-DA MR decreased from 0.064 to 0.008, and the SGPLS MR decreased from 0.059 to 0.015. Figures 6 and  7 show the intersection of the three Ensemble models for screening genes on the original TNBC dataset and on the subset with outliers removed. With the subset, all the genes screened by SGPLS overlapped with the genes screened by the other two models. The intersection of the genes screened by EN and SPLS-DA also increased from eight in the original TNBC dataset to 26 in the subset. These results show that the consistency of gene screened by the three Ensemble models was greatly increased after removing outliers.
Because the prediction accuracy of the three Ensemble models was very high, the genes selected by pairs of models were also listed in Table 14. Among the genes selected by both EN and SPLS-DA, ESR1, one of three key variables, and PHGDH [37], RARRES1 [38], SPDEF [26], PSAT1 [37], and FABP7 [15] have been reported to be related to TNBC. Among the genes selected by both SPLS-DA and SGPLS, SLPI [39], TFF1 [23], and KRT6B [40] have been reported to be related to TNBC. Other selected genes, SLC40A1 [41], ADAMTS15 [42], THSD4 [43], GREB1 [44], and SLC44A4 [45] have been reported to be related to breast cancer, and ZG16B [46], FDCSP [47], and SRARP [48] have been associated with other types of tumors. The correlation between these genes and TNBC should be verified by further experiments.

Discussion
Mislabeled samples in omics data lead to two main problems: how to identify associated biomarkers accurately and avoid the influence of mislabeled samples, and how to detect mislabeled samples accurately. Rlogreg had the lowest variable selection accuracy among the three methods tested, and lower outlier identification accuracy than enetLTS. The computation time for  Rlogreg was considerably lower than those of the other two methods because the regularization parameter λ is determined using Bayesian regularization, which is faster than cross-validation. However, this way of determining λ affected the accuracy of variable selection, which was worse than that of enetLTS, which uses cross-validation to determine the regularization parameter. Additionally, because Bootkrajang et al. [4] set the misclassified samples predicted by Rlogreg as outliers, many outliers were detected and the FPR was high. Ensemble and enetLTS both use cross-validation to determine the regularization parameters. However, enetLTS required much more time than Ensemble because crossvalidation is conducted at each iterative step of the C-step algorithm in enetLTS. For outlier detection, individuals with significant Pearson residuals more than Φ(0.9875) were regarded as outliers in enetLTS. The outlier detection accuracy of enetLTS was better than that of Rlogreg, and enetLTS was stable when the proportion of outliers increased. Least trimmed square (LTS) is an effective method to solve the masking phenomenon in which many outliers are located close together in a low-dimensional dataset [49]. The results showed that enetLTS also was an effective method to detect outliers in a high-dimensional dataset. For variable selection, enetLTS is equivalent to EN when EN was applied on the subset with outliers removed. EN tends to include all relevant variables, so more variables are selected, resulting in high PSRs and FDRs for enetLTS. We ran Ensemble on the TNBC dataset after removing the outliers identified by enetLTS. And one of the three ways that Ensemble includes is EN. Therefore, using Ensemble is equivalent to running three methods on this subset, so by keeping the variables with strong correlation with dependent variables through their intersection, the FDR can be reduced.
The EN model in Ensemble also was compared with enetLTS by Kurnaz et al. [11]. The results showed that on contaminated data, enetLTS performed better than the EN for variable selection with a lower FPR and better precision of coefficients. However, associated genes with substantial effects were not influenced by outliers and were detected by all three Ensemble models. Then, by finding the intersection of variables selected by the three Ensemble models, the genes with the strongest effects are selected. Because the intersection contained few genes, Ensemble had a very low FDR and a lower PSR than enetLTS. When the proportion of outliers was relatively large, we found that the variable selection accuracy of Ensemble decreased, especially the PSR. This may be because, without considering robustness, when the original datasets with Table 13 Genes selected by Ensemble for the TNBC subet *   CA12,GABRP, VGLL1, AGR2, GATA3, FOXA1, TFF3, AGR3, KRT16 *: This subset is the original data set after removing 68 outliers identified by enetLTS outliers were used, the variable selection of the three Ensemble models may have been influenced by the outliers. EnetLTS performs the EN estimation after removing outliers, so the enetLTS PSR is less affected by the outliers.
All the results showed that the variable selection accuracy of Ensemble on highdimensional data was high. Ensemble is a way of modeling using a variety of different basic models, so, as long as the basic model has diversity and independence, Ensemble will usually have low error rates. Many researches on Ensemble models have combined multiple machine learning methods to improve the accuracy of predictions [50]. Because the different models "look" at the data from different angles, they are not flexible enough to make changes in the training set to more accurately summarize new data and improve the generalization ability of the model. Rather, the Ensemble approach aims to seek the wisdom of the group to build a model that is closer to reality [51].
Overall, the outlier detection accuracy of Ensemble was worse than that of enetLTS. Ensemble achieved consensus with rank product statistics corrected by multiple testing, which led to fewer outliers detected by Ensemble and a lower Sn than enetLTS. Further, Cook's D derived from EN, SPLS-DA, or SGPLS may have been influenced by outliers. Conversely, the Pearson residual of enetLTS was derived from the subset without outliers, which may explain why enetLTS detected more outliers with a higher Sn than Ensemble.
When the proportion of outliers was relatively low, such as ≤5%, the results showed that Ensemble had high variable selection accuracy. Although none of the three Ensemble models considered robustness, they were less affected by the proportion of outliers, so the FDR of variable selection was reduced by intersection. When the proportion of outliers was large, such as > 5%, the results showed that using enetLTS first to identify outliers and then using Ensemble improved the overall variable selection accuracy. In practice, the proportion of outliers can be determined according to the inaccuracy rate of the diagnostic methods used; for example, the inaccuracy rate of IHC detection is about 20%. EnetLTS is the recommended method for the identification of misclassified samples regardless of the proportion of outliers in a dataset.
The identified misclassified samples need to be further checked using more accurate tests or multiple tests so that experimental or diagnostic errors can be corrected to avoid subsequent treatment failure caused by the wrong treatment. If, after verification, the identified misclassified samples were found not to be caused by such errors, it may mean that the disease classification of these samples had different response patterns compared to their covariate combinations. Taking the TNBC data for example, if the identified misclassified sample is labeled TNBC, and it is not a diagnostic error after verification, it indicates he/she should be labeled non-TNBC based on the genes screened from the vast majority of individuals. For these heterogeneous samples, we suggest that further analysis can be done in this way. The propensity score [52] can be used to match these heterogeneous TNBC individuals according to the gene expression values of genes (screened from the vast majority of individuals) among non-TNBC individuals. In this way, specific genes related to TNBC can be found in these heterogeneous TNBC individuals, and these specific genes are different from associated genes screened from the vast majority of individuals. Further research may find suitable individualized treatment for these heterogeneous TNBC patients. The analysis of identified heterogeneous samples needs further research.

Conclusions
When the proportion of outliers is relatively low and ≤ 5%, Ensemble can be used for variable selection. When the proportion of outliers is > 5%, Ensemble can be used for variable selection on a subset of data after removing outliers identified by enetLTS. For outlier identification, enetLTS is the recommended method. In practice, the proportion of outliers in a dataset can be estimated according to the inaccuracy of the diagnostic methods used.

Simulation settings
We generated n = 100 and n = 500 observations from a p-dimensional multinormal distribution N(0, Σ p ) with p = 200 and p = 1000. The (i,j) element of Σ p was set to 0.9 |i − j| , 1 ≤ i, j ≤ p. We assumed high correlation coefficients among variables because the close correlation among genes is frequently observed.
We fixed the coefficient vector as β T =(1, …, 1,0, …, 0). The first 30 β i were set to one and the others were set to zero. The response variable was generated according to a Bernoulli distribution with y i~B (1, π i ), where logitðπ i Þ ¼ x T i β for i = 1,2, …,n. We considered the following two scenarios for outliers. (1) Outliers in the response: We set 1 3 ε of the observations selected randomly from the class y i = 0 to one, and 2 3 ε for which selected randomly from the class y i = 1 to zero. Asymmetric mislabeled samples were set because they are usually more harmful and harder to be detected than symmetric ones. ε =0.01, 0.02, 0.05, 0.10 and 0.15 were considered. (2) Outliers in both the response and predictors: This was the same as scenario (1); however, ε of observations with outliers in the response also contained outliers in the predictors following an independent N (3,1) distribution.(3) To make the simulation scenario closer to the real data, we set up the simulation based on the TNBC data set . The new datasets were simulated with sample size N = 1000, of which 500 observations had y value of 1 (TNBC class), and 500 samples have y value of 0 (non-TNBC class). A subset with the dimension p = 5000 was randomly selected from the TNBC data set. The mean vector and covariance matrix corresponding to the TNBC group and the non-TNBC group were obtained respectively from the subset, and then the normally distributed random variables were generated.
For each setting mentioned above, we compared the performance of Rlogreg, Ensemble, and enetLTS. For Rlogreg, the initial gamma matrix was set to 0:85 0:15 0:15 0:85 , which means that the initial probability that the label was flipped from the true label one to the observed label zero, or the true label zero to the observed label one was 0.15 for both cases.

Performance measures
The evaluation criteria were divided into three categories. The first category concerns the variable selection accuracy.
(1) Model size: the number of non-zero coefficients in the estimated model.
(2) Positive Selection Rate (PSR) and false discovery rate (FDR): where true positive TP is the number of coefficients that are non-zero in the true model and were estimated as non-zero. In the true model, false positive FP represents the zero coefficients that were estimated as non-zero. False negative FN represents the number of non-zero coefficients that were estimated as zero. PSR represents the proportion of TP in non-zero coefficients in the actual model. Additionally, FDR represents the ratio of FP in non-zero estimated coefficients.
(3) The geometric mean of PSR and (1-FDR) (GM): We calculated the geometric mean of PSR and (1-FDR) to evaluate the selection performance of the methods comprehensively.
The second category of indicators evaluates the accuracy of outlier detection.
(1) The number of outliers (Num): Number outliers detected by a method.
(2) Sensitivity (Sn) and false positive rate (FPR): where true positive TP * represents the number of actual outliers that were also detected as outliers. False positive FP * represents the number of individuals with correct labels that were detected as outliers. False negative FN * represents the number of actual outliers that were misclassified as individuals with the correct labels. True negative TN * represents the number of individuals with actual correct labels that were also identified as those with correct labels.
Sn represents the proportion of actual outliers that were correctly identified. FPR represents the proportion of individuals with correct labels that were wrongly categorized as outliers.
The third category of indicators evaluates the prediction accuracy.
(1) Misclassification rate (MR): MR represents the fraction of misclassified observations that correspond to their prediction probability by the fitted model. We set where predicted probabilityp i ¼ expðx 0 iβ Þ 1þ expðx 0 iβ Þ andŷ i is the predicted response. Misclassified observations are samples with response y = 1 that are predicted as zero, or ones with y = 0 but predicted as 1.
Training data and test data were generated according to the above sampling schemes. Training data were generated to fit the model and test data to evaluate the model. The test data were generated without outliers. For each setting, we calculated the average of the performance measures over 100 simulation replicates implemented in MATLAB [53] (for Rlogreg only) and R software [54,55].