Skip to main content
  • Research article
  • Open access
  • Published:

Fast and robust imputation for miRNA expression data using constrained least squares

Abstract

Background

High dimensional transcriptome profiling, whether through next generation sequencing techniques or high-throughput arrays, may result in scattered variables with missing data. Data imputation is a common strategy to maximize the inclusion of samples by using statistical techniques to fill in missing values. However, many data imputation methods are cumbersome and risk introduction of systematic bias.

Results

We present a new data imputation method using constrained least squares and algorithms from the inverse problems literature and present applications for this technique in miRNA expression analysis. The proposed technique is shown to offer an imputation orders of magnitude faster, with greater than or equal accuracy when compared to similar methods from the literature.

Conclusions

This study offers a robust and efficient algorithm for data imputation, which can be used, e.g., to improve cancer prediction accuracy in the presence of missing data.

Background

Next generation sequencing technologies have revolutionized high-throughput analysis of the transcriptome. However, zero values present an inherent problem when analyzing the expression matrices generated through these techniques. When transcripts are relatively high in some samples, but not in others of the same type, or when the dimensionality of the data is high, technical zeros are even more likely to happen. Distinguishing technical zeros from true biologic null expression is essential for correct data interpretation.

To highlight the importance of data imputation, in terms of retaining classification accuracy, we consider the example problem of classifying images of handwritten digits. Classifying images of handwritten digits is a well studied problem in machine learning, and the test accuracy exceeds \(99\%\) using the state-of-the-art models [1]. See Fig. 1a, where we have shown an example, synthetic image of a handwritten 1. The same image, but with some missing pixels, is shown in Fig. 1b. The locations of the missing pixels are selected at random, and uniformly. If we wish to classify the images with missing pixels, then it is ill-advised to perform no data imputation (i.e., imputing zeros), as the accuracy would suffer. See Fig. 1c, where we have shown the effect of imputing zeros on the AUC, classification accuracy (ACC) and \(F_1\) score with the percentage of missing pixels. We see a decrease in classification accuracy and \(F_1\) score when more than \(10\%\) of the pixels are missing, and the reduction in accuracy is more pronounced as the percentage of missing pixels increases.

Fig. 1
figure 1

a Example image of a handwritten number 1. the same image but with missing pixels c AUC, classification accuracy and \(F_1\) score with % of missing pixels

Several strategies have been described for data imputation in gene expression and miRNA expression analysis [2,3,4,5,6,7,8,9,10,11]. Two popular techniques are “VIPER” and “scImpute.” In [3], the authors introduce “VIPER”, which implements data imputation on gene expression data using a combination of lasso (or an elastic net), with a box constrained regression. That is, first a set of neighboring cells are found, which have related expression values to the missing cell, using lasso (or elastic nets). Then a box constrained regression is performed on the selected neighbors to fill in the missing gene expression values. The reason for using lasso as preprocessing, is that the quadratic programming code employed in [3] for box constrained regression does not scale well to large matrices, and thus lasso (or an elastic net) is used to select a subset of candidate nearest neighbors to reduce the array size before nonnegative regression. In [9], the authors introduce “scImpute”, which shares similarities in the intuition to VIPER. First, in the scImpute algorithm, the cells are clustered into K groups using spectral clustering [12]. Then, the missing cell expressions are reconstructed from their neighboring cells by a nonnegative constrained regression. That is, the missing values are imputed using nonnegative linear combinations (i.e., a linear combination with nonnegative coefficients) of their nearest neighbors, where the neighboring cells are determined by spectral clustering. We choose to focus on VIPER (specifically the lasso variant) and scImpute for comparison here, as they share the most similarities with the proposed method.

Here we present a novel, fast method for data imputation using constrained Conjugate Gradient Least Squares (CGLS) borrowing ideas from the imaging and inverse problems literature. As an example of a desired application for this work, we present miRNA expression analysis, with a particular focus on cancer prediction. As shown below, highly accurate cancer prediction is possible using simple classifiers (e.g., a softmax function is used here), on a wide variety of data sets, in the case when all (or a large fraction) of the miRNA expression values are known, and there is little to no missing data. It is not always possible to measure all expression values contained in the training set, for every patient, however. To combat this, we aim to impute the missing values using the known expressions, so that we can retain use of our accurate model fit to the full set of miRNA available in the training set. We propose to reconstruct the missing data via nonnegative constrained regression, but with the further constraint that the regression weights sum to 1. Such constraints ensure that the imputed values lie within the range of the training data, with the idea to prevent overfitting. We enforce the regression weights to sum to 1 as a hard constraint in our objective, so that the nonnegative least squares and weight normalization steps are carried out simultaneously. To solve our objective, we apply the nonnegative Conjugate Gradient Least Squares (CGLS) algorithm of [13], typically applied in inverse problems and image reconstruction. The CGLS code we apply does not suffer the scaling issues encountered in, e.g., VIPER, for large matrices, and can process efficiently large scale expression arrays. The algorithm we propose offers a fast, efficient, and accurate imputation without the need for preprocessing steps, e.g., as in VIPER and scImpute. Our method is also completely nonparametric, and thus requires no tuning of hyperparameters before imputation, in contrast to scImpute and VIPER which require that two hyperparameters be tuned. Such parameters may be selected, for example, by cross validation, as is suggested in [3]. However, cross validation is slow, particularly for large data, and is thus impractical for clinical applications. To demonstrate the technique, we test the performance on miRNA expression data publicly available in the literature, and give a comparison to VIPER and scImpute. Specifically, as a measure of performance, we focus on how effectively each method retains the classification accuracy with the percentage of missing data (as in the curves shown in Fig. 1c). The proposed method is shown to be orders of magnitude faster than VIPER and scImpute, with greater than or equal accuracy, for the examples of interest considered here in cancer prediction.

Results

The method proposed here will be denoted by Fast Linear Imputation (FLI), for the remainder of this paper. The FLI algorithm and the core objective functions are discussed in detail in the appendix, section “Description of FLI”. In this section, we present a comparison of FLI, and the methods VIPER [3] and scImpute [9] from the literature on publicly available miRNA expression data [14,15,16] and synthetic handwritten image data. The specific implementations of VIPER and scImpute used here are discussed in the appendix, section “Implementation of VIPER and scImpute”. FLI is also compared against (unconstrained) regression, mean, and zero imputation as baseline. The classification model, selection of hyperparameters, and classification metrics are detailed in the appendix.

Synthetic handwritten image results

In this section, we present our results on the synthetic handwritten image data discussed in the introduction. The handwritten image data is included as a visual example to show clearly how the imputation methods are performing and to give some intuition as to why some methods perform better than others. For more details on this data see section “Data sets”.

Fig. 2
figure 2

Handwritten image data results. ac AUC, ACC and \(F_1\) scores with percentage of missing dimensions. df mean (\(\epsilon _{\mu }\)), standard deviation (\(\epsilon _{\sigma }\)) and maximum (\(\epsilon _{M}\)) imputation errors over all test patients, with percentage of missing dimensions. The method is given in the figure legend

In Fig. 2a–c we present plots of the AUC, ACC and \(F_1\) scores with the percentage of missing dimensions, for each method. Figure 2d–f show the corresponding plots of the mean, standard deviation and maximum imputation errors, over all test patients. The plots in Fig. 2e, f are cropped on the vertical axis to better highlight the errors corresponding to the more competitive methods. This cuts off the end of error curve corresponding to scImpute, which spikes when \(\approx 95\%\) of the dimensions are missing. In Table 1, we present the average values over the curves in Fig 2a–f, as a measure of the average performance over all possible levels of missing data. The mean, standard deviation, and maximum imputation times, over all test patients, are given in Table 1c.

Table 1 Handwritten image data results

In terms of retaining the classification accuracy, FLI, VIPER, and scImpute offer comparable performance. FLI and VIPER are joint best and offer mean AUC, ACC, and \(F_1\) scores exceeding \(99\%\). For the baseline methods, namely (unconstrained) regression, mean, and zero imputation, we see a reduction in the classification accuracy, and the reduction is more pronounced when \(>70\%\) of the dimensions are missing, as evidenced by the curves in Fig. 2a–c. In terms of the imputation error, FLI offers the most consistent imputation accuracy, when compared to VIPER and scImpute, in the sense that FLI offers the smallest standard deviation and maximum errors.

Fig. 3
figure 3

Example image reconstructions of the one image shown in the introduction, using all methods considered. The number of missing pixels is 550, which is \(550/784=70\%\) of all pixels. The ground truth is also shown for comparison

For an example image imputation, see Fig. 3. where we have shown image reconstructions of the handwritten one image discussed in the introduction (Fig. 1a). We see ghosting artifacts in the VIPER reconstruction, and a significant blurring effect in the scImpute reconstruction. The regression imputation appears overfit, and introduces severe artifacts. FLI offers the clearest and sharpest image, with relatively few artifacts. So, in some cases, there are artifacts introduced by the VIPER and scImpute reconstructions. While this is not enough to confuse the classifier (i.e., the classification accuracy is still retained), the imputation error is less consistent when compared to FLI. In particular, the average maximum imputation error offered by FLI, over all levels of missing dimensions, is \(17\%\) lower than the next best performing method, namely VIPER. See the third row of Table 1b. FLI is also orders of magnitude faster than VIPER and scImpute, as indicated by the imputation times of Table 1c.

Singapore results

Here we present our results on the miRNA expression data of Chan et. al. [16], collected from Singaporean patients. This data includes significant batch effects due to different measurement technologies. See section “Data sets” for more details.

See Fig. 4a–c for plots of the classification accuracy, and Fig. 4d–f for the imputation errors with the percentage of missing dimensions. See Table 2 for the mean values over the curves in Fig. 4a–f, and the mean, standard deviation, and maximum imputation times. In this example, FLI offers the best performance in terms of retaining the AUC, ACC and \(F_1\) score, on average, across all levels of missing dimensions. As evidenced by Fig. 4a, FLI offers the highest AUC over all levels of missing dimensions. We see a similar effect in the ACC and \(F_1\) score curves of Fig. 4b, c, although, in a minority of cases, scImpute slightly outperforms FLI. The retention of the classification accuracy is significantly reduced using regression, mean and zero imputation, when compared to FLI, VIPER, and scImpute.

Fig. 4
figure 4

Singapore data results. a-c AUC, ACC and \(F_1\) scores with percentage of missing dimensions. df mean (\(\epsilon _{\mu }\)), standard deviation (\(\epsilon _{\sigma }\)) and maximum (\(\epsilon _{M}\)) imputation errors over all test patients, with percentage of missing dimensions. The method is given in the figure legend

FLI offers the most optimal performance in terms of the mean, standard deviation, and maximum imputation error, across all levels of missing dimensions, when compared to VIPER and scImpute. A zero imputation offers the best maximum and standard deviation error over all methods. The mean error offered by zero imputation is significantly higher than that of FLI, scImpute and VIPER, however. We would expect the standard deviation of a zero imputation to be low, as there is not much variation among the imputations (i.e., many of the values are zeros). The maximum error curves of Fig. 4f indicate that, for some patients, the imputation error is high when using FLI, VIPER and scImpute, as simply imputing zeros offers less error. Such erroneous patients can be considered outliers, and do not greatly effect the overall classification accuracy, as evidenced by the plots of Fig. 4a–c.

Table 2 Singapore data results

The imputation time offered by FLI is orders of magnitude faster than VIPER and scImpute. For example, FLI is approximately three orders of magnitude faster, in terms of mean imputation time, when compared to VIPER, which was the next best performing method in terms of AUC, ACC and \(F_1\) score, after FLI. The imputation time offered by FLI is also more consistent when compared to VIPER, as evidenced by the \(t_{\sigma }\) scores. When compared to scImpute, FLI is approximately one order magnitude faster in terms of mean and maximum imputation time, and is more consistent with lower standard deviation. Regression, mean and zero imputation are the fastest methods, but at the cost of accuracy.

This example was included given the presence of significant batch effects, as discussed at the beginning of this section, and in more detail in section “Data sets”. This example provides evidence that FLI is most optimal (compared to similar methods such as VIPER and scImpute), in terms of accuracy and imputation time, when imputing data in the presence of batch effects.

Korea results

Here we present our results on the miRNA expression data of Lee et. al. [17], collected from Korean patients. For more details on this data see section “Data sets”, point (3).

Fig. 5
figure 5

Korea data results. ac AUC, ACC and \(F_1\) scores with percentage of missing dimensions. df Mean (\(\epsilon _{\mu }\)), standard deviation (\(\epsilon _{\sigma }\)) and maximum (\(\epsilon _{M}\)) imputation errors over all test patients, with percentage of missing dimensions. The method is given in the figure legend

Table 3 Korea data results

In Fig. 5a–c we present plots of the AUC, ACC and \(F_1\) scores with the percentage of missing dimensions, for each method. Figure 5d–f show the corresponding plots of the mean, standard deviation and maximum imputation errors. The plots in Fig. 5d, f are cropped to \(\epsilon =0.35\) on the vertical axis to better highlight the errors corresponding to the more competitive methods. Thus, parts of some of the error curves are missing in Fig. 5d, f. For example, in most cases (i.e., for most levels of missing dimensions considered), the zero imputation mean and maximum error exceeds 0.35 and thus why much of the light blue curves corresponding to zero imputation are missing in the plots. In Table 3a, b, we present the average values over the curves in Fig. 5a–f. The mean, standard deviation, and maximum imputation times, over all test patients, are given in Table 3c.

In this example, FLI, VIPER, and scImpute offer similar levels of performance in terms of retaining the classification accuracy, with FLI offering the best average AUC, and scImpute the best average ACC and \(F_1\) scores. A standard regression imputations is also effective in retaining the classification accuracy up to approximately \(85\%\) of dimensions missing, and is comparable to FLI, VIPER, and scImpute within this range. We see a sharp reduction in accuracy in the regression curves (i.e., the purple curves of Fig. 5a–c) when more than \(85\%\) of the dimensions are missing, however, and regression significantly underperforms FLI, scImpute, and VIPER at this limit. For mean and zero imputation, we see a more gradual reduction in accuracy, when compared to regression. The imputation errors offered by FLI, VIPER, scImpute, and mean imputation are comparable, and outperform regression and zero imputation. When compared to VIPER and scImpute, FLI offers an imputation time which is orders of magnitude faster, in terms of mean imputation time. The imputation time offered by FLI is also more consistent, with lower standard deviation and maximum, when compared to VIPER and scImpute. As was the case in the previous examples, regression, mean, and zero imputation are the fastest methods, but at the cost of accuracy.

Discussion

In this paper we introduced FLI, a fast, robust data imputation method based on constrained least squares. To illustrate the technique, we tested FLI on synthetic handwritten image and real miRNA expression data sets, and gave a comparison to two similar methods from the literature, namely VIPER [3] and scImpute [9]. We also compared against (unconstrained) regression, mean, and zero imputation as baseline. The results highlight the effectiveness of FLI in retaining the classification accuracy in cancer prediction applications using miRNA expression data, and in image classification. When compared to VIPER and scImpute, FLI was shown to offer greater than or equal imputation accuracy, with imputation speed orders of magnitude faster than scImpute and VIPER, in all examples considered. VIPER, scImpute, and FLI significantly outperformed regression, mean and zero imputation in terms of imputation accuracy, in all examples considered, but were slower given the greater computational complexity. For further validation of FLI on two more real miRNA expression data sets, see appendix 6.

In section “Singapore results”, we considered an example expression data set collected from Singaporean patients, which included significant batch effects. When batch effects were present, FLI was shown to outperform VIPER and scImpute in terms of retaining the classification accuracy, and imputation error. On the handwritten image and Korean data sets, considered in sections “Synthetic handwritten image results” and “Korea results”, such batch effects were not detected. In these examples, the imputation accuracy offered by FLI, VIPER, and scImpute was comparable. This study provides evidence that FLI offers optimal imputation accuracy, when compared to the methods of literature, on batch data. This is important, since batch effects are common in medical data [18] and thus an imputation which is effective in combating batch effects, without the need for a-prioiri batch correction steps, is desirable.

In all examples considered, FLI was shown to be orders of magnitude faster than VIPER and scImpute. FLI is also completely nonparametric, and thus more straightforward to implement, in contrast to VIPER and scImpute, which require the tuning of two hyperparameters. It is suggested in [3] to tune the lasso parameter used by VIPER via cross validation. The reason for using lasso as preprocessing in VIPER, is that the quadratic programming code employed in [3] for nonnegative regression does not scale well to large matrices, and thus lasso (or elastic nets) are used to select a subset of candidate nearest neighbors before nonnegative regression. A similar intuition is used in [9] in scImpute, whereby the training data is clustered into K groups using spectral clustering [12] before nonnegative regression. That is, the test samples are imputed using linear combinations of their nearest neighbors, where the neighbors are determined a-priori by spectral clustering. The algorithm we propose does not suffer such scaling issues for large matrices, and does not require any preprocessing steps before imputation. Our method is also completely nonparametric, and thus requires no tuning of hyperparameters before imputation, in contrast to scImpute and VIPER which require that two hyperparameters be tuned. Cross validation is slow, however, and resulted in long imputation times (in the order of minutes) when using VIPER. As noted by the authors in [3], the quadratic programming algorithm, used to implement box constrained regression, is slow, and thus why lasso preprocessing is proposed. The nonnegative least squares code of scImpute, applied in [9], also suffers efficiency issues. To combat this, the authors proposed to limit the number of regression weights a-priori using spectral clustering. FLI does not suffer such efficiency concerns, and requires no tuning of hyperparameters or preprocessing steps a-priori. FLI thus offers a faster and more straightforward imputation, when compared to VIPER and scImpute. This is important in applications where large numbers of samples need to be processed quickly (e.g., large gene expression arrays). In such applications, FLI offers the most practical imputation time, in comparison to VIPER and scImpute.

Conclusions and further work

The technique FLI proposed here offers accurate and fast imputation for miRNA expression data. In particular, the imputation offered by FLI was sufficiently accurate to retain the classification accuracy in cancer prediction and handwritten image recognition problems when a large proportion (up to \(85\%\)) of the dimensions were missing. Thus, FLI offers an effective means to classify samples with missing data, without the need for model retraining.

The application of focus here is miRNA expression analysis and cancer prediction. Since miRNA expression variables are highly correlated, as indicated by the plots in Fig. 6, FLI is ideal for miRNA expression data. FLI is not exclusive to miRNA however, and is generalizable to any data set with highly correlated variables. This is validated by the multivariate normal data results in section “Multivariate normal examples” in the appendix. The current iteration of FLI requires a full training set (i.e., with no missing data) for the imputation, which can be considered a limitation of FLI. In further work we aim to address this limitation and develop FLI for more general missing data problems, and test further the generalizability of FLI on a variety of expression technologies commonly applied in cancer prediction (e.g., single-cell RNA, mRNA, protein expression, metabolite expression).

In this study, we assumed the locations of the missing data points to be random and uniform. In practice, the distribution of drop out events may be nonuniform. For example, in miRNA sequencing, the lower limit of detection is related to sequencing depth, thus within the technical zero range there may be a broad range of true expression values. We hypothesize that such expressions will more frequently be drop outs, when compared to more significantly expressed miRNA. In further work, we aim to test the effectiveness of FLI, and the methods of the literature, in the case when the drop out distribution is nonuniform, once the distribution of drop out events is decided upon.

Availability of data and materials

All real miRNA expression data sets considered here are publicly available online. See section “Data sets”. The Japanese data (discussed in point 1 of section “Data sets”) can be downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE113740 and https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE113486. The Keller data set (point 2 of section “Data sets”) can be downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE31568. The Korean data of Lee (point 3) can be downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85589. The Singaporean data of Chan (point 4) can be downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41922 and https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42072. The synthetic handwritten digit data (point 5) can be downloaded from the Matlab machine learning toolbox, and is available for Matlab subscribers. The FLI code introduced here, and the code to generate the multivariate normal data (discussed in point 6 of section “Data sets”), is available from the authors upon reasonable request.

References

  1. Ciregan D, Meier U, Schmidhuber J. Multi-column deep neural networks for image classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition.2012; pp. 3642–3649 . IEEE et al.

  2. Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray JI, Raj A, Li M, Zhang NR. Saver: gene expression recovery for single-cell RNA sequencing. Nat Methods. 2018;15(7):539–42.

    Article  CAS  Google Scholar 

  3. Chen M, Zhou X. Viper: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies. Genome Biol. 2018;19(1):1–15.

    Article  Google Scholar 

  4. Peng T, Zhu Q, Yin P, Tan K. Scrabble: single-cell rna-seq imputation constrained by bulk RNA-seq data. Genome Biol. 2019;20(1):88.

    Article  Google Scholar 

  5. Slawski M, Hein M. Sparse recovery by thresholded non-negative least squares. Adv Neural Inf Process Syst. 2011;24:1926–34.

    Google Scholar 

  6. Van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon KR, Chaffer CL, Pattabiraman D, et al. Recovering gene interactions from single-cell data using data diffusion. Cell. 2018;174(3):716–29.

    Article  Google Scholar 

  7. Van den Berge K, Perraudeau F, Soneson C, Love MI, Risso D, Vert J-P, Robinson MD, Dudoit S, Clement L. Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol. 2018;19(1):1–17.

    Article  Google Scholar 

  8. Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016;17(1):1–14.

    Article  Google Scholar 

  9. Li WV, Li JJ. An accurate and robust imputation method scimpute for single-cell RNA-seq data. Nat Commun. 2018;9(1):1–9.

    Article  Google Scholar 

  10. Zhang L, Zhang S. Comparison of computational methods for imputing single-cell RNA-sequencing data. IEEE/ACM Trans Comput Biol Bioinf. 2018;17(2):376–89.

    Google Scholar 

  11. Gong W, Kwak I-Y, Pota P, Koyano-Nakagawa N, Garry DJ. Drimpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinform. 2018;19(1):1–10.

    Article  Google Scholar 

  12. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856 (2002) et al.

  13. Gazzola S, Wiaux Y. Fast nonnegative least squares through flexible Krylov subspaces. SIAM J Sci Comput. 2017;39(2):655–79.

    Article  Google Scholar 

  14. Yamamoto Y, Kondo S, Matsuzaki J, Esaki M, Okusaka T, Shimada K, Murakami Y, Enomoto M, Tamori A, Kato K, et al. Highly sensitive circulating microrna panel for accurate detection of hepatocellular carcinoma in patients with liver disease. Hepatol Commun. 2020;4(2):284–97.

    Article  CAS  Google Scholar 

  15. Usuba W, Urabe F, Yamamoto Y, Matsuzaki J, Sasaki H, Ichikawa M, Takizawa S, Aoki Y, Niida S, Kato K, et al. Circulating mirna panels for specific and early detection in bladder cancer. Cancer Sci. 2019;110(1):408–19.

    Article  CAS  Google Scholar 

  16. Chan M, Liaw CS, Ji SM, Tan HH, Wong CY, Thike AA, Tan PH, Ho GH, Lee AS-G. Identification of circulating microrna signatures for breast cancer detection. Clin Cancer Res. 2013;19(16):4477–87.

    Article  CAS  Google Scholar 

  17. Lee J, Lee HS, Park SB, Kim C, Kim K, Jung DE, Song SY. Identification of circulating serum mirnas as novel biomarkers in pancreatic cancer using a penalized algorithm. Int J Mol Sci. 2021;22(3):1007.

    Article  CAS  Google Scholar 

  18. Elias KM, Fendler W, Stawiski K, Fiascone SJ, Vitonis AF, Berkowitz RS, Frendl G, Konstantinopoulos P, Crum CP, Kedzierska M, et al. Diagnostic potential for a serum mirna neural network for detection of ovarian cancer. Elife. 2017;6:28932.

    Article  Google Scholar 

  19. Keller A, Leidinger P, Bauer A, ElSharawy A, Haas J, Backes C, Wendschlag A, Giese N, Tjaden C, Ott K, et al. Toward the blood-borne mirnome of human diseases. Nat Methods. 2011;8(10):841–3.

    Article  CAS  Google Scholar 

  20. Al-Saffar AAM, Tao H, Talab MA. Review of deep convolution neural network in image classification. In: 2017 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET). 2017; pp. 26–31. IEEE et al.

  21. Qin Z, Zeng Q, Zong Y, Xu F. Image inpainting based on deep learning: A review. Displays. 2021; 102028

  22. Gao B, Pavel L. On the properties of the softmax function with application in game theory and reinforcement learning.2017; arXiv preprint arXiv:1704.00805

  23. Caliński T, Harabasz J. A dendrite method for cluster analysis. Commun Stat-theory Methods. 1974;3(1):1–27.

    Article  Google Scholar 

  24. Zha Z, Wen B, Zhang J, Zhou J, Zhu C. A comparative study for the nuclear norms minimization methods. In: 2019 IEEE International Conference on Image Processing (ICIP). 2019; pp. 2050–2054. IEEE et al.

Download references

Acknowledgements

Not applicable.

Funding

This research received support from the grant K12 HD000849, awarded to the Reproductive Scientist Development Program by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (KME). The authors also wish to acknowledge funding support from the GOG Foundation, as part of the Reproductive Scientist Development Program (KME), Robert and Deborah First Family Fund (KME), the Massachusetts Life Sciences Center Bits to Bytes Program (JWW, KME), and Abcam, Inc (JWW). The funders listed above contributed funding for this study, but did not contribute to the writing or preparing of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

JWW developed FLI and the original idea. The experiments were conducted by JWW. Analyses of results by JWW and KE. KE was as major contributor in writing the manuscript, and provided expert insight from a medical background needed to communicate this work to a medical audience. All authors have read and approved the manuscript.

Corresponding author

Correspondence to James W. Webber.

Ethics declarations

Ethics approval and consent to participate

No new human or animal data is presented here.

Consent for publication

There are no issues regarding consent for publication.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix: Materials and methods

In this section we describe in detail our data imputation strategy FLI, and discuss the classification models and metrics used in our results.

Description of FLI

Let \(X=(\mathbf {x}_1,\ldots ,\mathbf {x}_p)\in \mathbb {R}^{n_1\times p}\) denote some training data, with no missing entries, and let \(Z=(\mathbf {z}_1,\ldots ,\mathbf {z}_p)\in \mathbb {R}^{n_2\times p}\) denote some unseen data with NaN columns. Let \(\mathcal {I}_1=\{i_1,\ldots ,i_{p_1}\}\subset \{1,\ldots ,p\}\) denote the indices of the NaN columns, and let \(\mathcal {I}_2=\{j_1,\ldots ,j_{p_2}\}=\{1,\ldots ,p\}\backslash \mathcal {I}_1\), where \(p=p_1+p_2\). Then we aim to find

$$\begin{aligned} \text {arg\,min}_{V\in \mathcal {V}_+}\left\| \begin{pmatrix} \mathbf {x}^T_{j_1}\\ \ldots \\ \mathbf {x}^T_{j_{p_2}} \end{pmatrix}V-\begin{pmatrix} \mathbf {z}^T_{j_1} \\ \ldots \\ \mathbf {z}^T_{j_{p_2}}\end{pmatrix}\right\| _2^2+\alpha \sum _{j=1}^{n_2}\left( \sum _{i=1}^{n_1}v_{ij}-1\right) ^2, \end{aligned}$$
(1)

where

$$\mathcal {V}_+=\{V\in \mathbb {R}^{n_1\times n_2} : v_{ij}\ge 0,\ \forall \ 1\le i\le n_1, 1\le j\le n_2\}$$

is the set of matrices size \(n_1\times n_2\) with non-negative entries. Here \(\alpha\) is set orders of magnitude larger than the maximum entry of X so that the columns of V all sum to 1 as a hard contraint. Specifically, we set \(\alpha =10^6\times \max _{i,j}\left( X_{ij}\right)\). The missing values are then imputed via

$$\begin{aligned} \begin{pmatrix} \mathbf {z}^T_{i_1} \\ \ldots \\ \mathbf {z}^T_{i_{p_1}}\end{pmatrix}=\begin{pmatrix} \mathbf {x}^T_{i_1}\\ \ldots \\ \mathbf {x}^T_{i_{p_1}} \end{pmatrix}V. \end{aligned}$$
(2)

The \(\alpha\) term and nonnegativity constraints ensure that the imputed values lie within the range of the training data. To solve (1) we apply the nonnegative CGLS algorithm of [13], typically applied in inverse problems and image reconstruction. Such imputation methods, which use linear combinations of the known expressions to reconstruct the missing values, are appropriate when there is a high level of linear dependence across expressions. The miRNA expressions are highly correlated, and thus FLI (and VIPER and scImpute) are appropriate for imputing miRNA expression data. See section “Singular value plots”, where we show singular value plots highlighting the linear dependence across micros for the data sets considered. The FLI algorithm detailed above is written in Matlab and is available from the authors upon request.

Classification and error metrics

Here we introduce the metrics which will be used to assess the quality of the results. Let TP, FP, TN, and FN denote the number of true positives, false positives, true negatives, and false negatives, respectively, in a binary classification. Then, we report the following classification metrics in our comparisons:

  1. 1

    The classification accuracy

    $$\begin{aligned} \text {ACC}=\frac{\text {TP}+\text {TN}}{\text {TP}+\text {TN}+\text {FP}+\text {FN}} \end{aligned}$$
  2. 2

    The \(F_1\) score

    $$\begin{aligned} F_1=\frac{2\text {TP}}{2\text {TP}+\text {FP}+\text {FN}} \end{aligned}$$
  3. 3

    The Area Under the Curve (AUC) values corresponding to the Receiver Operator Characteristic (ROC).

All classification metrics above take values between 0 and 1. A value closer to 1 indicates a better performance, and vice versa.

Let \(\mathbf {x}\in \mathbb {R}^p\) denote a vector of ground truth expression values corresponding to a given patient, and let \(\mathbf {x}_{\epsilon }\in \mathbb {R}^p\) be an approximation (e.g., obtained through imputation). Then we define the least squares error

$$\begin{aligned} \epsilon =\frac{\Vert \mathbf {x}-\mathbf {x}_{\epsilon }\Vert _2}{\Vert \mathbf {x}\Vert _2}. \end{aligned}$$
(3)

We use \(\epsilon\) to measure the imputation error in section “Results”.

Data sets

We consider the following four data sets from the literature for real data testing:

  1. 1

    Serum miRNA expression data of [14, 15] collected from \(n=2460\) Japanese patients. The number of expression values measured is \(p=2565\). The authors provide expression values for 1123 control patients, and for patients with 15 different types of diseases (the case number varies with the disease), including bladder cancer, hepatocellular carcinoma (HCC), breast cancer, ovarian cancer, and hepatitis. To compile the data used here, we combine the data sets of [14] and [15], making sure to delete any replicate patients. We focus on the HCC and bladder cancer case patients as most samples are provided for these diseases. We consider two separate binary classification problems, whereby we aim to separate the control set from bladder cancer and HCC patients.

  2. 2

    The expression data of [19] collected by Keller et. al. from \(n=454\) German patients. The number of expression values measured is \(p=863\). The samples are comprised of 70 healthy control patients, and patients with 14 different cancer and noncancer diseases (the case number varies with the disease), including melanoma, ovarian cancer, multiple sclerosis, and pancreatic cancer.

  3. 3

    miRNA expression data of [17] collected by Lee et. al. from 232 Korean patients. The number of expression values measured is \(p=2578\). The samples are comprised of 88 patients with Pancreatic Cancer (PC), and 19 healthy controls. In [17], the authors combine the 19 healthy patients with 10 cholelithiasis patients to form a larger control set of 29 patients for use in the PC classifications. We use the same control set here, and aim to separate the controls from the PC patients. In total we consider \(n=117\) patients. The authors provide \(p=2578\) expression values, for each patient, all of which will be used in our classifications.

  4. 4

    miRNA expression data of [16] collected by Chan et. al. from \(n=116\) Singaporean patients, consisting of 67 patients with breast cancer and 49 healthy controls. There are \(p=116\) expression values for each patient. The samples are comprised of two cohorts, one size \(n=62\), and the other \(n=54\). The expression values of each cohort are measured using different technologies, which creates a batch effect. This example is included to test the effectiveness of each considered imputation method in the presence of batch effects.

  5. 5

    Synthetic handwritten image data - the “Digits” database from Matlab. We focus on the images of zeros and ones provided for our classifications. That is, we consider a binary classification problem whereby we aim to separate images of zeros from images of ones. The data consists of 988 images of zeros and 1026 images of ones, of size \(28\times 28\). In this case, \(n=1026+988=2014\), and \(p=28^2=784\).

  6. 6

    Synthetic multivariate normal data. This data consists of \(n=1000\) samples drawn from a multivariate normal distribution with dimension \(p=1000\), mean zero, and correlation matrix C, where C is the square \(p\times p\) matrix whose diagonal entries are 1 and all other entries are fixed at some \(R\in [0,1)\). For more details, see section “Multivariate normal examples”.

The data sets discussed in points 1-6 above were chosen to have variety in population type and size, cancer/disease type, and dimension size. We consider four miRNA data sets (discussed in points 1-4 above) from four distinct populations (namely, Japan, Germany, Korea, and Singapore, respectively), which include a number of different cancer and disease types, and control groups, e.g., melanoma, pancreatic cancer, breast cancer, and healthy controls, as well as a variety of different sample and dimension sizes. The chosen data sets also offer a wide variety of ratios between control and case group sizes. For example, the Japanese data includes more controls than cases, for all diseases considered (e.g., 1123 healthy controls vs 40 ovarian cancer samples), whereas the data collected by Keller et. al., from German patients, has more equal balance between cases and controls (e.g., 70 healthy controls vs 15 ovarian cancer samples).

We also consider two synthetic data sets, namely handwritten digit and multivariate Gaussian data, as discussed in points 5 and 6. The handwritten image data is included as a visual example to show the intuition of FLI and the other methods considered. That is, one can imagine miRNA expression imputation as filling in missing pixels of an image (or “inpainting”). The aim is to help the reader to see clearly the effects of the different imputation methods, e.g., as in Fig. 3, and understand how the methods are performing and why some methods perform better than others. We feel the inpainting analogy also provides context more generally to a machine learning audience, as image classification and inpainting are highly studied problems in machine learning [20, 21].

The multivariate normal data is included as an example where we have the ability to more precisely control the correlation level among the variables, and show the relation between variable correlation and imputation accuracy and classification performance. See section “Multivariate normal examples”.

Classification model

Here we discuss the classification model used to classify patients in the real miRNA expression data experiments conducted in this paper. Let \(X\in \mathbb {R}^{n\times p}\) be a set of imputed miRNA expression data (i.e., with no missing values). To classify patients, we train a softmax function [22] classification model

$$\begin{aligned} P(y=j \ |\ \mathbf {x})=\frac{e^{\mathbf {x}^T\mathbf {w}_j+\mathbf {b}_j}}{\sum _{i=0}^{n_c-1}e^{\mathbf {x}^T\mathbf {w}_i+\mathbf {b}_i}}, \end{aligned}$$
(4)

where \(j\in \{0,1,\ldots ,n_c-1\}\) is the class label, with \(n_c>1\) the number of classes, \(\mathbf {x}\in \mathbb {R}^p\) is a patient sample after imputation (e.g., one row of X), and the \((\mathbf {w}_j,\mathbf {b}_j)\) are weights and biases to be trained. Here y denotes the class label assigned to \(\mathbf {x}\). The class with the highest probability P is then chosen for membership.

Validation methods

Here we discuss the validation methods used in the experiments conducted. We use a multiple hold out set validation. That is, we randomly and uniformly set aside a subset of samples size \(n_t<n_1\) from the data. After which, we randomly and uniformly assign a subset of \(m<p\) variables to NaN for each patient, where m is determined by the percentage of missing dimensions (see section “Results”). Then the missing dimensions are imputed, and the test patients are classified using the model discussed in section “Classification model”. The above process is repeated over \(n_T\) trials and the results are averaged. For the data sets with small n, i.e., data sets (2)–(3) of section “Data sets”, where \(n<150\), we set \(n_t=10\) and \(n_T=30\). For data set (1) of section “Data sets”, where \(n>1000\), we set \(n_t=300\) and \(n_T=1\). In total there are \(n_T\times n_t=300\) test trials for each classification performed.

Implementation of VIPER and scImpute

Here we discuss the implementations of VIPER and scImpute used in this paper. Both algorithms are implemented in Matlab. The R code provided in [3, 9] was ran from within Matlab. The Matlab formulations of VIPER and scImpute are available from the authors upon request.

scImpute

We perform scImpute as explained in [9], but with a few technical changes which we shall now discuss. First, since no outliers were detected in the data sets used here, we do not implement the outlier removal stage of scImpute discussed in [9]. In some instances, the Nonnegative Least Squares (NNLS) solver employed in [9] suffered crashing issues for underdetermined systems. Hence, in the case of undermined system matrices, we multiplied both sides by the matrix transpose so that the input to the NNLS solver was a square matrix. That is, we solved the normal equations, which is an equivalent problem. In rare cases, the NNLS solver produced the error “Matrix inversion failed”. In such instances, we imputed the mean value over the nearest neighbors determined by spectral clustering (i.e., the spectral clustering step described in [9]).

VIPER

VIPER is implemented exactly as discussed in [3], with lasso used as preprocessing.

In [3, 9], the authors propose methods to determine the locations of the missing data. In this paper, we assume knowledge of the locations of the missing data a-priori, and hence we do not implement such aspects of VIPER and scImpute.

Hyperparameter selection

Here we discuss the selection of hyperparameters for the methods compared against. For scImpute, there are two hyperparameters, namely the number of clusters K and the affinity parameter \(\sigma\) used for the spectral clustering step. Here we are using the notation of [9] and [12]. In [12], it is suggested to choose \(\sigma\) so that the within cluster variances are minimized. We use a similar idea and choose the \(K\in \{2,\ldots ,10\}\) and \(\sigma\) such that the ratio of the between and within cluster variances is maximized, i.e., we maximize the Calinski-Harabasz statistic [23]. In [9], no method for choosing K or \(\sigma\) is given so we use the ideas of [12] (cited in [9]) to choose K and \(\sigma\).

For lasso VIPER, the lasso parameter is chosen by 10-fold cross validation, and the threshold parameter is set to \(t=0.001\), as in [3]. Note here we are using the notation of [3].

Fig. 6
figure 6

Singular value plots of the expression matrices X corresponding to the Japanese, Singapore, handwritten digits, and Keller data. The nuclear norm values \(\Vert X\Vert _*\) are given in the figure subcaptions

As discussed in section “Description of FLI”, the \(\alpha\) parameter of FLI (see equation (1)) is set orders of magnitude larger than the maximum entry of the expression training matrix X, so that the imputation weights all sum to 1 as a hard constraint. Specifically, \(\alpha =10^6\times \max _{i,j}\left( X_{ij}\right)\) is set six orders of magnitude larger than the maximum entry of X.

The remaining methods compared against for baseline, namely regression, mean and zero imputation, have no hyperparameters.

Singular value plots

Here we show plots of the singular values of the expression matrices X corresponding to some of data sets considered. See Fig. 6. We see a high level of linear dependence among the expression values, as indicated by the singular value plots. The nuclear norm [24] is commonly used to approximate the matrix rank. We use the nuclear norm here to measure the level of linear dependence among the miRNA expression values. A smaller nuclear norm indicates greater linear dependence, and vice-versa. Based on the nuclear norm values, there is higher linear dependence among the Japanese patients, and thus an imputation using linear combinations (e.g., as with FLI) is likely to be more accurate. The Keller data shows the least linear dependence among the expression values, and thus we expect the FLI imputation (and those of VIPER and scImpute) to be less accurate in this case.

Multivariate normal examples

To facilitate the conversation of section “Singular value plots” further, in this section we test the imputation accuracy of FLI on correlated multivariate normal data with varying levels of linear dependence among the variables, and plot the imputation error against correlation level.

Fig. 7
figure 7

Example multivariate normal data sets for varying R, when \(p=2\). The two classes are highlighted in blue and red in each case. Here, \(R\sim 1\) means \(R=1-10^{-6}\). Note, R cannot be exactly 1 as then C would not be positive definite

To generate data, we sample from a multivariate normal distribution with mean zero and correlation matrix C, where C is the square \(p\times p\) matrix whose diagonal entries are 1 and all other entries are fixed at some \(R\in [0,1)\). We use R to vary the linear dependence level among the p variables. We separate the data into two classes which are divided by the linear classification boundary \(\{\mathbf {x}\cdot (1,\ldots ,1)^T=0\}\), i.e., the \((p-1)\)-dimensional hyperplane through the origin which is perpendicular to \((1,\ldots ,1)^T\). We choose this classification boundary so that all variables have equal correlation to the class labels, and thus have equal weight towards classification performance. In Fig. 7, we show example data sets simulated in this way for varying R, when \(p=2\), for visualization. The number of samples is \(n=1000\). In this case, the classification boundary is the line \(y=-x\).

Fig. 8
figure 8

ac multivariate normal imputation results. d relation between R and the nuclear norm \(|X|_*\)

To test the effectiveness of FLI, we set \(p=1000\) and generate multivariate normal data with \(n=1000\) samples. Then, we vary R and the percentage of missing dimensions and record the changes in imputation error and classification accuracy. The classification model is a softmax function, as described in section “Classification model”. See Fig. 8 for our results.

In Fig. 8a, we plot the percentage of missing dimensions against \(\epsilon _{\mu }\) (as in Figs. 2d, 4d and 5d) for a variety of R values, and in Fig. 8b we plot the mean value over the curves in Fig. 8a against R. The same metric was used to measure imputation performance in Tables 1, 2, and 3 in the main text. We also include error bars in Fig. 8b, whose width is two times the standard deviation \(\epsilon _{\mu }\) value over the curves in Fig. 8a. A lower mean and standard deviation \(\epsilon _{\mu }\) score indicates more accurate, more consistent imputation performance as the percentage of missing dimensions varies, and vise-versa. As R increases, the mean and standard deviation \(\epsilon _{\mu }\) scores decrease, and the relation appears continuous and monotonic. This is as we would expect and is consistent with the singular value analysis in sub-section “Singular value plots”. In Fig. 8c, we plot the mean AUC, \(F_1\), and ACC scores against R. As R increases, the classification accuracy increases, and the softmax model retains high classification accuracy (\(\text {AUC},F_1,\text {ACC}>0.95\)) with only mild correlation among the variables (\(R>0.15\)), further indicating the imputation abilities of FLI for binary classification problems. In Fig. 8d, we plot R against \(|X|_*\) (i.e., the nuclear norm of the data) to show the relation between R and the nuclear norm introduced in sub-section “Singular value plots”. The plot shows a continuous, inverse, monotonic relationship between R and \(|X|_*\). Thus, as R increases, or, equivalently, as \(|X|_*\) decreases, we can expect to see increased imputation accuracy and greater retention of the classification performance using FLI.

Additional miRNA expression results

Here we present additional validation of FLI on two more real miRNA expression data sets from the literature.

Japan and Keller data results

In this section, we test the accuracy of FLI on the miRNA expression data of [14, 15], collected from Japanese patients, and the expression data of Keller et. al. [19], collected from German patients.

Fig. 9
figure 9

Keller and Japan data results, using FLI. ac AUC, ACC and \(F_1\) scores with percentage of missing dimensions. The disease classified is given in the figure legend

For the Japanese data, we focus specifically on the binary classification problems of separating bladder cancer and HCC patients from controls, as is considered in [14, 15]. For the Keller, we focus on separating melanoma and multiple sclerosis patients from the control set. We chose melonoma and multiple sclerosis, as the softmax function classifier (discussed in section “Classification model”) offered the best accuracy scores in terms of AUC when separating these diseases from controls, compared to the other diseases considered by Keller. In total, we consider four binary classification problems, two associated with the Japanese data (i.e., the HCC and bladder cancer classifications), and two from the Keller data set (i.e., the melanoma and multiple sclerosis classifications). See Fig. 9 where we show curves of the classification accuracy scores with the percentage of missing dimensions, for each of the four classifications considered. In Table 4 we present the average scores over the curves shown in Fig. 9, as an average measure of the performance of FLI over all levels of missing dimensions.

Table 4 Keller and Japan data results

The results offer further validation of the effectiveness of FLI in retaining the classification accuracy on two more large miRNA expression data sets from the literature. The imputation times of Table 4b further validate the efficiency of FLI. For example, the maximum imputation time recorded, over all test patients processed, was \(t_M=1.15\)s. On the Japanese data, FLI is most effective in retaining the classification accuracy, when compared to the Keller data, and offers AUC, ACC, and \(F_1\) scores exceeding .95 up to \(70\%\) of dimensions missing. On the Keller data, FLI is less effective in retaining the classification accuracy, and offers AUC, ACC, and \(F_1\) scores exceeding .95 up to approximately \(50\%\) of dimensions missing. The mean curve values presented in Table 4a highlight the difference in the effectiveness of FLI on the Japanese and Keller data sets. In Fig. 6, we showed singular value plots of the expression data matrices corresponding to the Keller and Japanese data sets. The plots indicated a higher level of linear dependence among the Japanese miRNA set, when compared to that of Kelller. Thus we would expect an imputation such as FLI, which uses linear combinations of the training expressions to impute the missing expressions, to be less effective on the Keller data set, as there is less linear dependence among the miRNA subset chosen by Keller. The results observed here are thus in line with the findings of Fig. 6.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Webber, J.W., Elias, K.M. Fast and robust imputation for miRNA expression data using constrained least squares. BMC Bioinformatics 23, 145 (2022). https://doi.org/10.1186/s12859-022-04656-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-022-04656-4

Keywords