Fast and robust imputation for miRNA expression data using constrained least squares

Background High dimensional transcriptome profiling, whether through next generation sequencing techniques or high-throughput arrays, may result in scattered variables with missing data. Data imputation is a common strategy to maximize the inclusion of samples by using statistical techniques to fill in missing values. However, many data imputation methods are cumbersome and risk introduction of systematic bias. Results We present a new data imputation method using constrained least squares and algorithms from the inverse problems literature and present applications for this technique in miRNA expression analysis. The proposed technique is shown to offer an imputation orders of magnitude faster, with greater than or equal accuracy when compared to similar methods from the literature. Conclusions This study offers a robust and efficient algorithm for data imputation, which can be used, e.g., to improve cancer prediction accuracy in the presence of missing data.

but with some missing pixels, is shown in Fig. 1b. The locations of the missing pixels are selected at random, and uniformly. If we wish to classify the images with missing pixels, then it is ill-advised to perform no data imputation (i.e., imputing zeros), as the accuracy would suffer. See Fig. 1c, where we have shown the effect of imputing zeros on the AUC, classification accuracy (ACC) and F 1 score with the percentage of missing pixels. We see a decrease in classification accuracy and F 1 score when more than 10% of the pixels are missing, and the reduction in accuracy is more pronounced as the percentage of missing pixels increases. Several strategies have been described for data imputation in gene expression and miRNA expression analysis [2][3][4][5][6][7][8][9][10][11]. Two popular techniques are "VIPER" and "scImpute. " In [3], the authors introduce "VIPER", which implements data imputation on gene expression data using a combination of lasso (or an elastic net), with a box constrained regression. That is, first a set of neighboring cells are found, which have related expression values to the missing cell, using lasso (or elastic nets). Then a box constrained regression is performed on the selected neighbors to fill in the missing gene expression values. The reason for using lasso as preprocessing, is that the quadratic programming code employed in [3] for box constrained regression does not scale well to large matrices, and thus lasso (or an elastic net) is used to select a subset of candidate nearest neighbors to reduce the array size before nonnegative regression. In [9], the authors introduce "scImpute", which shares similarities in the intuition to VIPER. First, in the scImpute algorithm, the cells are clustered into K groups using spectral clustering [12]. Then, the missing cell expressions are reconstructed from their neighboring cells by a nonnegative constrained regression. That is, the missing values are imputed using nonnegative linear combinations (i.e., a linear combination with nonnegative coefficients) of their nearest neighbors, where the neighboring cells are determined by spectral clustering. We choose to focus on VIPER (specifically the lasso variant) and scImpute for comparison here, as they share the most similarities with the proposed method.
Here we present a novel, fast method for data imputation using constrained Conjugate Gradient Least Squares (CGLS) borrowing ideas from the imaging and inverse problems literature. As an example of a desired application for this work, we present miRNA expression analysis, with a particular focus on cancer prediction. As shown below, highly accurate cancer prediction is possible using simple classifiers (e.g., a softmax function is used here), on a wide variety of data sets, in the case when all (or a large fraction) of (c) Accuracy plots Fig. 1 a Example image of a handwritten number 1. b the same image but with missing pixels c AUC, classification accuracy and F 1 score with % of missing pixels the miRNA expression values are known, and there is little to no missing data. It is not always possible to measure all expression values contained in the training set, for every patient, however. To combat this, we aim to impute the missing values using the known expressions, so that we can retain use of our accurate model fit to the full set of miRNA available in the training set. We propose to reconstruct the missing data via nonnegative constrained regression, but with the further constraint that the regression weights sum to 1. Such constraints ensure that the imputed values lie within the range of the training data, with the idea to prevent overfitting. We enforce the regression weights to sum to 1 as a hard constraint in our objective, so that the nonnegative least squares and weight normalization steps are carried out simultaneously. To solve our objective, we apply the nonnegative Conjugate Gradient Least Squares (CGLS) algorithm of [13], typically applied in inverse problems and image reconstruction. The CGLS code we apply does not suffer the scaling issues encountered in, e.g., VIPER, for large matrices, and can process efficiently large scale expression arrays. The algorithm we propose offers a fast, efficient, and accurate imputation without the need for preprocessing steps, e.g., as in VIPER and scImpute. Our method is also completely nonparametric, and thus requires no tuning of hyperparameters before imputation, in contrast to scImpute and VIPER which require that two hyperparameters be tuned. Such parameters may be selected, for example, by cross validation, as is suggested in [3]. However, cross validation is slow, particularly for large data, and is thus impractical for clinical applications. To demonstrate the technique, we test the performance on miRNA expression data publicly available in the literature, and give a comparison to VIPER and scImpute. Specifically, as a measure of performance, we focus on how effectively each method retains the classification accuracy with the percentage of missing data (as in the curves shown in Fig. 1c). The proposed method is shown to be orders of magnitude faster than VIPER and scImpute, with greater than or equal accuracy, for the examples of interest considered here in cancer prediction.

Results
The method proposed here will be denoted by Fast Linear Imputation (FLI), for the remainder of this paper. The FLI algorithm and the core objective functions are discussed in detail in the appendix, section "Description of FLI". In this section, we present a comparison of FLI, and the methods VIPER [3] and scImpute [9] from the literature on publicly available miRNA expression data [14][15][16] and synthetic handwritten image data. The specific implementations of VIPER and scImpute used here are discussed in the appendix, section "Implementation of VIPER and scImpute". FLI is also compared against (unconstrained) regression, mean, and zero imputation as baseline. The classification model, selection of hyperparameters, and classification metrics are detailed in the appendix.

Synthetic handwritten image results
In this section, we present our results on the synthetic handwritten image data discussed in the introduction. The handwritten image data is included as a visual example to show clearly how the imputation methods are performing and to give some intuition as to why some methods perform better than others. For more details on this data see section "Data sets". In Fig. 2a-c we present plots of the AUC, ACC and F 1 scores with the percentage of missing dimensions, for each method. Figure 2d-f show the corresponding plots of the mean, standard deviation and maximum imputation errors, over all test patients. The plots in Fig. 2e, f are cropped on the vertical axis to better highlight the errors corresponding to the more competitive methods. This cuts off the end of error curve corresponding to scImpute, which spikes when ≈ 95% of the dimensions are missing.  In Table 1, we present the average values over the curves in Fig 2a-f, as a measure of the average performance over all possible levels of missing data. The mean, standard deviation, and maximum imputation times, over all test patients, are given in Table 1c.
In terms of retaining the classification accuracy, FLI, VIPER, and scImpute offer comparable performance. FLI and VIPER are joint best and offer mean AUC, ACC, and F 1 scores exceeding 99% . For the baseline methods, namely (unconstrained) regression, mean, and zero imputation, we see a reduction in the classification accuracy, and the reduction is more pronounced when > 70% of the dimensions are missing, as evidenced by the curves in Fig. 2a-c. In terms of the imputation error, FLI offers the most consistent imputation accuracy, when compared to VIPER and scImpute, in the sense that FLI offers the smallest standard deviation and maximum errors.
For an example image imputation, see Fig. 3. where we have shown image reconstructions of the handwritten one image discussed in the introduction (Fig. 1a). We see ghosting artifacts in the VIPER reconstruction, and a significant blurring effect in the scImpute reconstruction. The regression imputation appears overfit, and introduces severe artifacts. FLI offers the clearest and sharpest image, with relatively few artifacts. So, in some cases, there are artifacts introduced by the VIPER and scImpute  reconstructions. While this is not enough to confuse the classifier (i.e., the classification accuracy is still retained), the imputation error is less consistent when compared to FLI. In particular, the average maximum imputation error offered by FLI, over all levels of missing dimensions, is 17% lower than the next best performing method, namely VIPER. See the third row of Table 1b. FLI is also orders of magnitude faster than VIPER and scImpute, as indicated by the imputation times of Table 1c.

Singapore results
Here we present our results on the miRNA expression data of Chan et. al. [16], collected from Singaporean patients. This data includes significant batch effects due to different measurement technologies. See section "Data sets" for more details. See Fig. 4a-c for plots of the classification accuracy, and Fig. 4d-f for the imputation errors with the percentage of missing dimensions. See Table 2 for the mean values over the curves in Fig. 4a-f, and the mean, standard deviation, and maximum imputation times. In this example, FLI offers the best performance in terms of retaining the AUC, ACC and F 1 score, on average, across all levels of missing dimensions. As evidenced by Fig. 4a, FLI offers the highest AUC over all levels of missing dimensions. We see a similar effect in the ACC and F 1 score curves of Fig. 4b, c, although, in a minority of cases, scImpute slightly outperforms FLI. The retention of the classification accuracy is significantly reduced using regression, mean and zero imputation, when compared to FLI, VIPER, and scImpute.
FLI offers the most optimal performance in terms of the mean, standard deviation, and maximum imputation error, across all levels of missing dimensions, when compared to VIPER and scImpute. A zero imputation offers the best maximum and standard deviation error over all methods. The mean error offered by zero imputation is significantly higher than that of FLI, scImpute and VIPER, however. We would expect the standard deviation of a zero imputation to be low, as there is not much variation among the imputations (i.e., many of the values are zeros). The maximum error curves of Fig. 4f indicate that, for some patients, the imputation error is high when using FLI, VIPER and scImpute, as simply imputing zeros offers less error. Such erroneous patients can be considered outliers, and do not greatly effect the overall classification accuracy, as evidenced by the plots of Fig. 4a-c. The imputation time offered by FLI is orders of magnitude faster than VIPER and scImpute. For example, FLI is approximately three orders of magnitude faster, in terms of mean imputation time, when compared to VIPER, which was the next best performing method in terms of AUC, ACC and F 1 score, after FLI. The imputation time offered by FLI is also more consistent when compared to VIPER, as evidenced by the t σ scores. When compared to scImpute, FLI is approximately one order magnitude faster in terms of mean and maximum imputation time, and is more consistent with lower standard deviation. Regression, mean and zero imputation are the fastest methods, but at the cost of accuracy.
This example was included given the presence of significant batch effects, as discussed at the beginning of this section, and in more detail in section "Data sets". This example provides evidence that FLI is most optimal (compared to similar methods such as VIPER and scImpute), in terms of accuracy and imputation time, when imputing data in the presence of batch effects.

Korea results
Here we present our results on the miRNA expression data of Lee et. al. [17], collected from Korean patients. For more details on this data see section "Data sets", point (3).
In Fig. 5a-c we present plots of the AUC, ACC and F 1 scores with the percentage of missing dimensions, for each method.    cropped to ǫ = 0.35 on the vertical axis to better highlight the errors corresponding to the more competitive methods. Thus, parts of some of the error curves are missing in Fig. 5d, f. For example, in most cases (i.e., for most levels of missing dimensions considered), the zero imputation mean and maximum error exceeds 0.35 and thus why much of the light blue curves corresponding to zero imputation are missing in the plots.
In Table 3a, b, we present the average values over the curves in Fig. 5a-f. The mean, standard deviation, and maximum imputation times, over all test patients, are given in Table 3c.  In this example, FLI, VIPER, and scImpute offer similar levels of performance in terms of retaining the classification accuracy, with FLI offering the best average AUC, and scImpute the best average ACC and F 1 scores. A standard regression imputations is also effective in retaining the classification accuracy up to approximately 85% of dimensions missing, and is comparable to FLI, VIPER, and scImpute within this range. We see a sharp reduction in accuracy in the regression curves (i.e., the purple curves of Fig. 5a-c) when more than 85% of the dimensions are missing, however, and regression significantly underperforms FLI, scImpute, and VIPER at this limit. For mean and zero imputation, we see a more gradual reduction in accuracy, when compared to regression. The imputation errors offered by FLI, VIPER, scImpute, and mean imputation are comparable, and outperform regression and zero imputation. When compared to VIPER and scImpute, FLI offers an imputation time which is orders of magnitude faster, in terms of mean imputation time. The imputation time offered by FLI is also more consistent, with lower standard deviation and maximum, when compared to VIPER and scImpute. As was the case in the previous examples, regression, mean, and zero imputation are the fastest methods, but at the cost of accuracy.

Discussion
In this paper we introduced FLI, a fast, robust data imputation method based on constrained least squares. To illustrate the technique, we tested FLI on synthetic handwritten image and real miRNA expression data sets, and gave a comparison to two similar methods from the literature, namely VIPER [3] and scImpute [9]. We also compared against (unconstrained) regression, mean, and zero imputation as baseline. The results highlight the effectiveness of FLI in retaining the classification accuracy in cancer prediction applications using miRNA expression data, and in image classification. When compared to VIPER and scImpute, FLI was shown to offer greater than or equal imputation accuracy, with imputation speed orders of magnitude faster than scImpute and VIPER, in all examples considered. VIPER, scImpute, and FLI significantly outperformed regression, mean and zero imputation in terms of imputation accuracy, in all examples considered, but were slower given the greater computational complexity. For further validation of FLI on two more real miRNA expression data sets, see appendix 6.
In section "Singapore results", we considered an example expression data set collected from Singaporean patients, which included significant batch effects. When batch effects were present, FLI was shown to outperform VIPER and scImpute in terms of retaining the classification accuracy, and imputation error. On the handwritten image and Korean data sets, considered in sections "Synthetic handwritten image results" and "Korea results", such batch effects were not detected. In these examples, the imputation accuracy offered by FLI, VIPER, and scImpute was comparable. This study provides evidence that FLI offers optimal imputation accuracy, when compared to the methods of literature, on batch data. This is important, since batch effects are common in medical data [18] and thus an imputation which is effective in combating batch effects, without the need for a-prioiri batch correction steps, is desirable.
In all examples considered, FLI was shown to be orders of magnitude faster than VIPER and scImpute. FLI is also completely nonparametric, and thus more straightforward to implement, in contrast to VIPER and scImpute, which require the tuning of two hyperparameters. It is suggested in [3] to tune the lasso parameter used by VIPER via cross validation. The reason for using lasso as preprocessing in VIPER, is that the quadratic programming code employed in [3] for nonnegative regression does not scale well to large matrices, and thus lasso (or elastic nets) are used to select a subset of candidate nearest neighbors before nonnegative regression. A similar intuition is used in [9] in scImpute, whereby the training data is clustered into K groups using spectral clustering [12] before nonnegative regression. That is, the test samples are imputed using linear combinations of their nearest neighbors, where the neighbors are determined a-priori by spectral clustering. The algorithm we propose does not suffer such scaling issues for large matrices, and does not require any preprocessing steps before imputation. Our method is also completely nonparametric, and thus requires no tuning of hyperparameters before imputation, in contrast to scImpute and VIPER which require that two hyperparameters be tuned. Cross validation is slow, however, and resulted in long imputation times (in the order of minutes) when using VIPER. As noted by the authors in [3], the quadratic programming algorithm, used to implement box constrained regression, is slow, and thus why lasso preprocessing is proposed. The nonnegative least squares code of scImpute, applied in [9], also suffers efficiency issues. To combat this, the authors proposed to limit the number of regression weights a-priori using spectral clustering. FLI does not suffer such efficiency concerns, and requires no tuning of hyperparameters or preprocessing steps a-priori. FLI thus offers a faster and more straightforward imputation, when compared to VIPER and scImpute. This is important in applications where large numbers of samples need to be processed quickly (e.g., large gene expression arrays). In such applications, FLI offers the most practical imputation time, in comparison to VIPER and scImpute.

Conclusions and further work
The technique FLI proposed here offers accurate and fast imputation for miRNA expression data. In particular, the imputation offered by FLI was sufficiently accurate to retain the classification accuracy in cancer prediction and handwritten image recognition problems when a large proportion (up to 85% ) of the dimensions were missing. Thus, FLI offers an effective means to classify samples with missing data, without the need for model retraining.
The application of focus here is miRNA expression analysis and cancer prediction. Since miRNA expression variables are highly correlated, as indicated by the plots in Fig. 6, FLI is ideal for miRNA expression data. FLI is not exclusive to miRNA however, and is generalizable to any data set with highly correlated variables. This is validated by the multivariate normal data results in section "Multivariate normal examples" in the appendix. The current iteration of FLI requires a full training set (i.e., with no missing data) for the imputation, which can be considered a limitation of FLI. In further work we aim to address this limitation and develop FLI for more general missing data problems, and test further the generalizability of FLI on a variety of expression technologies commonly applied in cancer prediction (e.g., single-cell RNA, mRNA, protein expression, metabolite expression).
In this study, we assumed the locations of the missing data points to be random and uniform. In practice, the distribution of drop out events may be nonuniform. For example, in miRNA sequencing, the lower limit of detection is related to sequencing depth, thus within the technical zero range there may be a broad range of true expression values. We hypothesize that such expressions will more frequently be drop outs, when compared to more significantly expressed miRNA. In further work, we aim to test the effectiveness of FLI, and the methods of the literature, in the case when the drop out distribution is nonuniform, once the distribution of drop out events is decided upon.

Appendix: Materials and methods
In this section we describe in detail our data imputation strategy FLI, and discuss the classification models and metrics used in our results.

Description of FLI
Let X = (x 1 , . . . , x p ) ∈ R n 1 ×p denote some training data, with no missing entries, and let Z = (z 1 , . . . , z p ) ∈ R n 2 ×p denote some unseen data with NaN columns. Let I 1 = {i 1 , . . . , i p 1 } ⊂ {1, . . . , p} denote the indices of the NaN columns, and let I 2 = {j 1 , . . . , j p 2 } = {1, . . . , p}\I 1 , where p = p 1 + p 2 . Then we aim to find where is the set of matrices size n 1 × n 2 with non-negative entries. Here α is set orders of magnitude larger than the maximum entry of X so that the columns of V all sum to 1 as a hard contraint. Specifically, we set α = 10 6 × max i,j X ij . The missing values are then imputed via The α term and nonnegativity constraints ensure that the imputed values lie within the range of the training data. To solve (1) we apply the nonnegative CGLS algorithm of [13], typically applied in inverse problems and image reconstruction. Such imputation methods, which use linear combinations of the known expressions to reconstruct the missing values, are appropriate when there is a high level of linear dependence across expressions. The miRNA expressions are highly correlated, and thus FLI (and VIPER and scImpute) are appropriate for imputing miRNA expression data. See section "Singular value plots", where we show singular value plots highlighting the linear dependence across micros for the data sets considered. The FLI algorithm detailed above is written in Matlab and is available from the authors upon request.

Classification and error metrics
Here we introduce the metrics which will be used to assess the quality of the results. Let TP, FP, TN, and FN denote the number of true positives, false positives, true negatives, and false negatives, respectively, in a binary classification. Then, we report the following classification metrics in our comparisons: 1 The classification accuracy All classification metrics above take values between 0 and 1. A value closer to 1 indicates a better performance, and vice versa. Let x ∈ R p denote a vector of ground truth expression values corresponding to a given patient, and let x ǫ ∈ R p be an approximation (e.g., obtained through imputation). Then we define the least squares error We use ǫ to measure the imputation error in section "Results".

Data sets
We consider the following four data sets from the literature for real data testing: 1 Serum miRNA expression data of [14,15] collected from n = 2460 Japanese patients.
The number of expression values measured is p = 2565 . The authors provide expression values for 1123 control patients, and for patients with 15 different types of diseases (the case number varies with the disease), including bladder cancer, hepatocellular carcinoma (HCC), breast cancer, ovarian cancer, and hepatitis. To compile the data used here, we combine the data sets of [14] and [15], making sure to delete any replicate patients. We focus on the HCC and bladder cancer case patients as most samples are provided for these diseases. We consider two separate binary classification problems, whereby we aim to separate the control set from bladder cancer and HCC patients. 2 The expression data of [19] 3 miRNA expression data of [17] [17], the authors combine the 19 healthy patients with 10 cholelithiasis patients to form a larger control set of 29 patients for use in the PC classifications. We use the same control set here, and aim to separate the controls from the PC patients. In total we consider n = 117 patients. The authors provide p = 2578 expression values, for each patient, all of which will be used in our classifications. 4 miRNA expression data of [16]  The data sets discussed in points 1-6 above were chosen to have variety in population type and size, cancer/disease type, and dimension size. We consider four miRNA data sets (discussed in points 1-4 above) from four distinct populations (namely, Japan, Germany, Korea, and Singapore, respectively), which include a number of different cancer and disease types, and control groups, e.g., melanoma, pancreatic cancer, breast cancer, and healthy controls, as well as a variety of different sample and dimension sizes. The chosen data sets also offer a wide variety of ratios between control and case group sizes. For example, the Japanese data includes more controls than cases, for all diseases considered (e.g., 1123 healthy controls vs 40 ovarian cancer samples), whereas the data collected by Keller et. al., from German patients, has more equal balance between cases and controls (e.g., 70 healthy controls vs 15 ovarian cancer samples). We also consider two synthetic data sets, namely handwritten digit and multivariate Gaussian data, as discussed in points 5 and 6. The handwritten image data is included as a visual example to show the intuition of FLI and the other methods considered. That is, one can imagine miRNA expression imputation as filling in missing pixels of an image (or "inpainting"). The aim is to help the reader to see clearly the effects of the different imputation methods, e.g., as in Fig. 3, and understand how the methods are performing and why some methods perform better than others. We feel the inpainting analogy also provides context more generally to a machine learning audience, as image classification and inpainting are highly studied problems in machine learning [20,21].
The multivariate normal data is included as an example where we have the ability to more precisely control the correlation level among the variables, and show the relation between variable correlation and imputation accuracy and classification performance. See section "Multivariate normal examples".

Classification model
Here we discuss the classification model used to classify patients in the real miRNA expression data experiments conducted in this paper. Let X ∈ R n×p be a set of imputed miRNA expression data (i.e., with no missing values). To classify patients, we train a softmax function [22] classification model where j ∈ {0, 1, . . . , n c − 1} is the class label, with n c > 1 the number of classes, x ∈ R p is a patient sample after imputation (e.g., one row of X), and the (w j , b j ) are weights and biases to be trained. Here y denotes the class label assigned to x . The class with the highest probability P is then chosen for membership.

Validation methods
Here we discuss the validation methods used in the experiments conducted. We use a multiple hold out set validation. That is, we randomly and uniformly set aside a subset of samples size n t < n 1 from the data. After which, we randomly and uniformly assign a subset of m < p variables to NaN for each patient, where m is determined by the percentage of missing dimensions (see section "Results"). Then the missing dimensions are imputed, and the test patients are classified using the model discussed in section "Classification model". The above process is repeated over n T trials and the results are averaged. For the data sets with small n, i.e., data sets (2)-(3) of section "Data sets", where n < 150 , we set n t = 10 and n T = 30 . For data set (1) of section "Data sets", where n > 1000 , we set n t = 300 and n T = 1 . In total there are n T × n t = 300 test trials for each classification performed.

Implementation of VIPER and scImpute
Here we discuss the implementations of VIPER and scImpute used in this paper. Both algorithms are implemented in Matlab. The R code provided in [3,9] was ran from within Matlab. The Matlab formulations of VIPER and scImpute are available from the authors upon request.

scImpute
We perform scImpute as explained in [9], but with a few technical changes which we shall now discuss. First, since no outliers were detected in the data sets used here, we do not implement the outlier removal stage of scImpute discussed in [9]. In some instances, the Nonnegative Least Squares (NNLS) solver employed in [9] suffered crashing issues for underdetermined systems. Hence, in the case of undermined system matrices, we multiplied both sides by the matrix transpose so that the input to the NNLS solver was a square matrix. That is, we solved the normal equations, which is an equivalent problem. In rare cases, the NNLS solver produced the error "Matrix inversion failed". In such instances, we imputed the mean value over the nearest neighbors determined by spectral clustering (i.e., the spectral clustering step described in [9]).

VIPER
VIPER is implemented exactly as discussed in [3], with lasso used as preprocessing.
In [3,9], the authors propose methods to determine the locations of the missing data. In this paper, we assume knowledge of the locations of the missing data a-priori, and hence we do not implement such aspects of VIPER and scImpute.

Hyperparameter selection
Here we discuss the selection of hyperparameters for the methods compared against. For scImpute, there are two hyperparameters, namely the number of clusters K and the affinity parameter σ used for the spectral clustering step. Here we are using the notation of [9] and [12]. In [12], it is suggested to choose σ so that the within cluster variances are minimized. We use a similar idea and choose the K ∈ {2, . . . , 10} and σ such that the ratio of the between and within cluster variances is maximized, i.e., we maximize the Calinski-Harabasz statistic [23]. In [9], no method for choosing K or σ is given so we use the ideas of [12] (cited in [9]) to choose K and σ.
For lasso VIPER, the lasso parameter is chosen by 10-fold cross validation, and the threshold parameter is set to t = 0.001 , as in [3]. Note here we are using the notation of [3].
As discussed in section "Description of FLI", the α parameter of FLI (see equation (1)) is set orders of magnitude larger than the maximum entry of the expression training matrix X, so that the imputation weights all sum to 1 as a hard constraint. Specifically, α = 10 6 × max i,j X ij is set six orders of magnitude larger than the maximum entry of X.
The remaining methods compared against for baseline, namely regression, mean and zero imputation, have no hyperparameters.

Singular value plots
Here we show plots of the singular values of the expression matrices X corresponding to some of data sets considered. See Fig. 6. We see a high level of linear dependence among the expression values, as indicated by the singular value plots. The nuclear norm [24] is commonly used to approximate the matrix rank. We use the nuclear norm here to measure the level of linear dependence among the miRNA expression values. A smaller nuclear norm indicates greater linear dependence, and viceversa. Based on the nuclear norm values, there is higher linear dependence among the Japanese patients, and thus an imputation using linear combinations (e.g., as with FLI) is likely to be more accurate. The Keller data shows the least linear dependence among the expression values, and thus we expect the FLI imputation (and those of VIPER and scImpute) to be less accurate in this case.

Multivariate normal examples
To facilitate the conversation of section "Singular value plots" further, in this section we test the imputation accuracy of FLI on correlated multivariate normal data with varying levels of linear dependence among the variables, and plot the imputation error against correlation level.
To generate data, we sample from a multivariate normal distribution with mean zero and correlation matrix C, where C is the square p × p matrix whose diagonal entries are 1 and all other entries are fixed at some R ∈ [0, 1) . We use R to vary the linear dependence level among the p variables. We separate the data into two classes which are divided by the linear classification boundary {x · (1, . . . , 1) T = 0} , i.e., the (p − 1)-dimensional hyperplane through the origin which is perpendicular to (1, . . . , 1) T . We choose this classification boundary so that all variables have equal correlation to the class labels, and thus have equal weight towards classification performance. In Fig. 7, we show example To test the effectiveness of FLI, we set p = 1000 and generate multivariate normal data with n = 1000 samples. Then, we vary R and the percentage of missing dimensions and record the changes in imputation error and classification accuracy. The classification model is a softmax function, as described in section "Classification model". See Fig. 8 for our results.
In Fig. 8a, we plot the percentage of missing dimensions against ǫ µ (as in Figs. 2d, 4d and 5d) for a variety of R values, and in Fig. 8b we plot the mean value over the curves in Fig. 8a against R. The same metric was used to measure imputation performance in Tables 1, 2, and 3 in the main text. We also include error bars in Fig. 8b, whose width is two times the standard deviation ǫ µ value over the curves in Fig. 8a. A lower mean and standard deviation ǫ µ score indicates more accurate, more consistent imputation performance as the percentage of missing dimensions varies, and vise-versa. As R increases, the mean and standard deviation ǫ µ scores decrease, and the relation appears continuous and monotonic. This is as we would expect and is consistent with the singular value analysis in sub-section "Singular value plots". In Fig. 8c, we plot the mean AUC, F 1 , and ACC scores against R. As R increases, the classification accuracy increases, and the softmax model retains high classification accuracy ( AUC, F 1 , ACC > 0.95 ) with only mild correlation among the variables ( R > 0.15 ), further indicating the imputation abilities of FLI for binary classification problems. In Fig. 8d, we plot R against |X| * (i.e., the nuclear norm of the data) to show the relation between R and the nuclear norm introduced in sub-section "Singular value plots". The plot shows a continuous, inverse, monotonic relationship between R and |X| * . Thus, as R increases, or, equivalently, as |X| * decreases, we can expect to see increased imputation accuracy and greater retention of the classification performance using FLI.

Additional miRNA expression results
Here we present additional validation of FLI on two more real miRNA expression data sets from the literature.

Japan and Keller data results
In this section, we test the accuracy of FLI on the miRNA expression data of [14,15], collected from Japanese patients, and the expression data of Keller et. al. [19], collected from German patients.
For the Japanese data, we focus specifically on the binary classification problems of separating bladder cancer and HCC patients from controls, as is considered in [14,15]. For the Keller, we focus on separating melanoma and multiple sclerosis patients from the control set. We chose melonoma and multiple sclerosis, as the softmax function classifier (discussed in section "Classification model") offered the best accuracy scores in terms of AUC when separating these diseases from controls, compared to the other diseases considered by Keller. In total, we consider four binary classification problems, two associated with the Japanese data (i.e., the HCC and bladder cancer classifications), and two from the Keller data set (i.e., the melanoma and multiple sclerosis classifications). See Fig. 9 where we show curves of the classification accuracy scores with the percentage of missing dimensions, for each of the four classifications considered. In Table 4 we present the average scores over the curves shown in Fig. 9, as an average measure of the performance of FLI over all levels of missing dimensions. The results offer further validation of the effectiveness of FLI in retaining the classification accuracy on two more large miRNA expression data sets from the literature. The imputation times of Table 4b further validate the efficiency of FLI. For example, the maximum imputation time recorded, over all test patients processed, was t M = 1.15 s. On the Japanese data, FLI is most effective in retaining the classification accuracy, when compared to the Keller data, and offers AUC, ACC, and F 1 scores exceeding .95 up to 70% of dimensions missing. On the Keller data, FLI is less effective in retaining the classification accuracy, and offers AUC, ACC, and F 1 scores exceeding .95 up to approximately 50% of dimensions missing. The mean curve values presented in Table 4a Table 4 Keller and Japan data results  difference in the effectiveness of FLI on the Japanese and Keller data sets. In Fig. 6, we showed singular value plots of the expression data matrices corresponding to the Keller and Japanese data sets. The plots indicated a higher level of linear dependence among the Japanese miRNA set, when compared to that of Kelller. Thus we would expect an imputation such as FLI, which uses linear combinations of the training expressions to impute the missing expressions, to be less effective on the Keller data set, as there is less linear dependence among the miRNA subset chosen by Keller. The results observed here are thus in line with the findings of Fig. 6.