In many modern data analysis applications, the user is faced with data matrices with a huge number of columns and/or rows. Such matrices arise in disciplines ranging from astronomy through genomics and social sciences to zoology. As a specific example, let us consider gene expression microarray data. In a typical study, hundreds of thousands of probe expressions are measured for a large number of samples. This methodology has had a significant impact on gene expression research, but the publication of studies with dissimilar or contradictory results has raised concerns about the reliability of this technology, especially when all the individual values of gene expressions are requested. On the other hand, when the goal is more modest, *e.g.*, just classifying the samples into few categories, there is typically ample information available in the data, and one can hope that the huge redundancy in the data compensates for the possible errors of the technology.

In such cases, it is common to employ one of several dimensionality reduction methods in order to identify low-dimensional features for use by a downstream analyst. Many popular methods, *e.g.*, Principal Component Analysis (PCA), multidimensional scaling, recently-popular nonlinear-dimensionality reduction methods, etc., boil down to the Singular Value Decomposition (SVD). The singular vectors, or principal components, associated with the largest singular values have strong optimality properties, and they can often be quite useful as a tool to summarize and identify major patterns of the data. (See, e.g.,
[1], as a nice example in the field of genomics and
[2] for a fast matrix factorization algorithm.) Nevertheless, it is typically quite hard for a geneticist or downstream data analyst to interpret those vectors in terms of the application domain from which the data are drawn. The reason for this is that the singular vectors are mathematical abstractions defined for any matrix, and they are typically linear combinations of all of the input data. This has been noted most explicitly by Kuruvilla *et al.*[3]. After describing the many uses of the vectors provided by the SVD and PCA in DNA microarray analysis, they bluntly conclude that "While very efficient basis vectors, the (singular) vectors themselves are completely artificial and do not correspond to actual (DNA expression) profiles. .. Thus, it would be interesting to try to find basis vectors for all experiment vectors, using actual experiment vectors and not artificial bases that offer little insight."

To address these and other issues, Mahoney and Drineas
[

4] proposed the

*CUR matrix decomposition* method. CUR decompositions are low-rank matrix decompositions that are explicitly expressed in terms of a small number of actual columns and/or actual rows of the data matrix:

where

*A* is the original data matrix,

*C* consists of a small number of actual columns of

*A*,

*R* consists of a small number of actual rows of

*A*, and

*U* is a small carefully constructed matrix that guarantees that the product

*CUR* is close to

*A*. Since they are constructed from actual data elements, CUR decompositions are interpretable by practitioners of the field from which the data are drawn (to the extent that the original data are). For example, CUR decompositions have been used for interpretable data analysis of DNA single-nucleotide polymorphism data
[

5–

7]. The theory of CUR matrix decompositions works as follows
[

4,

8]. To determine which columns to include in

*C* (and similarly for

*R*), one computes an "importance score" for each column of

*A* and then randomly samples a small number of columns from

*A* using that score as an "importance sampling" probability distribution. This importance score depends on the matrix

*A* and an input rank parameter

*k*. If

is the

*j*-th element of the

*ξ*-th right singular vector of

*A*, then the

*normalized statistical leverage scores* equal

for all *j* = 1,…,*n*. These quantities, up to scaling, equal to the diagonal elements of the so-called "hat matrix," *i.e.*, the projection matrix onto the span of the top *k* right singular vectors of *A*[9]. As such, they have a natural statistical interpretation as a "leverage score" or "influence score" associated with each of the data points; and they have been widely-used for outlier identification in diagnostic regression analysis.

The basic algorithm for choosing columns from a matrix—call it

ColumnSelect—takes as input

*any*
*m* ×

*n* matrix

*A*, a rank parameter

*k*, and an error parameter

*ε*, and then performs the following.

- 1.
First, compute *v*
^{1},…,*v*
^{
k
}(the top *k* right singular vectors of *A*) and the normalized statistical leverage scores of Equation (2).

- 2.
Second, keep the *j*-th column of *A* with probability
, for all *j* ∈ {1,…,*n*}, where
.

- 3.
Third, return the matrix *C* consisting of the selected columns of *A*.

In some applications, this restricted CUR decomposition, *A* ≈ *P*
_{
C
}
*A* = *CX*, where *X* = *C*
^{+}
*A*, is of interest and where *C*
^{+} denotes a Moore-Penrose generalized inverse of the matrix *C*.^{a}

In other applications, one wants a CUR matrix decomposition in terms of columns and rows simultaneously. The basic algorithm for this performs the following.

- 1.
Run ColumnSelect on *A* with
to choose columns of *A* and construct the matrix *C*.

- 2.
Run ColumnSelect on *A*
^{
T
}with
to choose rows of *A* (columns of *A*
^{
T
}) and construct the matrix *R*.

- 3.
Define the matrix *U* as *U* = *C*
^{+}
*A*
*R*
^{+} .

Thus, in contrast to PCA and the SVD, where the low-dimensional basis consists of singular vectors that are linear combinations of all the data vectors, here the matrices *C* and *R* consists a small number of actual columns and rows of *A*, respectively. The details of this procedure, including the use of randomness, are important for the strong underlying theory
[4, 8, 10]; but in practice several variations that exploit the structure identified by the statistical leverage scores perform very well. These practical design decisions we made for our implementation will be described in the next section.

In this paper, we describe the rCUR package, which is a freely available, open source R implementation of the CUR matrix decomposition method. We will summarize functionality and features of the package that allow the user to obtain the statistical leverage scores and the matrices *C*, *U*, and *R* by simple S4 classes and methods. In certain cases, we have found that the statistical leverage scores themselves are useful directly, and thus we also describe variations to select the columns or rows that deviate from the theory described above. We will then demonstrate the strength of the technique on a microarray study. In particular, we will show that even for a very large set of heterogeneous samples with various experiments, rCUR is able to select a few percent of the probes that have the same classification capacity as the original full set.

Finally, it should be emphasized that this CUR approach is very different the classical statistical perspective, where statistical leverage scores have been used in diagnostic regression analysis to identify outliers and errors
[9]; and that Bien *et al.*[11] have described the connections between CUR matrix decompositions and sparse optimization methods. See also
[12] as a previous example in the genomics literature for gene selection via outliers.