### The h-plot and covariance Biplot

Gabriel [3] introduced the biplot as a method for displaying the elements of a matrix as inner products of vectors corresponding to the rows and columns of a matrix. Gower and Hand [23] have extended and generalized the ideas of Gabriel. Two earlier descriptions of the biplot for microarray data are Chapman [19] and Pittelkow and Wilson [14]. There is some confusion about the definition of biplots [23] and a variety of biplots have been proposed, some more suited than others to the analysis of microarray data.

Any matrix, **Z**, of rank *r* and size *N*
_{
c
}by *N*
_{
g
}can be factored as

**Z** = **CG**^{
T
} (6)

where **C** is a matrix of size *N*
_{
c
}by *r*, and **G** is *N*
_{
g
}by *r*. Any element of **Z** can therefore be written as *z*
_{
ij
}=
, where *c*
_{
i
}is row of **C** and *g*
_{
j
}is a row of **G**.

The biplot, as described by Gabriel [3], is the plot of all the *N*
_{
c
}+ *N*
_{
g
}vectors, *c*
_{
i
}, *i* = 1, ..., *N*
_{
c
}and *g*
_{
j
}, *j* = 1, ..., *N*
_{
g
}. Originally, both the rows and the columns were represented on the biplot by vectors (lines from the origin or rays), then it became common practice to use the vector representation only for the columns (variables) but, with microarray data, the vector representation is used for the rows (microarrays).

To be practical as a display, *r* would need to be two or three. The matrix of gene expression indices is not, in general, of sufficiently low rank to plot usefully. The usual approach, in this case, is to find a rank *k* approximation to **Z** where *k* is usually two or three. An exact biplot representation of the rank-*k* approximation matrix is known as an 'approximate biplot', but for ease of notation we drop the qualifier 'approximate'.

It is known that one may use the singular value decomposition (SVD) to approximate any rectangular matrix by a matrix of the same size but of lower rank, such that the sums of squares of the differences between the elements of the matrix and its approximation is minimized. Let the SVD of a rank-*r* matrix, **Z**, be **Z** = **U Λ V**
^{
T
}where **U**, size *N*
_{
c
}by *N*
_{
c
}, and **V**
^{
T
}, size *N*
_{
g
}by *N*
_{
g
}, are orthogonal matrices such that **U**
^{
T
}
**U** = **I** and **V**
^{
T
}
**V** = **I**. The notation **I** is used to denote a conformable identity matrix.

The matrix

**Λ** of size

*N*
_{
c
}by

*N*
_{
g
}has elements,

*λ*
_{
ij
}= 0, if

*i* ≠

*j*, and

*λ*
_{
ij
}=

*λ*
_{
i
}, if

*i* =

*j*. The scalars

*λ*
_{
i
}are ordered such that

*λ*
_{1} ≥

*λ*
_{2}, ≥ ... ≥

*λ*
_{
r
}> 0 and are the singular values of

**Z** and the positive square roots of the nonzero eigenvalues of

**Z**
^{
T
}
**Z** and

**ZZ**
^{
T
}. Since

**Z**
^{
T
}
**Z** =

**V Λ**
^{2}
**V**
^{
T
}, the columns of

**V**, the right singular vectors of

**Z**, are also the eigenvectors of

**Z**
^{
T
}
**Z**. Similarly it can be shown that the columns of

**U**, the left singular vectors of

**Z**, are the eigenvectors of

**ZZ**
^{
T
}. A rank

*k*, (

*k* ≤

*r*) approximation to

**Z** is given by

where **U**
_{
k
}and **V**
_{
k
}are matrices comprising the first *k* columns of **U** and **V** respectively and **Λ**
_{
k
}is a sub matrix of **Λ** formed from the first *k* columns and rows of **Λ**.

where 0 ≤ *α* ≤ 1, can be used to factorize
. If **C**
_{
k
}and **G**
_{
k
}are defined as **C**
_{
k
}= **U**
_{
k
}
, and **G**
_{
k
}= **V**
_{
k
}
, then
.

The factorization of

into

is not unique since for any (conformable) invertible matrix

**A**,

= (

**C**
_{
k
}
**A**) (

**G**
_{
k
}(

**A**
^{
T
})

^{-1})

^{
T
}=

. The transformations,

**A** and (

**A**
^{
T
})

^{-1} differ only by a scaling. To see this, let the singular value decomposition of the matrix,

**A** be as follows

**A** =

**U**
^{†}
**Λ**
^{†}
, where

**U**
^{†} and

**V**
^{†} are k × k orthogonal matrices and Λ

^{†} is

*diag* (

) with

, the singular values of the decomposition. Then (

**A**
^{
T
})

^{-1} =

**U**
^{†} (Λ

^{†})

^{-1}
**V**
^{†T
}. Thus, if

**W** is a diagonal matrix of elements

*w*
_{
i
},

*i* = 1, ...,

*k* this indeterminacy can be made clear by writing

To ease notation in the following, assume that *k* is given, say 2, and write
and
= **G**
_{
k
}.

Different scalings affect the position of the points in the plane. If the matrix **Z** is gene centered, i.e. if for each gene, the average gene expression measurement is subtracted from its individual measurements, and in the biplot factorization with *α* = 0 and **W** =
**I** or **W** =
**I**, the configuration of the gene points is determined from the variance-covariance matrix of the genes. (Or, more correctly, an approximation to the variance-covariance matrix of the genes.)

Given **W** =
**I** (or **W** =
**I**),
, and
,
.

The covariance Biplot uses the coordinates,
(microarray points) and
(gene points). The h-plot uses the coordinates
.

Using the geometry of inner products, the juxtaposition of microarrays and genes provides information about the size of the mean corrected gene expression. Any element of

can be calculated as the inner product between the two 2-dimensional row vectors,

and

, as follows;

where
is the cosine of the angle subtended at the origin between the vectors,
and
and ||·|| denotes length. Thus if the point represented by
is close to the point
then it can be deduced that the gene represented by the point
is relatively up regulated in the microarray represented by the point
. If
is on the opposite side of the plot to
then the gene is relatively down regulated on the microarray.

How accurate these reductions are depends on how good the approximation is in the lower ranked space. Two measures of goodness-of-fit for a *k*-dimensional display, which we call *I*
_{1} and *I*
_{2}, are available. They range from zero to one, with a high value assuring close approximation.
is an absolute goodness of fit statistic, and
is a goodness of fit statistic for the variances and covariances between the genes.

An alternative geometric interpretation of the inner product is as the product of the signed length of one of the vectors and the length of the projection of the other vector onto it. Since
, row points can be projected onto a line joining two column points to obtain a ranking of the rows in terms of column differences. Further since
one can project column points onto a row difference vector to estimate the difference in elements between two rows. Since
is the interaction between gene *j* and *j'* on chips *i* and *i'*, row difference vectors which lie at right angles to column difference vectors are indicative of no interaction. Further interpretations can be found in [3, 13] and [14].

Since log(*z*
_{
ij
}/*z*
_{
i'j
}) ≃ (
) *g*
_{
j
}, gene points can be projected onto a chip difference vector to estimate the log ratio of gene expression between two microarrays. Projection of gene points onto a vector joining two microarrays points provides a visual ranking of the genes in terms of their log ratios on the two microarrays.

Let

and

, denote the gene and microarray coordinates respectively. Then the coordinates can be seen as the solutions to the bilinear model,

There is an indeterminacy in this model which is approached in different ways. Wentzell *et al* [24], for example, use multiple curve resolution methods from chemistry to resolve the indeterminacy in the solution space. Note that Wentzell *et al*'s profiles are another example of profiles to those described here. In the h-plot and covariance-plot, the indeterminacy is resolved so that the inner-products of the gene vectors (
) approximates the variance-covariance matrix of the genes.