Dissecting systems-wide data using mixture models: application to identify affected cellular processes

Background Functional analysis of data from genome-scale experiments, such as microarrays, requires an extensive selection of differentially expressed genes. Under many conditions, the proportion of differentially expressed genes is considerable, making the selection criteria a balance between the inclusion of false positives and the exclusion of false negatives. Results We developed an analytical method to determine a p-value threshold from a microarray experiment that is dependent on the quality and design of the data set. To this aim, populations of p-values are modeled as mathematical functions in which the parameters to describe these functions are estimated in an unsupervised manner. The strength of the method is exemplified by its application to a published gene expression data set of sporadic and familial breast tumors with BRCA1 or BRCA2 mutations. Conclusion We present an objective and unsupervised way to set thresholds adapted to the quality and design of the experiment. The resulting mathematical description of the data sets of genome-scale experiments enables a probabilistic approach in systems biology.


Background
Functional analysis of microarray data, e.g. to reveal enrichment of promoter sequences, metabolic pathways or signalling cascades, requires an extensive selection of differentially expressed genes. Arbitrarily chosen thresholds for fold-changes and/or significance of change are commonly used to split the genes in subsets of alternative (differentially expressed) and null (non-differentially expressed) genes, with acceptable proportions of false positives and false negatives. Which proportion is acceptable depends on the research question. For example, to find markers of a treatment or signatures of a mutation, it is usually enough to fix the FDR-control level [1], and select the most significant alternative genes. However, most differentially expressed genes will not be selected. For other purposes, e.g. when comparing differences between treatments or mutations, there is need for more thorough comparisons. For each gene we would like to know the probability of it being alternative or null. Such an approach will let us select genes that are truly -or truly not -differentially expressed, with associated estimates of false positives.
The differential expression of genes is evaluated after a comparison of gene expression levels, yielding a list of pvalues. Commonly, a threshold is then fixed for either the number of genes selected, or for their p-value, and those genes corresponding to the k smallest p-values are classified as alternative. However, because of the large number of p-values involved, even if all features are null some small p-values may be observed due to pure chance. Thus, it is important to consider an estimate for the proportion of alternative features while determining the threshold.
The problem of estimating the proportion of alternative and null features has been handled by several authors [2][3][4][5]. In many experimental settings the alternative features make up a considerable fraction. This is a consequence of genes being connected in networks; altered expression of one gene can affect expression levels of multiple targets. Begley and co-workers [6] showed that the methylating agent MMS induced changes in the transcription level of 33% of the genes in the S. cerevisiae genome and oxidizing agent t-BuOOH altered 38% of the genes. In a study of familial and sporadic human breast cancers by Hedenfalk and co-workers [7], Storey and Tibshirani [5] estimated 33% of the genes to be differentially expressed between BRCA1 and BRCA2 mutation positive tumors. However, even though over 1,000 genes were considered to be changed, only 160 genes could be selected with a pFDR of 5% as a consequence of the overlap between alternative and null genes. It is intuitive that using only 16% of the differentially expressed genes for functional analysis is suboptimal, and we will show that this is indeed the case. To detect subtle but coordinated changes in gene expressions, it is advantageous to examine the joint behavior of inter-connected sets of genes [8], which implies the need for a generous selection. Significant changes in cellular processes can be detected in cases where the alteration of individual gene expressions is not significant [9].
The threshold determination has so far been left to the researcher. In practice it is either guided by limitations in the number of features to be verified after being classified as alternative, or by the desired false positive proportion in the final list. However, with the increasing rate at which new experiments are being performed, and decreasing cost of verification experiments, there will be a growing need for unsupervised ways of determining thresholds yielding a list of features with acceptable false positive and false negative rates.
We present here an objective and unsupervised way to set thresholds that are adapted to the quality and design of the experiment. Description of the distribution of both null and alternative p-values as mathematical functions will give the researcher the possibility to select genes depending on the probability of being alternative or null. Hedenfalk et al [7] determined gene expression patterns in 21 tumors from breast cancer patients. Seven tumors were sporadic with unknown mutations, whereas the remainder were familial cancers, in which one of the known genes associated with breast cancer was mutated: BRCA1 (7 patients) or BRCA2 (8 patients). RNAs from the three different sources were hybridized to arrays containing 6,512 cDNA clones. After discarding low quality spots [5,7], more than 3,000 clones remained. The measurements from these genes were tested against the null hypothesis that there is no differential expression across two conditions [5]. The resulting p-values can be visualized in density histograms (figure 1A,C and supplementary material). The p-values appear to follow the expected distribution as being composed of a population of truly null (non-differentially expressed) genes with p-values uniformly distributed among [0, 1] and a population of truly alternative (differentially expressed) genes with pvalues that tend to be close to zero. Similar shapes of distributions can be generated for all four tested comparisons, BRCA1 to sporadic, BRCA2 to sporadic, BRCA1 to BRCA2 and both BRCA1 and BRCA2 to sporadic.

Decomposing distributions and setting thresholds
To further examine the gene expression responses, we describe the alternative and null distributions as mathematical functions. For this purpose, we plot the genes sorted by their p-values, p, against the percentile ranks (rank position/the total number of p-values), y (figure 1B) to reveal F -1 , the inverse of the underlying cumulative density function. F -1 is estimated non-parametrically. When comparing BRCA1 mutation-positive tumors against sporadic tumors, is convex upward on the entire interval  figure 1A). To select differentially expressed genes, we use the critical  point p 0 as a p-value threshold. As a result, the null hypothesis was accepted for 2,083 genes with a p > 0.19. These genes all have at least a twice larger probability of being null than alternative. In general, since we have a mathematical description of f(p) any point can be chosen as threshold. For instance, we can calculate a point, p 1 , where genes have at least a two times larger probability of being alternative than null. Hundred-and-eighty genes have a p-value below this threshold of p 1 = 0.04, and are selected as alternative. The proportion of false positives, as estimated by the ratio of integrals of f 0 and f in the interval [0, p 1 ], amounts to 20%. The pFDR estimated by Storey and Tibshirani [5] is 34%. Notably, in this comparison at a pFDR of 5%, being the proposed threshold by Storey and Tibshirani [5], no gene is selected as being differentially expressed.
Comparing expression data from BRCA2 and sporadic tumors shows similar patterns ( = 0.75 and = 11.1) (supplementary figure 1). Here, using p 0 as the p-value threshold 2,168 genes are selected as being null. Only 20% of the genes with p <p 0(BRCA1) or p <p 0(BRCA2) are in common in both comparisons, suggesting that BRCA1 and BRCA2 affect different target genes.
Information on the genes affected in both BRCA1 and BRCA2 mutation-positive tumors can be obtained by comparing the data from the familial tumors together against the sporadic. The curvature of , revealing the genes in common between BRCA1 and BRCA2 mutated tumors, has two local maxima (supplementary figure 1). Also when the data from BRCA1 is compared to that from BRCA2 mutated tumors (figure 1C-D), the curvature plot has several local maxima. These observations indicate that the original distributions might be constituted of p-values from two or more populations of truly alternative genes in addition to the truly null genes. Each of these groups is represented by a close to linear part in the plot of .
The critical points of the curvature of and the corresponding p-values let us estimate the parameters to describe the distributions of the groups.

Simulation study
In order to verify how the method works in a case for which the result is known, a simulation study was performed. It was assumed for simplicity that each alternative p-value followed an exponential distribution with rate λ. We also compared the proportions of false positives and false negatives between different methods for selecting genes, e.g. while controlling the family wise error-rate (FWER) or the FDR (figure 3). The proportion of false calls largely depends on the separation between alternative and null features (the value of λ) for thresholds based directly on the p-values. As to be expected, when correcting for multiple testing, i.e. controlling the FWER or FDR, the proportion of false positives remains constant. Similar calculations can be made for complex populations with more than one population of alternative features.

Functional analysis
To discern alternated cellular processes, we performed an analysis at gene set level [8,9]. To obtain a description of the function of the affected genes (p <p 0 ) in the BRCA study, the genes on the arrays were associated with Gene Ontology (GO) terms [10]. The annotation of the genes in the original data files were updated to the Unigene 170 build (24 April 2004). Of the informative transcripts on the array, 62% could be attributed to gene sets that represent specific GO terms. Subsequently, the hypergeometriĉ π 0λ  where x is the number of affected genes in the gene set, N is the total number of affected genes, n is the total number of genes in the gene set and m is the total number of genes not in the gene set. In cases of several p 0 , we chose p 0 corresponding to the major peak of the curvature plot, or if two major peaks are close to each other, the one associated with the largest p 0 .
The significant gene sets and the affected genes in the four comparisons are shown in supplementary tables 1-4.
When comparing BRCA2 and sporadic, only three gene sets were significantly (p < 0.01) altered, one consisting of genes involved in DNA repair. For BRCA1, 15 genes sets are significantly (p < 0.01) altered. The genes involve DNA binding, phosphorylation and cell cycle. Among the six enriched gene sets in the comparison between familial (BRCA1 and BRCA2) and sporadic tumors, the most significant enrichment is that of genes associated with the mitochondria. Almost as significant is the lack of electron transport genes among the selected genes.
We then tried to select genes using alternative methods. When controlling the FWER or the Benjamini-Yuketieli For visualization of the distributions, we used a modified version of 'gene set enrichment analysis' [8], where the genes were sorted on the p-values to form a sequence. Genes belonging to the tested gene set are attributed a value of and the remaining genes are given the value . The cumulative sum of the sequence is calculated and plotted for a few gene sets in figure 4. The more a gene set deviates from a uniform distribution, the greater absolute value of the sum. For these gene sets, the threshold at p 0 coincides with the maximum of the cumulative sum.

Discussion
High-throughput techniques such as gene expression profiling by microarrays allows rapid screening of large amounts of data simultaneously. However, the staggering amount of data produced causes new problems, such as how to determine a meaningful p-value threshold between differentially expressed and unchanged genes. Methods such as Benjamini & Hochberg's [1] or Benjamini & Yekutieli's [11] control false positives proportions, but yield no information about false negatives proportions. Our method can produce a full description of the probability density function for differentially expressed genes, which makes it possible to rationally choose a desired ratio between π 0 and 1 -π 0 for which the p-value threshold can be calculated.
The most important features of our proposed p-value threshold are that it is data dependent and unsupervised. It relies on an independent estimate of the unchanged genes proportion, . The method works best for π 0 ∈ (0.20,0.95) and λ > 5, situations in which the number of alternative features is non-negligible and has a minimum separation from the null features.
The value of λ will increase with the separation of the populations of alternative and null genes. The power of the statistical test will also be reflected on λ, as well as the quality of the data. On commercial platforms with multiple synthetic probes, λ tends to be higher than on microarrays spotted with PCR products (data not shown).
As the tolerable amount of false calls depends on the research question, relaxed selection criteria are needed for questions which require a balance between the inclusion of false positives and the exclusion of false negatives. Con-ventional ways of selecting genes, such as stringent control of the FWER or FDR, select few, if any, false positives whereas the proportion of false negatives is at its maximum even with good separation between alternative and null features ( figure 3). Only when relaxing the thresholds are we able to bring the two rates into equilibrium. Ideally, with no overlapping populations between truly alternative and truly null genes and a single population of alternative genes, the threshold at p 0 will select of the total genes as differentially expressed. However, in practice there is always some overlap in p-values between truly alternative and truly null genes. Selection of genes with p <p 0 will frequently include many genes which can be useful under some circumstances. For instance, although in the examples of the case study the FDRs are high at p 0 , almost all alternative genes are included, as can be estimated by . In the functional analysis, the plots of the cumulative sums show that setting the threshold at p 0 is biologically valid. Up until these points there is still a contribution of genes in the relevant gene sets. A more stringent selection would neglect the evidence of many genes being marginally differentially expressed. By selecting genes through controlling the FDR, similar results can be obtained with relaxed thresholds. The analysis of biological function that we present on the BRCA data set was expected to reveal groups of genes related to the cellular processes that are affected by either mutant BRCA1 or mutant BRCA2. Indeed for certain predefined gene sets, e.g. genes involved in DNA repair, cell cycle progression or nucleic acid interactions, the proportion of genes with p <p 0 is higher than can be expected from a random distribution. The observation that the threshold p 0 coincides with the maximum of deviation from zero in figure 4 suggests that p 0 is at the breakpoint where the enrichment of the gene set culminates. We noticed that there are more gene sets significantly affected by mutated BRCA1 than mutated BRCA2. Next to "protein phosphorylation", specifically gene sets involved in cell cycle regulation are enriched, including "cell cycle", and the somewhat less significantly induced gene sets "mitosis" and "spindle". As for mutated BRCA2, there is no indication that its impairment affects cell cycle progression; the genes in the gene set "mitosis" appear even to be depleted among the affected genes. The involvement of DNA repair is additionally endorsed by the less significantly induced gene sets "response to stress" and "induction of apoptosis by extracellular signals". From the literature we know that the BRCA network deals with lesions that block or interfere with DNA replication. BRCA1 has a role in DNA damage sensing but its precise function is not known (for review see [12]). BRCA2 is directly involved in DNA repair [13]. Interestingly, when BRCA1 and BRCA2 mutation positive tumors are com- pared together against sporadic tumors, mitochondrial genes are affected but not electron transport genes. Until recently, the main function of mitochondria was thought to be the provision of energy for the cell through the creation of an electrical potential across its membrane. However, mitochondria are also involved in one route of apoptosis and recently there has been evidence for dysfunctional mitochondria being associated with premature ageing [14]. These are functions that could possibly be connected to the BRCA network.

Conclusion
The method presented here will allow researchers to set unsupervised thresholds to select alternative features from mixed populations of p-values. As the discipline of systems biology evolves, there will be a need to compare global measurements of different levels (RNA, proteins, metabolites, etc.). The evidence of features being alternative can be used as weights in a comparison. In this context, a mathematical description of significant features enables a probabilistic approach to identify affected pathways.

Setting thresholds
The p-values are defined as independent continuous random variables P 1 , P 2 , ..., P n taking values in the interval [0, 1]. Let us represent by f and F the probability density and cumulative probability functions of the distribution of a generic p-value P, respectively. We require that f is twice differentiable.
We represent by f 0 (p) and f 1 (p) the p-value densities for a null feature and an alternative feature, respectively. For a generic feature, its p-value density can be written as the mixture We represent by p 0 the p-value threshold defining the largest p-value corresponding to features identified as alternative. We shall define p 0 as the maximum of the second derivative of F -1 . The second derivative correspond to the curvature of the original function. Using (1), we get that p 0 satisfies Note that there need not be a unique value of p 0 satisfying (2). The derivation of the equation is given in the appendix.
For an intuitive understanding, consider the extreme case where the alternative features under study all have p-values equal to zero, whereas the null features have a uniform distribution over [0, 1]. Then F has probability mass at p = 0 equal to the proportion of alternative features (1π 0 ), and in the interval (0, 1] is described by a straight line between (0, 1 -π 0 ) and (1, 1). It is clear that the best threshold p 0 is the one that selects all features with zero pvalues. This corresponds to the turning point of F -1 In practice, the alternative features will not all have zero pvalues, but the turning point of F -1 , where its second derivative is zero, remains the best threshold to yield a compromise between false positives and false negatives proportions.

Application to exponential distributions
To give some insight about how the proposed threshold works, let us consider the case when the alternative p-values have an exponential distribution. This consists of one of the simplest functional forms for a monotonically decreasing density within [0, 1] with probability density function given by where , a constant guaranteeing that . As λ → 0, f 1 approaches the uniform distribution. Equation (2)  Thus, the threshold is set at a point where the proportion of the alternative features is half the proportion of the null features. In other words, the genes with a p-value less than this threshold all have some evidence against the null hypothesis, whereas very few truly differentially expressed genes are excluded. For many purposes this definition gives a too wide selection, and it is more desirable to have a point where the proportion of alternative genes is higher than the proportion of null genes, e.g. at a point p 1 where . At p 1 , the proportion of alternative genes is twice the proportion of null genes. However, with the functional description given above, any point depending on the ratio between the proportions of null and alternative genes can be found in an unsupervised manner.
The dependency of the threshold p 0 on λ and π 0 , is visualized in figure 5. In this figure, it is clear that the suggested threshold p 0 varies with π 0 and λ in the desired way: as λ increases, the separation between the null and alternative p-values distributions increases and, therefore, p 0 decreases. On the other hand, as the proportion of null genes π 0 increases, naturally it becomes harder to identify the alternative genes, and again p 0 decreases.
For practical reasons it is desirable to have an estimate of the proportion of false positives in the selection. Provided the described functions, we can integrate f 0 and f in the interval [0, max(p)]. The ratio between the two will be an estimate of the proportion of false positives.
Extensions to more than one alternative component are straightforward and given in appendix. Application of the more flexible beta distribution yield comparable results (see appendix).
The p-value threshold dependencies on π 0 and λ

Figure 5
The p-value threshold dependencies on π 0 and λ. p-values come from a mixture of π 0 null genes and a single population Non-parametric approach In practice, there is interest in making as few distributional assumptions as possible so as to yield a robust approach. We suggest estimating F non-parametrically. This can be done by plotting the sorted p-values of all genes against their percentile rank (as the scatterplots in figures 1B and 1D) and estimating F -1 by fitting a smooth function -such as a cubic spline -to the data points. The smoothing parameters are chosen by cross-validation.
In non-parametric regression, the equivalent degrees of freedom for noise (EDF) is defined by EDF = tr{I -A(α)}, where A(α) is the hat matrix associated with spline smoothing with smoothing parameter α. The relation between EDF, the generalized cross validation (GCV) score and the residual sum of squares [15,16] can be written as The optimal EDF, and thus smoothing parameter, to fit the smoothing spline is found by minimizing the residual sum of squares.
To get the curvature of the spline, , it is differentiated numerically twice. The critical points p 0 , of this second derivative can be identified.
The R code and additional information is available at http://www.medgencentre.nl/pla.