Multiple-input multiple-output causal strategies for gene selection

Background Traditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in generalization accuracy they often do not reveal direct causes. The latter is essentially related to the fact that high correlation (or relevance) does not imply causation. In this study, we show how to efficiently incorporate causal information into gene selection by moving from a single-input single-output to a multiple-input multiple-output setting. Results We show in synthetic case study that a better prioritization of causal variables can be obtained by considering a relevance score which incorporates a causal term. In addition we show, in a meta-analysis study of six publicly available breast cancer microarray datasets, that the improvement occurs also in terms of accuracy. The biological interpretation of the results confirms the potential of a causal approach to gene selection. Conclusions Integrating causal information into gene selection algorithms is effective both in terms of prediction accuracy and biological interpretation.


Background
Supervised analysis of genomic datasets (gene expression microarray or comparative genomic hybridization array for instance) with a large number of features and a respectively small number of samples requires the adoption of either regularization or feature selection strategies [1]. The most common feature selection strategies select or rank the variables according to a relevance score. In ranking, for instance, the score of each variable is the univariate association with the target returned by a measure of relevance, like mutual information, correlation, or p-value. If on one hand the ranking is widely used for its simple implementation and its low complexity, on the other hand it suffers from well-known limitations. A drawback is that ranking relies on univariate terms and as such it cannot take into consideration higher-order interaction terms or redundancy between features [2]. Another limitation is that ranking techniques are not able to distinguish between causes and effects. This is due to the fact that univariate correlation (or relevance) does not imply causation [3]. This problem is not solved in multivariate feature selection approaches since their cost function typically takes into consideration accuracy but disregards causal aspects. Nowadays the importance of bringing causality into play when designing feature selection methods is more widely acknowledged in the bioinformatics and the machine learning communities [4,5]. This is typically the case in microarray classification, where the goal is, for example, to distinguish between tumor classes or predict the effects of therapies on the basis of gene expression profiles [6]. In these settings the number of input variables, represented by the number of gene probes, is huge (typically several thousands) while the number of samples, represented by the patients' tumors, is very limited (a few hundreds) making the selection of relevant genes a challenging task. Moreover the inference of causal relationships between variables plays a major role in the context of genomic studies since more and more biologists and medical doctors expect data analysis to provide not only accurate prediction models (for prognostic purposes) but also insights into the mechanisms associated with disease and appropriate therapeutic targets.
It is well established that the detection of causal patterns cannot be carried out in a bivariate (single-input single-output) context and that at least a trivariate setting has to be considered [7]. This is put into evidence by the literature on graphical models where arc orientation relies on notions of conditional independence (requiring at least three terms) [8] and by the work on information theoretic methods for network inference [9]. In particular this paper will focus on the notion of feature interaction, a three-way mutual information that differs from zero when group of attributes are complementary [10]. The role of interaction in feature selection has already been discussed in the machine learning literature. Jakulin proposed an heuristic based on interaction for selecting attributes within the naive Bayesian classifier [11]. Meyer et al. proposed a filter algorithm which relies on the maximization of an information theoretic criterion, denoted Double Input Symmetrical Relevance (DISR), which implicitly takes into account the interaction, or complementarity between variables, in the choice of the features [12]. Watkinson et al. used a notion of synergy related to feature interaction to assign a score to a pair of genes and then measured the degree of confidence that one of the genes regulates the other [9]. A causal filter algorithm which computes interaction between inputs has been recently proposed in [5]. However it is unclear whether these techniques are capable of recovering the set of features that are both relevant and causal, in high-dimensional problems, such as in microarray analysis.
The contributions of this paper can be summarized as follows. First we introduce a new causal filter based on the interaction information and we show how to estimate this quantity in a multiple-input multiple-output setting. Second we assess the capacity of such filter to prioritize causal variables by using a synthetic case study. Third we measure from an accuracy and a biological point of view the performance of such causal filter in a number of prognostic studies in breast cancer. We advocate that a multiple-input multiple-output approach is particularly relevant in clinical studies where it is common that more than a single target variable is collected. This is the case of prognostic studies of breast cancer patients where several clinical indices, including patients' tumor size and histological grade, are collected together with the survival of the patients and the gene expressions of their tumor. It is worth to note that, in spite of their availability, these additional phenotypes are usually not taken into consideration since statistical studies focus on survival prediction and adopt single-output methods. This paper describes an original multiple-input multiple-output score which combines a conventional relevance term with a causal term. This additional term quantifies the causal role of the features and allows the prioritization of causal variables in the resulting ranking. We carried out a synthetic study, where the set of causal dependencies is known, which shows that causal variables are highly ranked once this score is adopted. We performed a meta-analysis of six publicly available breast cancer microarray datasets to assess the improvement of using our causal relevance score in terms of accuracy over the conventional ranking. The related discussion shows also that it is possible to carry out a biological interpretation of the role of selected variables which allows to discriminate between potentially causal and relevant, yet non causal, features. The source code, documentation and data are open-source and publicly available from http://mlg.ulb.ac.be/software/ and http:// compbio.dfci.harvard.edu/pubs/mimocausal/.

Mutual information and interaction
Let us consider a multiple-input multiple-output (MIMO) classification problem characterized by n input variables X = {x i , i = 1,..., n} and m targets Y = {y j , j = 1,..., m} where x i ∈ X is continuous and y j ∈ Y j = {c j1 , . . . , c jC } . Let us denote y 1 as the primary target and the remaining m -1 outputs as secondary targets. We make this distinction since, though we assume that the goal of classification is to predict y 1 , we want to take advantage of the causal information which can be extracted by multiple targets. We begin by reviewing some notions of information theory by considering three random (boldface) variables, notably two inputs x 1 , x 2 and the primary target y 1 . The mutual information [13] between the continuous variables x 1 and x 2 is defined in terms of their probabilistic density functions p(x 1 ), p(x 2 ) and p(x 1 , x 2 ) as where H is the entropy and the convention 0 log 0 0 = 0 is adopted. This quantity measures the amount of stochastic dependence between x 1 and x 2 and is also called two-way interaction [11]. Note that, if x 1 and x 2 are Gaussian distributed the following relation holds where r is the Pearson correlation coefficient.
Let us now consider the target y 1 , too. The conditional mutual information I(x 1 ; x 2 |y 1 ) [13] between x 1 and x 2 , once y 1 is given, is defined by The conditional mutual information is null iff x 1 and x 2 are conditionally independent given y 1 . The change of dependence between x 1 and x 2 due to the knowledge of y 1 is measured by the three-way interaction information defined in [14] as This measure quantifies the amount of mutual dependence that cannot be explained by bivariate interactions. When it is different from zero, we say that x 1 , x 2 and y 1 three-interact. A non-zero interaction can be either negative, which denotes a synergy or complementarity between the variables, or positive, which indicates redundancy. Because of the symmetry of the H operator in (3), we have By (4) we derive Since the joint information of x 1 and x 2 to y 1 can be written as I((x 1 ; x 2 ); y) = I(x 2 ; y) + I(x 1 ; y|x 2 ) it follows that by adding I(x 2 ; y) to both sides of (5) we obtain I((x1; x2); y)−I(x1; y)+I(x2; y)−I(x1; x2; y) = I(x1; y)+I(x2; y)+I(x1; x2|y)−I(x1; x2) (6) Note that the above relationships hold also when either x 1 or x 2 are vectorial random variables.

Feature selection, causality and interaction
Consider a multiple-class classification problem where x X ⊂ ℝ n is the n-variate input and y 1 ∈ Y is the primary target variable. Let A = {1,..., n} be the set of indices of the n inputs. Let us formulate the feature selection problem as the problem of finding the subset X* of v > 0 variables such that X * = arg max where the score s(X S ) of a subset X S of variables is given by the mutual information it brings to the target.
In other words, for a given number v of variables the optimal feature set is the one that maximizes the information about the target. Note that this formulation of the feature selection problem, also known as Max-Dependency [12,15], is classifier-independent.
If we want to carry out the maximization (7), both an estimation of I and a search strategy in the space of subsets of X are required. As far as the search is concerned, according to the Cover and Van Campenhout theorem [16], to be assured of finding the optimal feature set of size v, all feature subsets should be assessed. Given the infeasibility of exhaustive approaches for large n, we will consider here only forward selection search approaches. Forward selection starts with an empty set of variables and incrementally updates the solution by adding the variable that is expected to bring the best improvement (according to a given criterion). The hill-climbing search selects a subset of v <n variables in v steps by exploring only v i=0 (n − 1) configurations. For this reason the forward approach is commonly adopted in filter approaches for classification problems with high dimensionality [17,18].
If v = 1 the optimal set returned by (7) is composed of the most relevant variable, that is the one carrying the highest mutual information to y. For v > 1, we need to provide an incremental solution to (7) in order to obtain, given a set of d variables, the (d + 1) th feature which maximizes the increase of the dependency where (X S , x k ) stands for the set of variables resulting from the union of X S and x k . Since for large d the term s((X S , x k )) requires the computation of multivariate mutual information, its estimation is often prone to illconditioning and large variance. This led to the adoption of low variate approximations in literature, like the univariate approximation which leads to a ranking of the variables according to their mutual information with the target. More advanced approaches rely on bivariate decompositions [12] like where s((x i , x k )) quantifies the amount of information that x i and x k contain jointly about y 1 .
However a feature selection procedure targeting the Max-Dependency is not able in general to discriminate between causal and non causal dependencies. For instance in a selection procedure applied to a dataset derived from a causal process like the one in Figure 1, the effect x 4 could be more highly ranked than the direct causes x 1 and x 2 .
Here we propose to modify the conventional score s(X) into a causal score s c (X) able to keep into consideration the causal information returned by the adoption of a multiple output configuration. This is made possible by integrating in the score an interaction term which is strictly related to the notion of causal dependency.

Interaction and causal dependency
This section aims to establish the link between information theory and causality. Causality is at the same time an essential and imprecise notion in scientific discovery. In order to avoid any ambiguity, here we adopt the formalism of causal Bayesian network which is a sound and convenient framework for reasoning about causality between random variables [8]. This means that all causal dependencies between variables are expressed by a directed acyclic graph where the existence of an oriented edge from a node x i to a node x j means that x i directly causes x j . In formal terms we assume that the Causal Markov condition, the Causal Faithfulness and the Causal Sufficiency conditions hold [4]. Several works in literature showed that the structure of a causal graph can, to some extent, be inferred from observational data. The vast majority of these works rely on statistical tests of conditional independence [19]. Here we present a way to reason about causality which do not use independence tests but estimate an information theory score to prioritize potential causes.
Let us consider a triplet made of two inputs x i , x j and one target y 1 . As discussed in [4] six possible configurations of directed acyclic graphs involving three variables can occur. One configuration is trivial and corresponds to a completely unconnected graph. One configuration corresponds to a single arrow chain (for example only x i and x j are linked) and it is well known in literature that for a system of two variables the causal structure is not distinguishable. Another configuration corresponds to a fully connected graph and in this case the lack of independencies implies that the direction of the arrows cannot be determined. The remaining configurations can be illustrated and detected by studying the relationship [5] between the sign of I(x i ; x j ; y 1 ) and causal patterns of the triplet, like the ones sketched in Figure 1.
A negative interaction I(x i ; x j ; y 1 ) means that the knowledge of the value y 1 increases the amount of dependence between x i and x j ; this situation occurs in the presence of a collider. According to the label of the collider we can have two cases: i) the common effect configuration (for example the pattern involving x 1 , x 2 and y 1 , also known as the explaining-away effect) and ii) the spouse configuration (the pattern involving x 3 , x 5 and y 1 in Figure 1 where x 3 is the common descendant of y 1 and x 5 ). This is a consequence of the fact that, if we assume Causal Faithfulness, the graph structure entails that the two parents are independent (null mutual information) but conditionally dependent (conditional mutual information bigger than zero). Note also that both configurations are characterized by the presence of a collider.
On the contrary a positive interaction I(x i ; x j ; y 1 ) between x i and x j means that the knowledge of y 1 decreases the amount of dependence. This situation occur in two cases: i) the common cause configuration (for example, two dependent effects x 3 and x 4 become independent once the value of the common cause y 1 is known as illustrated in Figure 1) and ii) the causal chain configuration where one of the variables (let say, x 1 ) is the cause and the other (let say, x 4 ) is the effect of y 1 . This is due to the fact that the graph entails the dependence between x i and x j as well as their conditional independence (null conditional mutual information).
So far we have considered a single-output configuration. However causal patterns can be better identified if we consider a multiple-output configuration, for instance the two output configuration sketched in Figure  2. If y 1 and y 2 are two outputs representing different observations of the same phenomenon (for example a disease) we expect that the causal configurations concerning the first output appear also for the second one. This is a reasonable assumption in breast cancer clinical studies where the measured phenotypes (size and Figure 1 Single-output case with different causal patterns: (i) common effect or explaining away effect configuration involving x 1 , x 2 and y 1 ; (ii) spouse configuration involving x 5 and y 1 ; (iii) common cause configuration involving y 1 , x 3 , x 4 ; and (iv) causal chain configuration involving x 1 , y 1 , x 4 . histological grade of the tumor for instance) can be considered as different manifestations of the state of the tumor.
Let us consider for instance the inputs x 1 and x 2 and the two targets y 1 and y 2 : the common effect configurations between x 1 and x 2 and y 1 holds also for the triplet x 1 and x 2 and y 2 . The same happens for the common cause pattern involving both the triplet x 3 , x 4 , y 1 and x 3 , x 4 , y 2 . The presence of multiple outputs can therefore make more robust the identification of a causal pattern, especially in data configurations characterized by a very large number of variables.
In the following we will take advantage of these considerations to design a causal filter able to extract from observed data causal dependencies between variables.

The MIMO causal filter
The link between causality and interaction discussed in the previous section suggests that, if we want to detect causality without estimating large variate dependencies, we may search for patterns like the one sketched in Figure 3. This dependency pattern is characterized by two causal inputs and two outputs and can be detected when the following two conditions are satisfied: 1 the interaction I(x 1 ; x 2 ; y 1 ) is negative 2 the interaction I(x 1 ; x 2 ; y 2 ) is negative In what follows we implement this idea into a MIMO causal filter where input variables belonging to causal patterns like the one in Figure 3 are prioritized.
For the pair of inputs x 1 and x 2 and the pair of outputs y 1 and y 2 , we define a structural score which is composed of two multiple-input interaction terms. The magnitude of this score depends on whether x 1 and x 2 jointly play a joint causal role on y 1 and y 2 , or in other words, the pattern in Figure 3 is encountered. This means that the higher the term C(x 1 , x 2 ), the higher is the evidence that the pair x 1 , x 2 be a cause of y 1 ad y 2 . This score plays a similar role to the score that is maximized in structural identification of Bayesian networks [20]. If in that case the score measures the likelihood of the data for a given graph structure, here the quantity C(x 1 , x 2 ) measures the likelihood of the data for a structural pattern where the pair x 1 , x 2 has a causal role.
In the case of bivariate output (m = 2) we propose then a causal version s c of the univariate score s which accounts both for the relevance and the causal role of a pair of input variables x 1 and x 2 s c ((x 1 , x 2 )) = I(x 1 ; y 1 ) + I(x 2 ; y 1 ) + λC(x 1 , x 2 ) where l > 0 stands for the degree of causality imposed to the selection. If we adopt the filter approximation (10) the incremental formula takes the form In other terms this formulation suggests to add at the (d + 1) th step, among all the remaining variables, the one which has the better combination of relevance and causality, where the causal term is obtained by averaging over the selected variables and the considered outputs. x 1 x 2 y 1 y 2 Figure 3 Two-inputs two-outputs causal pattern.
Note that in the case of m > 2 targets the structural score (11) is obtained by averaging the interaction terms over the m variables. Similarly to what is done in regularization approaches [21] where specific configurations (typically those with higher complexity) are penalized by adding a complexity term to the one measuring the error, the causality parameter l in (13) is expected to penalize input variables with no causal role (positive interaction). Note that for l = 0 the selection rule (13) boils down to the rule (9). The following section will study the impact of the causality term on the accuracy and the stability of a filter algorithm implementing the rule (13).

Results
In this section we perform two experiments to assess the role of the causation term in the feature selection process. The first one is based on a number of synthetic datasets generated by simulating a causal Bayesian network while the second relies on public microarray breast cancer datasets to assess the approach in a real data setting.

Synthetic data
This experiment focuses on the prioritization of causes in a set of classification tasks defined on the basis of simulated data generated by the causal structure depicted in Figure 4. Note that this causal structure aims to represent in a very simplified manner a stochastic dependency characterized by a number of indirect (nodes 1-3) and direct causes (nodes 4-8), a latent non measurable variable (node 9), one observable primary target (node 10), two secondary targets (nodes 11-12), a set of additional effects (nodes 13-29) and a number of independent and irrelevant variables (nodes [30][31][32][33][34][35][36][37][38][39][40]. In order to set up an analogy with the real data experiments of the following subsection, we could make the assumption that the latent variable represents the cancer progression, the three targets denote a set of observable measures depending on the cancer state (patients' prognosis, size and histological grade of the tumor for instance), and that all other variables represent the expression of genes whose activity could play a causal role, be determined as an effect of the disease or be completely irrelevant. It is worth to note that also in the presence of a hidden variable the interaction between marginally independent causes given an effect is negative. This is due to the fact that conditioning on the hidden variable or on one of his children is equivalent in terms of d-separation between the variables [8] and consequently is equivalent in terms of the sign of the interaction. We simulate a number of multivariate datasets from the causal structure in Figure 4 and for each of them we rank the inputs of the MIMO classification problem by using the conventional ranking approach based on mutual information (Equation (9)) and our novel approach based on causality (Equation (13)). The stochastic dependency between parents and descendants of the network is modeled by a linear regression where the parameters are uniformly sampled in [-2, 2] and the noise distribution is Gaussian with zero mean and standard deviation s. We carry out a series of experiments, each characterized by 150 datasets and an increasing noise standard deviation ranging between 0.01 and 0.4. All the variables are continuous apart from the variables 10, 11, and 12, which correspond to the targets y 1 , y 2 , and y 3 of the classification task and are discretized to two binary values. Note that all measures are centered and scaled in order to have a zero mean and unit standard deviation; this allows for a better understanding of the impact of the noise amplitude on the ranking.
The quality of our causal prioritization strategy is assessed by measuring the average ranking of the direct causes (nodes 4-8) and the percentage of time that the direct causes are ranked among the first 5 variables. These two measures (together with a 90% confidence interval) for different values of λ are shown in Figure 5 and 6 respectively. These plots show that by increasing the value of λ, the average ranking position of direct causes decreases (direct causes are better prioritized) and that the percentage of correct selection increases (among the first ranked variables we find the direct causes with higher probability). The improvement occurs in a consistent manner for different values of the noise standard deviation though the detection of causal terms become less accurate as the noise increases. Note also that the very bad performance of the ranking (λ = 0) strategy (0% rate of correct selection) derives from the very large number of effects which tend to be ranked before the real causes.

Real expression data
The real data experiment consists of 6 public microarray datasets derived from breast cancer clinical studies (Table 1) in order to compare the generalization accuracy of the selection returned by the conventional ranking approach based on mutual information (Equation (9)) with the accuracy of the selection returned by our novel approach based on causality (Equation (13)).
All the microarray studies analyzed hereafter are characterized by the collection of gene expression data (the inputs X representing n = 13,091 unique genes), the survival data (the primary target y 1 ) and 2 additional clinical (secondary) variables about the state of the tumor, namely the histological grade and the tumor size. These clinical variables are well known by clinicians to be highly relevant for prognosis since large tumors of high grade are usually aggressive and lead to poor prognosis. Each experiment was conducted in a meta-analytical [22] and cross-validation [23] framework, that is the set of variables are selected by relying on the samples of several datasets and the validation is performed on a set of samples not used for the selection. In order to adopt a classification framework, the survival of the patients was transformed in a binary class such as low or high risk of the patients given their clinical outcome at five years as in [24]. We conducted two sets of meta-analysis validation experiments to compare the conventional ranking approach (λ = 0 case) and our causal version for different values of λ: • Holdout: we carried out 100 training-and-test repetitions where for each repetition the training set is composed of half of the samples of each dataset and the test is composed of the remaining ones.
• Leave-one-dataset-out where for each dataset the features used for classification are selected without considering the patients of the dataset itself. Once the selection is over, 100 holdout repetitions are used to assess the generalization power of the selected set of features.
All the mutual information terms are computed by using the Gaussian approximation (2). This allows the meta-analysis integration at the correlation level by means of the weighted estimation approach proposed by [22]. All the experiments were repeated for three sizes of the gene signature (number of selected features): v = 20, 50, 100.
The quality of the selection is represented by the accuracy of a Naive Bayes classifier measured by four different criteria: the Area Under the ROC curve (AUC), the Root Mean Squared Error (RMSE), the SAR (Squared error, Accuracy, and ROC score introduced by [25]) and the precision-recall F score measure [26]. Table 2 reports for the holdout experiment the value of the four performance criteria for different values of v and λ. Table 3 refers to the leave-one-dataset-out experiments for v = 20, v = 50, and v = 100, respectively. Note that the W-L (Win-Loss) line reports the number of datasets for which the causal filter is significantly more (W) or less (L) accurate than the ranking filter according both to the McNemar test [27] (p-value < 0.05 adjusted for multiple testing by Holm's method [28]) and the Wilcoxon paired test on squared errors (p-value < 0.05 adjusted for multiple testing by Holm's method).

Discussion
In the previous section we reported the accuracy results of the traditional ranking approach and our novel method based on a causal relevance score. Here we discuss the added value of our causal approach both from a quantitative and qualitative perspective.
The performance measured in cross-validation suggests that the incorporation of a causal term leads to a significant improvement of classification accuracy. This improvement is observed for different validation configurations and different sizes of the prognostic gene signature. From these results we can conclude that (i) causal feature selection is interesting also for a prediction perspective and (ii) relevant (prognostic) information is contained into secondary output variables (in our case tumor size and histological grade). Although the absolute improvement is only moderate (3% to 6% depending on the validation configurations and performance estimates), the use of our causal ranking strategy in more sophisticated modeling approach for prognosis, such as in [29], may help develop more clinically relevant prognostic classifiers in breast cancer.
The other advantage of our approach is that the introduction of a causality term leads to an interpretation of the causal role of the selected genes. We illustrate this characteristic in Figure 7 by comparing, through Gene Ontological (GO) terms, gene rankings with increasing degree of causality using a pre-ranked gene set enrichment analysis (GSEA) [30]. By quantifying how the causal rank of genes diverges from the conventional one (λ = 0) with respect to λ we can identify the gene sets that are potential causes or effects of breast cancer.
Genes that remains among the top ranked ones for increasing λ can be considered as relevant (they contain predictive information about survival) and causal. Genes whose rank increases for increasing λ are putative causes: they have less relevance than other genes (for example, those being direct effects) but they are potentially causal. These genes would have been missed by conventional ranking, where they would appear as false negatives if we interpret the outcome of conventional ranking in causal terms. Genes whose rank decreases for STK 159 [53] VDX 344 [54,55] UNT 137 (92) [56] MAINZ 200 [57] TRANSBIG 198 [58] Duplicated patients between studies have been removed in two studies, UPP and UNT; the remaining unique patients are reported in brackets. All the datasets have been generated from Affymetrix technology and normalized using fRMA [51]. We consider for analysis the 13,091 unique genes common in all datasets.
increasing λ are putative effects in the sense that they are relevant but probably not causal. This set of genes could be erroneously considered as causal, and represent false positives if we interpret the outcome of conventional ranking in causal terms.
Since genes are not acting in isolation but rather in pathways, we analyzed the gene rankings in terms of gene set enrichment. As described in [30], the normalized enrichment score (NES) computed in GSEA enables quantification of the strength of association of a gene set Table 3       (GO term) with a phenotye of interest, here poor or good prognosis (survival). In more details, given a list of genes L ranked by their prognostic relevance and an a priori defined set of genes S (for example genes sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom; gene sets associated with the prognosis phenotype tend to show the latter distribution. NES reflects the degree to which a gene set S is overrepresented at the extremes (top or bottom) of the entire ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when a gene in S is encountered and decreasing it when genes not in S are encountered. The magnitude of the increment depends on the statistic used to rank the genes in L. In our study the statistic of a gene is simply its rank (the most relevant genes have the largest ranks) and its sign depends on the association of its expression with survival: positive sign if over-expression is associated with poor survival and inversely. The score is the maximum deviation from zero encountered in the "walk"; it corresponds to a weighted Kolmogorov-Smirnov-like statistic [30,31]. Finally the score is normalized for each gene set to account for the size of the gene set, yielding a NES.
We computed NES for multiple genome-wide rankings generated with increasing values of λ, and displayed in Figure 7 the score of the 3 most enriched GO terms which are identified as being potentially (A) both causes and effects, (B) causes, and (C) effects of breast tumorigenesis (GSEA results for all the GO terms are provided in Additional File 1, 2 and 3). The first group of GO terms that show similar enrichment scores independently of their level of causality are implicated in cell movement and division, cellular respiration and regulation of cell cycle ( Figure 7A). The first GO term involves genes encoding for the Rho family of GTPases proteins that are among key regulators of actin and microtubule cytoskeleton [32] and are often overexpressed in human breast cancers [33]. Bromberg et al. showed that, when affected by RNF5, this family of proteins may cause dysregulation of cell proliferation to promote tumor progression [34]. The second GO term represents the co-enzyme metabolic process which includes proteins showed to be early indicators of breast cancer [35]; perturbation of these co-enzymes might cause cancers by compromising the structure of important enzyme complexes implicated in mitochondrial functions [35]. Genes involved in the third GO term "regulation cyclin-dependent protein kinase activity" are key players in cell cycle regulation and inhibition of such kinases proved to block proliferation of human breast cancer cells [36]. Moreover, Moore et al. recently highlighted the role of cyclin-dependent kinases as progesterone activators that could give raise to tumors and sustain their progression in breast cancer [37]. Figure 7B displays the GO terms that are increasingly enriched with their degree of causality, involving genes that are putative causes of the tumorigenesis affecting patients' survival; these genes might have been missed by the conventional ranking approach (λ = 0). Counterintuitively, the three GO terms in this category are related to the immune system what is sought to be more an effect of the tumor growth as lymphocytes Normalized Enrichment Score  Figure 7 Most enriched GO terms with respect to l according to a pre-ranked gene set enrichment analysis (GSEA): (A) GO terms enriched in the conventional ranking and having a high degree of causality for tumorigenesis; (B) GO terms increasingly enriched with respect to larger l, suggesting they are putative causes for tumorigenesis; (C) GO terms decreasingly enriched with respect to larger l, suggesting they are putative effects for tumorigenesis. The normalized enrichment score (NES) depends on the genome-ranking of the genes, which in turn depends on λ. Larger the NES of a GO term, stronger the association of this gene set with survival; the sign of NES reflects the direction of association of the GO term with survival, a positive score meaning that over-expression of the genes implies worst survival and inversely.
strike cancer cells as they proliferate. However, several independent research groups showed that frequent usage of aspirin significantly decrease the long-term risk of cancer death by correcting immune system dysfunction [38,39], findings that have been confirmed in breast cancer [40], what supports that the immune system might have a causal role in tumorigenesis. There is strong evidence of interplay between immune system and tumors since solid tumors are commonly infiltrated by immune cells; in contrast to infiltration of cells responsible for chronic inflammation, the presence of high numbers of lymphocytes, especially T cells, has been reported to be an indicator of good prognosis in many cancers [41], what concours with the sign of the enrichment (negative enrichment; Figure 7B). We and others have reported that gene expression signatures representing the immune response process were associated with a better prognosis in particular subtypes of breast cancer [29,42,43]. The last group of GO terms are less enriched when the degree of causality increases and the vast majority of the corresponding genes are related to cell-cycle and proliferation ( Figure 7C). Cell-cycle and proliferationrelated genes, such as for example Ki67, have been used for many decades to describe breast tumors: High levels of Ki67 have been correlated with worse prognosis and are also known to be associated with high tumor grade and negativity of estrogen receptor status [44,45]. We and others have shown that a quantitative measurement of proliferation genes using mRNA gene expression could provide an accurate assessment of prognosis of breast cancer patients [43,46,47]. We also have shown that only one of those genes, AURKA, which is significantly enriched in this case in the M phase GO term, was sufficient to recapitulate the prognostic performance of different prognostic signatures [48]. However the enrichment of these proliferation-related genes seems to be a downstream effect of the breast tumorigenesis instead of its cause.
Our approach allows to identify biological processes that may be direct causes of cancer. These processes are likely to be missed by conventional methods. Given the promising performance of our approach, we plan to integrate our method in analytical frameworks combining efficiently the available clinical data and a priori biological knowledge, potentially retrieved from biomedical literature [49] or pathway database [50], in order to unravel gene sets or network of genes causal of cancer patients' survival.