Game Theory
In this section we introduce some basic game theoretical notions and definitions. A coalitional game is a pair (N, v), where N denotes a finite set of players and v : 2N→ ℝ the characteristic function, with v(∅) = 0. Often we identify a coalitional game (N, v) with the corresponding characteristic function v. A group of players T ⊆ N is called a coalition and v(T) is the worth of the coalition T.
The unanimity game (N, u
R
) based on R ⊆ N is the game described by u
R
(T) = 1 if R ⊆ T and u
R
(T) = 0, otherwise. Every coalitional game (N, v) can be written as a linear combination of unanimity games in a unique way, i.e. v = ∑S⊆N, S≠∅λ
S
(v) u
S
(see for instance [33]. The coefficients are called unanimity coefficients or dividends of the game (N, v).
Given a coalitional game (N, v), an allocation or payoff vector is a vector (x
i
)i∈N∈ ℝNassigning to player i ∈ N the amount x
i
.
A solution for a class of coalitional games is a function ψ that assigns a payoff vector ψ(v) to every coalitional game in the class. A well-known solution for coalitional games is the Shapley value, introduced by [7].
The Shapley value assigns to each player his average marginal contribution over all the possible orderings, i.e. permutations, of the players. Formally, given a coalitional game (N, v), the Shapley value assigns to player i ∈ N:
(1)
where π is a permutation of the players, P(π; i) is the set of players that precede player i in the permutation π and n is the cardinality of N.
In [6], the definition of microarray game was introduced as a coalitional game (N, ) with the objective to stress the relevance ('sufficiency') of groups of genes in relation to a specific condition. Let N = {1,...,n} be a set of genes. On a single microarray experiment on N, a sufficient requirement to realize in a coalition S ⊆ N the association between a condition and an expression property is that all the genes showing that expression property belong to coalition S (sufficiency principle). Different expression properties for genes might be considered like, e.g., under- or over-expression, strong variation, abnormal expression etc. A group of genes S ⊆ N which realizes the association between the expression property and the condition on a single array is called a winning coalition for that array. For example, consider a single microarray experiment on a set of genes N = {1, 2,...,10} under a given condition (e.g., exposure to air pollution) and suppose that only genes 1, 3 and 7 show the expression property (e.g., over-expression). Then, each set of genes S ⊆ N with 1, 3, 7 ∈ S is a winning coalition in such an experiment.
Moving to k ≥ 1 microarray experiments on N, we refer to a Boolean matrix B ∈ {0,1}n × k, where the Boolean values 0 – 1 represent two complementary expression properties, for example the property of normal expression (coded by 0) and the property of abnormal expression (coded by 1). Let B.
j
be the j-th column of B. We define the support of B.
j
, denoted by sp(B.
j
), as the set sp(B.
j
) = {i ∈ N | B
ij
= 1}.
The microarray game corresponding to B is the coalitional game (N, ) with N = {1,...,n} and where : 2N→ ℝ+ is such that (T) is the rate of occurrences of the coalition T as a winning coalition, i.e., is the rate of occurrences of the coalition T as a superset of the supports in the Boolean matrix B. in formula, we define (T), for each T ∈ 2N\ {∅}, as the value
(2)
where Θ(T) = {j ∈ K | sp(B.
j
) ⊆ T, sp(B.
j
) ≠ ∅}, with K = {1,...,k} and where card(Θ(T)) is the cardinality of Θ(T). Finally, we define (∅) = 0.
Note that the expression of a gene is a continuous variable which hypothetically may assume whatever value across different samples, then it is not at all easy to identify good criteria to discriminate between different expression properties. The binarization method used in this work is described in section Data processing for CASh. For alternative binarization methods in gene expression analysis, see for instance [34, 35].
Comparative Analysis of Shapley Value
Consider two groups of microarray experiments on the same set of genes N = {1,...,n}, respectively collected under two different conditions 1 and 2. Let B1 ∈ {0, 1}n × kand B1 ∈ {0, 1}n × hbe two Boolean matrices, where B1 is obtained via a discretization procedure from an expression data set with k biological samples under condition 1, and B2 is obtained via a discretization procedure from an expression data set with h biological samples under condition 2.
Let , be the microarray games corresponding to the Boolean matrices B1 and B2, respectively. Let ϕ() be the Shapley value on the game and let ϕ() be the Shapley value on the game .
Consider the following absolute difference of Shapley values
(3)
for each i ∈ N, where ϕ
i
() is the Shapley value of gene i in the microarray game corresponding to the Boolean matrix B1 and ϕ
i
() is the Shapley value of gene i in the microarray game corresponding to the Boolean matrix B2.
We formally present a procedure (Algorithm 1) to test the null hypothesis that each gene in N has the same Shapley value between the two conditions 1 and 2. In fact we want to test the null hypothesis δ
i
(ϕ(), ϕ()) = 0 against the alternative hypothesis δ
i
(ϕ(), ϕ()) ≠ 0. More precisely, we introduce a test procedure based on a non-parametric Bootstrap method of re-sampling with replacement (see [12, 13] as general introduction to Bootstrap methods; see [36] as a Bootstrap application to microarray analysis), which is able to test the null hypothesis of no difference between the means of two random samples without assuming under the null hypothesis that the probability distributions in the populations are the same.
Algorithm 1
INPUT:
-
Two Boolean matrices B1 ∈ {0,1}n × k, B2 ∈ {0, 1}n × h, with n, k, h ∈ {1, 2,...};
-
an integer number b of Bootstrap re-samples (with replacement).
OUTPUT:
BEGIN:
-
Compute the observed Shapley value difference for each i ∈ N, where and are the microarray game correspondingto B1 and B2, respectively.
FOR
r
: 1 TO
b
BEGIN:
-
Compute the r1-th Bootstrap re-sample (with replacement) on the column indices {1,...,k} of B1; compute the r2-th Bootstrap re-sample (with replacement) on the column indices {1,...,h} of B2, respectively.
-
Define the Boolean matrix corresponding to the r1-th re-sample and the Boolean matrix corresponding to the r2-th re-sample.
-
Compute the Bootstrap Shapley value difference , for each i ∈ N, where , are the microarray games corresponding to the Boolean matrices and , respectively.
END.
-
for each i ∈ N, compute the (un-adjusted for multiple comparisons) estimate Achieved Significance Level (ASL) or p-value p
i
of each gene i ∈ N as follows .
END.
In Additional files, a more detailed version of the pseudo-code of Algorithm 1 [see Additional file 1] and its implementation [see Additional file 4] are given. A discussion on the generation of raw p-values using bootstrap method and the related procedures to adjust p-values for multiple hypothesis testing is provided [see Additional file 2]. Calculations of CASh on a numerical instance are also illustrated [see Additional file 3].
Description of data processing
We analyzed the microarray gene expression data published in [14]. Study subjects were children from the Teplice (TP) area in the north and from the rural Prachatice (PR) in the south of the Czech Republic, for a total of 47 children; 23 from the TP area and 24 from the PR area. For details on study population, collection and processing of blood, RNA isolation and microarray analysis of gene expression see [14].
Data pre-processing
Raw data files from ImaGene (BioDiscovery, Marina del Rey, CA, USA) published in [14] were uploaded into Expressionist Refiner Array (Genedata AG, Basel, Switzerland) for data transformation. Data transformations were applied in the following order: background was corrected according to [37, 38]; LOWESS correction with a smoothing factor of 0.1 to remove any nonlinearity between the two channels was applied [39]; expression ratios of the subjects's sample with respect to the common reference sample were calculated using a specific bayesian algorithm to estimate the most likely expression signal given the measurements for the spot and background intensities. The following data were derived:
-
Expression ratio = signal Cy5/signal Cy3;
-
Signal to noise (S/N) ratio for each channel = ;
-
Relative error computed = .
For the analysis of the data the quality thresholds were set as follows:
In addition, only the transcripts were used which have, after the previously described filtering, at least 50% valid values per group, i.e. ≥ 12 valid values for the PR group and also ≥ 12 for the TP group. From the about 20000 spots on the microarray, 5873 fulfill the above described quality and filtering criteria and were used for further statistical analysis. These spots from the series of 47 experiments are represented as a gene expression matrix X, with n = 5873 (after filtering) rows and 47 columns, where the i-th row consists of a 47-elements expression vector X
i
. = (Xi 1,...,Xi 47), for a single gene sequence i.
On such a matrix, a t-test analysis was used to identify genes significantly differing in expression between the two groups of individuals (TP compared to PR).
Data processing for CASh
The final matrix X of 5873 genes and 47 samples that was distilled from the data filtering and preparation as described above, was split in two distinct expression matrices, XTPand XPR, whose columns were selected from X accordingly to the 23 subjects from TP area and the 24 subjects from PR area.
First, a procedure aimed to discriminate over-regulated levels of gene expression with respect to expressions measured in the PR area was applied. Each continuous value in the vector Xi.= (Xi 1,...,Xi 47) which was equal to or greater than Mean [] + Stdev[] was coded as 1, or as 0 if otherwise. Consequently, a Boolean matrix B+ with n rows and 47 columns and with values {0, 1} was generated from X. Separately, a procedure aimed to discriminate under-regulated levels of gene expression with respect to expressions measured in the PR area was applied. Each continuous value in the vector X
i
. = (Xi 1,...,Xi 47) which was equal to or smaller than Mean[] - Stdev[] was coded as 1, or as 0 if otherwise. Consequently, a Boolean matrix B- with n rows and 47 columns with values {0, 1} was also generated from X. According to the distinction between PR and TP biological samples, the Boolean matrix B+ was split in two different Boolean matrices BTP+and BPR+, and the Boolean matrix B- was split in two other Boolean matrices BTP-and BPR-. By relation (2), from the Boolean matrix BTP+the microarray game is defined and, in a similar way, the microarray game from the Boolean matrix BTP-is also defined.
In order to remove those genes whose high level of Shapley value may be attributed to chance, we applied the Bootstrap-based Algorithm 1. In practice, we applied Algorithm 1 (b = 1000) with BTP+in the role of B1 and BPR+in the role of B2 and the un-adjusted p-values were computed. In a similar manner, we applied Algorithm 1 (b = 1000) with BTP-in the role of B1 and BPR-in the role of B2 and the corresponding un-adjusted p-values were computed.
As further criterium for filtering, genes with Shapley value smaller than the mean plus the standard deviation in both microarray games and were filtered out. Following this criterium, 838 genes were selected in game and 889 genes were selected in game ) (see the highlighted intervals of Figure 1).
The overlap between two lists with the same number n of genes is defined as the following fraction
for n ≥ 1.
Hierarchical cluster analysis and gene ontology
We used hierarchical clustering and K-means clustering to detect similarity relationships in gene expressions between TP and PR areas, based on the set of genes selected by CASh and t-test. In hierarchical clustering, all agglomerative hierarchical clusters were computed using the Euclidean distance between single vectors and the Ward method [40]. In K-means clustering, the algorithm of Hartigan and Wong [41] is used, the number of clusters a priori specified is K = 2 and the maximum number of iterations allowed is 10000. Before clustering analysis, we imputed missing values by the k-Nearest Neighbors method (k = 5) [42]. Heat-maps were representative of logged gene expression values, which are centered and scaled in the row direction. Statistical analysis were performed with Expressionist Pro from Genedata or the software R [see http://www.r-project.org]. Classification accuracy of clusters is measured as the percentage of correctly classified subjects. More precisely, classification accuracy was computed in two steps: first, each cluster is assigned to the area (TP or PR) with the majority of subjects in the cluster; second, the accuracy is computed according to the following ratio:
For functional annotation analysis, we used the online software DAVID [20], which employs a modified Fishers exact test [43, 44] to derive biological themes within particular gene sets defined by functional annotation. In this way, over-representation of a particular annotation term corresponding to a group of genes was quantified in terms of the p-value computed in the test procedure.
Simulation study
A simulated gene expression data-set of of n = 1000 genes obtained by random samples from normal distributions under two simulated conditions which are denoted by a class variable y ∈ {1, 2}. Nine hundred genes were randomly sampled from a normal distribution with mean= 1 and stdev = 1 under both conditions 1 and 2. The remaining 100 genes were slpit in two sets of target genes (i.e., to be discovered genes), since the parameters of the normal distributions from which the expression values are sampled change with the conditions: 50 genes were randomly sampled from a normal distribution with mean = 2 and stdev = 1 under condition 1, and from a normal distribution with mean = 1 and stdev = 1 under condition 2; remaining 50 genes were randomly sampled from a normal distribution with mean = 1 and stdev = 1 under condition 1, and from a normal distribution with mean = 1 and stdev = 2 under condition 2. To be processed by CASh, each randomly sampled continuous value of gene i, i = 1,..., 1000, under condition 1 (condition 2) which was equal to or greater than the average expression of gene i plus its standard deviation under condition 2 (condition 1) was coded as 1, or as 0 if otherwise.