Data description and preprocess
The human (HG-U133A and GNFIH) and mouse (GNF1M) Affymetrix microarray data (Affymetrix, Snata Clara, CA) [2] were retrieved from Gene Expression Omnibus (GEO) of National Center for Biotechnology Information (NCBI) [16] (Human: GDS594 and GDS596; Mouse: GDS 592). The data resource contains the expression data from 73 human tissues (GDS594/GDS596) and 69 mouse tissues (GDS592). Only common homology tissues from human and mouse were considered in the following analysis (24 orthologous adult tissue listed in Additional file 1). And for those biological replicates we assigned the average expression value as the signal. After mapping probes to the microarray chip platform annotation files, the final datasets include 14746 human genes and 13048 mouse genes. The homology gene pairs’ information in human and mouse was gained from NCBI (http://www.nicbi.nlm.gov/HomoloGene). After extracting the unique hit homology pairs, we identified 5055 orthologous gene pairs have expression data in both human and mouse. We used packages from BioConductor (http://www.bioconductor.org/) for functional annotation of those genes. Package ‘org.Hs.eg.db’ and ‘org.Mm.eg.db’ (version 2.2.11) were used for Gene Ontology (GO) mapping of human and mouse genes respectively. Package ‘KEGG.db’ (version 2.2.11) was used for pathway mapping of genes. Only these GO modules with over five genes on the chip will be used in the following analysis.
Estimation of gene expression divergence in tissues
A p-rank method [17] was performed to evaluate the gene expression distance between two tissues. To obtain the p-rank value of gene expression, a gene’s expression signal value was first ranked among all genes in each tissue, and then divided by the number of genes n.
Let p
i,t1
and p
i,t2
indicate the p-rank values of gene i in tissues t1 and t2. We calculated the gene divergence E
i
for every gene between each two tissues on the fate map as:
E
i
= |p
i,t
1 – p
i,t
2|
In each pair of tissues, those genes with divergence in the lowest 5% of all gene divergence values were classified as the conservatively expressed genes, which have the least expression divergence between two tissues. Furthermore, those genes which were expressed as the top 5% in a group of tissues or in all the tissues were considered as ubiquitously conservatively expressed genes in corresponding tissue group.
Gene set based measurement for inter tissue expression similarity
Gene Ontology (GO) [18] provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data. Consider a set of n genes in one Gene Ontology module; three kinds of measurements were defined for characterizing the profiling of this GO among the tissues. If expression distance was significantly small for a given GO module in a pair of tissues, the expression level of this module was considered conservative in the tissue pair. If a significant p value was obtained in the Kolmogorov–Smirnov test (KS test) [19] for the expression of a given GO module in a pair of tissues, the expression distribution of this module was referred to as a divergent pattern in these tissues. If Pearson correlation coefficient was significant for a given GO module in a pair of tissues, their expression relationship was referred to as a linear correlated pattern [20].
Tissue expression distance
Let S
i,g,t
denote the (log2 transformed) expression levels of gene i of GO g in tissue t. The expression distance of GO g in tissue t1 and tissue t2 is calculated as [4]:
while n is the number of gene in GO g.
The D values in all tissue pairs in human and mouse approximately follow a normal distribution (see in Additional file 2). Based on the cumulative distribution function of the normal distribution, a p value was calculated to represent the significance of the expression similarity of GO g in tissue t1 and tissue t2:
µ for the mean of D values and σ for the standard deviation of D values.
For a sub-group of tissues in the fate map’s branch, the intro group expression similarity of GO g is estimated using an integrated p value [21]. Since D ~ N(µ,σ2), D
sub
(the mean value of D values in a sub-group of tissues) should follow D
sub
~ N(µ,σ2/ N), N for the sample size in this group. Thus the integrated p value could be calculated as:
P
g,t–sub
= Pr(X ≤ x) = ϕ
µ,σ
2/ N(x), x = Dg,t 1–t 2
The GO module with a small p value or integrated p value was considered as conserved GO module in a pair or a group of tissues.
Tissue expression difference
To identify differentially expressed GO modules in a pair of tissues, a nonparametric KS test was performed [22]. The p-rank values of the genes in certain GO module from a pair of tissues were used. A p value (denoted as p
g,t1-t2
) less than 0.05 denoted that the difference of expression status of this GO module in two tissues should be considered as statistically significant.
For a group of tissues, the p values from the KS test in each pair of tissues from that group were integrated. The p values p
g,ti-tj
were transformed to Z
g,ti-tj
with quantile function of normal distribution. Then Z score for a sub-group of tissues (Z
sub
) was summarized from Z
g,ti-tj
with the function:
, n for the number of pair wise p values in this group
If these GO modules were not signature modules of a group of tissues, the p
g,ti-tj
would follow uniform distribution. If p
g,ti-tj
followed uniform distribution, Z
g,ti-tj
would follow norm distribution. As a result, Z score would also follow norm distribution. A significant small value of Z comparing to normal distribution corresponded to the significantly being perturbed of the GO modules under these tissues.
Tissue expression correlation
The expression correlation of GO m in a pair of tissues t1 and t2 was defined as the Pearson’s correlation coefficient [23] of the profiles of genes in GO m in tissue t1 and in tissue t2:
, while n is the number of gene in GO m.
A high r indicates a high similarity of the expression profiles of the GO module in two tissues, thus this GO module was considered as correlated expressed GO module in tissues. The threshold for r between two tissues was set to be r > 0.9, while the threshold for r in a group of tissues was set to be r
mean
> 0.9.
Gene enrichment analysis
For all the genes defined as tissue conserved ones, we made gene module enrichment using in GO modules and KEGG pathways. The enrichment significance of tissue conserved genes was calculated in a hyper genomic distribution model. Let k be the number of sequences corresponding to genes of interest (i.e. conservatively expressed genes) in certain GO module (or KEGG pathway). The total number of sequences in genes of interest is n. Given the total size of genes in the libraries (m) and in all tissues (N), the probability of observing k or more sequences for gene g in liver can be calculated by the formula:
The Bonferroni correction was used to adjust the p-values for liver enrichment gene identification based on the hyper-geometric distribution [21]. When performing n tests, with each of them being significant with probability β, the probability that at least one of them comes out significant should be less than n*β. The calculation was conducted in R platform with ‘stats’ package (http://www.r-project.org).
Supplementary files
The analysis workflow was presented in Additional file 3. All the supplementary files were available at our website: http://omics.biosino.org:14000/kweb/tissue_expression/