ROKU: a novel method for identification of tissue-specific genes
© Kadota et al; licensee BioMed Central Ltd. 2006
Received: 08 November 2005
Accepted: 12 June 2006
Published: 12 June 2006
One of the important goals of microarray research is the identification of genes whose expression is considerably higher or lower in some tissues than in others. We would like to have ways of identifying such tissue-specific genes.
We describe a method, ROKU, which selects tissue-specific patterns from gene expression data for many tissues and thousands of genes. ROKU ranks genes according to their overall tissue specificity using Shannon entropy and detects tissues specific to each gene if any exist using an outlier detection method. We evaluated the capacity for the detection of various specific expression patterns using synthetic and real data. We observed that ROKU was superior to a conventional entropy-based method in its ability to rank genes according to overall tissue specificity and to detect genes whose expression pattern are specific only to objective tissues.
ROKU is useful for the detection of various tissue-specific expression patterns. The framework is also directly applicable to the selection of diagnostic markers for molecular classification of multiple classes.
A major challenge of microarray analysis is to detect genes whose expression in a single or small number of tissues is significantly different than in other tissues. Accurate identification of such tissue-specific genes can allow researchers to deduce the function of their tissues and organs at the molecular level .
Several methods have been used for this purpose [1–5]. Of these, Schug et al.  demonstrated the effectiveness of using Shannon information theoretic entropy for ranking genes according to their tissue-specificity, from restricted (tissue-specific) expression to average (ubiquitous/housekeeping) expression. However, there is also a severe disadvantage. The entropy does not explain to which tissue a gene is tissue-specific, only measuring the degree of overall tissue specificity of the gene. Hence further analysis to identify specific tissues is needed. Although Schug et al.  proposed a new statistic (Q) based on entropy to estimate the degree of a gene's specificity on a particular tissue, the issue of redundancies remains where top-ranked genes as specific to tissue A are also top-ranked as specific to tissue B. We assert such genes are not specific to A or B, but rather are genes specific to both A and B. For example, we observed that two of the top five probesets specific to liver were also found in the top five probesets specific to gall bladder . The issue of such redundancies is a concern with any ranking-based method, such as pattern-matching , when the number of interrogated tissues increases. Methods of identifying genes specific only to objective tissues are needed.
Unlike ranking-based methods, methods based on outlier detection are free from the issue of redundancies because they identify tissues corresponding to both over- and under-expressed outliers for each gene [3, 5]. Therefore, these methods can treat equally various types of tissue-specific genes: (1) 'up-type' genes selectively over-expressed in a single or small number of tissues compared to the others, (2) 'down-type' genes selectively under-expressed, and (3) 'mixed-type' genes selectively over- and under-expressed in some tissues. Although the mixed-type is possible, the first two types (up-type and down-type) of expression patterns are particularly important because those genes may be associated with fundamental biological phenomena and may contain particular tissue-specific diagnostic markers. Using outlier-detection-based methods, however, ranking genes according to their degree of overall tissue-specificity is difficult.
This complementary relationship between ranking-based and outlier-based methods led us to develop a combined approach, ROKU. ROKU analyzes any type of tissue-specific genes (up-, down-, and mixed-type) in two steps. First, it ranks genes according to overall tissue-specificity using Shannon entropy, and second, for each gene, it identifies specific tissues whose observations are regarded as outliers using a method of Kadota et al. . We applied the method to both synthetic and real gene expression data and demonstrated its utility by comparison with other methods.
Results and discussion
Definition of tissue-specific genes
Data processing and its effect on Shannon entropy calculation
When one gene vector x= (x 1 , x 2 , ..., x N ) is given, the entropy H(x) can be calculated by equation 1 (See Methods). The range of H is from 0 whose gene expression is perfectly restricted in a single tissue (Fig. 1a) to log2(N) whose gene expression pattern is flat in all the interrogated tissues (Fig. 1b). We therefore rely on the low entropy score for the identification of tissue-specific genes. The black scatter plots in Fig. 1 are synthetic expression observations for N tissues (i.e., N = 10 in this case). The entropy H for each gene vector x is given by the number in black above the figures. Clearly, direct calculation of the entropy for raw gene vector x works well only for detecting tissue-specific genes in a narrow sense (Figs. 1a and 1c) but not for those in a broad sense (Figs. 1d–f). The H scores (3.22, 3.29, 3.23 for Figs. 1d–f, respectively) of tissue-specific genes in a broad sense are close to the maximum value (log210 = 3.32) and cannot identify those genes as 'tissue-specific'.
To detect tissue-specific genes in a broad sense, we introduce a simple method that processes a given gene vector x and makes a new vector x'. Data processing is done by subtracting the one-step Tukey biweight and by taking the absolute value of equation 2 (see Methods). The Tukey biweight yields a robust weighted mean able to resist 50% of outliers . The scatter plots of processed vectors are shown in red in Fig. 1. The entropy scores, H(x'), for the processed vectors to obvious tissue-specific genes in a broad sense (Figs. 1d–f) are considerably lower than those for x. This is because the relative expression levels for specific tissues (highlighted tissues) become high after data processing. For example, the value (0.04) for tissue 3 in Fig. 1e becomes 0.75 after data processing. Since the base-line value is 0.1 (1/N, N = 10) in this case, such high values decisively contribute low entropy to the gene expression pattern. Also, entropy scores, H(x') and H(x), to non-specific (or randomly expressed) genes are quite similar and close to the maximum (3.32) (Figs. 1g and 1h). These results demonstrate the adequacy for our strategy for detecting tissue-specific genes in a broad sense at least on typical/hypothetical expression data.
Analysis of real data
To further investigate the validity of our method (ROKU), we applied the method to a public gene expression matrix consisting of 36 normal human tissues and 22,283 probesets . Briefly, ROKU (1) processes each probeset expression vector and makes a processed vector x', (2) calculates the entropy H(x'), and (3) assigns specific tissues to each probeset whose observations are detected to be 'outliers' (see Methods). We compared the performance of ROKU to that of Schug's method, which directly uses the original/non-processed vector x for measuring the entropy H(x) . The two entropy scores (H(x') and H(x)) for all probesets are available in the additional file [see Additional file 1].
To compare the agreement of top-ranked probesets between ROKU and Schug's method we analyzed the percentage of common probesets in a top-ranked set of ~22,283 probesets. About 80% of ~3,000 top-ranked probesets are common, indicating that ROKU does not change the rank of probesets drastically (data not shown). One way to compare the effect of the data processing used in ROKU to that used in Schug's method is to sort probesets in order of increasing magnitude by the difference between the two entropy scores (H(x') - H(x)) calculated by the two methods. Since ROKU outputs relatively low entropy to each probeset compared to Schug's method as a whole [see Additional file 1], the average value of (H(x') - H(x)) tends to be negative: -0.425 (4.314 for ROKU; 4.739 for Schug's method).
Comparison of entropy scores by ROKU and Schug's method. Probesets and two entropy scores estimated by ROKU and Schug's method are shown. The numbers in parenthesis are the ranks of the probesets. One can easily rank all probesets by this order using the additional file [see Additional file 1].
H(x') - H(x)
Probesets with the lowest (H( x ') - H( x )) scores
Probesets with the highest (H( x ') - H( x )) scores
There are 858 probesets satisfying H(x') > H(x): processed expression vectors are less tissue-specific than the original vectors. Visual evaluation for those probesets showed no probeset exists whose entropy score is improperly assigned, i.e., no obvious tissue-specific probesets exist. These results demonstrate the data processing strategy used in ROKU successfully estimated/ranked probesets by their overall tissue specificity on real data. We verified such trends in other microarray datasets (data not shown).
Note that ROKU is inferior to Schug's method (i.e., direct application of entropy to measuring tissue specificity) in rare cases. For example, consider a gene expression pattern of constant high expression in N/2 tissues and low expression in other tissues. ROKU gives the processed expression pattern as 'flat' and H(x') = log2(N). Accordingly, ROKU cannot distinguish such differential expression patterns from constant expression patterns because it gives the same entropy scores for the two patterns. In other words, H(x') is not useful for identifying non-specific genes. Nevertheless, this disadvantage is not a problem for detecting the tissue-specific expression patterns we focused on. We also observed that there was no probeset suffer from this disadvantage in the real data set.
Detection of specific tissues as outliers
As mentioned earlier, the entropy does not indicate which tissues are specific though it can rank genes according to their degrees of overall tissue specificity. To identify such specific tissue when they exist, ROKU employs an outlier-detection-based method proposed by Kadota et al.  (see Methods for details). Regardless of over- and/or under-expressed outliers, it can return specific tissues corresponding to outliers for each gene. Accordingly, an outlier matrix can be constructed (consisting of 1 for over-expressed outliers, -1 for under-expressed outliers, and 0 for non-outliers) that corresponds to the original gene expression matrix by applying the method. Genes with any expression pattern of interest can be detected using the outlier matrix. The outlier matrix is also available in the additional file [see Additional file 1].
For example, ROKU identifies 59 probesets specific to lung and 291 probesets specific to fetal lung and of course no redundancies exist between the two sets by virtue of the advantage of the original method . Since ROKU is a combined method consisting of calculation of an entropy and assignment of specific tissues to each gene, ROKU can compensate for the disadvantage of the original method  by assigning an entropy score H(x'): ROKU can rank genes with particular tissue-specific patterns by their overall tissue specificity. We compared the performance of ROKU to that of Schug's Q t (x) statistic  which can also rank genes specific to a tissue t.
List of top ten genes specific to lung and to fetal lung. Probesets and two entropy scores estimated by ROKU and Schug's method are shown. *Probesets indicate those listed to be "specific to lung" are also listed to be "specific to fetal lung" and vice versa. Note that Schug's method has strong redundancy when similar tissues are selected independently and therefore cannot detect probesets specific only to the objective tissue.
Specific to Lung
Specific to Fetal lung
Of 22,283 probesets analyzed, 16,072 exhibit one or more specific tissues. We observed that most of them consist of specific up-expression patterns, such as Figs. 1c and 1d [see Additional file 1]. This is probably because the distribution of gene expression levels from the dataset we used here roughly follows an exponential distribution in which the probability of a gene's expression observation decays exponentially (data not shown). Still we appreciate the merit of ROKU being able to detect genes with various types of tissue-specific expression patterns, as shown in Fig. 1.
Effect of different quantification algorithms on gene ranking
There are some ways for extending this work. First, we used an outlier-detection-based method  for the detection of specific tissues. Some other methods, such as Sprent's non-parametric method  and its derivative, could be applicable. The outputs of these methods vary with the selected parameters. A comparative study of these methods is the next task. Second, the current work did not discuss the statistical significance of observed differences in gene expression. We plan to combine the significance analysis, such as a method of Sharov et al. , with the current method.
In this work, we propose a novel method (ROKU) for the detection of genes with tissue-specific expression patterns. ROKU was developed to compensate for the disadvantages of two conventional methods [3, 4] by combining the advantages of the two. Using synthetic expression data, we demonstrated its potential applicability for the detection of various types of specific expression patterns. Although most of the detected tissue-specific genes in real microarray data exhibit one type of expression pattern (i.e., 'up-type' genes selectively over-expressed in a single or small number of tissues compared to the others), the entropy-based gene ranking by ROKU outperforms the two original methods. ROKU can be a powerful tool for selecting genes specific to tissues of interest.
Publicly available Affymetrix U133A oligonucleotide microarray data for 22,283 genes in 36 various normal human tissues  were downloaded from the author's website . For the most part we used the data quantified using MAS5 (Micro Array Suite 5 from Affymetrix) software. Other quantified data using the RMA algorithm  were also analyzed to compare the effects of different quantification algorithms. RMA quantification was performed by the justRMA() function in R  using raw data (Affymetrix CEL files).
Gene ranking by Shannon entropy
The use of Shannon entropy  to rank genes by their overall tissue specificity here is the same as described in Schug et al. . Consider one gene's expression vector x= (x 1 , x 2 , ..., x N ) for N tissues and an observation x t for tissue t. The entropy of the gene is calculated as
where p t is the relative expression of x t for tissue t defined as . H ranges from zero to log2(N), with the value 0 for genes expressed in a single tissue (Fig. 1a) and log2(N) for genes expressed uniformly in all the interrogated tissues (Fig. 1b).
To equally identify down- and mixed-types of tissue-specific genes as well as up-type genes, we processed the original vector x. The processed observation x t ' for tissue t is defined as
where T bw is the one-step Tukey biweight, a popular statistic robust against outliers. It provides as much robustness as a median and is also used to estimate the expression signal from each probe set in the Affymetrix Micro Array Suite (MAS 5.0) software package [7, 13]. The parameters for the calculation of T bw are the same as those adopted in the tukey.biweight() function in R package 'affy' (i.e., c = 5, ε = 0.0001) . Our method (ROKU) uses the processed expression vector x' of a gene, while Schug et al.  uses the original vector x, to calculate the gene's entropy (H(x') and H(x)) as a measure of the overall tissue specificity.
Detecting specific tissues as outliers
As mentioned above, the entropy does not indicate which tissues are specific, but is a measure of the overall tissue specificity of a gene. We imagine observations in specific tissues to be easily visualized as outliers on the over- and/or under-expressed side if any exist. We used an outlier-detection-based method proposed by Kadota et al.  to detect tissues with specific expression patterns. According to Kadota et al. , the statistic U for identifying outliers is defined as
where n and s denote the numbers of non-outlier and outlier candidates, and σ denotes the standard deviation (SD) of the observations of the n non-outlier candidates. The procedure is first, normalize the gene vector x= (x1, x2, ..., x N ) for N (=n+s) tissues by subtracting the mean and dividing by the SD; second, sort the normalized values (i.e., Z-scores) by order; third, calculate the statistics U for various combinations of outlier candidates starting from both sides of the values; finally, regard tissues corresponding to outliers detected in the combination of the minimum U as 'specific'.
We would like to thank T. Konishi, S. Nakamura and M. Abe for helpful discussions and J. J. Rodrigue for editing the English. This study was performed through Special Coordination Funds for Promoting Science and Technology of the Ministry of Education, Culture, Sports, Science and Technology, the Japanese Government.
- Greller LD, Tobin FL: Detecting selective expression of genes and proteins. Genome Res 1999, 9: 282–296.PubMed CentralPubMedGoogle Scholar
- Pavlidis P, Noble WS: Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol 2001, 2: research0042. 10.1186/gb-2001-2-10-research0042PubMed CentralView ArticlePubMedGoogle Scholar
- Kadota K, Nishimura SI, Bono H, Nakamura S, Hayashizaki Y, Okazaki Y, Takahashi K: Detection of genes with tissue-specific expression patterns using Akaike's Information Criterion (AIC) procedure. Physiol Genomics 2003, 12: 251–259.View ArticlePubMedGoogle Scholar
- Schug J, Schuller WP, Kappen C, Salbaum JM, Bucan M, Stoeckert CJ Jr: Promoter features related to tissue specificity as measured by Shannon entropy. Genome Biol 2005, 6: R33. 10.1186/gb-2005-6-4-r33PubMed CentralView ArticlePubMedGoogle Scholar
- Ge XJ, Yamamoto S, Tsutsumi S, Midorikawa Y, Ihara S, Wang SM, Aburatani H: Interpreting expression profiles of cancers by genome-wide survey of breadth of expression in normal tissues. Genomics 2005, 86: 127–141. 10.1016/j.ygeno.2005.04.008View ArticlePubMedGoogle Scholar
- Grant GR, Liu J, Stoeckert CJ Jr: A practical false discovery rate approach to identifying patterns of differential expression in microarray data. Bioinformatics 2005, 21: 2684–2690. 10.1093/bioinformatics/bti407View ArticlePubMedGoogle Scholar
- Hubbell E, Liu WM, Mei R: Robust estimators for expression analysis. Bioinformatics 2002, 18: 1585–1592. 10.1093/bioinformatics/18.12.1585View ArticlePubMedGoogle Scholar
- Irrizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4: 249–264. 10.1093/biostatistics/4.2.249View ArticleGoogle Scholar
- Sharov AA, Dudekula DB, Ko MS: A web-based tool for principal component and significance analysis of microarray data. Bioinformatics 2005, 21: 2548–2549. 10.1093/bioinformatics/bti343View ArticlePubMedGoogle Scholar
- Normal tissue data[http://www.genome.rcast.u-tokyo.ac.jp/normal/]
- R Project[http://www.r-project.org/]
- Shannon CE, Weaver W: The mathematical theory of communication. Univ of Illinois Press. Champaign, Illinois; 1963.Google Scholar
- Hoaglin DC, Mosteller F, Tukey JW: Understanding Robust and Exploratory Data analysis. Wiley, New York; 2000.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.