ROKU: a novel method for identification of tissue-specific genes

Background One of the important goals of microarray research is the identification of genes whose expression is considerably higher or lower in some tissues than in others. We would like to have ways of identifying such tissue-specific genes. Results We describe a method, ROKU, which selects tissue-specific patterns from gene expression data for many tissues and thousands of genes. ROKU ranks genes according to their overall tissue specificity using Shannon entropy and detects tissues specific to each gene if any exist using an outlier detection method. We evaluated the capacity for the detection of various specific expression patterns using synthetic and real data. We observed that ROKU was superior to a conventional entropy-based method in its ability to rank genes according to overall tissue specificity and to detect genes whose expression pattern are specific only to objective tissues. Conclusion ROKU is useful for the detection of various tissue-specific expression patterns. The framework is also directly applicable to the selection of diagnostic markers for molecular classification of multiple classes.


Background
A major challenge of microarray analysis is to detect genes whose expression in a single or small number of tissues is significantly different than in other tissues. Accurate identification of such tissue-specific genes can allow researchers to deduce the function of their tissues and organs at the molecular level [1].
Several methods have been used for this purpose [1][2][3][4][5]. Of these, Schug et al. [4] demonstrated the effectiveness of using Shannon information theoretic entropy for ranking genes according to their tissue-specificity, from restricted (tissue-specific) expression to average (ubiquitous/housekeeping) expression. However, there is also a severe disadvantage. The entropy does not explain to which tissue a gene is tissue-specific, only measuring the degree of overall tissue specificity of the gene. Hence further analysis to identify specific tissues is needed. Although Schug et al. [4] proposed a new statistic (Q) based on entropy to estimate the degree of a gene's specificity on a particular tissue, the issue of redundancies remains where top-ranked genes as specific to tissue A are also top-ranked as specific to tissue B. We assert such genes are not specific to A or B, but rather are genes specific to both A and B. For example, we observed that two of the top five probesets specific to liver were also found in the top five probesets specific to gall bladder [4]. The issue of such redundancies is a concern with any ranking-based method, such as patternmatching [2], when the number of interrogated tissues increases. Methods of identifying genes specific only to objective tissues are needed.
Unlike ranking-based methods, methods based on outlier detection are free from the issue of redundancies because they identify tissues corresponding to both over-and under-expressed outliers for each gene [3,5]. Therefore, these methods can treat equally various types of tissuespecific genes: (1) 'up-type' genes selectively overexpressed in a single or small number of tissues compared to the others, (2) 'down-type' genes selectively underexpressed, and (3) 'mixed-type' genes selectively over-and under-expressed in some tissues. Although the mixed-type is possible, the first two types (up-type and down-type) of expression patterns are particularly important because those genes may be associated with fundamental biological phenomena and may contain particular tissue-specific diagnostic markers. Using outlier-detection-based methods, however, ranking genes according to their degree of overall tissue-specificity is difficult.
This complementary relationship between ranking-based and outlier-based methods led us to develop a combined approach, ROKU. ROKU analyzes any type of tissue-specific genes (up-, down-, and mixed-type) in two steps. First, it ranks genes according to overall tissue-specificity using Shannon entropy, and second, for each gene, it identifies specific tissues whose observations are regarded as outliers using a method of Kadota et al. [3]. We applied the method to both synthetic and real gene expression data and demonstrated its utility by comparison with other methods.

Definition of tissue-specific genes
We first show typical examples of various types of gene expression patterns. We here divided tissue-specific genes into two levels, a narrow sense and a broad sense. Genes over-expressed in a small number of tissues but unexpressed or slightly expressed in others, such as those shown in Figs. 1a and 1c, are defined as tissue-specific genes in a narrow sense, while genes over-and/or underexpressed in a small number of tissues compared to other tissues are defined as tissue-specific in a broad sense (the latter group includes the former). We focused here on the latter case and wanted to identify such expression patterns (see black scatter plots in Figs. 1d-f). We use two terms ("genes" and "probesets") interchangeably throughout this paper.

Data processing and its effect on Shannon entropy calculation
When one gene vector x = (x 1 , x 2 , ..., x N ) is given, the entropy H(x) can be calculated by equation 1 (See Methods). The range of H is from 0 whose gene expression is perfectly restricted in a single tissue (Fig. 1a) to log 2 (N) whose gene expression pattern is flat in all the interrogated tissues (Fig. 1b). We therefore rely on the low entropy score for the identification of tissue-specific genes. The black scatter plots in Fig. 1 are synthetic expression observations for N tissues (i.e., N = 10 in this case). The entropy H for each gene vector x is given by the number in black above the figures. Clearly, direct calculation of the entropy for raw gene vector x works well only for detecting tissue-specific genes in a narrow sense (Figs. 1a and 1c) but not for those in a broad sense (Figs. 1d-f). The H scores (3.22, 3.29, 3.23 for Figs. 1d-f, respectively) of tissue-specific genes in a broad sense are close to the maximum value (log 2 10 = 3.32) and cannot identify those genes as 'tissue-specific'.
To detect tissue-specific genes in a broad sense, we introduce a simple method that processes a given gene vector x and makes a new vector x'. Data processing is done by subtracting the one-step Tukey biweight and by taking the absolute value of equation 2 (see Methods). The Tukey biweight yields a robust weighted mean able to resist 50% of outliers [6]. The scatter plots of processed vectors are shown in red in Fig. 1. The entropy scores, H(x'), for the processed vectors to obvious tissue-specific genes in a broad sense (Figs. 1d-f) are considerably lower than those for x. This is because the relative expression levels for specific tissues (highlighted tissues) become high after data processing. For example, the value (0.04) for tissue 3 in Fig. 1e becomes 0.75 after data processing. Since the baseline value is 0.1 (1/N, N = 10) in this case, such high values decisively contribute low entropy to the gene expression pattern. Also, entropy scores, H(x') and H(x), to nonspecific (or randomly expressed) genes are quite similar and close to the maximum (3.32) (Figs. 1g and 1h). These results demonstrate the adequacy for our strategy for detecting tissue-specific genes in a broad sense at least on typical/hypothetical expression data.

Analysis of real data
To further investigate the validity of our method (ROKU), we applied the method to a public gene expression matrix consisting of 36 normal human tissues and 22,283 probesets [5]. Briefly, ROKU (1) processes each probeset expression vector and makes a processed vector x', (2) calculates the entropy H(x'), and (3) assigns specific tissues to each probeset whose observations are detected to be 'outliers' (see Methods). We compared the performance of ROKU to that of Schug's method, which directly uses the original/non-processed vector x for measuring the entropy H(x) [4]. The two entropy scores (H(x') and H(x)) for all probesets are available in the additional file [see Additional file 1].
To compare the agreement of top-ranked probesets between ROKU and Schug's method we analyzed the percentage of common probesets in a top-ranked set of 22,283 probesets. About 80% of ~3,000 top-ranked probesets are common, indicating that ROKU does not change the rank of probesets drastically (data not shown). One way to compare the effect of the data processing used in ROKU to that used in Schug's method is to sort probesets in order of increasing magnitude by the difference between the two entropy scores (H(x') -H(x)) calculated by the two methods. Since ROKU outputs relatively low entropy to each probeset compared to Schug's method as a whole [see Additional file 1], the average value of (H(x') -H(x)) tends to be negative: -0.425 (4.314 for ROKU; 4.739 for Schug's method). Table 1 lists the ten lowest-and ten highest (H(x') -H(x)) valued probesets and Fig. 2 shows expression profiles for the two lowest-and two highest probesets listed in Table  1. The difference is greatest for the probeset '206319_s_at'. This is mainly because the relative expression for the testis changes from 0.35 to 0.75 by virtue of data processing. ROKU gives a low entropy (H(x') = 1.950 and H(x') <H(x)) for the probeset '206319_s_at' and a high entropy (H(x') = 4.729 and H(x') > H(x)) for the probeset '201131_s_at'. This is quite reasonable because visual evaluation admits the former to be tissue-specific and the latter to be non-specific. Schug's method, however, gives quite similar values (4.235 for the former and 4.228 for the latter) for the two probesets: the entropy for the former is higher than that for the latter. Processed data are represented by red circles. Specific expression observations detected to be outliers by [3] are highlighted. Numbers in black and red indicate Shannon entropy scores for the original and the processed data, respectively. Shannon entropy can range from (a) 0 to (b) 3.32 in this case (logarithm to the base 2 of 10). Expression patterns such as (a) and (c) are defined as tissue-specific genes in a narrow sense. Tissue-specific genes in a broad sense include various expression patterns such as (a, c, d) up-type, (e) down-type, and (f) mixed-type. By virtue of data processing, ROKU can detect tissue-specific genes in a broad sense. Meanwhile, ROKU gives relatively high scores (close to 3.32) for non-specific gene expression patterns such as (g) and (h).

Shannon entropy calculation for various tissue-specific expression patterns
Expression patterns of probesets listed in Table 1  Figure 2 Expression patterns of probesets listed in Table 1. Expression patterns of probesets with the two (a) lowest-and (b) highest (H(x') -H(x)) scores are shown. Other legends are the same as given in Fig. 1.
There are 858 probesets satisfying H(x') > H(x): processed expression vectors are less tissue-specific than the original vectors. Visual evaluation for those probesets showed no probeset exists whose entropy score is improperly assigned, i.e., no obvious tissue-specific probesets exist. These results demonstrate the data processing strategy used in ROKU successfully estimated/ranked probesets by their overall tissue specificity on real data. We verified such trends in other microarray datasets (data not shown).
Note that ROKU is inferior to Schug's method (i.e., direct application of entropy to measuring tissue specificity) in rare cases. For example, consider a gene expression pattern of constant high expression in N/2 tissues and low expression in other tissues. ROKU gives the processed expression pattern as 'flat' and H(x') = log 2 (N). Accordingly, ROKU cannot distinguish such differential expression patterns from constant expression patterns because it gives the same entropy scores for the two patterns. In other words, H(x') is not useful for identifying non-specific genes. Nevertheless, this disadvantage is not a problem for detecting the tissue-specific expression patterns we focused on. We also observed that there was no probeset suffer from this disadvantage in the real data set.

Detection of specific tissues as outliers
As mentioned earlier, the entropy does not indicate which tissues are specific though it can rank genes according to their degrees of overall tissue specificity. To identify such specific tissue when they exist, ROKU employs an outlierdetection-based method proposed by Kadota et al. [3] (see Methods for details). Regardless of over-and/or under-expressed outliers, it can return specific tissues corresponding to outliers for each gene. Accordingly, an outlier matrix can be constructed (consisting of 1 for overexpressed outliers, -1 for under-expressed outliers, and 0 for non-outliers) that corresponds to the original gene expression matrix by applying the method. Genes with any expression pattern of interest can be detected using the outlier matrix. The outlier matrix is also available in the additional file [see Additional file 1].
For example, ROKU identifies 59 probesets specific to lung and 291 probesets specific to fetal lung and of course no redundancies exist between the two sets by virtue of the advantage of the original method [3]. Since ROKU is a combined method consisting of calculation of an entropy and assignment of specific tissues to each gene, ROKU can compensate for the disadvantage of the original method [3] by assigning an entropy score H(x'): Table 1

Probeset ID Entropy H(x') -H(x) H(x') H(x)
Probesets with the lowest (H(x') -H(x) ROKU can rank genes with particular tissue-specific patterns by their overall tissue specificity. We compared the performance of ROKU to that of Schug's Q t (x) statistic [4] which can also rank genes specific to a tissue t. Fig. 3 shows the top-ranked gene expression profiles specific to (a) lung and (b) fetal lung identified by ROKU's H(x') statistic and Schug's Q t (x) statistic [4]. The Q t (x) statistic for a tissue t in a gene expression vector x is defined as Q t (x) = H(x) -log 2 (p t ) (see Methods for details). Clearly, ROKU can detect probesets whose expression patterns are specific only to each of the objective tissues (lung or fetal lung) while Schug's Q statistic cannot. This is because a low Q t (x) statistic indicates that gene x is relatively highly expressed in a small number of tissues including tissue t, but does not always indicate whether the expression pattern of x is specific only to the tissue t. Indeed, both probesets ('215454_x_at' detected as specific to lung and '205982_x_at' specific to fetal lung) identified by Schug's Q t (x) statistic include another tissue in addition to the objective tissue. We analyzed this trend in the top-ranking probesets ( Table 2). We assert that these probesets are not specific to lung (or fetal lung) but are specific to both lung and fetal lung. Although the choice of which method should be used is, of course, dependent on individual research purposes, our method (ROKU) is superior to Schug's Q statistic for detecting genes specific only to tissues of interest.
Of 22,283 probesets analyzed, 16,072 exhibit one or more specific tissues. We observed that most of them consist of specific up-expression patterns, such as Figs. 1c and 1d [see Additional file 1]. This is probably because the distribution of gene expression levels from the dataset we used here roughly follows an exponential distribution in which the probability of a gene's expression observation decays exponentially (data not shown). Still we appreciate the merit of ROKU being able to detect genes with various types of tissue-specific expression patterns, as shown in Fig. 1.

Effect of different quantification algorithms on gene ranking
As discussed in Grant et al. [6], a serious issue regarding any method is the choice of quantification algorithms, such as MAS5 [7] or RMA [8]; different choices can output different subsets of top-ranked genes. We compared the influence on gene ranking when the same raw data are MAS5-quantified and RMA-quantified. Fig. 4 shows the percentages of common probesets in a top-ranked set of Table 2: List of top ten genes specific to lung and to fetal lung. Probesets and two entropy scores estimated by ROKU and Schug's method are shown. *Probesets indicate those listed to be "specific to lung" are also listed to be "specific to fetal lung" and vice versa. Note that Schug's method has strong redundancy when similar tissues are selected independently and therefore cannot detect probesets specific only to the objective tissue. Expression patterns of probesets listed in Table 2  Figure 3 Expression patterns of probesets listed in Table 2. Expression patterns of top-ranked probesets specific to (a) lung and to (b) fetal lung are shown. Other legends are the same as given in Fig. 1. Note that the two methods output different topranked probesets and probesets detected by ROKU are specific only to the objective tissue.

ROKU
22,283 probesets between MAS5 data and RMA data, by gene ranking using ROKU (red circle) and Schug's method (black circle). Although both methods (ROKU and Schug's method) output relatively low percentages of common probesets, especially in the 100 top-ranked probesets (about 31% for ROKU; about 3% for Schug's method), the percentages for ROKU were consistently higher than those for Schug's method. This result indicates gene ranking based on ROKU is more robust against data transformation than Schug's method.
There are some ways for extending this work. First, we used an outlier-detection-based method [3] for the detection of specific tissues. Some other methods, such as Sprent's non-parametric method [5] and its derivative, could be applicable. The outputs of these methods vary with the selected parameters. A comparative study of these methods is the next task. Second, the current work did not discuss the statistical significance of observed differences in gene expression. We plan to combine the significance analysis, such as a method of Sharov et al. [9], with the current method.

Conclusion
In this work, we propose a novel method (ROKU) for the detection of genes with tissue-specific expression patterns. ROKU was developed to compensate for the disadvantages of two conventional methods [3,4] by combining the advantages of the two. Using synthetic expression data, we demonstrated its potential applicability for the detection of various types of specific expression patterns. Although most of the detected tissue-specific genes in real microarray data exhibit one type of expression pattern (i.e., 'up-type' genes selectively over-expressed in a single or small number of tissues compared to the others), the entropy-based gene ranking by ROKU outperforms the two original methods. ROKU can be a powerful tool for selecting genes specific to tissues of interest.

Microarray data
Publicly available Affymetrix U133A oligonucleotide microarray data for 22,283 genes in 36 various normal human tissues [5] were downloaded from the author's website [10]. For the most part we used the data quantified using MAS5 (Micro Array Suite 5 from Affymetrix) software. Other quantified data using the RMA algorithm [8] were also analyzed to compare the effects of different quantification algorithms. RMA quantification was performed by the justRMA() function in R [11] using raw data (Affymetrix CEL files).

Gene ranking by Shannon entropy
The use of Shannon entropy [12] to rank genes by their overall tissue specificity here is the same as described in Schug et al. [4]. Consider one gene's expression vector x = (x 1 , x 2 , ..., x N ) for N tissues and an observation x t for tissue t. The entropy of the gene is calculated as where p t is the relative expression of x t for tissue t defined as . H ranges from zero to log 2 (N), with the value 0 for genes expressed in a single tissue (Fig. 1a) and log 2 (N) for genes expressed uniformly in all the interrogated tissues (Fig. 1b).
To equally identify down-and mixed-types of tissue-specific genes as well as up-type genes, we processed the original vector x. The processed observation x t ' for tissue t is defined as where T bw is the one-step Tukey biweight, a popular statistic robust against outliers. It provides as much robustness as a median and is also used to estimate the expression signal from each probe set in the Affymetrix Micro Array Suite (MAS 5.0) software package [7,13]. The parameters for the calculation of T bw are the same as those adopted in the tukey.biweight() function in R package 'affy' (i.e., c = 5, ε = 0.0001) [11]. Our method (ROKU) uses the processed expression vector x' of a gene, while Schug et al. [4]  Effect of different quantification algorithms on gene ranking Figure 4 Effect of different quantification algorithms on gene ranking. MAS5-and RMA-quantified data are compared. The higher the percentage of common probesets between the two, the more rank-invariant property the method has. ROKU gives a more invariant gene ranking than Schug's method.
uses the original vector x, to calculate the gene's entropy (H(x') and H(x)) as a measure of the overall tissue specificity.

Detecting specific tissues as outliers
As mentioned above, the entropy does not indicate which tissues are specific, but is a measure of the overall tissue specificity of a gene. We imagine observations in specific tissues to be easily visualized as outliers on the over-and/ or under-expressed side if any exist. We used an outlierdetection-based method proposed by Kadota et al. [3] to detect tissues with specific expression patterns. According to Kadota et al. [3], the statistic U for identifying outliers is defined as where n and s denote the numbers of non-outlier and outlier candidates, and σ denotes the standard deviation (SD) of the observations of the n non-outlier candidates. The procedure is first, normalize the gene vector x = (x 1 , x 2 , ..., x N ) for N (=n+s) tissues by subtracting the mean and dividing by the SD; second, sort the normalized values (i.e., Z-scores) by order; third, calculate the statistics U for various combinations of outlier candidates starting from both sides of the values; finally, regard tissues corresponding to outliers detected in the combination of the minimum U as 'specific'.