Analyzing miRNA co-expression networks to explore TF-miRNA regulation

Background Current microRNA (miRNA) research in progress has engendered rapid accumulation of expression data evolving from microarray experiments. Such experiments are generally performed over different tissues belonging to a specific species of metazoan. For disease diagnosis, microarray probes are also prepared with tissues taken from similar organs of different candidates of an organism. Expression data of miRNAs are frequently mapped to co-expression networks to study the functions of miRNAs, their regulation on genes and to explore the complex regulatory network that might exist between Transcription Factors (TFs), genes and miRNAs. These directions of research relating miRNAs are still not fully explored, and therefore, construction of reliable and compatible methods for mining miRNA co-expression networks has become an emerging area. This paper introduces a novel method for mining the miRNA co-expression networks in order to obtain co-expressed miRNAs under the hypothesis that these might be regulated by common TFs. Results Three co-expression networks, configured from one patient-specific, one tissue-specific and a stem cell-based miRNA expression data, are studied for analyzing the proposed methodology. A novel compactness measure is introduced. The results establish the statistical significance of the sets of miRNAs evolved and the efficacy of the self-pruning phase employed by the proposed method. All these datasets yield similar network patterns and produce coherent groups of miRNAs. The existence of common TFs, regulating these groups of miRNAs, is empirically tested. The results found are very promising. A novel visual validation method is also proposed that reflects the homogeneity as well as statistical properties of the grouped miRNAs. This visual validation method provides a promising and statistically significant graphical tool for expression analysis. Conclusion A heuristic mining methodology that resembles a clustering motivation is proposed in this paper. However, there remains a basic difference between the mining method and a clustering approach. The heuristic approach can produce priority modules (PM) from an miRNA co-expression network, by employing a self-pruning phase, which are analyzed for statistical and biological significance. The mining algorithm minimizes the space/time complexity of the analysis, and also handles noise in the data. In addition, the mining method reveals promising results in the unsupervised analysis of TF-miRNA regulation.


Introduction
Microarray analysis of microRNAs (miRNAs) is a high throughput method for studying miRNA expression in cultured cells or tissues. The three datasets used in this analysis are the resultant of three different microarray experiments using human (Homo Sapience) miRNAs. We describe the materials and methods employed in the study in the sections ahead.

Materials
Three microarray datasets are experimented in this study. One of them comprises the expression values of schizophrenia patient-specific miRNAs [Perkins et al., 2007] while the others contain expression values of tissue-specific miRNAs [Baskerville, 2005] and stem cell-based expression profiling for different subjects [Laurent et al., 2008]. These three are described hereunder. [Perkins et al., 2007] This dataset contains the expression values of postmortem human brain tissues, obtained from the Harvard Brain Tissue Resource Center, consisting of frozen blocks (300-500 mg/block) from the postmortem prefrontal cortex (Brodmann area nine from 15 individuals with schizophrenia and 21 unaffected comparison subjects). The subjects taken were free of neurodegenerative pathology and the tissues were groupmatched for gender, age, postmortem interval, and hemisphere.

Schizophrenia Dataset
The dataset was prepared through a microarray experiment where oligonucleotide probes were synthesized in duplicate for 264 miRNAs antisense to the mature sequence. The mature sequences were verified from the reported results accumulated in the Sanger miRNA registry. All arrays were prepared from the same batch using the conventional hybridization process. The microarray analysis started with the extraction of data from the GPR files and the data points having foreground values lower than 1.5 times of the local background were eliminated. Following this elimination, the probes from which greater than 40% of the data points were removed, were discarded out of the analysis. This pre-processing reduced the dataset comprising a total of 239 miRNAs. After the background subtraction all the data were log-transformed and the missing values were estimated using the well-known k-NN [Troyanskaya et al., 2001]. The data were normalized using rank invariant normalization for comparison across the samples. The per-sample mean of the two rank invariant normalized probes was used for the analysis. Univariate calculations of differential expression were estimated using two-class (unpaired test) Statistical Analysis of Microarrays through 500 permutations with an FDR of 5%.
Thus, the final dataset includes the expression values received over 36 experiments for 239 miRNAs. The statistical information pertaining the microarray data is -minimum expression value = 6.03, maximum expression value = 15.88, average expression value = 7.27, standard deviation of the expression values over the entire dataset = 1.38.
1.2 Tissue-specific Dataset [Baskerville, 2005] This dataset includes the global expression of miRNAs found in various human tissues. The probes used in the microarray experiment were based on DNA oligonucleotides. Based on the techniques developed earlier to clone and sequence the populations of small RNAs, hybridization and the processing of the arrays were performed. A pilot array was used to explore the hybridization conditions and the array design options during the study. The experimentation achieved expression profiling of 175 miRNAs taken from 24 different human organs and the HeLa S3 cell line. Ongoing research physically identifies some of these miRNAs in brain, lung, liver, spleen, muscle, and bone marrow having highly specific expression in these organs.
There are 43 missing values out of the total 4375 values (∼ 0.98%) in the complete dataset. These missing values are imputed by the BPCA method [Brock et al., 2008]. The statistical information pertaining the microarray data is -minimum expression value = 0.45, maximum expression value = 7095.17, average expression value = 32.33, standard deviation of the expression value over the entire dataset considered = 221.28. These are after the estimation of the missing values. [Laurent et al., 2008] This microarray-based miRNA expression profiling was performed using a novel modification of the highthroughput gene expression profiling methodology of assays. This combines cDNA-mediated annealing, selection, extension and ligation assay. It applied a solid-phase primer extension (after target hybridization) to enhance the discrimination among homologous miRNA sequences. In addition, PCR with universal primers was used to amplify all targets prior to array hybridization. One specific assay oligonucleotide was designed for each miRNA, consisting of three parts: at the 5' end was a universal PCR priming site; in the middle was an address sequence, complementary to a corresponding capture sequence on the array; and at the 3' end was a miRNA-specific sequence. Pooled assay oligonucleotides corresponding to the human miRNAs are first annealed to cDNA. An allele-specific primer extension step is then carried out; the assay oligonucleotides are extended only if their 3' bases are complementary to their cognate sequence in the cDNA template. The extended products are then amplified by PCR using common primers, of which one is fluorescently labeled, and hybridized to a microarray bearing the complementary address sequences.

Stem Cell Dataset
A subset of this dataset is used in this study. The extracted dataset includes 130 columns over the 439 human miRNAs without any missing values. In this expression data, minimum expression value = 0, maximum expression value = 29632.83, average expression value = 982.18 and standard deviation of the expression values over the entire dataset = 1067.54.

Methods
The methods employed in the study are described in detail in the following subsections.

Proximity Measure
The zero mean and unit normalization of an expression vector (a row vector) E = [E 1 E 2 . . . E n ] of dimension n is computed as, Note that, the normalization done using Eqn.
(1) turns the expression vector E into a unit vector E (||E|| = 1). Further applying squared Euclidean measure on this normalized data maps the co-expression values within [−1, 1]. But, with this zero mean unit normalization the squared Euclidean distance metric coincides with the Pearson correlation coefficient. As the study intends to prepare a co-expression network, a correlation measure will not therefore fit here as appropriate. Moreover, the study is based on a Fuzzy Complete Graph (FCG) that reflects the degree of fuzzy membership values (similarity) between the vertices. Thus, the necessity of a novel proximity measure becomes apparent here.

Sensitivity Analysis
The assignment of optimal association density threshold (δ) value and the density decay constant (ξ) play an important role in the selection of significant module by the proposed mining methodology. This parameter, not tuned properly might cause the inclusion of irrelevant miRNAs in the significant module selected or might disrupt the comprehensiveness of this significant module. Naturally, the sensitivity of parameters in an unsupervised method is guided by the data itself. With this prior knowledge, we have observed the distribution of fuzzy membership values within an FCG to get an indication of threshold of the association density. Two kinds of histogram plots have been prepared for both the FCGs derived from the datasets. The first one is shown in the main paper whereas the other is shown in Fig. S1.
In the first kind of histogram (see main paper), the average fuzzy membership values of all the miRNAs are computed with respect to the other miRNAs and its histogram is prepared. These are column-specific values. Whereas, in the other one ( Fig. S1), for each experiment included in the dataset, the histogram of the fuzzy membership values of all the miRNAs with respect to the others are combined in a single plot. This highlights the distribution of fuzzy membership values of all the miRNA pairs over the entire dataset. From both of these, we can follow that for higher δ values we get higher associativity within an miRNA module. On observing Fig. S1, one can easily note the long tail of miRNA connectivity for lower δ values. By the careful judgement of these plots, we selected the Association Density threshold to be 0.95, 0.99 and 0.93 for the patient-specific, tissue-specific and stem cell datasets respectively. To care about the noise that might be present in the data, the parameters have been tuned to slightly higher values from that of realizable by perception from the plots.

Silhouette Index Measure for Priority Modules
For quantitatively testing the significance of the priority modules (PM s) found by the proposed method (for various δ values) and some of the well-known clustering approaches, the Silhouette index [Rousseeuw, 1987], a well-known cluster validity index, has been computed. The Silhouette index can measure the degree of compactness within a single PM and its separation from the set of other PM s (assuming it to be the second cluster) [Bandyopadhyay et al., 2008]. The distance metric used is the squared Euclidean distance. The intercluster separation is calculated as, (2) Again, the intracluster variance is calculated as, Using Eqn.
(2) and Eqn. (3), the average Silhouette width of the cluster C, SI C/V , with respect to the complete reference set V , is calculated as, The value of SI C/V ranges within [-1,+1], with higher values indicating better clustering result.

Validation of Pruning Phase
A clique is defined to be a complete subgraph of a graph. The vertices within a clique is therefore fully associative with each other. So, by exploring the largest clique sizes we have tried to verify the comprehensiveness of the size of the significant modules selected from the patient-specific and tissue specific datasets by the proposed method. For this purpose, the two FCGs, derived from the two datasets, have been mapped initially into two unweighted graphs. This is done by removing those edges having fuzzy membership values less than 0.95, 0.99 and 0.93 from the complete graphs for the three datasets respectively. Following an upper bound of a clique given in [Amin et al., 1972], we have computed the upper bound of clique sizes of the unweighted graphs derived from the FCGs. According to this work [Amin et al., 1972], the clique size γ(G) of a graph G is upper bounded by the following relation, In Eqn. (5), N ≤−1 denotes the number of eigenvalues, of the adjacency matrix of G, which are no larger than −1.

Results on Gene Expression Data
For verifying the scalability of the proposed mining method, we have applied it on a larger gene expression dataset. The dataset contains 6167 yeast genes and their expression profiles over a total of 52 time points of DNA damage. The missing values of the dataset were imputed with BPCA [Brock et al., 2008] and the final FCG is prepared by computing the co-expression values. The parameters of the mining algorithm were set to δ = 0.6 and ξ = 0.01. The distribution of the sizes of the PM s found by the proposed heuristic mining method is shown in Fig. S2(II). As for the miRNA co-expression networks, here too we found that the statistical coherence within these modules are very high. The expression profile plot of the selected significant 216 genes and the background set of genes are shown in Fig. S2(I). We found a total of 25 PM s, with a few of these being small in size. Specially, the first few are very small modules but they have very high coherence which decreases in the later ones.

Visual Validation
Expression profile plot is a well-known tool for the visualization of expression data [Eisen et al., 1998]. It shows the plots of the degree of expression values of all the expression vectors in combination. Thus, a compact expression profile plot (set of the expression plots spanning over a compact band) represents a coherent module. To highlight the decreasing coherence the PM s, the expression profile plots of these clusters are shown in Fig. S3 and Fig. S4.
Here, a novel visual validation tool, namely the quartile deviation plot (QDP), has been used to illustrate the effectiveness of the pruning method. This tool incorporates many statistical information in the plot, not only representing the level of expression values. A QDP is a combination of some boxplots. It prepares the boxplot of the expression values over each experiment. Such a columnwise boxplot of the entire dataset reflects the lower quartile, median, and upper quartile values of an experimental column. These boxplots of all the experiments (be it patient-specific or tissue-specific) are then combined to prepare the final plot. The maximum whisker length, in the units of interquartile range, is taken to be 1.5, which is a default one [McGill et al., 1978]. The QDPs of the priority clusters are shown in the main paper.

Co-occurrence p-value Computation
To validate a mining solution in the form of a set of PM s, it is worth exploring the probability of receiving the result by chance. So, the computation of p-value is of importance here. Earlier, for similar cases a randomization model that preserves the degree distribution in a network by edge-swapping has been effectively used [Shalgi et al., 2007]. We construct here a cluster matrix, of size n × k, where n denotes the number of miRNAs pruned from the complete set and k is the number of PM s. The element in the position (i, j) is "1" if miRNA i is found in PM j, otherwise "0". One such example of a cluster matrix is shown in Table S1.  39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99  100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99  100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129  Now, the values of the matrix are altered repeatedly to make it random. In this randomization subroutine, the values are actually swapped randomly between two miRNA pairs to keep the distribution of the values same in the matrix. After preparing the final random matrix, the co-occurrence of the miRNA pairs are computed in the matrix. Based on this randomized result and the co-occurrence count of the original matrix, the p-value is computed. Suppose, the co-occurrence of miRNAs are found n times out of a total N cases of the cluster matrix and computing in the randomized matrix, the count is found to be e times out of a total N cases. Then, the hypergeometric p-value is computed as, A poor p-value computed from a solution establishes the truth of not receiving the result by chance.

Transcription Analysis
For the transcription regulatory analysis within the PM s found by the proposed method, the well-known UCSC Table browser [Karolchik et al., 2004] is used. This manually updated database contains many established results in tabular format. The browser helps to retrieve the data associated with a track in text format and to calculate the intersections between various tracks. We have mainly used two tables. The first one accumulates the locations of the human miRNAs in the chromosomes including the information on start sites and end sites. The table is downloadable from the UCSC After accumulating the data on the locations of miRNAs for both the datasets, we have searched for the transcription factors (TFs) binding in the 10kb upstream region of the individual miRNAs. The complete table has been downloaded from the UCSC Table browser [Karolchik et al., 2004]  This track shows microRNA target sites in the defined regions. Now, by studying the retrieved tables, we accumulate the results finally into some tables. These tables show the list of TFs which are found in the 5 ′ untranscribed region (UR) of more than one miRNAs common to a single PM. Thus, by unsupervised means we can find out common transcription targets of miRNAs which helps to reconstruct the regulatory network between miRNAs.