BMC Bioinformatics BioMed Central Methodology article Construction and use of gene expression covariation matrix

Background One essential step in the massive analysis of transcriptomic profiles is the calculation of the correlation coefficient, a value used to select pairs of genes with similar or inverse transcriptional profiles across a large fraction of the biological conditions examined. Until now, the choice between the two available methods for calculating the coefficient has been dictated mainly by technological considerations. Specifically, in analyses based on double-channel techniques, researchers have been required to use covariation correlation, i.e. the correlation between gene expression changes measured between several pairs of biological conditions, expressed for example as fold-change. In contrast, in analyses of single-channel techniques scientists have been restricted to the use of coexpression correlation, i.e. correlation between gene expression levels. To our knowledge, nobody has ever examined the possible benefits of using covariation instead of coexpression in massive analyses of single channel microarray results. Results We describe here how single-channel techniques can be treated like double-channel techniques and used to generate both gene expression changes and covariation measures. We also present a new method that allows the calculation of both positive and negative correlation coefficients between genes. First, we perform systematic comparisons between two given biological conditions and classify, for each comparison, genes as increased (I), decreased (D), or not changed (N). As a result, the original series of n gene expression level measures assigned to each gene is replaced by an ordered string of n(n-1)/2 symbols, e.g. IDDNNIDID....DNNNNNNID, with the length of the string corresponding to the number of comparisons. In a second step, positive and negative covariation matrices (CVM) are constructed by calculating statistically significant positive or negative correlation scores for any pair of genes by comparing their strings of symbols. Conclusion This new method, applied to four different large data sets, has allowed us to construct distinct covariation matrices with similar properties. We have also developed a technique to translate these covariation networks into graphical 3D representations and found that the local assignation of the probe sets was conserved across the four chip set models used which encompass three different species (humans, mice, and rats). The application of adapted clustering methods succeeded in delineating six conserved functional regions that we characterized using Gene Ontology information.


-SELECTION OF CHIPSET MODELS AND DOWNLOADING OF DATA
The method that we used to calculate CVMs relies on the ability to perform systematic comparisons between any two biological conditions present in sets of experiments coming from different laboratories. For this reason, not all technologies are suitable for this approach, which can only be applied if the signals from two different chips are directly comparable. Single channel technologies can allow such direct comparisons in two possible ways. First, the in situ probe synthesis technique that is used, for example, by Affymetrix [1] or NimbleGene [2], results in a nearly constant probe density across the surface of a chip. Second, density variations, which are inherent to the process of probe spotting, can be corrected using internal controls, as is done with the Applied BioSystem technology. Other technologies based on spotting, e.g. cDNA or oligonucleotide spotted arrays, must use double channel hybridization schemes with external controls to compensate for surface dissimilarities in probe density; because of this, they do not deliver reliable signal values that can be directly compared between different chips. Another important characteristic for our approach is the availability of datasets obtained over a large panel of experimental conditions with the same standardized collection of probes. These two criteria, plus others described in the Introduction, led us to restrict our analysis to datasets originating from Affymetrix platforms, and to consider four chipset models in which the number of biological conditions studied was greater than 100 as of October 2004 (see supplemental file exp-biol-pointinfo.ods). We organized the downloaded data into a structure in which the samples are grouped according to the biological conditions and experiments. With this nomenclature, experiments and samples correspond, respectively, to Geo Data Sets (GDS) and Geo Samples (GSM) in GEO. A given biological condition groups all the replicated experimental points (table 1, "before filtration" section).

-CHARACTERISATION AND VALIDATION OF SAMPLES
A preliminary step in RDAM is the transformation of each signal into a relative rank, i.e. the placing of the signal value into the signal distribution, expressed on a scale of 0-100. This transformation is important because it acts as a normalisation procedure that allows different samples to be compared. However, if some probe sets are missing, this transformation will result in a systematic bias towards higher ranks, which could affect the RDAM analysis. To minimize this possibility, we decided to eliminate all samples having more than 500 missing probe sets. Figure 1 indicates the number of missing probe sets in the U95 dataset. Although no samples were eliminated for this chipset model, we note that experiment 18 had a high number of missing probe sets (441). This experiment was analyzed with DChip, and it is possible that these probe sets were dropped during this step for an unknown reason. For all the other samples, the number of missing probe sets was less than 100 and corresponded to dispensable probe sets like spikes and controls. All of these missing probe sets have had their signals set equal to the minimum of the existing signals. Another important potential source of bias is the truncation of the data, which generally results from the application of a threshold at some step in the data processing. The consequence of this is that numerous different probe sets have identical signal values, causing the true order of the probe sets to be lost and leading to the random assignation of their ranks. To identify this potential problem, we sorted the signal values and searched for the longest run of identical values. Figure 2, which displays the size of the longest run (red line), shows that the entire experiment 8, and around half of experiments 6 and 28, had runs of more than 500 in number, which led us to filter out the corresponding samples. In the same figure, we have plotted the total number of different signal values used in the sample (blue line). This quantity is related to the numerical precision of the measurement, i.e., the number of significant digits employed. If this numerical precision is high, then all measurements will be different, as is the case, exceptionally, in experiment 18. Conversely, a very low level of numerical precision will result in numerous runs having identical values and will thus drastically decrease the diversity of the signal values; this is exemplified by experiment 6, in which the 12,625 probe sets of the U95 chip A are represented by only about 400 different values. However, it is worth noting that even with such a low level of precision, the effect on the RDAM analysis will be negligible, because the random assignation of rank will only occur over a very small range: the mean number of probe sets having the same signal in this case will be around 30 (for a run of 100 probe sets having exactly the same signal, the rank would vary between r and r±0,79, a difference which would only exceptionally be statistically significant). Consequently, we did not use this information to filter out any samples. The downloaded data were initially processed either with MAS4 or MAS5, the two main versions of the Affymetrix analysis suite currently available. With respect to our method, the most important difference between these two versions is the way in which they calculate the single compound signal of a probe set from several measured probe signals (around 20 per probe set): MAS4 can generate negative numbers, whereas MAS5 only produces positive numbers. As the completion of this field is not mandatory in GEO, we searched for the relevant information in the headers of the GSM files, as explained in the Methods section. On the basis of these incomplete pieces of information, we first tentatively assigned a MAS version number to 26 experiments out of 32. In order to confirm this tentative assignation and to complete the classification, we plotted (figure 3) the minimal signal values (blue line). The occurrence of numerous negative values in experiment 21 suggests that their initial classification as MAS5-processed is erroneous. The absence of negative signals in experiment 28, which was classified as MAS4, is explained by the application of a threshold for the first half of the samples. Experiment 13, which was classified as MAS4, was changed to MAS5 because no negative signals were present. Experiments not classified in the first step were designated MAS4 for experiments 9 and 24 and MAS5 for experiments 8, 10, 16 and 26. Experiment 18 was analyzed by dChip and experiment 4 by RMA.

-SELECTION OF REPLICATES
We performed the multiple biological condition comparisons by applying the Rank Difference Analysis of Microarray (RDAM) method [3] [4]. As RDAM can take advantage of duplicated measurements to evaluate the noise distribution and to improve statistical power, we did not consider biological conditions represented by only one sample, and we selected exactly two replicates for all of the other biological conditions. We calculated the distance between any two samples (see Methods) and traced a dendrogram. This graphical representation of the whole dataset allowed us to identify the two most similar replicates for each condition, to filter out questionable samples, and to visualize the quality of the entire dataset. We show in figure 4 some branches of the U95 dendrogram. Panel A displays some samples whose names contain the character @ to indicate that the corresponding biological condition was not replicated and therefore not used (see also sample 27 in panel E). In panel B, the null distance between the pairs of samples (125,137) and (122,124) in experiment 5 means that the same sample has been used twice under two different names. We encounter the same situation in panel C, where the same sample was used in two different experiments. Panels A, C and E show that similar biological conditions are grouped. It is worth noting that samples of panels A and C, both of which pertain to muscle, are grouped separately according to their pathological or physiological origins. Such groupings of similar biological conditions independently of their experimental source is not systematic, and experiments 2 (panel E) and 25 (panel D) show a situation in which the experimental origin had a greater weight than the biological condition in the grouping process. The samples are named as follows: a descriptive part is constructed by concatenating information recovered from the GDS files (for example Dendrit_Leish-m is the concatenation of the cell type information (dendritic cells) and the name of an infecting pathogen, Leishmania major). This name may be followed by the rank of the replicate (e.g. r001,r002). At the end are indicated the rank of the experiment, the rank of the biological condition, and the rank of the sample, preceded respectively by the letters E, C, and P. Other information and the decision drawn from the dendrogram processing is indicated before this composite name. The condition rank is recalled, followed by a symbol indicating, e.g., that the sample is used as a member of the duplicates representing the biological condition (>>>>), or that it has been discarded because of an absence of duplicates (@) or for some other reason, as indicated in the text (>-<). All panels use the same scale to represent distances.
In a first step, the two most similar duplicates were selected on the condition that they were distinct. We then inspected the scatter plots of the ranks of duplicates and found that three classes could be easily defined, as exemplified in figure 5. Most of the duplicates were similar to those shown in panels B and C and were considered adequate. Some duplicates were clearly of poor quality, such as those displayed in panel D, and the corresponding conditions--i.e.the entire experiment 6 of the U95 dataset--were not used. Conversely, some duplicates were abnormally similar, e.g., as shown in panel A, and were discarded as well because they were technical replicates (visible on panel F of figure 4). There were also some intermediate cases between those shown in panels A and B for which it was difficult to determine if they were technical replicates. These cases were resolved by visual inspection of comparison heat maps (see figures 8 and 10). Following the application of these successive control steps, 101 biological conditions out of 237 were eliminated from the U95 dataset. The discarded conditions corresponded to the following: 2 not measured with chip A, 70 without replicates, 12 with a longest run of greater than 500, 11 with the same sample used twice, 4 with poor reproducibility, and 2 with technical replicates. Table  1 summarizes the effects of this first filtration step on the four chipset models used in this study: 18% of the experiments (24/134) and 32% of the biological conditions (302/945) were eliminated.

-RANK ASSIGNATION FOR DATA ANALYZED BY MAS4
The RDAM method was initially applied to data processed by MAS5, and all signals were given a different rank. Unlike with MAS5, however, the algorithm used in the MAS4 version of the Affymetrix software can give negative values when calculating the compound signals of particular probe sets. Because negative values cannot be used as a measure of transcript concentration, it might be tempting to assign a rank of zero to all negative signals. Another possibility would be to rank all values, irrespective of their sign, which would have the advantage of allowing MAS4 and MAS5 data to be treated similarly. We tested both of these ranking schemes on biological conditions analyzed with MAS4, and found that there was little difference between the CVMs constructed from the two. Figure 6 shows that in the CVMs constructed either by using a different rank for each signal value or by setting the rank to zero for negative signals, all probe sets have roughly the same connectivity degree. We thus concluded that the ranking scheme used for MAS5 data could be applied to MAS4 data as well.

Figure 6 -Similarity in connectivity degree between CVMs constructed using two types of ranking applied to biological conditions analyzed with MAS4 in U95 experiments.
For a particular probe set, the connectivity degree represents its number of neighbours, i.e. the number of probe sets that have a correlation with it. In red are plotted the connectivity degrees between the two CVMs with positively correlated values, and in blue are the connectivity degrees between the two CVMs with negatively correlated values.

-MAS4 VS MASCOMPARISONS
Another point of concern is the compatibility of MAS4 and MAS5 results. It is worth noting that the samples belonging to the ten biological conditions common to experiments 2 (GSE96) and 25 (GSE803) form two disjoint groups in the dendrogram ( figure 4). This absence of reproducibility could reflect either a systematic artefactual bias in one of the experiments or the existence of differences between the types of results. While the first explanation cannot be ruled out, an analysis of an experiment [5] in which the samples were treated with the two versions of the MAS software demonstrated that the type of result had a predominant effect, as shown in figure 7. It must be stressed that in this particular experiment the samples were scanned using the settings recommended by Affymetrix for the MAS4 version (photon multiplier tube (PMT) set at 100%). The drastic effect observed in figure 7 results therefore from the changes in the algorithm. In addition, we observed that the use of the MAS5 algorithm results in a dendrogram that groups samples belonging to the same biological condition, whereas they are mixed in dendrograms constructed from results analyzed by the MAS4 algorithm. It therefore seems undesirable to mix MAS4 and MAS5 results, because numerous statistically significant variations could be present between identical biological conditions processed by the two methods. It would thus be preferable to exclusively consider MAS5 results.

Figure 7 -Effect of MAS software version.
Samples (GSE558 [6]) are named by concatenation of the biological description, MAS version number, and replicate rank: NTG: Non-transgenic, DN: Dominant-negative PI3 kinase, CA: Constitutively active PI3 kinase.The dendrogram was constructed in the same way as that displayed in figure 4 (see the 'distance between samples' section in the Methods).

-COMPARISONS BETWEEN BIOLOGICAL CONDITIONS
Following the first filtration step, we performed systematic comparisons between every pair of biological conditions, e.g. 136*135/2=9180 comparisons for the U95 chip set model (see column 'first filtration -#biol' of table 1). A graphical representation of these comparisons was used to display their general characteristics. An inspection of panel A of figure 8, which shows the estimated total variation for the increased and decreased probe sets in each comparison, shows that intra-experiment comparisons detected far less variation than those carried out between experiments. This property, which results in the presence of dark squares all along the main diagonal, reflects the tight relationships that exist between the samples from these experiments (see, for example, experiments 3, 4, 6, 9, 12, 13, and 17).   U95 chip set model).B-Fraction of probe sets considered significant at the level of 10% FDR, C-Fraction of probe sets considered significant at the level of 1% FDR, D-Excess of one type of variation: in the comparison i vs j, if the estimated total variation for the increased probe sets is larger than the estimated variation for the decreased probe sets, then the ratio increased:decreased is plotted in (i,j), and (j,i) is set to one. Conversely, if the estimated total variation for the increased probe sets is lower than the estimated variation for the decreased probe sets, then the ratio decreased:increased is plotted in (j,i), and (i,j) is set to one.
An inspection of the comparison heat maps revealed that experiments 19 and 22 were abnormal, which had not been detected in the previous quality filtering steps. In all of the comparisons involving the biological conditions of experiment 19, virtually no probe sets varied at the level of FDR 10% (panel B); this could be partly explained by the poor reproducibility between replicates for this experiment. Inversely, in all the comparisons involving the biological conditions of experiment 22, at least 60% of the probe sets varied. As shown in figure 9, this abnormal outcome is probably due to an error in the assignation of signals to the probe sets, resulting in a randomization of the signal values in the GSM files. It is sometimes difficult to determine by a simple visual inspection that two replicates from a given biological condition are technical replicates because their rank plot is intermediate between categories A and B of figure 5. In such cases, the inspection of comparison heat maps can help to confirm this status. For example, figure 10 shows that in the comparison heat map of chipset MG-U74, experiments 8 (GSE 369), 44 (GSE934) and 46 (GSE986) display a systematically high level of variation with respect to all other conditions and should therefore be considered as technical replicates. However, experiment 37 (GSE700), which has a similar pattern, does not seem to have technical replicates, as judged from its rank plots.  Having selected all the suitable biological conditions, we next performed systematic comparisons between pairs of conditions by applying the RDAM method (see in Methods). For example, the 71 biological conditions conserved in the U95 dataset out of the 237 initially present gave 71*70/2=2485 comparisons. Figure 11 displays the results of these comparisons for the four chipset models used in this study. It can be concluded from the evenly distributed values of the fraction of varying probe sets that the two filtration steps we applied were successful in identifying biological conditions that do not produce abnormal patterns in this type of representation. Table 3 presents certain statistical properties of the comparisons performed in the four chipset models used. The first conclusion is that a very high and rather constant mean selection yield is observed: 78% to 89% of the estimated total variation is retained when a selection at FDR 10% is applied. The second conclusion is that the mean total variation is very high as well (17%, 14%, 18% and 16%, i.e. 2150, 3110, 2250 and 1400 varying probe sets in each comparison, respectively).

Figure 11 -Graphical display of the total variation present in all the comparisons made in each of the four datasets (HG-U95, HG-U133, MG-U74 and RG-U34).
In these plots, the biological conditions are displayed that were analyzed by MAS5 and that passed all the quality filters. No "outlier" pattern is visible, as is the case in figures 7 and 10.

-DETECTION OF REDUNDANT BIOLOGICAL CONDITIONS
Another potential source of bias in the construction of CVMs is the presence of multiple experiments that have different numbers of similar biological conditions. If an experiment contains many similar biological conditions, as is the case, for example, with GSE 314 (experiment 5 of figure 10, with 16 similar conditions), then most of the probe sets varying in a given comparison between one condition from this first experiment and a second condition from another experiment should also vary in comparisons between other similar conditions from the first experiment and the used second condition. For other, smaller experiments, e.g. GSE363 (experiment 7 of figure 10, with 4 similar conditions), the effect will also be present but to a smaller extent. It is therefore important to give the same weight to each biological condition and to eliminate redundant conditions. We constructed a distance matrix and defined as redundant those conditions that shared more than 80% identity (see figure 12 and the 'detection of redundant conditions' section of the Methods). After this process, the number of conditions in chipset models HG-U95, HG-U133, MG-U74, and RG-U34 was, respectively, 37, 84, 89, and 41 (versus 71, 126, 205, and 96 at the completion of the quality filtering steps).

Figure 12 -Portion of the dendrogram of HG U133 biological conditions showing selected nonredundant conditions.
The samples are named as followed: at the beginning is indicated the rank of the biological condition, e.g. C120, followed by a descriptive part constructed as explained in the legend of figure 4, e.g. Lymphoblast_721_B, and finally by the rank of the experiment, e.g. E002. The decision made in view of the distance matrix is indicated after this composite name, using a symbol showing that the biological condition was either used (<==>) or discarded as redundant (----).