GFS: fuzzy preprocessing for effective gene expression analysis

Background Gene expression data produced on high-throughput platforms such as microarrays is susceptible to much variation that obscures useful biological information. Therefore, preprocessing data with a suitable normalization method is necessary, and has a direct and massive impact on the quality of downstream data analysis. However, it is known that standard normalization methods perform poorly, specially in the presence of substantial batch effects and heterogeneity in gene expression data. Results We present Gene Fuzzy Score (GFS), a simple preprocessing technique, that is able to largely reduce obscuring variation while retaining useful biological information. Using four sets of publicly available datasets containing batch effects and heterogeneity, we compare GFS with three standard normalization techniques as well as raw gene expression. Each method is evaluated with respect to the quality, consistency, and biological coherence of its processed output. It is found that GFS outperforms other transformation techniques in all three aspects. Conclusion Our approach to preprocessing is a stronger alternative to popular normalization techniques. We demonstrate that it achieves the essential goal of preprocessing – it is effective at making expression values from multiple samples comparable, even when they are from separate platforms, in independent batches, or belong to a heterogeneous phenotype.


Introduction
Gene expression profiling experiments and analysis are often designed with the objective of verifying one or more hypotheses that can help in building effective diagnostic or prognostic models in clinical settings. Typically, expression data are collected from groups manifesting differences in certain properties of interest, such as disease types or states, developmental stages, and response to specific treatments or interventions over time. The collected data are then mined for appropriate variation patterns relevant to the hypotheses under consideration. The underlying assumption in such studies is that the input gene expression values from different samples accurately reflect the amounts of RNA produced by the corresponding genes and, thus, are properly comparable. However, in practice, unless an effective normalization technique is applied to preprocess the expression data, a number of factors may lead to the violation of this assumption [1,2].
Firstly, the entire technical process of isolation and quantification of RNA leading up to the final measurements is unlikely to be completely error-free, as inaccuracies may insinuate any of the steps in the long procedure. Secondly, with change in time, place, and other variables in experimental settings, systematic biases of non-biological origins invariably enter during measurement experiments in the form of batch effects. When such biases are correlated with the biological properties under investigation, they can severely confound interesting variation [3]. Thirdly, differences in experimental settings may also introduce changes in local environments of cells, thus inducing fluctuations in gene expression that further contribute to noise in the measurement data [4].
All these factors together make it improbable for multiple samples to naturally have comparable expression values. Therefore, we rely heavily on the capabilities of a preprocessing method to recover meaningful biological information, and remove or account for noise in the form of obscuring variation. Yet, it was reported [2] that popular normalization techniques are not very successful in discriminating between real and obscuring variation to produce quality input for downstream gene expression analysis. In fact, it was noted by Luo et al. [2] that preprocessing using common methods led to reduction in the quality of subsequent predictive models in upto 25% of the cases.
To mitigate the performance issues commonly presented by preprocessing techniques, we propose Gene Fuzzy Score (GFS), a transformation method that uses fuzzy scores derived from rank values of gene expression within individual samples. We chose four different sets of gene expression data containing substantial batch effects and heterogeneity for the analysis. On these datasets, we compared the performance of GFS and other popular preprocessing methods with respect to the quality, consistency, and biological coherence of their processed output.

Background
Preprocessing techniques typically attempt to make expression values from multiple samples comparable in two different ways: 1. by scaling expression values such that each sample has an equal value for a statistic such as mean or median; or 2. by adjusting expression values such that each sample has the same expression distribution across genes.
The first approach includes methods such as mean and median scaling, and is popular for Affymetrix genechips. For example, in the mean scaling method, the mean gene expression value of each microarray in the sample is first calculated, and a grand mean is then computed as the mean of all means. Finally, expression value of each microarray in the sample is scaled such that the mean expression of each microarray is equal to the grand mean. Median scaling also follows the same procedure, with the mean statistic being replaced by median. While these methods are simple to implement, they assume that expression values of all samples share a linear relationship. They -especially mean scaling -also suffer from a few other drawbacks such as sensitivity to outlier distortions [5].
The second approach includes more sophisticated methods such as z-score and quantile normalization. In z-score normalization, the expression values of genes in each microarray are transformed to fit the standard normal distribution with a mean of zero and 1 unit standard deviation. On the other hand, quantile normalization uses the rank values of gene expression within individual microarrays to make the distribution of all microarrays identical in statistical properties. Since ranks are known to be relatively more robust to batch effects than absolute expression values [1], this is expected to lead to better performance on datasets with batch effects. In the quantile normalization procedure, the expression values of each microarray are first sorted in ascending order, and the mean expression corresponding to each rank across microarrays is stored separately. Following this, the original expression values in each microarray are assigned ranks based on their relative quantitative order. Finally, a transformed matrix is obtained by replacing each gene rank value by the mean expression value corresponding to that rank as stored earlier.
The z-score and quantile normalization methods are relatively more robust to outliers, provided that the number of microarrays in a dataset is sufficiently large. However, the actual distributions of underlying data are assumed to be identical in all samples, and specifically assumed to be Gaussian in case of z-score normalization. This assumption is especially likely to break down in datasets with disease state samples where the regular functions of the genes and their synchronization with each other may be substantially disrupted. In such cases, the expression patterns within a disease sample may not be identical to samples of the normal phenotype. It also may not be identical to other disease samples if the disease is heterogeneous and is able to manifest itself through the exploitation and/or breaking of multiple mechanisms.
It is also commonly observed that low-expression genes and proteins exhibit a much greater coefficient of variance than highly expressed ones in their expression levels (see figure 2E in the work by Goh et al. [6]). Thus, the expression rank of low-expression genes is highly unstable. This may adversely affect the performance of a ranking-based normalization method such as quantile normalization.
Therefore, we are inspired to present GFS as a preprocessing technique for gene expression. Like quantile normalization, our method also makes use of gene expression ranks instead of absolute values, thus earning more robustness to batch effects. However, unlike the above techniques, we do not make any assumptions on the simi larity of distribution or the equality of any mean-, medianlike statistic across samples. Moreover, in our method, we fuzzify the expression ranks such that irrelevant fluctuations introduced by minor differences in ranks are alleviated, and noise from low-ranked genes is discarded.
The idea of fuzzification has also been used earlier in a few gene expression profile analysis methods [7,8] and also proteomic profile analysis methods [6,9]. However, these works merely use it as a component of their respective methods, and do not study its role and effectiveness as a normalization procedure.
A single gene expression matrix was produced by merging the two DMD datasets from Haslett et al. [10] and Pescatori et al. [11]. Similarly, data were merged from Armstrong et al. [12] and Golub et al. (Leukemia) [13], as also from Yeoh et al. [14] and Ross et al. (ALL subtypes) [15].
Note that each of the first three pairs of the chosen datasets (as in Table 1) are independent and were produced on different microarray platforms. Thus, the merged gene expression matrices obtained from them contain batch effects by default. We consider only genes that are common in the two samples of the dataset pair, and run all the four preprocessing techniques -GFS, mean scaling, z-score normalization, and quantile normalization -on these input matrices, and evaluate their effectiveness in dealing with batch effects. To observe the effect of preprocessing on highly heterogenous data, we also use another more heterogeneous dataset from Yeoh et al. [14] that has 9 disease subtypes (ALL) and normal patient samples to compare the selected methods. Thus, in total, four sets of input gene expression matrices belonging to three different disease types are used in our analysis.

Approach
In GFS, we transform a raw gene expression matrix by making use of the rank values of genes within each microarray, rather than by using their absolute expression values. Further, we use two quantile thresholds -θ 1 and θ 2 -to assign a fuzzified score to each gene in each patient. Ranks below θ 2 in a sample are all reduced to a score of zero, those above θ 1 are given a score of 1, and intermediate ranks are interpolated to obtain a score between 0 and 1. In particular, let r(g i , p j ) be the rank of gene expression of a gene g i in patient p j , and q(p j , θ) be the rank corresponding to the upper θth quantile of gene expression in patient p j . Then, the gene fuzzy score s(g i , p j ) assigned to a gene g i in patient p j is given by the following function: Apart from the use of rank values in computing transformed scores, GFS also benefits from the fact that it allows for selection of quantile thresholds such that noise from low-ranked genes is safely removed by assigning a score of 0, while genes with very high expression are all treated equally with a score of 1. For the purpose of uniformity in comparison, we fix θ 1 to 5% and θ 2 to 15% for all GFS runs mentioned in this paper. However, using a θ 1 value between 5 to 10% and θ 2 value between 15 to 20% also leads to similar results.
In evaluating the proposed approach against other normalization techniques discussed earlier, we focus on three salient questions in this paper: 1. Does the preprocessing technique produce consistent results across different datasets, provided that they have the same composition of different phenotypes? 2. What is the quality of the output produced by the processing technique? How well does the processing retain useful information while mitigating obscuring effects? 3. Is the output produced by the technique biologically coherent? We compared GFS with three standard normalization methods described in the previous section -mean scaling, z-score normalization, and quantile normalization. The description of our design and approach to each experiment is given in the next section.

Visualizing data after PCA transformation
We preprocess the raw gene expression matrices with each of the four methods -mean scaling, z-score normalization, quantile normalization and GFS. For each method, we select the top 15% genes with maximum variance in the processed matrix, as these are most likely to be the genes contributing to interesting variation. We then reduce the processed matrix to include only these high variance genes, and apply PCA transformation on the reduced matrix. A scatter plot of the coordinates corresponding to the first two principal components (PC1 and PC2) of each sample is visualized.
A good preprocessing method is expected to show a clear clustering of samples of the same phenotype, and separation between samples of different phenotypes. Moreover, the quality of clustering would ideally not be adversely affected by the presence of samples from multiple batches in the data.
Observations: While in the Leukemia, DMD, and childhood ALL datasets, samples from different batches are clearly separated, GFS (Fig. 5) shows the best phenotypewise clustering of samples among all preprocessing techniques. Mean scaling (Fig. 2) does not perform well on any of the datasets, and in some cases, obscures the separation seen even in raw gene expression (Fig. 1). This degredation in peformance is in line with previous findings [2]. Z-score normalization shows good performance on DMD and Leukemia (Fig. 3) datasets, and quantile normalization performs well only on the DMD dataset (Fig. 4).
In case of the more heterogeneous ALL dataset (9 disease subtypes and normal samples), GFS is the only method to discriminate between samples of the different ALL subtypes (Figs. 1, 2, 3, 4, 5 (a)).
From the PCA scatterplots for all the three datasets with batch effects (Leukemia, DMD, and ALL with 2 subtypes), we observed that samples from two batches are always clearly separated along PC1. This implies that the first principal component is highly enriched in batch effects. Therefore, we exclude the first principal component (PC1), and draw scatterplots corresponding to the second and third principal component (PC2, PC3). In PC2 vs PC3 scatterplots, there is very less separation between samples from different batches but belonging to the same phenotype, as compared to that in PC1 vs PC2 scatterplots (Figs. 1, 2, 3, 4, 5). This trend is consistent across all three datasets with batch effects. Thus, removing PC1 can be an effective technique to reduce batch effects in gene expression data to a great extent. However, for the more heterogeneous ALL dataset where batch effects are absent, removing PC1 results in loss of important variation information, and subsequently, less clearer separation between different phenotypes.

Comparing processing quality
Quality of a preprocessing method is determined by its ability to separate interesting from obscuring variation. An inferior preprocessing method will lead to an output in which expression variation across microarrays would be confounded with irrelevant information. In contrast, expression variation across microarrays in the output of an ideal preprocessing method will correspond to interesting biological variation alone.
Experiment: We estimate the quality of preprocessing methods with respect to the capability of their transformed output to separate samples of different phenotypes. In particular, we randomly select 15% of the genes, reduce the processed matrix to include the selected genes, and apply PCA on the resultant matrix. The PCA co-ordinates of all samples are then used to compute a clustering performance metric called the silhouette score. The silhouette score is calculated based on the mean intra-cluster distance a and the mean nearest-cluster distance b for each sample, as (b − a)/max(a, b) [16]. The score ranges from -1 to 1. In general, a higher silhouette score indicates a better clustering.
For the ALL dataset with 9 subtypes, co-ordinates corresponding to the first three principal components are used, while for the other three datasets with batch effects, co-ordinates corresponding to only the second and third principal components are used. This is repeated over 1000 iterations, and the distribution of silhouette scores corresponding to each preprocessing method is used to infer the quality of clusters formed by its transformed output.
Observations: For all the four datasets, the distribution of silhouette scores obtained using randomly chosen 15% genes is stable at a higher value in case of GFS, in comparison to other preprocessing methods (see Fig. 6). This shows that the assigned scores to each microarray-gene pair after GFS preprocessing are more relevant to the interesting variation in gene expression and thus, even randomly chosen features are better able to capture the phenotype-based clusters. Moreover, the reference silhouette scores obtained from the top 15% variance genes in GFS processed matrices are consistently higher than the 75th percentile score of its null distribution obtained from random 15% genes, across all datasets (Fig. 6). For quantile normalization, while the silhouette scores obtained from its top 15% variance genes are also consistently higher than the 75th percentile score of the corresponding null distribution, these observed silhouette scores are consistently lower than those for GFS. On the other hand, the silhouette scores derived using the top 15% variance genes in z-score normalized and raw expression are lower than the 75th percentile score of their corresponding null distributions in the DMD dataset and ALL dataset with 2 subtypes. The silhouette score computed on top 15% variance genes in scaled expression data is lower than the median score of its null distribution in all datasets. This shows GFS-processed expression values are more effective than the other methods.
The silhouette scores obtained from the PCA transformed co-ordinates of samples using the top 15% highvariance genes are recorded in Tables 2 and 3. In all datasets, with and without the first principal component (which is often the richest in batch effects), GFS is seen to have a better score relative to other processing methods. Also, in the three datasets with batch effects, removing PC1 improves phenotype-wise clustering, while in the heterogeneous ALL dataset with no batch effects, removing PC1 leads to discarding important variation and thus a reduction in clustering performance.

Comparing consistency
It is important that a reliable preprocessing method produces an output that remains consistent in multiple runs over datasets of the same type. For instance, if two datasets of the same disease are transformed by a preprocessing method, and the genes indicated to have the highest contribution to interesting variation have very little overlap, it is natural to infer that the variation is confounded by noise and the genes are likely to be false positives. In contrast, consistency in such output affirms that the preprocessing method is indeed reliable, since similarity in input ensures similarity in output. Thus, a preprocessing technique assigning meaningfully transformed expression values should indicate a consistent set of high-variance genes, when applied to different datasets with the same phenotype distribution. Experiment: In order to evaluate the consistency of different preprocessing methods, we split each dataset into two datasets such that each contains the same number of samples of each phenotype, independently apply the preprocessing technique on the resultant split data, and obtain the two resulting lists of the top 15% high-variance genes from the splits. Further, we apply PCA to the normalized data, and remove genes that have a coefficient of zero in all of the first three principal components for the ALL dataset with 9 disease subtypes. For the other three batch effects-ridden datasets, we only remove genes that have a coefficient of zero in the second and third principal component. This process is repeated 100 times using different splits of each dataset. We then examine the distribution of similarity (measured in terms of the jaccard coefficient) between the two gene lists.
Observations: A consistent preprocessing technique would be expected to demonstrate a high overlap in highvariance genes. It is seen that the distribution of jaccard coefficient when the split datasets are processed using GFS, is stable at an equal or higher value in all the datasets (Fig. 7). The other methods fluctuate in performance and, in some cases, show worse consistency than raw gene expression.

Comparing biological coherence
For a phenotype to manifest, the causal genes often co-ordinate with other genes, and seldom act alone. Therefore, genes contributing to interesting variation in data are more likely to be connected to each other in biological pathways. Thus, we expect that a more biologically coherent preprocessing technique will result in high-variance genes that induce significantly more and/or bigger subnetworks on known biological pathways.
Experiment: We assess the biological coherence of the preprocessing methods by examining the subnetwork size distribution obtained when high-variance genes are used to induce subnetworks on pathways. The subnetwork size distribution for each processing method is obtained as follows: 5. Using genes in step 4, induce subnetworks on known pathways from the PathwayAPI database [17] and store the subnetwork size distribution.
To generate the null model, step 2 is replaced with randomly selecting 15% of all genes, and steps 1-5 are repeated over 1000 iterations. Finally, for each subnetwork size, a p-value is calculated as the proportion of subnetwork frequencies in the null model Fig. 6 Null distributions of silhouette scores obtained with raw and processed expression matrices taking 15% random genes as features (the three dashed lines show 25th quartile, median and 75th quartile, while the red dot indicates the score obtained from top 15% variance genes). a ALL (9 Subtypes). b ALL (2 Subtypes). c DMD. d Leukemia found to be greater than the frequency from original distribution.
The same analysis is repeated for the three datasets with batch effects by modifying step 4 to include only those genes that have a non-zero coefficient in the second or third principal component.
Observations: The distribution of subnetwork sizes induced by the top 15% variance genes are shown in Fig. 8 (using the first three principal components) and Fig. 9 (using PC2 and PC3 only). The figures show the actual subnetwork count distribution across different subnetwork sizes, while the inset figures show the corresponding percentage frequencies. In the Leukemia dataset and ALL dataset with 2 subtypes, GFS has the highest percentage frequency of subnetworks of size greater than or equal to 5 and, in most datasets, GFS induces more subnetworks overall.
From the low p-values in Tables 4, 5, 6, 7, we observe that the significance of frequencies is high for subnetworks induced by GFS, regardless of their size. Further, comparison with other methods shows that the frequency of subnetworks induced by high-variance genes in GFS-processed datasets is much more significant than those induced on datasets processed with other methods and raw gene expression. Hence, we infer that GFS-transformed output is highly biologically coherent. Moreover, we observe that on excluding the batch effectsenriched PC1 from the analysis, the p-values corresponding to larger subnetwork sizes are lower than those of smaller sizes, indicating higher significance, and hence  greater biological coherence, of the large subnetwork sizes.

Effect of sample size on performance of GFS
To examine the effect of sample size on GFS, we randomly selected samples of the size of 0.25, 0.50, 0.75 times the original sample size over 100 iterations. We then noted the range of silhouette scores obtained from the iterations for each sample size. (For the heterogeneous ALL dataset, the first three PCs were used to calculate the silhouette scores, while for the other datasets, only the second and third PCs were used). As expected, Fig. 10 shows that the clustering performance improves with increase in sample size. Interestingly, the boxplots in Fig. 10, interpreted together with Tables 2 and 3, also indicate that the median performance of GFS when provided with even 0.25 times of the entire sample size is still comparable with, and often better than, that of other normalization methods when they are supplied with the entire sample size.

Conclusion
An effective preprocessing technique is expected to transform the gene expression matrix such that data of the same phenotype from different sources is made similar. This can be achieved by removing or accounting for obscuring noise in gene expression measurement, and retaining interesting variation relevant to properties of biological interest. Such a processing is essential to ensure reliable downstream analysis of gene expression data. However, popular normalization techniques do not necessarily improve the quality of expression data, and sometimes even exacerbate the issue by mistaking real variation for noise and discarding it.    ------------21  ------------1     We discussed a new approach, Gene Fuzzy Score, to address this issue and compared it with other popular preprocessing methods with respect to three important criteria. First, we assessed the capability of the transformed output of each technique to resolve differences in phenotypes within the dataset. Secondly, we estimated the consistency of their output when presented with different datasets with the same phenotype distribution. Finally, we analysed the distributions of size of subnetworks induced by genes indicated to be sources of interesting variation in each processed expression matrix. In each of these aspects, GFS was successful in improving the transformation outcome, proving its applicability in datasets with batch effects and heterogeneity. Moreover, the performance of GFS improves with increase in sample size.
A recurring observation from our experiments is that in datasets with significant batch effects, the batch effects are generally captured by the first principal component in PCA. Thus, applying a PCA transformation and excluding the first principal component from subsequent analysis leads to significant reduction in batch effects in any dataset, and improves the performance of all preprocessing techniques. Further, we note that GFS outperforms other methods irrespective of whether this additional step is implemented.
Another merit of GFS is the interpretability of its transformed outcome. A biologist may quickly understand how highly the gene is ranked in a particular patient. For b, when a gene has a GFS score of 0.5 in a patient, it means the gene is in the top 10% most highly expressed genes in that patient (assuming θ 1 and θ 2 are set at 5 and 15% respectively). Thus, apart from being a robust and effective preprocessing technique, GFS is also easily interpretable.
While we evaluated GFS only on microarray gene expression, it is conceivable that the method may be applied to data obtained from other high-throughout technologies such as RNA-seq and SWATH proteomics. Exploring this possibility remains the subject of our future work.