Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling

Background Nowadays, many public repositories containing large microarray gene expression datasets are available. However, the problem lies in the fact that microarray technology are less powerful and accurate than more recent Next Generation Sequencing technologies, such as RNA-Seq. In any case, information from microarrays is truthful and robust, thus it can be exploited through the integration of microarray data with RNA-Seq data. Additionally, information extraction and acquisition of large number of samples in RNA-Seq still entails very high costs in terms of time and computational resources.This paper proposes a new model to find the gene signature of breast cancer cell lines through the integration of heterogeneous data from different breast cancer datasets, obtained from microarray and RNA-Seq technologies. Consequently, data integration is expected to provide a more robust statistical significance to the results obtained. Finally, a classification method is proposed in order to test the robustness of the Differentially Expressed Genes when unseen data is presented for diagnosis. Results The proposed data integration allows analyzing gene expression samples coming from different technologies. The most significant genes of the whole integrated data were obtained through the intersection of the three gene sets, corresponding to the identified expressed genes within the microarray data itself, within the RNA-Seq data itself, and within the integrated data from both technologies. This intersection reveals 98 possible technology-independent biomarkers. Two different heterogeneous datasets were distinguished for the classification tasks: a training dataset for gene expression identification and classifier validation, and a test dataset with unseen data for testing the classifier. Both of them achieved great classification accuracies, therefore confirming the validity of the obtained set of genes as possible biomarkers for breast cancer. Through a feature selection process, a final small subset made up by six genes was considered for breast cancer diagnosis. Conclusions This work proposes a novel data integration stage in the traditional gene expression analysis pipeline through the combination of heterogeneous data from microarrays and RNA-Seq technologies. Available samples have been successfully classified using a subset of six genes obtained by a feature selection method. Consequently, a new classification and diagnosis tool was built and its performance was validated using previously unseen samples.


Background
Cancer is the second leading cause of death worldwide, just behind cardiovascular disease. Specifically, breast cancer is one of the five most dangerous cancers in the world, showing a high mortality rate according to World Health Organization (WHO), and being the cancer with the highest impact among the female population [1]. Nowadays, many breast cancer diagnoses are performed when a patient presents several related symptoms, thus increasing the mortality risk. If the cancer has spread, treatment becomes more difficult, and generally the chances of surviving are significantly lower. However, cancers that are diagnosed at an early stage are more likely to be treated successfully. Therefore, it is primordial to find biomarkers that allow an early diagnosis of breast cancer. Two sequencing technologies, microarray and RNA-Seq, have been used for obtaining gene expression. They are briefly described next.

Microarray technology
Microarray has been the main sequencing technology used in the last two decades until the arrival of Next Generation Sequencing techniques. The most extended microarray platforms are Affymetrix [2] and Illumina [3], leading the second one the RNA-Seq sequencing technology market. Nevertheless, there are other very important microarray manufacturers such as Agilent [4], Exiqon [5] or Taqman [6]. A high simultaneous number of genes can be measured at expression level from the use of microarrays. The expression values are achieved by means of microscopic DNA spots attached to a solid surface which have followed a hybridization process. Once this process is completed, it is possible to read the expression values with a laser, and consequently store the quantification levels in a .CEL file [7].

RNA-Seq technology
As a natural evolutionary step in the treatment of biological information from DNA, RNA-Seq is gradually replacing the widespread use of microarrays. Although its application was originally intended for genomic transcription study, it also allows achieving a mapping between the levels of transcription and gene expression [8]. In this sense, its combination with other functional genomics methods allows enhancing the analysis of gene expression. This is achieved through the quantification of the total number of reads that are mapped to each locus in the transcriptome assembly step. RNA-Seq read counts robustness has been validated against predecessor technologies such as microarrays or quantitative polymerase chain reaction (qPCR) [9].

Comparison between both technologies
RNA-Seq offers an important number of advantages over microarrays, although the cost of RNA-Seq experiments is also higher than in microarray technology nowadays: • RNA-Seq allows detecting the variation of a single nucleotide. • RNA-Seq does not require genomic sequence knowledgement. • RNA-Seq provides quantitative expression levels.
• RNA-Seq provides isoform-level expression measurements. • RNA-Seq offers a broader dynamic range than microarrays.
In spite of these advantages, microarrays are still used due to their lower costs. Besides, as microarrays have been used for a longer period, there exist many robust statistical and operational methods for their processing [10][11][12][13][14][15].
There are many significant microarray experiments already available to the research community, and there is also even a high number of microarray datasets that have not been analyzed so far. These datasets might have information that could reveal important facts and candidate biomarkers. In any case, there is no doubt that RNA-Seq is the present technology, but it can also take advantage of the available data from microarray technology. As Nookaew et al. explained [16], there is a high consistency between RNA-Seq and microarray, which encourages to continue using microarray as a versatile tool for gene expression analysis.
The main objective of this work is to find possible breast cancer biomarkers from patient and control samples acquired via NCBI GEO web platform [17]. To this end, an exhaustive search has been done in order to obtain statistically significant samples from both microarray and RNA-Seq series. Two datasets have been considered in this study, one for training and one for testing. The training dataset has been used to extract the Differentially Expressed Genes (DEGs), and to design a classifier. The test dataset has been considered for the assessment of the DEGs selection and classification processes.
In the case of RNA-Seq samples, cqn package [18] has been used to calculate the expression values from the BAM/SAM file. Once the expression values were available, they were merged and normalized with the microarray data. Gene expression was achieved through a joint study of all series that allowed integration among microarrays and RNA-Seq data.
Most of the previous studies in the selection of biomarkers perform this process through statistical tools over a given dataset and a given technology. However, this work takes an innovative step forward by combining different datasets and microarray technologies together with RNA-Seq data. Furthermore, this research also builds an smart breast cancer classifier with the aim of achieving early diagnosis when unlabeled samples are presented. To this end, the minimum-Redundancy Maximum-Relevance (mRMR) [19] feature selection algorithm was applied in order to select the most relevant genes to perform the classification. Also, three different classification algorithms have been implemented and their results compared. The first classifier makes use of Support Vector Machines (SVM) [20,21]. Alternatively, Random Forest (RF) [22] and k-Nearest Neighbor (k-NN) [23] classifiers have also been designed.
This paper has been structured as follows. This section has shown the introduction and state of the art of this work. Next section explains the methodology followed in this study. It begins by describing the available data series that have been used for this research. Later, the pipeline for processing and classifying the data is shown. An innovative step for automatic sample classification is described using machine learning techniques. The results and discussion section shows the integrated gene expression, revealing those genes that remain unchanged regardless of the technology used in the gene expression identification process. Furthermore, this section underlines the validity of the proposed approach and its utility in breast cancer early diagnosis using the developed classification tool. Finally, the conclusions section summarizes the most important contributions of this study for breast cancer diagnosis and genetic profiling.

Microarray and RNA-Seq series
The first issue that must be addressed concerns the definition of the kind of sample that is going to be used, along with the determination of the tissue or cell that the sample comes from. As a result, a wide search through the NCBI-GEO platform has been done with the objective of finding datasets belonging to both the selected cell lines and the considered technologies. In this study, control samples have been selected from the MCF10A cell line [24]. This cell line is classified as a healthy non-tumorigenic epithelial cell line. Various breast cancer cell lines were selected as cancer samples (MCF7 and HS578T) [25,26]. Besides, not every sample from each of the series has been selected, as there are samples that do not belong to the cell lines required, or they have been treated with some kind of drug that could produce some noise in the final results.
Once the requirements for selecting the desired samples were established, an exhaustive search of Affymetrix and Illumina series was carried out for microarray data. On the other hand, RNA-Seq data was selected from Illumina HiSeq technology. Only datasets containing the above-mentioned cell lines were selected. Table 1 summarizes the selected series for this study. As it can be seen, the NCBI GEO database offers a larger availability of microarray data when compared with the number of RNA-Seq samples. Two separated supersets have been created, one for training predictive models, and the other for their testing, both containing microarray as well as RNA-Seq samples. The training dataset is made up of 108 microarray samples: 65 samples from Affymetrix, 43 from Illumina, and 24 RNA-Seq samples. On the other hand, the test set is made up of 120 samples of microarray (108 of Illumina and 12 of Affymetrix) as well as 6 samples of RNA-Seq. These series are publicly available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=S.NAME where S.NAME is the name of each series at NCBI GEO.

Microarray pipeline
The first step in the methodology for microarray data is to put together all the selected series, independently of their technology (Affimetrix or Illumina). Consequently, a quality analysis assessment was performed across the series, in order to detect and consequently remove any possible outlier. This outliers detection and removal was performed through arrayQualityMetrics R package [27], which computes the Kolmogorov-Smirnov statistic K a between the distribution of each array and the distribution of the pooled data. Next, sample normalization was performed using the limma R package normalizedBe-tweenArrays function [10], in order to remove dynamic expression variability between samples. Once the samples were normalized, the expressed gene values were obtained. Figure 1 outlines the microarray data analysis pipeline.

RNA-Seq pipeline
The pipeline proposed by Anders et al. [28] has been followed for the extraction of RNA-Seq data as it is shown in Fig. 2. Starting from the SRA original files, several tools like sra-toolkit [29], tophat2 [30], bowtie2 [31], samtools [32] and htseq [33] have been used to obtain the read count for each gene. Once the read count files were obtained, the expression values were calculated using the cqn and the NOISeq R packages [34].

Integrated pipeline
A new data processing pipeline is proposed in this work which extends the classical gene expression data analysis pipeline in two ways. On one hand, this pipeline integrates data from both microarray and RNA-Seq technologies. Furthermore, once the integration has been carried out, a gene selection process and an assessment through a classification process were performed, using separated training and test datasets. The workflow of the entire pipeline is shown in Fig. 3. In a first step, sample integration of data from both microarrays and RNA-Seq technologies has been carried out using the merge function from base R package.
Once the gene expression values have been obtained for each technology separately, a normalization of all joint technologies was applied using the normalizedBe-tweenArrays function cited before over all datasets available (see Table 1). These tasks are essential in order to  have available a right normalization of the biological data and its subsequent processing [35,36]. We have to note that each of the series in Table 1 were originally differently quantified depending on the respective technology and manufacturer.
The next steps in the pipeline for gene expression levels calculation and extraction of DEGs, were made only over the training dataset, thus leaving the test dataset for later assessment.
Gene extraction was performed at different levels using the limma R package, both at individual levels (microarray data and RNA-Seq data separately) and at integrated level (joined microarray and RNA-Seq data).

Classification
Once a set of possible target genes which can be considered as biomarkers for breast cancer were identified, we proceeded to assess the results through three different classification technologies: SVM, RF and k-NN. The main objective of this stage is the validation of the behavior of the selected genes at the arrival of new unseen samples. The selected genes and the training dataset were used for designing the classification models, which were later evaluated over the test dataset.
• SVM: These models are supervised learning algorithms which assign categories to new samples. This algorithm is based on the idea of separating data from different categories through a hyperplane. The algorithm calculates the maximum-margin hyperplane that maximizes the distance between different classes. For overlapped data, this type of models turn a reduced space into a higher dimensional space using a kernel function, in order to Moreover, the algorithm tolerates making classification errors, which are controlled by the γ hyperparameter, in order to improve the generalization capability of the model [20,21]. • RF: This method grows many single classification trees with the purpose of building a forest of classification trees. For the classification, the algorithm assigns the input vector to be classified to each tree of the forest. Once that each individual tree performs classification, the forest chooses the class having the largest number of votes over all the trees. After each tree is built, all of the data are run down the tree, and proximities are computed for each pair of cases. If two cases occupy the same terminal node, their proximity is increased by one. At the end of the run, the proximities are normalized by dividing by the number of trees. Proximities are used with the aim of replacing missing data, locating outliers and producing illuminating low-dimensional views of the data [22,37]. • k-NN: This supervised method is based on assigning to a new unseen sample, the class corresponding to the predominant one in the k nearest neighbors (most similar samples) from the known labeled data. It is a well-known fast and easy-to-use technique which however provides a comparable performance to other well-known more complex techniques [23,38].
Ten-fold cross-validation was used over the training dataset to obtain the optimal hyperparameters for these methodologies: σ (kernel width) and γ for SVMs, number of trees for RF and k for k-NN.

Gene ranking: mRMR
Additionally, a feature selection process was performed through the mRMR [19] algorithm over the candidate biomarkers, with the objective of finding a reduced subset of genes that gives similar classification accuracy than the initial complete set of genes. In this way, the reduction of the number of genes allows the creation of a more simple and interpretable classifier, as well as more computationally efficient, while maintaining the robustness of the method. This algorithm creates a ranking of features, DEGs in our case. mRMR algorithm uses mutual information as the criterion for variables relevance, computing relevance and redundancy among variables (i.e. genes), and sorting them so that they bring largest relevance with respect to the class (cancer/no cancer) and, at the same time, they have lowest redundancy among themselves. Therefore, this algorithm will rank in first position the gene that contains the maximum relevance information, but the following genes will provide also minimum redundant information (apart from maximum relevance as regards to the class) with respect to the already selected genes, and so forth.

Results and discussion
This section will focus on presenting and discussing the obtained results coming from the experimentation process followed in this study. It is divided into two subsections: first subsection shows the results for the process of obtaining the set of DEGs; while second subsection will show the results of the classification process making use of the former set of genes.

Integrated gene expression
This subsection describes the process and results of the DEGs extraction. As it was previously stated in the methods section, series belonging to different technologies and platforms have been integrated. The objective of this integration is twofold: first, to increase the number of samples that will be used as input to our method, thus improving the robustness and stability of the results. Second, the obtained results will be independent of a single technology, as they proceed from different sources. The presence of RNA-Seq samples increases the dynamical midrange of the genes, making the results more accurate and robust. Furthermore, the number of available samples is greatly increased thanks to the availability of microarray data stored in public repositories.
When working with heterogeneous data, normalization is one of the most sensitive steps in the whole process, as a mistake in this step could cause interpretation errors, and could lead to a false set of expressed genes. Figure 4 shows the need of normalization for both training and test datasets together due to the difference of the dynamic range between samples. To this end, both training and test datasets have been subjected to a joint normalization using the normalizeBetweenArrays function from the limma R package, thus achieving the same dynamic range for all the samples. Figure 5 shows the results once the joint normalization was applied. As it can be seen, the dynamic range between samples has been corrected. In the next step, only the training dataset will be used in the process for identifying the DEGs.
We therefore proceeded to identify the DEGs both for each technology separately (microarray & RNA-Seq) and for the integrated dataset. Several restrictions were imposed in order to determine the expressed genes: the fold change in the expression values of the selected genes was set to be greater or equal than 2 and the p-value was set to be less or equal than 0.001. These constraints ensure that the chosen expressed genes are statistically significant, therefore showing different behavior between patient and healthy samples. These restrictions were applied to the three microarray, RNA-Seq and integrated datasets, so that three sets with different expressed genes were obtained. Finally, through the intersection of the three groups of expressed genes, a total of common 98 DEGs were found. These genes comply with the restrictions and they are differentially expressed in all datasets as the intersection shows (Fig. 6). Consequently, the obtained genes are differentially expressed independently of the gene expression technology, excluding possible noisy genes.
A boxplot of the mean gene expression values of the 98 DEGs for the samples in the training dataset is shown in Fig. 7. It shows a clear differentiation between the average value of the cancer cell lines samples and the average value of the MCF10A non-cancer cell line samples. Furthermore, the statistical information of the intersection set of 98 DEGs is shown in Table 2. Table 2 shows five statistics values computed by the li mma package (logFC, t-statistic, p-value, adj.p.val. and B).
The log-fold change (logFC) represents the difference between breast cancer and control expressed values. If | logFC |≥ 2 it means that there exists significant differences between cancer and control values. The second value in Table 2 is the moderated t-statistic, which is the ratio between the log2-fold change value for each gene and it respective standard error. The next value is the p-value (p-val) which represents the probability of obtaining a result equal or higher than what it was observed when the null hypothesis is true. The adjusted p-value indicates which proportion of comparisons within a family of comparisons (hypothesis tests) are significantly different. The B-statistic (B) is the log-odds that a given gene is differentially expressed. Figure 8 depicts a hierarchical clustering using the list of 98 invariant expressed genes. As it can be seen, the cluster is split into two group of samples, one belonging to control samples and the other to breast cancer samples. Thus

Classification results
Once the DEGs were identified in the previous subsection, this subsection assesses the performance of these genes through a classification process when new samples are presented. For that purpose, the classification algorithms SVM, RF and k-NN have been implemented. The whole training dataset formed by 132 samples has been used as the input data for the classifier ( Table 1). The 98 DEGs values were normalized to range between [-1,1], and have been chosen as classification features, ordered by a mutual information-based ranking provided by the mRMR algorithm. Moreover, for a further assessment of the classifier against new unseen samples, a test dataset made up of 126 samples has been equally normalized and used for testing (Table 1).
Following the proposed integrated pipeline in this work (see Fig. 3), once the samples were correctly integrated and the 98 DEGs were found, a classification method using these genes has been applied. Results for all the algorithms in the validation stage using the 98 genes reached an accuracy equal to 100%. Therefore, all samples belonging to the training dataset were successfully classified. When the classifier using 98 genes was applied to test samples, an accuracy above 95% was reached by the three algorithms, rising up to a 97% in the case of SVMs and RFs, thus confirming the robustness of the proposed pipeline approach (see Table 3).   Afterwards, a feature selection process has been applied in order to reduce the cardinality of the 98 DEGs. As a result, the mRMR algorithm returned a gene ranking based on mutual information. Figure 9 shows the validation (10-CV over the training dataset) and test results using the three algorithms: SVMs, RFs and k-NN. These validation results are above 98% using only the first gene of the ranking for classification for the three algorithms, and above 99.2% using a reduced set of the first six genes in the ranking. Moreover, classification results when using the new 126 unseen samples of the test set and the three methods, rose up to coherent results with an accuracy of 96.8% using SVMs, 94.1% using k-NN, nevertheless lower for RFs with a 87.4%. Therefore, the classifier performs in a similar way to the behavior observed in the validation results for two of the classifiers. Consequently, the main set of 98 DEGs was reduced to the later six genes set, which allow discerning if new samples are cancerous or not, with an expected error around a 3.2% when using a SVM classifier. These differences in performance among classification techniques are usual in this type of problems, and a number of papers comparing classification techniques for biological data can be found in the literature [37,[39][40][41]. In the results above-mentioned, using only 6 genes, SVMs attains an optimal performance near that attained using the complete set of 98 genes. This behavior is also seen in the k-NN technique, although with a lower performance. RF on the other hand obtains similar results than SVMs when the complete set of 98 genes are used, but fails to design a simpler classifier with a low number of genes with optimal performance [39,40]. Thus, these results support the design of an optimal classifier based on SVMs with only six genes attaining the excellent aforementioned results.
Finally, once the potential biomarker genes were identified as the reduced subset of six genes, a literature review and biological study was done in order to reveal the relation between those genes and their involvement in breast cancer ( Table 4). The first five of these six genes have been formerly reported as genes involved in breast cancer, whilst the sixth gene is present in breast cancerous tissue, although with no evidence of a direct implication with breast cancer development. This means that the results attained by the proposed integrated pipeline are coherent, as the reduced subset of six genes is formed by genes Fig. 9 Validation and test classification results with SVM, RF and k-NN using the most relevant genes obtained by mRMR related with breast cancer. Furthermore, these genes can be used for classification and diagnosis purposes over new unseen samples. They can be designated as a new breast cancer biomarker signature when these types of cell lines data are present. Figure 10 shows a hierarchical cluster built with the small six genes subset. Two distinct groups are clearly identified, as it also happened in Fig. 8: one matching control samples and the other matching breast cancer samples. Therefore, this indicates that the expression profiles of these genes constitute a possible diagnosis criteria for breast cancer. Figure 11 shows a boxplot for each of the six genes representing the average expression value for the cancerous samples (red), and control samples (green). As it can be seen, average expression values between cancerous and control samples are clearly differentiated, thus reaffirming their potential as breast cancer biomarkers.

Conclusions
This work has presented the possibility of integrating data from different gene expression analysis technologies. On the one hand, microarrays, which have been widely used in the last two decades and, on the other hand, RNA-Seq Table 4 Relationship of the top 6 expressed genes with breast cancer Gene symbol Gene name Relationship between protein and breast cancer SFRP1 Secreted frizzled-related protein 1 Inhibition of SFRP1 increases the proliferation, migration and invasion of breast cancer cells. SFRP1 exerted this function by activating Wnt/βcatenin signaling pathway in breast carcinogenesis [42,43].

GSTM3
Glutathione S-transferase mu 3 GSTM3 is suggested as an important modifier that impacts on individual susceptibility to develop breast cancer among premenopausal women [44]. High expression of GSTM3 is related to protective genotypes against breast cancer SULT1E1 Gulfotransferase family 1E member 1 SULT1E1 is an enzyme that catalyzes the sulfation of active 17β-estradiol into inactive form. SULT1E1 is highly expressed in normal mammary epithelial cells and rarely expressed in breast cancer cells. However, its over-expression in breast carcinomas is considered to retard tumor cell growth by arresting cell cycles and inducing apoptosis and may thus improve the prognosis of breast cancer [45,46].

MB
Myoglobin MB plays a functional role in breast cancer progression by promoting the growth of fully oxygenated cells through the control of fatty acid homeostasis and lipogenesis [47,48]. MB is dose-dependent downregulated by 17β-estradiol in breast cancer cells [49].

TRIM29
Tripartite motif containing 29 TRIM29 is considered a breast cancer tumor suppressor. Low TRIM29 expression in breast cancer is associated with more aggressive tumor features. Suppression of the oncogenic transcription factor TWIST1 expression is one mechanism suggested by which TRIM29 functions as a suppressor of breast cancer development [50].

VSTM2L V-set and transmembrane domain containing 2 like
Although VSTM2L is detected in breast cancer tissues, to date there are no relation between its expression and breast cancer development in the current literature. that is the technology meant to replace microarrays definitely. An exhaustive search from the NCBI-GEO public repository has been performed in order to collect breast cancer samples from both technologies. The intersection of DEGs in microarray, RNA-Seq, and the integrated dataset, has allowed identifying a set of candidates biomarkers for diagnosis of this disease.
Thereafter, feature selection through mRMR was applied in order to select the most relevant biomarkers subset. Three different classification models (SVMs, RFs, and k-NN) were designed from the training dataset and the selected DEGs and compared. These classifier were validated with the test dataset achieving outstanding results for the three algorithms when the complete set of 98 DEGs were used. In conclusion, results show that the expressed genes can be designated as robust biomarkers for breast cancer diagnosis when specific cell lines samples are used. Furthermore, even with a small subset of six of those genes, a great validation accuracy was reached (99%). Also, classification results over new unseen data show great accuracy, specially over SVM classification (96.8%). Five of these top six genes have been formerly reported as genes that show biological relation with breast cancer, which reinforce the designation of the expression profiles of these genes for breast cancer diagnosis.