- Methodology article
- Open Access
- Published:

# Multiple-platform data integration method with application to combined analysis of microarray and proteomic data

*BMC Bioinformatics*
**volume 13**, Article number: 320 (2012)

## Abstract

### Background

It is desirable in genomic studies to select biomarkers that differentiate between normal and diseased populations based on related data sets from different platforms, including microarray expression and proteomic data. Most recently developed integration methods focus on correlation analyses between gene and protein expression profiles. The correlation methods select biomarkers with concordant behavior across two platforms but do not directly select differentially expressed biomarkers. Other integration methods have been proposed to combine statistical evidence in terms of ranks and p-values, but they do not account for the dependency relationships among the data across platforms.

### Results

In this paper, we propose an integration method to perform hypothesis testing and biomarkers selection based on multi-platform data sets observed from normal and diseased populations. The types of test statistics can vary across the platforms and their marginal distributions can be different. The observed test statistics are aggregated across different data platforms in a weighted scheme, where the weights take into account different variabilities possessed by test statistics. The overall decision is based on the empirical distribution of the aggregated statistic obtained through random permutations.

### Conclusion

In both simulation studies and real biological data analyses, our proposed method of multi-platform integration has better control over false discovery rates and higher positive selection rates than the uncombined method. The proposed method is also shown to be more powerful than rank aggregation method.

## Background

In gene expression experiments, the expression levels of thousands of genes are simultaneously monitored to study the underlying biological process. In proteomic data, the protein levels or protein counts are measured for thousands of genes simultaneously. In addition, there are other types of genomic data with different sizes, formats and structures. Each distinct data type, such as gene expression, protein counts, or single nucleotide polymorphisms, provide potentially valuable and complementary information regarding the involvement of a given gene in a biological process. Many biomarkers that play important roles in biological processes behave differently in treatment versus control groups; this phenomenon can be observed consistently across various data platforms. Therefore, integrating related data sets from different sources is crucial to correctly identify the significant underlying biomarkers. Integrative analysis of multiple data types would improve the identification of biomarkers of clinical end points [1]. However, the integration of data from different sources poses a number of challenges. First, genomic data come in a wide variety of data formats. For example, expression data are recorded as continuous measurements, whereas proteomic data often consist of discrete counting variables. One may wish to convert data into a common format and common dimension, but this is not always practical or feasible [2]. Second, different data sets are collected under different experimental settings. Therefore, the distribution of the measurements as well as the quality of the experiments may vary from data set to data set. Third, measurements obtained across different data platforms could be collected from the same or related biological samples. Therefore, measurements across different data types could have complicated dependency relationships.

The practice of combining different data sources to perform classification analysis has been considered in the literature. Efforts to integrate data and improve classification accuracy are widely seen in recent studies [3–5]. In contrast to performing classification on biological samples, our main objective is to select important biomarkers for an underlying biological process. Correlation analysis has been proposed to integrate diverse data types and assimilate them into biological models for the prediction of cellular behavior and clinical outcome. Tian et al. [6] performed a correlation analysis of protein and mRNA expression data using the cosine correlation metric for comparison. Bussey et al. [7] integrated data on DNA copy number with gene expression levels and drug sensitivities in cancer cell lines based on Pearson’s correlation coefficients. Adourian et al. [8] presented a cross-compartment correlation network approach to integrate proteomic, metabolomic, and transcriptomic data for selecting circulating biomarkers; partial pairwise Pearson’s correlations controlling for treatment group means were calculated. The markers with concordant RNA and protein expression were included in the prediction models, while discordant ones were excluded. However, this approach might miss some important biological information, such as protein-protein interactions and protein-gene interactions [9]. Another limitation is that correlation analysis mainly captures the strength of the correlation among measurements across different platforms; however, strong correlation only demonstrates consistent outcome across different platforms and does not directly translate to significant involvement in a biological process. Furthermore, statistical evidence from complicated data sets, such as factorial experiments, times series, or longitudinal data, cannot be summarized.

The problem of how to reliably combine data from different experiment platforms to identify significant biomarkers has recently received considerable attention in the bioinformatics literature. The rank aggregation method [10] has been proposed for ranking genes by similarity to the disease genes in Gene Ontology, pathways, transcription factor binding sites, and sequence, then aggregating this rankings to get the final result. Rhodes et al. [11] combined four independent data sets to identify genes deregulated in prostate cancer. For each gene in each data set, a p-value was obtained as an indication of the probability that the gene was differentially expressed. P-values for different data sets were subsequently aggregated to provide an overall estimate of the genes’ significance of being differentially expressed during prostate cancer. However, combining genes’ ranks in the rank aggregation approach or p-values in the meta-profiling method ignores the underlying multivariate distributions of the ranks or p-values. Furthermore, data quality may vary across different data sources. The two aggregation methods detailed above essentially give equal weights to different data sets. Thus, we propose to combine statistical evidence across different platforms through summary statistics instead of raw data. For each experimental platform, we formulate a null hypothesis and construct the summary test statistic. By randomization, we obtain the null distribution of the vector of statistics across different platforms. The test statistics are summarized across different platforms in a weighted scheme, where the weights take into account different variabilities possessed by the statistics. The method allows the use of different types of summary statistics from different platforms, which gives great flexibility and generality with respect to its application.

The proposed method is similar in spirit to a meta-analysis. Both methods combine statistical evidence across multiple data sets. However, in meta-analysis different data sets are based on the same type of experiments or observational studies, and therefore the measurements are the same variables. Across different data sets, the quality of the data may vary. The goal of meta-analysis is to fully utilize all the information from different data sets and construct a weighted estimate of the effect size. Different weighting schemes are available depending on the statistical models [12]. On the other hand, data integration focuses on integrating statistical evidence across different experimental types. There is no common effect size to estimate across various data sets. In our proposed method, we use a weighted average of the test statistics across different data platforms, but the test statistics are summaries of evidence towards different sub-hypotheses rather than summaries of common effect size as in meta-analysis. The proposed integration method does not check for differences across the platforms.

## Methods

The aim of our multi-platform integration method is to select a set of significant biomarkers that are involved in a biological process and thus behave differently in the treatment group and the control group. In order to combine statistical evidence across different platforms, our method requires that analogous hypotheses based on the features being measured are formulated for each platform. Each null analogous hypothesis specifies the unrelatedness of the biomarker in that particular experimental setting, but all of them infer the unrelatedness of the biomarker to the biological process being investigated. Based on the set of Q analogous hypotheses for Q data sources, we construct a set of Q corresponding test statistics for each type of data. The test statistics can be different and tailored to the specific experimental settings. For example, if the microarray experiment has a multifactorial design, the appropriate test statistic can be an F statistic based on an ANOVA test. If the proteomics experiment generates counting data for diseased versus normal groups, the appropriate test statistic can be a nonparametric Wilcoxon rank sum test. A vector of observed statistics across multi-platforms is obtained. We then randomly permute data across diseased and control groups. All measurements from different platforms are permuted. In this way, we obtain an empirical null distribution of the vector of test statistics. In order to pool the randomized values of the statistics across the biomarkers to form the empirical null distribution, we assume data from different biomarkers are independent or have an exchangeable correlation structure. For the validity of the randomization procedure, we assume an exchangeable covariance structure for the measurements within each platform. Finally, we construct a weighted sum of the test statistics across different platforms with the weights being the inverse of the empirical standard deviation of each statistic. We determine a set of significant biomarkers based on the aggregated test statistic.

In the following, we demonstrate our method by integrating microarray expression data and proteomic data as an example. We consider two experiments, the first having microarray expression data measured on *l*_{1} diseased samples and *l*_{2} control samples and the second having proteomic data measured on *m*_{1} diseases samples and *m*_{2} control samples. The objective is to find biomarkers significantly involved in disease development.

Step 1): Define two analogous null hypotheses. For microarray data, the null hypothesis would be *H*_{01}: the gene’s mRNA level is the same in diseased and normal populations; for proteomic data, the null hypothesis would be *H*_{02}: the protein level is the same in diseased and normal populations.

Step 2): Based on the hypotheses, construct two test statistics, *t*_{
m
} and *t*_{
p
}, tailored to each type of data. Consequently, we obtain a vector of two observed statistics (*t*_{
m
},*t*_{
p
})^{′} across two data platforms. The test statistics can be of any type as long as they summarize information from the data and can be used to assess the statistical significance of the data toward the hypotheses. Let {x}_{1}={\left({x}_{11},\dots ,{x}_{1{l}_{1}}\right)}^{\prime} denote the *l*_{1} gene expression measurements in the disease group, {x}_{2}={\left({x}_{21},\dots ,{x}_{2{l}_{2}}\right)}^{\prime} denote the *l*_{2} gene expression measurements in the control group, {\overline{x}}_{1}={\sum}_{j=1}^{l1}{x}_{1j}/{l}_{1}, and {\overline{x}}_{2}={\sum}_{j=1}^{l2}{x}_{2j}/{l}_{2}. Similarly, {y}_{1}={\left({y}_{11},\dots ,{y}_{1{m}_{1}}\right)}^{\prime} denotes the *m*_{1} protein measurements in the disease group and {y}_{2}={\left({y}_{21},\dots ,{y}_{2{m}_{2}}\right)}^{\prime} denotes the *m*_{2} protein measurements in the control group, {\overline{y}}_{1}={\sum}_{j=1}^{{m}_{1}}{y}_{1j}/{m}_{1}, and {\overline{y}}_{2}={\sum}_{j=1}^{{m}_{2}}{y}_{2j}/{m}_{2}. For illustration purpose, we adopt Student’s t-statistic for each of the data:

and

where *s*^{2} denotes the sample variance. The test statistics should be formulated so that a larger test statistic in the positive direction indicates more evidence towards the alternative hypotheses. For example, if Student’s t-statistic is used, then a one-sided alternative hypothesis corresponds to a one-sided t-statistic, whereas the two-sided alternative leads to the absolute value of the t-statistic. Consider *n* genes being measured in the experiments and we obtain *n* vectors of test statistics (*t*_{
mi
},*t*_{
pi
})^{′}, *i* = 1,…,*n*, from the data sets.

Step 3): The samples are randomly permuted across diseased and control groups. If the same sample is being measured across different platforms, all the measurements from the different platform are permuted simultaneously. The simultaneous permutation preserves the dependency relationship among the measurements from different platforms. Based on random permutation, we obtain an empirical null distribution of the vector (*t*_{
m
},*t*_{
p
})^{′}.

Step 4): The aggregated test statistic will be:

where {\stackrel{\u02c6}{\sigma}}_{1} and {\stackrel{\u02c6}{\sigma}}_{2} are the estimated standard deviations of *t*_{
m
} and *t*_{
p
} based on the empirical null distribution, and *t*_{
m
} and *t*_{
p
} are the observed t-statistics or the absolute values of the t-statistics based on the direction of the alternative hypotheses. At significance level *α*, we choose a threshold *C*_{
α
}, such that {P}_{{H}_{01}\cap {H}_{02}}\left({t}_{A},>,{C}_{\alpha}\right)=\alpha. Specifically, *C*_{
α
} is the 100(1−*α*)*%* percentile of *t*_{
A
}, which can be obtained from the empirical null distribution. Construct a decision line that separates selected significant biomarkers and nonsignificant biomarkers. The resulting separation line is:

All the biomarkers with (*t*_{
m
},*t*_{
p
}) above the separation line will be declared as significantly involved in the disease development.

In the more general case, suppose we have Q data platforms with the observed test statistics (*t*_{1},…,*t*_{
Q
})^{′}. From random permutation, we obtain the joint empirical distribution of this vector of test statistics under the global null hypothesis. Let {\stackrel{\u02c6}{\sigma}}_{1}^{2},\dots ,{\stackrel{\u02c6}{\sigma}}_{Q}^{2} denote the estimated variance of the individual test statistics.The aggregated test statistic takes the form:

The resulting critical region will take the form:

where *C*_{
α
} is the 100(1−*α*)*%* percentile of *t*_{
A
}. Any biomarker with *t*_{
A
} > *C*_{
α
} will be selected as behaving significantly differently between the diseased group and control group.

Our method aggregates actual values of the test statistics across different data platforms, which preserves more information compared to the rank aggregation method. Moreover, our method assigns different weights to each data set according to the variability of the test statistics: larger the variation in the test statistic, the smaller the weight assigned to it, and vice versa. The threshold *C*_{
α
} is determined based on the empirical null distribution of the aggregated test statistics, which implicitly takes into account the dependency relationships among the test statistics. Furthermore, our method can deal with different data types and formats generated by various experimental settings.

There are two major ways to perform the multiplicity adjustment. The first is the Bonferroni correction. If we wish to control the familywise type I error rate at *α*^{∗}, then the individual level *α* = *α*^{∗}/*n*, where *n* is the total number of biomarkers. When *n* is large, the Bonferroni correction leads to very stringent tests with *α* being very small. Alternatively, we can control the number of false discoveries. To set the number of false discoveries to be equal to or less than *f* , then \alpha =f/\left(n\stackrel{\u02c6}{\mathit{\pi}}\right), where \stackrel{\u02c6}{\mathit{\pi}} is the estimated proportion of non-differentially expressed biomarkers. If there is no \stackrel{\u02c6}{\mathit{\pi}} available, we use \stackrel{\u02c6}{\mathit{\pi}}=1 and that gives a conservative value for *α*.

Different platforms can be used to test different sub-hypothesis. All of these sub-hypotheses should be concordant in supporting the overall biological hypothesis. For example, the involvement of a gene in disease development can be supported by both mRNA expression level changes and proteomic level changes. In most cases, changes in measurements from different platforms are expected to occur in the same direction. However, our method is also applicable even if the changes are in different directions, as long as the statistical evidence from both sources can be combined. For example, consider *H*_{10}: mRNA is increasing in normal group; *H*_{20}: antibody count is decreasing in normal group. Even though the actual measurements from two platforms are negatively correlated, we can construct the test statistics *t*_{1} and *t*_{2} so that the positive value of the statistics supports the alternative hypotheses and the weighted average can be used as combined evidence of the involvement of the biomarker in the process.

## Results

### Results on simulated data

In this section, we examine the performance of our proposed method by examining its positive selection rates and false discovery rates under various testing scenarios. We simulate data sets from *Q* different platforms. The number *Q* is set to be either 2 or 5. For the *q* th experiment, the data set is denoted as *X*_{
q
}. For each data set, we assume that *n* different biomarkers are measured, *X*_{
q
} = {\left({X}_{q1}^{\prime},\dots ,{X}_{\mathit{qn}}^{\prime}\right)}^{\prime}. For the *i* th biomarker, *X*_{
qi
}= {\left({X}_{\mathit{qi}1}^{\prime},{X}_{\mathit{qi}2}^{\prime}\right)}^{\prime}, where *X*_{qi 1} denotes data from the control group with mean *μ*_{qi 1} and *X*_{qi 2} denotes data from the diseased group with mean *μ*_{qi 2}. The total number of biomarkers is set to be *n* = 1000. Among the *n* biomarkers, let *g* denote the number of biomarkers that are related to the biological process of interest, i.e. *μ*_{qi 1} ≠ *μ*_{qi 2}. The number *g* of differentially expressed (DE) biomarkers is set to be 200. The number of measurements for each biomarker obtained from each platform is set to be 10, in which 5 are from the control group and the other 5 are from the disease group. We also consider different effect sizes. For continuous data, we generate {X}_{\mathit{qi}}\sim \mathrm{MVN}\left({\left({\mu}_{\mathit{qi}1}^{\prime},{\mu}_{\mathit{qi}2}^{\prime}\right)}^{\prime},\mathrm{\Sigma}\right), where Σ has an exchangeable correlation structure with correlation *ρ*. The correlation *ρ* is set to be either 0 or 0.5. For differentially expressed markers, *μ*_{qi 1} = 0 × 1_{
m
}, *μ*_{qi 2} = *e* × 1_{
m
}, where *e* is the effect size and *m* = 5 is number of measurements. Discrete data *X*_{
qi
}is generated from a Poisson(*λ*) distribution, where *λ*_{qi 1} = *μ*_{qi 1} for the control group and *μ*_{qi 2} = *μ*_{qi 1} + *e* for the diseased group. The *g* differentially expressed markers are divided into two groups with *g*_{1} = 100 and *g*_{2} = 100. Each group is assigned a different effect size *e*. For each platform, the alternative hypothesis can be either left-sided, right-sided or two-sided. The number of permutation is 100. All of the permuted values from the *n* biomarkers are pooled together to form the empirical null distribution. The results are summarized for 100 simulated data sets.

To compare our multi-platform integration method with the individual platform analysis method, the positive selection rate (PSR) and false discovery rate (FDR) are calculated to assess the performance of each method for selecting the differentially expressed biomarkers:

and

Tables 1, 2, and 3 provide detailed simulation settings and results at the *α* = 0.05 significance level. From the results, we can see that our multi-platform integration method has the highest PSR and the lowest FDR with the smallest variance compared to all other individual platform analyses in all scenarios. In addition, such advantage is consistently observed regardless of whether or not there is correlation among the measurements obtained for each biomarkers. Table 1 summarizes the results for the integrative analysis based on two different platforms. Given different effect sizes, different sided alternatives, and different correlations, the increase in PSR is consistently about 40% and the decrease in FDR is about 30% compared to the results from individual platforms. Table 2 summarizes the results for the integrative analysis based on five different platforms. Given different simulation scenarios, the increase in PSR for most cases is about 60% and the decrease in FDR is about 40% compared to the results from individual platforms. This shows that by integrating more data from different sources, we are improving the sensitivity and selectivity of the proposed method. Table 3 summarizes the results for the integrative analysis based on two different platforms, where the first consists of continuous data and the second consists of discrete data. Similar to the setting with two continuous data sets, the increase in PSR is about 40% and the decrease in FDR is about 30% compared to the results from individual platforms.

Figure 1 demonstrates decision lines from different methods. The plot is constructed based on the results from one simulated data set and contains three decision lines: the vertical line using data from the first individual platform, the horizontal line using data from the second individual platform, and the dashed line based on our multi-platform integration method. Our decision line provides a greatly improved separation of the differentially and non-differentially expressed biomarkers. Moreover, the individual platform analysis misidentifies some of the data points compared to our method.

As we examine a large number of biomarkers, we need to investigate the control of the false discovery rate of the proposed method with regards to multiple hypothesis testing [13]. Given a fixed cut-off value of *α*, we obtain the realized false discovery rate \mathit{FDR}=\left(\mathit{FP}\right)/\left(\stackrel{\u02c6}{T}P\right)) and its estimates F\stackrel{\u02c6}{D}R=\left(\stackrel{\u02c6}{F}P\right)/\left(\stackrel{\u02c6}{T}P\right), where *FP* denotes the number of false positive biomarkers, \stackrel{\u02c6}{F}P=\mathit{n}\mathit{\pi}\alpha is the estimated number of false positive biomarkers, \stackrel{\u02c6}{T}P is the total number of biomarkers claimed as positive, *π* is the proportion of non-differentially expressed genes, and \stackrel{\u02c6}{\mathit{\pi}} is its estimator. We can control the estimated number of false positive discoveries by selecting the significance level of the approaches. We expect that the estimated \stackrel{\u02c6}{F}P should be close to the true *FP*; the F\stackrel{\u02c6}{D}R should be close to the true *FDR* as well. Under the simulation setting of scenario 2 left-sided case in Table 1, the control of the false discovery rate of our proposed method under different significance levels is examined and presented in Table 4. With *π* = 0.8 and *α* = 0.005, \stackrel{\u02c6}{F}P is aimed to be controlled at 4. On average, our method produces 3.84 false positives, whereas the first and second individual platform analyses has 4.65 and 5.00 false positives, respectively. The corresponding average F\stackrel{\u02c6}{D}R of our method is 0.0225, which is close to the true *FDR* of 0.0214. This demonstrates the integrative analysis yields satisfactory control of false discovery rate, which is improved compared to individual platform analyses.

### Results on real data

In this section, we apply our method to data from a study of growth and stationary phase adaption in *Streptomyces coelicolor* provided by Jayapal et al. [16]. The data set contains both isobaric stable isotope labeled peptide (iTRAQ^{TM})-derived shotgun proteomic data and DNA microarray transcriptome data. To study different growth stages of *S. coelicolor* M145 cells, eight time point cell samples (7, 11, 14, 16, 22, 26, 34, and 38 h) were collected. Because the iTRQA^{TM} system can only analyze four distinct samples in a single experiment, the eight protein samples were distributed across three runs of mass spectrometric (MS) analysis, The protein sample from 11 h was run in three MS experiments, so it serves as a reference. Therefore, protein abundance ratios {r}_{j/11\mathit{hr},k}^{i} were obtained from experimental run *k* for protein *i* in sample *j* hr with respect to the 11 h reference. Protein identification and quantification were carried out by comparing the raw spectral data against a theoretical proteome of *S. coelicolor* using proteinPilot^{TM} software and the inbuilt Paragon^{TM} search engine. Only proteins identified with ≥ 99*%* confidence were considered for further analysis. Finally, all identified proteins were further processed to yield a protein abundance ratio with respect to the first time point (7 h) sample using {r}_{j/7\mathrm{hr}}^{i}={r}_{j/11\mathrm{hr}}^{i}/{r}_{7\mathrm{hr}/11\mathrm{hr}}^{i}. Ultimately, only 886 proteins identified in the 7 h sample could be used for our analysis.

For microarray data, total mRNA from the same eight time point samples were isolated and a spotted DNA microarray experiment was conducted. Hybridization was performed using genomic DNA (gDNA) as a reference. The mRNA abundance was obtained using _{log2}[cDNA/gDNA]. To be consistent with the protein data, mRNA abundance data from different samples were processed to calculate _{log2}[cDNAi/cDNA_{7hr}] for each sample with respect to the first time point sample. Only gene expression values with protein values (894 genes) were analyzed. To deal with missing values, we deleted genes that had no values for mRNA at all or had at least five missing values in the protein data set. The rest of the missing values for genes were imputed by using R package MICE. In total, the number of genes suitable for the subsequent integrative analysis was 886. Based on the growth curve, time points were divided into two groups; those from 7, 11, 14 and 16 h represented the growth phase and those from 22, 26, 34 and 38 h represented the stationary phase.

The objective of our analysis is now to select the biomarkers that are differentially expressed between the two phases. We apply our multi-platform integration method to identify differentially expressed biomarkers. For the mRNA data, we formulate the null hypothesis as *H*_{0}: the mRNA expression level is the same between the two phases. Similarly, for protein data, the null hypothesis is formulated as *H*_{0} : the protein ratio is the same between the two phases. For both mRNA data and protein data, two-sided alternatives are considered in the analysis. For each platform, we use Student’s t-statistics to summarize the statistical evidence, which are denoted as *t*_{
m
} and *t*_{
p
}. To obtain the multivariate null distribution, 100 permutations are conducted. The overall correlation between *t*_{
m
} and *t*_{
p
} is 0.2787. The variances of *t*_{
m
} and *t*_{
p
} are 3.0489 and 3.6411, respectively. Based on the decision line constructed at the significance level *α* = 0.05, our method detects 172 differential expressed genes with an estimated \stackrel{\u02c6}{F}P equal to 44. Individual analysis on the mRNA data and the protein data detects 137 and 143 genes, respectively. Figure 2 depicts the decision lines for all three comparative analyses: the vertical lines using the mRNA data, the horizontal lines using the protein data, and the dashed lines using our multi-platform integration method.

Nine differentially expressed genes are identified by our method but not by the other two methods. Among these, we identify biosynthetic enzymes (SCO5080 actVA5, SCO5072 actVIORFI) involved in actinorhodin production. These genes are up-regulated only at late stages of the culture and produce antibiotics during the stationary phase. Expression of two genes encoding malate oxidoreductase (SCO2951) and translation elongation factor G (SCO4661) have been found to be depressed during the stationary phase compared with the growth phase [17]. Table 5 summarizes the nine genes and the associated literature confirmations [16–21].

## Discussion

An ongoing problem in proteomics is that extremely small sample sizes often occur, largely due to biological reasons. To investigate the performance of our method in such situations, we consider a case for each platform wherein the control and the diseased groups each have only two measurements. Our method is applied and the simulation results shown in Table 6, scenario 1. Due to the small sample size, the positive selection rate is rather low and the false discovery rate rather high. Nevertheless, the combined method still outperforms the single platform method.

We also consider the situation in which data on the same biomarker from *n* platforms have a multivariate distribution and the data from the diseased group are independent of those from the control group. The new simulation results are summarized in Table 6, scenario 2. The correlation between the platforms is set to 0.5, and the other parameters are the same as in Table 1, scenario 1, right-sided test. Due to the high correlation among the platforms, the gain in power of the aggregated method is less pronounced than that of the independence case. This is because different platforms contribute overlapping information when they are highly correlated.

The proposed method allows different ways of constructing *t*_{
m
} and *t*_{
p
} as long as they provide summarized statistical evidence for that platform. The Student’s *t*-statistic is adopted in the paper simply for illustration purpose. Alternatively, we can simply use the unstandardized differences: {t}_{m}={\overline{x}}_{1}-{\overline{x}}_{2}, and {t}_{p}={\overline{y}}_{1}-{\overline{y}}_{2}. Then we proceed with the randomization, obtain the estimated variances for *t*_{
m
} and *t*_{
p
} and form a weighted linear sum statistic. To compare the empirical performance of the standardized versus unstandardized versions, we conduct simulations under the setting 1 of Table 1 with right-sided test. The results are summarized in Table 6, scenario 3. The two versions have comparable performance in terms of PSR and FDR. The unstandardized version of *t*_{
m
} and *t*_{
p
} has a slightly higher PSR and a slightly lower FDR.

An alternative way of combining test statistics across different platforms is to form a multivariate quadratic statistic. Given two platforms, for example, we consider an alternative test statistic

where \stackrel{\u02c6}{\Sigma} is the estimated covariance matrix of the vector (*t*_{
m
},*t*_{
p
}) obtained from the empirical null distribution. Such multivariate statistic can be used to test the overall null hypothesis against two-sided alternatives, while the weighted linear statistic that we propose can be used to test one-sided alternatives or two-sided alternatives. Thus, our method is more broadly applicable. We further conduct simulations to compare the multivariate quadratic form with our proposed weighted linear statistic for two-sided tests under the setting of scenario 2, Table 1, with results included in Table 7. For two-sided alternatives, the quadratic statistic has very similar performance to our proposed weighted linear statistic, with a slightly lower PSR and a slightly higher FDR.

Finally, we compare our method with the existing robust rank aggregation method [14] with results included in Table 8. The inference from rank aggregation method is based on the ranks of the test statistics. The ranking can in some degree reflect the significance of the test statistics. But the position of the rank does not always translate into the relatedness of the biomarker to the underlying biological mechanism. The rank aggregation method assigns p-values of the observed ranks under the null hypothesis that the normalized ranks of all biomarkers are uniformly distributed. But this is a null hypothesis which can correspond to two totally different situations: all the biomarkers are not related to the biological process or all of them are related with equal effect size. This evaluation of p-values under such global null hypothesis has two implications. First of all, if all the biomarkers are related to the biological process with equal or similar effect sizes, the observed ranks will appear non-informative and thus the method will have little power to detect them. Secondly, the p-value of each observed rank is calculated under the global null hypothesis. Thus, the rank aggregation has a correct error control under the global null hypothesis but has no correct error control under other configurations of the individual hypotheses. In other words, it lack the strong control of the error rate under different configurations of the individual hypothesis [15]. On the other hand, our method assigns p-values under the individual null hypotheses and thus have a strong control of the error rate. This means our method’s actual false discovery rate and estimated false discovery rate will be in good agreement no matter how many of the genes belong to the null situation and how many belong to the alternative situation. While in contrast, the rank aggregation will tend to be very conservative if there are many biomarkers belonging to the alternative situation. To demonstrate this, we choose the number of significant markers ranging from 100, 200 to 400. It is shown in Table 8 that the rank aggregation behaves very conservatively in the presence of large number of significant markers. For instance, with five platforms and 200 significant biomarkers, our proposed method has a PSR of 0.9995 and a FDR of 0.1399, while the competing rank aggregation method has a much lower PSR of 0.4995 and FDR of 0.0823. This comparison further demonstrates the advantage of the proposed method.

## Conclusion

With the advent of various types of genomic technologies, it is imperative to develop a method that can integrate different types of genomic data to solve biological questions. We develop a general framework for data integration across multiple data platforms. For each data set, a test statistic is formed to summarize the statistic evidence toward the specific null hypothesis tailored to the data platform. The types of test statistics can vary and their marginal distributions can be different. The observed test statistics can then be aggregated across different data platforms. The overall decision is based on the empirical distribution of the aggregated statistic obtained through random permutations. Our method can accommodate different experimental designs and various data types across platforms.

## References

Reif D, White B, Moore J: Integrated analysis of genetic, genomic and proteomic data.

*Expert Rev Proteomics*2004, 1: 67–75. 10.1586/14789450.1.1.67Hamid J, Hu P, Roslin M, Ling V, Greenwood C, Beyene J: Data integration in genetics and genomics: methods and challenges.

*Human Genomics Proteomics*2009, 9: 869093.Lanckriet G, Bie T, Cristianini N, Jordan M, Noble S: A statistical framework for genomic data fusion.

*Bioinformatics*2004, 20: 2626–2635. 10.1093/bioinformatics/bth294Daemen A, Gevaert O, De Bie T, Debucquoy A, Machiels J, De Moor B, Haustermans K: Integrating microarray and proteomics data to predict the response on cetuximab in patients with rectal cancer.

*Pac Symp Biocomputing*2008, 13: 166–177.Buness A, Ruschhaupt M, Kuner R, Tresch A: Classification across gene expression microarrray studies.

*Bioinformatics*2009, 10: 453.Tian Q, Stepaniants S, Mao M, Weng L, Feetham M, Doyle M, Yi E, Dai H, Thorsson V, Eng J, Goodlett D, Berger J, Gunter B, Linseley P, Stoughton R, Aebersold R, Collins S, Hanlon W, Hood L: Integrated genomic and proteomic analyses of gene expression in mammalian cells.

*Mol Cell Proteomics*2004, 3: 960–969. 10.1074/mcp.M400055-MCP200Bussey K, Chin K, Lababidi S, Reimers M, Reinhold W, Kuo W, Gwadry F, Kouros-Mehr H, Fridlyand J, Jain A, Collins C, Nishizuka S, Tonon G, Roschke A, Gehlhaus K, Kirsch I, Scudiero D, Gray J, Weinstein J, Ajay: Integrating data on DNA copy number with gene expression levels and drug sensitivities in the NCI-60 cell line panel.

*Mol Cancer Ther*2006, 5: 853–867. 10.1158/1535-7163.MCT-05-0155Adourian A, Jennings E, Balasubramanian R, Hines W, Damian D, Plasterer T, Clish C, Stroobant P, McBurney R, Verheij E, Bobeldijk I, van der Greef J, Lindberg J, Kenne K, Andersson U, Hellmold H, Nilsson K, Salter H, Schuppe-Koistinen I: Correlation network analysis for data integration and biomarker selection.

*R Soc Chem*2003, 4: 249–259.Ma Y, Ding Z, Qian Y, Wan Y, Tosun K, Shi X, Castranova V, Harner E, Guo N: An integrative genomic and proteomic approach to chemosensitivity prediction.

*Int J Oncol*2009, 34: 107–115.Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent L, De Moor B, Marynen P, Hassan B, Carmeliet P, Moreau Y: Gene prioritization through genomic data fusion.

*Nat Biotechnol*2006, 24: 537–544. 10.1038/nbt1203Rhodes D, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan A: Large-scale meta analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression.

*Proc Natl Acad Sci U S A*2004, 101(25):9309–9314. 10.1073/pnas.0401994101Hu P, Greenwood C, Beyene J: Statistical methods for meta-analysis of microarray data: A comparative study.

*Inf Syst Front*2006, 8: 9–20. 10.1007/s10796-005-6099-zGao X: Construction of null statistics in permutation based multiple testing for multi-factorial microarray experiments.

*Bioinformatics*2006, 22: 1486–1494. 10.1093/bioinformatics/btl109Kolde R, Laur S, Adler P, Vilo J: Robust rank aggregation for gene list integration and meta-analysis.

*Bioinformatics*2012, 4: 573–580.Hochberg Y, Tamhane A:

*Multiple Comparison Procedures*. New Jersey: Wiley; 1987.Jayapal K, Philp R, Kok Y, Yap M, Sherman D, Griffin T, Hu W: Uncovering genes with divergent mRNA-protein dynamics in Streptomyces coelicolor.

*PLoS One*2008, 3: e2097. 10.1371/journal.pone.0002097Manteca A, Sanchez J, Jung H, Schwamle V, Jensen O: Quantitative proteomics analysis of Streptomyces coelicolor development demonstrates that onset of secondary metabolism coincides with hypha differentiation.

*Mol Cell Proteomics*2010, 9(7):1423–1436. 10.1074/mcp.M900449-MCP200Bentley S, Chater K, Cerdeno-Tarraga A, Challis G, Thomson N, James K, Harris D, Quail M, Kieser H, Harper D, Bateman A, Brown S, Chandra G, Chen C, Collins M, Cronin A, Fraser A, Goble A, Hidalgo J, Hornsby T, Howarth S, Huang C, Kieser T, Larke L, Murphy L, Oliver K, O’Neil S, Rabbinowitsch E, Rajandream M, Rutherford K, Rutter S, Seeger K, Saunders D, Sharp S, Squares R, Squares S, Taylor K, Warren T, Wietzorrek A, Woodward J, Barrell B, Parkhill J, Hopwood D: Complete genome sequence of the model actionomycete Streptomyces coelicolor A3(2).

*Nature*2002, 417: 141–147. 10.1038/417141aMehra S, Lian W, Jayapal K, Charaniya S, Sherman D, Hu W: A framework to analyze multiple time series data: A case study with Streptomyces coelicolor.

*J Ind Microbiol Biotechnol*2006, 33(2):159–172. 10.1007/s10295-005-0034-7Jayapal K, Sui S, Philp R, Kok Y, Yap M, Griffin T, Hu W: Multitagging proteomic strategy to estimate protein turnover rates in dynamic systems.

*J Proteome Res*2010, 9: 2087–2097. 10.1021/pr9007738Nieselt K, Battke F, Herbig A, Bruheim P, Wentzel A, Jakobsen O, Sletta H, Alam M, Merlo M, Moore J, Omara W, Morrissey E, Juarez-Hermosillo M, Rodriguez-Garcia A, Nentwich M, Thomas L, Iqbal M, Legaie R, Gaze WH, Challis G, Jansen R, Dijkhuizen L, Rand D, Wild D, Bonin M, Reuther J, Wohlleben W, Smith M, Burroughs N, Martin J, Hodgson D, Takano E, Breitling R, Ellingsen T, Wellington E: The dynamic architecture of the metabolic switch in Streptomyces coelicolor.

*BMC Genomics*2010, 11: 10. 10.1186/1471-2164-11-10

## Acknowledgements

The authors are grateful to Dr. Lei Nie for his discussion and comments on our project. The authors are very thankful to the editor, associate editor and three referees. Their comments and suggestions lead to a much improved manuscript.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors’ contributions

SW, XG, YX, XW and ZF developed the algorithm, SW and YX implemented the algorithm, YX, ZF, and XY performed data analysis; and XG supervised the project. All authors read and approved the final manuscript.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

**Open Access**
This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License (
https://creativecommons.org/licenses/by/2.0
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Wu, S., Xu, Y., Feng, Z. *et al.* Multiple-platform data integration method with application to combined analysis of microarray and proteomic data.
*BMC Bioinformatics* **13**, 320 (2012). https://doi.org/10.1186/1471-2105-13-320

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1471-2105-13-320

### Keywords

- False Discovery Rate
- Rank Aggregation
- Data Platform
- Decision Line
- Lower False Discovery Rate