Multiple-platform data integration method with application to combined analysis of microarray and proteomic data
- Shicheng Wu^{1},
- Yawen Xu^{1},
- Zeny Feng^{2},
- Xiaojian Yang^{2},
- Xiaogang Wang^{1} and
- Xin Gao^{1}Email author
https://doi.org/10.1186/1471-2105-13-320
© Wu et al.; licensee BioMed Central Ltd. 2012
Received: 21 February 2012
Accepted: 2 November 2012
Published: 2 December 2012
Abstract
Background
It is desirable in genomic studies to select biomarkers that differentiate between normal and diseased populations based on related data sets from different platforms, including microarray expression and proteomic data. Most recently developed integration methods focus on correlation analyses between gene and protein expression profiles. The correlation methods select biomarkers with concordant behavior across two platforms but do not directly select differentially expressed biomarkers. Other integration methods have been proposed to combine statistical evidence in terms of ranks and p-values, but they do not account for the dependency relationships among the data across platforms.
Results
In this paper, we propose an integration method to perform hypothesis testing and biomarkers selection based on multi-platform data sets observed from normal and diseased populations. The types of test statistics can vary across the platforms and their marginal distributions can be different. The observed test statistics are aggregated across different data platforms in a weighted scheme, where the weights take into account different variabilities possessed by test statistics. The overall decision is based on the empirical distribution of the aggregated statistic obtained through random permutations.
Conclusion
In both simulation studies and real biological data analyses, our proposed method of multi-platform integration has better control over false discovery rates and higher positive selection rates than the uncombined method. The proposed method is also shown to be more powerful than rank aggregation method.
Background
In gene expression experiments, the expression levels of thousands of genes are simultaneously monitored to study the underlying biological process. In proteomic data, the protein levels or protein counts are measured for thousands of genes simultaneously. In addition, there are other types of genomic data with different sizes, formats and structures. Each distinct data type, such as gene expression, protein counts, or single nucleotide polymorphisms, provide potentially valuable and complementary information regarding the involvement of a given gene in a biological process. Many biomarkers that play important roles in biological processes behave differently in treatment versus control groups; this phenomenon can be observed consistently across various data platforms. Therefore, integrating related data sets from different sources is crucial to correctly identify the significant underlying biomarkers. Integrative analysis of multiple data types would improve the identification of biomarkers of clinical end points [1]. However, the integration of data from different sources poses a number of challenges. First, genomic data come in a wide variety of data formats. For example, expression data are recorded as continuous measurements, whereas proteomic data often consist of discrete counting variables. One may wish to convert data into a common format and common dimension, but this is not always practical or feasible [2]. Second, different data sets are collected under different experimental settings. Therefore, the distribution of the measurements as well as the quality of the experiments may vary from data set to data set. Third, measurements obtained across different data platforms could be collected from the same or related biological samples. Therefore, measurements across different data types could have complicated dependency relationships.
The practice of combining different data sources to perform classification analysis has been considered in the literature. Efforts to integrate data and improve classification accuracy are widely seen in recent studies [3–5]. In contrast to performing classification on biological samples, our main objective is to select important biomarkers for an underlying biological process. Correlation analysis has been proposed to integrate diverse data types and assimilate them into biological models for the prediction of cellular behavior and clinical outcome. Tian et al. [6] performed a correlation analysis of protein and mRNA expression data using the cosine correlation metric for comparison. Bussey et al. [7] integrated data on DNA copy number with gene expression levels and drug sensitivities in cancer cell lines based on Pearson’s correlation coefficients. Adourian et al. [8] presented a cross-compartment correlation network approach to integrate proteomic, metabolomic, and transcriptomic data for selecting circulating biomarkers; partial pairwise Pearson’s correlations controlling for treatment group means were calculated. The markers with concordant RNA and protein expression were included in the prediction models, while discordant ones were excluded. However, this approach might miss some important biological information, such as protein-protein interactions and protein-gene interactions [9]. Another limitation is that correlation analysis mainly captures the strength of the correlation among measurements across different platforms; however, strong correlation only demonstrates consistent outcome across different platforms and does not directly translate to significant involvement in a biological process. Furthermore, statistical evidence from complicated data sets, such as factorial experiments, times series, or longitudinal data, cannot be summarized.
The problem of how to reliably combine data from different experiment platforms to identify significant biomarkers has recently received considerable attention in the bioinformatics literature. The rank aggregation method [10] has been proposed for ranking genes by similarity to the disease genes in Gene Ontology, pathways, transcription factor binding sites, and sequence, then aggregating this rankings to get the final result. Rhodes et al. [11] combined four independent data sets to identify genes deregulated in prostate cancer. For each gene in each data set, a p-value was obtained as an indication of the probability that the gene was differentially expressed. P-values for different data sets were subsequently aggregated to provide an overall estimate of the genes’ significance of being differentially expressed during prostate cancer. However, combining genes’ ranks in the rank aggregation approach or p-values in the meta-profiling method ignores the underlying multivariate distributions of the ranks or p-values. Furthermore, data quality may vary across different data sources. The two aggregation methods detailed above essentially give equal weights to different data sets. Thus, we propose to combine statistical evidence across different platforms through summary statistics instead of raw data. For each experimental platform, we formulate a null hypothesis and construct the summary test statistic. By randomization, we obtain the null distribution of the vector of statistics across different platforms. The test statistics are summarized across different platforms in a weighted scheme, where the weights take into account different variabilities possessed by the statistics. The method allows the use of different types of summary statistics from different platforms, which gives great flexibility and generality with respect to its application.
The proposed method is similar in spirit to a meta-analysis. Both methods combine statistical evidence across multiple data sets. However, in meta-analysis different data sets are based on the same type of experiments or observational studies, and therefore the measurements are the same variables. Across different data sets, the quality of the data may vary. The goal of meta-analysis is to fully utilize all the information from different data sets and construct a weighted estimate of the effect size. Different weighting schemes are available depending on the statistical models [12]. On the other hand, data integration focuses on integrating statistical evidence across different experimental types. There is no common effect size to estimate across various data sets. In our proposed method, we use a weighted average of the test statistics across different data platforms, but the test statistics are summaries of evidence towards different sub-hypotheses rather than summaries of common effect size as in meta-analysis. The proposed integration method does not check for differences across the platforms.
Methods
The aim of our multi-platform integration method is to select a set of significant biomarkers that are involved in a biological process and thus behave differently in the treatment group and the control group. In order to combine statistical evidence across different platforms, our method requires that analogous hypotheses based on the features being measured are formulated for each platform. Each null analogous hypothesis specifies the unrelatedness of the biomarker in that particular experimental setting, but all of them infer the unrelatedness of the biomarker to the biological process being investigated. Based on the set of Q analogous hypotheses for Q data sources, we construct a set of Q corresponding test statistics for each type of data. The test statistics can be different and tailored to the specific experimental settings. For example, if the microarray experiment has a multifactorial design, the appropriate test statistic can be an F statistic based on an ANOVA test. If the proteomics experiment generates counting data for diseased versus normal groups, the appropriate test statistic can be a nonparametric Wilcoxon rank sum test. A vector of observed statistics across multi-platforms is obtained. We then randomly permute data across diseased and control groups. All measurements from different platforms are permuted. In this way, we obtain an empirical null distribution of the vector of test statistics. In order to pool the randomized values of the statistics across the biomarkers to form the empirical null distribution, we assume data from different biomarkers are independent or have an exchangeable correlation structure. For the validity of the randomization procedure, we assume an exchangeable covariance structure for the measurements within each platform. Finally, we construct a weighted sum of the test statistics across different platforms with the weights being the inverse of the empirical standard deviation of each statistic. We determine a set of significant biomarkers based on the aggregated test statistic.
In the following, we demonstrate our method by integrating microarray expression data and proteomic data as an example. We consider two experiments, the first having microarray expression data measured on l_{1} diseased samples and l_{2} control samples and the second having proteomic data measured on m_{1} diseases samples and m_{2} control samples. The objective is to find biomarkers significantly involved in disease development.
Step 1): Define two analogous null hypotheses. For microarray data, the null hypothesis would be H_{01}: the gene’s mRNA level is the same in diseased and normal populations; for proteomic data, the null hypothesis would be H_{02}: the protein level is the same in diseased and normal populations.
where s^{2} denotes the sample variance. The test statistics should be formulated so that a larger test statistic in the positive direction indicates more evidence towards the alternative hypotheses. For example, if Student’s t-statistic is used, then a one-sided alternative hypothesis corresponds to a one-sided t-statistic, whereas the two-sided alternative leads to the absolute value of the t-statistic. Consider n genes being measured in the experiments and we obtain n vectors of test statistics (t_{ mi },t_{ pi })^{ ′ }, i = 1,…,n, from the data sets.
Step 3): The samples are randomly permuted across diseased and control groups. If the same sample is being measured across different platforms, all the measurements from the different platform are permuted simultaneously. The simultaneous permutation preserves the dependency relationship among the measurements from different platforms. Based on random permutation, we obtain an empirical null distribution of the vector (t_{ m },t_{ p })^{ ′ }.
All the biomarkers with (t_{ m },t_{ p }) above the separation line will be declared as significantly involved in the disease development.
where C_{ α } is the 100(1−α)% percentile of t_{ A }. Any biomarker with t_{ A } > C_{ α } will be selected as behaving significantly differently between the diseased group and control group.
Our method aggregates actual values of the test statistics across different data platforms, which preserves more information compared to the rank aggregation method. Moreover, our method assigns different weights to each data set according to the variability of the test statistics: larger the variation in the test statistic, the smaller the weight assigned to it, and vice versa. The threshold C_{ α } is determined based on the empirical null distribution of the aggregated test statistics, which implicitly takes into account the dependency relationships among the test statistics. Furthermore, our method can deal with different data types and formats generated by various experimental settings.
There are two major ways to perform the multiplicity adjustment. The first is the Bonferroni correction. If we wish to control the familywise type I error rate at α^{∗}, then the individual level α = α^{∗}/n, where n is the total number of biomarkers. When n is large, the Bonferroni correction leads to very stringent tests with α being very small. Alternatively, we can control the number of false discoveries. To set the number of false discoveries to be equal to or less than f , then $\alpha =f/\left(n\stackrel{\u02c6}{\mathit{\pi}}\right)$, where $\stackrel{\u02c6}{\mathit{\pi}}$ is the estimated proportion of non-differentially expressed biomarkers. If there is no $\stackrel{\u02c6}{\mathit{\pi}}$ available, we use $\stackrel{\u02c6}{\mathit{\pi}}=1$ and that gives a conservative value for α.
Different platforms can be used to test different sub-hypothesis. All of these sub-hypotheses should be concordant in supporting the overall biological hypothesis. For example, the involvement of a gene in disease development can be supported by both mRNA expression level changes and proteomic level changes. In most cases, changes in measurements from different platforms are expected to occur in the same direction. However, our method is also applicable even if the changes are in different directions, as long as the statistical evidence from both sources can be combined. For example, consider H_{10}: mRNA is increasing in normal group; H_{20}: antibody count is decreasing in normal group. Even though the actual measurements from two platforms are negatively correlated, we can construct the test statistics t_{1} and t_{2} so that the positive value of the statistics supports the alternative hypotheses and the weighted average can be used as combined evidence of the involvement of the biomarker in the process.
Results
Results on simulated data
In this section, we examine the performance of our proposed method by examining its positive selection rates and false discovery rates under various testing scenarios. We simulate data sets from Q different platforms. The number Q is set to be either 2 or 5. For the q th experiment, the data set is denoted as X_{ q }. For each data set, we assume that n different biomarkers are measured, X_{ q } = ${\left({X}_{q1}^{\prime},\dots ,{X}_{\mathit{qn}}^{\prime}\right)}^{\prime}$. For the i th biomarker, X_{ qi }= ${\left({X}_{\mathit{qi}1}^{\prime},{X}_{\mathit{qi}2}^{\prime}\right)}^{\prime}$, where X_{qi 1} denotes data from the control group with mean μ_{qi 1} and X_{qi 2} denotes data from the diseased group with mean μ_{qi 2}. The total number of biomarkers is set to be n = 1000. Among the n biomarkers, let g denote the number of biomarkers that are related to the biological process of interest, i.e. μ_{qi 1} ≠ μ_{qi 2}. The number g of differentially expressed (DE) biomarkers is set to be 200. The number of measurements for each biomarker obtained from each platform is set to be 10, in which 5 are from the control group and the other 5 are from the disease group. We also consider different effect sizes. For continuous data, we generate ${X}_{\mathit{qi}}\sim \mathrm{MVN}\left({\left({\mu}_{\mathit{qi}1}^{\prime},{\mu}_{\mathit{qi}2}^{\prime}\right)}^{\prime},\mathrm{\Sigma}\right)$, where Σ has an exchangeable correlation structure with correlation ρ. The correlation ρ is set to be either 0 or 0.5. For differentially expressed markers, μ_{qi 1} = 0 × 1_{ m }, μ_{qi 2} = e × 1_{ m }, where e is the effect size and m = 5 is number of measurements. Discrete data X_{ qi }is generated from a Poisson(λ) distribution, where λ_{qi 1} = μ_{qi 1} for the control group and μ_{qi 2} = μ_{qi 1} + e for the diseased group. The g differentially expressed markers are divided into two groups with g_{1} = 100 and g_{2} = 100. Each group is assigned a different effect size e. For each platform, the alternative hypothesis can be either left-sided, right-sided or two-sided. The number of permutation is 100. All of the permuted values from the n biomarkers are pooled together to form the empirical null distribution. The results are summarized for 100 simulated data sets.
The simulation settings and results for two platforms with continuous data
Methods | ||||
---|---|---|---|---|
Multi-platform | 1st individual | 2nd individual | ||
Scenario 1: | ρ = 0; g = g_{1} + g_{2} = 200 | |||
Right-side | Experiment1: | e = 0.5 for g_{1} = 100; e = 2 for g_{2} = 100 | ||
Experiment2: | e = 1.5 for g_{1} = 100; e = 1 for g_{2} = 100 | |||
PSR Mean | 0.7895 | 0.5372 | 0.5588 | |
PSR Var | 0.0007 | 0.0007 | 0.0010 | |
FDR Mean | 0.1907 | 0.2680 | 0.2600 | |
FDR Var | 0.0007 | 0.0013 | 0.0009 | |
Left-side | Experiment1: | e = -0.5 for g_{1} = 100; e = -2 for g_{2} = 100 | ||
Experiment2: | e = -1.5 for g_{1} = 100; e = -1 for g_{2} = 100 | |||
PSR Mean | 0.7908 | 0.5330 | 0.5556 | |
PSR Var | 0.0006 | 0.0006 | 0.0012 | |
FDR Mean | 0.1891 | 0.2673 | 0.2649 | |
FDR Var | 0.0006 | 0.0009 | 0.0011 | |
Two-sided | Experiment1: | e = -1 for g_{1} = 100; e = 1.5 for g_{2} = 100 | ||
Experiment2: | e = 2 for g_{1} = 100; e = -1 for g_{2} = 100 | |||
PSR Mean | 0.6988 | 0.4113 | 0.5403 | |
PSR Var | 0.0011 | 0.0011 | 0.0010 | |
FDR Mean | 0.2145 | 0.3202 | 0.2694 | |
FDR Var | 0.0007 | 0.0016 | 0.0012 | |
Scenario 2: | ρ=0.5; g = g_{1} + g_{2} = 200 | |||
Right-side | Experiment1: | e = 0.5 for g_{1} = 100; e = 2 for g_{2} = 100 | ||
Experiment2: | e = 1.5 for g_{1} = 100; e = 1 for g_{2} = 100 | |||
PSR Mean | 0.9405 | 0.6319 | 0.7819 | |
PSR Var | 0.0003 | 0.0005 | 0.0007 | |
FDR Mean | 0.1560 | 0.2410 | 0.2051 | |
FDR Var | 0.0005 | 0.0009 | 0.0007 | |
Left-side | Experiment1: | e = -0.5 for g_{1} = 100; e = -2 for g_{2} = 100 | ||
Experiment2: | e = -1.5 for g_{1} = 100; e = -1 for g_{2} = 100 | |||
PSR Mean | 0.9400 | 0.6316 | 0.7871 | |
PSR Var | 0.0002 | 0.0004 | 0.0006 | |
FDR Mean | 0.1605 | 0.2419 | 0.2024 | |
FDR Var | 0.0005 | 0.0007 | 0.0006 | |
Two-sided | Experiment1: | e = -1 for g_{1} = 100; e = 1.5 for g_{2} = 100 | ||
Experiment2: | e = 2 for g_{1} = 100; e = -1 for g_{2} = 100 | |||
PSR Mean | 0.9377 | 0.6670 | 0.7327 | |
PSR Var | 0.0003 | 0.0010 | 0.0007 | |
FDR Mean | 0.1622 | 0.2270 | 0.2122 | |
FDR Var | 0.0005 | 0.0009 | 0.0007 |
The simulation settings and results for five platforms with continuous data
Method | Multi-plat | 1st ind. | 2nd ind. | 3rd ind. | 4th ind. | 5th ind. |
---|---|---|---|---|---|---|
Scenario 1: | ρ = 0; g = g_{1} + g_{2} = 200 | |||||
Exp1: | e = 1.5 for g = 200 | |||||
Exp2: | e = 1.5 for g_{1} = 100; e = 1 for g_{2} = 100 | |||||
Exp3: | e = -0.5 for g_{1} = 100; e = -2 for g_{2} = 100 | |||||
Exp4: | e = -1 for g_{1} = 100; e = 1.5 for g_{2} = 100 | |||||
Exp5: | e = 2 for g_{1} = 100; e = -1 for g_{2} = 100 | |||||
PSR Mean | 0.9517 | 0.5601 | 0.4130 | 0.4464 | 0.4213 | 0.4471 |
PSR Var | 0.0002 | 0.0012 | 0.0011 | 0.0004 | 0.0010 | 0.0005 |
FDR Mean | 0.1572 | 0.2605 | 0.3299 | 0.3108 | 0.3205 | 0.2727 |
FDR Var | 0.0004 | 0.0011 | 0.0018 | 0.0009 | 0.0010 | 0.0010 |
Scenario 2: | ρ = 0.5; g = g_{1} + g_{2} = 200 | |||||
Exp1: | e = 1.5 for g = 200 | |||||
Exp2: | e = 1.5 for g_{1} = 100; e = 1 for g_{2} = 100 | |||||
Exp3: | e = -0.5 for g_{1} = 100; e = -2 for g_{2} = 100 | |||||
Exp4: | e = -1 for g_{1} = 100; e = 1.5 for g_{2} = 100 | |||||
Exp5: | e = 2 for g_{1} = 100; e = -1 for g_{2} = 100 | |||||
PSR Mean | 0.9998 | 0.8360 | 0.6655 | 0.5682 | 0.6712 | 0.5699 |
PSR Var | 2.7e-06 | 0.0006 | 0.0010 | 0.0004 | 0.0010 | 0.0008 |
FDR Mean | 0.1281 | 0.1898 | 0.2217 | 0.2593 | 0.2314 | 0.2093 |
FDR Var | 0.0004 | 0.0006 | 0.0009 | 0.0007 | 0.0007 | 0.0008 |
The simulation settings and results for two platforms with continuous data and discrete data
Methods | |||
---|---|---|---|
Multi-platform | 1st individual | 2nd individual | |
Experiment1: | Continues; ρ = 0; e = 0.5 for g_{1} = 100; e = 2 for g_{2} = 100 | ||
Experiment2: | Discrete; μ_{qn 1} = 5, e = 3 for g = 200 | ||
PSR Mean | 0.7356 | 0.5327 | 0.5228 |
PSR Var | 0.0008 | 0.0004 | 0.0012 |
FDR Mean | 0.1967 | 0.2702 | 0.2763 |
FDR Var | 0.0008 | 0.0012 | 0.0012 |
True positives and false discovery rates with π = 0.8
Methods | α | 0.05 | 0.01 | 0.005 |
---|---|---|---|---|
$\stackrel{\u02c6}{F}P$ | 40 | 8 | 4 | |
multi-platform | $\stackrel{\u02c6}{T}P$ | 224 | 165 | 143 |
(std) | 6.5547 | 6.0820 | 5.5202 | |
FP | 44.8125 | 8.0250 | 3.8375 | |
(std) | 7.3348 | 3.4778 | 2.263 | |
FDR | 0.1563 | 0.0386 | 0.0214 | |
(std) | 0.0219 | 0.0161 | 0.0125 | |
$F\stackrel{\u02c6}{D}R$ | 0.1428 | 0.0388 | 0.0225 | |
(std) | 0.0041 | 0.0014 | 0.0009 | |
1st individual | $\stackrel{\u02c6}{T}P$ | 165 | 107 | 91 |
(std) | 8.8797 | 5.3066 | 4.9031 | |
FP | 50.5125 | 9.9000 | 4.6500 | |
(std) | 8.9101 | 3.4982 | 2.1766 | |
FDR | 0.2431 | 0.0736 | 0.0406 | |
(std) | 0.0326 | 0.0246 | 0.0183 | |
$F\stackrel{\u02c6}{D}R$ | 0.1940 | 0.0600 | 0.0353 | |
(std) | 0.0103 | 0.0030 | 0.0019 | |
2nd individual | $\stackrel{\u02c6}{T}P$ | 197 | 106 | 79 |
(std) | 7.2442 | 8.2303 | 6.3222 | |
FP | 48.9250 | 9.6000 | 5.000 | |
(std) | 7.1862 | 3.5750 | 2.5376 | |
FDR | 0.1986 | 0.0721 | 0.0506 | |
(std) | 0.0245 | 0.0258 | 0.0251 | |
$F\stackrel{\u02c6}{D}R$ | 0.1630 | 0.0607 | 0.0408 | |
(std) | 0.0060 | 0.0048 | 0.0033 |
Results on real data
In this section, we apply our method to data from a study of growth and stationary phase adaption in Streptomyces coelicolor provided by Jayapal et al. [16]. The data set contains both isobaric stable isotope labeled peptide (iTRAQ^{ TM })-derived shotgun proteomic data and DNA microarray transcriptome data. To study different growth stages of S. coelicolor M145 cells, eight time point cell samples (7, 11, 14, 16, 22, 26, 34, and 38 h) were collected. Because the iTRQA^{ TM } system can only analyze four distinct samples in a single experiment, the eight protein samples were distributed across three runs of mass spectrometric (MS) analysis, The protein sample from 11 h was run in three MS experiments, so it serves as a reference. Therefore, protein abundance ratios ${r}_{j/11\mathit{hr},k}^{i}$ were obtained from experimental run k for protein i in sample j hr with respect to the 11 h reference. Protein identification and quantification were carried out by comparing the raw spectral data against a theoretical proteome of S. coelicolor using proteinPilot^{ TM } software and the inbuilt Paragon^{ TM } search engine. Only proteins identified with ≥ 99% confidence were considered for further analysis. Finally, all identified proteins were further processed to yield a protein abundance ratio with respect to the first time point (7 h) sample using ${r}_{j/7\mathrm{hr}}^{i}={r}_{j/11\mathrm{hr}}^{i}/{r}_{7\mathrm{hr}/11\mathrm{hr}}^{i}$. Ultimately, only 886 proteins identified in the 7 h sample could be used for our analysis.
For microarray data, total mRNA from the same eight time point samples were isolated and a spotted DNA microarray experiment was conducted. Hybridization was performed using genomic DNA (gDNA) as a reference. The mRNA abundance was obtained using _{log2}[cDNA/gDNA]. To be consistent with the protein data, mRNA abundance data from different samples were processed to calculate _{log2}[cDNAi/cDNA_{7hr}] for each sample with respect to the first time point sample. Only gene expression values with protein values (894 genes) were analyzed. To deal with missing values, we deleted genes that had no values for mRNA at all or had at least five missing values in the protein data set. The rest of the missing values for genes were imputed by using R package MICE. In total, the number of genes suitable for the subsequent integrative analysis was 886. Based on the growth curve, time points were divided into two groups; those from 7, 11, 14 and 16 h represented the growth phase and those from 22, 26, 34 and 38 h represented the stationary phase.
SCO Summaries for the 9 genes which are identified by multi-platform integration method but not by individual platform analysis
SCO | Sanger | Sanger | Sanger | Sanger | TIGR | Related |
---|---|---|---|---|---|---|
abbreviation | annotation | category | subcategory | category | paper* | |
SCO1958 | uvrA | ABC excision | Macromolecule | DNA-replication, | excinuclease ABC, | [17] |
nuclease subunit A | metabolism | repair, restr./modific’n | A subunit | [17] | ||
SCO2940 | other | putative | Not classified | Not classified | xanthine | |
oxidoreductase | (included putative | (included putative | dehydrogenase, | |||
assignments) | assignments) | putative | ||||
SCO2951 | other | putative malate | Central intermediary | Other central | malate | |
oxidoreductase | metabolisms | intermediary metabolism | oxidoreductase | |||
SCO3094 | other | conserved | hypothetical | Conserved in | conserved | |
hypothetical | protein | organism other than | hypothetical | |||
protein | protein | Escherichia coli | protein | |||
SCO4661 | fusA | elongation | Macromolecule | Proteins - | translation | |
factor G | metabolism | translation and | elongation | |||
modification | factor G | |||||
SCO5072 | actVIORF1 | hydroxylacyl-CoA | Secondary | PKS | hydroxylacyl-CoA | |
dehydrogenase | metabolism | PKS | dehydrogenase | |||
SCO5080 | actVA5 | putative | Secondary | PKS | putative | |
hydrolase | metabolism | PKS | hydrolase | |||
SCO6219 | Other | putative ATP/GTP | Protein | Serine/ | [17] | |
binding protein, | kinases | threonine | ||||
putative serine | ||||||
SCO6222 | other | putative | Not classified | Not classified | aminotransferase, | |
aminotransferase | (included putative | (included putative | class I | |||
assignments) | assignments) |
Discussion
Additional simulations
Method | Multi-plat | 1st ind. | 2nd ind. |
---|---|---|---|
Scenario 1: | Extremely small sample size | ||
two measurements from each group | |||
PSR Mean | 0.3022 | 0.2363 | 0.2179 |
PSR Var | 0.0009 | 0.0006 | 0.0007 |
FDR Mean | 0.3782 | 0.4436 | 0.4694 |
FDR Var | 0.0023 | 0.0025 | 0.0027 |
Scenario 2: | Correlation among platforms set to 0.5 | ||
Disease and normal groups are independent | |||
PSR Mean | 0.6689 | 0.5365 | 0.5578 |
PSR Var | 0.0009 | 0.0008 | 0.0011 |
FDR Mean | 0.2255 | 0.2690 | 0.2641 |
FDR Var | 0.0008 | 0.0010 | 0.0010 |
Scenario 3: | Non-standardized version of t_{ m } and t_{ p } | ||
i.e. t_{ m } = $\overline{{x}_{2}}-\overline{{x}_{1}}$, t_{ p } = $\overline{{y}_{2}}-\overline{{y}_{1}}$ | |||
PSR Mean | 0.8142 | 0.5479 | 0.5992 |
PSR Var | 0.0009 | 0.0005 | 0.0010 |
FDR Mean | 0.1586 | 0.2358 | 0.2235 |
FDR Var | 0.0006 | 0.0011 | 0.0010 |
We also consider the situation in which data on the same biomarker from n platforms have a multivariate distribution and the data from the diseased group are independent of those from the control group. The new simulation results are summarized in Table 6, scenario 2. The correlation between the platforms is set to 0.5, and the other parameters are the same as in Table 1, scenario 1, right-sided test. Due to the high correlation among the platforms, the gain in power of the aggregated method is less pronounced than that of the independence case. This is because different platforms contribute overlapping information when they are highly correlated.
The proposed method allows different ways of constructing t_{ m } and t_{ p } as long as they provide summarized statistical evidence for that platform. The Student’s t-statistic is adopted in the paper simply for illustration purpose. Alternatively, we can simply use the unstandardized differences: ${t}_{m}={\overline{x}}_{1}-{\overline{x}}_{2}$, and ${t}_{p}={\overline{y}}_{1}-{\overline{y}}_{2}$. Then we proceed with the randomization, obtain the estimated variances for t_{ m } and t_{ p } and form a weighted linear sum statistic. To compare the empirical performance of the standardized versus unstandardized versions, we conduct simulations under the setting 1 of Table 1 with right-sided test. The results are summarized in Table 6, scenario 3. The two versions have comparable performance in terms of PSR and FDR. The unstandardized version of t_{ m } and t_{ p } has a slightly higher PSR and a slightly lower FDR.
Comparison with the quadratic test statistic t_{ Q }
Method | Multi-plat | Quadratic |
---|---|---|
PSR Mean | 0.9377 | 0.9155 |
PSR Var | 0.0003 | 0.0004 |
FDR Mean | 0.1622 | 0.1804 |
FDR Var | 0.0005 | 0.0005 |
Quadratic: | Exp1: | e = -1 for g_{1} = 100; e = 1.5 for g_{2} = 100 |
Exp2: | e = 2 for g_{1} = 100; e = -1 for g_{2} = 100 |
Comparison with Robust Rank Aggregation Method
Setting: | Method | Multi-plat | RRA | |
---|---|---|---|---|
1. | ρ = 0.5; g = g_{1} + g_{2} = 100 | |||
Exp1: e = 1.5 for g = 200 | PSR Mean | 1.000 | 0.7497 | |
Exp2: e = 1.5 for g_{1} = 100; e = 1 for g_{2} = 100 | PSR Var | 1.98e-6 | 0.0012 | |
Exp3: e = -0.5 for g_{1} = 100; e = -2 for g_{2} = 100 | FDR Mean | 0.2803 | 0.0912 | |
Exp4: e = -1 for g_{1} = 100; e = 1.5 for g_{2} = 100 | FDR Var | 0.0011 | 0.0003 | |
Exp5: e = 2 for g_{1} = 100; e = -1 for g_{2} = 100 | ||||
2. | ρ = 0.5; g = g_{1} + g_{2} = 200 | |||
Exp1: e = 1.5 for g = 100 | PSR Mean | 0.9995 | 0.4995 | |
Exp2: e = 1.5 for g_{1} = 50; e = 1 for g_{2} = 50 | PSR Var | 0.23e-06 | 0.0008 | |
Exp3: e = -0.5 for g_{1} = 50; e = -2 for g_{2} = 50 | FDR Mean | 0.1399 | 0.0823 | |
Exp4: e = -1 for g_{1} = 50; e = 1.5 for g_{2} = 50 | FDR Var | 0.0004 | 0.0004 | |
Exp5: e = 2 for g_{1} = 50; e = -1 for g_{2} = 50 | ||||
3. | ρ = 0.5; g = g_{1} + g_{2} = 400 | |||
Exp1: e = 1.5 for g = 100 | PSR Mean | 0.9992 | 0.1133 | |
Exp2: e = 1.5 for g_{1} = 50; e = 1 for g_{2} = 50 | PSR Var | 2.23e-6 | 0.0002 | |
Exp3: e = -0.5 for g_{1} = 50; e = -2 for g_{2} = 50 | FDR Mean | 0.0402 | 0.0796 | |
Exp4: e = -1 for g_{1} = 50; e = 1.5 for g_{2} = 50 | FDR Var | 0.0001 | 0.0015 | |
Exp5: e = 2 for g_{1} = 50; e = -1 for g_{2} = 50 |
Conclusion
With the advent of various types of genomic technologies, it is imperative to develop a method that can integrate different types of genomic data to solve biological questions. We develop a general framework for data integration across multiple data platforms. For each data set, a test statistic is formed to summarize the statistic evidence toward the specific null hypothesis tailored to the data platform. The types of test statistics can vary and their marginal distributions can be different. The observed test statistics can then be aggregated across different data platforms. The overall decision is based on the empirical distribution of the aggregated statistic obtained through random permutations. Our method can accommodate different experimental designs and various data types across platforms.
Declarations
Acknowledgements
The authors are grateful to Dr. Lei Nie for his discussion and comments on our project. The authors are very thankful to the editor, associate editor and three referees. Their comments and suggestions lead to a much improved manuscript.
Authors’ Affiliations
References
- Reif D, White B, Moore J: Integrated analysis of genetic, genomic and proteomic data. Expert Rev Proteomics 2004, 1: 67–75. 10.1586/14789450.1.1.67View ArticlePubMedGoogle Scholar
- Hamid J, Hu P, Roslin M, Ling V, Greenwood C, Beyene J: Data integration in genetics and genomics: methods and challenges. Human Genomics Proteomics 2009, 9: 869093.Google Scholar
- Lanckriet G, Bie T, Cristianini N, Jordan M, Noble S: A statistical framework for genomic data fusion. Bioinformatics 2004, 20: 2626–2635. 10.1093/bioinformatics/bth294View ArticlePubMedGoogle Scholar
- Daemen A, Gevaert O, De Bie T, Debucquoy A, Machiels J, De Moor B, Haustermans K: Integrating microarray and proteomics data to predict the response on cetuximab in patients with rectal cancer. Pac Symp Biocomputing 2008, 13: 166–177.Google Scholar
- Buness A, Ruschhaupt M, Kuner R, Tresch A: Classification across gene expression microarrray studies. Bioinformatics 2009, 10: 453.PubMed CentralPubMedGoogle Scholar
- Tian Q, Stepaniants S, Mao M, Weng L, Feetham M, Doyle M, Yi E, Dai H, Thorsson V, Eng J, Goodlett D, Berger J, Gunter B, Linseley P, Stoughton R, Aebersold R, Collins S, Hanlon W, Hood L: Integrated genomic and proteomic analyses of gene expression in mammalian cells. Mol Cell Proteomics 2004, 3: 960–969. 10.1074/mcp.M400055-MCP200View ArticlePubMedGoogle Scholar
- Bussey K, Chin K, Lababidi S, Reimers M, Reinhold W, Kuo W, Gwadry F, Kouros-Mehr H, Fridlyand J, Jain A, Collins C, Nishizuka S, Tonon G, Roschke A, Gehlhaus K, Kirsch I, Scudiero D, Gray J, Weinstein J, Ajay: Integrating data on DNA copy number with gene expression levels and drug sensitivities in the NCI-60 cell line panel. Mol Cancer Ther 2006, 5: 853–867. 10.1158/1535-7163.MCT-05-0155PubMed CentralView ArticlePubMedGoogle Scholar
- Adourian A, Jennings E, Balasubramanian R, Hines W, Damian D, Plasterer T, Clish C, Stroobant P, McBurney R, Verheij E, Bobeldijk I, van der Greef J, Lindberg J, Kenne K, Andersson U, Hellmold H, Nilsson K, Salter H, Schuppe-Koistinen I: Correlation network analysis for data integration and biomarker selection. R Soc Chem 2003, 4: 249–259.Google Scholar
- Ma Y, Ding Z, Qian Y, Wan Y, Tosun K, Shi X, Castranova V, Harner E, Guo N: An integrative genomic and proteomic approach to chemosensitivity prediction. Int J Oncol 2009, 34: 107–115.PubMed CentralPubMedGoogle Scholar
- Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent L, De Moor B, Marynen P, Hassan B, Carmeliet P, Moreau Y: Gene prioritization through genomic data fusion. Nat Biotechnol 2006, 24: 537–544. 10.1038/nbt1203View ArticlePubMedGoogle Scholar
- Rhodes D, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan A: Large-scale meta analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci U S A 2004, 101(25):9309–9314. 10.1073/pnas.0401994101PubMed CentralView ArticlePubMedGoogle Scholar
- Hu P, Greenwood C, Beyene J: Statistical methods for meta-analysis of microarray data: A comparative study. Inf Syst Front 2006, 8: 9–20. 10.1007/s10796-005-6099-zView ArticleGoogle Scholar
- Gao X: Construction of null statistics in permutation based multiple testing for multi-factorial microarray experiments. Bioinformatics 2006, 22: 1486–1494. 10.1093/bioinformatics/btl109View ArticlePubMedGoogle Scholar
- Kolde R, Laur S, Adler P, Vilo J: Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 2012, 4: 573–580.View ArticleGoogle Scholar
- Hochberg Y, Tamhane A: Multiple Comparison Procedures. New Jersey: Wiley; 1987.View ArticleGoogle Scholar
- Jayapal K, Philp R, Kok Y, Yap M, Sherman D, Griffin T, Hu W: Uncovering genes with divergent mRNA-protein dynamics in Streptomyces coelicolor. PLoS One 2008, 3: e2097. 10.1371/journal.pone.0002097PubMed CentralView ArticlePubMedGoogle Scholar
- Manteca A, Sanchez J, Jung H, Schwamle V, Jensen O: Quantitative proteomics analysis of Streptomyces coelicolor development demonstrates that onset of secondary metabolism coincides with hypha differentiation. Mol Cell Proteomics 2010, 9(7):1423–1436. 10.1074/mcp.M900449-MCP200PubMed CentralView ArticlePubMedGoogle Scholar
- Bentley S, Chater K, Cerdeno-Tarraga A, Challis G, Thomson N, James K, Harris D, Quail M, Kieser H, Harper D, Bateman A, Brown S, Chandra G, Chen C, Collins M, Cronin A, Fraser A, Goble A, Hidalgo J, Hornsby T, Howarth S, Huang C, Kieser T, Larke L, Murphy L, Oliver K, O’Neil S, Rabbinowitsch E, Rajandream M, Rutherford K, Rutter S, Seeger K, Saunders D, Sharp S, Squares R, Squares S, Taylor K, Warren T, Wietzorrek A, Woodward J, Barrell B, Parkhill J, Hopwood D: Complete genome sequence of the model actionomycete Streptomyces coelicolor A3(2). Nature 2002, 417: 141–147. 10.1038/417141aView ArticlePubMedGoogle Scholar
- Mehra S, Lian W, Jayapal K, Charaniya S, Sherman D, Hu W: A framework to analyze multiple time series data: A case study with Streptomyces coelicolor. J Ind Microbiol Biotechnol 2006, 33(2):159–172. 10.1007/s10295-005-0034-7View ArticlePubMedGoogle Scholar
- Jayapal K, Sui S, Philp R, Kok Y, Yap M, Griffin T, Hu W: Multitagging proteomic strategy to estimate protein turnover rates in dynamic systems. J Proteome Res 2010, 9: 2087–2097. 10.1021/pr9007738View ArticlePubMedGoogle Scholar
- Nieselt K, Battke F, Herbig A, Bruheim P, Wentzel A, Jakobsen O, Sletta H, Alam M, Merlo M, Moore J, Omara W, Morrissey E, Juarez-Hermosillo M, Rodriguez-Garcia A, Nentwich M, Thomas L, Iqbal M, Legaie R, Gaze WH, Challis G, Jansen R, Dijkhuizen L, Rand D, Wild D, Bonin M, Reuther J, Wohlleben W, Smith M, Burroughs N, Martin J, Hodgson D, Takano E, Breitling R, Ellingsen T, Wellington E: The dynamic architecture of the metabolic switch in Streptomyces coelicolor. BMC Genomics 2010, 11: 10. 10.1186/1471-2164-11-10PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.