Empirical Bayes method for reducing false discovery rates of correlation matrices with block diagonal structure
 Clare Pacini^{1, 3}Email authorView ORCID ID profile,
 James W. Ajioka^{2} and
 Gos Micklem^{1, 3}
DOI: 10.1186/s128590171623y
© The Author(s) 2017
Received: 5 July 2016
Accepted: 1 April 2017
Published: 12 April 2017
Abstract
Background
Correlation matrices are important in inferring relationships and networks between regulatory or signalling elements in biological systems. With currently available technology sample sizes for experiments are typically small, meaning that these correlations can be difficult to estimate. At a genomewide scale estimation of correlation matrices can also be computationally demanding.
Results
We develop an empirical Bayes approach to improve covariance estimates for gene expression, where we assume the covariance matrix takes a block diagonal form. Our method shows lower false discovery rates than existing methods on simulated data. Applied to a real data set from Bacillus subtilis we demonstrate it’s ability to detecting known regulatory units and interactions between them.
Conclusions
We demonstrate that, compared to existing methods, our method is able to find significant covariances and also to control false discovery rates, even when the sample size is small (n=10). The method can be used to find potential regulatory networks, and it may also be used as a preprocessing step for methods that calculate, for example, partial correlations, so enabling the inference of the causal and hierarchical structure of the networks.
Keywords
Empirical Bayes CorrelationBackground
Correlation analysis is a common approach for identifying relationships between the expression of genes or proteins. Initially the covariance matrix would be calculated which may then be either inverted to find a partial correlation matrix, or standardised to give the correlation matrix. The correlation matrix is often used in downstream analysis such as graphical inference or using partial correlations, that further provides causal information. One problem with this type of analysis is that often, by necessity, sample sizes are small and this reduces the power of detecting correlations and increases the false discovery rate.
Methods for improving the estimation of correlations have to date included the shrinkage approach, Corpcor [1]. However, this method is applied uniformly to the full correlation matrix. In practice, recent graphical model approaches have used the block diagonal form of the covariance matrix to improve computational efficiency and interpretation [2]. In our case, we assume that sets of genes, within an operon for prokaryotes or under the control of one transcription factor for eukaryotes, will strongly correlate with each other. Such correlated sets are then represented by a single block, and the full correlation matrix is then split into separate blocks or transcriptional units. Consequently, a block diagonal structure of the covariance matrix is used to model this. A different covariance matrix in block diagonal form is used for each condition present in the data, meaning that different transcriptional units may be present in a case compared to a control matrix.
In conjunction with assuming the block diagonal form of the covariance matrix, we have taken an empirical Bayes approach to model the covariance, and similarly calculate correlations. This approach uses the data to generate the prior information matrix and borrows information from across the genes to estimate covariances in such a way as to lower the false discovery rates (FDRs). This is important in downstream analysis where we would like to limit the number of false discoveries when identifying candidates to test experimentally.
Implementation
We use an empirical Bayes approach to infer covariances and by simple extension, correlations. The theoretical basis of this, using conjugate priors, is derived in [3]. In that paper, Champion et al. use an independence prior, and a flat prior with constant covariance. We introduce a new prior that is a mixture of these two extremes, the block diagonal prior.
From this we can see that the inverse Wishart requires two hyperparameters, the matrix z and scalar λ.
Calculating hyperparameters
In empirical Bayes methods both hyperparameters λ and z are estimated from the data, rather than using a hierarchical model and assigning a prior distribution to each of the parameters, or by having to choose the parameter values where little prior knowledge is available. In choosing the matrix z we are looking for an appropriate prior matrix for the covariance matrix η. The Inverse Wishart distribution requires that the matrix z is positive definite, block diagonal matrices are positive definite if and only if each of the blocks are positive definite. Therefore we construct a block diagonal prior matrix where each of the blocks have a constant (non perfect) correlation. The estimated correlation matrix from this method η ^{0} is then a mixture of the sample correlation matrix S and the block diagonal prior matrix z.
As we are using a block diagonal matrix for z we have added an extra level of estimation to the model over other methods that fix the structure of the prior matrix z a priori as, for example, the independence or flat correlation matrix. By using the block diagonal matrix the different groupings or blocks need to be determined for each data set separately. There are three methods for generating the blocks for the matrix z available in the package covEB. The first is to provide a list of the block assignments for each variable in the covariance matrix. Each entry in the list should comprise a set of variables that together will form one block. The algorithm then uses the average correlation within these blocks to calculate a constant value for this block in the matrix z, with all elements outside the block set to zero.
The second method is to provide a threshold parameter (or correlation level). This could be based on prior knowledge or calculated using an existing method for thresholding covariance matrices such as those described in Bickel et al. [4]. Given this correlation level (used for simplicity to allow the user to set thresholds in [1,1]), we set all elements of the sample correlation matrix below this threshold to zero and identify the block diagonal structure. This is done by treating the correlation matrix as a weighted graph and using the cluster function in the R package igraph [5] to separate the graph into disjoint subnetworks such that each element in one subnetwork is completely separate from all elements in the other subnetworks, that is the correlation between them is zero. Each subnetwork then represents one block in the block diagonal matrix.
The third method requires no prior information from the user, instead the Akaike’s Information Criteria (AIC) metric is used to select the threshold level that determines the block diagonal structure. The AIC is defined as AIC=2p−2 ln(L), where p are the number of parameters estimated in the model and ln(L) is the loglikelihood of the data given the model. In our case, we calculate the likelihood of observing the given covariance matrix for different block diagonal matrices (models). We assume the data is from a multivariate normal distribution with covariance matrix Ω. Ω is the diagonal matrix calculated (as outlined above) for different threshold values, this threshold value is chosen to minimise the AIC statistic.
Simulated data
For the simulated data we use the method of Hardin to generate block diagonal matrices [6]. We generated 100 sample covariance matrices for 50 genes, each covariance matrix had a block diagonal structure with 5 blocks of different sizes. Correlations within blocks were set to 0.7, random samples were generated for each of these correlation matrices for three different sample sizes of 10, 15 and 20. We include a simulated dataset under these parameters in the R package covEB. We denote the empirical Bayes method using a block diagonal prior as EB, for comparison two other methods are used to calculate the correlation matrix, the Pearson correlation matrix, and the Corpcor method of Schäfer et al. are used to estimate elements within blocks and all the elements outside these blocks are set to zero. We calculated the Frobenius norm between each of the estimation methods and the true block diagonal matrix. We estimated the correlation matrices and calculated the average false and true positive rates and their standard deviations over the 100 simulations, for each of the methods mentioned above.
Biological data
The Bacillus subtilis data were taken from the paper [7]. A subset of 19 samples that are clustered according to the affinity propagation clustering with Euclidean distance was used. The genes were filtered leaving those with variance above the median. For each method we used known transcriptional unit information from BSubCyc (http://www.bsubcyc.org) to generate the prior groupings, selecting 56 known transcriptional units each with at least 5 genes in them.
Results and discussion
Simulation results for the empirical Bayes method with known groupings (blocks)
Pearson block  Corpcor block  EB block  EB  

Mean n=10  5.67  20.41  4.08  9.23 
Sd n=10  1.49  1.44  1.64  1.72 
Mean n=15  4.10  17.99  2.81  7.59 
Sd n=15  0.98  0.93  1.05  1.13 
Mean n=20  4.08  18.42  2.69  7.39 
Sd n=20  0.76  0.80  0.75  0.90 
Simulation results with known threshold level and groups then estimated from the data
Pearson  Corpcor  EB  EB AIC  

Mean n=10  19.27  23.48  17.65  17.86 
Sd n=10  3.42  3.70  3.34  3.14 
Mean n=15  12.30  17.45  12.72  13.89 
Sd n=15  4.34  2.04  3.81  3.21 
Mean n=20  10.52  17.52  11.46  13.19 
Sd n=20  3.96  1.81  3.28  2.81 
Simulation results with known groups (first three columns) and using a known threshold level to estimate groups or blocks from the data (threshold) and the EB AIC method for when neither the threshold or groups are known
EB  Corpcor  Pearson  EB threshold  CorpCor threshold  Pearson threshold  EB AIC  

n=10  FDR mean  0.00  0.00  0.00  0.04  0.26  0.16  0.04 
FDR sd  0.00  0.00  0.00  0.09  0.12  0.09  0.09  
TPR mean  0.70  0.54  0.66  0.39  0.56  0.66  0.36  
TPR sd  0.21  0.12  0.15  0.21  0.13  0.15  0.21  
n=15  FDR mean  0.00  0.00  0.00  0.09  0.16  0.13  0.06 
FDR sd  0.00  0.00  0.00  0.11  0.12  0.11  0.11  
TPR mean  0.90  0.32  0.83  0.75  0.59  0.81  0.61  
TPR sd  0.10  0.08  0.10  0.19  0.19  0.12  0.21  
n=20  FDR mean  0.00  0.00  0.00  0.11  0.11  0.10  0.10 
FDR sd  0.00  0.00  0.00  0.10  0.10  0.09  0.10  
TPR mean  0.97  0.44  0.90  0.91  0.55  0.88  0.79  
TPR sd  0.04  0.09  0.06  0.09  0.14  0.08  0.11 
For the Bacillus subtilis data, we calculated the FDR rates based on the known transcriptional units for the EB method at 6%, this is an acceptable error level for most experiments that usually aim for a FDR of 5%. By construction with the Pearson and Corpcor method we set elements outside blocks to zero, therefore comparing these to the prior information the FDR for both methods is zero. As a comparison, without imposing the block diagonal structure on these matrices, the Pearson and Corpcor method had an FDR of 20% and 39% respectively. The EB method also resulted in similar or improved true positive rates of 93% compared to 92% and 43% for Pearson and Corpcor respectively.
Conclusions
Above we show that the EB method is closer to the blocks within the true matrix, as calculated using the Frobenius norm, for each of the sample sizes when using either the known groupings or those generated from the simulated data (using a known thresholding level). Further using a 5% significance level the EB method has lower false discovery rates than the Pearson covariance matrix and existing Corpcor method.
The simulation results indicate that we are able to improve the false discovery rates when estimating correlations that can be used in downstream analysis. The EB shows particular improvements when the sample size was as small as ten replicates. This is important as many experiments have comparable levels of replication by necessity.
Further, controlling the false discovery rate is particularly useful when the network inferences are used to drive experimental hypotheses, as we are interested in identifying the highest possible value links for subsequent laboratory analysis. The EB method was also able to find connections between transcriptional units sharing the same regulator (PurR). This shows how the EB method is flexible, controlling error rates whilst also allowing significance connections between genes or transcriptional units that are supported by the data.
Availability of data and materials
Project name: covEB
Project home page: http://bioconductor.org/packages/covEB/
Operating system(s): Platform independent
Programming language: R
Other requirements: R (≥ 3.3)
License: GPL3
Any restrictions to use by nonacademics: None
Abbreviations
 cys(C,H,P):

Cysteine C,H,P
 EB:

Empirical Bayes
 FDRs:

False discovery rates
 hem(B,C,X):

Heme B,C,X’
 PerR:

Transcriptional repressor of the peroxide region
 PurR:

Purine biosynthesis
 TPR:

True positive rate
 YtrA:

Transcriptional regulator YtrA (GntR family)
Declarations
Acknowledgements
We would like to thank Simon Tavaré for his advice and useful comments on the manuscript. The authors are also grateful to the reviewer for their suggestions that greatly improved this manuscript.
Funding
CP is funded by the Engineering and Physical Sciences Research Council (EPSRC) doctoral program.
Authors’ contributions
CP conceived study and carried out the work with advice from JWA and GM. CP, JWA and GM wrote the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Schäfer J, Strimmer K. A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol. 2005; 4(1):32.Google Scholar
 Danaher P, Wang P, Witten DM. The joint graphical lasso for inverse covariance estimation across multiple classes. J R Stat Soc Ser B Stat Methodol. 2014; 76(2):373–97.View ArticleGoogle Scholar
 Champion CJ. Empirical Bayesian estimation of normal variances and covariances. J Multivar Anal. 2003; 87(1):60–79.View ArticleGoogle Scholar
 Bickel PJ, Levina E. Covariance regularization by thresholding. Ann Stat. 2008; 36(6):2577–604.View ArticleGoogle Scholar
 Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006; Complex Systems:1695. http://igraph.org.Google Scholar
 Hardin J, Garcia SR, Golan D. A method for generating realistic correlation matrices. Ann Appl Stat. 2013; 7(3):1733–62.View ArticleGoogle Scholar
 Nicolas P, Mäder U, Dervyn E, Rochat T, Leduc A, Pigeonneau N, et al. Conditiondependent transcriptome reveals highlevel regulatory architecture in Bacillus subtilis. Science. 2012; 335(6072):1103–6.View ArticlePubMedGoogle Scholar