Data and essentiality scoring
The data used in this study comes from publicly available CRISPR knockout screens datasets, downloaded from the Cancer Dependency Map database (Avana dataset) [10]. Four different pipelines for measuring gene knockout fitness effects (gene essentiality) were used: BAGEL2 [12], Z-score model [18], Ceres [14], and Chronos [15]. The BAGEL2 algorithm generates log Bayes Factors (BF) to report gene essentiality for each cell line, with positive scores indicating essentiality. Z-score values are generated using a Gaussian mixture model, with negative scores indicating essentiality. The Ceres algorithm removes principle components highly related to copy-number-specific effects and scales the data so that median essential score is − 1, and median non-essential score is 0. Chronos models the read-count data assuming a negative binomial distribution and removes copy-number related bias, with the scores scaled similar to Ceres.
Bayes Factors (BF) and Z-scores were calculated using raw read count data from the DepMap 20Q4 release. Ceres scores were downloaded from the 20Q4v2 release, and Chronos gene effect scores were downloaded from the 22Q2 release. We considered only the common genes and cell lines of BF, Z-scores and Ceres scores, which resulted in genes-by-cell lines data matrices of size 17,834 × 730. The Chronos data was processed to include only genes and cell lines present in the intersection with Ceres, BF and Z-score, resulting in a data matrix of size 17,662 × 727.
Normalization techniques
We compared four normalization techniques, two of which perform as variance normalization methods and the other two are covariance normalization methods. The quantile normalization technique executes variance normalization, applied to mitigate screen quality bias and to allow comparison between different samples. Quantile normalization first ranks the genes by magnitude, calculating the mean for genes in the same rank, and substituting the values of all genes in that rank with the mean value, and then reorders the genes in the original order. The Boyle principal component analysis approach aims to remove the technical confounding introduced by olfactory receptor genes that have highly correlated profiles across different genetic backgrounds [7]. This method applies principal component analysis of the gene-by-cell-line essentiality matrix across olfactory receptors, and then subtracts the first four principal components from the original score matrix, implemented here as in Wainberg et al. methods [9].
The covariance normalization methods, also known as whitening or sphering transformations, aim to remove dependencies between features in the data matrix. These methods involve linear transformation of the data matrix X using a “whitening” matrix W, such that the resulting normalized data matrix \(\tilde{X}\) has a covariance matrix equal to the identity matrix (Eq. 1). To perform this, the data matrix is first centered, subtracting the mean across all samples, and the covariance matrix is computed, denoted by \(\Sigma\) (Eq. 2). The covariance matrix is positive semi-definite, meaning it is symmetric with non-negative eigenvalues, thus it can be inverted, and it can be decomposed as a product of two simpler matrices. There are many “whitening” matrices that can do the linear transformation described above, and we used two different options that satisfy the condition. One of the techniques of calculating W, described in Wainberg et al. [9], uses a Cholesky decomposition of the inverse covariance matrix (Eq. 3), and the linear transformation is done as described in (Eq. 4). Another technique termed PCA whitening utilizes the Eigen-decomposition of the covariance matrix (Eq. 5), and the PCA whitening transformation is done as described in (Eq. 6).
$$\tilde{X} = WX,\;s.t.\;Cov\left( {\tilde{X}} \right) = I$$
(1)
$$Let\;Cov\left( X \right) = \Sigma$$
(2)
Cholesky whitening:
$$\Sigma^{ - 1} = LL^{T}$$
(3)
$$Let\;W = L^{T} \to \tilde{X} = L^{T} X.$$
(4)
PCA whitening:
$$\Sigma = EDE^{ - 1}$$
(5)
$$Let\;W = D^{ - 1/2} E^{T} \to \tilde{X} = D^{ - 1/2} E^{T} X$$
(6)
where \(D = diag\left( {\varvec{\lambda}} \right)\), a diagonal matrix containing the eigenvalues \(\lambda_{i}\) on the diagonal; and E is the orthogonal matrix of eigenvectors.
Statistical measures of similarity
We utilized two statistical measures to quantify co-essentiality, Pearson’s correlation and Ordinary Least Squares. Using Pearson’s correlation, we calculate the correlation coefficient for all possible gene pairs (Eq. 7), resulting in a gene-by-gene correlation matrix. We rank the gene pairs by the correlation coefficient values.
$$\rho_{x,y} = \frac{{cov\left( {x,y} \right)}}{{\sigma_{x} \sigma_{y} }},\;{\text{where}}\;\sigma \;{\text{denotes}}\;{\text{standard}}\;{\text{deviation}}$$
(7)
Using ordinary least squares (implemented in Python3 using the numpy.linalg.lstsq function), we estimate the parameter vector \(b\) in the linear regression model \(y = bx + \varepsilon\), where \(y\) is set to be the essentiality scores for one of the genes and \(x\) is a two-column matrix, with the first column being the other gene’s essentiality scores and the second column is the intercepts, set to a vector of all ones. We ran OLS for each gene pair, and calculated log P-values, resulting in a gene-by-gene matrix of P-values. When the Cholesky whitening transformation is applied, both \(x\) and \(y{ }\) are transformed by the triangular Cholesky matrix, and this constitutes the Generalized least squares method described in Wainberg et al. [9]. We rank the gene pairs by the log P values.
It is important to note that in least squares linear regression the slope \(b\) is given by (Eq. 8). Rewriting it as in (Eq. 9) and substituting (Eq. 7), shows how the correlation coefficient \(\rho\) factors in. When covariance normalization is applied, whether with Cholesky or PCA whitening, the transformed data has identity covariance, meaning all features are independent and the variance along each of the features is one. With the variance (\(\sigma^{2}\)) equal to one in the covariance normalized data, PCC and OLS yield equivalent results.
$$b = \frac{{cov\left( {x,y} \right)}}{{\sigma_{x}^{2} }}$$
(8)
$$\frac{{cov\left( {x,y} \right)}}{{\sigma_{x}^{2} }} = { }\frac{{cov\left( {x,y} \right)}}{{\sigma_{x} \sigma_{y} }} \times \frac{{\sigma_{y} }}{{\sigma_{x} }} = \rho_{x,y} \frac{{\sigma_{y} }}{{\sigma_{x} }}$$
(9)
Evaluation of pathway enrichment with log-likelihood scoring
For each method, we took the top 50,000 ranking gene pairs, and bin them into bins of 1000 gene pairs for further analysis. To evaluate the performance of these methods, we conducted pathway enrichment tests with a log-likelihood scheme described by Lee et al. [19], using multiple annotated pathway databases. The log likelihood scoring quantifies the accuracy of each network and their ability to reconstruct known biological pathways and processes. The pathway databases we considered are: Kegg [21], Gene Ontology (GO) Biological Processes [22], and Reactome [20]. Additionally, we considered pre-processed versions of GO and Reactome, in which pathways that contained gene sets involved in mitochondrial translation and oxidative phosphorylation were removed. We refer to these versions as CleanGO and CleanReactome. Only six pathways were removed from Reactome in this process, and the bulk of our analysis was done with the resulting CleanReactome. In the log-likelihood scoring pipeline, we filter through the pathways within a reference file, to only include those pathways that have a minimum of 5 genes and maximum of 400 genes.
Using the gene pairs from each bin and the annotated reference set, we measure if both genes in the sample pair are annotated in the same pathway (true positive), or if they are annotated in different pathways (false positive). If a gene from our sample is not annotated in any pathways in the reference set, we do not count the pair in our calculation. We then compare the ratio of the frequencies of true positives and false positives in our sample with the background expectation, meaning the ratio of frequencies of all annotated genes operating in the same pathway and genes operating in different pathways in the reference set. The log-likelihood score is given by (10).
$$LLS = \log_{2} \left( {\frac{{\Pr \left( {true\;pos} \right)/\Pr \left( {false\;pos} \right)}}{{\Pr \left( {all\;reference\;pos} \right)/\Pr \left( {all\;reference\;neg} \right)}}} \right){ }$$
(10)
Scores above zero indicate that the method tends to link genes in the same pathway, and high scores indicate more confident linkages. The log-likelihood scoring was performed cumulatively, as well as locally for each method.
The hu.MAP 2.0 Protein Complexes List was downloaded from the humap2 website (http://humap2.proteincomplexes.org/static/downloads/humap2/humap2_complexes_20200809.txt) [24]. We created a list of unique gene pairs from each protein complex, a total of 57,911 gene pairs, and calculated overall LLS for this list. The LL score for hu.MAP 2.0 was 4.75, with coverage of 10,060 unique genes. We used this score as a reference against which to compare the performance of the networks formed with the various methods.
Gene set enrichment analysis
To analyze the differences between the highest scoring networks, we looked at the enrichment from genes exclusive those networks. We used the hu.MAP2 LL Score as a cut-off for the PCA-whitening covariance normalized Ceres, Bayes Factors, and Chronos networks. We identified edges and genes exclusive to each network, and conducted enrichment analysis on the exclusive gene sets using the GSEAPY “enrichr” python module [31] with the reference sets ‘KEGG_2021_Human’, ‘GO_Biological_Process_2021’, ‘GO_Cellular_Component_2021’, ‘GO_Molecular_Function_2021’.
Partial correlation
We utilized partial correlation to reveal the conditional relationship between two genes, after controlling the effect of a third gene. We calculate partial correlation of two genes \(x\) and \(y\), while controlling the effect of a third gene \(z\), using the recursive formula (Eq. 11). For each pair of genes \(x\) and \(y\) with \(\left| {\rho_{x,y} } \right| > 0.15\), we calculate partial correlation with respect to every other gene in the network, and look for the gene that yields the highest partial correlation coefficient. To quantify the change in correlation coefficient after accounting for the effect of a third gene, we calculate a ratio between the two coefficients with the formula (Eq. 12).
$$\rho_{x,y\_z} = \frac{{\rho_{x,y} - \rho_{x,z} \rho_{y,z} }}{{\sqrt {\left( {1 - \rho_{x,z}^{2} } \right)\left( {1 - \rho_{y,z}^{2} } \right)} }}$$
(11)
$$\frac{{\rho_{x,y\_z}^{2} }}{{\rho_{x,y}^{2} }}$$
(12)
Our partial correlation analysis revealed many gene trios (x–y–z), where positive correlation coefficients between genes x and y and between genes y and z are boosted after controlling for the effect of the other, while the correlation coefficient between genes x and z is negative. A master data list of these trios is available in Additional file 2: Table S1. We created a network with these moonlighting trios, consisting of gene pairs with the edge being the positive partial correlation coefficient, as well as gene pairs with negative PCC edge. The network can be viewed in Cytoscape [32], and the data table used to create the network is available in Additional file 2: Table S2.