Curation of breast cancer studies
The Metabric study datasets were downloaded from EMBL-EBI with the study ID EGAS00000000083. Study datasets were comprised of the discovery set and the validation set, as well as a third smaller group of normal control samples. For the expression data of each set of samples, probe-level data were aggregated to the gene level and each sample was adjusted using quantile normalization. For the copy number variation variant data, each gene’s CNV status was found by calculating the mean of the values of the probes covering that gene. The TCGA Breast invasive carcinoma (BRCA) study data were also downloaded and contained mostly tumor samples and some normal samples. The HiSeq expression data were log transformed and median centered. The BRCA CNV data were downloaded from Firehose, and GISTIC gene-level output were used directly. For the GSE18229 study and the GSE20624 study, expression data were downloaded from the UNC microarray database, aggregated from the probe-level to the gene-level and quantile normalized. For the GSE20685 study, the expression data were downloaded from the GEO database. For the GSE22133 study, the expression data were aggregated from the probe level to the gene level and quantile normalized. For the CNV data, the values of the probes covering each gene were averaged to become the CNV status of that gene.
EM algorithm
We devised an EM algorithm to separate the whole tumor patient population into 3 groups, “higher than normal”, “similar to normal” and “lower than normal”. To do this, we assumed that the expression values of a certain gene in the tumor patient population were a mixture of 3 Gaussian distributions (3 modes), corresponding to each of the 3 groups mentioned above. We assumed those of the normal patient corresponded only to the middle component. To avoid assignment of a patient to an unreasonable mode, we assumed these 3 Gaussian distribution shared the same variance. Then the log likelihood function could be written as:
$$ LL\left({\overrightarrow{\mathrm{x}}}_{tumor},{\overrightarrow{\mathrm{x}}}_{normal};\overrightarrow{\pi},\overrightarrow{\mu},\sigma \right)=\sum \limits_{i=1}^{\# tumor}\log \left(\sum \limits_{j=1}^3f\left({\mathrm{x}}_{\mathrm{tumor},i};{\mu}_j,\sigma \right)\times {\pi}_j\right)+\sum \limits_{i=1}^{\# normal}\log \left(f\left({\mathrm{x}}_{normal,i};{\mu}_2,\sigma \right)\right) $$
\( f\left(\mathrm{x};\mu, \sigma \right)=\kern0.5em \frac{1}{\sqrt{2\pi}\sigma }{e}^{-\frac{{\left(\mathrm{x}-\mu \right)}^2}{2{\sigma}^2}} \) is the density function of normal distribution. \( {\overrightarrow{x}}_{tumor} \) and \( {\overrightarrow{x}}_{normal} \) are the vectors of expression levels of a certain gene in the tumor patient population and normal patient population. \( \overrightarrow{\pi} \) is a 3-element vector specifying the proportion of patients that belong to each of the 3 modes. \( \overrightarrow{\mu} \) is a 3-element vector specifying the mean of the 3 Gaussian distributions, subject to μ1 ≤ μ2, μ2 ≤ μ3. σ is the standard deviation of the 3 Gaussian distributions.
For each round, the EM algorithm was started by updating the responsibilities \( \overrightarrow{\gamma} \), which is a vector with #tumor elements: \( {\gamma}_{i,j}=\frac{f\left({\mathrm{x}}_{tumor,i};{\mu}_j,\sigma \right)}{\sum \limits_{k=1}^3f\left({\mathrm{x}}_{tumor,i};{\mu}_k,\sigma \right)} \). Then \( \overrightarrow{\pi} \) is updated by \( {\pi}_j=\frac{\sum \limits_{i=1}^{\# tumor}{\gamma}_{i,j}}{\# tumor} \) , \( \overrightarrow{\mu} \) is updated by \( {\mu}_j=\frac{\sum \limits_{i=1}^{\# tumor}{x}_{tumor,i}{\gamma}_{i,j}+I\left(j=2\right)\sum \limits_{i=1}^{\# normal}{x}_{normal,i}}{\sum \limits_{i=1}^{\# tumor}{\gamma}_{i,j}+I\left(j=2\right)\times \# normal}\left(j=1,2,3\right) \), but the inequality bounds require that:
if μ1 > μ2, μ2 ≤ μ3, then \( {\mu}_1={\mu}_2=\frac{\sum \limits_{j=1}^2\sum \limits_{i=1}^{\# tumor}{x}_{tumor,i}{\gamma}_{i,j}+I\left(j=2\right)\sum \limits_{i=1}^{\# normal}{x}_{normal,i}}{\sum \limits_{j=1}^2\sum \limits_{i=1}^{\# tumor}{\gamma}_{i,j}+I\left(j=2\right)\times \# normal} \);
if μ1 ≤ μ2, μ2 > μ3, then \( {\mu}_2={\mu}_3=\frac{\sum \limits_{j=2}^3\sum \limits_{i=1}^{\# tumor}{x}_{tumor,i}{\gamma}_{i,j}+I\left(j=2\right)\sum \limits_{i=1}^{\# normal}{x}_{normal,i}}{\sum \limits_{j=2}^3\sum \limits_{i=1}^{\# tumor}{\gamma}_{i,j}+I\left(j=2\right)\times \# normal} \);
and if μ1 > μ2, μ2 > μ3, then \( {\mu}_1={\mu}_2={\mu}_3=\frac{\sum \limits_{j=1}^3\sum \limits_{i=1}^{\# tumor}{x}_{tumor,i}{\gamma}_{i,j}+I\left(j=2\right)\sum \limits_{i=1}^{\# normal}{x}_{normal,i}}{\sum \limits_{j=1}^3\sum \limits_{i=1}^{\# tumor}{\gamma}_{i,j}+I\left(j=2\right)\times \# normal} \).
Finally, σ is updated by \( {\left[\frac{\sum \limits_{i=1}^{\# tumor}\sum \limits_{j=1}^3{\gamma}_{i,j}{\left({\mathrm{x}}_{tumor,i}-{\mu}_j\right)}^2+\sum \limits_{i=1}^{\# normal}{\left({\mathrm{x}}_{normal,i}-{\mu}_2\right)}^2}{\sum \limits_{i=1}^{\# tumor}\sum \limits_{j=1}^3{\gamma}_{i,j}+\# normal}\right]}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.} \).
The EM iterations were stopped when the log likelihood reached convergence. When μ1 < μ2, μ2 < μ3, and πi > 0.01, i = 1, 2, 3 were all satisfied, this gene was said to exhibit trimodality distribution. Then two cutoff values were calculated by \( {cutoff}_{12}=\frac{\mu_1^2-{\mu}_2^2-2{\sigma}^2\log \left(\frac{\pi_1}{\pi_2}\right)}{2\left({\mu}_1-{\mu}_2\right)} \) and \( {cutoff}_{23}=\frac{\mu_2^2-{\mu}_3^2-2{\sigma}^2\log \left(\frac{\pi_2}{\pi_3}\right)}{2\left({\mu}_2-{\mu}_3\right)} \). Sometimes cutoff12 > μ2 or cutoff12 < μ1 could occur. When that happened, an ad hoc rule applied to set cutoff12 at the 10% quantile of the expression values of the tumor samples. Similarly, cutoff23 was set at the 90% quantile when cutoff23 > μ3 or cutoff23 < μ2. Finally the true membership of each tumor sample to the three modes was decided by comparing their expression values to cutoff12 and cutoff23. An empirical π was calculated by the proportion of tumor patients belonging to each mode.
Gene essentiality analysis
The gene essentiality screening data were downloaded from the 2012 Cancer Discovery study [19]. In this study, a continuous GARP score was defined for each gene in every cell line. A lower score for a gene meant that the cell line was more reliant on the expression of this gene for survival. We used the expression data downloaded from the Cancer Cell Encyclopedia (CCLE) website [2]. The whole CCLE dataset contained the expression data of 58 breast cancer cell lines. 29 of these cell lines were also used in the gene essentiality screening study.
Statistical tests
Survival analysis performed in this study was done using functions from the R survival package. To test the tri-modal association of each gene’s expression level with overall survival, the “low”, “middle”, and “high” categorical variables were input into the Cox proportional hazard model, with or without adjusting for other variables. The P value for the “low” group was assigned by testing the null hypothesis that “low” group patients had no worse overall survival than “middle” group patients, and the same applied for “high” group p values. All survival analysis was censored at 20 years.
To test the proportional trend of two groups of patients in tumor graded 1, 2 and 3, a modified version of prop.trend.test function from the stats R package was used. The p value generated by prop.trend.test was from a two-tailed test, while a one-tailed p value was calculated from it by examining the sign of the coefficient. The one-tailed p value was for the null hypothesis that “low” (“high”) group patients tended to have lower grade tumors when compared to “middle” group patients. To compare “low” vs. “middle” groups for example, the test in essence generated a smaller p value when more advanced grade tumors were more likely to be “low” group patients rather than “middle” group patients.