Predictive modeling of gene expression regulation

Background In-depth analysis of regulation networks of genes aberrantly expressed in cancer is essential for better understanding tumors and identifying key genes that could be therapeutically targeted. Results We developed a quantitative analysis approach to investigate the main biological relationships among different regulatory elements and target genes; we applied it to Ovarian Serous Cystadenocarcinoma and 177 target genes belonging to three main pathways (DNA REPAIR, STEM CELLS and GLUCOSE METABOLISM) relevant for this tumor. Combining data from ENCODE and TCGA datasets, we built a predictive linear model for the regulation of each target gene, assessing the relationships between its expression, promoter methylation, expression of genes in the same or in the other pathways and of putative transcription factors. We proved the reliability and significance of our approach in a similar tumor type (basal-like Breast cancer) and using a different existing algorithm (ARACNe), and we obtained experimental confirmations on potentially interesting results. Conclusions The analysis of the proposed models allowed disclosing the relations between a gene and its related biological processes, the interconnections between the different gene sets, and the evaluation of the relevant regulatory elements at single gene level. This led to the identification of already known regulators and/or gene correlations and to unveil a set of still unknown and potentially interesting biological relationships for their pharmacological and clinical use. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04481-1.


S0.3 TCGA BRCA basal-like patient samples
List of the 122 TCGA Breast Invasive Carcinoma patient samples of Basal-like subtype, according to the PAM50 molecular subtype classification, used for validation.

S1. Data analysis: Algorithm and results
The proposed algorithm and its main operations for extracting and analyzing data for each target gene are summarized in Figure S1.1. After the initial data extraction and preparation, described in the main paper, the data analysis part is composed of two subsequent main phases: Feature selection and then linear regression, detailed as follows.

S1.1 Feature selection
As described in the main paper, the first data analysis phase for each target gene consists in selecting its best subset of features for the subsequent regression modeling task, avoiding to consider the ones found nonrelevant to model the regulation of the expression of the target gene. This feature selection process allows to perform a more accurate analysis for each target gene in a shorter time, reducing the dimension of the set of features with respect to the set of observations; considering the whole set of possible features, in fact, would highly increase the computational complexity and would result in an underdetermined regression problem. The feature selection process ( Figure S1.2(a) and Figure S1.2(b)) iterates along the list of target genes belonging to the gene set selected for the analysis, and for each of such genes it executes the following operations for each of the data matrices considered: a. The current gene data matrix M i is imported and all its empty columns are removed, since they do not contribute to the regression model (e.g., columns with all NaN methylation values, corresponding to genes for which no promoter methylation probes are measured in TCGA, or columns regarding genes whose expression is zero in the considered TCGA samples); b. All those features belonging to the previous evaluated set of features (i.e., data matrix), if any, of the same gene that were not selected by the previous feature selection process, if any, are removed. We assume those same features could not be selected in the current feature selection process, since they have already been discarded once. The features retrieved from the previous selection, along with the new features in the current data matrix, are involved in the new feature selection process; this means that features previously selected may be discarded as a result of the current feature selection, because some of the new added features may better explain the expression of the target gene; c. On such features of the current data matrix M i , a forward feature selection is first performed by evaluating the linear regression performance of growing subsets of these features: starting from a feature, all the other features are added one by one according to the one providing the best regression performance on a training set of samples; then, all defined subsets of features are evaluated on a testing set of different samples, and the subset with the best cross-validation regression performance is retained. In order to reduce the possible bias induced by the particular patient samples considered for the training and test sets, we randomly split the available TCGA aliquots into five, possibly equal, subgroups of samples. This partition is performed only once at the beginning of the data analysis, and then the same five subgroups of samples are used for all the feature selections of all target genes. Thus, for each target gene and each of its data matrices considered, we execute the feature selection five times, using each time four of the sample subgroups as training set and the fifth remaining subgroup as testing set, according to a k-fold cross-validation process with k = 5. Finally, the intersection of all the features extracted in the five testing sets is computed to obtain the final selected features (i.e., those extracted in all five subgroups) for the current target gene in the current data matrix. Figure S1.2(a) and Figure  S1.2(b) illustrate this feature selection procedure.

S1.2 Linear regression
The second phase of our data analysis consists in fitting a linear model on each individual target gene and on the set of features selected in the previous phase for each of its data matrices, with the aim of extracting the most significant features of each data matrix (i.e., feature type) that better explain the target gene expression (i.e., output variable), quantifying their effect. Just as done for the feature selection, the process iterates along the list of target genes belonging to the gene set under analysis and their considered data matrices M i , fitting a linear model as follows: a. First, data are normalized to allow comparisons of results; since heterogeneous data are used for assessing the regulation of the target gene expression, normalization is needed for consistently comparing results not only within but also across the different individual regression models defined. The Z-score normalization is applied to convert the values of each considered variable (i.e., feature or target gene expression) to have a distribution with mean_value = 0 and variance = 1, using the following formula: This normalization allows comparing the regression coefficients assigned to the different features within the same or in different models and establishing which one has the highest impact on the output variable; b. Then, we build the best linear regression model by fitting an ordinary least squares (OLS) model to the normalized data.
For each target gene, from each data matrix we extract and evaluate only the relevant features for the regulation of the target gene expression, according to the values of their confidence intervals. A confidence interval represents an interval estimate that is supposed to contain the true value of an unknown parameter. It has an associated confidence level (i.e., 95%) quantifying the probability that the parameter value (in our case, the estimated regression coefficient) lies in the interval. So, only if the effect of a feature is very unlikely to be zero (i.e., its 95% confidence interval does not contain the zero value), then the feature is considered relevant for the model. This means that the actual set of relevant features predicted involved in the regulation of the expression of the target gene is smaller than, or equal to, the set of features selected by the previous feature selection process and used as input for the regression model. Furthermore, the coefficient of determination (R 2 Figure S1.2(b): Feature selection procedure: subsequently evaluated data matrices for each target gene. score) is calculated to measure the regression fit quality. Figure S1.3 shows how this identification of relevant features works, by considering as an example the gene TKT and the features selected from its data matrix M3.
Finally, all the results (i.e., target genes and their predicted regulatory features) are graphically represented through expression networks, one for each evaluated target gene set and feature set type (i.e., data matrix), grouping target genes in subclasses according to their function inside their gene set. In our specific application case, we considered 3 sets of target genes, each grouping the genes known to compose one of the pathways more relevant in Ovarian Serous Cystadenocarcinoma (OV), as follows: ▪ DNA_REPAIR pathway genes (genomic instability due to DNA damages is one of the main causes of different types of cancer, including ovarian tumor); ▪ STEM_CELLS pathway genes (tumor stem cells represent one of the main tumor relapsing mechanisms, regulating tumor's either sensitive or resistant response to chemotherapeutic drugs); ▪ GLUCOSE_METABOLISM pathway genes (there is a strict correlation between glucose addiction and cancer platinum-based therapy, and the resistance of some cancer cells to this type of therapy may be associated with an alteration in their metabolic profile).
The common color code shared by the drawn networks allows an easy interpretation of the results and the assessment of the effect of both methylation and transcription factors, as well as of the interrelationships between pathways of interest (i.e., how each target gene is regulated and how in turn it participates, if so, in regulating other target genes). Section S3 of this document illustrates all the resulting gene expression networks for OV cancer analysis.

S1.3 Results: Additional details
Here, we report additional details of the main results obtained for each evaluated target gene set (i.e., pathway gene set) and feature set type (i.e., data matrix) model.

S1.3.1 DNA_REPAIR pathway
Model M2: the expression of each target gene is considered regulated by genes in the DNA_REPAIR pathway and genes encoding for transcription factors binding to the target gene promoters. The Adjusted R 2 score resulted higher than 0.6 only for 3 genes of the pathway (POLE, TP53BP1 and FANCD2), reaching the maximum value of 0.67 for the gene POLE. Model M3: the expression of each target gene is considered regulated by genes in the DNA_REPAIR pathway and genes encoding for transcription factors binding to the target gene promoters or to the promoters of other genes in the DNA_REPAIR pathway. The Adjusted R 2 score resulted higher than 0.6 for 6 genes of the pathway (FANCC, TP53BP1, ERCC2, FANCD2, POLQ and BRCA1), reaching the maximum value of 0.64 for the gene FANCC.
For 5 genes of the pathway, the gene methylation is selected as one of the relevant features involved in the regulation of their expression: Differently from the previous model M2, gene CDK12 loses its methylation from the set of relevant regulatory features, because more relevant regulators are found among the set of candidate regulatory genes of the pathway, including the selected regulatory genes SUZ12, BCLAF1, RUNX1 and ZNF217.
Model M5: the expression of each target gene is considered regulated by genes in the DNA_REPAIR pathway, genes encoding for transcription factors binding to the target gene promoters or to the promoters of other genes in the DNA_REPAIR pathway, genes in the STEM_CELLS or GLUCOSE_METABOLISM pathway and their regulatory genes. The Adjusted R 2 score resulted higher than 0.6 for 8 genes of the pathway (POLB, FANCC, POLQ, TP53BP1, FANCD2, ERCC2, POLE and ERCC1), reaching the maximum value of 0.68 for gene POLB. For 4 genes of the pathway the gene methylation is selected as one of the relevant features involved in the regulation of their expression: Differently from the previous model M3, gene ERCC1 loses its methylation from the set of its relevant regulatory features, because more relevant regulators are found among the set of candidate regulatory genes of the pathway, including the selected regulatory genes HNRNPUL1 and NCOA3 of the STEM_CELLS pathway.
There is one single most frequent regulator, selected as a relevant regulatory feature for 5 DNA repair genes: ➢ XRCC3 (candidate regulatory gene of the STEM_CELLS pathway) is selected as: -2 nd (out of 11 features) for gene POLQ, with regression coefficient 0.1916; -4 th (out of 9 features) for gene POLE, with regression coefficient 0.1625; -5 th (out of 11 features) for gene POLE, with regression coefficient 0.1098; -9 th (out of 11 features) for gene FANCC, with regression coefficient 0.0985; -9 th (out of 12 features) for gene PARP1, with regression coefficient -0.0767; Figure S1.4 at the end of the section displays the 8 genes of the DNA_REPAIR pathway that overall show the best linear fit (Adjusted R 2 > 0.6) in model M5 and the set of their relevant regulatory features.

S1.3.2 STEM_CELLS pathway
Model M2: the expression of each target gene is considered regulated by genes in the STEM_CELLS pathway and genes encoding for transcription factors binding to the target gene promoters. The Adjusted R 2 score resulted higher than 0.6 for 5 genes of the pathway (PTPRC, ITGA4, PECAM1, ENG and LATS1), reaching the maximum value of 0.77 for the gene PTPRC.
For 10 genes of the pathway the gene methylation is selected as one of the relevant features involved in the regulation of their expression: Only one single most frequent regulator is selected as a relevant regulatory feature for 9 stem cells target genes: ➢ PTPRC (gene of the STEM_CELLS pathway) is selected as: Model M3: the expression of each target gene is considered regulated by genes in the STEM_CELLS pathway and genes encoding for transcription factors binding to the target gene promoters or to the promoters of other genes in the STEM_CELLS pathway. The Adjusted R 2 score resulted higher than 0.6 for 11 genes of the pathway (PTPRC, ITGA4, LATS1, DNMT1, JAK2, NOTCH2, MAML1, PECAM1, CHEK1, AXL and ENG), reaching the maximum value of 0.82 for the gene PTPRC.
For 10 genes of the pathway the gene methylation is selected as one of the relevant features involved in the regulation of their expression: Differently from the previous model M2, methylation becomes relevant for gene DACH1, while gene ENG loses its methylation from the set of its relevant regulatory features, because more relevant regulators are found among the set of candidate regulatory genes of the pathway, in particular the selected regulatory genes TEAD2 and ZNF143 of the STEM_CELLS pathway.
Only one single most frequent regulator is selected as a relevant regulatory feature for 10 stem cells target genes: ➢ PAX8 (candidate regulatory gene of the STEM_CELLS pathway) is selected as: -1 st (out of 1 feature) for gene NOS2, with regression coefficient -0.1792; -2 nd (out of 8 features) for gene BMP7, with regression coefficient 0.2424; -2 nd (out of 3 features) for gene NANOG, with regression coefficient -0.1257; -6 th (out of 7 features) for gene LIN28B, with regression coefficient -0.1251; -6 th (out of 6 features) for gene MYCN, with regression coefficient -0.1533; -7 th (out of 7 features) for gene KLF4, with regression coefficient -0.1542; -8 th (out of 8 features) for gene CD44, with regression coefficient -0.1997; -10 th (out of 12 features) for gene MYC, with regression coefficient -0.1491; -11 th (out of 13 features) for gene ETFA, with regression coefficient -0.1424; -11 th (out of 11 features) for gene ITGA4, with regression coefficient -0.1075; Model M5: the expression of each target gene is considered regulated by genes in the STEM_CELLS pathway, genes encoding for transcription factors binding to the target gene promoters or to the promoters of other genes in the STEM_CELLS pathway, genes in the DNA_REPAIR or GLUCOSE_METABOLISM pathway and their regulatory genes. The Adjusted R 2 score resulted higher than 0.6 for the same 11 genes of the pathway as in model M3 (PTPRC, ITGA4, DNMT1, LATS1, MAML1, JAK2, CHEK1, PECAM1, NOTCH2, AXL and ENG), reaching the maximum value of 0.82 for the gene PTPRC.
For 9 genes of the pathway, the gene methylation is selected as one of the relevant features involved in the regulation of their expression, i.e., the same as in M3, with the exception of gene DACH1: The same single most frequent regulator is selected as a relevant regulatory feature for the same 10 stem cells target genes as in model M3: ➢ PAX8 (candidate regulatory gene of the STEM_CELLS pathway) is selected as: -2 nd (out of 2 features) for gene NOS2, with regression coefficient -0.1858; -2 nd (out of 8 features) for gene BMP7, with regression coefficient 0.2264; -2 nd (out of 3 features) for gene NANOG, with regression coefficient -0.1257; -6 th (out of 9 features) for gene LIN28B, with regression coefficient -0.1277; -6 th (out of 7 features) for gene MYCN, with regression coefficient -0.1625; -8 th (out of 8 features) for gene CD44, with regression coefficient -0.1997; -9 th (out of 9 features) for gene KLF4, with regression coefficient -0.1344; -10 th (out of 12 features) for gene ETFA, with regression coefficient -0.1213; -11 th (out of 13 features) for gene MYC, with regression coefficient -0.1132; -12 th (out of 12 features) for gene ITGA4, with regression coefficient -0.0921; Figure S1.5 at the end of the section displays the 11 genes of the STEM_CELLS pathway that overall show the best linear fit (Adjusted R 2 > 0.6) in model M5 and the set of their relevant regulatory features.

S1.3.3 GLUCOSE_METABOLISM pathway
Model M2: the expression of each target gene is considered regulated by genes in GLUCOSE_METABOLISM pathway and genes encoding for transcription factors binding to the target gene promoters. The Adjusted R 2 score resulted higher than 0.6 only for 3 genes of the pathway (SDHD, DLAT and ACO2), reaching the maximum value of 0.78 for the gene SDHD. For 14 genes of the pathway, the gene methylation is selected as one of the relevant features involved in the regulation of their expression: -AGL, with regression coefficient -0.2198; There are 2 most frequent regulators, each selected as a relevant regulatory feature for 10 glucose metabolism target genes: ➢ SUCLG1 (gene of the GLUCOSE_METABOLISM pathway) is selected as: Model M3: the expression of each target gene is considered regulated by genes in GLUCOSE_METABOLISM pathway and genes encoding for transcription factors binding to the target gene promoters or to the promoters of other genes in the GLUCOSE_METABOLISM pathway. The Adjusted R 2 score resulted higher than 0.6 for 6 genes of the pathway (SDHD, TPI1, DLAT, MDH1, ACO2, and PRPS1), reaching the maximum value of 0.78 for the gene SDHD.
For 12 genes of the pathway, the gene methylation is selected as one of the relevant features involved in the regulation of their expression: There is one single most frequent regulator, selected as a relevant regulatory feature for 8 glucose metabolism target genes: ➢ ILK (candidate regulatory gene of the GLUCOSE_METABOLISM pathway) is selected as:

S2. Computational evaluation: ARACNe
ARACNe is a an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context: starting from gene expression data, it generates gene expression networks by considering triplets of genes and iteratively removing the weakest of the three relationships at each iteration, according to its assigned Mutual Information (MI) value and to an arbitrary threshold (I 0 ) set a priori by the user (i.e., only relationships between genes quantified by a MI value higher than I 0 are considered and, for each triplet, the edge with the lowest MI value is removed). The threshold allows defining the dimension of the generated network: the higher is I 0 , the smaller are the networks and the number of displayed relationships. An example of how the dimension of the output network changes in relation to different values of the threshold I 0 is shown in Figure S2.1: nodes represent the genes of interest, while edges identify their correlations computed by ARACNe.
ARACNe works under a specific set of initial approximations that, although proved to be reasonable, may lead the algorithm to wrong network inferences. Being a computational method, there is no guarantee that its results are correct or complete; some relationships it identifies may be wrong, as well as it may miss some relationships. However, being a recognized computational method, we used ARACNe for evaluating the associations found during our data analysis, by comparing our ovarian cancer regression model networks with the corresponding networks generated by the ARACNe algorithm. Depending on how this comparison is made and due to the fact that the linear regression algorithm is a different computational procedure with respect to the ARACNe algorithm, differences are expected. Specifically, according to the computational approach we defined, which is focused on identifying best-predicting sets of regulators although in case omitting potential regulators with a lower predicting power, the regression models computed on a subset of selected features can detect less correlations than ARACNe, although some of such correlations may not be found by ARACNe. In addition, ARACNe is made for building only gene expression networks without taking methylation or other regulatory elements into account, as conversely our novel modeling allows. In the end, as a result of this comparison, for most target genes a set of common features between the two approaches is expected, i.e., a set of the main relationships revealed by the regression models that are also extracted by the ARACNe algorithm: this allows verifying that our initial aim and its implementation through a combined feature selection and linear regression approach leads to meaningful results that are worth investigating through further biological experimental analyses. We processed through the ARACNe algorithm the information contained in M3 and M5 matrices for the Ovarian Serous Cystadenocarcinoma, generating two corresponding networks. As done for our linear regression procedure, this processing was performed for each considered set of genes, i.e., for the sets of genes belonging to the DNA_REPAIR, STEM_CELLS, or GLUCOSE_METABOLISM pathway (see Section S0.1). For each pathway, the two networks corresponding to the M3 and M5 matrices, respectively, that were generated using ARACNe were then used for the comparison: the former network takes the genes of the pathway and the complete set of their candidate regulators as input, along with their expression values in the different ovarian cancer TCGA samples, to allow a complete processing using the list of genes in the pathway as hubs of the network; the latter network adds expression information about genes of the other considered pathways and their candidate regulators, to allow a complete processing by setting the list of genes in the studied pathway as nodes and the list of their candidate regulatory genes as transcription factors. The whole process is executed by setting the lowest possible Mutual Information threshold (I 0 = 0.001), in order to obtain the highest number of correlations. The generated networks (six in total) are then compared with the corresponding ones from our regression models based on the M3 and M5 matrices for the same pathways. Figure S2.2 summarizes the set of operations performed for the generation of pathway gene networks by our regression models and by ARACNe for their comparison.
Overall comparison of results for the considered ovarian cancer data are provided in Figure S2.3, while Figure  S2.4 shows an example of the detailed comparison for some genes in M5. The latter one was performed by matching relevant features found associated with the indicated gene through our regression models or through ARACNe, and by assessing their ranking among the gene associated features found, based on the association regression coefficient or mutual information, respectively; such rankings resulted very similar overall.   Finally, Figure S2.5 summarizes the results of the comparison with ARACNe by varying the MI threshold, for each considered pathway.

S3. Gene expression networks
All our final results are graphically represented through the following reported expression networks, grouping target genes in subclasses according to the considered gene set (i.e., pathway, in our case) they belong to. All the drawn networks follow the same color code: in red the genes in the DNA_REPAIR pathway, in orange the genes in the STEM_CELLS pathway, in pink the genes in the GLUCOSE_METABOLISM pathway, in green the gene methylations and in light blue the regulatory genes regardless of the pathway they belong to. The relationships are represented by directed edges starting in the regulatory element and incoming in the target gene: red edges correspond to positive regulations, while grey edges correspond to negative regulations, and edge labels correspond to the regression coefficient resulting from the computed model.

S4.1 Feature selection
Our library genereg implements 5 different feature selection procedures, offering the user the possibility to select which one to use for the data analysis. The default one is the forward feature selection (FFS) combined with the 5-fold cross-validation process that we propose and describe in the main paper. The first two alternative methods follow the same principles of the default one, but they differ in the way the features selected in previous steps are treated; instead, the last two methods are based on the Lasso algorithm (one with re-evaluation of the features, the other one applied on all features together), combined with the 5-fold crossvalidation process. The following subsections present a brief description of how these methods work in detail.

S4.1.1 Incremental forward feature selection with feature re-evaluation
This is the multistep incremental method we propose: in each step (i.e., for each data matrix used), it considers only the features extracted as most relevant in the previous step (if any) and it re-evaluates them with all the new features in the current data matrix during a new feature selection process. In particular, for each target gene and for each input data matrix, the feature selection process is performed five times in each step: in order to reduce possible biases, the set of TCGA data samples is randomly split into five, possibly equal, groups of samples that are used to create five different testing sets. Therefore, for each target gene the feature selection is performed five times, one for each generated testing set (using the remaining samples as training set), according to a k-fold cross-validation process, setting k = 5. The intersection of the five sets of extracted features is computed to obtain the final selected features for the target gene in the considered data matrix. Figure S4.1: Incremental forward feature selection with feature re-evaluation.

S4.1.2 Incremental forward feature selection with no feature re-evaluation
This incremental method is quite similar to the default procedure: it is a FFS with 5-fold cross-validation using the same training and test sets of the default method, but it does not implement the feature re-evaluation process. In each step (i.e., for each data matrix) it considers only the new features added to the current matrix with respect to the previous one, maintaining the features selected in the previous step (if any) as fixed and without re-evaluating them during the new feature selection process. Thus, the set of features selected for each data matrix surely contains all the features selected for the previous data matrix, if any, in case along with new relevant features.

S4.1.3 Forward feature selection on all features
This method is a complete forward feature selection with 5-fold cross-validation, using the same training and test sets of the default method, considering all the features contained in the current data matrix regardless of the ones selected in the previous step. It allows analyzing all the potential regulatory features together, without any previous pre-selection process. Thus, it is particularly useful when applied to the matrix M5, which contains the whole set of candidate regulatory features. This allows building one single regression model for each target gene, considering all its features at once and all together. However, the user needs to pay attention to this type of selection process because the computational complexity (thus the execution time) inevitably increases, in case greatly, due to the high number of input features.

S4.1.4 Incremental Lasso feature selection with feature re-evaluation
This is an incremental method with re-evaluation of the features, very similar to the default one; however, it differs in the algorithm used to identify the relevant features: Lasso instead of forward feature selection, combined with 5-fold cross-validation. Here, the features selected are the ones with a coefficient different from zero, according to the Lasso algorithm, selecting the best tuning parameter (alpha). A Lasso regression with multiple possible alpha values is executed and, as a result, the best alpha value is identified; finally, the Lasso regression is performed according to the selected alpha value, in order to extract the most relevant features. More in details: for each target gene, 5 Lasso feature selection processes are performed, using the same training and test sets of the default method, and the intersection of the five sets of extracted features is computed to obtain the final selected features for the target gene in the considered data matrix. In each of the 5 processes, the best tuning parameter alpha is identified among 10 different candidate values (0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10) and then that same alpha value is used to perform a Lasso feature selection for linear regression, in order to select relevant features, i.e., the ones with a non-zero regression coefficient.

S4.1.5 Lasso feature selection on all features
This method is a complete Lasso feature selection with 5-fold cross-validation using the same training and test sets of the default method, considering all the features contained in the current matrix regardless of the ones selected in the previous step. It allows to analyze all the potential regulatory features together, without any previous pre-selection process. Thus, this method is extremely useful if applied to matrix M5, which contains the whole set of candidate regulatory features. This allows building one single regression model for each target gene, considering all its features at once and all together. Differently from the FFS, Lasso guarantees a very low computational complexity, even if analyzing the whole set of features; thus, it is preferable to FFS when applied to large input matrices.

S4.2 Results and comparisons
In this Section, we report brief comparisons between the results obtained through our default method and using the alternative methods on Ovarian Cancer data. Pairwise comparison details comprise the following plots: -a detailed comparison summarizing the results of model M5 for a set of 6 restricted target genes, identified as most relevant and interesting during the evaluation of our default method results for the OV tumor (BRCA1, CDK12, ERCC1, CHEK1, DNMT1, PTPRC), highlighting similarities and differences in the final regression model built in the two compared methods; -line charts summarizing the number of relevant features selected by the two compared methods and the common ones, for each OV genomic pathway; -mean and confidence interval plots for each OV genomic pathway in the two compared methods; -Fisher's exact test summary, considering the number of features extracted by the two compared methods, in order to assess the significance of the distribution of these values and whether this number is significantly different between the two methods, building a contingency table and computing the p-value as follows: Method 1 Method 1 = our default method Method 2 = alternative method a = features extracted in both method 1 and method 2 b = features extracted in method 2, but not in method 1 c = features extracted in method 1, but not in method 2 d = features extracted neither in method 1 nor in method 2 (always considering only features extracted from method 1 and/or 2, so this always equals 0) Overall, all performed comparisons show that a relevant number of the features that our method selects are selected also by the other methods, which generally select more features but not in a statistically significant higher amount. The forward feature selection method applied on all features together was the only exception, resulting selecting statistically significantly less features than our approach; furthermore, it was the only one that did not identify the relationship between the DNMT1 and CHEK1 genes, although it has been confirmed in OV cancer PDX models and in a model of conditional CHECK1 knock-out in HCT116 cultured ovarian carcinoma cells; all other relationships experimentally confirmed that we tested (see main paper Results subsection Experimental validation of found biological correlations in independent Ovarian and Breast cancer samples) were identified by all considered methods.

S4.2.1 Our incremental FFS method vs. FFS on all features
• • ERCC1 (Adjusted R 2 M5 = 0.60 vs. 0.52): 3 relevant features identified by our default method (10 total features) are confirmed by this first alternative method (7 total features), with similar regression coefficients (the most relevant feature is the same in both methods, ERCC2): These results confirm the strong relationship between ERCC1 and ERCC2 (experimentally confirmed in OV and BRCA PDX models), identifying ERCC2 as the most relevant feature in both methods, playing a key role in the expression regulation of gene ERCC1.
• CHEK1 (Adjusted R 2 M5 = 0.66 vs. 0.64): 5 relevant features identified by our default method (12 total features) are confirmed by this first alternative method (8 total features), with the following regression coefficients (the most relevant feature is the same in both methods, NFRKB): However, the alternative method does not select the methylation of CHEK1 as one of its relevant features, differently from what happens by applying the default method (methylation is selected with a regression coefficient of -0.1144).
• DNMT1 (Adjusted R 2 M5 = 0.70 vs. 0.62): 3 relevant features identified by our default method (8 total features) are confirmed by this first alternative method (4 total features), with the following regression coefficients (the most relevant feature is the same in both methods, SMARCA4): However, CHEK1 is not identified by the alternative method, despite the relationship between DNMT1 and CHEK1 has been confirmed in OV cancer PDX models and in a model of conditional CHECK1 knock-out in HCT116 cultured ovarian carcinoma cells (see main paper Results subsection Experimental validation of found biological correlations in independent Ovarian and Breast cancer samples).      • CDK12 (Adjusted R 2 M5 = 0.43 vs. 0.53): all the relevant features identified by our default method are confirmed by this alternative method, with the following regression coefficients: A point to stress in this case is the different behavior of these two methods with respect to the gene methylation, which is not identified as relevant by the default method, but it is selected in the alternative one (with coefficient -0.1380).
• ERCC1 (Adjusted R 2 M5 = 0.60 vs. 0.61): all the relevant features identified by our default method, except three of them (E2F4, NCO3 and PTTG1), are confirmed by this method, with the following regression coefficients (the most relevant feature is the same in both methods, ERCC2):

FFS All Features
Also in this case, this confirms the strong relationship between ERCC1 and ERCC2 (experimentally confirmed in OV and BRCA PDX models), identifying ERCC2 as the most relevant feature in both methods, playing a key role in the expression regulation of gene ERCC1. However, there is a different behavior with respect to the gene methylation, which is not identified as relevant by the default method, but it is selected in the alternative one (with coefficient -0.0926).
• CHEK1 (Adjusted R 2 M5 = 0.66 vs. 0.67): 6 relevant features identified by our default method (12 total features) are confirmed by this alternative method (13 total features), including gene methylation, with the following regression coefficients (the most relevant feature is the same in both methods, NFRKB): • DNMT1 (Adjusted R 2 M5 = 0.70 vs. 0.66): 5 relevant features identified by our default method (8 total features) are confirmed by this alternative method (8 total features), with the following regression coefficients (the most relevant feature is the same in both methods, SMARCA4): Both methods identify CHEK1 as one of the most relevant features, confirming the relevance of the DNMT1-CHEK1 biological relationship, which has been confirmed in OV cancer PDX models and in a model of conditional CHECK1 knock-out in HCT116 cultured ovarian carcinoma cells (see main paper Results subsection Experimental validation of found biological correlations in independent Ovarian and Breast cancer samples).     This confirms the strong relationship between ERCC1 and ERCC2 (experimentally confirmed in OV and BRCA PDX models), with ERCC2 playing a key role in the expression regulation of gene ERCC1. However, there is a different behavior with respect to the gene methylation, which is not identified as relevant by the default method, but it is selected in this alternative one (with coefficient -0.1001).
• CHEK1 (Adjusted R 2 M5 = 0.66 vs. 0.75): 7 relevant features identified by our default method (12 total features) are confirmed by this alternative method (23 total features), including gene methylation, with the following regression coefficients (the most relevant feature is the same in both methods, NFRKB):

S4.2.4 Our incremental FFS method vs. Lasso on all features
• BRCA1 (Adjusted R 2 M5 = 0.59 vs. 0.63): all the relevant features identified by our default method, except two of them (CBX5 and SUZ12), are confirmed by this alternative method, including gene methylation, with the following regression coefficients:   • CDK12 (Adjusted R 2 M5 = 0.43 vs. 0.41): all the relevant features identified by our default method, except 2 of them (ATM and NCOA3), are confirmed by this alternative method, with the following regression coefficients (the most relevant feature is the same in both methods, SUZ12): The alternative method identifies one new relevant feature, excluded by the default method: TAF1 (0.2696).
• ERCC1 (Adjusted R 2 M5 = 0.60 vs. 0.56): 4 relevant features identified by our default method (10 total features) are confirmed by this method (10 total features), with the following regression coefficients (the most relevant feature is the same in both methods, ERCC2): These results confirm, even in this case, the strong relationship between ERCC1 and ERCC2 (experimentally confirmed in OV and BRCA PDX models). However, there is a different behavior with respect to the gene methylation, which is not identified as relevant by the default method, but it is selected in this alternative one (with coefficient -0.0898).
• CHEK1 (Adjusted R 2 M5 = 0.66 vs. 0.67): 7 relevant features identified by our default method (12 total features) are confirmed by this alternative method (13 total features), including gene methylation, with the following regression coefficients (the most relevant feature is the same in both methods, NFRKB): • DNMT1 (Adjusted R 2 M5 = 0.70 vs. 0.60): only 2 relevant features identified by our default method (8 total features) are confirmed by this alternative method (2 total features), with the following regression coefficients (the most relevant feature is the same in both methods, SMARCA4):       • ERCC1 (Adjusted R 2 M5 = 0.56 vs. 0.52): 4 relevant features identified by Lasso (7 total features) are confirmed by the FFS (10 total features), with the following regression coefficients (the most relevant feature is the same in both methods, ERCC2):

S4.2.5 Lasso on all features vs. FFS on all features
These results confirm, even in this case, the strong relationship between ERCC1 and ERCC2 (experimentally confirmed in OV and BRCA PDX models), identifying ERCC2 as the most relevant feature in both methods. However, there is a different behavior with respect to the gene methylation, which is not identified as relevant by the FFS method, but it is selected in the Lasso one (with coefficient -0.0898).
• CHEK1 (Adjusted R 2 M5 = 0.67 vs. 0.64): 4 relevant features identified by Lasso (13 total features) are confirmed by the FFS (8 total features), with the following regression coefficients (the most relevant feature is the same in both methods, NFRKB): However, there is a different behavior with respect to the gene methylation, which is not identified as relevant by the FFS method, but it is selected in the Lasso one (with coefficient -0.0922).
• DNMT1 (Adjusted R 2 M5 = 0.60 vs. 0.62): only 1 relevant feature identified by Lasso (out of the 2 total features) is confirmed by the FFS (4 total features), with the following regression coefficient (the single common feature is the most relevant feature in both methods, SMARCA4): However, only Lasso identifies CHEK1 as one of the most relevant features, despite the relationship between DNMT1 and CHEK1 has been confirmed in OV cancer PDX models and in a model of

S4.2.6 Overall summary comparison of the considered methods
Here, we report the overall comparison among the five feature selection methods considered, based on the number of features that each method extracts and their consensus with the other methods, or on the number of target genes for which a method could extract significant features and the goodness of these features in modelling the target gene expression according to the Bayesian Information Criterion (BIC), or on the method execution time.
For each method, the following table shows the total number of features extracted and those of them selected also by other methods (all four or less), or by no other method, while Figure S4.31 illustrates the percentage of the features extracted by a method that were selected also by all other 4 methods, and the percentage of the features not selected by all other methods that were selected by 3, 2, 1 other methods or by none of them:   For each feature selection method considered, the following table shows the number of target genes (Models) out of the 177 total evaluated ones for which the method could extract significant features, and the average and standard deviation of the Bayesian Information Criterion (BIC) for these models, computed on the M5 data matrix features of the target gene or on all its features (BIC is a well-known criterion for model selection among a finite set of models, whose lowest value indicates the preferred model):

S4.2.7 Common features in all 5 feature selection methods
Here, we list the common features identified in all 5 methods considered, for each target gene in OV tumor. These features can be considered as most reliable since they were unveiled by all the different methods: