Identification of biomarkers predictive of metastasis development in early-stage colorectal cancer using network-based regularization

Peixoto, Carolina; Lopes, Marta B.; Martins, Marta; Casimiro, Sandra; Sobral, Daniel; Grosso, Ana Rita; Abreu, Catarina; Macedo, Daniela; Costa, Ana Lúcia; Pais, Helena; Alvim, Cecília; Mansinho, André; Filipe, Pedro; Costa, Pedro Marques da; Fernandes, Afonso; Borralho, Paula; Ferreira, Cristina; Malaquias, João; Quintela, António; Kaplan, Shannon; Golkaram, Mahdi; Salmans, Michael; Khan, Nafeesa; Vijayaraghavan, Raakhee; Zhang, Shile; Pawlowski, Traci; Godsey, Jim; So, Alex; Liu, Li; Costa, Luís; Vinga, Susana

doi:10.1186/s12859-022-05104-z

Research
Open access
Published: 16 January 2023

Identification of biomarkers predictive of metastasis development in early-stage colorectal cancer using network-based regularization

Carolina Peixoto¹,
Marta B. Lopes^2,3,
Marta Martins⁴,
Sandra Casimiro⁴,
Daniel Sobral^6,7,
Ana Rita Grosso^6,7,
Catarina Abreu⁵,
Daniela Macedo⁵,
Ana Lúcia Costa⁵,
Helena Pais⁵,
Cecília Alvim⁵,
André Mansinho^4,5,
Pedro Filipe⁵,
Pedro Marques da Costa⁵,
Afonso Fernandes⁴,
Paula Borralho⁴,
Cristina Ferreira⁵,
João Malaquias⁵,
António Quintela⁵,
Shannon Kaplan⁸,
Mahdi Golkaram⁸,
Michael Salmans⁸,
Nafeesa Khan⁸,
Raakhee Vijayaraghavan⁸,
Shile Zhang⁸,
Traci Pawlowski⁸,
Jim Godsey⁸,
Alex So⁸,
Li Liu⁸,
Luís Costa^4,5 &
…
Susana Vinga^1,9

BMC Bioinformatics volume 24, Article number: 17 (2023) Cite this article

2928 Accesses
5 Citations
2 Altmetric
Metrics details

Abstract

Colorectal cancer (CRC) is the third most common cancer and the second most deathly worldwide. It is a very heterogeneous disease that can develop via distinct pathways where metastasis is the primary cause of death. Therefore, it is crucial to understand the molecular mechanisms underlying metastasis. RNA-sequencing is an essential tool used for studying the transcriptional landscape. However, the high-dimensionality of gene expression data makes selecting novel metastatic biomarkers problematic. To distinguish early-stage CRC patients at risk of developing metastasis from those that are not, three types of binary classification approaches were used: (1) classification methods (decision trees, linear and radial kernel support vector machines, logistic regression, and random forest) using differentially expressed genes (DEGs) as input features; (2) regularized logistic regression based on the Elastic Net penalty and the proposed iTwiner—a network-based regularizer accounting for gene correlation information; and (3) classification methods based on the genes pre-selected using regularized logistic regression. Classifiers using the DEGs as features showed similar results, with random forest showing the highest accuracy. Using regularized logistic regression on the full dataset yielded no improvement in the methods’ accuracy. Further classification using the pre-selected genes found by different penalty factors, instead of the DEGs, significantly improved the accuracy of the binary classifiers. Moreover, the use of network-based correlation information (iTwiner) for gene selection produced the best classification results and the identification of more stable and robust gene sets. Some are known to be tumor suppressor genes (OPCML-IT2), to be related to resistance to cancer therapies (RAC1P3), or to be involved in several cancer processes such as genome stability (XRCC6P2), tumor growth and metastasis (MIR602) and regulation of gene transcription (NME2P2). We show that the classification of CRC patients based on pre-selected features by regularized logistic regression is a valuable alternative to using DEGs, significantly increasing the models’ predictive performance. Moreover, the use of correlation-based penalization for biomarker selection stands as a promising strategy for predicting patients’ groups based on RNA-seq data.

Peer Review reports

Introduction

Colorectal cancer (CRC) is one of the leading causes of cancer-related deaths worldwide. In 2018, it was the third most common cancer, with around 1.8 million new cases and the second most deathly cancer with almost 900 thousand deaths (9% of all cancer-related deaths) [1]. CRC begins as a benign adenomatous polyp, which develops into an advanced adenoma with high grade dysplasia and then progresses to invasive cancer. Invasive cancers that are confined within the wall of the colon (stages I and II) are curable. However, if untreated, they may spread to regional lymph nodes (stage III) or later metastasize to distant sites (stage IV) [2].

CRC is a very heterogeneous disease that can develop via distinct pathways involving different combinations of genetic and epigenetic changes [3]. These genetic differences between patients may lead to differences in susceptibility where cancers deriving from the same tissue may be stratified into disease subtypes [4]. Genetic and epigenetic heterogeneity poses a problem for the diagnosis and therapy of cancer. For example, it can lead to incorrect treatment decisions. CRC has three main types known, divided by their origin and expression: sporadic form (60%–80% of the cases), family type (20%–40%) and hereditary type [5]. Sporadic CRC may appear in individuals who carry no mutation that makes them susceptible to developing this type of cancer. Regarding the family type, no gene has been found to be related to the disease. However, there is a higher chance of developing this tumor when family members have suffered from sporadic colon cancer. In these cases, environmental factors play a critical role. Hereditary type may be divided into two subtypes whether patients show adenomatous polyps - familial adenomatous polyposis (FAP), or not - hereditary nonpolyposis colorectal cancer [5].

Metastasis is the major cause of death in CRC patients, and approximately 20% of the patients already have metastases at diagnosis [6]. In this context, it is vital to diagnose CRC at an early-stage and accurately identify patients likely to progress to metastasis in order to improve CRC patients outcomes. Tumor surgical removal is the treatment of choice for early localized CRC disease (stage II-III) [7]. 50% of stage III patients are cured by surgery, whereas 20% of patients will survive due to the addition of adjuvant chemotherapy and 30% will relapse in 2-3 years [8]. Altogether, only 20% of stage III patients benefit from chemotherapy, exposing 80% of patients to unnecessary toxicity [9]. Therefore, one of the main challenges is to identify those stage II-III CRC patients where adjuvant chemotherapy is crucial to improve their outcomes.

Many studies try to understand tumor biology and mechanisms that lead to metastasis; notwithstanding, the identification of the factors influencing metastatic tumor cells, especially in colorectal cancer, remains poor [10]. Consequently, over the years, there was an increase in molecular profiling of tumors using next-generation sequencing (NGS), such as RNA sequencing (RNA-seq), which constitutes an important tool widely used in cancer research for studying the transcriptional landscape and molecular pathways [11].

Supervised learning comes as a natural choice for helping in the classification of patients into metastatic and non-metastatic, based on NGS data. Some of the widely used classifiers applied to RNA-seq data are Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), and Support Vector Machine (SVM) [12,13,14]. However, despite the invaluable information provided by NGS, the intrinsic high dimensionality of gene expression data may compromise the classification learning task and severally hamper an accurate selection of biomarkers. Therefore, feature selection plays a pivotal role in the selection of informative genes preceding the classification of RNA-Seq data for disease prediction and diagnosis, to enhance accuracy in disease classification [15]. Furthermore, ranking of the features according to their relevance to the classification problem and further selection of the best ones can improve the performance of the prediction model [13].

Another common way to address the data high-dimensionality challenge is to use classification algorithms that control the model’s complexity through regularization [16, 17]. One option is to regularize the log-likelihood function of the LR model. Two of the most commonly used penalties are the lasso ($\ell _1$-norm) and the ridge ($\ell _2$-norm) [14] regularizers, whose combination leads to the Elastic Net [18]. Network analysis has also shown enormous potential in precision medicine, helping to identify key biomarkers and therapeutic targets in cancer [19]. Several studies used network-based regularizers to improve model accuracy and interpretability. Prior network knowledge may be based on protein-protein interactions [20], or from the correlation matrix of the gene expression values [21, 22].

In sum, several studies demonstrated that using supervised learning methods in microarray gene expression data [23] is a very promising technique and that the integration of gene expression profiles with network information may help to identify markers correlated with metastasis [24]. Also, in the context of colorectal cancer, some classifiers developed to investigate metastasis were based only on clinical data (e.g., sex, age at diagnosis, histological subtype, stage, primary site) [25]. Therefore, there is still an urgent need of methods for the identification of factors influencing metastatic tumor cells, especially in colorectal cancer.

In this work, we try to find a set of biomarkers that may predict the risk of metastasis using transcriptomic data from a cohort of CRC patients followed at the Hospital de Santa Maria (Lisbon), one of the largest hospitals in Portugal.

To achieve this goal, we applied and tested different classification methods using transcriptomic data, and proposed a new combined model that showed higher classification accuracy compared to its model counterparts. Altogether, we proposed a new pipeline for the selection of putative biomarkers based on patients’ gene correlation matrices.

Materials and methods

To identify important genes involved in the CRC metastasis process, several classification methods applied to RNA-seq data were tested. The pipeline of this study is represented in Fig. 1. All the methods were implemented in the R statistical software [26] and the corresponding code is available at https://github.com/sysbiomed/iTwiner.git.

Datasets

Primary tumor samples from patients diagnosed with CRC disease from June 2010 to October 2017 were collected as part of a prospective biobanking project approved by the Ethical Committee of Hospital de Santa Maria, all procedures were performed in accordance with relevant guidelines. Patients were followed at the Oncology Division of Hospital Santa Maria, Lisbon, and were treated as per institutional clinical practice in accordance with international guidelines, namely ESMO and NCCN guidelines. Cases were staged according to The American Joint Committee on Cancer (AJCC) staging system, 8th edition, and patients had not received neoadjuvant chemo or radiotherapy prior to sample collection. Whole transcriptome sequencing (WTS) was performed by Illumina Inc.

The dataset used in this study comprises 110 samples from early-stage (II and III) CRC patients with both clinical and transcriptomic (RNA-seq) data. This was obtained from two different cohorts of CRC patients from Hospital Santa Maria (Lisbon, Portugal): Cohort 1: Cohort described in [27] containing 111 samples, available under accession number EGAS00001005276 (European Genome-Phenome Archive). This cohort has 26 samples from primary stage II-III colorectal tumors that did not metastasize, 34 primary stage II-III colorectal tumors that metastasize in three years of follow up, 12 adjacent normal colonic mucosa, and 39 metastasis of CRC patients. From this cohort, only the primary colorectal tumors samples were used ($n_{T1}=45$), from early-stage CRC that metastasize ($n_{PM1}=19$), and did not metastasize ($n_{P1}=26$). Cohort 2: Cohort described in [28] containing 114 samples, already available in NCBI Database under accession number PRJNA689313. We used $n_{T2}=65$ samples that correspond to early-stage CRC that metastasize ($n_{PM2}=11$) and early-stage CRC that did not metastasize ($n_{P2}=54$).

The clinical dataset descriptive statistics are summarized in Table 1. The sex (Female or Male), tissue of cancer primary site (Colon or Rectum), stage of the disease (II or III), sidedness of primary site (Right or Left side of the colon), and age variables were selected for further analysis. For the classification methods, two groups of interest were selected, early-stage (II-III) patients that do not metastasize (P, $n_{P1}+n_{P2}=80$) and early-stage patients that metastasize (PM, $n_{PM1}+n_{PM2}=30$) during the follow-up time period. Given the resulting imbalanced groups and the problems in classification that were obtained due to class imbalance, an undersampling strategy was taken for model training by splitting the initial dataset into three different smaller datasets, i.e., DATASET1 ($n = 60$), DATASET2 ($n = 55$), and DATASET3 ($n = 55$). For each dataset, PM patients were the same ($n = 30$) and P patients were randomly divided into three groups ($n_1 = 30$, $n_2 = 25$, $n_3 = 25$). With this strategy, we exploit all the data collected while keeping class balance in each classification procedure. Other data partitions may be tested using the available code.

The original gene expression dataset was comprised of 39,103 variables (genes). After excluding the genes with a constant expression (standard deviation of zero), a dataset with 37,504 variables (genes) was obtained.

A preliminary study of the datasets was performed to verify the statistical significance of the differences between factor variables across the P and PM groups of patients using the Fisher’s Exact test, namely to the variables sex, tissue type, stage of the disease, and sidedness. For the age variable, a t-test was used to compare the mean between the two groups. Subsequent survival analysis was performed for each dataset used, with the main goal of studying the time until an event of interest, i.e., death, occurs [29]. Here, we compared the differences in survival between several groups of interest – namely, stages of the disease (II vs. III), sidedness (Right vs. Left side of the colon), and class (P vs. PM) – using the log-rank test [30].

Finally, differential gene expression analysis was performed to identify genes differentially expressed between the two patient groups (P and PM). To perform this analysis, the edgeR R software package was used, employing an FDR (false discovery rate) cut-off of 0.05 to identify differentially expressed genes (DEGs). These genes were further used for classification.

Table 1 Distribution of the patients of each dataset (D1 - DATASET1, D2 - DATASET2, D3 - DATASET3) used regarding sex (Female, Male), tissue type (Colon, Rectum), stage of the disease (II, III), sidedness (Right, Left) and age; * p-value comparing P and PM class groups using the Fisher exact test

Full size table

Classification methods

Classification is a supervised learning method, where the model learns from a set of predefined samples with given class labels (training dataset). The knowledge inferred from this is applied to classify unknown samples (a test dataset) accordingly [13].

In this work, three binary classification approaches were used to distinguish early-stage CRC patients that metastasize from those that did not: 1) classification methods based on a subset of relevant genes (DEGs), 2) classification via regularized logistic regression with embedded feature selection applied to the full dataset, and 3) all classifiers based on the relevant features identified by regularized logistic regression (Fig. 1).

Binary classification

Regarding binary classification, five different classifiers were tested: decision trees, support vector machines (linear and radial), logistic regression, and random forest. One of the limitations of these methods emerges when using high-dimensional data. Since a high number of features may lead to problems in classification analysis, a smaller subset containing only genes found to be differentially expressed between the groups P and PM was used. The difference in the expression level of genes is found useful in classification in order to identify disease biomarkers [13]. Decision trees (DT) are one of the most used classifiers. The tree complexity, measured by the number of nodes and number of features used, has a crucial effect on its accuracy [31]. Certain parameters for tree construction were fixed as explained in the section “Method evaluation and comparison”. SVMs have been successfully applied to a wide variety of biological applications, such as the classification of microarray gene expression profiles. Here we tested both linear (svmL) and radial (svmR) kernel functions [32]. Logistic regression allows the analysis of binary outcomes using a logistic function [33]. This method will be explained in more detail below. Finally, Random forest (RF) is an ensemble learning method for classification, operating by constructing a multitude of decision trees [34] to get a more accurate and stable prediction. These classification procedures were performed using the R software caret package.

Regularized logistic regression

Another approach that has been widely used for classification problems in cancer is logistic regression [35, 36]. This method is used for modeling a binary response variable [37]. In this specific case, we investigated how metastasis may be predicted using gene expression levels from early-stage CRC patients.

The logistic regression model estimates the probability of belonging to a given class ($Y_i=1$) by:

$$\begin{aligned} P({Y}_i=1|{{\textbf {X}}}_i,{\varvec{\beta }})=\frac{\exp ({{\textbf {X}}}_i^T {\varvec{\beta }})}{1+\exp ({{\textbf {X}}}_i^T {\varvec{\beta }})}, \end{aligned}$$

(1)

where ${{\textbf {X}}}_{i}, i = 1,\ldots ,n$, is the vector of the p covariates (gene expression values) of the i-th patient, and ${{\varvec{\beta }}} = (\beta _{1}, \beta _{2}, \dots ,\beta _{p})$ are the corresponding regression coefficients.

The parameters ${{\varvec{\beta }}}$ of the logistic model are estimated by maximizing the log-likelihood function, given by

$$\begin{aligned} l({\varvec{\beta }})&= \sum _{i=1}^{n} \Big \{y_i \log P (Y_i=1|{{\textbf {X}}}_i,{{\varvec{\beta }}}) + (1 - y_i) \log [1 - P (Y_i=1|{{\textbf {X}}}_i,{{\varvec{\beta }}})]\Big \}, \end{aligned}$$

(2)

where the binary variable $y_i$ indicates to which group observation i belongs to, either a patient known to have metastasized in the future (group PM, $y_i=0$) or to a patient whose tumor did not metastasize (group P, $y_i=1$).

One of the most used techniques to handle high-dimensional gene expression data is regularization [38]. The most common regularizer is the Elastic Net ([18]), which combines the $\ell _1$-norm and the squared $\ell _2$-norm of the parameters:

$$\begin{aligned} F({\varvec{\beta }}) = \lambda \Big \{ \alpha \left\Vert {\varvec{\beta }} \right\Vert _1 + (1-\alpha ) \left\Vert {\varvec{\beta }} \right\Vert _2^2 \Big \}, \end{aligned}$$

(3)

where $0 \le \alpha \le 1$. When $\alpha =1$, the least absolute shrinkage and selection operator (Lasso) is obtained, whereas $\alpha =0$ corresponds to the Ridge regression. Lasso may set coefficients to zero, resulting in a sparse model with fewer coefficients. Ridge regression, on the other hand, is a continuous shrinkage method that minimizes the residual sum of squares, keeping all the predictors in the model [39]. The parameter $\lambda$ that controls the penalizing weight is usually chosen with cross-validation.

Incorporating network-based regularizers in classifiers may improve model interpretability leading to parameter estimation towards meaningful biological solutions. This network information may be obtained from the data correlation itself. For example, Twiner was recently proposed as a regularizer based on pairwise correlations between the features in two distinct groups A and B [21]. This method allows the selection of similarly correlated genes in two groups (e.g., in two given diseases). Here we propose a variant of Twiner, the iTwiner, in which the more different a gene’s correlation pattern is between two groups (metastatic and non-metastatic), the less penalized will be in the regularization term of logistic regression.

Given two correlation matrices for A and B, ${\Sigma _A = [ {\varvec{\sigma }}_1^A,...,{\varvec{\sigma }}_p^A}]$ and ${\Sigma _B = [ {\varvec{\sigma }}_1^B,...,{\varvec{\sigma }}_p^B}]$, respectively, where each column ${\varvec{\sigma }}_j \in \mathbb {R}^p$ represents the correlation of each feature $j=1,\ldots ,p$ with the remaining ones, the dissimilarity measure $d_j(A,B)$ of feature j between A and B is given by the angle of the corresponding vectors

$$\begin{aligned} d_j(A,B) = \arccos { \frac{<{\varvec{\sigma }}_j^A,{\varvec{\sigma }}_j^B>}{\Vert {\varvec{\sigma }}_j^A \Vert \cdot \Vert {\varvec{\sigma }}_j^B \Vert } }, \quad j=1,\ldots , p. \end{aligned}$$

(4)

The regularizer is constructed using these distances, to promote the selection of genes whose correlation patterns are more distant between A and B. The penalty term is given by

$$\begin{aligned} F({\varvec{\beta }}) = \lambda \Big \{ \alpha \Vert \textbf{q} \circ {\varvec{\beta }} \Vert _1 + (1-\alpha ) \Vert \textbf{q} \circ {\varvec{\beta }} \Vert ^2_2 \Big \}, \end{aligned}$$

(5)

where vector $\textbf{q}=(w_1^{-1},\ldots ,w_j^{-1},\ldots ,w_p^{-1})$ represents the inverse of the normalized distances $w_j = d_j(A,B) / \max _k d_k(A,B)$.

The iTwiner adapts the former regularization in order to penalize, now in an inverse way, the gene expression correlation similarities between the two groups (P vs. PM). The main rationale, in this context, is to select biomarker signatures that indeed reflect the different correlation patterns between the metastatic vs. non-metastatic early-stage CRC patients.

Method evaluation and comparison

In this work we tested three different approaches to find the best CRC metastasis classifier: 1) Classifiers based on DEGs; 2) Regularized logistic regression applied to the full dataset; 3) Classifiers based on genes pre-selected from regularization (instead of DEGs).

As explained above, the original dataset was split into three smaller datasets due to the existing class imbalance. For each dataset used, samples were randomly divided into a training set (for model construction) and a test set (for model evaluation), comprising 70% and 30% of the data, respectively. To obtain statistically reliable predictive measurements, 10-fold cross-validation was performed on the training set to optimize the $\lambda$ parameter in regularized logistic regression. Regarding decision trees, the minimum number of observations that must exist in a node in order for a split to be attempted and the minimum number of observations in the final node were fixed (minsplit = 4; minbucket = minsplit/3, respectively). After testing manually some values, these were the ones that gave the best estimated tree. Also, a 10-fold cross-validation was used across all runs to tune maxdepth and estimate the best tree, guaranteeing models’ comparison. This estimation procedure and hyper-parameter optimization was performed using the R software package rpart. For support vector machine, random forest and logistic regression classifiers, the train function from caret package was used to perform hyper-parameter optimization from a training set using the default 10-fold cross-validation. To mitigate the variability of these procedures, train and test sets were randomly generated 100 times, keeping the same fixed split (70%-30%).

For the EN model the parameter that controls sparsity was set to $\alpha = 0.2$ and for iTwiner $\alpha = 0.05$, which selected an adequate number of variables to be further analyzed and interpreted. Notwithstanding, different $\alpha$ parameters may be tested to select different gene set sizes, using the code made available.

To evaluate the models’ performance, depending on the class predicted by the classifier and the true class of the patient (non-metastatic - P or metastatic - PM), four different results can be obtained: True positive (TP) - patient predicted as positive (non-metastatic) and the patient was non-metastatic; False positive (FP) - patient predicted as positive (non-metastatic) but the patient did metastasize; True negative (TN) - patient predicted as negative (metastatic) and the patient metastasized; False negative (FN) - patient predicted as negative (metastatic) but the patient did not metastasize. Using these results, the following measures on the test set were used as indicators of the performance of the classifiers: Accuracy (fraction of correct predictions - Acc), number of misclassifications (Miscl), Sensitivity (fraction of actual positive cases), Specificity (fraction of actual negative cases) and AUC (area under the ROC curve). The median values of all performance indexes obtained for train and test sets across the 100 runs were used for comparison.

To perform the analysis described above, glmnet [40] package was used in R statistical software. The ${{\textbf {q}}}$ vector was introduced as a penalty factor in the glmnet function.

Results and discussion

Different gene expression profiles are expected in early-stage patients that will metastasize compared to non-metastatic patients, as a consequence of molecular, biochemical, and genetic variations that make metastatic cells able to migrate from the primary tumor to other body sites [41]. In this work, several classification and feature selection strategies based on RNA-seq data were evaluated to distinguish early-stage (II-III) CRC patients that metastasize from those that do not, and to find a subset of genes that may be predictive of CRC metastasis.

Exploratory analysis

The data used to perform this analysis is described in Table 1. As explained before, the full dataset was divided into three. These were analyzed individually and patients were divided into several groups regarding important clinical factors such as sex (Female and Male), tissue (Colon and Rectum), stages of the disease (II and III), sidedness (Right, Left), and age. The statistical significance of the differences between groups P and PM for each clinical factor can be found in Table 1. Most clinical factors yielded no significance in the differences between groups. This is an important step to guarantee that further differences found between P and PM groups are related to gene expression data and not to possible clinical confounding factors.

Afterward, to assess if there were differences in the survival probability regarding clinical factors, survival analysis was performed (Fig. 2) for each dataset used, and the significance of the differences was determined via the log-rank test. As expected, stage III of the disease (more advanced stage), was related to a higher risk of death compared to stage II. This was observed in all datasets, however, only DATASET1 had significant results (p value = 0.0091). Also, PM patients showed worst survival probability when compared to P patients, significant in all datasets (p value < 0.001). Finally, regarding sidedness, no statistically significant results were found. However, for DATASET1 and DATASET2, there was a tendency for the right side to be related with worst survival probability, as shown in the literature [42].

Differential gene expression analysis was performed in all datasets to find differential expressed genes (DEGs) between P and PM patient groups. In DATASET1, a total of 9533 DEGs were found. Among those, 1589 were up-regulated and 7944 down-regulated in PM patients. In DATASET2, 1840 DEGs were found, 835 up-regulated and 1005 down-regulated in PM. Finally, 138 DEGs were found in DATASET3, 39 up-regulated and 99 down-regulated. Given the high number of DEGs found in each dataset, a smaller gene set containing only the fifty DEGs that exhibited the lowest p-values between the two tissues was created, for ease of model building and interpretation. The list of ranked genes can be found in Table 2. To compare if there were DEGs found in common between datasets, a Venn diagram was constructed (Fig. 3).

The DEGs found in common between datasets are represented in Table 3, where log fold change (LogFC) is also shown. Negative values refer to down-regulated genes and positive values to up-regulated genes in PM patients. As we can see, twelve genes were considered DEGs in at least two datasets between tissues of early-stage metastatic patients and non-metastatic, with 3 DEGs in common between all of the datasets tested, GBP4, IDO1, IGHV4-34. Interestingly, all these three genes have important implications in immune regulation, highly relevant for cancer progression [43,44,45]. Several of the other genes identified have previously been involved in cancer cells migration, invasion and metastasis such as LRP4, LGR6, APOL1 and CXCL11 [46,47,48,49].

Table 2 List of the top fifty genes with lowest p-value found between P vs. PM, regarding different gene expression analysis, subsequently used for classification analysis in each dataset tested

Full size table

Table 3 DEGs found in common at least in two datasets used with fold change regarding primary patients that will metastasize (PM). logFC - log fold change; Multiple testing correction is performed by applying the Benjamini-Hochberg method on the p-values, to control the false discovery rate (FDR)

Full size table

Classification based on the DEGs

To classify primary patients that metastasize, five distinct classification methods were used: decision trees (DT), random forest (RF), linear and radial support vector machine (svmL and svmR, respectively), and logistic regression (LR). Due to high-dimensionality problems, the full gene expression dataset cannot be directly used. Therefore, we decide to perform feature selection using only DEGs found between early-stage patients that metastasize and those that do not metastasize. This is a common approach used to reduce feature dimension before classification. Since the number of DEGs found in each dataset was different, we used the 50 DEGs with the lowest p-value described above as means to use the same gene dataset dimension as input to all classifiers in all datasets tested. After training the classifiers 100 times, several performance evaluation metrics were calculated in the test set, such as accuracy, misclassifications, sensitivity, specificity, and area under the ROC curve (AUC). The median results of all runs obtained for each dataset in the test set are displayed on Table 4 (all performances for train and test sets may be found in the Additional file 1: Table S1). Also, pairwise comparisons of the accuracy obtained for classifiers may be found on Additional file 1: Table S2. It is shown that the results are similar between the different methods tested. Nonetheless, RF was the best classifier obtained, presenting higher accuracy (0.72, 0.71, and 0.71) and AUC (0.72, 0.71, and 0.69) in all datasets tested, and the lowest number of misclassifications (5) in the test set.

Table 4 Median classifiers performance results (and standard deviation in parenthesis) obtained for test sets for the 100 runs tested using five classification methods applied to the fifty DEGs with lowest p-value

Full size table

Table 5 Median values (and standard deviation in parenthesis) of the performance metrics in test set by regularized LR methods across 100 runs applied to the full dataset

Full size table

Regularized logistic regression

The second approach tested to distinguish early-stage CRC patients that metastasize from those that do not, was to use regularized LR with different types of penalization for feature selection: Elastic net (EN) and the correlation-based regularizer iTwiner. The test set results obtained for these methods applied to the full dataset are described in Table 5 (results for train and test sets may be found in the Additional file 1: Table S3).

The performance of these methods was similar to the classifiers tested above, where higher accuracy in test set was obtained by the iTwiner method (mean $\text{Acc} = 0.69$). Interestingly, most of the misclassifications in DATASET2 and DATASET3 using both approaches were false negatives (FN), meaning that these methods classified wrongly patients that do not metastasize in patients that metastasize. Since non-metastatic patients can indeed metastasize in the future it would be of great value to do a follow-up on these patients that were labeled wrongly by these methods.

The median number of selected variables (genes) by the two methods (across the 100 runs) used to separate the two groups (P vs. PM) was 48 for EN and 38 for iTwiner (Table 5). Also, the number of genes selected in at least 50% of the 100 runs tested in EN was smaller when compared to iTwiner, indicating that the iTwiner method is more stable since more genes are consistently selected as important for the classification of early-stage patients that metastasize. Moreover, to assess which genes are being recurrently selected by the methods, independently of the dataset used, a Venn diagram was constructed (Fig. 4). Here, we compared for each regularizer (EN and iTwiner) the top fifty genes selected in each dataset (Table 6), as the most likely genes to be metastatic biomarkers in CRC patients. Interestingly, only a few biomarkers were found to be DEGs between the P and PM groups, represented in bold the down- and in underline the up-regulated genes in the PM group. Also, we can see that using iTwiner, a higher number of genes is selected in common across the three datasets tested, which once more stands as evidence of improved stability and robustness of the selected feature sets, irrespective of the dataset used.

Looking closely at the genes selected by each classifier in the different datasets tested (Table 6), EN only selected RAC1P3 in common to all datasets. This gene is a pseudogene of the Rac family of small GTPase whose role in cancer is still unknown. Regarding iTwiner, six genes were selected by the three datasets tested, RAC1P3, XRCC6P2, EEF1B2P6, HSPD1P7, TRBV11-1, HORMAD2. The majority of these genes are pseudogenes with an unknown role in cancer. However, HORMAD2 has been reported to have tumor suppressor functions, and its expression was seen down-regulated in cancer [50]. Here we showed that HORMAD2 gene was down-regulated in early-stage patients that metastasize (represented in bold in Table 6).

Table 6 List of genes ranked by the number of times that were selected by regularized LR methods used. Genes colored in underline and bold represent DEGs found up- and down-regulate in PM tissues, respectively

Full size table

Classification based on regularized-selected genes

The final procedure to find a set of biomarkers involved in metastasis processes of early-stage CRC patients was to use classification methods based on the features previously identified by regularized LR. In particular, using the genes pre-selected by regularization (EN, iTwiner) as an alternative to the DEGs (Fig. 1, method 3), to try to improve the classification performance.

Here, the five classifiers used earlier (DT, svmL, svmR, LR, and RF) were applied to the two smaller gene sets obtained by regularized LR. To have the same dataset dimension as before, the fifty genes selected by EN and iTwiner penalties ranked in Table 6 were used as input to the classifiers. This was done to each dataset as previously. Performances obtained for classifiers train and test sets using genes pre-selected by EN and iTwiner penalties may be found on the Additional file 1: Tables S4 and S6, respectively. Pairwise comparisons of the accuracy obtained for classifiers may be found on Additional file 1: Tables S5 and S7.

Regarding classifiers applied to gene sets based on EN penalties (Table 7), for all datasets tested, the best results in test set were obtained using svmR ($Acc = 0.78, 0.76, 0.76$) and RF ($Acc = 0.78, 0.76, 0.76$) methods. Also, in the RF method, the specificity of the results was higher, i.e., most of the misclassifications were FN. This means that the classifier labeled patients as metastatic even though they were non-metastatic at the three years follow-up time.

Afterward, we tested the same classifiers applied to a different gene set based on iTwiner penalization (Table 8). The best accuracy was obtained by RF classifier as before ($Acc = 0.86, 0.82, 0.76$). However, using this iTwiner penalization improved the specificity of the classifier (Specificity = 1 for all datasets).

Table 9 presents the mean performance results for all the tested combinations of classifiers and feature selection methods: DEGs found between P and PM patient group (Table 9, DEG +), genes pre-selected by EN regularizer (Table 9, EN +) and genes pre-selected by the iTwiner (Table 9, iTwiner +). For all gene selection methods tested, the best performance classifier was RF showing the highest accuracy and specificity. Overall, using the genes found by regularization, considering different penalty vectors (and so different information used for selection), instead of using DEGs found between groups, improved in a significant way the accuracy of the classifiers (Table 9). A pairwise comparison using Wilcoxon rank sum test with Benjamini & Hochberg p-value correction was performed to assess the statistically significant differences between the groups (Additional file 1: Table S8).

Moreover, for most classifiers tested (DT, svmL, and RF), if the selection of genes is based on correlation matrices (iTwiner), the performance of the models increases significantly, leading to the most accurate results. To better visualize these, Fig. 5 shows boxplots of the classifiers’ accuracy obtained for all gene selection methods applied to each dataset tested. Overall, when gene sets were obtained by regularization (EN + and iTwiner +), higher accuracy was obtained. This is well observed in the RF classifier (Fig. 5d).

Table 7 Median values (and standard deviation in parenthesis) obtained in test set for the five classification methods using fifty most frequently genes pre-selected by LR with EN penalization

Full size table

Table 8 Median values (and standard deviation in parenthesis) obtained in test set for the five classification methods using fifty most frequently genes pre-selected by LR with iTwiner penalization

Full size table

Table 9 Mean performance metrics values of the three datasets tested (test set) obtained for the classification methods applied to different gene sets based on DEGs, EN and iTwiner

Full size table

Conclusions

CRC is one of the leading causes of cancer-related deaths worldwide, being metastasis the major cause in these patients. Therefore, it is crucial to accurately diagnose CRC at an early-stage and understand the molecular mechanisms underlying metastasis. Several studies have tried to understand tumor biology and metastasis mechanisms by comparing early-stage versus metastatic tumors. We explore the relevance of studying early-stage (II-III) tumors that do not metastasize versus those that metastasize, in three years of follow-up. However, this is not an easy task since the high-dimensionality of gene expression data leads to problems in classification methods. As such, feature selection methods are important for selecting informative genes prior to classification, to improve their accuracy.

Here we present two major contributions to the discovery of metastatic biomarkers in CRC based on classification and feature selection. The first contribution is a new network-based feature selection method, iTwiner, that promotes the selection of genes with distinct correlation patterns in metastatic and non-metastatic patients, and has shown to significantly increase the classifiers’ predictive performance. Moreover, the proposed iTwiner regularizer selected the most stable and robust gene sets, including tumor suppressor genes and genes involved in several cancer processes like tumor growth and metastasis.

The second contribution proposes using gene sets pre-selected by regularized LR (via EN and iTwiner) as input features in the classification learning task, with proven improved performance compared to using DEGs as features, across many datasets and classifiers tested. Correlation-based penalization via the iTwiner penalty selected the best gene set for accurately distinguishing the two groups of patients, placing iTwiner as a promising strategy in the classification of CRC patients based on RNA-seq data and for the disclosure of biomarkers of CRC metastasis.

As future work, other types of classifiers may be tested, such as Gradient Boosting, Gaussian Process or neural networks, and since different hyper-parameter values may affect the classifiers’ performance, a more in depth investigation on optimization and tuning of parameters should be addressed. Also, studying the output of the binary classifiers and comparing those with genes selected by regularization methods would be an interesting next step, followed by gene function analysis to describe the biological role of genes and find potential enriched mechanisms and pathways.

Availability of data and materials

Two cohorts of CRC patients from Hospital Santa Maria (Lisbon, Portugal): Cohort 1: Cohort described in [27] containing 111 samples, already available under accession number EGAS00001005276 (European Genome-Phenome Archive) - https://ega-archive.org/search-results.php?query=EGAS00001005276; Cohort 2: Cohort described in [28] containing 114 samples, already available in NCBI Database under accession number PRJNA689313 - https://www.ncbi.nlm.nih.gov/bioproject/PRJNA689313. Code used is available at https://github.com/sysbiomed/iTwiner.git.

Abbreviations

CRC::: Colorectal cancer
RNA-seq::: RNA sequencing
DEGs::: Differentially Expressed Genes
FAP::: Familial Adenomatous Polyposis
NGS::: Next-Generation Sequencing
LR::: Logistic Regression
DTs::: Decision Trees
RF::: Random Forest
SVMs::: Support Vector Machines
ESMO::: European Society for Medical Oncology
NCCN::: National Comprehensive Cancer Network
AJCC::: The American Joint Committee on Cancer
WTS::: Whole Transcriptome Sequencing
P::: Early stage patients that do not metastasize
PM::: Early stage patients that metastasize
FDR::: False discovery rate
svmL::: Linear support vector machine
svmR::: Radial support vector machine
EN::: Elastic net
TP::: True Positive
FP::: False Positive
TN::: True Negative
FN::: False Negative
Acc::: Accuracy
Miscl::: Misclassification
AUC::: Area Under the Curve
ROC::: Receiver Operator Characteristic curve
logFC::: Log Fold Change

References

Jung G, Hernández-Illán E, Moreira L, Balaguer F, Goel A. Epigenetics of colorectal cancer: biomarker and therapeutic potential. Nat Rev Gastroenterol Hepatol. 2020;17(2):111–30.
Article Google Scholar
Markowitz SD, Bertagnolli MM. Molecular basis of colorectal cancer. N Engl J Med. 2009;361(25):2449–60.
Article CAS Google Scholar
Phipps AI, Limburg PJ, Baron JA, Burnett-Hartman AN, Weisenberger DJ, Laird PW, Sinicrope FA, Rosty C, Buchanan DD, Potter JD, et al. Association between molecular subtypes of colorectal cancer and patient survival. Gastroenterology. 2015;148(1):77–87.
Article CAS Google Scholar
Beerenwinkel N, Schwarz RF, Gerstung M, Markowetz F. Cancer evolution: mathematical models and computational inference. Syst Biol. 2015;64(1):1–25.
Article Google Scholar
Arvelo F, Sojo F, Cotte C. Biology of colorectal cancer Ecancermedicalscience. 2015;9.
Ferlay J, Shin H-R, Bray F, Forman D, Mathers C, Parkin DM. Estimates of worldwide burden of cancer in 2008: Globocan 2008. Int J Cancer. 2010;127(12):2893–917.
Article CAS Google Scholar
Twelves C, Wong A, Nowacki MP, Abt M, Burris H III, Carrato A, Cassidy J, Cervantes A, Fagerberg J, Georgoulias V, et al. Capecitabine as adjuvant treatment for stage III colon cancer. N Engl J Med. 2005;352(26):2696–704.
Article CAS Google Scholar
Auclin E, Zaanan A, Vernerey D, Douard R, Gallois C, Laurent-Puig P, Bonnetain F, Taieb J. Subgroups and prognostication in stage iii colon cancer: future perspectives for adjuvant therapy. Ann Oncol. 2017;28(5):958–68.
Article CAS Google Scholar
Johnston PG. Stage II colorectal cancer: to treat or not to treat. Oncologist. 2005;10(5):332–4.
Article Google Scholar
Pretzsch E, Bösch F, Neumann J, Ganschow P, Bazhin A, Guba M, Werner J, Angele M. Mechanisms of metastasis in colorectal cancer and metastatic organotropism: hematogenous versus peritoneal spread. J Oncol. 2019;2019.
Wang Y, Mashock M, Tong Z, Mu X, Chen H, Zhou X, Zhang H, Zhao G, Liu B, Li X. Changing technologies of RNA sequencing and their applications in clinical oncology. Front Oncol. 2020;10:447.
Article CAS Google Scholar
Zhang Y-H, Huang T, Chen L, Xu Y, Hu Y, Hu L-D, Cai Y, Kong X. Identifying and analyzing different cancer subtypes using RNA-seq data of blood platelets. Oncotarget. 2017;8(50).
Jabeen A, Ahmad N, Raza K. Machine learning-based state-of-the-art methods for the classification of RNA-seq data. Springer. 2018;133–172.
Tan KM, Petersen A, Witten D. Classification of RNA-seq data Springer. 2014:219–46.
Jain D, Singh V. Feature selection and classification systems for chronic disease prediction: a review. Egypt Inform J. 2018;19(3):179–89.
Article Google Scholar
Mohamed E, El Houby E, Wassif KT, Salah AI. Survey on different methods for classifying gene expression using microarray approach. Int J Comput Appl. 2016;975:8887.
Google Scholar
Vinga S. Structured sparsity regularization for analyzing high-dimensional omics data. Brief Bioinform. 2021;22(1):77–87.
Article Google Scholar
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc: Ser B (Statistical Methodology). 2005;67(2):301–20.
Article Google Scholar
Ozturk K, Dow M, Carlin DE, Bejar R, Carter H. The emerging potential for network analysis to inform precision cancer medicine. J Mol Biol. 2018;430(18):2875–99.
Article CAS Google Scholar
Veríssimo A, Carrasquinha E, Lopes MB, Oliveira AL, Sagot M-F, Vinga S. Sparse network-based regularization for the analysis of patientomics high-dimensional survival data. bioRxiv, 2018;403402
Lopes MB, Casimiro S, Vinga S. Twiner: correlation-based regularization for identifying common cancer gene signatures. BMC Bioinform. 2019;20(1):1–15.
Article CAS Google Scholar
Peixoto C, Lopes MB, Martins M, Costa L, Vinga S. Tcox: correlation-based regularization applied to colorectal cancer survival data. Biomedicines. 2020;8(11):488.
Article CAS Google Scholar
Burton M, Thomassen M, Tan Q, Kruse TA. Gene expression profiles for predicting metastasis in breast cancer: a cross-study comparison of classification methods. Sci World J. 2012;2012.
Chuang H-Y, Lee E, Liu Y-T, Lee D, Ideker T. Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007;3(1):140.
Article Google Scholar
Riihimäki M, Hemminki A, Sundquist J, Hemminki K. Patterns of metastasis in colon and rectal cancer. Sci Rep. 2016;6(1):1–9.
Article Google Scholar
R Core Team: R: A Language and Environment for Statistical Computing.
Sobral D, Martins M, Kaplan S, Golkaram M, Salmans M, Khan N, Vijayaraghavan R, Casimiro S, Fernandes A, Borralho P, et al. Genetic and microenvironmental intra-tumor heterogeneity impacts colorectal cancer evolution and metastatic development. Commun Biol. 2022;5(1):1–14.
Article Google Scholar
Golkaram M, Salmans ML, Kaplan S, Vijayaraghavan R, Martins M, Khan N, Garbutt C, Wise A, Yao J, Casimiro S, et al. Hervs establish a distinct molecular subtype in stage II/III colorectal cancer with poor outcome. NPJ Genom Med. 2021;6(1):1–11.
Article Google Scholar
Walters SJ. What is a cox model? 1999.
Jager KJ, Van Dijk PC, Zoccali C, Dekker FW. The analysis of survival data: the Kaplan–Meier method. Kidney Int. 2008;74(5):560–5.
Article Google Scholar
Rokach L, Maimon O. Decision trees. In: Data Mining and Knowledge Discovery Handbook, pp. 2005;165–192.
Noble WS. What is a support vector machine? Nat Biotechnol. 2006;24(12):1565–7.
Article CAS Google Scholar
LaValley MP. Logistic regression. Circulation. 2008;117(18):2395–9.
Article Google Scholar
Cutler A, Cutler DR, Stevens JR. Random forests. In: Ensemble Machine Learning, pp. 2012;157–75.
Algamal ZY, Lee MH. Penalized logistic regression with the adaptive lasso for gene selection in high-dimensional cancer classification. Expert Syst Appl. 2015;42(23):9326–32.
Article Google Scholar
Algamal ZY, Lee MH. Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification. Comput Biol Med. 2015;67:136–45.
Article CAS Google Scholar
Bewick V, Cheek L, Ball J. Statistics review 14: logistic regression. Crit Care. 2005;9(1):1–7.
Article Google Scholar
Huang H-H, Liu X-Y, Liang Y. Feature selection and cancer classification via sparse logistic regression with the hybrid $L_{{1/2}+ 2}$ regularization. PLoS ONE. 2016;11(5):0149675.
Article Google Scholar
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc: Ser B (Methodological). 1996;58(1):267–88.
Google Scholar
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22.
Article Google Scholar
Poturnajova M, Furielova T, Balintova S, Schmidtova S, Kucerova L, Matuskova M. Molecular features and gene expression signature of metastatic colorectal cancer. Oncol Rep. 2021;45(4):1–1.
Article Google Scholar
Baran B, Ozupek NM, Tetik NY, Acar E, Bekcioglu O, Baskin Y. Difference between left-sided and right-sided colorectal cancer: a focused review of literature. Gastroenterol Res. 2018;11(4):264.
Article CAS Google Scholar
Uyttenhove C, Pilotte L, Théate I, Stroobant V, Colau D, Parmentier N, Boon T, Van den Eynde BJ. Evidence for a tumoral immune resistance mechanism based on tryptophan degradation by indoleamine 2, 3-dioxygenase. Nat Med. 2003;9(10):1269–74.
Article CAS Google Scholar
Wang Q, Wang X, Liang Q, Wang S, Xiwen L, Pan F, Chen H, Li D. Distinct prognostic value of mRNA expression of guanylate-binding protein genes in skin cutaneous melanoma. Oncol Lett. 2018;15(5):7914–22.
Google Scholar
Xochelli A, Baliakas P, Kavakiotis I, Agathangelidis A, Sutton L-A, Minga E, Ntoufa S, Tausch E, Yan X-J, Shanafelt T, et al. Chronic lymphocytic leukemia with mutated ighv4-34 receptors: shared and distinct immunogenetic features and clinical outcomes. Clin Cancer Res. 2017;23(17):5292–301.
Article CAS Google Scholar
Zhou X, Xia E, Bhandari A, Zheng C, Xiang J, Guan Y, Zhang X. Lrp4 promotes proliferation, migration, and invasion in papillary thyroid cancer. Biochem Biophys Res Commun. 2018;503(1):257–63.
Article CAS Google Scholar
Wang F, Dai C-Q, Zhang L-R, Bing C, Qin J, Liu Y-F. Downregulation of lgr6 inhibits proliferation and invasion and increases apoptosis in human colorectal cancer. Int J Mol Med. 2018;42(1):625–32.
CAS Google Scholar
Lin J, Xu Z, Xie J, Deng X, Jiang L, Chen H, Peng C, Li H, Zhang J, Shen B. Oncogene apol1 promotes proliferation and inhibits apoptosis via activating notch1 signaling pathway in pancreatic cancer. Cell Death Dis. 2021;12(8):1–11.
Article Google Scholar
Cao Y, Jiao N, Sun T, Ma Y, Zhang X, Chen H, Hong J, Zhang Y. Cxcl11 correlates with antitumor immunity and an improved prognosis in colon cancer. Front Cell Dev Biol. 2021;9.
Lin Q, Hou S, Guan F, Lin C. Hormad 2 methylation-mediated epigenetic regulation of gene expression in thyroid cancer. J Cell Mol Med. 2018;22(10):4640–52.
Article CAS Google Scholar

Download references

Acknowledgements

Not applicable

Funding

This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with references PD/BD/139146/2018, CEECINST/00102/2018, IF/00409/2014, UIDB/50021/2020 (INESC-ID), UIDB/50022/2020 (IDMEC), UIDB/04516/2020 (NOVA LINCS), UIDB/00297/2020 and UIDP/00297/2020 (NOVA MATH) and projects MONET (PTDC/CCI-BIO/4180/2020) and MATISSE (DSAIPA/DS/0026/2019). The biobanking of CRC from Hospital Santa Maria, Lisbon, Portugal, was supported by a grant from the Official Portuguese Funding Agency for Science and Technology (FCT: PIC/IC/82821/2007). This project has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement No. 951970 (OLISSIPO project).

Author information

Authors and Affiliations

INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, 1000-029, Lisbon, Portugal
Carolina Peixoto & Susana Vinga
NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), NOVA School of Science and Technology, 2829-516, Caparica, Portugal
Marta B. Lopes
Center for Mathematics and Applications (NOVA MATH), NOVA School of Science and Technology (FCT NOVA), 2829-516, Caparica, Portugal
Marta B. Lopes
Instituto de Medicina Molecular - João Lobo Antunes, Faculdade de Medicina de Lisboa, Avenida Professor Egas Moniz, 1649-028, Lisbon, Portugal
Marta Martins, Sandra Casimiro, André Mansinho, Afonso Fernandes, Paula Borralho & Luís Costa
Oncology Division, Hospital de Santa Maria, Centro Hospitalar Lisboa Norte, Lisbon, Portugal
Catarina Abreu, Daniela Macedo, Ana Lúcia Costa, Helena Pais, Cecília Alvim, André Mansinho, Pedro Filipe, Pedro Marques da Costa, Cristina Ferreira, João Malaquias, António Quintela & Luís Costa
Associate Laboratory i4HB - Institute for Health and Bioeconomy, NOVA School of Science and Technology, Universidade NOVA de Lisboa, 2829-516, Caparica, Portugal
Daniel Sobral & Ana Rita Grosso
UCIBIO - Applied Molecular Biosciences Unit, Department of Life Sciences, NOVA School of Science and Technology, Universidade NOVA de Lisboa, 2829-516, Caparica, Portugal
Daniel Sobral & Ana Rita Grosso
Illumina Inc., 5200 Illumina Way, San Diego, CA, 92122, USA
Shannon Kaplan, Mahdi Golkaram, Michael Salmans, Nafeesa Khan, Raakhee Vijayaraghavan, Shile Zhang, Traci Pawlowski, Jim Godsey, Alex So & Li Liu
IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais, 1, 1049-001, Lisbon, Portugal
Susana Vinga

Authors

Carolina Peixoto
View author publications
You can also search for this author in PubMed Google Scholar
Marta B. Lopes
View author publications
You can also search for this author in PubMed Google Scholar
Marta Martins
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Casimiro
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Sobral
View author publications
You can also search for this author in PubMed Google Scholar
Ana Rita Grosso
View author publications
You can also search for this author in PubMed Google Scholar
Catarina Abreu
View author publications
You can also search for this author in PubMed Google Scholar
Daniela Macedo
View author publications
You can also search for this author in PubMed Google Scholar
Ana Lúcia Costa
View author publications
You can also search for this author in PubMed Google Scholar
Helena Pais
View author publications
You can also search for this author in PubMed Google Scholar
Cecília Alvim
View author publications
You can also search for this author in PubMed Google Scholar
André Mansinho
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Filipe
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Marques da Costa
View author publications
You can also search for this author in PubMed Google Scholar
Afonso Fernandes
View author publications
You can also search for this author in PubMed Google Scholar
Paula Borralho
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
João Malaquias
View author publications
You can also search for this author in PubMed Google Scholar
António Quintela
View author publications
You can also search for this author in PubMed Google Scholar
Shannon Kaplan
View author publications
You can also search for this author in PubMed Google Scholar
Mahdi Golkaram
View author publications
You can also search for this author in PubMed Google Scholar
Michael Salmans
View author publications
You can also search for this author in PubMed Google Scholar
Nafeesa Khan
View author publications
You can also search for this author in PubMed Google Scholar
Raakhee Vijayaraghavan
View author publications
You can also search for this author in PubMed Google Scholar
Shile Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Traci Pawlowski
View author publications
You can also search for this author in PubMed Google Scholar
Jim Godsey
View author publications
You can also search for this author in PubMed Google Scholar
Alex So
View author publications
You can also search for this author in PubMed Google Scholar
Li Liu
View author publications
You can also search for this author in PubMed Google Scholar
Luís Costa
View author publications
You can also search for this author in PubMed Google Scholar
Susana Vinga
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.P. performed the conceptualization, methodology, programming and software development, analysis and writing the original Draft. M.B.L. contributed for the methodology and review and editing. M.M performed the management of the samples and DNA extraction and was a contributor in review and editing the manuscript. S.C. provided resources, the CRC biobank. D.S. and A.R.G performed data curation. C.A., D.M., A.L.C., H.P., C.A., A.M., P.F., P.M.C., J.M. and A.Q. performed clinical data collection. A.F., P.B. and C.F. did the pathology analysis. S.K., M.G., M.S., N.K., R.V., S.Z., T.P., J.G., A.S. and L.L. performed WTS. L.C. contributed for the conceptualization and supervision. S.V. contributed for the conceptualization, methodology, review and editing the manuscript, supervision and project administration All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Luís Costa or Susana Vinga.

Ethics declarations

Ethical approval and consent to participate

Data obtained from studies approved by the ethics committee from Hospital de Santa Maria, Centro Hospitalar Universitário Lisboa Norte (Lisbon, Portugal) and all patients provided signed informed consent. All procedures were performed in accordance with relevant guidelines.

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

. Performance results and comparison of classifiers.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Peixoto, C., Lopes, M.B., Martins, M. et al. Identification of biomarkers predictive of metastasis development in early-stage colorectal cancer using network-based regularization. BMC Bioinformatics 24, 17 (2023). https://doi.org/10.1186/s12859-022-05104-z

Download citation

Received: 15 March 2022
Accepted: 07 December 2022
Published: 16 January 2023
DOI: https://doi.org/10.1186/s12859-022-05104-z

Identification of biomarkers predictive of metastasis development in early-stage colorectal cancer using network-based regularization

Abstract

Introduction