A deep learning-based framework for lung cancer survival analysis with biomarker interpretation

Background Lung cancer is the leading cause of cancer-related deaths in both men and women in the United States, and it has a much lower five-year survival rate than many other cancers. Accurate survival analysis is urgently needed for better disease diagnosis and treatment management. Results In this work, we propose a survival analysis system that takes advantage of recently emerging deep learning techniques. The proposed system consists of three major components. 1) The first component is an end-to-end cellular feature learning module using a deep neural network with global average pooling. The learned cellular representations encode high-level biologically relevant information without requiring individual cell segmentation, which is aggregated into patient-level feature vectors by using a locality-constrained linear coding (LLC)-based bag of words (BoW) encoding algorithm. 2) The second component is a Cox proportional hazards model with an elastic net penalty for robust feature selection and survival analysis. 3) The third commponent is a biomarker interpretation module that can help localize the image regions that contribute to the survival model’s decision. Extensive experiments show that the proposed survival model has excellent predictive power for a public (i.e., The Cancer Genome Atlas) lung cancer dataset in terms of two commonly used metrics: log-rank test (p-value) of the Kaplan-Meier estimate and concordance index (c-index). Conclusions In this work, we have proposed a segmentation-free survival analysis system that takes advantage of the recently emerging deep learning framework and well-studied survival analysis methods such as the Cox proportional hazards model. In addition, we provide an approach to visualize the discovered biomarkers, which can serve as concrete evidence supporting the survival model’s decision.


Background
Lung cancer is the leading cause of cancer-related deaths in both men and women in the United States. An estimated 158,080 Americans died from lung cancer in 2016, accounting for approximately 27% of all cancer deaths 1 . The five-year survival rate of lung cancer is 17.7%, which is lower than that of many other leading cancers, such as colon cancer (64.4%) and breast cancer (89.7%). There are two main types of lung cancer: small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC). NSCLC accounts for the majority of lung cancer (80% − 85%) and has two major subtypes: adenocarcinoma (AC), representing approximately 40%, and squamous cell carcinoma (SC), representing approximately 25% − 30% 2 . Accurate survival analysis is essential for personalized treatment management and prognosis. For example, closer followup and more aggressive treatment might benefit patients with poorer prognoses [1].

Cell localization
Currently, histopathology images serve as the golden standard for lung cancer diagnosis and are primarily evaluated by pathologists or doctors. However, this process is labor intensive, time consuming, and subject to high interobserver variability. Recently, an automated histopathological analysis system [2] has been shown to be able to provide accurate, consistent, and valuable decision support for the diagnosis of different diseases, such as breast cancer [3], pancreatic neuroendocrine tumors [4], lymphoma [5], and lung cancer [1,[6][7][8]. With the emergence of deep learning methods that have achieved great successes in computer vision [9][10][11][12], in this work, we aim to develop a deep learning-based lung cancer survival analysis system that can provide accurate prediction of patient survival outcomes and identify important image biomarkers.
Pathologists make diagnostic decisions based on cellular and inter-cellular level morphology, and thus accurate cell localization is a prerequisite step for lung cancer survival analysis. Since cells usually exhibit circular or approximately circular shapes and their sizes fall within a relatively small range, many methods [13][14][15] are designed to fully utilize this prior information. These methods primarily consist of three steps: cell confidence map generation, cell center localization, and optional post-processing. Here, we refer to a cell confidence map as an intermediate image transform which highlights cell centers. For example, in [14], Byun et al. first apply a Laplacian of Gaussian (LoG) filter with a fixed scale for locating nuclei on retinal images, assuming that the cell size is known a priori and cell detection can be achieved by locating the maximum filter response in a neighborhood with a predefined size. Observing that many cells exhibit round shapes, Veta et al. [16] used a fast radial symmetry transform (FRST) [17] to generate cell confidence maps. Following this idea of radial symmetry-based voting, Parvin et al. [18] have proposed an iterative voting approach based on (weakly) radial symmetries, which is adaptive to geometric perturbation and can handle elliptical objects. However, the iterative procedure in [18] is computationally expensive. To address this issue, Qi et al. [15] have proposed singlepass voting for cell detection which performs only one round of voting and computes the final cell centers by applying mean shift clustering [19] to the vote image.
Recently, deep learning-based models, especially convolutional neural networks (CNNs), have attracted particular interest [9][10][11][12] and achieved state-of-the-art performance in various vision tasks, such as image classification [20,21], object detection [22,23], segmentation [24] and so on. Great successes of applying CNN to medical image analysis have also been reported [25][26][27][28]. In [25], Ciregan et al. applied a deep CNN for automatic mitotic cell detection in breast cancer histology images. In [27], Song et al. first computed a pixel-wise coarse segmentation with CNN and achieved the final nuclei locations using a fast min-cut/max-flow graph inference algorithm [29]. In both [25] and [27], the CNN models are applied to testing images in a sliding-window manner for pixel-wise classification, which is computationally expensive. Recently, Long et al. [30] proposed a fully convolutional network (FCN). In contrast with conventional CNN methods [9][10][11][12], an FCN is trained in an end-to-end manner and can produce output maps with the same size as the inputs, and thus it is both asymptotically and absolutely efficient [30].

Survival analysis
Survival analysis is a well-studied field in health statistics research which aims at predicting the time until the occurrence of an event of interest, such as onset of a disease, tumor recurrence, death after some treatment intervention, etc. The time between the beginning of follow-up and the occurrence of the event is called survival time or failure time. In survival analysis, one important issue to be considered is the censoring problem. For example, this occurs when a subject is not followed up during the study period or does not experience the event of interest before the study ends. Since the exact survival times of the censored subjects are unknown, and they account for a large portion of the data, standard statistical methods such as linear regression are not suitable for survival time data. In survival analysis, the most commonly used method is the Cox proportional hazards model [31]. Other methods include the Kaplan-Meier estimate [32] for calculating the survival probability and the log-rank test [33] for comparing the survival outcomes of two or more subject groups.
Recently, there have been several works published regarding survival analysis of lung cancer using pathological image features [1,[6][7][8]. In [6], Wang et al. have proposed three groups of image morphological features (geometry features, pixel intensity statistics, and texture features) extracted from the segmented cell regions, and the Cox proportional hazards model is used to select image features that are correlated with patient survival outcomes. Similarly, Yao et al. [7] have enhanced the three groups of image features by including the spatial distributions of cell subtypes (tumor, lymphocyte, stromal) and have built a separate survival model for each of the two major subtypes of NSCLC: adenocarcinoma and squamous cell carcinoma. In [1], the cells are first segmented out using the Otsu threshold selection method [34], and then a total of 9,879 quantitative features are extracted from each image patch using CellProfiler [35]. They all achieved success in building a powerful survival model and finding valuable biomarkers. For example, in [6], pixel intensity and texture features are found to be correlated with survival outcomes; in [7], image features related with cell subtype distributions, cell shape and granularity are selected; in [1], Zernike shape, texture and radial distribution of pixel intensity are among the top prediction features.
However, the aforementioned methods [1,6,7] face several limitations. First, they all require cell segmentation as a prerequisite step. As opposed to cell detection, accurate, robust and efficient cell segmentation remains a challenging task because 1) there exist significant variations in intra-and inter-cellular intensity, especially for cancer cells across different patients, and 2) cells are often clustered into clumps, such that they might partially overlap with one another. Inaccurate cell segmentation might harm the discriminative power of some morphological features, such as cell granularity or shape. Second, the cell features are currently handcrafted: thus, they are error-prone and do not contain any high-level information related to diagnosis. In recent years, deep learning-based models have achieved state-of-the-art results in feature representation learning. Zhu et al. [8] proposed a deep learning-based survival model that takes 1024 × 1024 image patches as input and treats the classification outputs as the patient risk scores for survival analysis. However, in practice, pathologists or doctors make diagnostic decisions based on cellular level image information, and these discriminative details could be lost in Zhu's architecture, which downsamples the inputs with size 1024 × 1024 to the last convolutional feature maps with size 20 × 20. More importantly, this system misses the opportunity to use the well-developed survival analysis methods, such as the Cox proportional hazards model and its variants [31,36,37], to identify and interpret biomarkers. So far, those works that can combine the deep learning framework and the classic survival analysis methods remain absent. Finally, individual cellular features are currently aggregated into a patient level feature vector using relatively simple statistically based methods such as taking the mean, median or standard variation of each feature  dimension [1,6,7]. More advanced local feature aggregation methods, such as bag of words (BoW) [38] or sparsity-based BoW variations [39,40], are not investigated.
To overcome the aforementioned challenges, we propose a survival analysis system that takes advantage of the emerging deep learning framework [20,21] and well-studied survival analysis methods [31,36,37]. An overview of the proposed system is provided in Fig. 1, which consists of three main components: 1) An end-toend cellular feature learning module using a deep neural network with global average pooling. The learned cellular representations encode high-level biologically relevant information without the requirement of individual cell segmentation, and then are aggregated into patientlevel feature vectors by using a locality-constrained linear coding (LLC)-based bag of words (BoW) encoding algorithm. 2) A Cox proportional hazards model with an elastic net penalty for robust feature selection and survival analysis. 3) A biomarker interpretation module which can help localize the image regions that contribute to the survival model's decisions. Extensive experiments demonstrate that the proposed survival model provides excellent predictive power for testing data in terms of two commonly used survival analysis metrics: the log-rank test (p-value) of the Kaplan-Meier estimate and the concordance index (c-index). Furthermore, the proposed system can easily visualize the selected biomarkers, which can serve as regions of interest (ROIs). In this scenario, pathologists or doctors can validate the automated generated survival analysis results by examining these ROIs using raw image data. We argue that this system should receive greater future investment.

Cell detection via end-to-end learning
In pathological image analysis, cells are the regions of interest, and it is important to achieve accurate and robust cell detection. Many conventional CNN architectures [9][10][11][12] usually adopt the sliding window strategy to make dense predictions in the testing stage, which is computationally expensive. Fortunately, an FCN [30] architecture is proposed to upsample the output layer to a higher resolution which is the same as the input image dimension. Furthermore, in order to propagate more contextual information to higher resolution layers,  [41] which makes the expansive and contracting paths more or less symmetrical via gradually upsampling layers in the expansive path and concatenating them with corresponding layers in the contracting path. The U-Net architecture has been proven to be effective for cell segmentation and tracking [42]. In this work, we adopt the U-Net architecture for cancer cell detection.
Denote T = {(x, y) ∈ X × Y} as the training data, where x represents a training image and y = v * G is the corresponding cell center probability map, where v(i, j) = 1 if there is a human cell center annotation at pixel location (i, j), and otherwise v(i, j) = 0. G denotes a Gaussian kernel with standard deviation σ . Let o denote the output, and then the loss function can be defined as: where h and w denote the height and width of x, respectively, andȳ represents the mean value of y. β is a predefined constant which is chosen as β = 0.2 in our implementation. The final cell center coordinates are achieved via non-maximum suppression on the output map o. Please refer to [42] for more details of the U-Net architecture.

Biological information bearing feature learning Cellular feature extraction
After cell localization, discriminative cell representations must be learned for survival analysis. In the current literature, cellular descriptors are still composed using hand-crafted image features, and a proportion of them, such as cell size and shape features, rely on cell segmentation. Inaccurate cell segmentation will reduce the discriminative power of these features. Recently, image features based on the outputs of the last fully connected layers of CNN models have emerged as state-of-the-art generic representations for visual recognition [43-46]. Inspired by the success of CNN features, in this work, we propose to train a deep neural network to classify cells into different risk groups based on the corresponding patient survival times, and we build cell descriptors using activations within the learned deep neural network. The proposed architecture in Fig. 2 is based on the VGG-Net [47], from which we remove the layers after conv5 and instead add a convolutional layer followed by a global average pooling (GAP) layer with pooling size 5 and stride 5, which is then directly fed into the softmax layer for outputting the risk scores. The newly added convolutional layer selects 3 × 3 kernels with 512 feature maps as well as stride size equal to 1 and pad size of 1. The activations of the GAP layer are used as feature representations of cells for the subsequent survival analysis. The GAP layer design is inspired by [48], which uses a GAP layer as a structural regularizer to help the deep neural network prevent overfitting and improve the generalization ability. Later, it is used to localize the discriminative image regions (attention) in [46] for solving classification tasks. Note that the global average pooling summarizes the spatial information, and thus it is inherently robust with respect to spatial translation of the inputs. This property is essentially important for robust cell feature learning because inaccurate or inconsistent cell center localizations will directly lead to input translations.
To illustrate the robustness of the GAP layer with respect to translations of the inputs, we provide both quantitative and qualitative analyses as follows. Specifically, we first randomly sample 1000 cells from histopathological images of our dataset, The Cancer Genome Atlas (TCGA), pass them forward through the trained network, and treat the activations of the GAP layer as their features X ∈ R D×N , where D = 512 and N = 1000. Then, we randomly shift the cell center coordinates in four directions (left, right, top, and bottom) by a certain amount t, and their corresponding features X t ∈ R D×N are extracted as well. The Euclidean distances between the features of translated cells and the originals, D t = X − X t 2 ∈ R N , are computed. We repeat this process Fig. 4 Examples of the generated CAMs for several training patches. The maps in the second row highlight the discriminative image regions for cell classification. Please refer to [46] for the details of computing CAMs for different translation amounts, with t = {1, 3, 5, 7, 9, 11} pixels. For comparison, the histograms of oriented gradients (HOG) [49] features are also computed. The medium values of D t normalized by the medium values of the intercell distances of X against different t are plotted in Fig. 3a. It can be observed that, compared with the HOG image features, the perturbations of GAP features caused by input translations are smooth and insignificant. Furthermore, for each translated cell, we conduct top-k retrieval against the original cells and compute the precision. The top-k retrieval precision values for different t are shown in Fig. 3b, showing that the retrieval accuracy using GAP features remains at a high value even with a large translation (for example, 11 pixels). In contrast, the retrieval performance using HOG features deteriorates significantly as the translation grows. The robustness of the GAP cellular features with respect to input translations will effectively compensate for the inaccuracy or inconsistency of cell detection.
In addition, we probe into the neural network and localize the discriminative image regions (attentions) for cell classification by following the method in [46]. Several examples of the training patches and the computed class activation maps (CAMs) are shown in Fig. 4. It can be observed that the network generally places greater attentions on the central regions which are desired. However, when the cells are not localized in the image patch centers due to inaccurate cell detection, the network will adjust its attentions accordingly. This further shows that the activations of the GAP layer are robust with respect to image translations and are suitable for cell representations.

Aggregating cellular features
Given a patient p and a set of D-dimensional cellular descriptors X =[ x 1 , x 2 , ..., x N ] ∈ R D×N from p, we aim to aggregate individual cellular representations X into a single feature vector, f, as p's representation. One of the simplest yet most effective local descriptor aggregation methods is the BoW model [38]. Given a learned codebook with M entries, B =[ b 1 , b 2 , ..., b M ] ∈ R D×M , the BoW method converts each cellular descriptor x i into an M-dimensional code c i ∈ R M , and then all of the cellular codes C =[ c 1 , c 2 , ..., c N ] ∈ R M×N are pooled into a single vector, f ∈ R M . There are many encoding methods, such as hard vector quantization (VQ) [50] and sparse codingbased soft VQ [39,40]. Among these works, LLC [40] is used to project each descriptor into its local coordinate system for patientwise representation learning. Letting B i ∈ R D×k denote the k-nearest neighbors of x i , where k << M, the code c i can be achieved by solving a small linear system: The final feature representation f for patient p can be achieved via sum pooling or max pooling, followed by 2 normalization, f = f/ f 2 . The poolings are computed as: 1) sum pooling: f = In health informatics, the Cox proportional hazards model [31] is one of the most commonly used approaches for survival time analysis and is defined as: where s i (t) is the hazard for observation i at time t, s 0 (t) is the baseline hazard and is left unspecified, and β ∈ R M is the parameter vector. The estimation of β is obtained by maximizing the partial log likelihood: where R i = {j|y j ≥ t i }. For high-dimensional data, with M > P, the 1 penalty (lasso) [51] or 2 (ridge regression) is added to avoid degenerated solutions. The 1 penalty tends to generate sparse solutions, which are often desired for feature selection. However, as indicated in [36], this can also cause one significant problem: if the two predictors are strongly correlated, the lasso will pick one and entirely ignore the other. To tackle this problem, we add the elastic net penalty [37], which is a mixture of 1 and 2 , to (4): the problem thus becomes: where L(β) is defined in Eq. (4), · 1 denotes the 1 penalty, · 2 2 denotes the 2 penalty, and α is used to balance between 1 and 2 . In our implementation, we solve (5) using the cyclical coordinate descent algorithm [36],

Evaluation
We use two metrics to measure the predictive power of the survival model: the Kaplan-Meier estimate (KME) [32] that can effectively measure the survival differences between two or more groups, and the concordance index (c-index) [52] that can reveal the relative risks between patients.
Kaplan-Meier Estimate. In clinical trials, it is important to be able to accurately and robustly measure the fraction of patients who survive after a certain amount of time after treatment in spite of censored observations. For this purpose, the Kaplan-Meier estimate is the simplest yet most effective way to compute the survival rates. For two or more groups of subjects, the log-rank test is conducted to measure the significant difference between their survival distributions. The survival outcomes of two groups are considered as significantly different if the p-value of the log-rank test is less than 0.05.
Concordance Index. This is another commonly used metric to measure survival model performance, which is calculated as the fraction of all pairs whose predicted sur-3 http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html vival risks are correctly ordered among all subjects that can actually be ordered. The survival time t i /t j for patient p i /p j can be ordered if p i is uncensored and t i < t j . Note that t j would be the censoring time if p j is censored. Let G = (V , E) denote a directed graph, where the vertices V denote all of the patients, and a directed edge e ij exists between two nodes, v i and v j , if the corresponding patient p i of v i is uncensored and t i < t j . The edges can only originate from those uncensored nodes. Given a pair of patients (p i , p j ) ∈ E and their risk scores r i and r j , p i and p j are considered concordant if r i > r j . For the concordance index (c-index), the value of 0.5 is a random guess, and 1 is the best.

Dataset
The proposed framework is validated using a dataset downloaded from TCGA data portal 4 . TCGA is a collection of cancer specimens with additional clinical information and histopathology slide images. In total, 121 patients with image annotation and entire survival information are collected, and they are randomly partitioned into two sets, Set I and Set II, which contain 63 and 58 patients, respectively. The detailed description of the two sets is summarized in Table 1. A patient is labeled as high risk if his/her known days to death value is less than T 1 , and low risk if either the days to death or days to last follow up value is greater than T 2 . Note that those patients whose days to death values are unknown and whose days to last follow up values are less than

Pre-processing Cell detection
In this work, we implemented the U-Net architecture using Theano 5 and Keras 6 . Both training and testing are conducted on a machine equipped with an Intel Xeon E5-1650 CPU and an NVIDIA Quadro K4000 GPU. The qualitative cell detection results for two large image patches are shown in Fig. 5, and several zoomed-in patches are Fig. 7 The frequencies of the selected features for both BoW and LLC-sum encoding also provided for better illustration. It may be observed that the trained model can provide desirable cell center localization results.

Feature learning
The feature learning model is fine-tuned based on the pretrained VGG model 7 using Caffe [53] implementation on an NVIDIA Quadro K4000 GPU. For each training image, cell centers are localized using the aforementioned cell detection method. For each detected cell, an 80 × 80 patch around it is cropped as a training sample, and the label of the training sample is decided by the corresponding patient's risk status. For each patient, 2000 cells are randomly selected for feature extraction. The model is trained using a stochastic gradient descent algorithm with the initial learning rate set as 0.00001 and the minibatch size of 10. The training is stopped after 100,000 iterations.

Survival analysis without feature selection
We validate the proposed cell feature learning framework for survival analysis using two different setups: 1) Evaluation I, using the selected 26 patients in Set I as training and the rest of Set I plus all patients in Set II as testing; 2) Evaluation II, using the selected 23 patients in Set II as training and the remaining patients in Set II plus all patients in Set I as testing. In both experimental setups, we maximize the testing set sizes. For the training set, a binary classifier (logistic regression used in this paper) is trained to classify each patient as low risk or high risk. For the testing set, the model output r i ∈[ 0, 1] for the patient p i is treated as his/her survival 7 https://github.com/BVLC/caffe/wiki/Model-Zoo risk score, with a larger value denoting a higher survival risk and vice versa. When computing p-values, a risk score threshold r = 0.5 is used to partition the testing patients into two groups, low/high risk, by their predicted r-values and then to compute the p-values from the log-rank test. When computing c-index values, the raw risk score r is used, and no binary thresholding is involved.
The detailed survival analysis results of Evaluation I and Evaluation II using both p-value and c-index metrics are listed in Table 2. Note that in addition to the patient features aggregated via LLC reconstruction, the conventional encoding method, BoW, is also tested. For both BoW and LLC encoding, the dictionary B is learned via k-means clustering with k equal to 256. Meanwhile, two other local feature aggregation methods are also tested for comparison: 1 Cellular Voting. Here, we treat the softmax output of the deep learning model (see Fig. 2) as the cell's risk score, and the patient's risk score is determined by averaging all of the cellular scores. 2 Aggregate Statistic. As in [1,6,7], the mean, median, and standard variation of each cellular feature are computed and then concatenated into one single vector. The resulting feature dimension is 1536.
From Table 2, we observe the following: 1) In terms of both p-value and c-index metrics, the built survival model achieves satisfactory survival prediction outcomes with varying cellular feature aggregation methods. achieved in [8] on a different dataset, the National Lung Screening Trial (NLST) lung cancer data. We can conclude that the learned cellular features indeed encode patient survival information and effectively generalize the testing data. 2) Other feature encoding methods such as cellular voting and aggregate statistics are not able to produce robust and accurate survival predictions for both sets of testing data. For example, though the cellular voting method exhibits good p-value performance, the c-index value is not satisfactory; in addition, the aggregate statistic method achieves good results in terms of p-value and c-index in Evaluation I, but the p-value performance in Evaluation II is poor. The Kaplan-Meier curves stratified using the proposed survival model are shown in Fig. 6. It can be observed that the numbers of the patients who are at risk at the beginning of each time interval are also reported. These table values can be used to reconstruct the Kaplan-Meier curves and provide us with better insight into the survival prediction results.   Fig. 7. Since BoW and LLC-sum features are derived from the same dictionary B, their selected features can be aligned for comparison. It can be observed that the selected features, which correspond to certain entries (or cluster centers) in the dictionary B, are very consistent for both encoding methods. We can conclude that the cells that fall into those centers carry discriminative information about patient survival outcomes.
After feature selection, a survival model is built on the training data and its predictive power is validated with the testing data. The Kaplan-Meier curves and the p-values of the log-rank test using both LLC-sum and LLC-max encodings are provided in Fig. 8. In addition, the detailed numerical analysis results are listed in Table 3. Note that 1) The reported table values denote average results of  2) The survival prediction outcomes without feature selection are also computed for comparison. It can be observed that the adopted feature selection strategy which adds an elastic net penalty to the Cox model (Eq. 5) can find features that are highly correlated with patient survival outcomes. For example, for both LLC-sum and LLC-max encodings, the selected 30 features achieve similar performance as using all features (256 dimensions).

Biomarker analysis and visualization
In this experiment, we use all patients in Set I for training and all patients in Set II for testing. No cross validation is involved.
We first conduct univariate survival analysis using the selected features. Four features, x70, x93, x107 and x164, are chosen for examination. For the i-th feature, X i ∈ R P denotes the feature values for all patients, where P is the number of patients. A stump classifier is then trained on the training patients, and the learned threshold is used to partition the testing patients into the low-and high-risk groups. When computing the c-index, no risk scores need to be learned, and the raw feature values are directly used for calculation for both training and testing set. The survival analysis results of both training and testing data are provided in Table 4, and the Kaplan-Meier curves are provided in Fig. 9. It can be observed that some biomarkers do carry discriminative survival information. For example, x93 produces excellent prediction results in terms of both p-value and c-index on both training and testing sets. In fact, under the same setup, x93 achieves a better c-index value, 0.7157, than models built using all (0.6834) and the top 30 selected features (0.6750).
Since the feature value is closely related to the number of occurrences of cells belonging to a certain cell cluster, it will be helpful if those cells can be visualized. Note that the cluster center dictionary B is learned via k-means clustering, which means that the cell samples that fall into a certain cluster can be efficiently identified. Several cell samples corresponding to features x70, x93, x107, and x164 are shown in Fig. 10. Note that the survival model makes predictions using only the cells that fall into the selected cluster centers, and thus it would be helpful if these attention cells can be visualized. One example can be found in Fig. 11, which shows that the cells which contributed to the survival model's decision are localized and that they are nearly all tumor cells. This suggests that our survival analysis system can not only provide numeric conclusions regarding patients' survival outcomes but can also provide visual evidence supporting the decisions. We argue that the latter ability is also important, since it allows the pathologists or doctors the opportunity to re-assess those biomarkers using their expertise and knowledge. For the computeraided diagnosis system, we believe that this machine learning framework should receive greater investment in the future.

Conclusion
In this work, we have proposed a segmentation-free survival analysis system that takes advantage of the recently emerging deep learning framework and well-studied survival analysis methods such as the Cox proportional hazards model. Extensive experiments demonstrate that the proposed survival model offers excellent predictive power for the TCGA lung cancer dataset in terms of two commonly used survival analysis metrics: the log-rank test (p-value) of the Kaplan-Meier estimate and concordance index (c-index). In addition, we provide an approach to visualize the discovered biomarkers, which can serve as concrete evidence supporting the survival model's decisions.