Deep learning model integrating positron emission tomography and clinical data for prognosis prediction in non-small cell lung cancer patients

Background Lung cancer is the leading cause of cancer-related deaths worldwide. The majority of lung cancers are non-small cell lung cancer (NSCLC), accounting for approximately 85% of all lung cancer types. The Cox proportional hazards model (CPH), which is the standard method for survival analysis, has several limitations. The purpose of our study was to improve survival prediction in patients with NSCLC by incorporating prognostic information from F-18 fluorodeoxyglucose positron emission tomography (FDG PET) images into a traditional survival prediction model using clinical data. Results The multimodal deep learning model showed the best performance, with a C-index and mean absolute error of 0.756 and 399 days under a five-fold cross-validation, respectively, followed by ResNet3D for PET (0.749 and 405 days) and CPH for clinical data (0.747 and 583 days). Conclusion The proposed deep learning-based integrative model combining the two modalities improved the survival prediction in patients with NSCLC.

with advances in medical examinations, risk stratification of individual patients for precision medicine is still limited. One of the important reasons for this limitation is the difficulty in integrating different types of data containing prognostic information. This hurdle cannot be addressed using the traditional Cox proportional hazards model (CPH), a standard method for survival analysis in the medical field [3].
During the last few decades, a radiomic texture analysis with CPH has been actively investigated for survival prediction in patients [4]. The traditional radiomic texture analysis was based on extracting manually designed features (handcrafted features) from a manually or automatically segmented region of interest [5,6]. However, there are limitations in extracting prognostic information from high-dimensional medical images using traditional radiomics models with handcrafted features [7][8][9][10]. Moreover, handcrafted feature extraction using traditional radiomics is laborious and time-consuming. Deep learning-based survival prediction models have recently outperformed traditional feature extraction methods, particularly when working with high-dimensional medical images [11][12][13][14][15].
Deep learning has also revolutionized image recognition. A convolution neural network (CNN), which is composed of multiple convolutional and pooling layers, is the dominant framework for image recognition [16]. A CNN builds layers of features while maintaining spatial information by receiving raw image input. The important aspect of a CNN is to see parts rather than the entire image and to make use of the association between one pixel of the image and the surrounding pixels. However, a deeper layer causes gradient vanishing and explosion problems, and a ResNet model using a shortcut method that adds residuals to the network has been developed. Therefore, the CNN model has expanded its applications to various tasks such as classification, detection, segmentation, and prognostic prediction in the medical image field [16][17][18].
F-18 fluorodeoxyglucose positron emission tomography (FDG PET) imaging, a type of functional whole-body imaging, is known to be a promising tool for prognostic prediction in patients with lung cancer. FDG PET provides information on disease pathophysiology that might be difficult to contain in clinical data [19]. We aimed to improve the prediction of the survival times in patients with non-small cell lung cancer (NSCLC), which accounts for the majority of all lung cancer, using a multimodal deep learning approach that integrates different types of medical data, including clinical variables and whole-body FDG PET images.

Data preparation
Clinical variables and FDG PET images were collected from patients who were diagnosed with and treated for NSCLC between January 2011 and December 2017 at Chonnam National University Hwasun Hospital. Clinical data and PET images were obtained at almost the same time as the lung cancer diagnosis. FDG PET/computed tomography (CT) scans were obtained according to standardized imaging protocols at our institution using two types of PET/CT scanners. To test the generalization, PET images were derived from two types of PET/CT scanners: Discovery ST (GE Medical Systems, Milwaukee, WI, USA) and Discovery 600 (GE Medical Systems, Milwaukee, WI, USA). The three-dimensional (3D) PET images (whole-body axial images) had an image matrix of 128 × 128 × 427. Because coronal maximum intensity projection (MIP) images of FDG PET have shown promising results for survival prediction in patients with NSCLC [20], coronal MIP PET images were also obtained for comparison with 3D PET images. MIP PET images were obtained by projecting voxels with maximum intensity in parallel from the viewpoint to the coronal plane. Patients without any clinical factors or adequate pretreatment F-18 FDG PET/CT were excluded from the present study. Therefore, the datasets did not contain missing data. A treatment strategy for each patient, as determined by the multidisciplinary team, was recommended to the patients. This study was approved by the Institutional Review Board of our institution (CNUHH-2019-194).
A total of 2687 NSCLC patients (2005 men and 682 women, with a mean age of 67.95 ± 9.63 years) were included in this study. The datasets were split into two groups, 80% for training and 20% for testing. The patient characteristics for each dataset are listed in Table 1. There were no statistically significant differences among the features of each set based on a t-test for continuous variables and a Chi-square test for categorical variables. At the time of analysis, 1857 patients had died and 830 had been censored.

Statistics and performance metrics
The OS time was measured from the date of clinical diagnosis to the date of death. We predicted the absolute survival time and 2-year and 5-year survival status of the patients. We used the median residual life to predict the expected residual life expectancy (Fig. 1).
Baseline differences between the training and testing sets were assessed using a t-test for continuous variables and a chi-square test for categorical variables. Survival curves were generated using the Kaplan-Meier method and compared using the log-rank test [21]. Multivariate CPH regression analyses were conducted to estimate the prognostic effect of clinical features. Statistical significance was set at p < 0.05.
To compare the performance of the models in predicting the OS of an individual, we used C-index, MAE, and accuracy of the survival status. Owing to the presence of censoring in survival data, the frequently used evaluation metrics for regression, including the root mean squared error and R 2 , are inappropriate for estimating the prediction performance. Instead, specialized metrics such as the C-index and MAE are preferred for survival analysis [22]. The performance metrics were calculated and averaged using stratified five-fold cross-validation sets. The C-index is the fraction of all pairs of subjects whose predicted survival times are correctly ordered among all subjects that can be ordered. The C-index estimates the probability of the predicted survival time for each pair and evaluates whether each pair is of the same order as the actual survival time [23][24][25][26]. The C-index considers the relative risk of an event rather than the absolute survival times; therefore, we added the MAE to the performance metrics, which is the average of the differences between the predicted median residual lifetimes and actual observed OS times (ground truth) [22,27]. Lower MAE values indicate a better model performance. We measured the MAE in the subgroup of uncensored patients (n = 1857) because the censored data underestimated the survival time [28]. The classification accuracy of 2and 5-year survival status was also evaluated using the predicted residual life. A high accuracy indicates a better performance. Furthermore, we conducted a subgroup analysis to compare each model with the ground truth survival curve and MAE according to the overall stage.

Experimental setup
All our experiments are conducted in a computer with an Intel(R) Xeon(R) Silver 4210R CPU and four Nvidia 3090 GPUs with 24 GB. The Adam optimizer was applied with a learning rate of 1e-4, a batch size of 6 per graphics processing units (GPU) according to the GPU memory capacity for 3D images, and a batch size of 125 for clinical data. Furthermore, the entire epoch was learned using callbacks with three digits of patience. Table 2 presents the results of the multivariate CPH model. The model included nine clinical features, most of which are statistically significant risk factors for a poor OS. Older age, male sex, and advanced TNM stage were found to be independent predictors of a poor OS. Squamous cell carcinoma is associated with favorable survival outcomes.

Survival prediction models using clinical features
DeepSurv with an MLP model using clinical features consisted of 32, 64, and 128 nodes with two hidden layers, and a Gaussian error linear unit (GELU) was used as an activation function [29]. Unlike the rectified linear unit (RELU) function, which gives a difference according to the input of the gate, GELU is weighted according to the input value and is an active nonlinear function that is also used as an active function of MLP in the Vision Transformer (ViT) model [30]. A comparison of the DeepSurv MLP models with different nodes and the CPH model showed similar values for all models. However, the MLP with 64 nodes showed the best performance in terms of MAE and accuracy (Table 3). Therefore, we chose the DeepSurv MLP with 64 nodes for the final multimodal model.

Survival prediction models using PET images
For survival prediction using 2D MIP images, among ResNet with 18, 34, and 50 layers, the performance improved further as the number of layers increased. ResNet with 50 layers (ResNet-50) showed a better performance in terms of the MAE and classification accuracy than CPH, but not the C-index. For survival prediction using 3D PET images, 3D CNN ResNet3D models with 10, 18, and 34 layers were compared. Because wholebody PET images have a large volume, ResNet variants with a relatively low network depth (layers) were evaluated. The CNN models using 3D PET images showed better performance in all metrics than models using 2D PET images. ResNet3D with 34 layers (ResNet3D-34) achieved the best performance among all PET models (Table 4). Therefore, we chose the ResNet3D-34 using 3D PET images for the final multimodal model.

Multimodal deep learning
The DeepSurv MLP model using clinical features showed better performance than CPH model in terms of the MAE and classification accuracy of 2-and 5-year survival   and two layers. The proposed multimodal model showed the best performance in all prediction models. The C-index was the highest in the multimodal model, reaching 0.756 ± 0.01 under a five-fold cross validation. In addition, the MAE also showed the smallest error (approximately 1 year). Furthermore, the 2-and 5-year classification accuracies were the highest, reaching 0.743 ± 0.02 and 0.933 ± 0.01, respectively, with the proposed model (Table 5). Figure 2 shows the Kaplan-Meier curves comparing the distribution of the ground truth of the actual survival time and the predicted survival times using each model in Table 4 Performance comparison of the survival prediction of convolutional neural network (CNN) models using positron emission tomography (PET) images The best score in each column is highlighted in bold MAE Mean absolute error; C-index Harrell's concordance index; MIP Maximum intensity projection   the test set. Log-rank tests were conducted to evaluate the similarity of the survival distributions. There were no statistically significant differences between the ground truth and ResNet3D model (p = 0.17) or between the ground truth and multimodal model (p = 0.29). However, there was a significant difference between the ground truth and CPH (p < 0.001). In the early stage (I, II, III) of NSCLC patients, the CPH model (p < 0.001) showed a statistically significant difference from the actual survival curve, whereas ResNet3D (p = 0.629) and the proposed multimodal model (p = 0.416) showed no statistically significant difference. However, in the advanced stage (IV), the CPH (p = 0.026) and ResNet3D (p = 0.028) models showed a statistically significant difference from the actual survival curve, whereas the proposed multimodal model (p = 0.362) did not. Prediction models that use PET images as a portion of the input data provided more accurate survival predictions than the prediction model using only clinical data in early-stage NSCLC patients. In addition, the proposed multimodal model showed no significant difference from the actual survival curve and provided a more accurate survival prediction than other models in all stages of NSCLC patients. The survival curves for each patient with NSCLC were estimated from the predicted hazard ratios. Figure 3 shows the results of estimating individual survival curves in a representative 60-year-old male patient with stage III NSCLC without a history of smoking. The patient's actually observed survival time was 252 days. The predicted survival time of each model was estimated using the median residual life. The residual lifetimes predicted by the CPH, ResNet3D, and multimodal models were 788, 159, and 251 days, respectively. The most accurate model used to predict the actual survival time was multimodal model, which showed the smallest error (1 day) in comparison with ResNet3D (93 days) and CPH (536 days).

PET model MAE (days) C-index Classification accuracy of 2-year survival status
In the subgroup analysis of the MAE in patients according to overall stage, the advantages of the model using PET data (ResNet3D-34 and multimodal model) were more prominent than those of the model using clinical data (CPH) in the early stage (I, II, and III) (Fig. 4). In the early stage, the ResNet-34 and multimodal model showed a statistically significant difference from CPH. The MAE of the CPH showed a larger

Discussions
The prediction of a prognosis in individual patients is important for predicting the effectiveness of a treatment and improving patient care [31]. In the present study, a multimodal deep learning model is proposed that integrates two heterogeneous modalities (clinical data and 3D PET images) with joint fusion to predict the OS time in NSCLC patients. The integrative multimodal model showed an improved prognostic performance compared to the traditional CPH model using clinical data, a ResNet model using 2D PET images, and a ResNet3D model using 3D PET images. The proposed model seems to effectively combine the information inherent in the two different modalities and reflects them in the survival prediction. This is probably because, unlike ResNet2D, ResNet3D allows learning additional information, such as the spatial context around the tumors. Furthermore, ResNet3D handles a relatively small axial area close to the tumor such that the level of attention is not distracted by uninformative non-tumor areas in the images [32,33]. As the ResNet3D model outperformed other 3D-CNN models comparing C3D and RGB-I3D models in Kinetics, a large-scale video dataset [33], our results were consistent with the previous study.
Traditional radiomic approaches for predicting cancer prognosis using imaging data have been actively investigated [4,34]. However, the handcrafted feature extraction of radiomics is laborious and time-consuming and cannot use the complete information of the images. Because deep learning-based models have shown a good performance in terms of image classification, localization, detection, segmentation, and registration, deep learning-based survival prediction has been investigated to overcome these limitations; however, this approach has not been fully investigated  [35]. Whereas traditional CPH predicts the hazard function and requires specific assumptions to evaluate the survival time, the proposed model directly predicts the individual survival time (residual life). Direct survival time prediction, rather than a hazard function or distribution function, provides a more intuitive interpretation of the prognostic predictions [36].
In the present study, both 2D and 3D PET images were evaluated as input data for survival prediction. The prediction model of 3D PET images showed a much better performance than that of 2D MIP images [37,38]. MIP is a common visualization method that can be used to visualize 3D images by converting them into 2D images [39]. MIP PET images project voxels with maximum intensity in a parallel manner from the viewpoint to the plane. Although MIP images allow a reduction of the data size and computing power during training, they might be limited to reflecting the spatial information of the tumor, which contains useful prognostic information. The use of whole 3D medical images might be more robust than 2D images for prognostic prediction in cancer patients [17].
The present study has certain limitations. First, the number of features in the clinical data was limited because it was difficult to collect medical data through electronic medical records. However, we included essential clinical risk factors that were preferentially collected and readily used as prognostic factors in the real world. The TNM stage alone is often considered as a prognostic factor when making decisions regarding treatment and management owing to the lack of an appropriate model incorporating information from different modalities. Moreover, we included major risk factors for NSCLC, such as age, sex, histology, smoking history, and the TNM stage. Second, PET images without lesion annotation were used. Although a lesion annotation might have improved the predictive performance of deep learning models, lung cancer patients may have multiple metastatic lesions, ranging from several to hundreds. It takes a significant amount of time and effort by physicians to annotate such lesions. Instead, we attempted to improve the accuracy and generalize the model by collecting data from a relatively large number of patients. Finally, the present study still has room for performance improvements by using state-of-art CNN models. Variants of ResNet such as 3D densely connected convolutional network (3D-DensNet) and ResNet(2 + 1)D have been proposed and outperformed ResNet3D in imaging analysis [40][41][42]. Further research is necessary to address the challenges predicting prognosis using state-of-art CNN models for medical imaging applications.

Conclusion
The results of the present study indicate that deep learning model integrating clinical data and PET image data should improve prognostic prediction power in NSCLC patients, especially in patients with early stage. The proposed multimodal deep learning model can successfully integrate different types of medical data and provide intuitive prognostic prediction results to physicians and NSCLC patients.

Method
The modeling process that combines the two modalities is shown in Fig. 1. First, we compared the performance of DeepSurv with that of traditional CPH to choose a suitable model for clinical data. DeepSurv is a multilayer perceptron (MLP) adapted for survival analysis, which is a form of a feedforward deep neural network (DNN). DeepSurv predicts the effects of clinical covariates on the hazard rate parameterized by the weight of the network. The loss function for DeepSurv includes a negative log partial likelihood from the CPH and regularization term. The open-source code DeepSurv by Katzman et al. was used [43]. To optimize the hyperparameter of Deep-Surv, three layers (32, 64, and 128 nodes) of MLP were compared using the Harrell's concordance index (C-index) and mean absolute error (MAE).
Then, we compared the predictive ability of ResNet3D for 3D PET images with ResNet for two-dimensional (2D) MIP images. Because ResNet contains shortcut connections that turn the network into its counterpart residual version and allows stacked layers to fit the residual mapping, we proposed ResNet to extract features of PET images [17]. For survival prediction using 2D MIP images, ResNet models with 18, 34, and 50 layers were compared. For survival prediction using 3D PET images, 3D CNN ResNet3D models with 10, 18, and 34 layers were compared. We used a model structure that uses batch normalization and a RELU as an activation function after each convolution layer. The size of the convolution kernel is (3 × 3 × 3), two stride convolution layers were used for downsampling, and adaptive average pooling was applied to make the last fully connected layer [33]. Final multimodal model was constructed by combining CNN of optimal parameters in PET and DNN of optimal parameters in clinical data [44].