Identifying diseases that cause psychological trauma and social avoidance by GCN-Xgboost

Background With the rapid development of medical treatment, many patients not only consider the survival time, but also care about the quality of life. Changes in physical, psychological and social functions after and during treatment have caused a lot of troubles to patients and their families. Based on the bio-psycho-social medical model theory, mental health plays an important role in treatment. Therefore, it is necessary for medical staff to know the diseases which have high potential to cause psychological trauma and social avoidance (PTSA). Results Firstly, we obtained diseases which can cause PTSA from literatures. Then, we calculated the similarities of related-diseases to build a disease network. The similarities between diseases were based on their known related genes. Then, we obtained these diseases-related proteins from UniProt. These proteins were extracted as the features of diseases. Therefore, in the disease network, each node denotes a disease and contains the information of its related proteins, and the edges of the network are the similarities of diseases. Then, graph convolutional network (GCN) was used to encode the disease network. In this way, each disease’s own feature and its relationship with other diseases were extracted. Finally, Xgboost was used to identify PTSA diseases. Conclusion We developed a novel method ‘GCN-Xgboost’ and compared it with some traditional methods. Using leave-one-out cross-validation, the AUC and AUPR were higher than some existing methods. In addition, case studies have been done to verify our results. We also discussed the trajectory of social avoidance and distress during acute survival of breast cancer patients.

As shown in Fig. 2, some proteins are related to more than 1000 diseases, whereas some proteins are only associated with less than 100 diseases. Therefore, the features are sparse.

Comparison experiments
Since only 23 diseases are known to cause PTSA, we used leave-one-out cross validation to test the performance of GCN-Xgboost. We divided all diseases into 23 groups. For each time, we used one known disease with one group of unknown diseases as the test dataset and the rest are the training set.  We compared our method with support vector machine (SVM), artificial neural network (ANN), deep neural network (DNN) and random forest (RF). Figure 4 shows the AUC and AUPR of the results.
As we can see from Fig. 3, GCN-Xgboost performed best among these five methods with AUC 0.97 and AUPR 0.78. The second best method is DNN, since it can learn complex non-linear relationship from sparse data. SVM is the worst since it can not handle high dimensional features.

The power of GCN
Although GCN-Xgboost performed best among these methods, we still want to know the reason. Therefore, we only used Xgboost to identify diseases which can cause PTSA and compared the results with GCN-Xgboost's.
The results are shown in Table 1.
As we can see in Table 1, the AUC did not change much after using GCN, but the AUPR changed a lot. The AUPR of Xgboost was only 0.61, but GCN-Xgboost was 0.78, which means GCN-Xgboost can reduce false positive. Since GCN encoded the similarities of diseases, more information were provided so the method can perform better.

Case study
After verifying the effectiveness of GCN-Xgboost, we used it to identify diseases which can cause PTSA. Therefore, all the positive diseases are used as the positive samples. We randomly selected 100 unknown diseases as negative samples to built the model. We found 228 diseases were identified as diseases that cause PTSA. To verify whether our results are correct, we searched literatures to do case study. Flatt et al. [19] reported that Alzheimer's disease is very likely to cause PTSD. In addition, they also found people with PTSD and depression have twice the risk of dementia.
Yi-Frazier et al. [20] found that families and individuals of adolescents with type 2 diabetes are experiencing significant psychological stress.

PTSA in breast cancer
Breast cancer patients are at high risk of PTSA, which is a well-known fact.
From February 2017 to October 2017, 200 eligible patients with breast cancer were selected by randomly sampling from Department of Breast Surgery at the Shanxi Provincial Tumor Hospital. After obtaining written informed consent, trained researchers fill out the questionnaire for each patient.
All selected patients meet the following four conditions: (1) patients with breast cancer are diagnosed by pathological examination and are agreed to mastectomy; (2) age ≥ 18 years; (3) all the patients have received primary school or higher education and are able to communicate effectively; (4) they are awareness of diagnosis and voluntary participation.
Patients are excluded if they meet one of the following 4 conditions: (1) they have complications, such as heart disease, hypertension, and kidney disease; (2) they have other malignancies; (3) they are receiving antipsychotics for mental disorders.
The questionnaire includes: (1) basic information: age, occupation, education, retirement status, payment method for medical care, marital status, religion, and menopause status; (2) disease-related data: breast volume, severity of alopecia, breast cancer family history, and willingness of contralateral prophylactic mastectomy; (3) basic information of spouse: age, nationality, religion, education, occupation, and retirement status.
The Social Avoidance and Distress Scale (SADS) [21] was developed by Watson and Friend in 1969 which consists of 14 items measuring social avoidance and 14 items measuring social distress. Each item can be answered by "yes" or "no". The reliability of the avoidance and distress scales are 0.87 and 0.85, respectively. Scores for each item are summed to obtain a total score. If the total score is higher than 9, the patients are suffering social avoidance and distress. The total score for healthy individuals in China is 8.03 ± 4.86.
The Self-Esteem Scale (SES) developed by Rosenberg in 1965 is composed by 10 items. The items are rated by a four-point scale, where 1 = strongly agree, 2 = agree, 3 = disagree, and 4 = strongly disagree. Therefore, the total score ranges from 10 to 40. If the total score is lower than 25, the patient is low self-esteem. 26-32 represents moderate self-esteem, and 33 or higher represents high self-esteem. It is the most commonly used instrument to measure self-esteem in China.
Alopecia was graded according to National Cancer Institute Common Terminology Criteria for Adverse Events (NCI-CTCAE) 4.0 (grade 0: no alopecia; grade 1: hair loss < 50%, which is only visible close by and may need to be covered by different hairstyle; grade 2: hair loss > 50%, which needs to be covered by wigs or hats.
Breast volume was defined as brassiere cup size, i.e., the difference between the upper and lower chest circumferences. The cup size was recorded as A to E.
Considering that the number of patients will decrease during follow-up, the sample size was increased by 20%. A total of 800 questionnaires were distributed in four rounds of surveys.
Four rounds of face-to-face survey were conducted by trained researchers. Patients are divided into 4 groups based on the four phases of treatment: (1) after diagnosis but before mastectomy, (2) after mastectomy but before chemotherapy, (3) at mid-chemotherapy (in the second cycle), (4) at the end of chemotherapy. A total of 192 patients completed all the four rounds of survey and a total of 768 valid questionnaires were collected.
As shown in Table 2, results from the questionnaires showed significant differences in scores among the four phases of acute survival. The mean score of the four phases was 12.87 ± 5.71, which was significantly higher than that for healthy individuals in China (t = 11.741, P < 0.001).
As shown in Table 3, statistical analysis revealed significant differences in self-esteem among the four phases of acute survival (Table 3). Among patients with low self-esteem,  the number of patients after mastectomy but before chemotherapy was the largest (28.1%). Since then, the number of patients with low self-esteem has decreased, while the number of patients with moderate self-esteem has increased. The results of univariate analysis of social avoidance and distress are shown in Table 4. Breast size, willingness for contralateral prophylactic mastectomy, self-esteem, and spouse education are factors that cause significant differences in social avoidance and suffering.
Compared with spouses with elementary education and below, spouses with high school/technical education are the protective factors to avoid social avoidance. Compared with low self-esteem, moderate self-esteem is a protective factor to avoid social avoidance. The willingness of contralateral preventive mastectomy in genetic mutation carriers is a risk factor for social avoidance.

Discussion
Breast cancer patients experience severe social avoidance and distress during acute survival, especially in the stage between mastectomy and chemotherapy. Mastectomy can induce psychological and physical stress. Moreover, the loss of femininity after the operation exacerbated the distress. Breast loss and hair loss, nausea and weakness caused by chemotherapy seriously affect the mood of patients. They may even worry about being disliked by others, thus avoiding social interaction. Medical staff should cooperate with patients' families to understand and support patients, create a relaxed and positive environment for them, and enhance their sense of family and social belonging. Self-esteem is a person's self-emotional experience and evaluation in the social process. It is the core of self-awareness and an important indicator of mental health. Selfesteem affects patients' cognition, emotion, behavior, and mental health. In this study, in the period between mastectomy and chemotherapy, the number of patients with the highest inferiority complex was the largest. This may be related to the decline in self-care ability, self-identity disorder and weakened social role function. Patients tend to avoid social interactions, become more sensitive to interpersonal relationships, anxious and distressed. Self-esteem is a protective factor for mental health. An optimistic and positive attitude towards reality can enhance resilience. Medical staff should share successful cases of successful fight against diseases and recommend breast reconstruction and rehabilitation to help patients with low self-esteem improve their self-emotional experience and evaluation, and encourage them to express their emotions.
It has been suggested that the spouse's concern about the patient's appearance is an important factor in postoperative depression. The negative emotions of the spouse will further increase the psychological burden of the patient. The support of the spouse can provide positive psychological support for the patient. The results of this study indicate that the education level of the spouse may be related to social avoidance. A well-educated spouse may help patients understand and deal with the disease correctly, choose the best treatment plan, and provide them with positive psychological support to reduce their negative emotions. Therefore, medical staff should provide the spouses of breast cancer patients with necessary psychological and information support, improve their ability to care for the patients, and encourage and support the patients to reduce the patients' social avoidance.
The results of this study indicate that Contralateral preventive mastectomy for genetic mutation carriers increases the possibility of avoiding social interaction or aggravates social distress. According to reports, patients with unilateral breast cancer have an increased risk of contralateral breast cancer by 0.5-0.75% each year. Contralateral mastectomy has been shown to be effective for genetic mutation carriers. In this study, 56.25% of subjects were willing to undergo contralateral prophylactic mastectomy. However, this is a risk factor that society avoids and troubles. Loss of bilateral breasts, surgical trauma, increased risk of complications, and financial burden lead to fear, anxiety and depression.
To sum up, medical staff should pay attention to the psychological changes of breast cancer patients during the entire acute survival period, especially after mastectomy and the middle period of chemotherapy, and provide them with positive psychological support. Medical staff are obliged to help patients improve self-evaluation, promote psychological adjustment and enhance anti-stress ability. In addition, although contralateral preventive mastectomy can effectively prevent breast cancer, it may increase psychological and physical trauma, cause or increase social avoidance and distress, and reduce the patient's quality of life. Therefore, contralateral prophylactic mastectomy should only be performed under strict indications to avoid excessive aggressive treatment.

Conclusions
PTSA seriously threats patients' mental health and gives burden on the society. With the advancement of medical technology, patients are not only satisfied with the physiological cure, but also the psychological cure. PTSA is related to the quality of life of the patients after treatment. Therefore, special care is needed for patients with diseases that may cause PTSA. To achieve personalized treatment, we should know the diseases can cause PTSA at first. However, investigating hundreds of patients for each disease is time and money consuming. Therefore, in this paper, we developed 'GCN-Xgboost' to identify diseases that cause PTSA.
First, we calculated the similarities of diseases based on their related genes. Then, we obtained their related proteins from UniProt. Then, a disease network was built. GCN was used to encode the network to extract features for each disease. After encoding, the feature of each disease not only contains their related proteins, but also their relationship with other diseases. Finally, Xgboost was used to build model to identify diseases that cause PTSA.
We verified our method by cross-validation and compared our method with other existing methods. After verifying the effectiveness of our method, we did case studies to verify the accuracy of our results. At last, we discussed the PTSA in breast cancer. Figure 4 shows the work flow of our method. Firstly, we searched diseases that cause PTSA in PubMed. Then, Disease Ontology (DO) [22] was used to obtain these diseasesrelated diseases. After that, gene-based similarity calculation method was used to calculate the similarities of all the obtained diseases. Then, we could build a disease network based on the disease similarities. Secondly, we obtained each disease-related proteins from Uniprot [23] and we encoded these proteins to be the features of diseases. Then, each node in the disease network also contains information about its protein. Then, GCN was used to extract features from disease network. Finally, Xgboost was used to do the classification. We labeled known diseases that cause PTSA as 1, unknown diseases as 0.

Calculating disease similarity
Most of the diseases are associated with genes. Therefore, we calculated the similarity of diseases based on genes. We obtained disease-related genes by HumanNet [24]. Each gene interaction has a log likelihood score (LLS). Firstly, we need to normalize them. g i , g j denotes i th and j th gene respectively. LLS N (g i , g j ) is the LLS after normalization.
Therefore, the functional similarity score of two bunches of genes could be calculated by: e(i, j) ∈ (HumanNet) means the interaction edge between g i and g j is included in the HumanNet.
Then, if we want to calculate the association between one gene g and a gene set G = {g 1 , g 2 , . . . , g k } , we could use Eq. 3. k denotes the number of genes in G.
Finally, two diseases could be considered as two gene sets G 1 and G 2 . Therefore, the similarity between two diseases could be calculated as following: where g 1i is the gene of G 1 . m denotes the number of genes in G 1 and n denotes the number of genes in G 2 . Finally, by Eq. 4, we could obtain the similarity between two diseases.

Encoding method
Firstly, we searched diseases that cause PTSA in PubMed. Then, we obtained more diseases which are related to these disease by DO. We totally found 23 diseases which could cause PTSA and these diseases are related to 2387 kinds of diseases in DO. Then, we found these diseases are corresponded to 6875 kinds of proteins by Uniprot. These proteins could be the features of each disease. The encoding method is as following: where F d is the feature of disease. P 1 denotes whether this protein is related to this disease. If this protein is related to this disease according to Uniprot, P 1 = 1, otherwise P 1 = 0. n is the number of proteins we used.
Since we totally obtained 6875 proteins, n should be 6875. However, the dimension of features would be huge. Therefore, 523 most common proteins were selected as features since they are associated with at least 100 diseases. Finally, n should be 523 in our method. Therefore, each disease has a feature whose dimension is 1*523.
By the process above, we could build a disease network by the similarity of diseases and features of disease. In this network, each node is a disease and each edge is the similarity between two diseases. Therefore, there are 2387 nodes in the network, and each node contains the features of this disease. Then, GCN was used to encode the network.
For a given graph G = (V, E), V denotes the nodes and E denotes the edges. GCN is aim to use a nonlinear function to transfer network to output.
H (0) = X , which is the feature of the nodes.
Firstly, we need to obtain the Laplace matrix L: D is the degree matrix, which could be calculated by Adjacency matrix A.
D is a diagonal matrix. Then, we need to normalize L as following: The element of L sym is defined as With the Laplace matrix L, we can perform spectral convolution on the graph. In order to overcome the underfitting caused by too many parameters, some scholars have proposed a 'Chebyshev' method. In this method, filter function is: where ˜ = 2 max − I N θ ′ k represents a Chebyshev vector. The definition of Chebyshev polynomial is as following: If we let max = 2, K = 1, the first-order linear approximation of spectral convolution would be: Therefore, the output of GCN would be: Overall, after encoding by GCN, each disease not only contains their protein features, but also its relationship with other diseases.

Classification by Xgboost
Xgboost was proposed by Tianqi Chen [25]. The main advantage of using Xgboost in our work is the input could be sparse matrix. Since our feature is very sparse, Xgboost could handle these features.
Since Xgboost is derived from Gradient Boosting Decision Tree (GBDT) [26], we firstly introduced the workflow of GBDT. i � = j and v i adjacent to v j 0 otherwise The objective function is consisted by two parts: loss function and regularization term.
If T trees are trained, the model could be built as following: Both Xgboost and GBDT's basic classifier is CART, so the objective function could be as following: Obtaining f i is our target. We trained the t th tree based on the previous (t − 1) trees. Therefore, the t th objective function is: Then, the loss function would be: To obtain regularization term, decision tree could be defined as: