A novel method detecting the key clinic factors of portal vein system thrombosis of splenectomy & cardia devascularization patients for cirrhosis & portal hypertension

Background Portal vein system thrombosis (PVST) is potentially fatal for patients if the diagnosis is not timely or the treatment is not proper. There hasn’t been any available technique to detect clinic risk factors to predict PVST after splenectomy in cirrhotic patients. The aim of this study is to detect the clinic risk factors of PVST for splenectomy and cardia devascularization patients for liver cirrhosis and portal hypertension, and build an efficient predictive model to PVST via the detected risk factors, by introducing the machine learning method. We collected 92 clinic indexes of splenectomy plus cardia devascularization patients for cirrhosis and portal hypertension, and proposed a novel algorithm named as RFA-PVST (Risk Factor Analysis for PVST) to detect clinic risk indexes of PVST, then built a SVM (support vector machine) predictive model via the detected risk factors. The accuracy, sensitivity, specificity, precision, F-measure, FPR (false positive rate), FNR (false negative rate), FDR (false discovery rate), AUC (area under ROC curve) and MCC (Matthews correlation coefficient) were adopted to value the predictive power of the detected risk factors. The proposed RFA-PVST algorithm was compared to mRMR, SVM-RFE, Relief, S-weight and LLEScore. The statistic test was done to verify the significance of our RFA-PVST. Results Anticoagulant therapy and antiplatelet aggregation therapy are the top-2 risk clinic factors to PVST, followed by D-D (D dimer), CHOL (Cholesterol) and Ca (calcium). The SVM (support vector machine) model built on the clinic indexes including anticoagulant therapy, antiplatelet aggregation therapy, RBC (Red blood cell), D-D, CHOL, Ca, TT (thrombin time) and Weight factors has got pretty good predictive capability to PVST. It has got the highest PVST predictive accuracy of 0.89, and the best sensitivity, specificity, precision, F-measure, FNR, FPR, FDR and MCC of 1, 0.75, 0.85, 0.92, 0, 0.25, 0.15 and 0.8 respectively, and the comparable good AUC value of 0.84. The statistic test results demonstrate that there is a strong significant difference between our RFA-PVST and the compared algorithms, including mRMR, SVM-RFE, Relief, S-weight and LLEScore, that is to say, the risk indicators detected by our RFA-PVST are statistically significant. Conclusions The proposed novel RFA-PVST algorithm can detect the clinic risk factors of PVST effectively and easily. Its most contribution is that it can display all the clinic factors in a 2-dimensional space with independence and discernibility as y-axis and x-axis, respectively. Those clinic indexes in top-right corner of the 2-dimensional space are detected automatically as risk indicators. The predictive SVM model is powerful with the detected clinic risk factors of PVST. Our study can help medical doctors to make proper treatments or early diagnoses to PVST patients. This study brings the new idea to the study of clinic treatment for other diseases as well.


Background
Portal vein system thrombosis (PVST) refers to the blockage or narrowing of the portal vein, splenic and superior mesenteric veins, or intrahepatic portal vein branches, by a thrombus [1]. It is relatively rare and its clinical manifestations range from asymptomatic to severe complications including fever, abdominal pain, nausea, vomiting, and ileus [2]. The formation of PVST could increase the risk of upper gastrointestinal bleeding, hepatic coma or even fatal intestinal necrosis [3]. Moreover, PVST imposes difficulty on further liver transplantation [4,5]. With the development of imageological examination, more and more studies have shown that the incidence of PVST after splenectomy is significantly higher than previously reported. The reported incidence of PVST after splenectomy is different greatly, ranging from 0.36% [6] to even 80% [7]. Why are there so much inconsistence in the incidence of post-splenectomy PVST? It comes from the difference in examination methods, types of study, time and frequency of postoperative examinations, and the underlying diseases, etc. [8]. Up to now, the specific mechanisms leading to the formation of PVST after splenectomy are not known. It is generally agreed that hemodynamic changes of the portal venous system [9][10][11], blood hypercoagulability [3], cecum induced by splenic vein ligation [12], local inflammatory reaction [13], and irrational use of coagulants [14] are all important factors affecting the occurrence of PVST. Some studies also demonstrated that the formation of PVST was related to the volume of spleen, diameter of portal vein, prothrombin time (PT), plasma D-dimer level, and the function and quality of platelet rather than the count of platelet [15][16][17][18]. So far, it has been controversial in the role of early prophylactic anticoagulation in preventing PVST. This is because of concerning the risk of inducing bleeding, especially in the cirrhotic patients [19][20][21]. However, in the last decade some studies demonstrated that both pro-and anticoagulation elements were concomitantly reduced in liver cirrhosis patients [22,23], and the occurrence of bleeding for these patients was mainly due to the severity of portal pressure, endothelial dysfunction and bacterial infections, but not the disturbed hemostasis [24]. These studies provide the fundamental science for the prophylactic application of anticoagulation in these patients. Although the study to PVST has attracted many researchers [25][26][27][28][29][30] and some of them have found that prophylactic anticoagulation therapy can effectively prevent PVST after splenectomy even to cirrhotic patients [31], there are not any standard regimen for PSVT prophylaxis having been developed, and furthermore there are not any researchers focusing on detecting risk factors of PVST after splenectomy in cirrhotic patients by introducing machine learning to this field. Therefore we devote ourselves to this field.
We first propose a novel feature selection algorithm named RFA-PVST (Risk Factor Analysis for PVST) to detect the clinic risk factors of PVST, then we introduce the typical learning machine SVM to build the predictive model to PVST. We collect the clinic data of 92 splenectomy and cardia devascularization patients for cirrhosis and portal hypertension from the highest level hospital in PR China.
In our RFA-PVST, we propose the definition of discernibility and independence for each index to imply the capability of it in telling a PVST patient from non-PVST patients, and the differences of an index to other indices, respectively. The detected clinic indexes are with much higher discernibility and independence. The SVM model built on the detected risk factors can effectively tell PVST patients from non-PVST patients, and help medicine doctors to make proper cure decisions or early diagnoses to potential PVST patients. 5-fold cross validation experimental results on the aforementioned 92 clinic patients, and the statistic test between RFA-PVST and available famous feature selection algorithms demonstrate that the clinic risk factors detected by our RFA-PVST are statistically significant on which a very powerful predictive model is built.

Results
This section will display the clinic risk factors of PVST detected by our proposed RFA-PVST, and the power of these risk factors in recognizing PVST patients by the performance of the SVM model based on them in terms of its accuracy shorted as Acc in the following of this paper, sensitivity, specificity, precision, F-measure, FPR (false positive rate), FNR (false negative rate), FDR (false discovery rate), AUC (area under ROC curve) and MCC (Matthews correlation coefficient). The performance comparison are shown between our RFA-PVST and the available feature selection algorithms including mRMR [32], SVM-RFE [33], Relief [34], S-weight [35] and LLEScore [36]. The statistic test results between our RFA-PVST and the aforementioned feature selection algorithms are also presented. Figure 1 displays the collection of all clinic indexes in circles in the 2-dimension space with discernibility as xaxis and independence as y-axis. The red circle indicates clinic risk factors, meaning the area of the rectangle enclosed by coordinate lines and axes is much bigger than the rest ones. Table 1 lists clinic indexes in descending order by their risk degrees in 5-fold cross validation experiments. The underlined bold font means the detected risk factors, corresponding to the red circle depicting clinic indexes in Fig. 1. Table 2 displays the performance of 5 different SVM models of 5-fold cross validation experiments on the test subsets in terms of Acc, AUC, sensitivity, specificity, precision, F-measure, FNR, FPR, FDR, and MCC. Table 3 displays the average results of 5-fold cross validation experiments in terms of same metrics as that in Table 2 under same conditions. The underlined bold fonts in Tables 2 and 3 mean the best results.

Statistic test results of RFA-PVST
Friedman's test with α = 0.05 of our proposed RFA-PVST and mRMR, SVM-RFE, Relief, S-weight and LLEScore are displayed in Table 4 in terms of Acc, AUC, sensitivity, specificity, and precision of the SVM predictive models of PVST with the same number of risk indexes detected by each algorithm, respectively.
The multiple comparison test between each pair of algorithms at the confidence level of 0.95 is displayed in Table 5 in terms of Acc, AUC, sensitivity, specificity, and precision. The upper triangle of each test shows the mean rank difference between algorithms, and the lower triangle the statistical significance between each pair of algorithms, where * is the tag of strong significance between corresponding algorithms in the corresponding metrics.

Discussion
This section will discuss all of the experimental results displayed in the section of results.

Clinic risk factor discussion
The results in Fig. 1 disclose that our proposed metric RD is useful in detecting the clinic indexes with higher risk degree. The red circle clinic indexes in Fig. 1 comprise risk clinic indicators of PVST, and can be detected by our RFA-PVST automatically. The results in Fig. 1 reveal that the risk clinic factors for each fold experiment are variant for the variance of exemplars in each training subset of 5-fold cross validation experiments. However the number of risk factors of 5-fold cross validation experiments is from 2 to 8 with average 5. The common clinic indexes are anticoagulant therapy (with ID 32) and antiplatelet aggregation therapy (with ID 33) among 5 risk clinic indicator subsets detected by our proposed RFA-PVST. The clinic indexes of CHOL with ID 7, Ca with ID 17 and D-D with ID 31 appear 3 times among 5 subsets. This fact implies that anticoagulant therapy and antiplatelet aggregation therapy are the first two important risk indicators to predict PVST patients followed by the comparable important clinic indicators of CHOL, Ca and D-D.
The results in Table 1 disclose that antiplatelet aggregation therapy (with ID of 33) is the riskiest clinic index to PVST, followed by anticoagulant therapy (with ID of 32). The WBC (with ID of 20) and INR (with ID of 27) are the clinic indexes with the least risk degree causing PVST. In addition, the results in Table 1 tell us that although the training samples are variant, the first two clinic risk factors are same in each fold of 5-fold cross validation experiment, which further indicate that our proposed RFA-PVST algorithm is powerful in finding the risk clinic factor of PVST.
The results in Table 2 tell us that the performance of different PVST predictive models on test exemplars are variant in terms of Acc, AUC, sensitivity, specificity, precision, F-measure, FNR, FPR, FDR and MCC. The Although its AUC is not the best one among 5-fold cross validation experiments, it has got the comparable good AUC value of 0.84. Therefore we can conclude that these 8 clinic indexes are important clinic indicators on which the sound prediction model can be built to predict whether PVST will take place or not for splenectomy with cardia devascularization patients for liver cirrhosis and portal hypertension. The results in Table 3 tell us that our RFA-PVST can detect risk clinic indicators with which a SVM classifier can be built with best mean predictive accuracy, AUC, specificity, precision, FPR and FDR. Although this predictive model can only recognize 70% PVST patients in terms of sensitivity, not as good as that by SVM-RFE and Relief which can detect all PVST patients, our predictive model can detect 45% non-PVST patients while SVM-RFE and Relief cannot detect any one. This fact means that the predictive models by SVM-RFE and Relief exist the fatal error of recognizing all non-PVST patients as PVST ones, while the SVM classifier based on the risk indicators detected by our proposed RFA-PVST can make excellent tradeoff between sensitivity and specificity.

Statistic test result discussion
It can be seen from the results in Table 4 that p < 0.05 holds for all metrics used to do statistic test, including Acc, AUC, Sensitivity, Specificity, and Precision. So we can conclude that the strong significant difference exist between our RFA-PVST and the compared algorithms, including mRMR, SVM-RFE, Relief, S-weight and LLE-Score, that is the risk indicators detected by our RFA-PVST are statistically significant.
The multiple comparison test results in Table 5 in terms of accuracy (Acc), AUC, sensitivity, specificity, and precision of predictive models of PVST based on the risk indicators detected by the related algorithms reveal that our RFA-PVST can detect the risk clinic factors The underlined bold fonts mean the detected risk factors with much better predictive power to PVST for splenectomy plus cardia devascularization patients for liver cirrhosis and portal hypertension, compared to mRMR, SVM-RFE, Relief, S-weight and LLEScore. The results disclose the fact that our RFA-PVST is powerful in detecting the clinic risk indexes to predict whether PVST will happen or not on splenectomy plus cardia devascularization patients for liver cirrhosis and portal hypertension.

Conclusions
A novel algorithm named RFA-PVST is proposed to detect the clinic risk indicators of PVST for splenectomy and cardia devascularization patients for liver cirrhosis and portal hypertension. The discernibility and independence are defined for each clinic index. All of the clinic indexes are scatted in a 2-dimensional space with independence and discernibility as y-axis and x-axis, respectively. Those clinic indexes in top-right corner of the 2-dimensional space are detected automatically as risk indicators. The SVM classifier is built on the detected risk indicators to predict whether the PVST will happen or not on a splenectomy plus cardiac devascularization patient for liver cirrhosis and portal hypertension. 5-flod cross validation experiments on the clinic data of 92 patients disclose that antiplatelet aggregation therapy is the riskiest clinic index, followed by anticoagulant therapy. Taking the two therapies may lead to PVST for splenectomy plus cardiac devascularization patients for liver cirrhosis and portal hypertension. CHOL, Ca, and D-D are also important risk factors. Anticoagulant therapy, antiplatelet aggregation therapy, RBC, D-D, CHOL, Ca, TT, and weight comprise the clinic risk indicators to PVST. The predictive model based on these 8 risk indicators is very powerful.
Furthermore, the comparison between our proposed RFA-PVST and available typical feature selection algorithms including mRMR, SVM-RFE, Relief, S-weight and LLEScore demonstrate that our RFA-PVST is very powerful to detect the risk clinic indicators to recognize PVST from non-PVST patients. The significant test between the aforementioned algorithms reveal that there is strong significant difference between our RFA-PVST and the famous available feature selection algorithms. In addition, it is fantastic that our study results are coincident with that from references [17,37] about D-D is a clinic risk indicator of PVST.
We can conclude that our study is significant in the field of detecting risk factors causing PVST for splenectomy and cardia devascularization patients for liver cirrhosis and portal hypertension. It can help medical doctors to make proper treatments or early diagnoses to PVST patients. This study also provides a new idea to the clinic treatment of other diseases.

Methods
This section will first introduce the data used in this paper, then the preprocessing method will be introduced for the data. It should be noted that we are authorized to use the data under the condition of deleting the privacy information of patients. Then the SVM learning machine will be briefly introduced. After that we will introduce the idea of our proposed novel algorithm RFA-PVST in detail, and the methods building a SVM classifier in the clinic risk factors detected by our RFA-PVST. Finally the statistical test method will be introduced to value the significant difference between our RFA-PVST and other classic methods.

Data used in this paper
This subsection will cover the data information and the data preprocessing methods used in this paper.

Raw data
We collected clinic data of 92 patients of splenectomy with cardia devascularization for liver cirrhosis and portal hypertension from one of the first level hospital in PR China. The patients are partitioned into two groups,  The clinic indexes of these 92 patients are listed in Table 7. There are 33 clinic indexes, including 6 countable clinic indicators such as age, gender, weight, bleeding volume, anticoagulant therapy, antiplatelet aggregation therapy, and the other 27 measurable indexes. The measuring clinic indexes are recorded daily or every other day after operations and the date was also recorded at the same time. There are two therapy for patients were adopted to prevent PVST after operations including anticoagulant therapy and antiplatelet aggregation therapy. The anticoagulant therapy comprises giving patients low molecular

Data preprocessing
The age and the bleeding volume indexes use the original record value. Gender, anticoagulant therapy, and antiplatelet aggregation therapy are treated as Boolean variables, where male is 0 and female is 1, and without anticoagulant therapy is expressed as 0 and 1 otherwise, and without antiplatelet aggregation therapy is 0 and 1 otherwise. The median of measurable values is taken as the value for that measurable clinic indexes. If PVST occurred then the label for the patient is 1, which belongs to positive class, otherwise the label is − 1, belonging to negative class. To avoid the influence on experimental results from variant measurement metrics for different clinic indexes, we successively normalize and discretize data in (1) and (2).
where x i, j is the specific value of the j th index for the i th patient, and max(x j ), and min (x j ) are the maximum and minimum value of the j th index, respectively.
where μ i is the mean value of index j (1 ≤ j ≤ 33), and its standard deviation is σ i , then the discretized value for the index is d i, j in (2).

Support vector machines
SVM is a typical learning machine coined by Vapnik in 1920s [38]. It is based on the VC (Vapnik-Chervonenkis) dimension and the structure risk minimization with sound theoretic basics and concise mathematic model. It is a learning machine for small exemplars, and has got best generalization by making the optimal trade-off between the model complexity and the learning ability. SVM has been widely used in biomedical filed, and has greatly influenced the diagnosis and predictions of diseases [39][40][41][42]. The characteristic of SVM is that it maps the samples in low dimensional input space into highdimensional feature space via kernel functions, so that the inseparable exemplars in low dimensional input space has become separable in high-dimensional feature space by an optimal hyperplane. The popular used kernel functions are here.
linear kernel functions: , σ is positive real.

RFA-PVST algorithm
Feature selection is to detect several features from original ones to construct the feature subset making a specific criterion optimized [43]. The nature of feature selection is to display samples in a low dimensional space by those selected several features while preserving the pattern of samples as that in its original high dimensional space as much as possible [43]. It is usually implemented by erasing redundant and less important features while preserving the important ones. The selected features not only can preserve the classification power of original system, but also can reduce the complexity of classification model while improving its generalization [43][44][45]. The selected features preserve their physical properties with good interpretability, such that feature selection study has been paid much more attention by experts from statistics and machine learning fields, and has been widely applied to disease diagnoses [39][40][41][42]. The selected features do help medicine doctors to make proper decisions and take proper diagnoses to related patients. We propose RFA-PVST algorithm to detect clinic risk factors of PVST for splenectomy and cardia devascularization patients for cirrhosis and portal hypertension, so as to build the predictive model for PVST via the detected risk factors. The 92 post-splenectomy and cardia devascularization patients comprise exemplars for liver cirrhosis and portal hypertension, and their clinic indexes as features. The detecting clinic risk indexes is in fact a feature selection procedure.
We define the discernibility and independence for each clinic index, and plot the curve of independence with discernibility for all clinic indexes in a 2dimensional space with discernibility and independence as x-axis and y-axis, respectively. All clinic indexes in top-right corner of the 2-dimensional space comprise risk factors for they are with both comparatively high discernibility and high independence, while the less risk ones lie in bottom-left corner. To quantify how much contributions of a clinic index to telling a PVST patient form non-PVST patients, we define the risk degree for each clinic index as the product of its discernibility and its independence, that is, the area of the rectangle enclosed by coordinate lines and axes in the 2dimensional space. Consequently the clinic indexes with much higher risk degree than the rest ones are detected out and the SVM classifier is built based on the risk factors to predict whether the splenectomy and cardia devascularization patients for liver cirrhosis and portal hypertension are PVST patients or not.
Let training dataset D = {x 1 , x 2 , ⋯, x n } ∈ R m × n , where m is the number of patients and n the number of clinic indexes. We define dis j , ind j , and RD j to express the discernibility, independence, and risk degree for the clinic index j(1 ≤ j ≤ n), respectively in (3)-(7). Definition 1 Discernibility: Let N 0 and N 1 be the number of patients with and without PVST, respectively, and S(j) be the statistics of Wilcoxon signed rank test for clinic index j, x i, j is the value of sample i in its clinic index j, then the discernibility dis j of clinic index j is defined in (3), and S(j) is calculated in (4).
where χðÁÞ ¼ From the Definition 1, we can see that dis j of clinic index j can express its discernibility between patients with PVST and without PVST very well, so it can be used to value whether the clinic index j is a risk factor or not of causing PVST for splenectomy and cardia devascularization patients for liver cirrhosis and portal hypertension.
Definition 2 Independence: The independence ind j of clinic index j is defined in (5), where x j and x k are vectors of clinic index j and k. It is a negative exponential function of the correlation coefficient pr between clinic index j and its most correlated clinic index k with higher discernibility. For the clinic index j with the highest discernibility to PVST, its independence is defined as the negative exponential function of the correlation coefficient pr between j and its least correlated clinic index k. This correlation coefficient pr can be any kind of parameters to express the correlation between two variables. We adopt Pearson coefficient in our study. In order to unify the positive or negative correlation between clinic indexes, we adopt the absolute of Pearson coefficient expressed in (6), where X,Y are vectors of any two clinic indexes, and X is the mean vector of X, Y the mean vector of Y.
The above independence definition disclose that the less correlation of a clinic index with other indexes, the stronger is its independence, and vice versa. This definition is coincident with the principles in nature. In addition, the definition in (5) guarantees that the clinic index with the highest discernibility for PVST definitely has got the independence as high as possible, which further guarantees that it will be definitely selected as risk factors of PVST.

Definition 3 Risk Degree (RD):
The risk degree of clinic index j is defined as the product of its discernibility and independence in (7), which is the area of the rectangle enclosed by its coordinate lines and axes, where the discernibility is the x-coordinate and independence the ycoordinate.
The main steps of the proposed RFA-PVST are described as follows.
Input: Training dataset D ∈ R m × n , m is the number of patients, n is the number of clinic indexes, Y is the label vector indicating PVST patients or not.
Output: Set S of risk factors. BEGIN let S = ∅, F = {all clinic factors}; FOR j = 1 to n DO BEGIN calculate dis j for clinic index j in eq. (3); calculate ind j for clinic index j in eq. (5); calculate RD j for clinic index j in equation in (8); END //of FOR Plot all clinic indexes in the 2-dimensional space with discernibility as x-axis and independence as y-axis; Select clinic indexes in topright corner to comprise set S of risk factors; END Constructing predictive models 5-cross validation experiments are conducted, and SVM learning machines with RBF (Radial Basis Function) kernel functions are adopted. The proposed RFA-PVST is used to detect risk factors of PVST. The SVM classifier is constructed based on the detected risk factors. The performance of this SVM classifier is compared to that based on the indices by available feature selection algorithms to evaluate the power of RFA-PVST in detecting factors to recognize PVST patients.

Selecting parameters for SVM
The kernel function and its parameters are very important for a SVM learning machine [46]. We take RBF kernel function and grid search technique to find the optimal penalty parameter C and kernel function parameter γ for SVM. The grid search technique is to first set the specific range for C and γ, respectively, then test each pair of (C, γ) on training subset by cross validation experiments to find the best pair of (C, γ). Finally, the pair (C, γ) with the highest cross validation accuracy is the best pair parameters to be selected.
Building SVM model for predicting PVST 5-fold cross validation experiments are done on our collected clinic data of splenectomy plus cardia devascularization for liver cirrhosis and portal hypertension. The patients with PVST and without PVST are partitioned into 5 balanced parts respectively, so as to get 5 subsets of exemplars for 5-fold cross validation experiments. The RFA-PVST algorithm is conducted on training subset to get risk factors to construct set S. Then we construct the new training subset TS new whose exemplars only embodying risk factors from set S. The best pair of parameters (C, γ) is found on TS new . Finally the SVM classifier is built based on the best pair of parameters (C, γ) and the new training subset TS new to predict PVST.

Evaluation methods
The power of our proposed RFA-PVST is evaluated in two aspects. First, it is evaluated by the performance of the SVM classifier built on the selected risk indexes by proposed RFA-PVST. Second, it is evaluated by the significant statistic test between the SVM classifiers built on the risk indexes by RFA-PVST and by other popular feature selection algorithms.

Model evaluation
The performance of the SVM classifier is tested by exemplars in test subset in terms of predictive accuracy shorted as Acc, sensitivity, specificity, precision, F-measure, FPR (False positive rate), FNR (False negative rate), FDR (False discovery rate), AUC(Area under an ROC curve) and MCC(Matthews correlation coefficient). ROC is the acronym of receiver operating characteristic curve, which is a very famous metric to evaluate a model. AUC is the quantity value of ROC [47,48]. These metrics are defined in eqs. (8)-(17) based on the confusion matrix in Table 8. The power of our RFA-PVST is compared to the available feature selection algorithms including mRMR [32], SVM-RFE [33], Relief [34], S-weight [35] and LLEScore [36].