Feature selection algorithm based on dual correlation filters for cancer-associated somatic variants

Background Since the development of sequencing technology, an enormous amount of genetic information has been generated, and human cancer analysis using this information is drawing attention. As the effects of variants on human cancer become known, it is important to find cancer-associated variants among countless variants. Results We propose a new filter-based feature selection method applicable for extracting cancer-associated somatic variants considering correlations of data. Both variants associated with the activation and deactivation of cancer’s characteristics are analyzed using dual correlation filters. The multiobjective optimization is utilized to consider two types of variants simultaneously without redundancy. To overcome high computational complexity problem, we calculate the correlation-based weight to select significant variants instead of directly searching for the optimal subset of variants. The proposed algorithm is applied to the identification of melanoma metastasis or breast cancer stage, and the classification results of the proposed method are compared with those of conventional single correlation filter-based method. Conclusions We verified that the proposed dual correlation filter-based method can extract cancer-associated variants related to the characteristics of human cancer.

, and the close relationship between the somatic variant and the human caner has been known [6,7].
Various methods have been applied to study the effect of somatic variant on the human cancer [8][9][10][11]. Somatic variants that could affect the selection of treatment and response of melanoma were studied based on the case-control study [8]. In [9], the driver genes for breast cancer metastasis were discovered by using the synonyms and non-synonymous ratio. A new ranking system calculating relative importance of somatic variants was proposed considering its effect size [10]. Furthermore, the pattern comparison of somatic variants between primary and metastatic tumors was performed for two colorectal cancer patients [11].
If the data to be analyzed has a large number of features, a subset of features can efficiently describe the data while reducing unrelated features [12]. Because somatic variants are high dimensional information and related to various characteristics of the individual, it is important to find variants that are closely associated with the cancer [13,14]. In general, the feature selection method can be divided into the filter method, wrapper method, and embedded method [13,15]. The filter method measures the importance of a subset of features according to the predefined criteria. Therefore, the filter method has less computational burdensome. On the other hand, the wrapper method is computationally expensive because it uses the prediction model with learning process to select a subset of features. Then, it usually provides very good performance. Finally, the embedded method combines the advantages of previous two methods by selecting a subset of features as part of the learning process.
Researchers have developed cancer-related feature selection algorithms for various genetic information [16][17][18][19][20][21][22]. In case of [16], the genetic algorithm-based feature selection method was applied to improve the decision-making process considering the tissue image of breast cancer patients. To find the set of genes for cancer classification, the quantum-behaved binary particle swarm optimization (BPSO) [17], the forward search method considering the weight local modularity [18], and the kernel-based clustering method for gene selection using double radial basis function kernels [19] were suggested. On the other hand, the identification methods of cancer-driving variants were developed by considering mutation timing of variants [20] and utilizing the gradient tree boosting and iterative search method [21]. Micro-RNA variants associated with metastasis of endometrial cancer were also analyzed using the recursive elimination technique in [22].
To understand the human cancer, it is necessary to comprehensively study the algorithm that selects the small number of variants related to the cancer's specific characteristics. Despite previous studies, the extraction of genetic variants that are significantly associated with the cancer's characteristics is still a difficult problem because of enormously high dimensionality of genetic information. Although the filter-based feature selection requires a relatively short time compared to other methods, the filter method also requires a lot of computations in case of genetic information. At the same time, the high performance of selection also needs to analyze complex and delicate functions of genetic information. In this paper, we propose a new modified filter-based feature selection method by improving computational complexity and classification performance for the selection of cancer-associated somatic variants. We mainly addressed the following issues here: • The concept of dual correlation filter-based feature selection (DCFS) is proposed to extract all the significant features associated with one of the opposing characteristics while avoiding redundancy using multiobjective optimization. • The weight value based on DCFS estimates the importance of features and is utilized to overcome the computational complexity problem of feature selection in highdimensional data. • The proposed DCFS-based method is applied to extract cancer-associated variants to identify melanoma metastasis or detailed stages of breast cancer. Then, the classification performance of proposed method is compared with that of conventional single correlation-based feature selection (CFS).

Data sets
The national cancer institute (NCI) shares the genomic data of cancer through the data repository called genomic data common (GDC). We obtained annotated somatic variant files of patients with melanoma (SKCM) or breast cancer (BRCA) from the GDC portal (https ://porta l.gdc.cance r.gov/). Table 1 summarizes information for two data sets analysed in this paper. Regarding the melanoma metastasis, there are a total of 467 files in the SKCM set consisting of 104 primary tumors and 363 metastatic tumors. Regarding the stage II breast cancer, there are a total of 537 files for stage IIA (314) or stage IIB (223) tumors in the BRCA set. For the SKCM set, melanoma patients with primary tumors were defined as the negative class, and melanoma patients with metastatic tumors were defined as the positive class. On the other hand, for the BRCA set, we set breast cancer patients of stage IIA as the negative class and stage IIB as the positive class. Somatic variant files in SKCM and BRCA sets contain SNVs and InDels, and they were generated according to the DNA-Seq analysis pipeline. This pipeline includes the elimination of germline variants, the comparison of allele frequencies between paired normal and tumor samples, the quality control of the alignment workflow, and the

Classification performance measurements
When there are two classes defined as the positive and negative, there are four classification results: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). TP means that the data in the positive class is correctly classified as positive. On the other hand, TN means that the data in the negative class is correctly classified as negative. Conversely, FP and FN indicate that data in the positive and negative classes are miss-classified as opposite classes, respectively. The classification accuracy (Acc) represents the percentage of correctly categorized data as follows When we consider the unbalanced data, high Acc can be achieved even if all data is classified into one class. In this case, F 1 score may be a more fair classification performance measurement.

Variant filtering results
The 3-step variants filtering was conducted for both SKCM and BRCA sets as follows.
Step 1 Using ANNOVAR [24], we conducted annotations for somatic variants and identified the functional role of each variant. Then, only the variant that could directly affect protein synthesis were remained. After filtering, there were only the non-synonymous variants in coding regions.
Step 2 To remove somatic variants commonly detected in humans, three public databases of gnomAD, ESP6500, and ExAC were investigated, and somatic variants reported in these databases were removed. On the other hand, variants that were reported in COSMIC database [25], which is the global database of somatic variants found in human cancer, were contained in our analysis even if they were registered in gnomAD, ESP6500, or ExAC. (2) having p-value > 0.01 in Fisher's t-test were removed.
As a result, we got the variants data matrix E SKCM ∈ R 467×200,814 for SKCM set and E BRCA ∈ R 537×37,449 for BRCA set. The number of variants for the two data sets according to the filtering steps are shown in Table 1.

Classification using DCFS weighting algorithm with BPSO
We used the proposed DCFS weighting algorithm for cancer-associated somatic variants selection for the SKCM and BRCA sets. To confirm the performance of the proposed DCFS-based feature selection, we also applied the conventional CFS concept to the proposed weighting algorithm instead of DCFS. Pearson's correlation coefficient (PCC), which is a basic measurement of linear correlation between two variables, was applied for correlation analysis in this study. After selecting top D 1 variants, BPSO [17] was applied to the selected D 1 variants to find the optimal set of cancer-associated variants that maximize the classification performance. The utilized parameters for the weighting algorithm and BPSO are listed in Table 2. We set F 1 score as the fitness function. To measure classification performance, support vector machine (SVM) [26] with k-fold cross validation was applied with k = 10.
In Table 3, the number of selected variants (Num), classification accuracy (Acc), and F 1 score ( F 1 ) for selected D 1 and D 2 variants are compared. CFS-D 1 and DCFS-D 1 refer to the case of using selected D 1 features considering CFS-weight and DCFS-weight for classification, respectively. On the other hand, CFS-D 2 and DCFS-D 2 indicate the case of using selected D 2 features considering CFS-weight and DCFS-weight for classification, respectively. For the SKCM set, the number of selected cancer-associated variants was reduced maintaining classification performance in the case of DCFS-D 2 than DCFS-D 1 . At D 1 = 400 and D 1 = 500 , the classification performance in the case of CFS-D 2 was improved compared to the case of CFS-D 1 . For the BRCA set, the case of CFS-D 2 could have improved performance than the case of CFS-D 1 at D 1 = 200 and D 1 = 300 . However, classification performances at D 1 = 400 and D 1 = 500 were not significantly improved. On the other hand, the case of DCFS-D 2 was able to choose a feature set with high classification performance in all cases.

Classification using DCFS weighting algorithm with machine learning
Using selected D 1 features, we performed classifications of SKCM and BRCA sets by applying deep neural network (DNN). The utilized DNN structure consisted of two fully-connected hidden layers with 512 and 256 neurons, respectively. ReLU was used as the activation function for hidden layers, and softmax was applied to the output layer for binary classification. The batch size was 50, and the number of epochs was 100. We calculated the average Acc value when the 30% of the randomly selected test samples were classified using the model trained with remained samples. We implemented the model using the DNNClassifier class from tensorflow's tf.estimator module. Figure 1 provides the classification accuracy in the case of using D 1 and D 2 variants selected based on CFS-weight and DCFS-weight for the SKCM and BRCA sets. In general, Acc value of SKCM set was higher than that of BRCA set. When the number of selected features were increased, classification performances were also increased. Therefore, we could confirm that the performance of DNN classifier is affected by the number of features used for classification. Also, the classification performances of CFS-weight and DCFS-weight were similar when using the D 1 or D 2 features. For the BRCA set, the classification performance when using the D 1 features was relatively larger than when using the D 2 features. This phenomenon was also found for the SKCM set, but the gap between two cases using D 1 and D 2 features was relatively small.

Pathway and phenotype analysis
To ensure the reliability of the selected significant features, variant filtering was conducted before the feature selection. However, there is no guarantee that the selected features will actually affect the phenotype of the disease [27]. Thus, in order to conclude that the features selected by the proposed method are not accidentally discovered and are actually associated with cancer, clinical studies should confirm their role in cancer biology. Several cancer-related genes are known to be associated with more than one cancer, and pathway analysis can explore biological causes by examining changes in gene expression caused by mutations. Therefore, we performed pathway and phenotype analysis for selected variants to discuss their biological significance related to human disease. The human phenotype ontology (HPO) provides phenotypic abnormalities encountered in human disease [28]. The disease association of the gene containing the selected variant was confirmed through the HPO database. On the other hand, kyoto encyclopedia of genes and genomes (KEGG) provides a collection of pathway maps and a collection of disease entries focusing only on the perturbants [29]. We investigated HPO and KEGG databases to see if a relationship between pathway and disease was reported for the gene in which the selected variant was present. We also searched COSMIC database, which summarizes the effects of variants on human cancers.
Tables 4 and 5 provide search results for HPO, KEGG, and COSMIC databases of the selected variants in the case of using DCFS-D 2 at D 1 = 50 for the SKCM and BRCA data sets, respectively. For the SKCM set, 5 genes and 6 genes were found in HPO and KEGG databases, and 6 variants were reported in COSMIC database. Among the 17 selected variants, there were 3 variants that were not registered in any For the BRCA set, DNN classifier had similar performance for CFS and DCFS methods, but the gap between the cases using D 1 and D 2 features was relatively large of the three databases. For the BRCA set, information about 6 genes and 10 genes were collected from HPO and KEGG databases, respectively. In the case of COSMIC database, 18 variants among 25 selected variants were reported in association with human cancer. On the other hand, there were 3 variants that were not detected in three databases. BRAF, NRAS, and KIT are three well-known genes associated with melanoma, and BRCA1 and BRCA2 are two representative genes associated with breast cancer. The genes that play a role in inducing or inhibiting metastasis have been actively studied, but have not yet been clearly identified. Also, the genes involved in cancer progression are well known, but the detailed progression of subgroup classification of stage II breast cancer has not been addressed. Since we confirmed the relationship between cancer and disease for most of the extracted genes, it is worth to study the biological function of the extracted genes for melanoma metastasis or subgroup of stage II breast cancer through clinical studies.

Effect of correlation analysis method
To compare the impact of correlation analysis method on the proposed DCFS method, we considered PCC and Spaerman's rank correlation coefficient (SCC). PCC is a famous measure of the correlation between two data sets. PCC is defined as  where µ a and µ b refer to the mean value of the vector a and b , respectively. The range of PCC values is [−1, 1] . The closer to 1, the higher is the positive correlation. Conversely, the closer to -1, the higher is the negative correlation, and 0 means no correlation. PCC measures linear relationships between two vectors a and b . On the other hand, SCC can consider nonlinear relationships where the amount of change is not constant. Instead of raw data a and b , SCC is calculated based on ranking values ra and rb: where µ ra and µ rb refer to the mean value of ra and rb , respectively. The range of SCC values is also [−1, 1]. Based on PCC and SCC, DCFS-weighting algorithm selected D 1 variants, and BPSO was applied to get DCFS-D 2 . Figure 2 illustrates the number of selected variants according to D 1 . In general, PCC and SCC selected similar number of features for both SKCM and BRCA data sets. Furthermore, classifications were performed using SVM and k-fold cross validation with k = 10 for the SKCM and BRCA data sets. Figure 3 compares the classification performances when PCC and SCC are utilized for DCFS-D 2 according to D 1 , respectively. In case of the number of selected variants, it was hard to see any special trends according to the selection of correlation analysis method. Also, the classification performances were similar for PCC and SCC. As a result, the choice of correlation analysis method had little effect on the classification performance of SKCM and BRCA data sets. When the correlation between two variables is week, the difference between the PCC and SCC values is small. However, when the correlation is strong, the difference between the two values becomes larger depending on whether the correlation is linear.
In the SKCM and BRCA data sets, there were very few variants with strong correlations, and the impact of correlation analysis method seems to be small.

Complexity
CFS is a filter-based feature selection method and has the advantage of very fast computation. The complexity of the CFS merit maximization of Eq. (5) depends on the use of optimization methods. The proposed DCFS merit just needs a modified CFS merit calculation of Eq. (6). Therefore, the increase of computational complexity of DCFS compared to CFS is insignificant. When there are V variants and S samples with V >> S , the calculation complexity of DCFS merit follows O(V 2 ) , which is the same as the conventional CFS merit calculation. To find the optimal subset of features maximizing the DCFS merit, 2 V − 1 calculations of the DCFS merit are required, and the complexity becomes O(2 V V 2 ) . The proposed DCFS weighting algorithm can reduce the number of the DCFS merit calculation whose complexity is denoted as

Conclusions
In this study, we proposed the concept of DCFS to analyze cancer-associated somatic variants, containing SNVs and InDels. In order to reduce the computational complexity and to eliminate the effects of errors and biases, non-significant variants were removed considering the functional role, previous studies of disease association, and the reliability of variant. The DCFS merit was defined based on the multiobjective optimization to obtain the cancer-associated variants set related to activation and deactivation of the cancer's characteristics without redundancy. Because of high dimensionality of genetic information, we suggested DCFS weighting algorithm to reduce the complexity of feature selection procedure. We applied our proposed algorithm to identify metastasis of melanoma or the subgroup of stage II breast cancer. BPSO was used for DCFS maximization for significant variant selection, and a neural network was applied for classification of data. In addition, pathway and phenotype analysis were performed to study the effects of the variants selected by the proposed algorithm on the cancer phenotype. As a result, we verified that proposed DCFS algorithm could select cancer-associated variants resulting in high classification performance. We also discussed the impact of the choice of correlation analysis method on the proposed method. In summary, we believe that the proposed method can be applied to various analysis of genomic data and various feature selection analysis.

Variant filtering
Somatic variants, containing SNV and InDel, can be detected from NGS data by using various variant calling methods such as VarScan2 [23] and SomaticSniper [30]. After the variant calling, there are too many variants, and the non-critical variants also can be contained. We can remove the non-critical variants considering the functional role of variant, previous researches on variant, and the reliability of variant. The 3-step filtering procedure is summarized in Fig. 4.
Step 1 The functional role of variant is identified. The variants in non-coding regions that affect cancer have been studied [31,32]. However, the analysis of variant in the non-coding region is still more challenging than in the coding region, because it is difficult to interpret the functional role of variant. We focus on studying variants in coding regions that are directly related to protein synthesis. Therefore, only the variants in coding regions including exons, 3 ′ UTR, and 5 ′ UTR are extracted. Even if the variant is in the coding region, the amino acid sequence may not be modified and may not cause actual protein-coding change. This silent variation is called synonymous variant and removed with ambiguous variants, which are caused by error in sequencing and calling procedures.
Step 2 The previous researches on variant are investigated. The significant variant related to cancer is not commonly detected in humans. Therefore, we exclude the variant if it is already registered in public databases. However, the variant is re-included in the study if previous studies have reported the effect of the variant in the human cancer.
Step 3 The reliability of variant is confirmed. For the quality control of data, it should be confirmed that the variant dose not occur accidentally during the process of variant acquisition. Also, it is necessary to take into account the association with the cancer. Therefore, the variant is selected when the reads having the variant occur frequently in cancer samples and do not occur in normal samples.

DCFS
There are two types of cancer-associated variants that encourage and suppress the expression of certain characteristic of cancer. Both types of cancer-associated variants should be considered. However, if we extract these two types of variants separately, some of extracted variants can be included in both types. Therefore, we propose the concept of DCFS utilizing the multiobjective optimization to eliminate these redundant variants and generalize the correlation-based cancer-associated variants selection.
In [33], a filter-based feature selection method based on the correlation of data called CFS was proposed. CFS approach selects the least number of features that are closely related to the data class. In other words, CFS selects a set of features that are strongly correlated with the class but not each other. The merit criterion of CFS for a feature set f consisting of n features are as follows: where r f c is the mean of the correlations between a feature and the class, and r f f is the mean of the correlations between two features. The optimal subset of features with the maximum merit is selected.

Previous research investigation
• The variant does not occur in humans in common.
• The role of the variant in the cancer has already been known.
• The discovery of variant is not a coincidence.
• The variant is closely related to the cancer samples. In the first step, we identified the functional role of variant considering whether it directly affects protein synthesis. Next, we searched public databases to eliminate non-critical variants, and also referred to previous studies to include variants known to be important for the cancer. Finally, the last step confirmed that the variant did not occur accidentally during the sequencing or variant calling and was deeply related to the cancer samples The proposed DCFS extends CFS to find the smallest feature subset associated with two conflicting classes of data. The set of significant features in DCFS satisfies the following two conditions: • The selected feature is highly correlated with only one class.
• The selected features are not correlated with each other.
The first condition constrains the selected feature to be not correlated to both opposing characteristics. The second condition encourages that there is no duplicate information in the selected feature set. Let the data can be divided as two classes: positive or negative. Then, we can define two merit criterion M p and M n . In the case of M p , the selected features are the set of significant features specifically associated with the positive class. Also, in the case of M n , the selected features are specifically related to the negative class. DCFS maximizes M p and M n taking into account the relationship between features and the two classes simultaneously.
Let the data matrix be E , where an element e sv refers v-th feature of s-th sample for all v ∈ {1, 2, . . . , V } and s ∈ {1, 2, . . . , S} . Then, a column vector e v means values of v-th feature of all S samples. Also, the column vector c p and c n represent the positive and negative class index of samples, respectively. If the selection vector is x , where an element On the other hand, x v = 0 means v-th feature is not selected. To consider both objective functions M p and M n at the same time, we use the multiobjective optimization problem. Then, x is determined by following equation: where α ∈ [0, 1] is a scalarization parameter, |x| = v x v is the number of selected features, r ee (x) is the mean of the correlations between any two features, r ec p (x) is the mean of the correlations between a feature and the class index c p , and r ec n (x) is the mean of the correlations between a feauture and the class index c n . These three mean correlation values are defined as where a · b refers the element-wise multiplication between the vectors a and b of the same length, and ρ(a, b) is the correlation coefficient between a and b . In Eq. (6), the (6)  |x| + |x|(|x| − 1)r ee (x) scalarization parameter α ∈ [0, 1] adjusts the importance of the two objective functions, which are M p and M n . For the multiobjective optimization problem, there may not be a single solution because multiple objective functions can conflict with each other. In this case, there is one or more Pareto optimal solutions. Pareto solutions mean that we need to reduce other objective values to improve one objective value. The linear scalarization finds the most appropriate Pareto optimal solution using the parameter α . When α = 1 , the correlation with the positive class determines the objective function and only the positive class related features are extracted. Conversely, if α = 0 , the features associated with the negative class are selected considering the correlation with the negative class. If let α = 0.5 , the problem fairly considers two objective functions, and significant features related to both classes are extracted without duplicating information.

DCFS weighting algorithm
If the data dimension is very large, selecting the optimal subset of features based on the filter method is also complex. Therefore, we define the DCFS-weight to indicate the expected importance of each feature. Then, we can pre-select candidate critical features to alleviate the computational complexity problem.
To calculate the DCFS-weight for each feature, the proposed DCFS weighting algorithm iteratively performs DCFS calculation on a randomly selected subset of features. Let the number of iteration be T 1 . The sum of DCFS values of each feature is defined as the vector w val = (w val 1 , w val 2 , . . . , w val V ) . Similarly, w num = (w num 1 , w num 2 , . . . , w num V ) is defined to count the number of times that each feature is selected during T 1 iterations. Both w val and w num are initialized with a zero vector and updated through T 1 iterations. At a t-th iteration, a size of feature subset φ t ∈ � is randomly determined, and φ t features are randomly selected among V features. Then, DCFS value, which is M DCFS (E t ) , is calculated from E t , which is the data matrix only for selected feature subset I t , and w val i and w num i for all i ∈ I t are updated. M DCFS (E t ) is added to w val i for all i ∈ I t , and w num i is increased by 1 for all i ∈ I t . After T 1 iterations, we can calculate the DCFS-weight vector as follows By using the DCFS weighting algorithm, we can calculate the DCFS-weight w , and top D 1 features are extracted as the candidate significant features.
After using the DCFS weighting algorithm, we can get the data matrix E ′ only for the selected D 1 features. By considering D 1 << V features, the DCFS merit optimization based on Eq. (6) requires lower computational complexity compared to the case considering all V features. BPSO [17] is applied to find the optimal set of features by maximizing M DCFS (E ′ ) . Particle swarm optimization (PSO) is an optimization method inspired by the social behavior of bird or fish groups [34,35]. BPSO was developed for the discrete search space by modifying PSO and can be applied to the feature selection [17,36]. A feature selection is defined as a particle and is represented by a position vector that indicates whether each feature is selected or not. Each particle moves in the search space with the dimension D 1 and updates its position information repeatedly to maximize