DVA: predicting the functional impact of single nucleotide missense variants

Background In the past decade, single nucleotide variants (SNVs) have been identified as having a significant relationship with the development and treatment of diseases. Among them, prioritizing missense variants for further functional impact investigation is an essential challenge in the study of common disease and cancer. Although several computational methods have been developed to predict the functional impacts of variants, the predictive ability of these methods is still insufficient in the Mendelian and cancer missense variants. Results We present a novel prediction method called the disease-related variant annotation (DVA) method that predicts the effect of missense variants based on a comprehensive feature set of variants, notably, the allele frequency and protein–protein interaction network feature based on graph embedding. Benchmarked against datasets of single nucleotide missense variants, the DVA method outperforms the state-of-the-art methods by up to 0.473 in the area under receiver operating characteristic curve. The results demonstrate that the proposed method can accurately predict the functional impact of single nucleotide missense variants and substantially outperforms existing methods. Conclusions DVA is an effective framework for identifying the functional impact of disease missense variants based on a comprehensive feature set. Based on different datasets, DVA shows its generalization ability and robustness, and it also provides innovative ideas for the study of the functional mechanism and impact of SNVs.

For example, SNVs located in the coding region of the DNA sequence may lead to the different amino acid translation, resulting in the impact on the function of gene products in multiple ways, e.g., by affecting the protein's interaction with other proteins, its stability or it active sites.These changes may have an important impact on the overall function and signal transmission of the gene product.Therefore, how to accurately interpret the functional impact of missense variants is particularly important.
Various computational methods have been proposed to predict the functional impact of variants.Prediction methods based on a single type of information (for example, conservation/evolution/sequence homology) were proposed earlier.For example, Kumar et al. [6] proposed a prediction method called SIFT that predicts the effects of missense variants (amino acid substitutions, AASs) based on the sequence homology and the main idea of SIFT is that highly conservative positions tend to not tolerate variants/ substitutions, while low conservative positions are just opposite.The prediction method based on single information can provide a simple, intuitive and easy to explain prediction result, which is convenient for relevant researchers and clinical staff to use in practical work.However, its prediction results are often less accurate.Therefore, many methods have been developed to improve prediction accuracy by merging multiple types of variant features.For example, Shihab et al. [7] developed a novel method called FATHMM-MKL that integrated 10 different variant features (such as sequence conservation, histone modification, footprints, GC content, transcription factor binding sites) to improve the accuracy of functional impact prediction of variants.In addition to combining the features of variants, some methods also take the prediction scores of other methods as features to improve prediction accuracy.Such as REVEL [8] and MISTIC [9], these methods all use the scores of other prediction methods as features to predict the functional impact of variants.These different types of computational methods have advanced the study of the prediction of the functional impact of variants.However, most of these methods do not perform well enough [10] in the functional impact prediction of missense variants.Therefore, how to construct a comprehensive feature set and an efficient computational model to improve the prediction accuracy is still a key challenge.
To solve the above problems, we proposed a method called disease-related variant annotation (DVA, http:// bionet.org.cn/ DVA), which systematically integrated multiple features including conserved elements, allele frequencies in different populations, and protein-protein interaction (PPI) network feature transformation.The sequence conservation (DNA or protein) often means that a sequence has been maintained by natural selection and is considered to have functional value [11][12][13].Therefore, when a variant or its corresponding amino acid substitution occurs at a highly conserved site, the possibility of harmful effects often is greater than the possibility of harmlessness.In addition to sequence conservation, allele frequency is also used as a predictor, that is, low-frequency variants usually trend to be disease-related, and high-frequency variants trend to be neutral [14].Here, allele frequency is used as a predictor variable to predict the functional impact of variants.Except for the perspective of single-site and population statistics, the important impact of missense variants may lie in the interaction of protein molecules.For a protein molecule to perform a certain function, it often needs to be combined or coordinated with other protein molecules.However, previous methods rarely take this into account.Therefore, we employ this (PPI network) as the novel feature to predict the functional impact of variants, which is extracted by graph embedding.To sum up, the DVA method constructs a comprehensive feature set including sequence conservation, allele frequency, and PPI structure feature to predict the functional impact of variants accurately.The novelty of our approach lies in the combination of a set of features including conserved elements, allele frequencies in different populations, and PPI network feature transformation, to build the random forest model that achieves the significant improvement of accuracy within different complex diseases and cancers using missense variant datasets.

Experimental results for somatic cancer missense variants
Experimental results on somatic cancer variants are shown in Fig. 1 and Table 1.The area under receiver operating characteristic curve (AUROC) for DVA is 0.979.Here, we compared the DVA method to 14 prediction methods that were recently developed, widely used: SIFT [6], PROVEAN [15], MutationTester [16], MutationAssessor [17], FATHMM-MKL [7], DANN [18], MetaSVM [19], MetaLR [19], ClinPred [14], CADD [20], PrimateAI [21], REVEL [8], M-CAP [22], and MISTIC [9].The prediction scores of these methods were obtained from the webserver or software provided by authors, ANNOVAR, or the dbNSFP v3/v4 database.The AUROCs for the other individual prediction methods ranged from 0.506 to 0.84.Among them, the highest AUROC value was 0.84 achieved by the ClinPred method and the lowest AUROC value was 0.506 achieved by the MISTIC method.As a result, many prediction methods performed poorly on somatic cancer variants.Nevertheless, DVA significantly outperformed other functional impact prediction methods on such data.These results demonstrate that DVA has a good ability to predict the functional impact of somatic cancer variants.
In addition to the AUROC, accuracy, precision, recall, and F1-score also were employed to evaluate the performance of DVA and state-of-the-art prediction methods.As shown in Table 1, the performance of the previous algorithms was relatively poor and the DVA method has achieved the best performance among the four evaluation criteria (accuracy, precision, recall, and F1-score).In particular, the recall for DVA was higher Fig. 1 The AUROCs of 15 different prediction methods using somatic cancer missense variants than other methods by at least 0.2.The proposed method significantly outperformed other prognostic predictors of functional impact of somatic cancer variants.

Experimental results for missense variants in ClinVar database
Experimental results on the ClinVar missense variants are shown in Fig. 2 and Table 2.The AUROC value of DVA is 0.977.The AUROC for the other individual prediction methods ranged from 0.601 to 0.959.For example, REVEL and ClinPred achieve relatively good performance, with AUROC values of 0.915 and 0.959, respectively.The AUROCs of other prediction tools fluctuate around 0.8 (AUROC for MetaSVM is 0.861; AUROC for MetaLR is 0847; AUROC for CADD is 0.851; AUROC for DANN is 0.79; AUROC for PrimateAI is 0.773; AUROC for FATHMM-MKL is 0.777; AUROC for MutationAssessor is 0.845; AUROC for MISTIC is 0.871; AUROC for PROVEAN is Fig. 2 The AUROCs of 15 different prediction methods using missense variants in the ClinVar database 0.849; AUROC for SIFT is 0.821; AUROC for M-CAP is 0.857).MutationTaster has the worst predictive performance in the ClinVar missense variants with an AUROC of 0.601.These results demonstrate that DVA has a good ability to predict the functional impact of missense variants of different mendelian diseases.
As shown in Table 2, the DVA method also has achieved the best performance compared to other prediction tools.The recall of DVA is 0.926 at least 3.4 percentage points higher than other prediction tools.The precision, accuracy, and F1-score of DVA also have achieved the excellent performance, which indicates that DVA is effective in representing and distinguishing the differences between disease-related and neutral variants.

Experimental results for missense variants in VariBench database
Experimental results on the VariBench missense variants are shown in Fig. 3 and Table 3.As shown in Fig. 3, the DVA method has the best performance and its AUROC value is 0.858.For other prediction methods, the highest AUROC value was 0.813 achieved by the REVEL method and the lowest AUROC value was 0.54 achieved by the Mutation-Taster method.The deep neural network (DNN) has achieved an overwhelming advantage in some research fields of computer science, such as computer vision and natural language process.In the aspect of functional impact of variants, some prediction methods also employed deep neural networks.However, the performance of these methods did not significantly outperform other methods.DNN models require a lot of training data sets.Among all kinds of biological data, sequence data has a large scale to meet the requirements of training.Thus, PrimateAI employed a DNN model to predict the functional impact of variants using multi-sequences.Although DNN is used, the prediction result of PrimateAI based on sequence data is not excellent.DANN also is a prediction method based on the DNN model.Compared with other methods, DANN doesn't show the significantly overwhelming performance too.Based on the current observations, DNN has not achieved significant success in this field.Thus, more excellent machine learning algorithms and feature sets may still be a better choice.These results demonstrate that DVA has a good ability to predict the functional impact of missense variants in VariBench database.As shown in Table 3, the DVA method also has achieved the best performance compared to other prediction tools.The accuracy, precision, and F1-score of DVA also have the best performance compared to state-of-the-art methods.The prediction ability of the DVA method is also proved on the VariBench dataset.

Experimental results for missense variants based on cross-database validation
Prediction methods usually achieve better performance based the training data and testing data, which come from the same database.As shown in the Fig. 2 and Table 2, DVA achieves excellent performance (AUROC is 0.977) and the compared prediction    of most methods are around 0.6), while DVA still achieves excellent performance (AUROC is 0.934).These results demonstrate that DVA has better robustness.

Feature importance analysis
We analyzed the importance of selected features for the prediction performance of DVA method and used randomForest package to obtain importance scores using missense variants in the VariBench dataset.Figure 5 represents the importance level of the top 20 features.Among the top 20 features, allele frequency features are the most important features in our method and accounting for 9 of the top 20 features.The PPI graph embedding features are next important features and accounting for 8 of the top 20 features.The importance of conservative score features is relatively lower than that of the first two types of features, but there are still three conservative features in the top 20 features.So, it has also played a certain role in predicting the functional impact of missense variants.From the above feature importance analysis, it can be concluded that the two new features performed in the DVA method play a significant role in predicting the functional impact of missense variants.

Conclusions
Since the completion of human gene mapping, human beings have entered the post genome era.Annotation and analysis of gene and gene product functions are critical study at the post genome era.In addition to the general gene function research, the functional impact of variants occurred in the gene coding region is a very important and meaningful direction.For example, research by Shajani-Yi et al. [23] showed that some "key" genes have been found to carry disease-related variants in different cancers.In the Glioma/Glioblastoma cancer, 67% (16/24) of tumors exhibited one disease-related variant, of which 94% (15/16) were in IDH1 and 6% (1/16) were in PIK3CA.In the Colon Adenocarcinoma, 67% (73/109) of tumors with disease-related variants had more than one variant in addition to TP53.Daboub et al. [24] claimed a report of Parkes Weber syndrome associated with two disease-related variants in RASA1.Timms et al. [25] Fig. 5 The degree of contribution from top 20 features used by DVA discovered that BRCA disease-related variants were observed in all breast cancer subtypes.These studies also show that disease-related variants have a greater correlation with cancer.However, there are several variants stored in different databases, which are not yet clear about their possible effects.Therefore, how to better predict diseaserelated variants can better enable researchers or clinicians to focus on the scope of cancer research, thereby promoting precision medicine.With DVA, we describe an effective framework for identifying the functional impact of disease missense variants based on a comprehensive feature set.We demonstrate that the performance of DVA is much better than the state-of-the-art prediction methods based on different missense variants.Meanwhile, we observed the general robustness of the DVA, and it will be innovative for the study of the functional mechanism and impact of SNVs.
New features and appropriate model may contribute to the improved prediction performance.First, two new types of features have been introduced to significantly improve the predictive ability of this algorithm: (a) variant allele frequencies in different populations.When a SNV is widely present in the population, it often has no pathogenic effect on the molecular function or organism.On the contrary, if a SNV only appears in a few individuals, its impact on the organism may be more pathogenic.It has been less considered in previous prediction methods.(b) PPI network features.Whether it is normal organic operations or harmful molecular changes, it is often not a single factor that promotes its development, but the interaction of multiple molecules or changes in key molecules that lead to essential changes in the entire working mechanism.In the past, little attention has been paid to the interaction of different molecules in the research of the functional impact of variants.Here we used the graph embedding (PPI network) to represent the interaction of molecules with different variants.Second, a random forest model was constructed to predict the functional impact of variants.By merging the different types of features and the random forest model, the DVA algorithm has significantly improved the prediction of the impact of variants.
In this article, the graph embedding representation of protein-protein interactions is used as a novel feature to predict functional impact of variants and it improves the performance of the proposed method effectively.In addition to protein-protein interaction, protein structure also is an informative and detailed feature to protein function and may improve the performance of prediction methods.However, applying protein structure to predict functional impact of variants is still challenging.There are several problems to be fixed: (1) Although the wild protein structures are accessible in several database [26][27][28], the mutated protein structure usually is not provided; (2) Protein structure is complicated.How to construct protein structure representation as the input feature of machine learning methods is still a critical challenge.In the future, we will continue to develop novel methods which focus on addressing existing problems to predict functional impact of variants.

DVA datasets
In this study, we used three datasets (ClinVar(v20210922) [3], VariBench [29] and COSMIC [4]/VariSNP(v2017) [30]) to assess the performance of the proposed method and the state-of-the-art prediction methods.All of datasets were categorized as the disease-related and neutral missense variants, which were selected as the positive (disease-related or recurrent in cancer tissues) and negative (neutral) labels, respectively.To evaluate the performance of DVA and the state-of-the-art prediction methods appropriately, we established the following rules to filter out genetic variants: (1) The functional impact of variants contained in benchmark datasets should be predictable by all the state-of-the-art prediction methods.
(2) The variants should be rare, namely, the gnomAD allele frequencies (AF) of these variants are less than 1%.(3) Each variant should not occur in the training set of the state-of-the-art prediction methods.
Finally, we obtained three datasets: (1) 12,569 recurrent missense variants from the COSMIC (https:// cancer.sanger.ac.uk/ cosmic) dataset and 16,873 neutral missense variants from the VariSNP database; (2) 3706 disease-related missense variants and 2929 neutral missense variants from the ClinVar database; (3) 2965 disease-related missense variants and 3198 neutral missense variants from the VariBench database, which is integrated into the filtered versions of HumVar [31], ExoVar [32], VariBench, and SwissVar [33] datasets.There are some overlaps between the original databases.Through data preprocessing, we give priority to retaining duplicate variant data in ClinVar, then retaining variant data in Varibench, and finally retaining variant data in COSMIC and VariSNP.

Performance evaluation
The performance of the state-of-the-art prediction methods and DVA was evaluated using accuracy, precision, recall, F1-score, and the area under the ROC curve: In the equations above, the following parameters are defined: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).The positive cases denote the disease-related, deleterious, or pathogenic missense variants, while the negative cases denote the neutral or benign missense variants.Accuracy is the rate at which the prediction method correctly classifies the positive and negative cases.The Precision and Recall represent the ratio of real positive cases to predicted positive cases and correctly predicted positive to correctly predicted cases, respectively.F1-score is a compromise between precision and recall.The Receiver Operating (1) TP + TN TP + FP + TN + FN Characteristic (ROC) curve is a plot that illustrates the predictive ability of the prediction method.The Area Under the ROC curve (AUROC) is a numerical representation of the ROC curve to indicate the performance of the prediction method more conveniently.The AUROC, accuracy, precision, recall, and F1-score were obtained using the pROC [34] package implemented by the R language.

Feature matrix construction
In this section, we will introduce the comprehensive feature set of DVA.There are three kinds of variant features: conserved element features, allele frequency features, and PPI network features as shown in the Fig. 6.

Conserved element features
Here, the DVA algorithm incorporates a total of 8 conserved element features as predictive features to identify the functional impact of missense variants: The Genomic Evolutionary Rate Profiling++ (GERP++) [35] discoveries constrained elements in multiple alignments by quantifying substitution deficits, which represent substitutions that would have occurred if the element were neutral but did not occur because the element has been under functional constraint.GERP++ is widely used as the common feature in the prediction of functional impact of variants.In addition to GERP++, The PHylogenetic Analysis with Space/Time models (PHAST) [36] and phyloP [37] and transcription conservation element are also integrated into DVA features, which increases the diversity of feature sets.All these conserved element features were obtained from the ANNOVAR database.

PPI network features
When a variant occurs in the "multi-connected" gene that has more neighbor genes in the PPI network, its impact may be greater.Therefore, we introduce the characteristics of the PPI network into the prediction of SNV function effects to improve accuracy.However, the dimension of PPI network is too high, which will undoubtedly increase the computational complexity, it will affect the prediction work.The graph embedding [44] based on network structure and random work is a good way to solve this problem.The specific steps are as follows: • Based on protein-protein interaction database (STRING [45]), the binary adjacency matrix of PPI network will be obtained.• A special random walk with two parameters p and q is performed to guide the walker.The transition probability is as follows: where β pq (t, v, x) is the transition probability between current node v and its neighbor node x .The node t is the last step node, which is passed by the random walker.The p and q are walk direction parameter, respectively.Parameter d tx is the distance of node t and node x .The β pq (t, v, x) determines the probability of the random walker moving from node v to the next node.
• Based on the transition probability β pq (t, v, x) , start node t and walk step l , the random walk sequence of the start node t will be obtained as shown in the Fig. 6. • Repeat random walk process for each node in the PPI network and obtain several walk sequences for PPI network to represent local structure information.• Fit the Skip-Gram [46] neural network using the representation vectors of the start node (as input vector) and each node in the random walk sequence of the start node (as output vector).• Finally, the k-dimensional vectors will be used to represent the graph embedding information of the whole PPI network.The feature matrix is U k ∈ R n×k as follows: (5) β pq (t, v, x) =

DVA model and cross-validation
We used random forest, a machine learning technique, to predict whether a missense variant is disease-related or neutral.Due to the differences in training samples, tenfold cross-validation for each dataset was performed to enhance the robustness of the results for the DVA method, respectively.The detailed steps of tenfold cross-validation are: (1) randomly divide the entire dataset into 10 groups of equal size; (2) for each unique group: First, take the unique group as the test set.Second, take the remaining groups as the training set.Third, fit the prediction model on the training set and evaluate it on the test set; (3) take the average of ten evaluation results as the final result.We trained random forest on the set of variants by using the randomForest [47] package with 500 binary classification trees to predict whether a missense variant is disease-related or neutral.The DVA score represents the fraction of the final prediction score which votes for the variant being disease-related or neutral.

Fig. 3
Fig.3The AUROCs of 15 different prediction methods using missense variants in the VariBench database

Fig. 4
Fig.4 The AUROCs of 15 different prediction methods on COSMIC dataset

Fig. 6
Fig.6The overview of the DVA method

Table 1
Performance of prediction methods using the somatic cancer missense variants

Table 2
Performance of prediction methods using missense variants in the ClinVar database

Table 3
Performance of prediction methods using missense variants in the VariBench database methods also achieve good or fair performance (the AUROCs of most methods are around 0.85) based the training data and testing data, which come from the same database.However, when prediction methods are training in one dataset and testing in another dataset, the testing result may decrease.Thus, we perform another experiment to discuss whether DVA and other methods still have the better prediction ability when training data and test data come from different databases according to your suggestion.Most prediction methods (such as MISTIC and ClinPred) are trained on ClinVar or similar types of datasets.Thus, ClinVar and COSMIC/VariSNP were chosen as the training set and testing data, respectively.Experimental results are shown in the Fig.4and Table4, the AUROC and accuracy of 15 methods have decreased.However, performance of other 14 methods is significantly reduced (the AUROCs