VarSight: prioritizing clinically reported variants with binary classification algorithms

Background When applying genomic medicine to a rare disease patient, the primary goal is to identify one or more genomic variants that may explain the patient’s phenotypes. Typically, this is done through annotation, filtering, and then prioritization of variants for manual curation. However, prioritization of variants in rare disease patients remains a challenging task due to the high degree of variability in phenotype presentation and molecular source of disease. Thus, methods that can identify and/or prioritize variants to be clinically reported in the presence of such variability are of critical importance. Methods We tested the application of classification algorithms that ingest variant annotations along with phenotype information for predicting whether a variant will ultimately be clinically reported and returned to a patient. To test the classifiers, we performed a retrospective study on variants that were clinically reported to 237 patients in the Undiagnosed Diseases Network. Results We treated the classifiers as variant prioritization systems and compared them to four variant prioritization algorithms and two single-measure controls. We showed that the trained classifiers outperformed all other tested methods with the best classifiers ranking 72% of all reported variants and 94% of reported pathogenic variants in the top 20. Conclusions We demonstrated how freely available binary classification algorithms can be used to prioritize variants even in the presence of real-world variability. Furthermore, these classifiers outperformed all other tested methods, suggesting that they may be well suited for working with real rare disease patient datasets. Electronic supplementary material The online version of this article (10.1186/s12859-019-3026-8) contains supplementary material, which is available to authorized users.

contains a visualization of the overall workflow for training VarSight models. For details, refer to the main paper.

Overview
There are a total of 50 features used as inputs to the models. Two of those features are phenotype-based features and the remaining 48 are derived from Codicem (http://envisiongenomics.com/codicem-analysis-platform/) extractions.

Phenotype Features
For both methods, we relied on annotations and ontological structures provided by the Human Phenotype Ontology (HPO). Specifically, we used the hp.obo and ALL_SOURCES_ALL_FREQUENCIES_phenotype_to_ genes.txt corresponding to data version releases/2018-10-09.

HPO-cosine
The first phenotype feature is called the "HPO-cosine" score. Given P terms in the HPO, a phenotype profile is represented by an P -dimensional vector where dimension p is set to 1.0 if the corresponding phenotype term is present in the profile. Note that if a phenotype is present in the profile, all ancestor terms as defined by the ontology are automatically set as present as well. Given two profiles, we calculate the "HPO-cosine" score by performing the dot product calculation on the vectors and normalizing by their lengths. Using this measure, values close to 1.0 are highly similar (the vectors are "pointing" in similar directions) whereas those close to 0.0 are not (the vectors are mostly perpendicular).
For each gene, we created a phenotype profile based on the HPO annotations relating genes to phenotypes. For each patient, we created a phenotype profile and calculated the cosine score between the patient profile and all genes to create an ordered rank from 1 to G, where G is the total number of genes annotated by the HPO. For each variant, we then found the smallest rank, R, of genes tied to the variant and stored the normalized value ( R G ) as the "HPO-cosine" feature tied to the variant. For precise implementation details, please refer to the source code (https://github.com/HudsonAlpha/VarSight/blob/master/ scripts/HPOUtil.py).

PyxisMap
The second phenotype is derived from PyxisMap (https://github.com/HudsonAlpha/LayeredGraph). We derived our data from PyxisMap v1.2 and we ran the standard installation script that downloads all required data sources on December 19, 2018.
To run PyxisMap, we passed in the patient phenotype terms and obtained an ordered rank of all genes. For each variant, we then selected the best rank from genes tied to the variant, normalized the rank, and stored it as the "PyxisMap" feature tied to the variant. Note that this process is identical to how ranks were handled in the "HPO-cosine" feature. For precise implementation details, please refer to the source code (https://github.com/HudsonAlpha/VarSight/blob/master/scripts/PyxisMapUtil.py).

Codicem Features
Codicem annotates each variant using many different data sources. We selected the annotations that we believed were most frequently used by analysts when reviewing a case. Each annotation is assigned a feature type corresponding to how that annotation was turned into a feature for the models: 1. float -The feature is a single value floating-point decimal that is copied from the annotation.
2. single -The annotation is a categorical field with a single value per variant. Each allowed value is assigned a numerical value, and that value is used as the feature.
3. multiple -The annotation is a categorical field with multiple values per variant. The total number of occurrences of each allowed value is calculated per variant and each is stored as a feature. This is the only annotation type resulting in multiple features per variant.
4. float reduce -The annotation is a floating-point decimal with multiple values per variant. These float values are combined using a reduction method (typically, minimum or maximum as defined by the JSON configuration) and the single-value result of this reduction is used as the feature.
The JSON describing features is available on GitHub (https://github.com/HudsonAlpha/VarSight/ blob/master/CODI_metadata/fields_metadata.json), and a selection of that information is reproduced here. Table 1  For each classifier, a selection of hyperparameters were tested based on recommendations from sklearn and imblearn. We performed a brute-force search over the entire hyperparameter space using the GridSearchCV method of sklearn. This method tries every combination of hyperparameters and selects the method with the best performance as defined by the user. For defining performance, we performed stratified 10-fold cross validation (StratifiedKFold in sklearn) on the training data and optimized for best F1-score (see https: //scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html for more info).

Execution
Exomiser was run twice using two different final prioritization algorithm: hiPhive and hiPhive (human only).

.jar --analysis [ymlFN]
The ymlFN parameter is a templated file with parameters for the VCF filename, the HPO terms, and the output path. The template for the hiPhive execution is available at https://github.com/HudsonAlpha/ VarSight/blob/master/ExomiserTemplates/hiPhive-template.yml and the template for the hiPhive (human only) is available at https://github.com/HudsonAlpha/VarSight/blob/master/ExomiserTemplates/ hiPhive_human-template.yml. These templates were populated with the corresponding information for each patient using the same variants and HPO terms as were used in the VarSight classifiers.

Installation
We followed the instructions on the Phen-Gen website (http://phen-gen.org, click on "DOWNLOAD") to download Phen-GenV1 and followed directions in the README file to install and run Phen-Gen.

Execution
Phen-Gen takes the HPO terms and VCF file as input to the algorithm. For the HPO terms, we created a plain-text file with one HPO term per line as instructed. We used the same pre-filtered VCF file as was used in all other external tools.
Phen-Gen also requires extra parameters that were not need by other tools, specifically requiring a pedigree file with sex information, mode of inheritance, and the predictor type. For the pedigree file, we simply created a one-member pedigree with the corresponding sex field populated with the annotated sex of the patient. For mode of inheritance, we selected dominant as this tended to rank more variants than recessive. Finally, we selected the predictor type to be "genomic" as opposed to "coding" only. Here is the templated command we used to run Phen-Gen: hpoFN is the filename of the HPO file, vcfFN is the filename of the VCF file, and pedFN is the filename of the pedigree file.

Installation
We followed the instructions on the GitHub README (https://github.com/bio-ontology-research-group/ phenomenet-vp) to download the distribution file for DeepPVP v2.1.1 (https://github.com/bio-ontology-research-grou 5 Results Extended 5.1 Ranking Boxplot Figure 2 shows the rankings from our experimental results in a boxplot format. Figure 2: Results Boxplot. This image shows the rankings from our experiments in a boxplot format (see Table 3 and 4 in main paper for summarized results). Each reported variant rank is a red circle on the corresponding method with an overlayed, standard boxplot for all rankings for the method. Note, the values in the figure represent only the ranking of reported variants in the test set.

Codicem Filtering
The following sections describes the filter that was passed to Codicem for generating the test and training set variants. Figure 3 shows the sequential order of filters, and the following subsections briefly describes each filter unit's purpose.

Total Depth filter
The total depth filter is primarily a QC-related filter intended to remove variant calls with total coverage. This filter only passes variants that have at least 8 reads overlapping the locus. Figure 4 shows a visualization of the filter.

Percentage of Reads Supporting Allele filter.
This filter is also a QC-related filter intended to remove variant calls with low support for the alternate allele. The filter only passes variants with more than 15% of the reads supporting the alternate allele. Figure 5 shows a visualization of the filter.

Population Allele Frequency filter.
This filter removes variants that have a high population frequency. There are three general ways for a variant to pass this filter: 1) be a rare variant in GnomAD AND a rare variant in ExAC; 2) be semi-rare in GnomAD, semi-rare in ExAC, AND have a high-confidence damaging annotation from HGMD; or 3) be semi-rare in GnomAD, semi-rare in ExAC, AND have a ClinVar classification labeling it as pathogenic, likely pathogenic, or a drug response. Additionally, there is one variant from the gene F5 (rs6025) allowed through this filter. Note that this specific variant has a very high alternate allele frequency because the reference allele is actually the disease-causing mutation. Figure 6 shows a visualization of the filter.

HGMD, ClinVar, CADD, and Effects filter
This filter removes variants that are not predicted or annotated to have an effect on a transcript. A variant can pass this filter in any of four ways: 1) have an HGMD accession, 2) have a ClinVar accession, 3) have an effect that is predicted to modify a transcript, or 4) have a very high CADD score. Figure 7 shows a visualization of the filter.

Gene Has Associated Disease filter
This filter removes variants for which there is no annotated disease name in HGMD, OMIM, or ClinVar. If any of those annotations for the variant has a disease name, then the variant will pass this filter. Figure 8 shows a visualization of the filter.

Red Herring filter
This filter removes variants that are sequencing artifacts. Our analysts found that some variants are not found in population databases, but show up frequently (> 20% of samples) in our sequencing data. These variants are believed to be sequencing artifacts, so Codicem maintains a database of these variants to filter out during analysis. Figure 9 shows a visualization of the filter.

Repeats filter
This filter removes variants that are polymorphic repeats and/or artifacts from sequencing. It filters out any variants that are found in repeat tracks if there is no associated entry in HGMD or ClinVar. Figure 10 shows a visualization of the filter.       : Red Herring filter. This filter removes variants that are commonly found through sequencing, but do not appear in population databases like gnomAD or ExAC. These variants are believed to be sequencing artifacts, so they are labeled as "Red Herring" variants and filtered out. Figure 10: Repeats filter. This filter removes variants found in repeat tracks if there is no associated entry in HGMD or ClinVar. These variants are generally considered polymorphic and/or sequencing artifacts unless there are annotations from disease databases. Figure 11: Unknowns filter. This filter removed variants with an "Unknown" effect on transcription if there is not additional support for that variant through either CADD or HGMD.

Unknowns filter
This filter removes variants that have an "Unknown" effect on transcription if there isn't additional support for that variant through either CADD or HGMD. If CADD is greater than 10 or there is an HGMD accession tied to the variant, it will pass this filter. Figure 11 shows a visualization of the filter.

Near Splice filter
This filter removes near splice variants that do not have additional annotated or predicted support indicating that the variant would impact splicing. If a variant is a "Near Splice Site Alteration", that variant will only pass the filter if one of the following is true: 1) both splice predictions algorithms predict an impact on splicing, 2) ClinVar has a pathogenic or likely pathogenic annotation, or 3) CADD is relatively high. Figure  12 shows a visualization of the filter. Figure 12: Near Splice filter. This filter removes "Near Splice Site Alteration" variants that are not supported to be deleterious by splice prediction algorithms, ClinVar annotations, or CADD predictions.