VariantscanR: an R-package as a clinical tool for variant filtering of known phenotype-associated variants in domestic animals

Background Since the introduction of next-generation sequencing (NGS) techniques, whole-exome sequencing (WES) and whole-genome sequencing (WGS) have not only revolutionized research, but also diagnostics. The gradual switch from single gene testing to WES and WGS required a different set of skills, given the amount and type of data generated, while the demand for standardization remained. However, most of the tools currently available are solely applicable for human analysis because they require access to specific databases and/or simply do not support other species. Additionally, a complicating factor in clinical genetics in animals is that genetic diversity is often dangerously low due to the breeding history. Combined, there is a clear need for an easy-to-use, flexible tool that allows standardized data processing and preferably, monitoring of genetic diversity as well. To fill these gaps, we developed the R-package variantscanR that allows an easy and straightforward identification and prioritization of known phenotype-associated variants identified in dogs and other domestic animals. Results The R-package variantscanR enables the filtering of variant call format (VCF) files for the presence of known phenotype-associated variants and allows for the estimation of genetic diversity using multi-sample VCF files. Next to this, additional functions are available for the quality control and processing of user-defined input files to make the workflow as easy and straightforward as possible. This user-friendly approach enables the standardisation of complex data analysis in clinical settings. Conclusion We developed an R-package for the identification of known phenotype-associated variants and calculation of genetic diversity. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05426-6.


Introduction
Since the introduction of next-generation sequencing (NGS) techniques, whole-exome sequencing (WES) and whole-genome sequencing (WGS) not only revolutionized research, but also diagnostics. At present time, most of the tools available for the evaluation of NGS data are solely applicable for human analysis. On top of this, Veterinary Medicine also faces the additional problem that genetic diversity can be dangerously low, especially in the dog. For this reason, we developed variantscanR, an R-package, for the easy and straightforward identification of known disease-causing variants from a large collection of variants present in dogs and other animals.
The R-package variantscanR enables the filtering of variant call format (VCF file) files for the presence of known disease-causing variants. In addition to the main variant-filtering function, the package allows for the estimation of genetic diversity using both single and multisample VCF files. User-defined, file format-specific options are available for the quality control of certain parameters using online database screening. Finally, next to the filtering of known disease-causing variants, an additional step is optional to gather the remaining variants located within the genes of interest.
VariantscanR is an R package, designed for the use as a clinical tool for variant filtering of known diseasecausing variants in domestic animals. The package focuses on cats and dogs but can be used for other animals if the required data and information for this animal is available.
This vignette is divided into four parts:

Preprocessing Variant filtering Diversity Extra
A visual representation of the workflow is given below. To explain the workflow of this package, an example will be used and elaborated in this vignette. Disclaimer: Not the entire VCF dataset is used for as this is a very large file.
file:///private/var/folders/fg/58ryc6k94cj16ycl0cnjh2fh0000gn/T/RtmpB465O3/preview-b9234fe3d1d4.dir/variantscanR_vignette.html 2/15 Example: Labrador retriever WES data were obtained from 16 different dogs for an independent project at the laboratory. The data used in this vignette is for purely demonstrative purposes. One sample, from here on referred to as sample_9, was chosen randomly out of the pool of 16 dogs and is a yellow, female Labrador retriever.

Preprocessing
For the filtering of variants, the pipeline needs a least 3 different input files. The first one being the file containing the variants of interest (VOI file). Secondly, the VCF file to which the variants of interest are filtered against and thirdly, a BED file for the annotation of variants is needed throughout the worflow. These 3 files require a certain format, which is explained in the following sections.

Input files
(1) VCF file VCF stands for 'Variant Call Format' and is a standardized text file format used for reporting SNPs, indel and structural variations. A VCF file is composed of 3 main parts: Meta-data, starting with '##' A header that starts with '#CHROM' Variant call records The first 9 columns of the header line and the variant call record line give information on the variants found.
The VCF file already has a certain fixed format and doesn't need adjusting. In fact, please never edit a VCF in a word processor by hand, because it will disrupt the formatting!

(2) Variants Of Interest file (VOI file)
The VOI file contains the known disease-causing variants you are interested in. This file does need a certain format so it can be used in downstream analysis. Following columns need to be specified: The columns in bold are mandatory, the other ones are optional. The optional columns need to be included in the file, but don't need values, so NA values are allowed.
However, the function also allows another 'specific' format that can be downloaded straight from the internet, to make it a little more user-friendly. You can upload the omia.csv file into the R environment using: file <-read.csv("file_name.csv"). Make sure that the file is located within your working directory, otherwise you can use: file_path <-"path/to/your/file.csv" data <-read.csv(file_path) the pRocess function is able to handle the .csv file. And that was it! Now you are ready to do some variant filtering!

(Extra) Chromosome names table
Next, depending on the reference genomes used for creating the VCF file, the chromosome notation can have different denominations. The chromosomenameR function allows the conversion, if needed. The chromosomes have multiple different nomenclatures depending on the reference used for mapping. For example, in dogs, chromosome 1 can be noted as CM000001.3 or NC_006583.3, using the GenBank or RefSeq sequence, respectively. This notation needs to be changed to the 'chrN' notation, with N being the number of the chromosome. This is done using the chromosomerenamR function.
For this conversion a file a needed providing both names. For the dog, such a file is available in the package Data folder. The original names are stored in the first column and the 'chrN' names are placed in the second column.

VCFscanneR
Because the VCF file format is not one of the standard formats that can be uploaded in the R environment, the vcfscannR function was created. This function uploads a multisample sample VCF file into the R environment and turns it into a single sample VCF, ready for downstream analysis. The function has 2 parameters. The first parameter is the name of the VCF file that needs to be uploaded OR contains the path to the file. Both .vcf and .vcf.gz formats are accepted. The second parameter is the name of the sample of interest. The sample name has to match with that of the VCF file. In our example, the name is sample_9.
The example VCF file (SNPs.recode.subset.rename.vcf.gz) has 31 rows and 10 columns and contains variants found within chromosome 5.
After uploading the multisample VCF file with vcfscanneR function, the VCF looks like this: ## Scanning file to determine attributes.  Now, our VCF file is ready for variant filtering.

annotateR_NCBI.R
As was explained in the input section above, the BED file itself needs anotating. For this step, 2 separate functions were created that essentially do the same thing. One function can be used for a NCBI RefSeq BED file. Whereas the other function was designed to handle an Ensemble Genes (see section 'input file').
For demonstrative purposes, only the NCBI specific function, annotateR_NCBI function will be demonstrated for as the BED file used in this example is NCBI RefSeq based.

NOTE:
The entire workflow has been performed on the same dataset using the same variants file for both the NCBI refseq BED file and the Ensembl genes BED file. It was noticed that overall less variants were attained after filtering with the Ensembl genes BED file compared to the NCBI refseq BED file. This is mostly due to different annotation of the reference sequence and that less gene names were present/annotated in the Ensembl genes BED file.

pRocess
Now that all the input files have been amended for downstream analysis. A final preprocessing step, that is a mandatory in the pipeline, can be performed.
The pRocess function has several parameters of which a couple are optional. Because files that are manually made or adjusted are prone to human errors, quality control might be useful. The pRocess function includes some optional quality control filters. For example, the genomic locations provided for each variants are screened for errors by comparison with a reference genome. If a mismatch is encountered, the user will get notified of this error and will be asked to check this information for this variant. Next to this, an assembly version check can be performed by providing the "refseq" parameter with this information. If a variant does not belong to the right verion, that variant is removed from the VCF file. This is "CanFam3.1" in our example. If the variants file is downloaded from the OMIA website (as explained above in the Input files section), it is possible to provide the function with this information by setting the OMIA paramater to TRUE (this is FALSE by default).
In summary, we have 6 parameters: 1. variants_file: Required data("dog_BED_NCBI", package = "variantscanR") data("dog_allfields_NCBI", package = "variantscanR") This shows a good example of the quality control that is performed on the location of the variants. For the first SNP located on chromosome 6 at location 55146549, the user-defined variants_file states that a C is should be present at that location. However, when comparing with an online Reference Sequence, no match is found. The output tells the user to double check the information on this variant. When looking up this variant on OMIA.org, from where the file was downloaded, it shows that it is an insertion of C on that location, which explains why it did not match the reference. For as all the other information was correct, it won't cause any problems for downstream analysis. For the 5th variant, no reference was given and therefore, that row needs revisioning. The output of this function is difficult to show because raw .html files are created that will pop up in your internet browser. This head of the report is shown here, but it is not very appealing. For this reason, images of the output are provided in the following section.

Reporting
For demonstrative purposes, not the entire VCF dataset was used for the worked out example in this vignette. However, we do believe it is important to work out a real example using the entire dataset and show the outcome. For this reason, images were made of the real outcome and are shown below.
So if we go through the entire pipeline, after the variantfiltR function, an html report is created. The minimal output is an interactive overview of the various table created after filtering (fig2). By clicking on the desired table, the output will be displayed.

Figure 2: Report overview of tables
Next to this, breeding advice is also provided for heterozygous (fig3) and homozygous (fig4) variants according to the inheritance pattern.  The second table contains variants that are not homozygous wild type but are not known to have implications in or known to occur in the breed of interest. Still, this kind of information might be important to report (fig6).

Figure 6
Lastly, the third

Diversity
The variantscanR package also features a diversity analysis.

diveRsity
The input is a multisample VCF file and the name of one specific sample the user is interested in. Next to the VCF file, the user should also provide a file (this can be for example an Excel file) with the breeds of the samples. The order of the rows should match the order of the samples in the VCF file.This means that the first row contains the breed of the first sample and so on. Please make sure that the column name for that column is 'Breed' with the capital letter 'B'.
This shows the first 6 rows of an example 'Breeds' file: Changing the column names can be done like this: colnames(Breeds) <-"Breed" Most of the output of the diveRsity function are graphs, however, also a table with the actual diversity values is given. The measure used to quantify diversity is the average heterozygosity, which is calculated as follows: With being the level of heterozygosity, the number of heterozygous loci in the sample of interest and , the total number of loci used.
is the same for every sample of the multisample VCF file.

Reporting
As with the variantfiltR function, the not the entire VCF dataset was used in the worked-out example for the diveRsity function. However, the graphs of the real example are displayed below.

Figure 8: Basic overview
The second graph reveals the name of every sample. The third graph only reveals the name of the sample of interest, that was provided as a parameter to the diveRsity function.

Figure 10
Lastly, the fourth graph only shows the highest sample of every breed.

Figure 11
Next to the graphs, a table is included in the output of the diveRstity function, containing the heterozygosity values per sample.

Figure 12
extrafiltR When there is genetic heterogeneity, the variant responsible for a certain phenotype might not be known and as such are not present in the VOI file. The disease-associated variant might however occur in genes known to be associated with the phenotype. The optional extrafiltR function collects all other variants present in the genes that contain the variants of interest as this might be valuable input in these cases. A dataframe is rendered in the R environment. This dataframe can be searched by the user to identify potentially interesting variants.