Loqusdb: added value of an observations database of local genomic variation

Background Exome and genome sequencing is becoming the method of choice for rare disease diagnostics. One of the key challenges remaining is distinguishing the disease causing variants from the benign background variation. After analysis and annotation of the sequencing data there are typically thousands of candidate variants requiring further investigation. One of the most effective and least biased ways to reduce this number is to assess the rarity of a variant in any population. Currently, there are a number of reliable sources of information for major population frequencies when considering single nucleotide variants (SNVs) and small insertion and deletions (INDELs), with gnomAD as the most prominent public resource available. However, local variation or frequencies in sub-populations may be underrepresented in these public resources. In contrast, for structural variation (SV), the background frequency in the general population is more or less unknown mostly due to challenges in calling SVs in a consistent way. Keeping track of local variation is one way to overcome these problems and significantly reduce the number of potential disease causing variants retained for manual inspection, both for SNVs and SVs. Results Here, we present loqusdb, a tool to solve the challenge of keeping track of any type of variant observations from genome sequencing data. Loqusdb was designed to handle a large flow of samples and unlike other solutions, samples can be added continuously to the database without rebuilding it, facilitating improvements and additions. We assessed the added value of a local observations database using 98 samples annotated with information from a background of 888 unrelated individuals. Conclusions We show both how powerful SV analysis can be when filtering for population frequencies and how the number of apparently rare SNVs/INDELs can be reduced by adding local population information even after annotating the data with other large frequency databases, such as gnomAD. In conclusion, we show that a local frequency database is an attractive, and a necessary addition to the publicly available databases that facilitate the analysis of exome and genome data in a clinical setting.

For SNVs/INDELs the sequencing data were preprocessed and analyzed according to the standard GATK best practice procedure [1]. More details about the data processing is described in [2]. The SNV files were then decomposed and normalized with VT version 0.5772 (https://genome.sph.umich.edu/wiki/Vt). Structural variants were generated using findSV (https://github.com/J35P312/FindSV), a pipeline that combines output from the SV callers TIDDIT 2.2.6 [3] and CNVnator version 0.3.3. FindSV was executed using the binary alignment (BAM) files from the preprocessing stage described in the paragraph above. We chose to only include variants from confidently callable parts of the genome as described in [4] to avoid problematic regions, often due to low complexity, with uncertain variant calls.

A. Data processing, analysis and filtering
First, a local database, SweGenDB, was constructed from variants detected in 888 individuals from the SweGen cohort by loading their data into a local loqusdb version 2.4 database (see Supplementary section: Construction of Local Databases D). We proceeded with the annotation of variants detected in 98 other SweGen samples, using allele frequencies from gnomAD version 2.1.1. SNVs/INDELs were annotated with with VEP v92 [5], SVs were annotated with SVDB (https://github.com/J35P312/SVDB) (see Supplementary Materials E and F for more details). The variants were then also annotated with the observed number of occurrences (observations) per alternative genotype in SweGenDB, both for SNVs/INDELs and SVs (see Supplementary Materials G for more details). Next, we performed different filtering scenarios to compare the number of variants that could be dismissed based on frequency compared to gnomAD, SweGenDB or both. For all scenarios, we removed variants with allele frequencies (AF) higher than 1% according to gnomAD VCF key POPMAX AF. Sets of genes with known or suspected association with disease were collected into gene panels based on disease phenotypes. This is a common approach to reduce the number of variants under investigation to a relevant clinical subset of candidate variants. The variable measured was the number of variants left to interpret scaled by different sizes of gene panels. We used the PanelApp intellectual disability gene panel version 2.833 (979 consensus genes) (ID) from the PanelApp (https://panelapp.genomicsengland.co.uk), and a Mendeliome panel consisting of all disease associated genes in the Online Mendelian Inheritance in Men (OMIM) (https://omim.org/; 3756 genes). In the OMIM panel all genes where a disease relationship is "established" or "provisional" was included. We also investigated how the number of individuals in a local observation database affects the number of variants that can be dismissed. All experiments were performed for both SVs and SNVs/INDELs.

B. Chromosome Y haplogroups
The chromosome Y haplogroups were predicted using Yleaf 2.0 [6]. Yleaf was run using the following command: Were $1 is the WGS bam file, and $2 is the output file. The command was run for all individuals in the SweGen cohort. Individuals having a non-zero Q score was considered male, all other individuals were assumed to be female; statistical assessments were performed using the Mann Whitney U test. Results are visualised in Figure 1 and Tables  I and II.

C. Sample ID profiling
Loqusdb has a genotype-based sample ID profiling feature to detect if a case has already been inserted into the database. This is to avoid sample duplication in the database, which will have negative effects on the observation counts. The user provides a set of positions, each case will then get a combined mutation profile depending on what variation they have. These variants should be normal variants in the population and chosen so that each combination is unique. Loqusdb comes with a standard set of 50 SNVs that can be used as a default set of positions for sample profiling. When a new case is loaded -loqusdb will create a string based on the variant calls of these positions and look at the hamming distance with all existing cases in the database. For more information see loqusdb documentation.
Local databases were constructed using loqusdb version 2.4 (https://github.com/moonso/loqusdb). The databases were constructed using the following command: l o q u s d b −−p o r t $PORT l o a d −−v a r i a n t − f i l e $SNV VCF −−sv−v a r i a n t s $SV VCF −−gq−t r e s h o l d 0 − Were SNV VCF is a SNV VCF file, SV VCF is a SV VCF, PORT is the port used by the loqusdb instance, SAMPLENAME is the name of the sample. The command is run once for each individual to be loaded into the database. This command was used to construct six databases of different size (10, 48, 100, 196, 296, 888 individuals respectively); these databases were subsequently used to annotate the VCF files of 98 individuals not included in any of the databases.

E. Annotation of SNV VCF files
The resulting VCF files were annotated using the following VEP command: Additionally, this command filters for variants having a low, moderate, or high consequence (https://www.ensembl. org/info/genome/variation/prediction/predicted_data.html). The resulting filtered VCF file was annotated using each one of the internal frequency databases. This was done to remove variants lacking the POPMAX AF sites, as SVDB requires all variants to carry the frequency information specified by the −−frequency tag parameter.

G. Annotation of local observations
Finally all VCFs, both SNVs and SVs, where annotated with the local observations from the Loqusdb instance with the command: l o q u s d b −p o r t $PORT a n n o t a t e $SNV VCF > $FOLDER/$SNV FILE for SNVs, and: l o q u s d b −p o r t $PORT a n n o t a t e −−sv $SV VCF > $FOLDER/$SV VCF for SVs. Where SNV VCF is a SNV VCF file, SV VCF is a SV VCF, PORT is the port used by the mongoDB instance, FOLDER is the folder that the annotated file will be printed to. These two commands are run separately for each individual and database size. Tables   TABLE I: Table S1. SV variants filtered and haplogroup for each individual.