Data preparation
Whole genome sequencing of all samples was performed on the NovaSeq 6000 platform with a target coverage of 30 by the commercial provider. The NEBNext Ultra DNA Library Prep Kit was used for library preparation.
Dataset structure
We analyzed in total 124 WGS samples, all corresponding to healthy Czech individuals. Out of them, 10 samples formed a threegenerational family with relationship coefficients of \(\left( {0, 0.125, 0.25, 0.5} \right)\). In the article, we refer to this family as family \({\mathbf{A}}\). Another 12 samples within our dataset formed a different threegenerational family with relationship coefficients of (\(0, 0.0625, 0.125, 0.25, 0.5\)). In the article, we refer to this family as family \({\mathbf{B}}\). The remaining 102 samples were unrelated individuals.
Data preprocessing and SV identification
The fastq files corresponding to individual probands were processed by the generic data preprocessing pipeline published by Broad Institute [8]. The pipeline aligns sequences within fastq files to the hg38 genome build, performs base recalibration, and produces analysisready bam files. We also performed quality control using another pipeline published by Broad Institute [9]. Based on quality analysis, we excluded three samples with very low coverage.
The SVs were detected using Manta [10]. The vcf files containing detected SVs were merged by Survivor in such a way that variants occurring close to each other were not merged by the program. This was achieved by setting the maximum allowed distance between merged SVs to 1 bp. We also used only SVs longer than 50 bp for the following analyses.
Formal definitions
In this work, the SV data is represented by a genotype matrix with \(N\) rows corresponding to individual SVs and \(M\) columns corresponding to individual probands. The elements of the matrix correspond to the genotypic state of the variant in a given individual. We distinguished three genotypic states: \(0\) represents the reference homozygous genotype, \(1\) represents the heterozygous genotype, and \(2\) represents the alternative homozygous genotype.
Every SV can then be represented by a genotype vector \(v_{i}\) of length \(M\), containing a genotype state of \(i\)th SV for every individual, where \(i = 1, \ldots ,N\).
We defined the merging of any subset of SVs as a simple summation: \(\mathop \sum \nolimits_{{i \in {\varvec{s}}}} v_{i}\), where \({\varvec{s}} \subseteq \left\{ {1, \ldots ,N} \right\}\).
We call SV pair with corresponding vectors \(v_{i}\) and \(v_{j}\) “mergingincompatible” if \(k \in \left\{ {1, \ldots ,M} \right\}\) exists, where \(v_{i} { }\left( k \right) \ne 0\) \(\wedge\) \(v_{j} { }\left( k \right) \ne 0\) and \(i,j \in \left\{ {1, \ldots ,N} \right\}\).
To demonstrate the concepts used in this work, we also represented separately the genotypic state of the \(i\)th SV in the members of any parent–child trio by a vector \(u_{i}\) of length three, with the convention that the child genotype appears at the first position.
We call \(u_{i}\) nontrivial if \(l \in \left\{ {1,2,3} \right\}\) exists, where \(u_{i} { }\left( l \right) \ne 0\).
Algorithms for SV clustering
The differences between commonly used clustering strategies lie mainly in the definition of the dissimilarity measures used for clustering SVs of the same type. Two basic definitions are widely used. The first is defined as a function of overlap between genomic regions corresponding to SV pair [4, 7]. We used the following form in this work:
$$D_{1} = 1  \frac{{\left {g_{1} \cap g_{2} } \right}}{{\max \left( {\left {g_{1} } \right,\left {g_{2} } \right} \right)}}$$
where \(g_{1}\) and \(g_{2}\) are genomic intervals corresponding to two SVs.
The second is the maximum of two distances between the starting and ending position of the SV pair [6]. We used the following form in this work:
$$D_{2} = \max \left( {\left {{\text{start}}\left( {g_{1} } \right)  {\text{start}}\left( {g_{2} } \right)} \right,\left {{\text{end}}\left( {g_{1} } \right)  {\text{end}}(g_{2} } \right} \right)$$
where \({\text{start}}\left( {g_{1} } \right)\) denotes the starting genomic position of \(g_{1}\) and \({\text{end}}\left( {g_{1} } \right)\) denotes the ending genomic position of \(g_{1}\), analogically for \(g_{2}\).
With the definition of the dissimilarity measure, the clustering procedure is straightforward: we must select a threshold value (\(D_{max}\)) and then find components of the graph defined by pairs with a value of \(D_{1\left( 2 \right)} \le D_{max}\). These components are formed by SVs that we assume correspond to a single SV detected with slightly different breakpoints within different samples. All SVs forming the component are then merged into one.
There is one ambiguity at this point that must be considered. It could happen that during the merging of two SVs, both are presented within a single sample in a nonreference genomic state. It is ambiguous what the resulting genotype should be for this sample after merging both SVs. We call a pair of SVs that cannot be unambiguously merged “mergingincompatible” in the article—see the “Formal definitions” section above. Surprisingly, this problem is not mentioned in the studies, where SV clustering is used to merge large numbers of samples. According to our knowledge, there are two possible explanations for this: 1) these two SVs are on different strands of DNA, or 2) both are on the same strand. It follows that the abovementioned SVs represent different variants detected within a single sample and should not be merged.
A simple solution to this situation may be not to merge SVs in the cluster that show the presence of a mergingincompatible SV pair. We presented the results of this strategy within the article for the sake of completeness. We refer to this as “trivial strategy.”
We also used this more convenient strategy to solve the abovementioned problem.
Let us assume we obtained components for some value of \(D_{max}\).

1)
Find all components where at least one mergingincompatible SV pair exists.

2)
For every component defined in step 1, do the following:

a.
Find maximal \(D < D_{max}\), where all SVs from the component are distributed within subcomponents that do not contain any mergingincompatible SV pairs.

b.
Replace the component with set of subcomponents identified in step 2a.
We refer to this strategy as “corrected strategy.”
New algorithm for SV clustering
Combinatorial search of all SV pairs and triplets that can be merged into a Mendelianconsistent single SV
Mendelianconsistent SVs can only be created by merging two or three SVs represented by a nontrivial vector \(u\). The addition of any other nontrivial vector \(u\) to the triplet must result in a mergingincompatible SV pair (see section “Formal definitions”). This fact ensures the computational feasibility of the combinatorial search.
From the combinatorial viewpoint, 26 distinct vector pairs \(u\) existed, resulting in Mendelianconsistent SVs after merging. In contrast, there were only seven distinct vector triplets \(u\) with the same properties.
The algorithm searches for all the pairs and triplets (groups) within every trio and tests the following two conditions for every identified group of SVs:

1)
The group of SVs cannot be mergingincompatible if we consider all samples (not only trio members).

2)
\(D_{1\left( 2 \right)} \le D_{max}\) must hold for every pair of SVs within the group.
Only groups meeting both criteria are considered for the next step of the algorithm.
Reduction to disjoint set of SV groups
It is theoretically possible that the algorithm could identify two nondisjoint groups of SVs within a single trio, both satisfying the criteria defined in the previous section. It follows that these two groups are mergingincompatible, so we must decide which group to retain for further analysis. We implemented the following heuristic strategy to deal with intersecting SV groups.
At the first step we represented the identified superset of SV groups as an unoriented graph, where every identified SV group corresponded to a unique vertex. The edge between two vertices exists if the corresponding SV groups share at least one SV (i.e., having a nonzero intersection). We then identified all connected components within the graph and applied the following procedure on every such component, \(c_{i}\):

1)
Select SVs group from \(c_{i}\) which is detected in highest number of trios or randomly if all groups are equal.

2)
Test if the given group is mergingcompatible with already merged groups from \(c_{i}\) and if the merging results in a Mendelian consistent SV in all trios.

3)
Delete the group from the queue. If the test in step 2 succeeds, merge the group with already merged groups from \(c_{i}\) and add the group into the newly formed reduced component, \(c_{ir}\), where \(c_{ir} \subseteq c_{i}\).

4)
Return to step 1.
Constrained clustering
We constructed another type of graph where vertices correspond to individual SVs (and not the SVs groups as in the previous case). The edge exists between two SVs if both belong to any reduced component (\(c_{ir}\)).
Another required ingredient of constrained clustering is a cannotlink matrix corresponding to the logical triangular matrix that contains which SV pairs are mergingincompatible.
Constrained clustering can be described by the following procedure:

1)
Select all pairs of SVs with a minimum value of \(D_{1\left( 2 \right)}\) that are not part of any single component.

2)
For every pair of SVs identified in the previous step do the following:

a.
Merge the two components to which the examined SV pair belongs and test if the newly formed component contains any mergingincompatible SV pairs.

b.
If the test in 2a. fails, accept the new edge connecting the investigated SV pair.

3)
After all pairs with an actual minimum value of \(D_{1\left( 2 \right)}\) are examined, delete them from the queue and return to step 1.
We refer to this strategy as a “constrained clustering strategy.”
Measures used for SV dataset quality evaluation
Mendelian inheritance error
The first measure used for clustering quality evaluation was the fraction of SVs corresponding to Mendelian errors \(\left( {f_{MEI} } \right)\). Mendelian inheritance error (MIE) represents the combination of a child’s and its parents’ three genotypes that are inconsistent with Mendelian inheritance. We computed \(f_{MEI}\) for every trio and then averaged it over families \({\mathbf{A}}\) and \({\mathbf{B}}\). There is general agreement that most Mendelianerroneous genotype configurations are caused by sequencing/algorithm detection errors. Only a tiny fraction of Mendelian errors (where parents’ genotypes correspond to the reference homozygotes and the child genotype corresponds to an alternative heterozygote) can be caused by denovo mutation.
The number of Mendelianconsistent SVs resulting from merging SV clusters exhibiting Mendelian errors
We defined the new measure of SV dataset quality on the assumption that the Mendelian errors exhibit a specific pattern. We assumed that the observed Mendelian inconsistency can emerge as a product of erroneous determination of SV position in one or more of a trio’s members. The single SV will then be represented as two or three different SVs detected in slightly different positions. As a result, one or more alleles may be missing in the genotype configuration of the family trio because it is erroneously detected as a different SV with a different position. We will demonstrate this concept using formalism established above.
Let us assume we have a parent–child trio with an SV corresponding to Mendelianconsistent genotype configuration \(\left( {2,1,1} \right)\). If this SV is detected in one of the parents in a slightly different position, we will observe two neighboring SVs with the following genotype configurations: \(\left( {2,1,0} \right)\) and \(\left( {0,0,1} \right)\). We can see that the first configuration represents Mendelian error and the second is Mendelian consistent. Both configurations can be unambiguously merged into a single Mendelianconsistent SV.
According to this concept, we propose the following quantity to measure SV dataset quality: the number of Mendelianconsistent SVs resulting from merging SV clusters exhibiting at least one MEI.
Kinship prediction
We used the R package Demerelate [12] to predict kinship categories for two families within our dataset. We used two different estimators: the Loiselle coefficient [13] and the proportion of shared alleles \(\left( {S_{xy} } \right)\) [14]. The Loiselle coefficient represents a more complex measure that considers population SV frequencies estimated from nonrelated individuals within our dataset and corrects for a small sample size. The Loiselle coefficient, therefore, reflects the quality of the whole data set. In contrast, \(S_{xy}\) represents a simple measure based only on allele sharing between paired individuals. We presented the results obtained with \(S_{xy}\) only in Additional file 1 because they are similar to the results obtained with the Loiselle estimator.
We used the error related to the ability with which both estimators can separate individual kinship categories as a measure of dataset quality. Let us assume we have \(n\) kinship categories \(i = 1, \ldots ,n\) corresponding to relationship coefficients \(r_{i}\), where \(r_{i} < r_{i + 1}\) for \(i = 1, \ldots ,n  1\). We computed the following quantity for every kinship category:
$$L_{i} = \mathop \sum \limits_{{\begin{array}{*{20}c} {j < k} \\ {j,k \in i} \\ \end{array} }} \delta \left( {j,k} \right)$$
where \(\delta \left( {j,k} \right) = 1\) if \(C_{i} \left( {j,k} \right)\left\langle {\min \left( {C_{i + 1} } \right) \wedge C_{i} \left( {j,k} \right)} \right\rangle \max \left( {C_{i  1} } \right)\), otherwise \(\delta \left( {j,k} \right) = 0\) and where \(C_{i} \left( {j,k} \right)\) denotes the estimated value of relatedness for individuals \(j\) and \(k\), and \({\text{max}}\left( {C_{i} } \right)\) denotes maximum estimated value within kinship category \(i\), analogically for \(\min \left( {C_{i} } \right)\). On the basis of quantity \(L_{i}\), we can define the error rate for every kinship category as:
$$s_{i} = 1  \frac{{L_{i} }}{{\left( {\begin{array}{*{20}c} {\left i \right} \\ 2 \\ \end{array} } \right)}}$$
where \(\left i \right\) denotes the number of pairs of individuals within the kinship category \(i\). For the purpose of comparing SV clustering methods, we used the average value of \(s_{i}\) across all kinship categories presented within the pedigree, \(\left\langle s \right\rangle = \frac{1}{n}\mathop \sum \limits_{i} s_{i}\).
Deviation from Hardy–Weinberg equilibrium
Deviation from the Hardy–Weinberg principle is a widely used measure for quality evaluation of datasets containing population genetic variants. We followed the same methodology as applied in the work of Karczewski et al. [4]. We computed the chisquare pvalue using the Hardy–Weinberg R package and applied the Bonferroni correction. Those SVs with p < 0.05 after the Bonferroni correction were considered to violate the Hardy–Weinberg equilibrium. To compare the SV clustering methods, we used the fraction of SVs that are in Hardy–Weinberg equilibrium.
Randomization of SV distribution within clusters
We generated an ensemble of only ten randomized samples for the corrected clustering strategy for each dissimilarity measure (\(D_{1}\) and \(D_{2}\)). We generated only 20 randomizations in total due to the high computational complexity. The main purpose of the random model within this work was to exclude the null hypothesis that the behavior of the quantities used for dataset quality evaluation is determined by randomness, and, thus, their importance in clustering method comparisons is limited. Our goal during the randomizations was to preserve the size distribution of clusters before the correction for mergingincompatible SVs (as defined for corrected strategy). We performed randomizations by reshuffling the rows of the genotype matrix associated with each SV type. This procedure is equivalent to randomly partitioning SVs of the same type into clusters having the identical size distribution as exhibited by the real data. The randomization was performed before the correction because we needed to ensure that mergingincompatible SV pairs would not be presented within the final clusters.