 Research
 Open Access
 Published:
Convex hulls in hamming space enable efficient search for similarity and clustering of genomic sequences
BMC Bioinformatics volume 21, Article number: 482 (2020)
Abstract
Background
In molecular epidemiology, comparison of intrahost viral variants among infected persons is frequently used for tracing transmissions in human population and detecting viral infection outbreaks. Application of UltraDeep Sequencing (UDS) immensely increases the sensitivity of transmission detection but brings considerable computational challenges when comparing all pairs of sequences. We developed a new population comparison method based on convex hulls in hamming space. We applied this method to a large set of UDS samples obtained from unrelated cases infected with hepatitis C virus (HCV) and compared its performance with three previously published methods.
Results
The convex hull in hamming space is a data structure that provides information on: (1) average hamming distance within the set, (2) average hamming distance between two sets; (3) closeness centrality of each sequence; and (4) lower and upper bound of all the pairwise distances among the members of two sets. This filtering strategy rapidly and correctly removes 96.2% of all pairwise HCV sample comparisons, outperforming all previous methods. The convex hull distance (CHD) algorithm showed variable performance depending on sequence heterogeneity of the studied populations in real and simulated datasets, suggesting the possibility of using clustering methods to improve the performance. To address this issue, we developed a new clustering algorithm, khulls, that reduces heterogeneity of the convex hull. This efficient algorithm is an extension of the kmeans algorithm and can be used with any type of categorical data. It is 6.8times more accurate than kmode, a previously developed clustering algorithm for categorical data.
Conclusions
CHD is a fast and efficient filtering strategy for massively reducing the computational burden of pairwise comparison among large samples of sequences, and thus, aiding the calculation of transmission links among infected individuals using thresholdbased methods. In addition, the convex hull efficiently obtains important summary metrics for intrahost viral populations.
Background
Comparison of large samples of diverse genetic variants and obtaining summary statistics are common tasks in many fields, from computational searches of databases to cancer research and molecular epidemiology. For instance, comparison between populations of mitochondrial DNA variants from normal tissues and tumours brings important insights [1].
Phylogenetic analysis of viral sequences is frequently used in investigation of outbreaks and transmission chains [2,3,4,5,6], usually using a single sequence per infected individual. However, many viruses such as hepatitis C virus (HCV) exist as a population of numerous genetic variants in each infected individual. It was observed that minority variants in the source are often the ones responsible for transmission, indicating that the use of a single sequence per individual lacks sensitivity for detecting such transmissions [7]. Molecular analysis of intrahost viral populations sampled from infected persons was shown to be very efficient in detection of HCV transmissions in outbreak investigations [8,9,10,11]. Statistical analysis of intrahost HCV variants obtained from epidemiologically characterized outbreaks allowed for the development of a simple and accurate thresholdbased approach for detecting HCV transmissions [7].
Sampling of intrahost HCV variants, including minority variants, by UltraDeep Sequencing (UDS) improves sensitivity of transmission detection [7]. However, comparing all sequences from each pair of samples creates a considerable computational challenge. For instance, a relatively small dataset of 401 HCV samples required to perform 80,200 pairwise sample comparisons, for a total of 4.56 × 10^{10} pairwise sequence comparisons [12]. As the number of tested cases increases over time, detection of transmission networks becomes computationally impractical. However, we have observed that less than 1% of all samplepairs are usually linked by transmission, even in highrisk populations. Therefore, an exhaustive search over all pairs of sequences is very inefficient because the great majority of sample pairs are above certain relatedness threshold, which corresponds to 3.77% for HCV [7]. Briefly, it would be very advantageous to remove most of these pairs in order to reduce the number of computations needed to establish transmission on a set of samples.
Similar problems are encountered under different names in various areas of computer science [13,14,15,16,17,18,19]. There are many methods for rapid string comparison. Detection of viral transmissions is affected by three factors: (1) owing to a continuous addition of new data samples during outbreak investigations, there is a need of comparing two or more populations of homologous sequences, which is different from comparison of a sequence to a static data structure or a static database; (2) a need of exact distance calculations makes any available fast heuristic and approximate methods unsuitable for the purpose; and (3) use of the relatedness threshold allows for application of filtering strategies. These factors have been addressed by three previously published filtering algorithms: the kmer bloom filter based on a lower bound of the maximal common substring [20]; the Hamming radius of each population and the triangle inequality [20]; and the signature method based on comparing kmers and kchunks of every pair of sequences [21].
The current work was motivated by intuition that comparison of distances among every pair of points in a set can be replaced by calculation of distances among points located at the surface of the set. For the Euclidean plane (Fig. 1a), the convex hull of a set of points X is the smallest convex set that contains X, which can be visualized as the shape enclosed by a rubber band stretched around X [22]. However, every point in the hamming space is “at the surface” [23], which would apparently make this intuition useless for our purposes. Notwithstanding, each convex hull has a schema that describes it succinctly [24]. For instance, for the set of strings Pn = (010000, 011010, 111000, 010010, 011110), the convex hull is *1***0, where * means that the position can be 0 or 1. Thus, the convex hull in the hamming space is a set comprising all binary strings matching the schema (16 sequences in this example).
The Multiple Sequence Alignment (MSA) of nucleotide sequences from viral populations can be used to calculate a vector of frequencies of nucleotide states (A, C, G, T or gap) at each position in the MSA. This is analogous to the convex hull schema described above, with the added information of frequency for each state (Fig. 1b). These vectors are also known as Position weight matrices (PWMs), positionspecific scoring matrices or weighted patterns [25]. They have been used in models of transcription factor binding sites and in online matching algorithms [26].
Here, we present a fast and efficient filtering strategy that removes most pairwise comparisons, aiding the calculation of transmission links by thresholdbased methods. In addition, we show that application of convex hull significantly reduces the computational cost of many important summary statistics that are routinely calculated for each population.
Results
Useful metrics derived from the convexhull
The convex hull of each file provides the following information:

1
Average Hamming distance The average hamming distance (AD_{p}) among all sequences within the population p is a very common summary statistic of the genetic heterogeneity of a population. However, it is computationally challenging as it requires \(\left( {n^{2}  n} \right)/2\) pairwise comparisons and then calculates the average. Using the convex hull, we can obtain AD_{p} in the following manner:
$$AD{\text{p}} = \frac{{\mathop \sum \nolimits_{a = 1}^{l} \mathop \sum \nolimits_{i \ne j}^{{}} f_{ai} f_{ai} }}{{\left( {n^{2}  n} \right)/2}}$$where n is the number of sequences, l is the length, number of positions, f_{ai} is the frequency of nucleotide i at position a and f_{aj} is the frequency of nucleotide j at position a.

2
CC_{h} of each sequence The average distance from each sequence h to all others within the population, its closeness centrality CC_{h}, is an important measure of centrality that is also computationally challenging, as we need to make \(\left( {n^{2}  n} \right)/2\) pairwise comparisons and then calculate the average of each sequence. Using the convex hull, we can easily obtain CC_{h} in the following manner:
$$CC_{h} = \mathop \prod \limits_{a = 1}^{l} f_{a}$$where l is the length (number of positions in the MSA) and f_{a} is the population frequency of the nucleotide present at sequence h, position a.

3
Average distance between two populations The average hamming distance (AD_{pq}) among all sequences of two populations, p and q, is a very common statistic in population genetics that is the basis for several measures of genetic relatedness. However, it is computationally challenging as it requires \(p_{n} *q_{n}\) pairwise comparisons before calculating the average. Using the convex hull, n AD_{p} can be obtained using:
$$AD{\text{pq}} = \frac{{\mathop \sum \nolimits_{a = 1}^{l} \mathop \sum \nolimits_{i \ne j}^{{}} f_{pai} f_{qai} }}{{n^{2} }}$$where f_{pai} is the frequency of nucleotide i at position a in the set p and f_{qaj} is the frequency of nucleotide j at position a in the set q.

4
Lower and upper bound of all distances between two sets Figure 1c shows the convex hulls of two example populations. In this example, there are 2 positions with empty intersection, therefore, the distance between any sequence in p and any sequence in q must be 2 or more. This is a lower bound of all the distances between members of p and q. This metric helps to safely discard any population pairwise comparison if its value is higher than the threshold T. Thus, the convex hull distance between two populations p and q, CHD_{pq}, can be defined by:
$$CHD_{pq} = \mathop \sum \limits^{l} a$$where a is equal to 0 if any nucleotide in position a of set p is also present in set q, or equal to 1 otherwise. If CHD_{pq} > T, this pair of samples can be safely discarded.
In this example there are 3 positions with empty or partial intersection, thus the distance between any sequence in p and any sequence in q must be 3 or less. This is the upper bound of all the distances between members of p and q.
Filtering performance
To illustrate a greater efficiency of the filtering approach over the full calculation of all sequence pairs, we generated 20 files with a variable number of sequences and applied both algorithms. The time taken by the CHD filter grows very slowly when the number of sequences per file increases, because the number of pairwise sample comparisons remains constant, whereas the fulldistance method increases in a quadratic manner (Fig. 2a).
For the viral dataset, the CHD filtering algorithm quickly removed 96.2% of all samplepairs, outperforming the other three algorithms used here for comparison (Fig. 2b). In the viral dataset, the average divergence between CHD and actual minimal distance was only 7.84% (Fig. 2c), whereas divergence between the upper bound and maximal distance was 2.3times greater (18.32%).
We performed simulations to test performance of the CHD distance with 100 varying levels of genetic heterogeneity (Fig. 2d). As expected, the greater the diversity, the greater the divergence between true minimal distance and CHD.
kHulls performance
Splitting each population into clusters creates sets of lower diversity than the original, which could provide a boost in CHD performance. Here, we propose a new way to split populations into clusters: a modified version of the kmode algorithm, where the mode (consensus) of the cluster is replaced with the convex hull of the cluster, thus using more information. We performed simulations to test performance of the khulls algorithm with varying levels of cluster separation (Fig. 3a). Each simulated dataset contained either 2, 3 or 4 true clusters. The khulls method was found to be consistently correct in classification of sequences into clusters when the community separation was > 0.25. A community separation of 0.25 means that only 25% of the total variation is found between clusters and 75% of the variation is random.
At community separations > 0.5, both the kmode and khulls approaches were equal in performance. However, for lower values of community separation, khulls is 6.8times more accurate. We observed that a higher sensitivity was because khulls uses more data (all sequences of the cluster instead of just its consensus) and thus avoids many ties in the cluster assignments.
Although algorithms of the kmeans family can be performed several times from random starting points, we proposed a heuristic to find the initial seeds that makes use of the closeness centrality calculated using the convex hull. In order to evaluate its performance, we used all possible sets of seeds and ranked them by their average distance. It was found that, in average, our approach identifies a set of seeds outperforming 99.9946% of all other sets.
The main problem of clustering is the difficulty of choosing the optimal number of clusters, as the clusters’ quality always improves (by any intracluster or intercluster measure) the number of clusters is increased. We used the gap statistic [27] to choose the best clustering solution, in terms of how different it is from a random solution of the same size. Figure 3b shows how the average gap score for 100 random simulations (with a community separation of 0.4) changes with different cluster solutions. For each number of true clusters (n = 2, 3 or 4) the correct solution has the highest gap score. Over all levels of community separation, we found that the method tends to choose the correct number of clusters when the community separation is > 0.35. Regarding the membership within a cluster, we found that geometric mean showed better performance than arithmetic mean due to a reduction in the number of ties.
kHulls on HCV dataset
We applied the khulls algorithm to HCV and simulated datasets to evaluate the CHD filtering performance. On the HCV data, clustering results in correct filtering of 99.92% of sample pairs, which is an improvement over the scenario without clustering (96.2%). On the simulated data with no true clusters, a forced clustering into two groups moderately improves the CHD performance while adding very little computational work (Fig. 1d). With low mutation rates (e.g. < 0.05), clustering improves the CHD performance by 2.4fold. However, it must be considered that the performance of any filtering strategy would be benefited by splitting the population into clusters, but we only measured its effect on the CHD method.
Discussion
The convex hull in hamming space is a simple data structure that is efficient in generating information on: (1) average hamming distance within the set, (2) average hamming distance between two sets; (3) closeness centrality of each sequence; and (4) lower and upper bound of all the pairwise distances among the members of two sets. These metrics are routinely calculated by measuring all possible sequence pairs. The approach described here significantly reduces the number of pairs required to be assessed to accurately calculate these metrics. Although we have not provided formal proofs for these shortcuts, all these formulas were tested for correctness by comparing their results with the ones obtained by using the full calculation, with identical results. These comparisons were performed on all datasets, both with the real HCV datasets and simulated datasets produced with a variety of diversity levels.
The CHD filtering algorithm showed the best performance among the studied methods. CHD is also the simplest, which results in a fast processing time (< 1 s on a desktop computer to complete analysis of any dataset used here). In addition, the convex hull of each sample generates a very small file that needs to be calculated only once. Storage of this file requires minimal space, making it readily available for repeat usage when new files are deposited to the database.
The CHD filtering strategy proposed here is applicable in many settings. It can be used for the detection of viral transmissions by a thresholdbased method [7] or can be applied to any sets of data with common categorical variables for their rapid comparison.
As expected, a greater diversity results in a greater divergence between the true minimal distance and CHD. Thus, the CHD filtering performance on other viral datasets is expected to be negatively affected by high diversity, for instance, when a patient is infected with > 1 strain. However, this situation can be alleviated by application of clustering algorithms.
Usually, clustering of genomic sequences or other categorical data is performed on distance matrices for all pairs of sequences or variables using methods such as:

(1)
a similarity tree (e.g. UPGMA) with selection of a level of clustering that satisfies certain constraints (e.g. bootstrap support or a threshold level of difference).

(2)
a community detection algorithm applied to a network based on the distance matrix.

(3)
correspondence analysis (like principal component analysis but applied to categorical data using the distance matrix) with the following application of a secondary clustering algorithm.
A different method type, which can be applied to categorical data, is the kmode algorithm [28], which is a version of the kmeans algorithm applicable to clustering large data sets with categorical values. The algorithm works by calculating a cluster mode of each position. Thus, each sequence is compared to the consensus of a cluster, as opposed to the centroid in continuous data. Here, we improved the kmode algorithm by replacing the consensus with the convex hull of a cluster, thus retaining more information on clusters and achieving a higher accuracy (6.8X) when the community separation is weak. We also showed that the gap statistic provides an efficient way for selection of the best partition solution for the khulls algorithm.
Before advent of UDS, the detection of viral transmissions was based on phylogenetic analyses of a single viral sequence per patient. Increased UDS sampling of intrahost viral variants from each infected person improved the sensitivity of transmission detection [7] but amplified the computational burden by immensely increasing the number of sequences needed to be compared. Novel molecular surveillance technologies like the Global Hepatitis Outbreak and Surveillance Technology [29] rapidly generate and accrue massive numbers of intrahost viral variants from many infected individuals, presenting significant computational challenges to timely and accurately process molecular data. It is estimated that 2.0–2.8 million people have chronic HCV infection in the United States [30]. Efficient molecular surveillance on such a large population would necessitate fundamental improvements in the capacity to manage and analyze rapidly growing molecular data to assist in devising public health interventions to control and eliminate viral diseases. Computational approaches reducing the computational burden caused by massive data, such as our approach presented here, are important for enhancing surveillance efforts.
Conclusions
We present a fast and efficient filtering strategy that drastically reduces the amount of computation required for comparison of large samples of genomic sequences. The proposed filtering algorithm has many applications in different fields dealing with comparison of massive datasets of categorical variables such as analysis of sequences of intrahost viral variants for the detection of transmission and transmission networks. In addition, this approach is efficient in calculation of important summary metrics for viral populations.
Methods
Problem definition
Given P = (P_{1}, P_{2},…), a set of samples where each P_{i} is associated with a set S_{i} = (S_{i}^{1}, S_{i}^{2}, …) of homologous sequences, enumerate all sample pairs (P_{i}, P_{j}) where any pairwise sequence comparisons (S_{i}^{x}, S_{j}^{y}) has a hamming distance lower than the relatedness threshold, T (see Fig. 1). Given that every sequencepair needs to be considered, it yields an O(n^{2}) algorithm, where n is the number of sequences.
Viral dataset
The viral dataset is composed of previously published HCV sequences obtained from 401 HCV infected individuals [7, 12, 20, 21, 31, 32]. The sequences, obtained by UDS, encode the HCV E1/E2 junction (an amplicon of 306 bp, being 264 bp after removing primer sequences), which contains the HCV hypervariable region 1. The average number of unique sequences per sample was 534.3. For each samplepair, all its sequences are used to create a multiplesequence alignment (MSA), which is then used to calculate the hamming distance between every pair of sequences. The two samples are considered related if the minimum of the hamming distance between their sequences is smaller than T, which in our case corresponds to 3.77% [7]. All samplepairs in this viral dataset are above this threshold and thus are unrelated to each other, covering the spectrum of HCV diversity within and between subtypes (1a, 1b, 2a, 3a and 4a).
Performance comparison
We measure the performance of the method by testing what percentage of all population pairwise comparisons are safely discarded because they cannot have any pair of sequences with a hamming distance below the threshold. The rationale of the approach is that the great majority of sample pairs have sequences that are highly different (unrelated and it would be advantageous to remove these pairs in order to reduce the amount of computation needed to establish transmission on a set of samples. Every samplepair is still considered, yielding an O(p^{2}) algorithm, where p is the number of samples, but a single comparison is made, instead of all sequence pairs. We compare our performance with three previously published algorithms:

(1)
Maximal common substring [20], based on the lower bound of the maximal common substring, which is applied to the kmer bloom filter.

(2)
The Hamming radius filter [20], based on the Hamming radius of each population and the triangle inequality.

(3)
The signature method [21], based on comparing kmers and kchunks of every pair of sequences.
kHulls clustering algorithm
The use of a clustering method is expected to improve the performance of the convex hull distance. Here, we modify the kmode algorithm [28] by replacing the consensus cluster with the convex hull of the cluster, thus gaining more information about the cluster. The algorithm is as follows:

1
Choose a desired number of clusters, k.

2
Find k seeds:

a
Calculate the convex hull of all data

b
1st seed Calculate the closeness centrality of each sequence with the formula described for closeness centrality (CC_{h}). The most distant sequence is the 1st seed.

c
2nd seed Calculate the distance from the 1st seed to all others and identify the most distant.

d
For each additional seed: Calculate the distance from the previous seeds to all other sequences and find the sequence with the highest geometric mean.

a

3
Preparations:

a
Transform each sequence into the convex hull format.

b
Each cluster begins with just one seed.

a

4
While the cluster membership changes:

a
Calculate the Euclidean distance between each sequence’s convex hull and each cluster’s convex hull.

b
Assign the sequence to its closest cluster and obtain a list of cluster memberships.

c
Recalculate the convex hull of each cluster.

a
In order to find the optimal number of clusters, we used the gap score [27] in the following manner:

1
Calculate the goodness of the clustering solution. For each position on each cluster, calculate the Shannon entropy, then average over all positions.

2
Calculate an average of “within cluster” goodness (rather than “between cluster”).

3
Compare this value with the one obtained by 1000 random partitions of the same size by means of the gap score.

4
Choose the solution with the best score.
Simulated datasets
To test the performance of the CHD and the khulls algorithm, we created several simulated datasets with varying degrees of sequence heterogeneity and variable number of true clusters. For measuring the effect of genetic heterogeneity on the CHD distance, 5000 simulated pairs of files were created, with 50 replicas for each level of mutation rate (100 levels) and with each pair comparing two populations with the same level of simulated diversity.
For the clustering algorithm, 39,900 simulated datasets were created, with 100 replicas for each parameter combination. Each file consisted of 48 sequences, each sequence 50 nucleotides long. The sequences could be separated into 2, 3 or 4 true clusters, with 20 levels of community separation, ranging from 0.05 to 0.95. For instance, a community separation of 0.05 means that only 5% of the total variation is found between clusters and 95% of the variation is random.
Availability of data and materials
Data and software are available upon request.
Abbreviations
 HCV:

Hepatitis C virus
 UDS:

UltraDeep Sequencing
 CHD:

Convex hull distance
 CC:

Closeness centrality
 PWM:

Position weight matrices
 MSA:

Multiple Sequence Alignment
 AD:

Average distance
References
 1.
Campo DS, Nayak V, Srinivasamoorthy G, Khudyakov Y. Entropy of mitochondrial DNA circulating in blood is associated with hepatocellular carcinoma. BMC Med Genom. 2019;12(Suppl 4):74.
 2.
Spada E, Abbate I, Sicurezza E, Mariano A, Parla V, Rinnone S, Cuccia M, Capobianchi MR, Mele A. Molecular epidemiology of a hepatitis C virus outbreak in a hemodialysis unit in Italy. J Med Virol. 2008;80(2):261–7.
 3.
Bracho MA, Gosalbes MJ, Blasco D, Moya A, GonzalezCandelas F. Molecular epidemiology of a hepatitis C virus outbreak in a hemodialysis unit. J Clin Microbiol. 2005;43(6):2750–5.
 4.
GonzalezCandelas F, Bracho MA, Wrobel B, Moya A. Molecular evolution in court: analysis of a large hepatitis C virus outbreak from an evolving source. BMC Biol. 2013;11:76.
 5.
Prosperi MC, De Luca A, Di Giambenedetto S, Bracciale L, Fabbiani M, Cauda R, Salemi M. The threshold bootstrap clustering: a new approach to find families or transmission clusters within molecular quasispecies. PLoS ONE. 2010;5(10):e13619.
 6.
Feray C, Bouscaillou J, Falissard B, Mohamed MK, Arafa N, Bakr I, ElHoseiny M, Daly ME, ElKafrawy S, Plancoulaine S, et al. A novel method to identify routes of hepatitis C virus transmission. PLoS ONE. 2014;9(1):e86098.
 7.
Campo D, Xia G, Dimitrova Z, Lin Y, GanovaRaeva L, Punkova L, Ramachandran S, Thai H, Sims S, Rytsareva I, et al. Accurate genetic detection of hepatitis C virus transmissions in outbreak settings. J Infect Dis. 2015;213(6):957–65.
 8.
Nainan O, Alter M, KruszonMoran D, Gao F, Xia G, McQuillan G, Margolis H. Hepatitis C virus genotypes and viral concentrations in participants of a general population survey in the United States. Gastroenterology. 2006;131(2):478–84.
 9.
Thompson N, Novak R, WhiteComstock M, Xia G, GanovaRaeva L, Ramachandran S, Khudyakov Y, Bialek S, Williams I. Patienttopatient hepatitis C virus transmissions associated with infection control breaches in a hemodialysis unit. J Nephrol Therapeutics. 2012;S10:002.
 10.
GanovaRaeva LM, Dimitrova ZE, Campo DS, Lin Y, Ramachandran S, Xia GL, Honisch C, Cantor CR, Khudyakov YE. Detection of hepatitis C virus transmission by use of DNA mass spectrometry. J Infect Dis. 2013;207(6):999–1006.
 11.
Ramachandran S, Purdy MA, Xia GL, Campo DS, Dimitrova ZE, Teshale EH, Teo CG, Khudyakov YE. Recent population expansions of hepatitis B virus in the United States. J Virol. 2014;88(24):13971–80.
 12.
Rytsareva I, Campo DS, Zheng Y, Sims S, Thankachan SV, Tetik C, Chirag J, Chockalingam SP, Sue A, Aluru S, et al. Efficient detection of viral transmissions with nextgeneration sequencing data. BMC Genom. 2017;18(Suppl 4):372.
 13.
Qin J, Wang W, Lu Y, Xiao C, Lin X. Efficient exact edit similarity query processing with the asymmetric signature scheme. In: ACM SIGMOD international conference on management of Data, SIGMOD’11: 2011; New York, p. 1033–1044 (2011).
 14.
Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via hashing. In: 25th international conference on very large data bases, VLDB ’99 1999. Morgan Kaufmann Publishers Inc., San Francisco, CA, p. 518–529 (1999)
 15.
Li C, Wang B, Yang X. Vgram: improving performance of approximate queries on string collections using variablelength grams. In: 33rd international conference on very large data bases, VLDB ’07 2007; Vienna, p. 303–314 (2007).
 16.
Medvedev P, Scott E, Kakaradov B, Pevzner P. Error correction of highthroughput sequencing datasets with nonuniform coverage. Bioinformatics. 2011;27(13):137–41.
 17.
Nikolenko S, Korobeynikov A, Alekseyev M. Bayeshammer: Bayesian clustering for error correction in singlecell sequencing. BMC Genom. 2013;14(1):7.
 18.
Broder A, Glassman S, Manasse M, Zweig G. Syntactic clustering of the web. Comput Netw ISDN Syst. 1997;29(8–13):1157–66.
 19.
Altschul S, Gish W, Miller W, Myers E, Lipman D. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.
 20.
Rytsareva I, Campo D, Zheng Y, Sims S, Tetik C, Chirag J, Chockalingam S, Thankachan S, Sue A, Aluru S et al. Efficient detection of viral transmission with thresholdbased methods. In: 5th IEEE international conference on computational advances in bio and medical sciences (ICCABS): October 15–17 2015; Miami, FL, USA (2015)
 21.
Tsyvina V, Campo DS, Sims S, Zelikovsky A, Khudyakov Y, Skums P. Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants. BMC Bioinform. 2018;19(Suppl 11):360.
 22.
de Berg M, van Kreveld M, Overmars M, Schwarzkopf O. Computational geometry: algorithms and applications. Berlin: Springer; 2008.
 23.
Eigen M. From strange simplicity to complex familiarity: a treatise on matter, information, life and thought. Oxford: Oxford University Press; 2013.
 24.
Moraglio A. Towards a geometric unification of evolutionary algorithms. Colchester: University of Essex; 2007.
 25.
Stormo GD, Schneider TD, Gold L, Ehrenfeucht A. Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 1982;10(9):2997–3011.
 26.
Korhonen J, Martinmaki P, Pizzi C, Rastas P, Ukkonen E. MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics. 2009;25(23):3181–2.
 27.
Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc B. 2001;63:411–23.
 28.
Huang Z. Extensions to the kmeans algorithm for clustering large data sets with categorical values. Data Min Knowl Discov. 1998;2(3):283–304.
 29.
Longmire A, Sims S, Rytsareva I, Campo D, Skums P, Dimitrova Z, Ramachandran S, Medrzycki M, Thai H, GanovaRaeva L, et al. GHOST: global hepatitis outbreak and surveillance technology. BMC Genom. 2017;18(Suppl 10):916.
 30.
Hofmeister MG, Rosenthal EM, Barker LK, Rosenberg ES, Barranco MA, Hall EW, Edlin BR, Mermin J, Ward JW, Ryerson AB. Estimating prevalence of hepatitis C virus infection in the United States, 2013–2016. Hepatology. 2019;69(3):1020–31.
 31.
Williams I. Epidemiology of hepatitis C in the United States. Am J Med. 1999;107(6B):2S9S.
 32.
Campo DS, Dimitrova Z, Yamasaki L, Skums P, Lau DT, Vaughan G, Forbi JC, Teo CG, Khudyakov Y. Nextgeneration sequencing reveals large connected networks of intrahost HCV variants. BMC Genom. 2014;15(Suppl 5):S4.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 21 Supplement 18, 2020: Proceedings from the 8th Workshop on Computational Advances in Molecular Epidemiology (CAME 2019). The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume21supplement18.
Funding
All work and publication costs were funded by the Centers for Disease Control and Prevention, which had no role in: the design of the study and collection, analysis or interpretation of data or in writing the manuscript.
Author information
Affiliations
Contributions
DSC designed the study. DSC designed and implemented the convex hull filter, DSC and YK analysed data and wrote manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Research was conducted as approved by Institutional Review Board of Centers for Disease Control and Prevention. The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Consent for publication
Consent for publication was obtained with the internal CDC clearance system.
Competing interests
The authors declare they don’t have any competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Campo, D.S., Khudyakov, Y. Convex hulls in hamming space enable efficient search for similarity and clustering of genomic sequences. BMC Bioinformatics 21, 482 (2020). https://doi.org/10.1186/s1285902003811z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902003811z
Keyword
 Population distance
 Hamming
 Clustering
 Centrality