LEMON: a method to construct the local strains at horizontal gene transfer sites in gut metagenomics

Background Horizontal Gene Transfer (HGT) refers to the transfer of genetic materials between organisms through mechanisms other than parent-offspring inheritance. HGTs may affect human health through a large number of microorganisms, especially the gut microbiomes which the human body harbors. The transferred segments may lead to complicated local genome structural variations. Details of the local genome structure can elucidate the effects of the HGTs. Results In this work, we propose a graph-based method to reconstruct the local strains from the gut metagenomics data at the HGT sites. The method is implemented in a package named LEMON. The simulated results indicate that the method can identify transferred segments accurately on reference sequences of the microbiome. Simulation results illustrate that LEMON could recover local strains with complicated structure variation. Furthermore, the gene fusion points detected in real data near HGT breakpoints validate the accuracy of LEMON. Some strains reconstructed by LEMON have a replication time profile with lower standard error, which demonstrates HGT events recovered by LEMON is reliable. Conclusions Through LEMON we could reconstruct the sequence structure of bacteria, which harbors HGT events. This helps us to study gene flow among different microbial species.

biofluids and tissues, such as skin, lung, mouth. They are often associated with a range of human diseases and health conditions, from diabetes, colorectal cancer, to autism. The Human Microbiome Project [8] was launched in 2008 to study and understand the human microbiota. Some functions of the human microbiome, including antibiotic resistance and adaption to nutrients [9], are susceptible to HGT events. Mediated by phage, HGT in S.aureus occurs 1000 times more often than was thought, which greatly accelerates S.aureus to evolve resistance to antibiotics [10].
It is necessary to understand HGTs better. However, our current research is mainly focused on inferring ancient (lineage) HGTs from genomic sequences [11]. While the inference result is seriously affected by complex external factors [12]. For example, during the process of evolution, the transferred genome segments had been subjected to loss, mutation and duplication [13]. The inserted genes may also change the expression and functions of the gene around the insertion sites, resulting in very complicated structural variations [14], and temper with the receptor genome's stabilities [15][16][17]. These possibilities complicate our detection of the HGTs. Better results can be achieved if we can anticipate these changes and correct for their effects. Recent efforts in human microbiomes provide us with such an opportunity.
LEMON(https://github.com/lichen2018/LEMON) takes use of existing shotgun NGS datasets to detect HGT breakpoints, identify the transferred segments, and reconstruct the local strain, which has complicated structure variation.

Methods
HGT events result in the integration of DNA segments from one species to another species, which will generate local strains containing DNA segments from different species. Figure 1. illustrates the workflow of LEMON. Fig. 1 The workflow of reconstructing HGT strains from pair-ends shotgun reads. Firstly, we map paired-end shotgun reads against reference genomes using Burrows Wheeler Aligner (BWA). Then, we select junction reads, whose two sides are mapped to two different references, from the set of mapped reads. Thirdly, by treating junction reads as points on a two-dimensional plane, we apply Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to find candidate HGT breakpoints. Fourthly, we utilize split reads to get the exact positions of HGT breakpoints. Fifthly, according to the detected breakpoints, we could split references into segments, which are linked by junction reads. The coverage of each segment is calculated according to the number of mapped reads on it. Sixthly, we utilize the linked segments to construct a connected graph. By inserting dummy edges, we make the graph fully connected. Seventhly, we balance the coverage of each segment. Finally, we traverse the graph to find local strains. Each local strain should start from the first segment of one receptor and end with the last segment of the same receptor

References
Only the assembly results from multiple time-point metagenomics data of one individual can be used to discover the HGT events that exactly occurred between these time points. However, these samples are insufficient in published data, and this method cannot evaluate the difference in HGT events between different samples. To solve this problem, we construct a reference set S. We collected all of the bacterial genomes from the National Center for Biotechnology Information (NCBI) [18]. We selected one genome for each taxonomy as a representative genome to reduce the interference from homologous regions based on (1) the genome was annotated as reference or representative by NCBI; (2) or the one has minimal scaffolds number and highest completeness with contamination <10% in The Genome Taxonomy Database (GTDB) taxonomy evaluation results for 109,419 bacterial genomes [19]. The reference set contains 16,093 species with 1,246,881 scaffolds. Given a shotgun genomic read dataset R, we utilize BWA [20] to align reads against the references to identify the set of donors and receptors involved. These references with adequately covered segments are then retained. We denote the set of donors and receptors as D and H respectively.

Breakpoints and segments
The donor segments and receptor segments interweave in a local strain, separated by HGT breakpoints. We need to identify the breakpoints and segments involved in the local strains from D and H. The data are heterogeneous. According to studies on virus integration [21] and studies on gut metagenomics, breakpoints are surrounded by mutations such as Single Nucleotide Variation (SNV), Copy Number Variation (CNV), short indels and inversions [22]. Hence we detect the breakpoints as follows.
First, we map paired-end shotgun reads to reference genomes using BWA, here all references are indexed together to generate Burrows-Wheeler Transform (BWT) indexes. If the two sides of a paired-end read are mapped to two different references, we call such a pair a junction pair, such as junction pairs A and B in  The two sides of one junction pair are mapped to two different references. B x and B y are mapped positions of junction pair B on reference X and Y respectively. c Furthermore, all junction pairs mapped to the same two references are transformed to points in a coordinate system with their mapped positions as coordinates, e.g. junction pairs A and B are transformed to points A (A x , A y ) and B (B x , B y ). d We apply DBSCAN using Euclidean distance to cluster the breakpoint pair candidates, such as S 1 and S 2 . The maximum diameter of the cluster circle is the insert size of pair-end reads the same two references are transformed to points in a coordinate system with their mapped positions as coordinates. We then apply the clustering algorithm DBSCAN [23] using Euclidean distance to cluster the breakpoint pair candidates. A cluster that is supported by at least one junction pair is further subjected to analysis to determine the exact positions of its breakpoints. Next, we identify the split reads which support a cluster. A read is split if it can be partitioned into two parts, with each part mapped to a different reference; we say it supports a cluster if the mapped positions belong to the cluster. Each cluster contains multiple breakpoint pair candidates.
To find the exact positions, we use a scoring scheme to rank the candidate positions. The candidate with the highest score is reported as the final breakpoint pair position. The scoring scheme evaluates the split reads that support the cluster. Suppose that there are two references involved in the cluster, R 1 and R 2 . Given a candidate genomic positions pair p 1 and p 2 which belongs to R 1 and R 2 , respectively, we identify the split reads aligned to R 1 where the alignment terminated at position p 1 . Denote the portions of such a split read s mapped to R 1 and R 2 as e 1 (s) and e 2 (s), respectively. Then, the score is defined according to the alignment qualities of e 2 (s) against R 2 nearby p 2 . The alignment quality q s of e 2 (s) is calculated as m/l, where m is the number of matched positions and should be at least 15 bps, l is the length of e 2 (s). The quality score of p 2 is calculated as 1 − s log(1 − q s ) of the alignment qualities of the split read which supports the respective cluster. Similarly we calculate the score for p 1 . The candidate with the highest score is reported as the final breakpoint pair positions of the cluster.
Each breakpoint involves two segments. Denote the segment pair as u x u , v x v , and x u , x v ∈ {+, −} where + and − respectively indicates the positive and negative strands, and u, v ∈ V . We call such a pair a connected pair.
Denote the set of connected pairs as E, the copy number of e ∈ E as c(e) w.r.t. R. It is easy to see that the segments and connected pairs specify a graph G as illustrated in Fig. 1.

Local strains
Assume that there are k HGT events captured in G. Since each HGT event results in one local strain, there are k local strains to be constructed according to G. The first and last segments of each local strain are from the same receptor. Without loss of generality let all the first segments and the last segments of the local strains at the integration sites be denoted B = {b 1 , ..., b k }, and T = {t 1 , ..., t k }, respectively. Denote all the other segments involved in the HGTs as V = {v 1 , ..., v n } (excluding B and T). Next, we estimate the number of copies (copy number) of each segment v within R, and denote it as c(v), where v ∈ B ∪ T ∪ V . Factors such as species coverage are incorporated into the copy number estimation. For example, if segment v i is belong to a receptor r, the coverage of v i is cov(v i ) and the average coverage of r is cov(r), then the initial copy number of v i is estimated as Our task is to identify the k local strains captured in G. B and T can be identified with the input data. We assume that the copy numbers of the first segment and the last segment are the same for each strain, that is, c(b i ) = c(t i ), without loss of generality.

Connectivity
We formulate the problem as a Eulerian circuit problem to find the k local strains.
First, we insert dummy edges to transform the solution into a circuit. Without loss of generality, we assume Second, we insert edges and nodes to ensure connectivity. In the ideal case, for each vertex v ∈ V , there should be a path that starts from a source node in S, passes through v, and ends up at some target in T. However, due to sequencing errors, edges or nodes can be missing or spuriously introduced. If no target and source can reach a node v, we remove v and its adjacent edges from G. If a path exists from some vertex s to v, but no path exists from v to a target t, we insert vertices and edges to form a path from v to t. The inserted edge candidates are taken from the reference sets D and S. If v belongs to the same reference as t, this would suffice to reconstruct the path. Otherwise, we add edges that connect v and some nodes on the reference of t. In both situations, we use the minimum number of edges required. All the introduced edges and vertex are assigned a copy number of 1.

Balancing the graph
Denote the set of inbound edges of u as in(u) and the outbound edges of u as out(u). That is, The in-copy and and out-copy of a vertex v are defined as c in (u) = e∈in(u) c(e) and c out (u) = e∈out(u) c(e). In the ideal case, we should have c in (u) = c(u) = c out (u), but this may be broken due to experimental and sequencing errors.
We propose an integer linear programming (ILP) approach to optimize the degree balance property. First, assign each segment u (respectively μ) to a target copy number t(u) (respectively t(μ)) according to Eq. 1c (respectively 1d), to satisfy the degree balance property (Eq. 1b). Then, the following program minimizes the disagreement between the assignment copy number and the target copy number (Eq. 1a).

Finding Eulerian circuit
It can be shown that a Eulerian circuit to the graph constructed as illustrated in Fig. 1 gives a solution to our local strain reconstruction problem. However, the problem may yield multiple solutions. Each local strain should start from the first segment of one receptor and end with the last segment of the same receptor as illustrated in Fig. 1. Each reconstructed strain may contain several HGT events. Let the number of HGT events that contributes to the local strains i be denoted as h i . We choose a solution in which each reconstructed strain i has h i as large as possible.

Evaluation metrics
To evaluate the performance of LEMON, we construct true local strains containing transferred segments and use LEMON to recover them. We propose two metrics Reconstruction Accuracy and Detection Rate to measure results. We denote the true local strains as  (2), We set the number of repetitions N as 8 and define the mean value of reconstruction accuracyRA as follows, n k denotes the number of non-empty h i in the j-th repetitions.
An acceptable detection of one transferred segment should have its breakpoint pair located within 20 bp [24] of the true breakpoint pair as mentioned in "Breakpoints and segments" section. The Detection Rate is defined as the rate of the number of acceptable recovered transferred segments and the number of all transferred segments. The average detection rate is the average of 8 repetitions for each test.

Software parameter setting
Most third-party tools used in this article are set default parameters, including BWA, LUMPY [25], iRep [26], and STAR-Fusion [27]. The parameters of DBSCAN are eps, which is the maximum radius of one cluster, and min sample , which is the minimum number of points in one cluster. In our paper, eps is set as the average insert size. min sample is set as 1.

HGT events detection in simulated human gut microbiomes
To simulate human gut microbiome with different complexity as mentioned in "Local strains" section, we constructed 5 simulated microbiomes containing 160, 320, 640, 1280, 2560 species, respectively. For each simulated microbiome, 5 different amounts (20, 40, 60, 80, 100) of HGT events were generated. For each HGT event, we randomly selected two reference sequences as the receptor and donor respectively. On the donor, we randomly selected one 10k bp sequence region as a transferred segment and inserted it to a randomly selected insertion position on the receptor. In this simulation, each HGT event contained one transferred segment. We denoted the new receptor sequence containing transferred segments as the true local strain. All true local strains were used to generated 20X paired-end reads with WGSIM [28] as an input of LEMON. Eight repetitions were performed for every test.
In order to prove the performance of LEMON, we compared its performance with another popular breakpoint detection-based structural variant discovery software LUMPY. Figure 3 illustrates a Comparison of Reconstruction Accuracy and Detection Rate between LEMON and LUMPY under different simulated conditions. The red dot in each boxplot denotes the mean value. As we can see, most mean values of Reconstruction Accuracy and Detection Rate achieved by LEMON are higher than those achieved by LUMPY, which demonstrates that LEMON can reconstruct more accurate strains and detect more transferred segments than LUMPY.

HGT breakpoints detection
In order to evaluate the performance of LEMON in HGT breakpoints detection as mentioned in "Breakpoints and segments" section, we applied it to local strains with different coverage and compared the performance with LUMPY. HGT events with random receptors, donors and breakpoints were generated in 100 randomly selected microbials, resulting in 60 local strains with 4260 HGT breakpoints. The HGT breakpoint is the insertion position of donor segments such as s 1 and s 2 in Fig. 2a. The paired-end reads generated from these local strains with 10 different values (2X, 5X, 10X, 15X, 20X, 30X, 40X, 50X, 60X, and 70X) of depth were input of LEMON and LUMPY. The performance is measured in terms of Sensitivity and False Discovery Rate (FDR) in breakpoints detection, and the bearing bias is 20 bp. If the distance between the detected breakpoint position and the true position is larger than 20 bp, we treat this detected position as one false detected position. Then if the distance is less than 20 bp, the detected position is treated as one true detected position. Therefore, FDR is the rate of false discovered positions among all discovered breakpoint positions. Sensitivity is the rate of true detected positions among all true breakpoint positions. LEMON has higher sensitivity and lower FDR than LUMPY across different coverage levels as illustrated in Fig. 4

HGT strains reconstruction with complicated HGT event structure
In this simulation, we set the number of species s to 2560 and the number of HGT events to 100. In order to simulate a complicated HGT event structure, we changed the number of transferred segments in each HGT event from 1 to 5. Simulated paired-end reads were generated from   Figure 5c gives an example of one HGT event containing a different number of transferred segments. Figure 5d and e demonstrate that LEMON has better performance than LUMPY across different levels of coverage, especially at low coverage levels, such as 5X and 10X.
In Fig. 6, we compare local strains reconstructed by LEMON and LUMPY at 5X and 10X coverage levels. The true local strain contains two HGT events A and B. Each  HGT event has five transferred segments. +/-in each segment represents the forward/reverse direction of the segment. At 5X, LEMON detects one transferred segment Green -D6 and the RA is 0.6087, while LUMPY fails to detect anyone transferred segment and the RA is 0.5238. At 30X, LEMON detects all transferred segment and the RA is 1.0, which means LEMON has reconstructed the same structure as the true local strain, while LUMPY fails to detect Red +D1 and the RA is 0.9565. Therefore, LEMON could detect more transferred segments than LUMPY and reconstruct more accurate strains across different coverage levels.

Highly complex HGT structures do exist in real metagenomic data
We applied LEMON on a recently released metagenomic dataset [29] to reconstruct local strains containing HGT events. Some reconstructed local strains have complex HGT events, such as the reconstructed local strain of NZ_DS990133.1 in sample F-5 as shown in Fig. 7).
As we can see from Fig. 8, segments from one donor are not always inserted into the receptor as a whole. Sometimes they are inserted into the receptor together with segments from other donors. For example. Segments D1-D2-D3 and D2-D3-D4 from NZ_GG703855.1

Local strains reconstructed by LEMON can assist replication timing profile restoring
We used iRep to estimate the replication timing profile of each bacterium in metagenomics data [29]. iRep utilizes linear regression to evaluate the coverage distribution across the genome to determine the PTR (peak-to-trough ratio), which is the ratio between the coverage at the origin and terminus of replication. However, due to the limitations of the reference sequence and the low sequencing depth of most species, we typically got very few replication timing profiles in a single metagenomics sample.
We applied iRep to evaluate two coverage distributions for each receptor. The first coverage distribution is evaluated based on the original reference of the receptor containing no HGT event. The second coverage distribution is evaluated based on the reconstructed reference of the receptor containing HGT events. According to the two coverage distributions, we estimated two replication timing profiles(including PTR value, predicted origin, and terminus position) for each receptor. Since iRep utilizes the regression method to estimate replication timing profiles, we use Standard Error to measure the accuracy of the estimated replication timing profiles. Figure 8 demonstrates the change of Standard Error at the origin and terminus of replication before and after reconstructions, some reconstructed local strains have much lower Standard Error, which means that LEMON help to reconstruct strains containing HGT events with more accurate restoring replication timing profile.

Verifying HGT breakpoints with gene fusion breakpoints detected from metatranscriptome data
In order to verify the HGT breakpoints detected by LEMON, we analyzed the IBD (Inflammatory Bowel We identify 3 main reasons for discrepancies between STAR-Fusion-detected gene fusion breakpoints and our HGT breakpoints: 1) The results of STAR-fusion are based on STAR aligner, while our algorithm is based on BWA. STAR aligner and BWA employ different alignment algorithm, giving rise to different breakpoints results; 2) Limited sequencing data. The amount of metagennomics sequencing data in the IBD data set is around 5G per sample, and the amount of metatranscriptome data is 2G per sample. This is insufficient for the statistical significance required for finding all the breakpoints; 3) Based on our statistics in Table 2, most of HGT breakpoints occur in the non-coding region.
In summary, it is reasonable to find only 3 matching breakpoints in 17 pairs of data.

Conclusions and discussion
In this paper we present LEMON, a novel HGT discovery software that can detect HGT events and reconstruct strains containing multiple HGT events with complicated structural variation.
Using LEMON to reconstruct the sequence structure of bacteria allows us to study the metagenomics problem from the sequence level, thus no longer subjected to the comparison of abundance. For example, since HGT is the fundamental mechanism for the spread of antibiotic resistance in bacteria, by utilizing LEMON we could detect transferred Antibiotic Resistance Genes (ARG) [31], determine the corresponding donors and receptors, and reconstruct strains of receptors, which harbor the transferred ARG. Therefore, we could get a better understanding of the transfer mechanism of ARG among bacteria.
However, as the amount of sequencing data is generally insufficient for current metagenomics analysis, challenges remain in identifying the HGTs sensitively and accurately. This results in several shortcomings in LEMON. First, LEMON remains weak in finding HGT between the sequences that do not exist in the reference genome. Second, because we only consider the reads of unique mapping, the HGT on the repeat region cannot be identified.
At present, our reference set only contains the genome of bacteria. However, human gut microbiome also contains other microorganisms such as fungi and viruses. Therefore, a reference library that contains sequences of bacterial, viral and fungal more completely would be highly desirable for HGT analysis of the microbiome.