Problem formulation
Given n positions on two homologous sequences belonging to a diploid organism and m reads obtained after a sequencing experiment, we can reduce each read to a fragment vector f∈{0,1,−}^{n}, where 0 denotes a position that is equal to the reference sequence, 1 denotes a SNP with respect to the reference sequence, and − indicates a position that is not covered by the read. We define a haplotype as a vector h∈{0,1}^{n}, that is, the combination of SNPs and wildtype positions belonging to one of the two chromosomes. Given the two haplotypes h_{1} and h_{2}—which refer to the first and second copy of the chromosome, respectively—a position j (with j∈{1,…,n}) is said to be heterozygous if and only if \(h_{1_{j}} \neq h_{2_{j}}\), otherwise j is homozygous.
Let M be the “fragment matrix”, that is, the m×n matrix containing all fragments. Two distinct fragments f and g are said to be in conflict if there is a position j (with j∈{1,…,n}) such that f_{j}≠g_{j} and f_{j},g_{j}≠−, otherwise they are in agreement. M is conflictfree if there are two different haplotypes h_{1} and h_{2} such that each row M_{i} (with i∈{1,…,m}) is in agreement with either h_{1} or h_{2}. The overall haplotype assembly process is outlined in Fig. 1.
We can extend the heterozygous and homozygous definition at the column level as follows: a column c of M is homozygous if all its values are either in {0,−} or in {1,−}, on the contrary c is heterozygous because its values are in {0,1,−}, meaning that both a SNP and a wildtype exist in that position. Finally, we can detect the case where two distinct fragments are in conflict, and measure their diversity by defining a distance D(·,·) that calculates the number of different values between two fragments. Namely, given f=(M_{i1},…,M_{in}) and g=(M_{l1},…,M_{ln}) of M (with i,l∈{1,…,m}), we consider:
$$ D(\mathbf{f}, \mathbf{g}) = \sum_{j=1}^{n} d(f_{j}, g_{j}), $$
(1)
where d(f_{j},g_{j}) is defined as:
$$ d(x,y) = \left\{\begin{array}{ll} 1, & \text{if}\ x \neq y, x \neq , \text{ and}\ y \neq \\ 0, & \text{otherwise} \end{array}\right.. $$
(2)
Equation (1) defines the extended Hamming distance between two ternary strings f and g [19], denoting the total number of positions wherein both characters of f and g belong to {0,1} but they are different according to Eq. (2).
If M is conflictfree, then it can be partitioned into two disjoint matrices M_{1} and M_{2}, each one containing a set of conflictfree fragments. We can infer the two haplotypes h_{1} and h_{2} from M_{1} and M_{2}, respectively, as follows:
$$ h_{k_{j}} = \left\{\begin{array}{ll} 1, & \text{if}\ N_{1_{j}}(\mathbf{M}_{k}) \geq N_{0_{j}}(\mathbf{M}_{k})\\ 0, & \text{otherwise} \end{array}\right., $$
(3)
where j∈{1,…,n}, k∈{1,2}, and \(N_{0_{j}}(\mathbf {M}_{k})\), \(N_{1_{j}}(\mathbf {M}_{k})\) denote the number of 0s and 1s in the jth column, respectively. In such a way, N_{0}(M_{k}) is the vector consisting of the number of 0s of each column j using the reads of the partition M_{k}, while N_{1}(M_{k}) is the vector consisting of the number of 1s of each column j represented by the partition M_{k}.
In order to solve the wMEC problem, N_{0} and N_{1} are calculated using the m×n weight matrix W, representing the weight associated with each position in each fragment. As a matter of fact, W can be divided into the two disjoint partitions W_{1} and W_{2}, whose row indices correspond to those in M_{1} and M_{2}, respectively. We can extend Eq. (3) taking into account the weights as follows:
$$ h_{k_{j}} = \left\{\begin{array}{ll} 1, & \text{if}\ N_{1_{j}}(\mathbf{W}_{k}) \geq N_{0_{j}}(\mathbf{W}_{k})\\ 0, & \text{otherwise} \end{array}\right., $$
(4)
where j∈{1,…,n}, k∈{1,2}, and \(N_{0_{j}}(\mathbf {W}_{k})\), \(N_{1_{j}}(\mathbf {W}_{k})\) denote the sum of the weights associated with the 0 and 1 elements in the jth column, respectively.
The distance D(·,·) given in Eq. (1) can be used also to evaluate the distance between a fragment and a haplotype, by means of the following error function:
$$ \mathcal{E}(\mathbf{M}_{1},\mathbf{M}_{2}, \mathbf{h}_{1}, \mathbf{h}_{2}) = \sum_{k=1}^{2} \sum_{\mathbf{f} \in \mathbf{M}_{k}} D(\mathbf{f}, \mathbf{h}_{k}). $$
(5)
The best partitioning of M can be obtained by minimizing Eq. (5), inferring h_{1} and h_{2} with the least number of errors. Equation (5) is used as fitness function in GenHap.
GenHap: haplotype assembly using GAs
GAs are populationbased optimization strategies mimicking Darwinian processes [25–27]. In GAs, a population P of randomly generated individuals undergoes a selection mechanism and is iteratively modified by means of genetic operators (i.e., crossover and mutation). Among the existing metaheuristics for global optimization, GAs are the most suitable technique in this context thanks to the discrete structure of the candidate solutions. This structure is wellsuited to efficiently solve the intrinsic combinatorial nature of the haplotype assembly problem. In the most common formulation of GAs, each individual C_{p} (with p∈{1,…,P}) encodes a possible solution of the optimization problem as a fixedlength string of characters taken from a finite alphabet. Based on a quality measure (i.e., the fitness value), each individual is involved in a selection process in which individuals characterized by good fitness values have a higher probability to be selected for the next iteration. Finally, the selected individuals undergo crossover and mutation operators to possibly improve offspring and to introduce new genetic material in the population.
GenHap exploits a very simple and efficient structure for individuals, which encodes as a binary string a partition of the fragment matrix M. In particular, each individual \(\phantom {\dot {i}\!}C_{p}=[C_{p_{1}}, C_{p_{2}}, \ldots, C_{p_{m}}]\) (with \(\phantom {\dot {i}\!}p \in \{1, \ldots, P\}\)) is encoded as a circular array of size m (i.e., the number of reads). In order to obtain the two partitions M_{1} and M_{2}, C_{p} is evaluated as follows: if the ith bit is equal to 0, then the read i belongs to M_{1}; otherwise, the read i belongs to M_{2}. Once the two partitions are computed, GenHap infers the haplotypes h_{1} and h_{2} by applying Eq. (4). Finally, Eq. (5) is exploited to calculate the number of errors made by partitioning M as encoded by each individual of P. This procedure is iterated until the maximum number of iterations T is reached, the number of errors is equal to 0 or the fitness value of the best individual does not improve for θ=⌈0.25·T⌉ iterations.
Among the different selection mechanisms employed by GAs (e.g., roulette wheel [25], ranking [26], tournament [27]), GenHap exploits the tournament selection to create an intermediate population P^{′}, starting from P. In each tournament, κ individuals are randomly selected from P and the individual characterized by the best fitness value is added to P^{′}. The size of the tournament κ is related to the selection pressure: if κ is large, then the individuals characterized by worse fitness values have a low probability to be selected, therefore the variability of P^{′} might decrease.
Afterwards, the genetic operators (i.e., crossover and mutation) are applied to the individuals belonging to P^{′} to obtain the offspring for the next iteration. GenHap exploits a singlepoint crossover with mixing ratio equal to 0.5. Crossover is applied with a given probability c_{r} and allows for the recombination of two parent individuals C_{y},C_{z}∈P^{′} (for some \(\phantom {\dot {i}\!}y, z \in \{1, \ldots, P\}\)), generating two offspring that possibly have better characteristics with respect to their parents.
In order to increase the variability of the individuals, one or more elements of the offspring can be modified by applying the mutation operator. GenHap makes use of a classic mutation in which the elements \(C_{p_{e}}\) (with e∈{1,…,m}) of the individual can be flipped (i.e., from 0 to 1 or viceversa) with probability m_{r}. Besides this mutation operator, GenHap implements an additional bitflipping mutation in which a random number of consecutive elements of the individual is mutated according to probability m_{r}. This operator is applied if the fitness value of the best individual does not improve for a given number of iterations (2 in our tests).
Finally, to prevent the quality of the best solution from decreasing during the optimization, GenHap exploits an elitism strategy, so that the best individual from the current population is copied into the next population without undergoing the genetic operators.
Unlike the work in [12], GenHap solves the wMEC problem instead of the unweighted MEC formulation, by means of Eq. (4). Moreover, differently from the other heuristic strategies, such as ReFHap [15] and ProbHap [16], we did not assume the allheterozygosity of the phased positions [19]. Under this assumption, every column corresponds to heterozygous sites, implying that h_{1} must be the complement of h_{2}. In addition, since the required execution time as well as the problem difficulty increase with the number of reads and SNPs, to efficiently solve the wMEC problem we split the fragment matrix M into Π=⌊m/γ⌋ submatrices consisting of γ reads (see Fig. 2). Following a divideetimpera approach [28], the computational complexity can be tackled by partitioning the entire problem into smaller and manageable subproblems, each one solved by a GA that converges to a solution characterized by two subhaplotypes with the least number of corrections to the SNP values. The solutions to the subproblems achieved by the Π GA instances are finally combined. This approach is feasible thanks to the long reads with higher coverage produced by the second and thirdgeneration sequencing technologies. As a matter of fact, highly overlapping reads allow us to partition the problem into easier subproblems, avoiding the possibility of obtaining incorrect reconstructions during the merging phase.
The parameter γ, used for the calculation of Π, depends on the coverage value and on the nature of the sequencing technology; its value must be set to avoid discrete haplotype blocks that do not exist in the input matrix M. Generally, the intervals where several independent historical recombination events occurred separate discrete blocks, revealing greater haplotype diversity for the regions spanning the blocks [7].
GenHap firstly detects all the haplotype blocks inside the fragment matrix M and then, in each block, it automatically sets γ equal to the mean coverage of that block to partition the reads. Notice that GenHap solves each block sequentially and independently, obtaining a number of haplotype pairs equal to the number of detected blocks. So doing, for each block GenHap proceeds by executing Π different GA optimizations, one for each subproblem, calculating 2·Π subhaplotypes. The length of the individuals is equal to γ, except for the last subproblem that could have a number of reads smaller than γ (accordingly, the length of the individuals could be smaller than γ).
Since the problem is divided into Π subproblems, two subproblems referring to contiguous parts of the two chromosome copies might contain some overlapped positions that can be either homozygous or heterozygous. However, the reads covering an overlapped position might not be entirely included in the same subproblem. For this reason, during the GAbased optimizations, all the phased positions are assumed to be heterozygous. If a position j is homozygous (i.e., all the reads covering this position have the same value, belonging to {0,−} or {1,−}, in both the subpartitions and in every read covering it), then only one of the two subhaplotypes will have the correct value. This specific value is correctly assigned to the subhaplotype covered by the highest number of reads by following a majority rule. As soon as the two subhaplotypes are obtained, all the possible uncorrected heterozygous sites are removed and the correct homozygous values are assigned by checking the columns of the two subpartitions. Finally, once all subproblems in Π are solved, GenHap recombines the subhaplotypes to obtain the two entire haplotypes h_{1} and h_{2} of the block under analysis.
GenHap is also able to find and mask the ambiguous positions by replacing the 0 or 1 value with a X symbol. We highlight that an ambiguous position is a position covered only by the reads belonging to one of the two haplotypes.
Implementation
In order to efficiently solve the wMEC problem and tackle its computational complexity, GenHap detects the haplotype blocks inside the matrix M and then, for each block, it splits the portion of M into Π submatrices consisting of γ reads. So doing, the convergence speed of the GA is increased thanks to the lower number of reads to partition in each subproblem with respect to the total number of reads of the whole problem. As shown in Fig. 3, the Π submatrices are processed in parallel by means of a divideetimpera approach that exploits a MasterSlave distributed programming paradigm [29, 30] to speed up the overall execution of GenHap. This strategy allowed us to distribute the computation in presence of multiple cores. As a matter of fact, GenHap works by partitioning the initial set of reads into subsets and solving them by executing different GA instances. This strategy can be exploited in GenHap, as it solves the wMEC problem working on the rows of the fragment matrix M; on the contrary, HapCol works considering the columns of M, which cannot be independently processed in parallel.
The functioning of our MasterSlave implementation can be summarized as follows:

1
the Master allocates the resources and detects the haplotype blocks inside the fragment matrix. For each detected block, it partitions the portion of the matrix M into Π submatrices and offloads the data onto the available Σ Slaves (in real scenarios, Σ≪Π). During this phase, each Slave generates the initial population of the GA;

2
the σth Slave (with σ∈{1,…,Σ}) executes the assigned wMEC subtask, running the GA for either θ nonimproving iterations or T maximum iterations, independently of the other Slaves;

3
the process is iterated until all the wMEC subtasks are terminated;

4
the Master recombines the subsolutions received from the Slaves, and returns the complete wMEC solution for the block under analysis.
GenHap was entirely developed using the C++ programming language exploiting the Message Passing Interface (MPI) specifications to leverage multicore Central Processing Units (CPUs).