The GMASS score
The GMASS score represents the structural similarity of a pair of genome assemblies based on the length and number of similar genomic regions defined as consensus segment blocks (CSBs) in the assemblies. The CSBs are aligned assembly regions between two assemblies which can contain small number of substitutions and small gaps shorter than a chosen resolution (Fig. 1a). The way of constructing CSBs is described in the following subsection. The GMASS score is defined as Eq. 1 based on the assumption that the number and length of CSBs between two assemblies is equivalent to the number and length of sequence fragments (scaffolds or chromosomes) of the two assemblies as the two assemblies become more similar.
$$ GMASS=\frac{\ {\sum}_{i=r}^R{S}_i\ }{n}=\frac{\ {\sum}_{i=r}^R\left\{{L}_i\left({A}_1,{A}_2\right)\times {C}_i\left({A}_1,{A}_2\right)\right\}\ }{n} $$
(1)
The GMASS score is the average of structural similarity scores (Si) between two assemblies A1 and A2 across a certain range of CSB resolutions. In Eq. 1, Si denotes the structural similarity score at the CSB resolution i belonging to a set of n different resolutions with a minimum value r and a maximum value R. Si is calculated by the product of two terms related to the total length of the CSBs (Li(A1, A2)) and the number of CSBs (Ci(A1, A2)).
Li(A1, A2) is the fraction of the total length of the CSBs against the total length of assemblies used to construct the CSBs with the CSB resolution i defined as Eq. 2. In Eq. 2, l(Ax) is the total length of sequence fragments larger than resolution i in the assembly Ax, and l(CSBsx) is the total length of blocks in the CSBs belonging to Ax defined as CSBsx.
$$ {L}_i\left({A}_1,{A}_2\right)=\frac{l\left({CSBs}_1\left)+l\right({CSBs}_2\right)}{l\left({A}_1\right)+l\left({A}_2\right)} $$
(2)
For example, suppose CSBs are constructed for the two scaffold assemblies A1 and A2 using 10 Kbp resolution. If total 2 Mbp and 2.5 Mbp scaffold sequences of A1 and A2 respectively are used in this construction, and total 1.8 Mbp and 2.2 Mbp scaffold sequences of A1 and A2 respectively are actually included in the CSBs, then l(A1)= 2 Mbp, l(A2)= 2.5 Mbp, l(CSBs1)= 1.8 Mbp, and l(CSBs2)= 2.2 Mbp. Therefore, L10Kbp(A1, A2) = (1.8 + 2.2)/(2 + 2.5) = 0.89.
Ci(A1, A2) denotes the similarity between two assemblies in terms of the total number of CSBs constructed with the CSB resolution i defined as Eq. 3.
$$ {C}_i\left({A}_1,{A}_2\right)=\left\{{}_{1-\frac{c\left({CSB}_{S_1}\right)+c\kern0.22em \left({CSB}_{S_2}\right)-c\;\left({A}_1\right)-c\;\left({A}_2\right)}{\mathit{\max}c\;\left({A}_1\right)+\mathit{\max}c\;\left({A}_2\right)-c\;\left({A}_1\right)-c\;\left({A}_2\right)},\kern5.879990em otherwise}^{\frac{c\left({CSB}_{S_1}\right)+c\kern0.22em \left({CSB}_{S_2}\right)}{c\;\left({A}_1\right)+c\;\left({A}_2\right)},\kern1.2em if\;c\;\left({A}_1\right)+c\;\left({A}_2\right)\kern0.36em \ge c\kern0.24em \left({CSB}_{S_1}\right)+c\;\left({CSB}_{S_2}\right)}\kern0.36em \right. $$
(3)
Here c(X) is the number of sequence fragments larger than given resolution i (if X is an assembly) or the number of blocks constructed using the CSB resolution i (if X are CSBs). We note that if X are CSBs, then c(CSBs1) and c(CSBs2) are the same because of the definition of a CSB. maxc(X) is the maximum number of CSBs that can be constructed from an assembly X using the CSB resolution i. If the sum of the number of CSBs from two assemblies is less than or equal to the sum of sequence fragments larger than resolution i in the two assemblies, Ci(A1, A2) is defined as the fraction obtained from those two numbers (the upper part of Eq. 3). Otherwise, the positive difference between those two numbers is calculated, and its fraction against a similar difference using the sum of the maximum number of CSBs that can be constructed using the two assemblies is obtained. The result is subtracted from 1 to make a positive correlation between Ci(A1, A2) and assembly similarity (the lower part of Eq. 3). The maximum number of CSBs maxc(Ax) of an assembly Ax is calculated by Eq. 4.
$$ maxc\left({A}_x\right)=\frac{l\left({A}_x\right)}{i} $$
(4)
Here i is the resolution used to construct CSBs. Eq. 4 implies that the minimum length of a CSB is i, and therefore at most l(Ax)/i CSBs can be constructed. This was empirically validated using the GAGE and Assemblathon 1 dataset.
For example, suppose again that CSBs are constructed for the two scaffold assemblies A1 and A2 with l(A1)= 2 Mbp and l(A2)= 2.5 Mbp using 10 Kbp resolution. If total 20 and 24 scaffolds of A1 and A2 respectively are used in this construction, and total 20 CSBs are constructed, then c(A1)= 20, c(A2)= 24, and \( c\left({CSB}_{S_1}\right)= \) \( c\left({CSB}_{S_2}\right) \) = 20. In this case, c(A1) + c(A2) ≥ c(CSBs1) + c(CSBs2) holds, and therefore, C10Kbp(A1, A2)= (20 + 20)/(20 + 24) = 0.91 by the upper part of Eq. 3. On the other hand, if \( c\left({CSB}_{S_1}\right)= \) \( c\left({CSB}_{S_2}\right) \) = 25, the upper part of Eq. 3 cannot be used because C10Kbp(A1, A2)= (25 + 25)/(20 + 24) > 1 based on the upper part of Eq. 3 and the Ci(A1, A2) score between 0 and 1 is needed. In this case, Ci(A1, A2) is calculated by the lower part of Eq. 3 using c(A1)= 20, c(A2)= 24, \( c\left({CSB}_{S_1}\right)= \) \( c\left({CSB}_{S_2}\right) \) = 25, maxc(A1)= 2000/10 = 200, and maxc(A2)= 2500/10 = 250. Therefore, C10Kbp(A1, A2)= 1 – (25 + 25–20 – 24)/(200 + 250–20 – 24) = 0.99.
Calculation of the GMASS score for simulated and benchmark datasets
In this study, CSBs were constructed from three datasets: simulated genome assemblies, GAGE [18], and Assemblathon 1 [20]. In the case of the simulated genome assemblies, the pairwise alignments of simulated genome assemblies (reference against 11 other assemblies with different divergence to the reference) were obtained from previous study [26]. This dataset simulated the evolution of two human chromosomes 21 and 22 from a common ancestor of 12 species with varying divergences. The synteny block generation program in inferCars [27] was then used to construct CSBs between each pair of assemblies with a given resolution.
In the case of the other two datasets, assembly sequences as a form of scaffolds were downloaded from a website with their datasets. Then repeats in the assemblies were first masked using RepeatMasker (version 4.0.5) [28], and pairwise alignments between two different assemblies were generated using LASTZ (version 1.02) [29] with parameters “-E=150 -H=2000 -K=4300 -L=2200 -M=254 -O=600 -T=2 -Y=15000”. In evaluation of the GMASS score, self-alignments of each assembly were also constructed using LASTZ parameters “-O=400 -E=30 -X=910 -Y=9400 -K=3000 -L=3000 -H=2000 -T=1 -Z=1”. These parameter values were based on the recommendation found in the UCSC genome browser [30]. Once pairwise alignments of genome assemblies were prepared, the synteny block generation program in inferCars [27] was used to construct CSBs between a pair of assemblies with a given resolution.
In the process of the CSB generation between two assemblies, one of the assemblies needs to be a reference. In the case of the simulated assembly dataset, a reference assembly specified in that study was also used as the reference in this study. Therefore, the CSBs of total 11 pairs (reference against 11 other assemblies) were constructed. In the GAGE and Assemblathon 1 dataset, a total of eight assemblies by eight assembly programs (ABySS2 [31], ALLPATHS-LG [13], Bambus2 [32], CABOG [8], SGA [33], SOAPdenovo [12], MSR-CA [34] and Velvet [14]) and five assemblies by five assembly programs (ABySS [11], ALLPATHS-LG [13], Celera [10], Meraculous [35] and SOAPdenovo [12]) exist respectively, and CSBs of all possible assembly pairs were constructed by using one assembly with a higher N50 as the reference. In terms of the resolution, five different resolutions (100, 200, 300, 400, and 500 Kbp) were used for the simulated assembly dataset. In the case of the GAGE and Assemblathon 1 dataset, N50, N60, N70, N80, and N90 values of a non-reference assembly were used as resolutions.
To assess the relationship between the GMASS score and structural similarity of assembly pairs, the Pearson’s correlation coefficient test was conducted between structural similarities and the GMASS scores calculated using CSBs between simulation datasets. The correlation analyses were also conducted for Li(A1, A2), Ci(A1, A2) and Si scores in a given resolution i. The evolutionary distances which were used for constructing simulated assemblies were treated as indicators of sequence similarity between two assemblies. Information about the evolutionary distance, including phylogenetic tree topology and evolutionary distances between a reference assembly and the simulated assemblies were obtained from [26].