New approach: BREC
BREC is designed following the workflow represented in Fig. 1. To ensure that the broadest range of species could be analyzed by our tool, we designed a pipeline that adapts behavior with respect to input data. Each step of the workflow relies mostly on statistical analysis, adaptive algorithms, and decision proposals led by empirical observation.
The workflow starts with a pre-processing module (called "Step 0") aiming to prepare the data prior to the analysis. Then, it follows six main steps: (1) estimate Marey Map-based local recombination rates, (2) identify chromosome type, (3) prepare the HCB identification, (4) identify the centromeric boundaries, (5) identify the telomeric boundaries, and (6) extrapolate the local recombination rate map and generate an interactive plot containing all BREC outputs (see Fig. 1). Each step is detailed hereafter and summarised in Additional file 16.
Step 0 - Apply data pre-processing
Since we have noticed that BREC estimates are sensitive to the quality of input data, we propose a pre-processing step to assess data quality and suggest an optional data cleaning for outliers. As such, we could ensure proper functioning during further steps.
Data quality control (DQC) The quality of input data is tested regarding two criteria: (1) the density of markers and (2) the homogeneity of their distribution on the physical map along a given chromosome. First, the mean density, defined as the number of markers per physical map length, is computed. This value is compared with the minimum required threshold of 2 markers/Mb. Based on the displayed results, the user gets to decide if data cleaning is required or not. The threshold of 2 markers/Mb is selected based on a simulation process that allowed to test BREC results while decreasing markers density until the observed HCB estimates seemed to be no longer exploitable (see "Simulated data for quality control testing" section). Second, the distribution of input data is tested via a comparison with a simulated uniform distribution of identical markers density and physical map length. This comparison is applied using Pearson's \(\chi^2\) test [31], which allows examining how close the observed distribution (input data) is to the expected one (simulated data).
Data cleaning The cleaning step aims to reduce the disruptive impact of noisy data, such as outliers, in order to provide a more accurate recombination rate and heterochromatin boundary results. If the input data fails to pass the Data Quality Control (DQC) test, the user has the option to apply or not a cleaning process. This process consists of identifying the extreme outliers and eliminating them upon the user's confirmation. Outliers are detected using the distribution statistics of the genetic map (see Additional file 18). More precisely, inter-marker distances (separating each two consecutive points) are computed along the genetic map. Using a boxplot, distribution statistics (quartiles, mean, median) are applied on these inter-marker distances to identify outliers, which are chosen as the 5% of the data points with a greater genetic distance than the maximum extreme value, and should be discarded. Thus, the cleaning targets markers for which the genetic distance is quite larger than most of the rest. After the first cleaning iteration, DQC is applied again to assess the new density and distribution. The user can also choose to bypass the cleaning step, but BREC's behavior is no longer guaranteed in such cases.
Step 1 - Estimate Marey Map-based local recombination rates
Once the data are cleaned, the recombination rate can be estimated based on the Marey map [20] approach by: (1) correlating genetic and physical maps, (2) generating two regression models -third degree polynomial and Loess- that better fits these data, (3) computing the prime derivative for both models which will represent preliminary recombination maps for the chromosome. The primary purpose of interpolation here is to provide local recombination rate estimates for any given physical position, instead of only the ones corresponding to available markers.
At this point, both recombination maps are used to identify the chromosome type as well as the approximate position of centromeric and telomeric regions. Nevertheless, as a final output, BREC will return only the Loess-based adjusted map for recombination rates since it provides finer local estimates than the polynomial-based map.
Step 2 - Identify chromosome type
BREC provides a function to identify the type of a given chromosome according to the position of its centromere. This function is based on the physical position of the smallest value of recombination rate estimates, which primarily indicates where the centromeric region is more likely to be located. Our experimentation allowed to come up with the following scheme (see Additional file 10). Two main types are identified: telocentric and atelocentric [32]. Atelocentric type could be either metacentric (centromere located approximately in the center with almost two equal arms) or not metacentric (centromere located between the center and one of the telomeres). The latter includes the two most known subtypes, submetacentric and acrocentric (recently considered types rather than subtypes). It is tricky for BREC to distinguish between submetacentric and acrocentric chromosomes correctly. Their centromeres' position varies slightly, and capturing this variation (based on the smallest value of recombination rate on both maps -polynomial and Loess-) could not be achieved yet. Therefore, we chose to provide this result only if the implemented process allowed to identify the subtype automatically. Otherwise, the user gets the statistics on the chromosome's data and is invited to decide according to further a priori knowledge. The two subtypes (metacentric and not metacentric) are distinguished following intuitive reasoning inspired by their definition found in the literature. First, BREC identifies whether the chromosome is an arm (telocentric) or not (atelocentric). Then, it tests if the physical position of the smallest value of the estimated recombination rate is located between 40% to 60% interval. In this case, the subtype is displayed as metacentric. Otherwise, it is displayed as not metacentric. The recombination rate is estimated using the Loess model ("LOcal regrESSion") [33, 34].
Step 3 - Prepare the HCB identification
The HCB identification is a purely statistical approach relying on the coefficient of determination \(R^2\), which measures how good the generated regression model fits the input data [35]. We chose this approach because the Marey map usually exhibits a lower quality of markers (density and distribution) on the heterochromatin regions. Thus, we aim to capture this transition from high to low quality regions (or vice versa) as it reflects the transition from euchromatin to heterochromatin regions (or vice versa). The coefficient \(R^2\) is defined as the cumulative sum of squares of differences between the interpolation and observed data. \(R^2\) values are accumulated along the chromosome. In order to eliminate the biased effect of accumulation, \(R^2\) is computed twice: \(R^2-forward\) starts the accumulation from the beginning of the chromosome to provide the left centromeric and left telomeric boundaries. In contrast, \(R^2-backwards\) starts from the end of the chromosome, providing the right centromeric and right telomeric boundaries. These \(R^2\) values were calculated using the rsq package in R. To compute \(R^2\) cumulative vectors, rsq function is applied on the polynomial regression model. In fact, there is no such function for non-linear regression models like the Loess because, in such models, high \(R^2\) does not always indicate a good fit. A sliding window is defined and applied on the \(R^2\) vectors to precisely analyze their variations (see details in the next step). In the case of a telocentric chromosome, the position of the centromere is then deduced as the left or the right side of the arm, while in the case of an atelocentric chromosome, the existence of a centromeric gap is investigated.
Step 4 - Identify centromeric boundaries
Since the centromeric region is known to present reduced recombination rates, the starting point for detecting its boundaries is the physical position corresponding to the smallest polynomial-based recombination rate value. A sliding window is then applied to expand the starting point into a region based on \(R^2\) variations in two opposite directions. The sliding window's size is automatically computed for each chromosome as the largest value of ranges between each two consecutive positions on the physical map (indicated as i and \(i+1\) in Eq. 1). After making sure the sliding window includes at least two data points, the mean of local growth rates inside the current window is computed and tested compared to zero. If it is positive (resp. negative) on the forward (resp. backward) \(R^2\) curve, the value corresponding to the window's ending edge is returned as the left (resp. right) boundary. Else, the window moves by a step value equal to its size.
$$\begin{aligned} sliding\_window\_size(chromosome)= max\{|physPos_{i+1} - physPos_i| : 1 \le i \le n-1\} \end{aligned}$$
(1)
There are some cases where chromosome data present a centromeric gap. Such a lack of data produces biased centromeric boundaries. To overcome this issue, chromosomes with a centromeric gap are handled with a slightly different approach. After comparing the mean of local growth rates regarding to zero, accumulated slopes of all data points within the sliding window are computed, adding one more point at a time. If the mean of accumulated slopes keeps the same variation direction as the mean of growth rates, the centromeric boundary is set as the window's ending edge. Else, the window slides by the same step value as before (equal to its size). The difference between the two chromosome types is that only one sliding window is used for the telocentric case, its starting point is the centromeric side, and it moves away from it. As for the atelocentric case, two sliding windows are used (one on each \(R^2\) curve), their starting point is the same, and they move in opposite directions to expand the centromere into a region.
Step 5 - Identify telomeric boundaries
Since telomeres are considered heterochromatin regions as well, they also tend to exhibit low fitness between the regression model and the data points. More specifically, the accumulated \(R^2\) curve tends to present a significant depletion around telomeres. Therefore, a telomeric boundary is defined here as the physical position of the most significant depletion corresponding to the smallest value of the \(R^2\) curve. As such, in the telocentric case, only one \(R^2\) curve is used. It gives one boundary of the telomeric region (the other boundary is defined by the beginning of the left telomere or the end of the right telomere). Whilst in the atelocentric case, where the are two telomeres, the depletion on \(R^2-forward\) detects the end of the left telomeric region, and the depletion on \(R^2-backwards\) detects the beginning of the right telomeric region. The other two boundaries (the beginning of the left telomere and the end of the right telomere) are defined to be, respectively, the same values of the two markers with the smallest and the largest physical position available within the input data of the chromosome of interest.
Step 6 - Extrapolate the local recombination rate estimates and generate interactive plot
The extrapolation of recombination rate estimates at the identified centromeric and telomeric regions automatically performs an adjustment by resetting the initial biased values to zero along these heterochromatin ranges. Finally, all of the above BREC outputs are combined to generate one interactive plot to display for visualization and download (see details in "Easy, fast and accessible tool via an R-package and a Shiny app" section).
It is important to emphasize that throughout the whole main process module, only Step 1 " Estimating Marey map-based local recombination rates " comes from previous methods ([20, 21]). Otherwise, each of the steps 2-6 are fully developed (designed and implemented) within BREC and represent a new contribution, in addition to step zero " Data pre-processing ", as mentioned above.
Data and implementation
Validation data
The only input dataset to provide for BREC is genetic and physical maps for one or several chromosomes. A simple CSV file with at least two columns for both maps is valid. If the dataset is for more than one chromosome or the whole genome, a third column, with the chromosome identifier, is required.
Our results have been validated using Release 5 of the fruit fly D. melanogaster [36, 37] genome as well as the domesticated tomato Solanum lycopersicum genome (version SL3.0).
We also tested BREC using other datasets of different species: house mouse (Mus musculus castaneus, MGI) chromosome 4 [38], roundworm (Caenorhabditis elegans, ws170) chromosome 3 [39], zebrafish (Danio rerio, Zv6) chromosome 1 [40], respectively (see Additional file 13), as samples from the multi-genome dataset included within BREC (see further details on the full built-in dataset in "Description of main components of the Shiny app" section).
Fruit fly genome D.melanogaster Physical and genetic maps are available for download from the FlyBase website (http://flybase.org/; Release 5) [26]. This genome is represented here with five chromosomal arms: 2L, 2R, 3L, 3R, and X (see Additional file 2), for a total of 618 markers, 114.59Mb of physical map and 249.5cM of genetic map. This dataset is manually curated and is already clean from outliers. Therefore, the cleaning step offered within BREC was skipped.
Tomato genome S. lycopersicum Domesticated tomato with 12 chromosomes has a genome size of approximately 900Mb. Based on the latest physical and genetic maps reported by the Tomato Genome Consortium [28], we present both maps content (markers number, markers density, physical map length, and genetic map length) for each chromosome in Additional file 15. For a total of 1957 markers, 752.47Mb of physical map and 1434.49cM of genetic map along the whole genome.
Simulated data for quality control testing
We call data scenarios, the layout in which the data markers are arranged along the physical map. For experimentally testing the limits of BREC, various data scenarios have been specifically designed based on D. melanogaster chromosomal arms (see Additional file 9).
In an attempt to investigate how the markers' density varies within and between the five chromosomal arms of D. melanogaster Release 5 genome, the density has been analyzed in two ways: locally (with 1Mb-bins) and globally (on the whole chromosome). Additional file 4 shows the results of this investigation, where each little box indicates how many markers are present within the corresponding region of size 1Mb on the physical map. The mean value represents the global density. It is also shown in Additional file 2 where the values are slightly different. This is due to computing the markers' density in two different ways with respect to the analysis. Additional file 2, presenting the genomic features of the validation dataset, shows markers density in Column 3, which is simply the result of the division of markers number (in column 2) by the physical map length (in Column 4). For example, in the case of chromosomal arm X, this gives \(165/21.22 = 7.78 markers/Mb\). On the other hand, Additional file 4, aimed for analyzing the variation of local markers density, displays the mean of of all 1Mb-bins densities, which is calculated as the sum of local densities divided by the number of bins, and this gives \(165/22 = 7.5 markers/Mb\).
The exact same analysis has been conducted on the tomato genome S. lycopersicum where the only difference lies in using 5-Mb instead of 1-Mb bins, due to the larger size of its chromosomes (see Additional file 5).
Validation metrics
The measure we used to evaluate the resolution of BREC's HCB is called shift hereafter. It is defined as the difference between the observed heterochromatin boundary (\(observed\_HCB\)) and the expected one (\(expected\_HCB\)) in terms of physical distance (in Mb)(see Equation 2).
$$\begin{aligned} shift = | observed\_HCB - expected\_HCB | \end{aligned}$$
(2)
The shift value is computed for each heterochromatin boundary independently. Therefore, we observe only two boundaries on a telocentric chromosome (one centromeric and one telomeric). In comparison, we observe four boundaries in the case of an atelocentric chromosome (two centromeric giving the centromeric region and two telomeric giving each of the two telomeric regions).
The shift measure was introduced not only to validate BREC's results with the reference equivalents but also to empirically calibrate the DQC module, where we are mostly interested in the variation of its value as per variations of the quality of input data.
Implementation and Analysis
The entire BREC project was developed using the R programming language (version 3.6.3/2020-02-29) and the RStudio environment (version 1.2.5033).
The graphical user interface is build using the shiny and shinydashboard packages. The web-based interactive plots are generated by the plotly package. Data simulations, result analysis, reproducible reports, and data visualizations are implemented using a large set of packages such as tidyverse, dplyr, R markdown, Sweave and knitr among others. The complete list of software resources used is available on the online version of the BREC package accessible at https://github.com/GenomeStructureOrganization.
From inside an R environment, the BREC package can be downloaded and installed using the command in the code chunk in Additional file 19. In case of installation issues, further documentation is available online on the ReadMe page of the GitHub repository. If all runs correctly, the BREC shiny application will be launched on your default internet browser (see Shiny interface screenshots in Additional file 14).
All BREC experiments have been carried out using a personal computer with the following specs:
-
Processor: Intel\(^{\text{\textregistered}}\) \(\hbox {Core}^{\mathrm{TM}}\) i7-7820HQ CPU @ 2.90GHz x 8
-
Memory: 32Mo
-
Hard disc: 512Go SSD
-
Graphics: NV117 / Mesa Intel\(^{\text{\textregistered}}\) HD Graphics 630 (KBL GT2)
-
Operating system: 64-bit Ubuntu 20.04 LTS
Description of main components of the Shiny app
Build-in dataset
Users can either run BREC on a dataset of 44 genomes, mainly imported from [41], enriched with two mosquito genomes from [42] and updated with D. melanogaster Release 6 from FlyBase [26] (see Additional files 20 and 21), already available within the package, or, load new genomes data according to their own interest.
User-specific genomic data should be provided as inputs within at least a 3-column CSV file format, including for each marker: chromosome identifier, genetic distance, and physical distance, respectively. On the other hand, outputs from BREC running results are represented via interactive plots.
GUI input options
The BREC shiny interface provides the user with a set of options to select as parameters for a given dataset (see Fig. 3a). These options are mainly necessary in case the user works on his/her own dataset and this way the appropriate parameters would be available to choose from. First, a tab to specify the running mode (one chromosome). Then, a radio button group to choose the dataset source (existing within BREC or importing new dataset). For the existing datasets case, there is a drop-down scrolling list to select one of the available genomes (over 40 options), a second one for the corresponding physical map unit (Mb or pb) and a third one for the chromosome ID (available based on the dataset and not the genome biologically speaking). While for the import new dataset case, three more objects are added (see Fig. 3b); a fileInput to select csv data file, a textInput to enter the genome name (optional), and a drop-down scrolling list to select the data separator (comma , semicolon or tab character -set as the default-). As for the Loess regression model, the span parameter is required. It represents the percentage of how many markers to include in the local smoothing process. There is a numericInput object set by default at value 15% with an indication about the range of the span values allowed (min = 5%, max = 100%, step = 5%). The user should keep in mind that the span value actually goes from zero to one, yet, in a matter of simplification, BREC handles the conversion on its own. Thus, for example, a value of zero basically means that no markers are used for the local smoothing process by Loess, and so, it will induce a running error. Lastly, there is a checkbox to apply data cleaning if checked. Otherwise, the cleaning step will be skipped. This options could save the user some running time if s/he already have a priori knowledge that a specific genome's dataset has already been manually curated). The user is then all set to hit the Run button. BREC will start processing the chromosome of interest by identifying its type (telocentric or atelocentric). Since this step is quite difficult to automatically get the correct result, the user might be invited to interfere via a popup alert asking for a chromosome type confirmation (see Fig. 3b). As shown in Additional file 14a, all available genomes could be accessed from the left-hand panel (in dark grey) and specifically on the tab " Genomic data " where two pages are available: " Download data files " which provides a data table corresponding to the selected genome on a scrolling list along with download buttons, and " Dataset details " displaying a more global overview of the whole build-in aata repository (see Additional file 14b). To give a glance at the GUI outputs, Fig. 3c shows BREC results displayed within an interactive plot where the user will have the an interesting experience by hovering over the different plot lines and points, visualising markers labels, zooming in and out, saving a snapshot as a PNG image file, and many more available options thanks to the plotly package.