Programming language and data sets
RhoTermPredict algorithm was written in the Python (v.3.6) language [32], and requires libraries BioPython, numpy, re and openpyxl. It accepts as input bacterial genome sequences, and provides as output the coordinates of putative Rho-dependent terminators elements (RUT and RNAP pause sites) with a score assigned to them that indicates the probability that the extracted region actually corresponds to a Rho-dependent terminator (see below for the scoring assignment method). For Rho-dependent terminator predictions, we used available genomic sequences of the model microorganism E. coli K-12 substr. MG1655 (National Center for Biotechnology Information, accession code NC_000913.3) (see below). For the prediction quality evaluation, we used E. coli K-12 genomic annotation [33] together with relative RNA-Seq data (see below) and a list of Rho-dependent transcription terminators obtained by [18] (see below). In addition to E. coli K-12, with the purpose of evaluating the algorithm performances on other bacterial genomes, we run RhoTermPredict on genome sequences from B. subtilis 168 (NC_000964.3) [17]. In fact, although the transcription termination mechanisms in B. subtilis have remained poorly defined for a long time, recently the action of Rho has been verified in it [17, 21]. The procedure to search for putative Rho-dependent terminators is reported below.
Procedure to search for putative rho-dependent terminators
RhoTermPredict actually searches for some mandatory elements (putative RUT and RNAP pause sites) and for other optional elements whose presence could increase the prediction score. RhoTermPredict search is not based on “nucleotide sequence homology”, but rather on “conserved motifs in the nucleotide sequences”, and these motifs were inferred from literature data on RUT sites and on RNA Polymerase pausing sites. In fact, a consensus motif common to all Rho-dependent transcription terminators has been previously proposed in E. coli and S. enterica [5, 9, 28].
On that basis, we elaborated the following two-step procedure to detect Rho-dependent terminators (Fig. 1): i.) The first step is the identification of the putative “RUT site”. To do this, the algorithm scans a window of 78 nt over the query sequence, one nt at a time, until the C/G ratio of the window exceeds the threshold value of 1 and with regularly spaced cytosine residues within the window (every 11–13 nt). Then, by scanning a window of 128 nt (starting from the previous position where the C/G content of the window reached a value > 1), the 78 nt long region with maximal C/G content and with regularly spaced cytosine residues (herein referred to as RUT site) is selected. Therefore RhoTermPredict searches for a putative 78 nt long RUT site characterized by a maximal C/G content (in any case greater than 1) localized in a 128 nt long region (128 stands for 78 + 50: 78 is the extension of the RUT site, 50 an arbitrary value chosen to maximize the C/G content of the RUT site). This consensus motif and its extension (78 nt) fit well with structural and functional properties of Rho hexamer and its interaction with C-rich RNA sequences at the level of its primary RNA binding domain [5, 34,35,36]. This binding leads to the positioning of the RNA into the secondary RNA binding domain, which in turn activates the ATPase for its translocase/RNA–DNA helicase functions [6, 37]. ii.) The second step is the identification of a putative pause site for RNAP in a region extended up to 150 nt downstream from the 3′-end of the selected RUT site. The RNAP pause sites searched were hairpin structures [38, 39] (with a GC-rich stem and a loop constituted by 4–8 nt). Alternatively to hairpin structures, in the same 150 nt long region, we considered as putative RNAP pause site the presence of the consensus pause-inducing sequence element G− 11G− 10(C/T)− 1G+ 1 (where −1 corresponds to the position of the RNA 3′ end) identified by [40, 41]. However, the presence of the previous element close to a putative hairpin structure (precisely within the hairpin structure extended up to 5 nt upstream from its 5′ end and 5 nt downstream from its 3′ end) provides a higher score to the prediction.
RhoTermPredict also allows to predict multiple putative terminators in a single query region, and to search for terminators in both strands.
Procedure for rho-dependent terminator scoring assignment
The maximum score that could be assigned by our algorithm to a terminator prediction is 15, while the minimum is 6 (a minimum of 3 point for the RUT site and also a minimum of 3 points for a pause site). An addition of 1 point is assigned if the C/G ratio of RUT site > 1.25, of 2 points if such ratio > 1.5, of 3 if > 2. Regarding to the hairpin structure as predicted pause site, an other point is attributed if the GC-content of the hairpin stem is > whole genome GC-content + 10, instead 2 points if it is > whole genome GC-content + 20 (because it is known that the hairpin stem is GC-rich) while an extra 0.5 point it is assigned if the hairpin loop length is < 6 nt or if the hairpin stem length > 4. Finally 3 extra points are assigned if the consensus pause-inducing sequence element G−11G− 10(C/T)−1G+ 1 is present near the putative hairpin pause site.
Dataset of rho-dependent terminators, and construction of positive and negative sequence datasets
To test the reliability of RhoTermPredict predictions we used a total of 1264 regions containing Rho-dependent termination sites (defined as “BCM significant transcripts”, BSTs) obtained by [18] growing up the microorganism E. coli K-12 with or without the specific Rho inhibitor bicyclomycin at a concentration that reduces Rho function without affecting the rate of cell growth. Therefore a differential expression of these BSTs regions indicates the presence of a Rho-dependent transcription terminator.
Starting from the genome of E. coli K-12 (NC_000913.3) and the just-indicated BSTs regions, we generated a Rho-dependent terminator set (positive set) consisting of 300 nt long sequences immediately upstream all the BSTs regions. The positive set consisted of 1264 sequences. Instead, we constructed the negative set with all the intergenic regions (IRs) < = 300 nt and > = 200 nt in length in which terminators were not expected. To do this, we considered all the IRs that separated two divergently oriented coding sequences (CDSs). The negative set consisted of 195 sequences.
For B. subtilis 168 we produced a positive and a negative set of sequences in the same way as for E. coli K-12. In this case, the positive set was represented by 34, 300 nt-long, sequences that were immediately upstream from the genomic regions, obtained by [17], relative to extended mRNAs in the mutant strain of B. subtilis 168 lacking the termination factor Rho. As for E. coli K-12, as a negative set of sequences we considered all the IRs < = 300 nt and > = 200 nt in length separating two divergently oriented CDSs. For B. subtilis 168, the negative set consisted of 149 sequences.
Estimation of RhoTermPredict performances
RhoTermPredict performances were evaluated by using the following statistical measures:
$$ \mathrm{Recall}\ \left(\mathrm{sensitivity}\ \mathrm{or}\ \mathrm{the}\ \mathrm{true}\ \mathrm{positive}\ \mathrm{rate}\right)=\mathrm{TP}/\left(\mathrm{TP}+\mathrm{FN}\right) $$
$$ \mathrm{Precision}\ \left(\mathrm{the}\ \mathrm{positive}\ \mathrm{predictive}\ \mathrm{value}\right)=\mathrm{TP}/\left(\mathrm{TP}+\mathrm{FP}\right) $$
$$ \mathrm{Specificity}\ \left(\mathrm{the}\ \mathrm{true}\ \mathrm{negative}\ \mathrm{rate}\right)=\mathrm{TN}/\left(\mathrm{TN}+\mathrm{FP}\right) $$
$$ \mathrm{Accuracy}\ \left(\mathrm{the}\ \mathrm{fraction}\ \mathrm{of}\ \mathrm{samples}\ \mathrm{correctly}\ \mathrm{classified}\right)=\left(\mathrm{TP}+\mathrm{TN}\right)/\left(\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}\right) $$
$$ {\mathrm{F}}_1-\mathrm{score}\ \left(\mathrm{the}\ \mathrm{harmonic}\ \mathrm{mean}\ \mathrm{of}\ \mathrm{Precision}\ \mathrm{and}\ \mathrm{Recall}\right)={2}^{\ast }{\mathrm{Precision}}^{\ast}\mathrm{Recall}/\left(\mathrm{Precision}+\mathrm{Recall}\right) $$
where TP = True positives, FP = False positives, FN = False negatives and TN = True negatives.
We considered as either true positive (TP) or false positive (FP) any sequences of either the positive or the negative set in which the algorithm predicted a terminator, respectively. Importantly, at most one TP was considered for each sequence of the positive set. We considered as either true negative (TN) or false negative (FN) any sequences of either the negative or the positive set in which the algorithm did not predict any terminator, respectively.
RNA-sequencing, reads mapping quality assessment
Ribosomal RNAs were depleted using RiboZero Gram negative kit (Epicentre,Illumina) and strand-specific sequencing libraries were constructed using the ScriptSeqTM v2 RNAseq library preparation kit. After that, the purified cDNA library was sequenced on an Illumina GAIIx/Solexa or MiSeq platform (Illumina, San Diego, CA) with 76-bp paired-end reads. The BAM files for each condition analysed are publicly available at Sequence Reads Archive (SRA) under accession number BioProject PRJNA483864. Alignment to the reference strain of E. coli K-12 genome (Ref Seq NC_000913) was done using bowtie2 (with sensitive options, corresponding to -D 15 -R 2 -L 22 -i S,1,1.15; see the Bowtie2 manual for the explanation of the flags –D,-R,-L,S, http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#reporting). Evaluation of strand specificity and gene coverage was performed using BEDTools (v2.20.1) and SAMtools (v0.1.19). Wig files were generated from the aligned BAM files by using BEDTools (v2.20.1). To avoid bias caused by multi-mapping reads the non-deterministic option and end-to-end mode were used to force a single assignment of multi-mapping reads to the best scoring region (if present) or in the case of regions with identical scores reads were randomly assigned. Mapped reads with MAPQ (mapping quality) greater than 30 were analyzed to determine the read counts per protein-coding gene.
Genome-wide analysis and validation with RNA-Seq data
We also run RhoTermPredict on the whole genome of E. coli K-12 to test its genome-wide predictive performances. To validate these predictions, we used the RNA-Seq data above described. We considered the predicted regions as putative Rho-dependent terminators if they were in positions in which there was a negative transcription gradient in RNA-Seq data, as suggested in other works [17, 21].
Precisely, we considered, in this case, a prediction as TP if there was a decrease of read value by a factor of at least 1.5 between the read value at the putative RUT site 5′-end point and the read count value 150 nt downstream from putative RUT site 3′-end point (hence after the putative RNAP pause site), the first with a read count value ≥10. We considered this constraint in order to analyze only the predictions near an expressed DNA region in the RNA-Seq data. In fact, a prediction close to a not-expressed DNA region could be incorrectly considered as a FP.