Machine learning and statistics shape a novel path in archaeal promoter annotation

Background Archaea are a vast and unexplored domain. Bioinformatic techniques might enlighten the path to a higher quality genome annotation in varied organisms. Promoter sequences of archaea have the action of a plethora of proteins upon it. The conservation found in a structural level of the binding site of proteins such as TBP, TFB, and TFE aids RNAP-DNA stabilization and makes the archaeal promoter prone to be explored by statistical and machine learning techniques. Results and discussions In this study, experimentally verified promoter sequences of the organisms Haloferax volcanii, Sulfolobus solfataricus, and Thermococcus kodakarensis were converted into DNA duplex stability attributes (i.e. numerical variables) and were classified through Artificial Neural Networks and an in-house statistical method of classification, being tested with three forms of controls. The recognition of these promoters enabled its use to validate unannotated promoter sequences in other organisms. As a result, the binding site of basal transcription factors was located through a DNA duplex stability codification. Additionally, the classification presented satisfactory results (above 90%) among varied levels of control. Concluding remarks The classification models were employed to perform genomic annotation into the archaea Aciduliprofundum boonei and Thermofilum pendens, from which potential promoters have been identified and uploaded into public repositories. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04714-x.

processes include the transcription of an RNA molecule. This mid-step operation is carried out by the RNAP enzyme and configures a central process in the genetic information flux across all domains. The way the transcription occurs in archaea roughly resembles the eukaryotes [5]. In fact, these two domains are evolutionary siblings and archaea might have given origin to eukarya [6]. The overall structure of this process in these two domains presents a certain level of conservation. Indeed, the eukaryotic model poses as a more specialized version of its archaeal counterpart. For instance, while archaea employs a single RNAP to transcribe all genes, animals and plants make use of three and five different enzymes, respectively [7,8].
The recruitment of RNAP to the DNA is mediated by a DNA segment defined as a promoter sequence whose presence is necessary for the initiation of the transcription. The typical archaeal promoter element possesses three basal transcription factor binding sites. These additional proteins are TATA-box Binding Protein (TBP), Transcription Factor B (TFB), and Transcription Factor E (TFE) and they are needed for correctly directing RNAP to its precise site of action [9]. On a nucleic acid level, these proteins bind to: (i) a wTTATwww set of nucleotides, located at − 25, matching the TBP binding site, where w means A or T in the IUPAC code; (ii) an ssnAA sequence located around two nucleotides upstream TATA and a TAC sequence located in the range of − 1/ − 10, due to its two-extremity binding, TFB stabilizes TBP and the two combined create the Pre Initiation Complex (PIC), where s means C or G and n means any nucleotide in the IUPAC code; (iii) a TFE protein has the function of assisting PIC formation, hence, its binding preference varies according to the promoter and organism [9,10].
The conservation found around the binding site of transcription factor proteins in the archaeal genome might be used as input in a way that the recognition of these regulators is able to provide a more reliable annotation. The promoter prediction task is well developed in other branches of life than archaea. Such tools have succeeded in classifying these regulators in eukarya and bacteria. However, due to the particularities archaea have, a universal promoter classifier is an open scientific question.
In this work, we systematically locate the potential promoters of unannotated archaea by using their structural properties in comparison to random and sequences where no promoters have been identified. To do so, the well-conserved nature of archaeal promoters is employed and stressed.

Promoter sequences
A total of 3630 experimentally validated promoters of three different archaea were employed in this study: 1340 sequences of Haloferax volcanii, 1048 sequences of Sulfolobus solfataricus, and 1248 sequences of Thermococcus kodakarensis. These are model organisms in the Euryarchaea and TACK superphylum. These organisms were selected because there is available transcriptome information, which enables the extraction of promoter sequences associated with a given transcript.
The original data contains 1001 nucleotides per sequence with their Transcription Start Sites (TSS) mapped. Only primary TSS (pTSS) from the published transcriptomic data was considered. Next, a sub-sequence containing 100 nucleotides, i.e. − 80 to + 20 was extracted. This region comprises the reported core promoter in H. volcanii and S. solfataricus [11,12] and it has been reported as sufficient to initiate transcription in archaea [12]. Furthermore, the precise location of these organisms' promoters was reported to be located in the proposed range [12][13][14][15][16]. Annotations and lists of the promoters used in this study are available at https:// zenodo. org/ record/ 51375 51.

Control datasets
The classification methods of this study were stressed with three forms of control. First, we, through a self-developed Python script, shuffled the 100-nucleotides original sequences. A second control dataset was used by selecting the downstream sequences from + 21 to + 121. By this, we wanted to test the validity of our method by assessing sequences that do not indicate promoter activity nor have a TATA-box; and finally, we performed a second method for shuffling sequences, proposed by [17]; i.e., we divided the 100 nucleotide sequences into 20 blocks of 5 nucleotides each, then, we shuffled 12 of the blocks. By doing this, the consensual motifs such as TATA-boxes might be preserved in a way that our identification method is tensioned.

Structural parametrization
The totality of the sequences of this study (promoters and controls) were submitted through a structural coding in order to represent genetic information into numeric attributes. This representation captures specific sequence properties associated with regulatory regions such as promoters [18]. The parameter chosen for this study is DNA Duplex Stability (DDS). This feature has been employed as a way to represent the richness of GC base-pairs due to their extra hydrogen bond [19][20][21][22][23]. In this regard, Eq. 1 was used to calculate the DNA duplexes reported in [21]. It hinges on the assignment of a numeric attribute in sliding dinucleotide windows.

Classification through a statistical approach
Firstly, position-specific slices of 8 nucleotides (6 nucleotides matching the TATA-box, a spacer of 2 nucleotides, and 2 nucleotides comprising the BRE element) were extracted and averaged in each dataset (promoters, and three controls). Then, an interval was set ranging from the plus and minus values of the standard deviation formed upon the promoter dataset (Eq. 2).
Finally, a sequence was labeled as a promoter if its TATA + BRE nucleotides belongs to the range of Interval . Otherwise, it was classified as a non-promoter. A visual representation of the statistical method for classifying archaeal promoter sequences is available at: https:// doi. org/ 10. 5281/ zenodo. 51541 10.

Classification through an artificial neural network approach
In order to validate the simulation process, a k-fold-cross validation method was considered, where k = 10. This method involves in reserving 1/10 of the dataset to be used in the (2) Interval = x promoter ± σ promoter testing. The training is done with the remaining 9/10 shares. This grants that any biased data point gets covered. The validation process was done following the sample method in R [24].
Artificial Neural Network (ANN) simulations took place in the R environment through the neuralnet package [25]. The algorithm chosen to fit the ANN was the resilient backpropagation, since it has already succeeded in classifying genomic data [26]. The number of neurons in the hidden layer was set to 2 since a too complex curve to fit data points might be seen as a non-productive decision in machine learning [27]. The number of iterations over the training dataset, i.e. epochs, was increased until the validation and training errors kept dropping [28]. Finally, the maximum number of steps the ANN was allowed to reach until convergence was 200,000 as an attempt to balance computational costs. The R script that performed the ANN simulation is available at https:// github. com/ gusta vsgan zerla/ ANN---Archa eal-class ifica tion. git.

Assessment of classification
A binary classification might get assessed through an error matrix with predicted and actual values of a classification; this enables the performance of a classification technique to be evaluated. The elements that compose the matrix belong to two classes, positives and negatives and they are: True Positives (TPs), which correspond to the number of elements of a d class correctly predicted as a member of d class; True Negatives (TNs), which are elements that do not belong to a d class and have been assigned as non-d class; False positives (FPs), which comprises a member of d class classified as a non-d class; and False negatives (FNs), which encompasses members that do not belong to a d class and have been assigned as d class. With the error matrix, it is possible to calculate performance metrics of a binary classification.
The first metric is Accuracy (Eq. 3), which measures the proportion of correct predictions in the whole dataset (both TPs and TNs).
The second metric is Precision that verifies how many of the observations predicted as positive are actually positive and it is calculated through Eq. 4.
Next, Recall, which assesses how many of the TPs are actual TPs, is obtained through Eq. 5.
Finally, Specificity that calculates the detection rate of TNs throughout the entire dataset. It is obtained by Eq. 6.

Validation of the methods
To provide a validation for the methods proposed in this study, upstream sequences whose promoter activity has not been experimentally described yet, were downloaded from the RSAT prokaryotic database (http:// embnet. ccg. unam. mx/ rsat/) in its Sep 23 12:30:06 2021 version. The database contains upstream regions for 211 archaeal organisms. Two archaeal genomes, Aciduliprofundum boonei (741 sequences of 400 nucleotides each) and Thermofilum pendens (1926 sequences of 400 nucleotides each) exhibited a promoter-like profile in a previous study [29], i.e. the codification into DDS of upstream regions was found to be statistically similar to experimentally validated promoters, indicating that these particular upstream regions might contain promoters. Therefore, 400 nucleotides sequences got their TATA-box and TFB sites extracted. The nonparametric Kruskal Wallis test was employed to determine if the groups of experimental and potential promoters hold statistical differences. Finally, lists of annotated potential promoters of these two organisms are provided. A flowchart describing the classification method and the validation of the findings is described in Fig. 1.

DNA duplex stability parametrized archaeal promoters differ from control sequences
In order for getting the binding sites of transcription factor proteins represented by numeric inputs, genetic information was coded into DDS. Promoter sequences have already been well represented by DDS [19]. Concerning the coding of genetic information into DDS as well as locating areas of interest for turning promoters unmatched, Fig. 2 has been provided. The plotting of promoter sequences and their negative controls reveal that the binding site of transcription factor proteins is only Fig. 1 Overview of the classification rationale employed in this study. The figure is divided into i, ii, iii (train) and iv (test). (i) represents the conversion of genetic information into numeric attributes related to DDS, which is used as input of two classification methods. (ii) matches the Artificial Neural Network phase of classification; (iii) conveys information of how the classification was achieved through statistics. Both ii and iii were performed with experimentally verified promoters. Finally, the test, (iv) represents the validation process with upstream sequences whose promoters have not been identified yet. Each sequence undergoes through i, ii, and iii; then, the final decision is computed whether the sequence is a promoter or not found within promoters; these areas are observed in the promoter line, around positions − 28, − 32, and in the range of − 10 to + 1.
The initiation of transcription in archaea has been reported to need two transcription factor proteins: a TBP and a TFB, homolog to eukaryotic TFIIB [7,10]. Additionally, a second strong signal was observed around positions − 10 and + 1, matching the Proximal Promoter Element. This area consists of the binding site of a protein namely TFE, which has been reported to optimize the transcription in archaea by stabilizing the formation of a PIC [30]. Considering these organisms have limited genomes and need to have their metabolic demands matched in order to thrive, the presence of transcription optimizer proteins such as TFE plays a pivotal role in the gene expression.
Next, we have observed conserved binding sites of promoter recruitment transcription factor proteins in archaea with varied GC content (H. volcanii = 66.13%, T. kodakarensis = 50.67%, and S. solfataricus = 34.48%), More GC would indicate less potential binding sites for such proteins, as reported in [29]. However, our rationale has been able to find the binding site despite the amount of GC in a particular archaeon, Therefore, the binding site of these three proteins are clear in the plots representing the promoters in all organisms, suggesting that DDS succeeded in well representing promoter elements in archaea.

Statistical classification succeeds in the distinction of promoter sequences
In order to promote a classification method, the mean values of TATA + BRE sites of promoter sequences as well as three levels of control were converted into DDS. For each observation, the average of the promoter sequence differs from the three levels of control, where the closer from the promoter score is the shuffled sequence, followed by the downstream, and finally, the block shuffling process ( Table 1).
The dissimilarity observed in the promoters and the three forms of controls in Table 1 enables the statistical form of classification proposed in this study. The promoter interval was ranged and all the sequences got their data classified into promoter or nonpromoter. The results were then computed on an error matrix (Table 2), from which the precision value remains the same in every organism against their controls, since its calculation relies on positive values. The assessment of Table 2 indicates a higher recall value. The most satisfactory scores were achieved by the block form of control whilst the last was found in shuffled sequences, with the exception of S. solfataricus, in which downstream drags behind shuffled.
The statistical method of classification reported in this study has proven satisfactory in a way that it did not employ techniques encompassing machine learning. Firstly, the lower Precision value in the method suggests the model classified too many False Positives, this means non-promoters were classified as promoters in some instances. A reason for this to happen is the diversification found in archaea [2] and the dissimilarity in owning conserved binding sites [29]. Secondly, the most fine counts were achieved in the block form of control, matching the identification of Table 1, in which Table 1 Flagships of archaeal classification based on statistics The results displayed in Table 1 have information on the averaged values of the region encompassing TBP and TFB (BRE extremity) for four datasets: (i) experimentally verified promoters; (ii) control done in blocks; (iii) downstream sequences as control and; (iv) control with shuffled sequences. The standard deviation for the average of the promoters is also provided  the means of the blocks are the furthest from the promoters. Additionally, the method has presented satisfactory scores regarding recall, a metric that is sensible towards False Negatives. This phenomenon is explained due to the transcription machinery of different archaea being quite similar. If a sequence that lacks conserved binding sites of TBP and TFB, it is very unlikely to be classified as an archaeal promoter. The stress of the statistical model, brought by an inter-archaea classification (Table 3), similarities have been found in S. solfataricus and T. kodakarensis, confirming what was proposed by Takemasa et al. [31] in order to turn T. kodakarensis and S. solfataricus as regulatory chassis for hyperthermophilic archaea. These two particular organisms were reported to be similar in terms of their AT% throughout the genome [29], while H. volcanii has higher GC. We found the nucleotide composition directly affects the classification outcome, since it relies on conserved binding sites of transcription factor proteins.
Statistics has been proved as an adequate way to classify promoter sequences of archaea. This method is highlighted to its ease to implement, since it does not require extensive computational costs. Indeed, descriptive statistics is seen as a precursor of machine learning in classificatory nature [32].

Artificial Neural Network conveys a sturdier classification
In order to achieve more robust classification scores, ANNs were used. In the ANN simulation, the architecture that protruded satisfactory scores follows: (i) seven neurons in the input layer; (ii) two neurons in the hidden layer, and; (iii) one neuron in the output layer. Table 4 indicates the results achieved by the ANN simulation, with a default tradeoff value of 0.5 in computing the output of the model. The four parameters tested in the classification (Accuracy, Precision, Recall, and Specificity) are evenly spread among different forms of control. For a mean of the three forms of control against each classification parameter, please see Additional file 1: Table S1, in which Table 3 Results of the inter-organism statistical method of classification The classification rationale (formed upon the interval of mean ± standard deviation) was tested in different organisms. The results displayed in each cell have information of the averaged results of the three forms of control. The test of an archaeon with its own rationale for statistical classification was omitted, for it, see  Fig. 3. A second application of ANNs was conducted to evaluate if the pattern of one archaeon might be employed to classify another. Following this rationale, a new simulation was achieved in which the ANN was trained with one organism and tested with Table 4 Results of the ANN-based classification Each cell of this table contains the performance achieved by the best epoch for weight updating across the training dataset, i. e. the epochs were no longer increased when the convergence error became stable. For more details on the ANN simulation, see "Classification through an artificial neural network approach" and "Artificial Neural Network conveys a sturdier classification" sections

Accuracy (%) Precision (%) Recall (%) Specificity (%)
H.  . 3 ROC curves for the best ANN simulation in each archaeon. Once the best architecture for classifying archaea (each organism is presented in a separate panel) with neural networks was defined, the classification threshold was adjusted to produce ROC curves. The default output neuron yields a value and if it's bigger than 0.5, it gets classified as a promoter, otherwise, it is classified as a non-promoter. Each tick in the ROC curves represents an adjusted decision threshold, varying from 0 (x axis = 0, y axis = 0) to 1 (x axis = 1, y axis = 1) another. The results of this new simulation are available in Table 5, from which there is a leaning towards S. solfataricus and T. kodakarensis. The H. volcanii logistics produced classification results far distant from each other (e.g. from 60.87% recall to 96.23% specificity in a crossing of S. solfataricus and H. volcanii).
The results brought by the ANN classification suggest the model succeeded in classifying archaeal promoters, distinguishing them from three variations of control. In fact, this machine learning approach has succeeded in encountering promoters [17,22,23]. An implementation of similar nature was performed in [23] through the classification of bacterial promoters. The results obtained in this present study outperformed the bacterial classification because of the structure of the archaeal promoter in comparison to bacteria, which contains sigma factor proteins to direct RNAP to specific sites.
By outshining the statistical classification, the mathematical robustness of the ANN method [33] has proven uneven. Also, the rationale found in such method has matched the statistical classification, but overcame it. A good indicator to observe prediction validity is brought by ROC curves, which plots the specificity cost in gaining more recall [34]. The most evident characteristics are observed in the block control, which is found in the upper left corner of the plotting areas, confirming what the statistical analysis has found and validating the findings of Table 4 and Additional file 1: Table S1. The evenly spread scores (not fluctuating more than 1% in the metrics of each archaeon) certify the success of classification of ANN, suggesting the conservation protruded by a DDS codification of transcription factor binding sites has sufficiently turned promoter sequences unique.
The verification of inter-organism rationale of classification has evidenced that S. solfataricus and T. kodakarensis share similarities, evidenced by the acceptable classification scores between these two archaea. The high values of recall observed in H. volcanii vs. S. solfataricus and T. kodakarensis suggest that very few False Positives were identified, meaning that it was rare for the model to incorrectly classify H. volcanii promoters, this is due to the divergent amount of GC in this organism, reported in [29]. The Table 5 Results of the inter-organism ANN method of classification In this classification, the architecture that was trained with data of one archaeon and its controls was tested with other organisms and their controls. The results displayed in each cell have information of the averaged results of the three forms of control. The testing data of an archaeon with its own ANN architecture was omitted, for it, see To validate the newly identified promoters, they have been compared with experimentally verified promoters. In this sense, Fig. 3 holds information of the DDS profile of A. boonei and T. pendens as well as the other three archaea. In Fig. 4, there is a conserved The method proposed in this study was able to hand in regulatory annotation upon the genomes of A. boonei and T. pendens. To do so, we systematically characterized promoters from well-known archaea [29] and systematically used the algorithmized information in order to locate promoters in unannotated upstream regions of these organisms.
Many factors such as the diversity of archaea and their relatively recent discovery creates the need for high quality genome annotation. This is the moment when in-silico approaches provide help to experimental biology by curating data [20]. The boxplots portrayed in Fig. 5 showed two groups of organisms. No taxonomic inferences, i.e., boxplots with similar averages, could be made upon these since T. kodakarensis and A. boonei are Euryarchaea while S. solfataricus and T. pendens belong to the Crenarchaeota division, the statistical resemblance of these organisms requires further analysis. We also suggest using the model of H. volcanii in order to locate promoters in archaea that have high GC content. The statistical similarity found between verified and potential promoters advocate the robustness of the method proposed.

Concluding remarks
The results gathered in this study reveal the classification of promoter sequences in archaea susceptible to the percentage of GC in specific organisms. Moreover, the classification indicates a novel way of predicting promoter sequences in unannotated archaeal genomes through a combination of artificial neural networks and statistics. In this Fig. 5 Boxplots of promoter sequences of five different archaea. Promoter sequences from five archaea were converted into DDS. Each position (− 80 to 0) had its value averaged to produce the boxplots. H. volcanii has been placed apart from the other organisms due to its higher GC content. Groups of promoters consisting of A. boonei and S. solfataricus and T. pendens and T. kodakarensis were observed