Skip to main content
  • Methodology Article
  • Open access
  • Published:

Transductive learning as an alternative to translation initiation site identification

Abstract

Background

The correct protein coding region identification is an important and latent problem in the molecular biology field. This problem becomes a challenge due to the lack of deep knowledge about the biological systems and unfamiliarity of conservative characteristics in the messenger RNA (mRNA). Therefore, it is fundamental to research for computational methods aiming to help the patterns discovery for identification of the Translation Initiation Sites (TIS). In the field of Bioinformatics, machine learning methods have been widely applied based on the inductive inference, as Inductive Support Vector Machine (ISVM). On the other hand, not so much attention has been given to transductive inference-based machine learning methods such as Transductive Support Vector Machine (TSVM). The transductive inference performs well for problems in which the amount of unlabeled sequences is considerably greater than the labeled ones. Similarly, the problem of predicting the TIS may take advantage of transductive methods due to the fact that the amount of new sequences grows rapidly with the progress of Genome Project that allows the study of new organisms. Consequently, this work aims to investigate the transductive learning towards TIS identification and compare the results with those obtained in inductive method.

Results

The transductive inference presents better results both in F-measure and in sensitivity in comparison with the inductive method for predicting the TIS. Additionally, it presents the least failure rate for identifying the TIS, presenting a smaller number of False Negatives (FN) than the ISVM. The ISVM and TSVM methods were validated with the molecules from the most representative organisms contained in the RefSeq database: Rattus norvegicus, Mus musculus, Homo sapiens, Drosophila melanogaster and Arabidopsis thaliana. The transductive method presented F-measure and sensitivity higher than 90% and also higher than the results obtained with ISVM. The ISVM and TSVM approaches were implemented in the TransduTIS tool, TransduTIS-I and TransduTIS-T respectively, available in a web interface. These approaches were compared with the TISHunter, TIS Miner, NetStart tools, presenting satisfactory results.

Conclusions

In relation to precision, the results are similar for the ISVM and TSVM classifiers. However, the results show that the application of TSVM approach ensured an improvement, specially for F-measure and sensitivity. Moreover, it was possible to identify a potential for the application of TSVM, which is for organisms in the initial study phase with few identified sequences in the databases.

Background

Translation and transcription processes are used by the cells in order to interpret and express their genetic information [1]. Only a portion from the whole transcript messenger RNA (mRNA) gets translated into protein, which is called Coding Sequence (CDS). The correct protein coding region identification is one of the main problems in the molecular biology, since it motivates the search for conservative features in the mRNA sequence that enables the detection of a CDS region.

In eukaryotes, the CDS region is delimited by indicators denominated start codon and stop codon. The start codon, preferably identified by AUG triplet, also known as Translation Initiation Site (TIS), determines the start of the process of protein synthesis, which is one of the most important processes in the regulation of gene expression [2]. The translation process often begins in the first occurrence of an AUG codon [3], but can also begins in different codons as indicated in [4]. Similarly, the stop codon, identified by the occurrence of triplets UAA, UAG or UGA, determines the end of protein translation process.

The translation initiation site directly influences the produced protein, it may alter its structure and function in the cellular environment. The lack knowledge of conservative characteristics to identify the translation initiation site turns the TIS prediction into a complex problem.

The scanning model in eukaryotes [5] assumes that the link between the ribosome and the mRNA sequence initially occurs at the 5’ and goes toward the 3’ region. In [3], the authors establish the following concepts: upstream and downstream regions and the reading phase of the mRNA sequence by the ribosome during protein production process. This process can be seen in Fig. 1.

Fig. 1
figure 1

Representation of a mRNA sequence according to the scanning model in the eukaryotes

The identification of the TIS is a non-trivial task due to the fact that the mRNA molecules possess, depending on the organism, thousands of nucleotides and that the translation process is motivated by an intracellular context of difficult simulation. Additionally, the identification process corresponds to a combinatorial computational problem in the order of 4n, where n is the number of nucleotides considered in the analysis.

The task of predicting the TIS can be modeled as a binary classification problem, i.e., positive sequence when a TIS is identified and negative sequence otherwise. However, the TIS prediction context induces a natural unbalance in the databases, once in each mRNA sequence there is only one AUG codon identified as start codon (TIS), while all other AUG codons are identified as non-TIS (nTIS). For instance, the unbalance for the organisms Mus musculus and Rattus norvegicus are 1:23 and 1:131, respectively [6]. Such unbalance can be solved by two approaches: oversampling and undersampling. Oversampling artificially generates samples of the minority class in order to balance the database. For instance, the SMOTE algorithm [7] makes usage of this approach, applied in order to generate positive sequences (TIS) of the minority class.

Furthermore, undersampling selects samples within the majority class in order to obtain approximately the amount of samples contained in the minority class. In [6], the authors introduced an undersampling method called M-Clus, which performs clustering of the samples contained in the majority class and selects the centroid or most significant elements from each cluster to integrate the database used to build the classifier. Thus, the number of clusters to be considered corresponds to the number of samples available in the minority class.

Both oversampling and undersampling approaches present problems due to the biological context modification. The first method generates artificial samples from the minority class, enabling the creation of samples possibly inconsistent with the class. Similarly, the second approach fails to consider samples from the majority class that may be relevant for classification. In order to deal with the loss of relevant information caused by undersampling, [6] propose a method of knowledge inclusion called inAKnow. This method classifies sequences from the downstream region using a previous model generated from sequences belonging to the upstream region. These new sequences are included in the final model building.

The approach used in this study avoids the unbalance problem, inherent in the TIS prediction, by not considering all the occurrences of the AUG triplet, that are not TIS, as nTIS (negative class). From the biological point of view, AUG triplets found in the same reading phase of a TIS present more similarity with this class than with the nTIS class. Such similarity was verified in [8] by studying the translation mechanism of HIV into mRNA molecules and the identification of the restarting of the translation process, occasioned by the presence of a AUG triplet near by a stop codon triplet. Under this assumption, we will use as nTIS only upstream AUG codon and out of reading phase with the TIS.

On the other hand, due to the good performance of the inductive SVM classifier for classification problems in different domains with high dimensionality [9], this classifier has often been used in the TIS prediction. In the experiments carried out by [10] and [11], the use of inductive SVM aiming the TIS prediction presented an accuracy gain with the use of kernel functions such as locality-improved kernel and Salzberg kernel, reaching an accuracy of 88.6% for the database used in [12]. The TIS Hunter1 program [13] proposes the usage of Edit Kernels function and a methodology for redundancy control in the genetic code that consists in converting the set of nucleotides from a downstream region into a amino acids sequence prior to the SVM training. This methodology reached 99.9% accuracy for the same database proposed in [12] and 96.7% accuracy for human mRNA from NCBI Reference Sequence (RefSeq) database [14]. Although the TISHunter predictor has presented very satisfactory results, it needs a specific kernel function. The proposed approach in this work uses the RBF function, which is a standard function in classification problems.

In addition, this tool is a TIS predictor and does not work as a classifier. In the other words, for each mRNA molecule, there is only one indication of TIS, without classification of the other AUGs of the molecule. In mistake situations, there is no indication of other possible AUGs that could be TIS. This information will be important for anyone who wants to promptly identify the beginning of translation. Besides that, in [15] the authors mention that the success of TISHunter depends on the existence of related proteins or cDNA sequences in the database. They also highlight that the Kernel function, once determined for the training set, can not be easily adapted. Therefore, there is a need for new approaches to TIS identification.

With the progress of the Genome Project [16], a greater number of molecules are sequenced and made available in the RefSeq database daily [14]. However, a small number of molecules, such as the Nasonia vitripennis organism, which has only 35 REVIEWED molecules available, on 22nd April/2014, is a challenges for classification problems. In such case, the inductive inference does not posses enough information for training the model. To overcome this problem, the transductive inference, introduced by [17], represents an alternative way. The core idea behind the transductive inference is to build a classifier using two data sets: 1) the original training set, which contains the already classified data, and 2) the prediction set, in which the elements are not labeled yet. Thus, the transductive inference have more available information for training than through inductive inference, and can be considered as an alternative for solving the problem TIS prediction, in a single process step.

The transductive inference can be classified as semi-supervised learning [18]. This kind of learning correspond to the union from the categories of supervised and unsupervised learning methods. In machine learning, these two techniques are fundamentally different. Unsupervised learning aims to seek inherent patterns in the unlabeled data set. The unsupervised learning techniques are directly related to density estimation problem in statistics, which aims to estimate the density function for a set of observed data.

Supervised learning aims to discover a x to y mapping given a training set containing pairs (x i ,y i ), where y i Y is called the label or x i sample objective, and \(Y = (y_{i})^{T}_{i \in [n]}\) represents the vector of labels in training data. Similar to the unsupervised learning, a requirement is that pairs (x i ,y i ) need to be collected independent and identically distributed [11].

The semi-supervised learning techniques make use of unlabeled data during training process. Generally, this learning could be used in contexts where there is a small amount of labeled data and a large amount of unlabeled data, such as the TIS prediction problem, in which the unlabeled data are the new molecules whose TIS has not been identified yet. Notice that, the TIS identification process usually requires the participation of a human expert or bio-chemical experiments, which makes the labeling process more expensive and complex. This reinforces the need for a technique that automates the identification of the TIS, as is the case of Transductive SVM (TSVM).

According with [17], the term “transductive” corresponds to a pattern recognition problem. It means that given the classifications y i ,i=1,…,l, of l labeled samples x i ,…,x l from the training set, the goal is to discover the classification of the k unlabeled samples x l+1,…,x l+k from the prediction set, contrary to the inductive inference, in which the goal is to find a function that can describe the problem and then classify the prediction set.

During the transductive learning training process the algorithm has access to the l training vectors X train , its labels Y train (Eq. 1), and the u unlabeled prediction samples X test (Eq. 2)

$$\begin{array}{*{20}l} X_{train} &= {x_{t_{1}},\ldots,x_{t_{l}}} \qquad Y_{train}={y_{t_{1}},\ldots,y_{t_{l}}} \end{array} $$
(1)
$$\begin{array}{*{20}l} X_{pred}&={x_{p_{1}},\ldots,x_{p_{u}}}. \end{array} $$
(2)

The sets X train ,Y train , and X pred are used by the transductive learning in order to predict the labels of the prediction samples (Eq. 3).

$$ Y_{pred}^{*}={y_{p_{1}}^{*},\ldots,y_{p_{u}}^{*}}, $$
(3)

The goal is to minimize the ratio of incorrect predictions (Eq. 4) for the prediction.

$$ Err_{pred}(Y_{pred}^{*}) = \frac{1}{u} \sum_{i \in S_{pred}} \delta_{\frac{0}{1}} (Y_{i}^{*},Y_{i}) $$
(4)

where \(\delta _{\frac {0}{1}}(Y^{*}_{i},Y_{i})\) is 0 if \(Y^{*}_{i} = Y_{i}\) or 1 otherwise.

As previously mentioned, inductive methods are often used in the TIS prediction, differently from the transductive methods application that has not been discussed in the context. Note that the main purpose of the TIS prediction is to correctly identify positive AUG triplets (TIS) and not necessarily to identify an inductive function that represents the problem. It is important to enhance that, the use of inductive methods for new molecules may fail, since the new sequences may have different characteristics concerning the TIS prediction in comparison to the sequences used during the training process to obtain the model. On the other hand, transductive methods readjust the model for each new sequence to be predicted. Thus, it is relevant to consider and analyze the application of transductive inference to the TIS prediction problem.

Consequently, this work compares the behavior of the Transductive SVM (TSVM) and Inductive SVM (ISVM) applied to the TIS identification problem. For this, we consider two scenarios in relation to the training set. The first considers 90% of dataset for training and 10% for validation; and in the second scenario it was considered 10% for training and 90% for validation. The results show that the proposed approach based on transductive inference provides better results for organisms with smaller number of molecules (Rattus norvegicus and Mus musculus) in F-measure and sensitivity in comparison with the inductive method for predicting the TIS. The methods were tested with the molecules from the most representative organisms contained in the RefSeq database: Rattus norvegicus, Mus musculus, Homo sapiens, Drosophila melanogaster and Arabidopsis thaliana. The transductive method presented F-measure and sensitivity higher than 90% and also higher than the results obtained with ISVM.

This paper is organized as follows: Firstly, “Methods” section describes the databases considered in this study and the procedures used in the data preparation. The criterium for definition of the windows size for extraction of positive and negative sequences are analyzed and discussed. In this section the definition of the SVM parameters and the adopted validation process is presented. The “Results and discussion” section presents the results obtained by the comparative process between the ISVM and TSVM classifiers and a comparative study with the Netstart, TISHunter and TIS Miner programs. Finally, the “Conclusions” section presents the final considerations.

Methods

This section presents the procedures carried out to evaluate the inductive and transductive inferences for TIS identification. For this, we describe the used databases to perform the tests, the window size definition, extraction process of positive and negative sequences, the definition of the SVM parameters and the evaluation metrics.

Figure 2 schematically shows the methodology used in this work, illustrating all activities performed to investigate the TSVM behavior for the TIS prediction problem and to compare the ISVM and TSVM methods. This methodology will be described in the next sections.

Fig. 2
figure 2

ISVM and TSVM evaluation methodology towards the solution for the TIS prediction problem schematically represented

Materials

The used databases in our experiments (see Fig. 2) were extracted from the public repository RefSeq [14] from the NCBI (National Center for Biotechnology Information)2 on 22nd April 2014 referent to the following organisms: Rattus norvegicus (1383 molecules), Mus musculus (1097 molecules), Homo sapiens (21,528 molecules), Drosophila melanogaster (27,764 molecules), Caenorhabditis elegans (26,066 molecules) and Arabidopsis thaliana (35,173 molecules), which represents 96.07% of the molecules available in this repository. The remaining 3.93% molecules available in the RefSeq database (distributed among 14 organisms) were not considered in our study because it doesn’t generate a sufficient sequence for training the classifiers. For example, considering our methodology was possible to extract only 23 positive sequences and 18 negative sequences for the Nasonia vitripennis organism. Notice that this sequences number, in general, is not sufficient for a training process of classifiers.

Although the organism Caenorhabditis elegans have a large number of molecules, it could not be analyzed due to the fact that its molecules contain only the CDS region. In other words, this organism does not have a upstream region sufficient for our methodology.

Each molecule was identified according to the inspection level and classified as: Model, Inferred, Predicted, Provisional, Reviewed, Validated and WGS3. In this work we have considered only mRNA molecules with inspection level Reviewed since those records undergo a thorough review process.

Window size definition

In this section the criteria to define the size of the analysis window will be discussed, which corresponds to the data preparation stage comprised by methodology proposed in this work (see Fig. 2).

According to the experiments carried out by [6, 11], the size of the nucleotide sequences extraction window directly influences the quality of the prediction model. A preliminary study, [6] indicates that asymmetric sized windows provide higher accuracy to the prediction model. Consequently, our work adopts asymmetric windows and the upstream region with the fewer amount nucleotides. This will be discussed bellow.

In order to define the amount of nucleotides in the upstream region, we have considered the ribosome scanning model and the Kozak consensus [3], which identifies a conservative pattern in the -6, -5, -4, -3, -2, -1, +1, +2, +3 and +4 positions (GCC[A or G]CCAUG[G]), where there is a predominance of nucleotides [A or G] and [G], respectively, in the positions -3 and +4. A higher number of nucleotides in the upstream region was used by [1], in which -7 was identified as a conservative position. For the experiments carried out in our work, we use windows with 9 nucleotides in the upstream region, since the scanning model of the mRNA chain is made at each 3 nucleotides and guarantees that our analyses includes the previously identified conservative positions. In addition, our methodology avoids the unnecessary elimination of sequences when considering a small upstream region.

To define the amount of nucleotides in the downstream region, we have taken into account the results obtained by [1] and [13] where the authors suggest the existence of a pattern to define the TIS present in the CDS region of a molecule. In [13], the authors consider windows with size of 150 nucleotides in downstream region for the tests into database used by [12] and 270 nucleotides in downstream region for the validation in Human mRNAs. However, these sizes were empirically defined for the used databases and do not take into account the possibility of protein pattern in the downstream region.

Aiming to evaluate the existence of such pattern for the TIS in the downstream region, we have varied the amount of nucleotides in this region to be considered through an analysis of the CDS sizes from the studied organisms. Figure 3 depicts a box plot of the CDS sizes found in each organism. For the sake of readability, we have eliminated typical outliers from this type of graphic. CDS sizes in the range of values limited by the box represent 75% of all CDS sizes found in each organism. Therefore, the choice for the amount of nucleotides in the downstream region close to the CDS size may impact in classifier’s performance because most of the information from this region would be considered. Figure 3 shows that most of the evaluated molecules present CDS region with sizes ranging from 800 to 2000 nucleotides, limits shown as a dashed line. The Drosophila melanogaster organism has CDS region bigger than 2000, however windows with more than 2000 nucleotides prevent the study of organisms with fewer molecules, such as Arabidopsis thaliana.

Fig. 3
figure 3

Box plot for the CDS region size per organism

To define a common amount of nucleotides in the downstream region to be applied for all studied organisms, we have identified in the Fig. 3 that the CDS region from the organism Mus musculus is mostly distributed from 800 to 2000 nucleotides. Defining the amount of nucleotides in the downstream region inside this interval enables to consider much of the information contained in the CDS region from the remaining organisms.

In order to identify the amount of nucleotides in the downstream region, we have analyzed the frequency histogram of Mus musculus organism (see Fig. 4), which the intervals smaller than 2000 can be seen in the Table 1. The frequency histogram has been generated using package fdth 4 from R version 2.12.2.

Fig. 4
figure 4

Frequency histogram of the intervals in the size of the CDS region from Mus musculus

Table 1 Frequency histogram of the intervals in the size of the CDS region from Mus musculus

We have defined the amount of nucleotides in the downstream region as the median from the interval of each class based on the frequency histogram of the size of the CDS region for Mus musculus (Table 1). We have eliminated the class with median 1930 because our preliminary experiments with this window size did not generate a representative size of training set for the organism Rattus norvegicus. Although the first two intervals are outside the range from 800 to 2000, these were considered in the analysis. Doing so, we evaluate the interference in the performance of the classifier when there is more information available regarding the CDS region. Therefore, 235, 518, 800, 1081, 1365 and 1650 are amount of nucleotides in the downstream region for the extraction window.

Extraction of positive and negative sequences

For each window size previously established in the previous Section, the sequences were extracted using the developed program Transdutis5. A negative sequence (nTIS) can be differentiated according to its location, upstream or downstream, and with regards to the ribosome reading phase [3]. In this work we only consider windows in which the AUG is at most until the end of the CDS region. Therefore, we guarantee that all sequences used to generate the classification model have at least a portion of the CDS region, which supposedly contains a pattern to predict the TIS [13].

The nTIS sequences locate in the upstream region in the reading phase of TIS [5] are classified as upstream in phase (UPIP) and those out of the reading phase of TIS are called upstream out of phase (UPOP). On the other hand, sequences locate in the CDS region in the reading phase of TIS are classified as CDS in phase (CDSIP) and those out of the reading phase of TIS are called CDS out of phase (CDSOP), as shown in Fig. 5.

Fig. 5
figure 5

A sequence of an mRNA with the identification of the regions

Preliminary experiments using negative sequences (nTIS) UPIP, CDSOP, CDSIP as input to the SVM resulted in relatively low F-measure results, around 70% for the organism Mus musculus. Additionally, results from [13] indicate that UPIP sequences possess a very similar biological context to the TIS. These sequences may even start the protein translation process and be interrupted early on by the presence of a stop codon [8]. Thus, the sequences used as input for the inductive SVM (ISVM) and transductive SVM (TSVM) were only negative UPOP and positive TIS, as previously identified in Fig. 5.

During the sequence extraction process, we have preprocessed the database (see Fig. 2) in order to eliminate the duplicated sequences prioritizing the sequences from the (minority) positive class (TIS). The process of removing duplicated sequences consists in eliminating repeated occurrences of a sequence, thus the remaining sequences are named unique and the removed are name duplicated. Table 2 presents the amount of sequences extracted by window size, by organism and the number of duplicated sequences disregarded for training the classifier. Notice that, in general, the number of duplicated sequences found is greater for small window size and confirm the necessity of eliminating duplication.

Table 2 Amount of sequences extracted by classification and amount of duplicated sequences eliminated during the preprocessing

Still regarding Table 2, CDS region contains higher number of duplicate sequences, which reinforces the possibility of existence of conservative information in this mRNA sequence region. Additionally, it is important to note a higher amount of nTIS sequences of type UPOP in comparison with UPIP sequences, indicating that these sequences are more representative, which justifies the choice made in this work.

In addition to equal sequences classified to the same class, there were also equal sequences differently classified, i.e., classified as TIS and nTIS in different molecules. This rarely occurs, mostly found in the organism Drosophila melanogaster in a proportion of about 1:5000 that corresponds to the total amount of extracted sequences. In this work, we disregarded those sequences differently classified.

TIS prediction problem is essentially unbalanced because for each analyzed molecule there exist only one TIS, with rare exceptions, of several AUG codons, whose do not start the protein translation. However, as presented in Table 3 (column TIS/nTIS), this problem has been alleviated by eliminating duplicates and using only out of phase negative upstream sequences (UPOP). Still, it is important to note that the amount of available TIS sequences is higher than the amount of nTIS sequences for windows of size 235, 518 and 800 nucleotides in the downstream region for the organism Arabidopsis thaliana.

Table 3 Amount of sequences after the elimination of duplicated sequences

Besides the duplicated sequences, we have eliminated sequences containing windows longer than amount of nucleotides existent in the molecule for both upstream and downstream.

Similar to [4, 19], the sequences were codified as binary chain, i.e., 4 bits to represent each nucleotide A, C, G and U as 1000, 0100, 0010 and 0001, respectively.

SVM parameter definition

Another stage of the proposed methodology is to define the parameters of the SVM algorithms to be used in the ISVM and TSVM classifiers. This activity is directly linked to the training process, as can be seen in Fig. 2.

For the non-linearly separable problems, as in the TIS prediction, it is necessary to use variables that smoothen the optimization problem restrictions, allowing the occurrence of some misclassification and the use of a kernel function in order to map the training data to specific space. Parameter C, known as penalty parameter, determines the weight attributed to each incorrect classification provided by the classifier, so that the higher the value the more specific classifier and more intolerant to incorrect classification.

The efficiency of those two classifiers depends on the proper selection of the parameters of the kernel function and the optimal hyperplane separation margin’s smoothing parameter, represented by C. Our work uses the Gaussian RBF (Radial Basis Function) kernel function (Eq. 5) and its parameters defined as σ, that corresponds to the variation of Gaussian function. However, our work uses the parameter γ as commonly found in implementations of SVM classifiers, which is defined as \(\gamma =-\frac {1}{2\sigma ^{2}}\).

$$ K(x_{i},x_{j}) = \exp^{\gamma\parallel x_{i} - x_{j} \parallel^{2}} $$
(5)

The parameters were defined using the Grid search method [20] implemented in the libsvm 6. This method defines a optimal set of parameters by an exhaustive search within a predefined range of values for each parameter. Preliminary experiments with this method using all the 1454 sequences from Mus musculus for a window size of 235 (see Table 3). It was required about 5 hours of processing in order to find the best pair of parameters (C,γ). The experiment was executed in a high-performance SGI Altix server in the National Supercomputing Center at Federal University of Rio Grande do Sul7.

Due to the high amount of available molecules (around 20 thousand) for the remaining analyzed organisms and the Grid Search’s high runtime (given by the SVM’s execution time and the amount of records in the training set), we use 10% of the available sequences. Those sequences were chosen using the Mersenne Twister method [21], but keeping the ratio of positive (TIS) and negative (nTIS) classes. Grid Search was executed for each of the organisms and window size defined in Table 3. See the Additional file 1 for the values for the parameters (C,γ) found by the Grid Search using RBF kernel function, which were used for the training of ISVM and TSVM.

The assessment of the results was performed using \(Precision = 100 \times \frac {TP}{TP+FP}\), \(sensitivity = 100 \times \frac {TP}{TP+FN}\), \(F-measure = 2 \times \frac {Precision \times sensitivity}{Precision + sensitivity}\) metrics (where TP = True Positive, TN = True Negative, FP = False Positive and FN = False Negative) and ROC (Receiver Operating Characteristic Curve) [22].

Validation process

We have applied the 10-fold cross-validation method, which guarantees the statistical validation of the model. It consists of subdividing the available data set in 10 folds of the same size from which 9 are used for training the remaining one for validation.

However, this validation process induces a favorable context to the inductive learning techniques because 90% (9 folds) of the available data goes for training and the remaining one (10%) for the validation. Thus, in order to compare the performance of ISVM and TSVM in a more balanced context, we have proposed experiments in two different scenarios.

From now on the traditional cross-validation will be referenced as Scenario 1. The usage of the Scenario 1 is valid in order to evaluate the transductive classifier in an unfavorable context. However, it is important to evaluate which the best context is to apply each of the inferences. Consequently, we propose a variation of the cross-validation method to simulate a context in which the available data for training are scarce. It aims to invert the cross-validation model, e.g., 10% (1 fold) of the data are available for the training and the remaining 90% for the model validation. From now one this scenario is called Scenario 2. Data from both Scenario 1 and 2 are used for training the ISVM and TSVM (refer to Fig. 2).

Results and discussion

This experiments aims to analyze the behavior of ISVM and TSVM for the TIS prediction problem. As previously described this analysis was performed using 6 window sizes for sequence extraction in two different scenarios, in which the amount of available sequences is different.

Table 4 presents the precision obtained for both methods, ISVM and TSVM. It is possible to observe that the precision of the ISVM and TSVM is similar for both scenarios, with few exceptions. The largest differences are found in the Rattus norvegicus and Mus musculus organisms, which have few training sequences (see Table 3).

Table 4 Validation precision results using ISVM and TSVM methods for the Scenarios 1 and 2

For the Scenario 2, in which only 10% of the sequences are available, the precision of both classifiers is smaller, as expected. It is important to observe that the greater the number of training sequences for an organism the greater the precision obtained with ISVM and TSVM classifiers. However, for the Scenario 2, the sensitivity shown in the Table 5 indicates that the ISVM classifier falls by identifying the TIS. This occurs for Rattus norvegicus and Mus musculus organisms, which have few molecules.

Table 5 Validation sensitivity results using ISVM and TSVM methods for the Scenarios 1 and 2

With the evaluation of precision and sensitivity separately, we just have a partial idea of which classifier is better for the prediction of TIS problem. So, the F-measure metric (the harmonic mean of sensitivity and precision) was used to compare the performance of the classifiers (ISVM and TSVM) taking into account both precision and sensitivity. Table 6 presents the F-measure results that point the TSVM is better than ISVM for the organisms that have fewer molecules, in this case the organism Rattus norvegicus and Mus musculus. This results reinforce that TSVM is more indicated for organisms that have fewer molecules or are under studied.

Table 6 Validation F-measure results using ISVM and TSVM methods for the Scenarios 1 and 2

We further evaluated the performance of ISVM and TSVM classifiers by ROC curves. Figure 6 a and b illustrate the ROC curves for Rattus novergicus and Mus musculus organisms, respectively. As already discussed, in Scenario 2, the TSVM classifier is better than the ISVM classifier (Fig. 6 a). Although the area under the ROC curve, in Scenario 2, is slightly smaller for the transductive classifier (AUC = 0.837 in the transductive and 0.917 in the inductive for Rattus norvegicus organism), the best classification model, the one that is closest to the point (0,100%), that is, with a higher true positive rate and lower false positive rate, is obtained by TSVM classifier.

Fig. 6
figure 6

ROC curve for a Rattus norvegicus and b Mus musculus organisms

On the other hand, when considering the inductive scenario (Scenario 1), with a higher number of training sequences, the inductive classifier presented better results than the transductive one. This conclusion is based on the area under an ROC curve, AUC = 0.973 in the inductive and 0.917 in the transductive for Rattus norvegicus organism. The same behavior was observed for Mus musculus organism (Fig. 6 b).

Another important results refers to the size of the analyzed extraction windows. By analyzing the F-measure results (see Table 6) it is possible to notice that the greater the number of nucleotides in the downstream region of the extraction window the better the performance of the classifiers. Nevertheless, there is a similar performance for windows with 1081, 1365 or 1650 nucleotides in the downstream region. On the other hand, there is a considerable reduction in amount of available sequences for training (see Table 3). Therefore, for the evaluated organisms, it is appropriate to use as window size the smallest among the largest. In this work, we consider 1081 nucleotides in the downstream region, regardless the organism.

By analyzing these results it is possible to observe that the usage of the TSVM method better suits organisms with few labeled sequences, e.g., Rattus norvegicus and Mus musculus organisms. When using ISVM comes a question, for how long the inductive classifier is valid? To handle with this situation, it is necessary retraining the classifier constantly in order to ensure its accuracy and representativeness, since the frequency in which new sequences (intrinsically different from the sequences considered in the original training set) are included in the database may compromise the classifier’s performance.

Although the TSVM classifier, by the transductive principle itself, needs to be readjusted for each new sequence, there is an increase in the reliability of the classification process. This readjust is justified when the organisms have few sequenced molecules. The retraining implies an increase in the computational cost in comparison to inductive methods. However, this cost can be reduced if each readjustment process considers the SVs of the previous readjustment in addition the new sequences.

Table 7 presents the amount of SV used in the TSVM approach and the elapsed time for the classification of one molecule from each organism.

Table 7 TSVM’s retraining computational cost

Comparative study

In order to compare our approach in a real scenario of TIS identification, the next stage of this work is to perform a comparative analysis among some of the main programs for TIS prediction.

For comparative study, a test sets, which was not included in the training of the ISVM and TSVM classifiers, were utilized. This new database comprises data from RefSeq extracted between 22 April and 22 September, 2014.

The test sets have the following number of molecules for each considered organism: Rattus norvegicus (125 molecules), Mus musculus (36 molecules), Homo sapiens (113 molecules), Drosophila melanogaster (106 molecules) and Arabidopsis thaliana (15 molecules).

The considered programs in this evaluation are the following: TISHunter [13], TIS Miner [11], NetStart [12] and TransduTIS, developed for this work, which implements the inductive (TransduTIS-I) and transductive (TransduTIS-T) approaches.

We developed a python script8 to automate the tests with TisHunter, TIS Miner and NetStart. To evaluate TISHunter, we have used the URL9 to submit each mRNA for testing with the default settings. The TIS Miner program was evaluated using the URL10 with default parameters, with the number of predictions set to maximum value. We used a classification threshold of 0.6 for this program, such that for each AUG with score greater then 0.6 we consider a positive prediction; otherwise, if score is fewer then 0.6 we consider a negative prediction. Finally, to evaluate the NetStart we used the URL11 and setting its parameters to vertebrate. All the tests are available at 8.

Both ISVM and TSVM were tested with extraction windows of 1090 nucleotides (1081 in the downstream region and 9 in the upstream region). Molecules that did not meet these conditions were not considered in the tests.

Table 8 presents the results of the tests for each studied organism. We also present the amount of hit and not hit for each tool analyzed. Hit corresponds to AUG that is TIS and was classified as TIS, and not hit corresponds to AUG that is TIS but was classified as nTIS. It is important highlight that TISHunter is essentially predictor, so it was not possible to infer information about the classification process to build a confusion matrix. For calculation of the hit and not hit, only occurrences of AUG in the upstream region were considered.

Table 8 Comparison among methods

By analyzing the results, we have observed that the TransduTIS-T has the best hit and not hit among the evaluated tools. It means that the herein proposed model was able to better characterize the context of TIS prediction, which is important aiming to identify the higher possible amount of AUG codons that are truly TIS. Thus, researchers in TIS identification may more safely analyze proteins generated from this identification. The TISHunter program [13], which uses Edit Kernel functions, obtained significant results as well, reinforcing the hypothesis of conservative features in the CDS region to the TIS prediction.

Conclusions

In this paper we compare the Inductive (ISVM) and Transductive (TSVM) classification methods for TIS identification. We describe the sequence extraction process, the preprocessing adopted and the elimination of duplicate sequences, which are important aspects for TIS prediction. We also present an approach to not incur the unbalancing, common situation in TIS identification. Besides, we have demonstrated the viability by using asymmetric extraction windows with a large amount of nucleotides in the downstream region.

The results show that the TSVM approach ensured an improvement, specially in F-measure and sensitivity, for organisms that have a small amount of mRNA molecules, as observed in the Rattus norvegicus and Mus musculus organisms. For organisms with a larger number of sequences, the inductive approach is recommended. When compared with other tools, in a real scenario of TIS identification, the transductive approach proved to be efficient for TIS identification in mRNA molecules.

Although the proposed methodology has achieved satisfactory results, some limitations can be mentioned: first, the sequences extraction process depends of a window fixed size, in both the upstream and downstream regions. This limits the classification of some molecules, as observed in Caenorhabditis elegans organism, which has a small upstream window. Another observed aspect corresponds to retraining process of the TSVM classifier, when it is desired to identify the TIS of new molecules.

Finally, this work provides a web interface, TransduTIS-I and TransduTIS-T, for the identification of TIS.

Endnotes

1 Available at http://tishunter.ucr.edu/

2 Available at http://www.ncbi.nlm.nih.gov/

3 A description of each status is available at http://www.ncbi.nlm.nih.gov/books/NBK21091/

4 https://cran.r-project.org/web/packages/fdth/

5 Available at http://transdutis.com.br/

6 Available at https://www.csie.ntu.edu.tw/~cjlin/libsvm/

7 More information available at http://www.cesup.ufrgs.br

8 http://www.icei.pucminas.br/projetos/dsrgroup/?wpdmpro=transdutis

9 http://tishunter.ucr.edu/cgi-bin/tishunter.cgi

10 http://dnafsminer.bic.nus.edu.sg/cgi-bin/tis.pl

11 http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi

Abbreviations

CDS:

Coding sequence

CDSIP:

CDS in phase

CDSOP:

CDS out of phase FN: False negatives

ISVM:

Inductive SVM

mRNA:

Messenger RNA

NCBI:

National center for biotechnology information

nTIS:

non-TIS

RBF:

Radial basis function

RefSeq:

Reference sequence

RNA:

Ribonucleic acid

ROC:

Receiver operating characteristic

SV:

Support vectors

SVM:

Support vector machine

TIS:

Translation initiation site

TSVM:

Transductive SVM

UPIP:

UPstream in phase

UPOP:

Upstream out of phase

URL:

Uniform resource locator

References

  1. Tzanis G, Berberidis C, Vlahavas I. Mantis: a data mining methodology for effective translation initiation site prediction. In: Engineering in Medicine and Biology Society, 2007. EMBS 2007. 29th Annual International Conference of the IEEE. IEEE: 2007. p. 6343–347.

  2. Nakagawa S, Niimura Y, Gojobori T, Tanaka H, Miura K-i. Diversity of preferred nucleotide sequences around the translation initiation codon in eukaryote genomes. Nucleic Acids Res. 2008; 36(3):861–71.

    Article  CAS  PubMed  Google Scholar 

  3. Kozak M. Compilation and analysis of sequences upstream from the translational start site in eukaryotic mrnas. Nucleic Acids Res. 1984; 12(2):857–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Hatzigeorgiou AG. Translation initiation start prediction in human cdnas with high accuracy. Bioinformatics. 2002; 18(2):343–50. doi:http://dx.doi.org/10.1093/bioinformatics/18.2.343.

  5. Kozak M. Initiation of translation in prokaryotes and eukaryotes. Gene. 1999; 234(2):187–208.

    Article  CAS  PubMed  Google Scholar 

  6. Silva LM, de Souza Teixeira FC, Ortega JM, Zárate LE, Nobre CN. Improvement in the prediction of the translation initiation site through balancing methods, inclusion of acquired knowledge and addition of features to sequences of mrna. BMC Genomics. 2011; 12(Suppl 4):9.

    Article  Google Scholar 

  7. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: synthetic minority over-sampling technique. J Artif Intell Res. 2002; 16(1):321–57.

    Google Scholar 

  8. Luukkonen B, Tan W, Schwartz S. Efficiency of reinitiation of translation on human immunodeficiency virus type 1 mrnas is determined by the length of the upstream open reading frame and by intercistronic distance. J Virol. 1995; 69(7):4086–94.

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995; 20(3):273–97.

    Google Scholar 

  10. Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller KR. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics. 2000; 16(9):799–807.

    Article  CAS  PubMed  Google Scholar 

  11. Liu H, Wong L. Data mining tools for biological sequences. J Bioinforma Comput Biol. 2003; 1(01):139–67.

    Article  CAS  Google Scholar 

  12. Pedersen AG, Nielsen H. Neural network prediction of translation initiation sites in eukaryotes: perspectives for est and genome analysis. In: Ismb. Vol. 5: 1997. p. 226–33.

  13. Li H, Jiang T. A class of edit kernels for svms to predict translation initiation sites in eukaryotic mrnas. J Comput Biol. 2005; 12(6):702–18.

    Article  PubMed  Google Scholar 

  14. Pruitt KD, Maglott DR. Refseq and locuslink: Ncbi gene-centered resources. Nucleic Acids Res. 2001; 29(1):137–40.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Jia Zeng RA, Demetrick D. Adaptive multi-agent architecture for functional sequence motifs recognition. Bioinformatics. 2009; 25(23):3084–92.

    Article  PubMed  Google Scholar 

  16. Chain PSG, et al. Genomics. genome project standards in a new era of sequencing. Science (New York). 2009; 326:236–7.

    Article  CAS  Google Scholar 

  17. Gammerman A, Vovk V, Vapnik V. Learning by transduction. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc: 1998. p. 148–55.

  18. In: Chapelle O, Schölkopf B, Zien A, (eds).Semi-Supervised Learning. Cambridge: MIT Press; 2006. http://www.kyb.tuebingen.mpg.de/ssl-book.

    Google Scholar 

  19. Stormo GD, Schneider TD, Gold LM. Characterization of translational initiation sites in e. coli. Nucleic Acids Res. 1982; 10(9):2971–96.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011; 2:27–12727.

    Article  Google Scholar 

  21. Matsumoto M, Nishimura T. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans Model Comput Simul (TOMACS). 1998; 8(1):3–30.

    Article  Google Scholar 

  22. Li Y, Ray D, Ye P. Identification of germ cell-specific genes in mammalian meiotic prophase. BMC Bioinforma. 2013; 14(1):72. doi:http://dx.doi.org/10.1186/1471-2105-14-72.

Download references

Acknowledgements

We thank the DSRgroup for the support, and the Supercomputing National Center (CESUP) of the Federal University of Rio Grande do Sul (UFRGS) for making available the computational resources for the execution of the experiments.

Funding

Research reported in this publication was supported by Foundation for Research Support of the State of Minas Gerais (FAPEMIG), the Brazilian National Council for Scientific and Technological Development (CNPq) and the Engineering Institute of School of Engineering of Minas Gerais - EMGE.

Availability of data and materials

All the data and materials are available at http://www.icei.pucminas.br/projetos/dsrgroup/?wpdmpro=transdutis.

Authors’ contributions

LZ designed the study. CP developed the methods, conducted the tests and wrote the research paper. CN and LZ have provided the expertise and have reviewed the data analysis. All authors read and approved the final manuscript.

Authors’ information

Cristiane Neri Nobre and Luis Enrique Zárate are members of the DSRgroup (http://www.icei.pucminas.br/projetos/dsrgroup/)

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristiano Lacerda Nunes Pinto.

Additional file

Additional file 1

SVM parameters obtained by executing the Grid Search method Due to the high amount of available molecules (around 20 thousand) for the remaining analyzed organisms and the Grid Search’s high runtime (given by the SVM’s execution time and the amount of records in the training set), we use 10% of the available sequences. Those sequences were chosen using the Mersenne Twister method, but keeping the ratio of positive (TIS) and negative (nTIS) classes. Grid Search was executed for each of the organisms and window size defined in this work. This table presents the values for the parameters (C,γ) found by the Grid Search using RBF kernel function, which were used for the training of ISVM and TSVM. (XLS 28.0 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nunes Pinto, C.L., Nobre, C.N. & Zárate, L.E. Transductive learning as an alternative to translation initiation site identification. BMC Bioinformatics 18, 81 (2017). https://doi.org/10.1186/s12859-017-1502-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-017-1502-6

Keywords