Metabolic network prediction through pairwise rational kernels

Background Metabolic networks are represented by the set of metabolic pathways. Metabolic pathways are a series of biochemical reactions, in which the product (output) from one reaction serves as the substrate (input) to another reaction. Many pathways remain incompletely characterized. One of the major challenges of computational biology is to obtain better models of metabolic pathways. Existing models are dependent on the annotation of the genes. This propagates error accumulation when the pathways are predicted by incorrectly annotated genes. Pairwise classification methods are supervised learning methods used to classify new pair of entities. Some of these classification methods, e.g., Pairwise Support Vector Machines (SVMs), use pairwise kernels. Pairwise kernels describe similarity measures between two pairs of entities. Using pairwise kernels to handle sequence data requires long processing times and large storage. Rational kernels are kernels based on weighted finite-state transducers that represent similarity measures between sequences or automata. They have been effectively used in problems that handle large amount of sequence information such as protein essentiality, natural language processing and machine translations. Results We create a new family of pairwise kernels using weighted finite-state transducers (called Pairwise Rational Kernel (PRK)) to predict metabolic pathways from a variety of biological data. PRKs take advantage of the simpler representations and faster algorithms of transducers. Because raw sequence data can be used, the predictor model avoids the errors introduced by incorrect gene annotations. We then developed several experiments with PRKs and Pairwise SVM to validate our methods using the metabolic network of Saccharomyces cerevisiae. As a result, when PRKs are used, our method executes faster in comparison with other pairwise kernels. Also, when we use PRKs combined with other simple kernels that include evolutionary information, the accuracy values have been improved, while maintaining lower construction and execution times. Conclusions The power of using kernels is that almost any sort of data can be represented using kernels. Therefore, completely disparate types of data can be combined to add power to kernel-based machine learning methods. When we compared our proposal using PRKs with other similar kernel, the execution times were decreased, with no compromise of accuracy. We also proved that by combining PRKs with other kernels that include evolutionary information, the accuracy can also also be improved. As our proposal can use any type of sequence data, genes do not need to be properly annotated, avoiding accumulation errors because of incorrect previous annotations.


Related work
Metabolic networks allow the modelling of molecular systems to understand the underlying biological mechanisms in a cell [1]. Metabolic networks are represented by the set of metabolic pathways. Metabolic pathways are a series of biochemical reactions, in which the product (output) from one reaction serves as the substrate (input) to another reaction. The experimental determination of metabolic networks, based on known biological data such as DNA or protein sequences, or gene expression data, is still very challenging [2]. Thus, there have been several efforts to develop supervised learning methods to determine genes coding for missing enzymes and predict unknown parts of metabolic networks [3,4].
Most of the methods to predict metabolic networks assume that the genome annotation is correct, e.g., Pathway Tools [4], a software application to predict metabolic networks using information from BioCyc databases [5]. Pathway Tools uses a two part algorithm, in which part 1 infers the reactions catalyzed by the organism from the set of enzymes present in the annotated genome, and part 2 infers the metabolic pathways present in the organism from the reactions found in the part 1. Considering BioCyc and MetaCyc have a huge amount of available data, this application can potentially make precise metabolic pathway predictions [6]. However, part 2 is based on the annotated genes, and if there are errors in the annotation, the inferred pathways will not be correct. Therefore, these methods intrinsically carry error accumulations due to incorrect genome annotations.
To tackle this problem, we have previously proposed using information directly related to the sequence as the primary data (e.g., genomic and proteomic data) [7]. As a result, we obtained the best accuracy values using Support Vector Machine (SVM) methods combined with string kernels representing the sequence data. We experimentally demonstrated that SVMs supersede other methods, such as matrix kernel regression, for predicting metabolic networks. This is consistent with recent results showing the usefulness of SVMs in bioinformatics [8]. However, our solution [7] was computationally expensive in terms of execution time because of sequence data manipulation.
Other authors have also combined SVM and other supervised learning techniques with kernel methods to predict metabolic networks [9][10][11]. The main advantage of using kernel methods is that heterogeneous data can be represented and combined simultaneously. Thus, if disparate types of data can be manipulated as kernels, data from many sources can be made to contribute uniformly to the information in a training set when building a model [12].
Yamanishi [9] and Kotera et al. [11] described the theory and implementation of GENIES, a web application that allowed prediction of the unknown parts of metabolic networks using supervised graph inference and kernel methods. Several algorithms were implemented in GENIES to find the decision or predictive functions for supervised network inference. Some of these algorithms were Kernel Canonical Correlation Analysis (KCCA) [13,14], Expectation-Maximization (EM) algorithm [15] and Kernel Matrix Regression (KMR) [9]. The authors developed several experiments, but they did not use sequence data. Therefore, one of the motivations to extend our previous research [7] was to use sequence data combined with these algorithms. As noted above, we obtained the best accuracy values with the SVM method combined with sequence kernels, but with high execution times.
To address these high computational costs, we consider the results from Allauzen et al. [16], who proposed a method to predict protein essentiality using SVMs and manipulating sequence data using rational kernels. The authors designed two sequence kernels (called general domain-based kernels), which are instances of rational kernels. To handle the large amount of data (6190 domains each with around 3000 protein sequences), automata representation was used to create the rational kernels. Their results showed that the final kernels favourably predicted protein essentiality. We note, however, that none of the previous works using rational kernels in bioinformatics [16][17][18] have considered problems related to biological network predictions.
Based on the fact that the rational kernels described by Allauzen et al. [16] can be extended to other problems, we define new kernels to be applied to metabolic network predictions. In this research, we represent sequence data using rational kernels. Rational kernels take advantage of the fast algorithms for, and efficient representation of, transducers for sequence manipulations to improve performance. As sequence data can be used, raw genomic or proteomic information may be considered, and this method avoids problems associated with incorrect annotation when predicting metabolic networks. Additionally, the current work is the first to combine rational kernels (using finite-state transducers) [17][18][19][20] with known pairwise kernels [10,[21][22][23] to obtain pairwise rational kernels. While the kernel techniques proposed in this paper can be applied equally to any machine learning tools that employ kernel methods, such as KCCA, EM or KMR, we have focused on SVMs as an illustration of their capability to reduce computational costs. We have also chosen SVM methods in light of the experimental results we obtained in previous works [7], as well as the efficiency and effectiveness of SVM methods to predict protein essentiality [16]. http://www.biomedcentral.com/1471-2105/15/318

Automata and transducers
Automata define a mathematical formalism to analyze and model real problems through useful machines [24]. An automaton has a set of states (generally represented by circles), and transitions (generally represented by arrows). The automaton moves from one state to another state (makes a transition) when activated by an event or function. One variant of an automaton is called finite state machine. A finite-state machine can be used to model a simple system, such as turnstiles or transit lights, or complex systems such as sophisticated spaceship controls [25].
Automata work on sequence of symbols, where * denotes all the finite sequences using the symbols on the alphabet , including that represents the empty symbol. In order to formally define automata and transducers, we will follow the notations used by Cortes et al. [17]. An automaton A is a 5-tuple ( , Q, I, F, δ) [24] where is the input alphabet set, Q is the state set, I ⊂ Q is the subset of initial states, F ⊂ Q is the subset of final states, and δ ⊆ Q × ( ∪ { }) × Q is the transition set. A transition ι ∈ δ describes the actions of moving from one state to another when a condition (input symbol) is encountered.
Similarly, a Finite-State Transducer (FST) is an automaton where an output label is included in each transition in addition to the input label. Based on the above definition, a FST T is a 6-tuple ( , , Q, I, F, δ) [18], where the new term is the output alphabet and the transition set δ is now δ ⊆ Q × ( ∪ { }) × ( ∪ { }) × Q. Similar to the previous definition, a transition ι ∈ δ is the action of moving from one state to another when the input symbol from is encountered and the output from is produced.
As an example, a weighted transducer is shown in Figure 1(a). We use as delimiters the colon to separate the input and output labels of the transitions and the slash to separate the weight values (i.e., the notation is input:output/weight). States are represented by circles, where the set of initial states are bold circles and the set of final states are double circles. Only the initial and final states have associated weighs (the notation is state/weight). Example 1 shows how to compute the weight to the transducer T (i.e., T(x, y)) for two given sequences x and y. In this case, we define the alphabets = {G, C} and = {G, C}.  Example 2. The weight (or value) associated to the Automaton A in Figure 1(b) for y = CCG ∈ * is computed as: A(CCG) = 1 * 2 * 3 * 6 * 1 + 1 * 3 * 1 * 4 * 1 = 48 considering that there are two accepting paths labelled with CCG. These paths are:  There are several operations defined on automata and transducers, such as inverse and composition. Given any transducer T, the inverse T −1 is the transducer obtained when the input and output labels are swapped for each transition. The composition operation of the transducers T 1 and T 2 with input and output alphabets both equal to is a weighted transducer, denoted by T 1 • T 2 , provided that the sum given by (

Rational kernels
In order to manipulate sequence data, FSTs provide a simple representation as well as efficient algorithms such as composition and shortest-distance [18]. Rational Kernels, based on Finite-State Transducers, are effective for analyzing sequences with variable lengths [17].
As a formal definition, a function k : * × * → R is a rational kernel if there exists a WFST U such that k coincides with the function defined by U, i.e., k(x, y) = U(x, y) for all sequences x, y ∈ * × * [17]. From now on, we consider the input and output alphabets with the same symbols (i.e., = ), and only the terms and * will be used.
In order to compute the value of U(x, y) for a particular pair of sequences x, y ∈ * × * , the composition algorithm of weighted transducers is used [17]: • First, M x , M y are considered as trivial weighted transducers representing x, y respectively, where M x (x, x) = 1 and M x (v, w) = 0 for v = x or w = x. M x is obtained using the linear finite automata representing x by augmenting each transition with an output label identical to the input label and by setting all transition, initial and final weights to one. M y is obtained in a similar way by using y. • Then, by definition of weighted transducer composition: Based on this representation, a two-step algorithm is defined by Cortes et al. [17] to obtain k(x, y) = U(x, y).

Algorithm 1 Rational Kernel Computation
INPUT: pair of sequences (x, y) and a WFST U (i) compute N using composition as N = M x • U • M y (ii) compute the sum of all paths of N using shortest-distance algorithm, which is equal to U(x, y).

RESULTS: value of k(x, y) = U(x, y)
Using Algorithm 1, the overall complexity to compute one value for the rational kernel is O(|U||M x ||M y |), where |U| remains constant. In practice, this complexity is reduced to O(|U| + |M x | + |M y |) in many kernels which have been used in areas such as natural language processing and computational biology. For example, Algorithm 1 for the n-gram kernel has a linear complexity (see a detailed description of the n-gram kernel below).
Kernels used in training methods for discriminant classification algorithms (e.g., SVM) need to satisfy Mercer's condition or equivalently be Positive Definite and Symmetric -PDS [18]. Cortes et al. [18] have proven a result that gives a general method to construct a PDS rational kernel using any WFSTs.

n-gram kernel as a rational kernel
Hofmann et al. [26] have defined a class of similarity measures between two biological sequences as a function of the number of equal subsequences that they have. As an example of such measures is the spectrum kernel defined by Leslie et al. [27]. Similarity values are the results of summing all the products of the counts for the same subsequences. It is also referred to in computational biology as the k-mer or n-gram kernel. In the rest of this paper, we use the term n-gram to follow the notation of Hofmann et al. [26] and Cortes et al. [17].
The n-gram kernel is defined as k n (x, y) = |z|=n c x (z)c y (z) for a fixed integer n, which represents subsequences of length n. Here, c a (b) is the number of times that the subsequence b appears in a. k n can be represented as a rational kernel using the weighted transducer U n = T n • T −1 n , where the transducer T n is defined as T n (x, z) = c x (z), for all x, z ∈ * with |z| = n [18]. For example, for n = 2, k 2 (x, y) = |z|=2 c x (z)c y (z) is the rational kernel where z represents all the subsequences in * with size 2 and T 2 (x, z) = c x (z) counts how many times z occurs in x.
Allauzen et al. [16] extended the construction of this kernel, k n , to measure the similarity between sequences represented by automata. Firstly, they define the count of a sequence z in a weighted automaton A as c A (z) = u∈ * c u (z)A(u), where u ranges over the set of sequences in * which can be represented by the automaton A. This equation represents the sums obtained for each u, of how many times z occurs in u multiplied by the weight (or value) associated to the sequence u in the automaton A (as is computed in Example 2).
Then, the similarity measure between the weighted automata A 1 and A 2 , according to the n-gram kernel k n , is defined as: http://www.biomedcentral.com/1471-2105/15/318 Based on this definition and using Algorithm 1, the n-gram rational kernel can be constructed in time O(|U n | + |M x | + |M y |), as described by Allauzen et al. [16] and Mohri et al. [28].
Yu et al. [29] have verified that n-gram sequence kernels alone are not good enough to predict protein interactions. We address their concerns in our experiments by combining n-gram with other kernels that include evolutionary information.

Pairwise kernels
We apply kernel methods to the problem of predicting relationships between two given entities, i.e., pairwise prediction. Models to solve this problem have as an input two instances, and the output is the relationship between them. Kernels used in these models need to define similarities between two arbitrary pairs of entities. Typically, the construction of pairwise kernels K are based on simple kernels k, where k : X × X → R. In this paper four different pairwise kernels are investigated: Direct Sum Learning Pairwise Kernel [21], Tensor Learning Pairwise Kernel (or Kronecker Kernel) [22,30,31], Metric Learning Pairwise Kernel [23] and Cartesian Pairwise Kernel [10].

Pairwise support vector machine
The rationale for the preceding discussion on representing disparate types of data as kernels is to enable us to use them in machine learning formalisms such as Support Vector Machines (SVMs). SVMs are used for classification and regression analysis, defined as supervised models with associated learning algorithms [33]. In this research, we use SVMs for classification. SVMs represents the data as vectors in a vector space (i.e., input or feature space). As a training set, several entities x i (vectors) classified in two categories are given. A SVM is trained to find a hyperplane that separates the vector space in two parts. Each part of the feature space groups the entities into the same category. Then, a new entity x can be classified depending their location in the feature space related to the hyperplane [33].
Pairwise Support Vector Machines, instead, classify pair of entities (x, y) [32]. Let us formally define the binary Pairwise Support Vector Machine formulation, following Brunner et al. [32]: given a training data ((x i , y j ), d i ), where d i has binary values (e.g., the pair (x i , y j ) is classified as +1 or −1), i = 1, . . . , n, j = 1, . . . , n and the mapping function , then the Pairwise SVM methods find the optimal hyperplane, w T (x i , y i ) + b = 0, which separate the points in two categories. One of the solutions is based on the dual formalism of the optimization problem described in Cortes et al. [33]. In this case the decision function is: where K is the pairwise kernel, (x i , y j ) is the set of training examples, α is obtained from the Lagrange Multipliers as a function of w (the normal vector) and b is the offset of the hyperplane (please, see Cortes et al. [33] for more details). In this case, α and b are the "learned" parameters during the training process. Thus, f classifies the new

Metabolic networks
In this work, the metabolic network is represented as a graph, in which the vertices are the enzymes, and the edges are the enzyme-enzyme relations (two proteins are enzymes that catalyze successive reactions in known pathways). Figure 2 represents a graphical transition from a metabolic pathway to a graph.
In a traditional representation of a metabolic pathway, enzymes are vertices (nodes), and metabolites are edges (branches). Following Yamanishi [9], we represent it differently, where the interactions between pairs of enzymes are considered discrete data points. For example, in Figure 2(a), the enzyme numbered EC 5.3.1.9 can create D-fructose-6-phosphate as a product, which is in turn used as a substrate by the enzyme numbered EC 2.7.1.11. This means there is an enzyme-enzyme relation between EC 5.3.1.9 and EC 2.7.1.11. Then, we create a graph in which enzyme-enzyme relations become edges and enzymes are nodes as is shown in Figure 2(b). If there is a relation between two enzymes, such a relation is classified as +1 (i.e., interacting pair). Enzyme-enzyme pairs for which no relation exists are classified as −1 (noninteracting pairs). Figure 2(c) describes these classifications, which are used as training set in the SVM method.

Using pairwise kernel and SVM to predict metabolic networks
The input data, considered as the training example dataset ((x i , y i ), d i ), is a set of known pairs of enzymes (or genes) classified in two categories (interacting or non-interacting pairs). Figure 3(a) shows an example of the input data, obtained from the metabolic network described in Figure 2(c). In Figure 3 y 1 ), (x 2 , y 2 ))). Several state-of-the-art pairwise kernels were mentioned above. For example, if we consider the Tensor Product Pairwise Kernel K [22], then K((x 1 , y 1 ), (x 2 , y 2 )) is computed using a simple kernel k (e.g., k could be the simple Phylogenetic (PFAM) kernel described by Ben-Hur et al. [22]). The PFAM kernel (k pfam (x, y)) describes similarity measures based on the PFAM database [34] between the gene x and the gene y. Thus, the Tensor Product Pairwise Kernel K, using as a simple kernel the PFAM Kernel k pfam is defined as: For example, in Figure 3(b)-bottom, if the genes are associated to the variables as follow: x 1 = YAR071W, y 1 = YAL002W, x 2 = YDR127W, y 2 = YAL038W, the Tensor Product Pairwise Kernel is: A Pairwise SVM based on the dual formalism of the optimization problem is represented in Figure 3(c). The parameters α ij and b are learned, using the pairwise kernel, K, and the training dataset, (x i , y i ). Finally, new pairs of enzymes or genes (x, y) can be classified as interacting or not-interacting, depending the evaluation of the decision function f (see an example representation in Figure 3(d)). By predicting the gene interactions of the other unseen examples, all the metabolic pathways can be predicted.
The pairwise kernel computation is one of the most expensive tasks during the prediction of the metabolic networks in processing and storage. Using sequence data causes even longer execution times and large storage needs. However, we have mentioned the advantages of using sequence data in order to avoid error accumulation because of genome annotation dependencies. As well, SVMs guarantee better accuracy values than other supervised learning methods along with sequence kernels for metabolic network inference [7]. Therefore, we focus on improvement of the pairwise kernel computations and representation, by incorporating rational kernels to manipulate the sequence data. To accomplish this, we have proposed a new framework called Pairwise Rational Kernels.

Pairwise rational kernels
In this section, we propose new pairwise kernels based on rational kernels, i.e., Pairwise Rational Kernels (PRKs). They are obtained using rational kernels as the simple kernels k. We have defined four PRKs, based on the notations and definitions in the Background Section above. Definition 1. Given X ⊆ * and a transducer U, then a function K : (X × X) × (X × X) → R is: Following Theorem 1, if we construct U using a weighted transducer T, such as U = T • T −1 , then we guarantee U is a Positive Definite and Symmetric (PDS) kernel. PDS is a needed condition to use kernels in training classification algorithms. Since all the kernels defined above are results of PDS kernel operations, the PRK kernels are also PDS [35]. http://www.biomedcentral.com/1471-2105/15/318

Algorithm
We have designed a general algorithm, Algorithm 2, to compute the kernels, using the composition of weighted transducers. This is a an extension of Algorithm 1. It uses as an input the transducers M x 1 , M y 1 , M x 2 , M y 2 , that represent the sequences x 1 , y 1 , x 2 , y 2 ∈ X and the Weighted Finite-State Transducer U, and outputs the value of K ((x 1 , y 1 ), (x 2 , y 2 )).

Algorithm 2 Pairwise Rational Kernel Computation
INPUT: pairs of sequences (x 1 , y 1 ), (x 2 , y 2 ) and WFST U (i) obtain M x 1 , M y 1 , M x 2 , M y 2 and use transducer composition to compute: compute the sum of all paths of N 1 , N 2 , N 3 , N 4 using shortest-distance algorithm (iii) compute the formulas in Definition 1: In our implementation described below, we use the ngram rational kernel as the kernel U (see the n-gram kernel as a rational kernel Section for more details). Then, the complexity of steps (i) and (ii) are O(|M x 1 | + |M y 1 | + |M x 2 | + |M y 2 |).
Step (iii) adds a constant time complexity. We conclude that PRKs based on n-gram kernels can also be computed in time O |M x 1 | + |M y 1 | + |M x 2 | + |M y 2 | .

Experiments
In this section we describe experiments to predict metabolic networks using pairwise SVMs combined with PRKs. We aim to prove the advantage of using PRKs to improve execution time during the computation of the pairwise kernels and the training process, while maintaining or improving accuracy values.

Dataset
We used data from the yeast Saccharomyces cerevisiae [36]. This species was selected to compare our methods, implementations and results with other methods that also predict biological networks for Saccharomyces cerevisiae [9,10,22].
The data for this species were taken from the KEGG pathway [37] and converted to a graph as described in the previous section (see Figure 2 for more details). There were 755 nodes and 2575 interacting pairs in the graph for this species. As we used SVM methods for the metabolic network inference, we prefer a balanced dataset. In this dataset, we have an unbalanced proportions of interacting (+1) and non-interacting (−1) classified pairs (e.g., for this dataset there were 282060 non-interacting pairs). In order to balance our dataset, we followed the procedure recommended by Yu et al. [29], using the program BRS-noint to select non-interacting pairs. Yu et al. [29] describes the bias towards noninteracting pair selection during the training process and the accuracy estimation. To eliminate this bias, the BRSnoint program is used to create a "balanced" negative set to maintain the right distribution of non-interacting and interacting pairs. As a result, we obtained 2574 noninteracting pairs for a total of 5149 pairs in the training process.

Training process and kernel computation
The known part of the metabolic network was converted in a graph and then obtained the pairs of training set, corresponding to Figure 3(a). The PRK representation coincides with Figure 3. Here, we describe the computation of PRKs (which is the main contribution of this research), given the data from the yeast Saccharomyces cerevisiae: • each of the 755 known genes were represented as a trivial weighted automaton (i.e., A x 1 , A x 2 , . . . A x 755 ) using the nucleotide sequences, • the n-gram kernel, with n = 3, was used as a rational kernel, then U(A x 1 , A x 2 ) = |z|=3 c A x 1 (z)c A x 2 (z) (see the n-gram kernel as a rational kernel Section for more details), • Algorithm 2 was implemented to obtain the K values, • as an example, the Tensor Product Pairwise Rational Kernel in Definition 1 is obtained by: • finally, all the PRK kernels K with positive eigenvalues were normalized to avoid the fact that longer sequences may contain more n-grams, resulting in more similarities [16].
With these results and other values corresponding to 3-gram rational kernel, the K PRKT is computed as: K PRKT ((x 1 , y 1 ), (x 2 , y 2 )) = 0.3, where 0.3 is a measure of similarity.

SVM and predicting process
To implement the pairwise SVM method, we use the sequential minimal optimization (SMO) technique from the package LIBSVM [40] in combination with OpenKernel library [39]. During the training process, the decision function was obtained by estimating the parameters, as is shown in Figure 3(c). Now, the prediction process allows classification of new pairs of nucleotide sequences as interacting or not interacting by evaluating the decision function. Example 4 shows a description of the prediction process, similar to the process described in Figure 3(d), but using nucleotide sequences. Example 4. This example describe the predictor process. Suppose we want to know if x = CTCAAAGTCTTAATGCTTGGACAAATTGAAAT TGG, and y=TCTACAGAGTCGTCCTTCGTCTACCGGGAAAAT, which represent abbreviated nucleotide sequences, interact or do not interact. The decision function, f (x, y), was previously obtained during the training process (see the Pairwise support vector machine Section for more details). If the resulting value of evaluating the decision function f (x, y) is greater than 0, the pair (x, y) interact, otherwise the pair (x, y) do not interact. Suppose that the evaluation is Then, we predict that these nucleotide sequences (x, y) interact in the context of the metabolic network of the yeast Saccharomyces cerevisiae. In this case, we used 755 genes during the training process, but the species has more than 6000 genes [41]. Then, the rest of the metabolic pathways can be predicted by classifying all other pairs of genes (or pairs of raw nucelotide sequences), as interacting or non-interacting, using the decision function f . Note that the decision function is obtained once during the training process, but can be used as often as needed during the prediction process.
The advantage of using sequence data is that nucleotide sequences can be used, even if it is not annotated. Also, any other type of sequence data, e.g., from highthroughput analysis, can be considered and combined, using a similar implementation.

Experiment description and performance measures
We used pairwise SVM with PRKs for metabolic network prediction, using the data and algorithms described above. We ran experiments for twelve different kernels. Firstly, we used four PRKs described in Definition 1 using the 3-gram rational kernel (i.e., K PRKDS−3gram , K PRKT−3gram , K PRKM−3gram and K PRKC−3gram ). In addition, a combination of PRKs with other kernels were considered. We included the phylogenetic kernel (K phy ) described by Yamanishi 2010 [9] and PFAM kernel (K pfam ) describe by Ben-Hur et al. [22]. Then, a second set of experiments were developed combining PRKs with the phylogenetic kernel (i.e., K PRKDS−3gram + K phy , K PRKT−3gram + K phy , K PRKM−3gram + K phy and K PRKC−3gram + K phy ). Finally, we combined PRKs with the PFAM kernel, obtaining K PRKDS−3gram + K pfam , K PRKT−3gram + K pfam , K PRKM−3gram + K pfam and K PRKC−3gram + K pfam kernels. Considering that the phylogenetic and PFAM kernels were PDS, the resulting combinations were also PDS [35].
To compare the advantages of the PRKs framework, we developed a new set of experiments with the same dataset, but without using finite-state transducers. We considered the pairwise (n-gram) kernel, i.e., K T−3gram . K T−3gram denoted the pairwise tensor product described http://www.biomedcentral.com/1471-2105/15/318 in the Pairwise kernels Section. To be consistent with the previous experiments, we combined the K T−3gram kernel with the phylogenetic kernel (K phy ) and PFAM kernel (K pfam ), i.e., K T−3gram + K phy and K T−3gram + K pfam kernels, respectively. The pairwise SVM algorithm was used to predict the metabolic network using the same data set described above. Table 1 describes the groups created to compare these kernels with the equivalent PRKs.
All the experiments were executed on a PC intel i7CORE, 8MB RAM. To validate the model, we used the 10-fold cross validation method and measured the average Area Under the Curve of Receiver Operating Characteristic (AUC ROC) score.
Cross-validation method is a suitable approach to validate performance of predictive models. In k-fold crossvalidation, the original dataset is randomly partitioned into k equal-sized subsets. Then, the model is trained k times. Each time, one of the k subsets is reserved for testing and all the remaining k − 1 subsets are used for training. The final value is obtained as the average of the k results (see Kohavi et al. [42] for more details).
A Receiver Operating Characteristic (ROC) curve is a plot of the True Positive Rate (TPR) versus the False Positive Rate (FPR) for different possible cut-offs of a binary classifier system. A cut-off defines a level for discriminating positive and negative categories. ROC curve analysis is used to assess the overall discriminatory ability of the SVM binary classifiers. The area under the curve (average AUC score) has been used as a metric to evaluate the strength of the classification.
In addition, the 95% Confidence Intervals (CIs) have been computed, following the method described by Cortes and Mohri [43]. The authors provide a distribution-independent technique to compute confidence intervals for average AUC values. The variance depends on the number of positive a negative examples (2575 and 2574 in our cases) and the number of classification errors, ranging between 889 and 1912 in our cases. Table 2 shows the SVM performance, execution times and 95% CIs grouped by the kernels mentioned above. As we can see, the experiments using only the PRK have the best execution times (Exp. I) as the transducer representations and algorithms speed up the processing. However, the  [29] with PPI networks. They stated simple sequence-based kernels, such as n-gram, do not properly predict-protein interactions. However, when Yu et al. [29] combined sequence kernels with other kernels that incorporate evolutionary information, the accuracy of the model predictor was improved.

Results and discussion
We obtained similar results applied to metabolic networks predictions: when the PHY and PFAM kernels were included (Experiments II and III, respectively), accuracies were improved while maintaining adequate processing times. The best accuracy value was obtained by combining the PRK-Metric-3gram and PFAM kernels (average AUC=0.844). Other papers have used similar kernel combinations to improve the prediction of biological networks, such as Ben-Hur et al. [22] and Yamanishi [9]. However, rational kernels have not been used in previous research.
Ben-Hur et al. [22] report an average AUC value of 0.78 for PFAM kernels, while Yamanishi [9] reports an average AUC of 0.77 for the PHY kernel for predicting Saccharomyces cerevisiae metabolic pathways. We have previously developed similar experiments but using SVM methods [7]. As a result, we obtain AUC values of 0.92 for PFAM kernel and 0.80 for PHY kernel, with execution times of 12060 and 7980 seconds, respectively. However, in all cases a random selection of negative and positive training data was used. As noted by Yu et al. [29], the average AUC values obtained by random selection of data for training machine learning tools results in a bias towards genes (or proteins) with large numbers of interactions. As such, the high AUC results in these previous works cannot be directly compared to the results in this paper. We have employed the balanced sampling techniques suggested by Yu et al. [29] to combat bias in the training set. Our results, with average AUC values in the range 0.5-0.844, are comparable to and exceed in cases the results obtained by Yu et al. [29] with balanced sampling, which range from 0.5-0.75 across several different kernels for protein interaction problems. We have also obtained these results in execution times of 15-140 seconds. With the exception of the direct sum kernel, all of the confidence intervals are above the behaviour of a random classifier.
We developed one more experiment with the PFAM kernel as a simple kernel of the Pairwise Tensor Product (K pfam ) using a balanced sampling as suggested by Yu et al. [29]. Note that it is not a PRK; it is a regular pairwise kernel using PFAM as a simple kernel, similar to the example in the Using pairwise kernel and SVM to predict metabolic networks Section. As a result, the average AUC was 0.61 and the execution time was 122 seconds. When we compare these values with the results in Table 2 Exp. I, we can see that the kernels K PRKM−3gram http://www.biomedcentral.com/1471-2105/15/318 where Nfs is the number of times Algorithm A failed and Algorithm B succeeded, and N sf is the number of times Algorithm A succeeded and Algorithm B failed. When z is equal to 0, the two algorithms have similar performance. Additionally, if N fs is larger than N sf then Algorithm B performs better than Algorithm A, and vice versa. We computed the z scores considering Algorithm A as the SVM algorithm using the Pairwise Tensor Product (K pfam ) and three different Algorithm Bs, using SVM with three different PRKs from Table 2 (i.e., K PRKM−3gram , K PRKC−3gram and K PRKT−3gram +K pfam mentioned above). In all cases, we obtained z scores greater than 0 (i.e., 4.73, 4.54, 7.51), which mean the PRKs performed better. These z-score also proved that the difference was statistically significant with a confidence level of 99% (based on Two-tailed Prediction Confidence Levels described by [45]). The Cartesian Kernel has not been widely used since it was defined by Kashima et al. [10]. Kashima et al. [10] used Expression, Localization, Chemical and Phylogenetic kernels to predict metabolic networks. Each of these are non-sequence kernels. In the current experiments we computed, for first time, the pairwise Cartesian kernel with a rational kernel (sequence kernel) to represent sequence data for metabolic network prediction. Cartesian kernels [10] have been defined as an alternative to improve the Tensor Product Pairwise Kernel [22] computation performance. In the three experiments shown in Table 2, we confirmed this definition, as we have obtained better accuracy and execution times when we used the Cartesian Pairwise Rational Kernel (K PRKC−3gram ) rather than the Tensor Product Rational Kernel (K PRKT−3gram ). Comparing our results with Kashima et al. [10], we obtained better average AUC values (i.e., 0.844 vs 0.79), and approximately the same average of the execution times (i.e., 93 seconds). Kashima et al. [10] used nonsequence data and random selection of positive and negative data for training. Figure 4 shows the results of the experiments comparing the PRK framework with other pairwise kernels. The three comparative groups described in Table 1 were used. As can be seen, the execution times were better when the PRKs are used in the three groups. This proves that PRKs compute faster because rational kernels use finite-state transducer operations and representations, improving the performance.
The power of using kernels is that almost any sort of data can be represented using kernels. Therefore, completely disparate types of data can be combined to add power to kernel-based machine learning methods [8]. For example, coefficients describing relative amounts of metabolites involved in a biochemical reaction (i.e., stochiometric data) can also be represented as kernels and added to strength the predicting model. For example, the reaction catalyzed by fructose-bisphosphate aldolase [EC 4.1.2.13] splits 1 molecule of fructose 1,6-bisphosphate into 2 molecules of glyceraldehyde 3-phosphate, where the relative amounts of substrate and product are represented by the coefficients 1 and 2, respectively. A stoichiometric kernel therefore would encode coefficients for all substrates and products, where enzymes that do not interact would have stoichiometric coefficients of 0. Other authors [46][47][48] have defined and used similar types of stochiometric data, which can be converted into kernels to be consider with PRKs.

Conclusion
In this paper, we introduced a new framework called Pairwise Rational Kernels, where pairwise kernels are obtained based on transducer representations, i.e., rational kernels. We defined the framework, developed general algorithms and tested on the pairwise Support Vector Machine method to predict metabolic networks.
We used a dataset from the yeast Saccharomyces cerevisiae to validate and compare our proposal with similar models using data from the same species. We obtained better execution times than the other models, while maintaining adequate accuracy values. Therefore, PRKs improved the performance of the pairwise-SVM algorithm used in the training process of the supervised network inference methods.
In these methods, the learning process are executed once to obtain the decision function. The decision function can be used as many times as necessary to predict interaction between the other sequences in the species and predict the metabolic pathways.
The methods in this research used sequence data (e.g., nucleotide sequences) to predict these interactions. Genes do not need to be correctly annotated as the raw sequences can be used. Therefore, our methods were able to avoid the error accumulation due to wrong gene annotations.
As future work, our proposal will be used to produce a set of candidate interactions of pathways from the same and other species, that could be experimentally validated. As well, other pairwise rational kernels may be developed using other finite-state transducers operations.