Open Access

Directed acyclic graph kernels for structural RNA analysis

BMC Bioinformatics20089:318

DOI: 10.1186/1471-2105-9-318

Received: 13 April 2008

Accepted: 22 July 2008

Published: 22 July 2008

Abstract

Background

Recent discoveries of a large variety of important roles for non-coding RNAs (ncRNAs) have been reported by numerous researchers. In order to analyze ncRNAs by kernel methods including support vector machines, we propose stem kernels as an extension of string kernels for measuring the similarities between two RNA sequences from the viewpoint of secondary structures. However, applying stem kernels directly to large data sets of ncRNAs is impractical due to their computational complexity.

Results

We have developed a new technique based on directed acyclic graphs (DAGs) derived from base-pairing probability matrices of RNA sequences that significantly increases the computation speed of stem kernels. Furthermore, we propose profile-profile stem kernels for multiple alignments of RNA sequences which utilize base-pairing probability matrices for multiple alignments instead of those for individual sequences. Our kernels outperformed the existing methods with respect to the detection of known ncRNAs and kernel hierarchical clustering.

Conclusion

Stem kernels can be utilized as a reliable similarity measure of structural RNAs, and can be used in various kernel-based applications.

Background

Recent discoveries of a large variety of important roles for non-coding RNAs (ncRNAs), including gene regulation or maturation of mRNAs, rRNAs and tRNAs, have been reported by many researchers. Most functional ncRNAs form secondary structures related to their functions, and secondary structures without pseudoknots can be modeled by stochastic context-free grammars (SCFGs) [1, 2]. Therefore, several computational methods based on SCFGs have been developed for modeling and analyzing functional ncRNA sequences [314]. These grammatical methods work very well if the secondary structures of the target ncRNAs are modeled successfully. However, it is difficult to build such stochastic models since it is necessary to construct complicated models, to prepare the number of training sequences, and/or to obtain prior knowledge for some families containing non-uniform and/or non-homologous sequences such as snoRNA families. Thus, we need more robust methods for performing structural ncRNA analysis. On the other hand, support vector machines (SVMs) and other kernel methods are being actively studied, and have been proposed for solving various problems in many research fields, including bioinformatics [15]. These methods are more robust than other existing methods, and we therefore considered using kernel methods including SVMs instead of the grammatical methods to analyze functional ncRNAs.

Several kernels for ncRNA sequences have been developed so far [1619]. Kin et al. have proposed marginalized count kernels for RNA sequences [16]. Their kernels calculate marginalized count vectors of base-pair features under SCFGs trained with a given dataset, and compute the inner products. Therefore, marginalized count kernels inherit the drawback of the grammatical methods. Washietl et al. have developed a program called RNAz, which detects structurally conserved regions from multiple alignments by using SVMs [17]. RNAz employs the averaged z-score of the minimum free energy (MFE) for each sequence and structure conservation index (SCI). Assuming that MFE for the common secondary structure is close to that for each sequence if a given multiple alignment is structurally conserved, SCI is defined as the rate of MFE for the common secondary structure to the averaged MFE for each sequence. These features allow for the detection of structurally conserved regions. However, since these features cannot measure the structural similarities between RNA sequences, it is difficult to apply them to other aspects of structural RNA analysis, such as detecting particular families. Several works which involve some helpful features specific to given target families (e.g. miRNAs and snoRNAs) have been proposed [18, 19]. These family-specific methods perform well in detecting their target families. However, in order to apply this strategy to other families, it is necessary to develop new features for every family.

For the purpose of analyzing ncRNAs using kernel methods including support vector machines, we have proposed stem kernels, which extend the string kernels to measure the similarities between two RNA sequences from the viewpoint of secondary structures [20]. The feature space of the stem kernels is defined by enumerating all possible common base pairs and stem structures of arbitrary lengths. However, since the computational time and memory size required for the naive implementation of stem kernels are of the order of O(n4), where n is the length of the inputted RNA sequence, applying stem kernels directly to large data sets of ncRNAs is impractical.

Therefore, we develop a new technique based on directed acyclic graphs (DAGs) derived from base-pairing probability matrices of RNA sequences, which significantly reduces the computational time of stem kernels. The time and space complexity of this method are approximately of the order of O(n2). Furthermore, we propose profile-profile stem kernels for multiple alignments of RNA sequences, which utilize base-pairing probability matrices for multiple alignments instead of those for individual sequences.

Methods

In this section, we propose new kernels for analyzing ncRNAs. First, an outline of our previous work is provided, after which the proposed new technique based on directed acyclic graphs (DAGs) derived from base-pairing probability matrices of RNA sequences is described. Finally, the proposed kernels are extended to kernels for multiple alignments of RNA sequences by utilizing averaged base-pairing probability matrices.

Naive stem kernel algorithms

Before proposing the new method, we briefly describe stem kernels which have been proposed as an extension of the string kernels for measuring the similarities between two RNA sequences from the viewpoint of secondary structures [20]. The feature space of the stem kernels is defined by enumerating all possible common base pairs and stem structures of arbitrary lengths. The stem kernel calculates the inner product of common stem structure counts. In other words, the more stem structures two RNA sequences have in common, the more similar they are. However, the time needed for the explicit enumeration of all substructures obviously grows exponentially, which renders this method infeasible for long sequences. We have therefore developed an algorithm for calculating stem kernels which is based on the dynamic programming technique. For an RNA sequence x = x1x2 ... x n (x k {A, C, G, U}), we denote a contiguous subsequence x j ... x k by x [j, k], and the length of x by |x|. The empty sequence is indicated by . For a base a, the complementary base is denoted as a ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyyaeMbaebaaaa@2D38@ . For a string x and a base a, x a denotes the concatenation of x and a. For two RNA sequences x and x ', the stem kernel K is defined recursively as follows:
K ( ϵ , x ) = K ( x , ϵ ) = 1 , for  x , x , K ( x a , x ) = K ( x , x ) + x k = a ¯ i < j  s . t . x i = a ¯ , x j = a K ( x [ k + 1 , | x | ] , x [ i + 1 , j 1 ] ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabiqaaaqaauaabeqabiaaaeaacqWGlbWscqGGOaaktqvzynutnfgDOLeDHXwAJbqegmwBTLwmWaaceiGae8x9diRaeiilaWIafCiEaGNbauaacqGGPaqkcqGH9aqpcqWGlbWscqGGOaakcqWH4baEcqGGSaalcqWF1pGScqGGPaqkcqGH9aqpcqaIXaqmcqGGSaalaeaacqqGMbGzcqqGVbWBcqqGYbGCcqqGGaaicqGHaiIicqWH4baEcqGGSaalcuWH4baEgaqbaaaacqGGSaalaeaacqWGlbWscqGGOaakcqWH4baEcqWGHbqycqGGSaalcuWH4baEgaqbaiabcMcaPiabg2da9iabdUealjabcIcaOiabhIha4jabcYcaSiqbhIha4zaafaGaeiykaKIaey4kaSYaaabuaeaadaaeqbqaaiabdUealjabcIcaOiabhIha4jabcUfaBjabdUgaRjabgUcaRiabigdaXiabcYcaSiabcYha8jabhIha4jabcYha8jabc2faDjabcYcaSiqbhIha4zaafaGaei4waSLaemyAaKMaey4kaSIaeGymaeJaeiilaWIaemOAaOMaeyOeI0IaeGymaeJaeiyxa0LaeiykaKIaeiOla4caleaacqWGPbqAcqGH8aapcqWGQbGAcqqGGaaicqqGZbWCcqGGUaGlcqqG0baDcqGGUaGlcqqGGaaicuWG4baEgaqbamaaBaaameaacqWGPbqAaeqaaSGaeyypa0JafmyyaeMbaebacqGGSaalcuWG4baEgaqbamaaBaaameaacqWGQbGAaeqaaSGaeyypa0JaemyyaegabeqdcqGHris5aaWcbaGaemiEaG3aaSbaaWqaaiabdUgaRbqabaWccqGH9aqpcuWGHbqygaqeaaqab0GaeyyeIuoaaaaaaa@9DA8@
(1)

Both the time and the memory required for the calculation K(x, x ') are of the order of O(|x|2|x '|2), which renders this method impractical for applying to large data sets of ncRNAs.

Stem kernels with DAG representation

Here, we develop a new technique based on directed acyclic graphs (DAGs) derived from base-pairing probability matrices of RNA sequences, which significantly reduces the time needed for computing stem kernels. Figure 1 contains a diagram illustrating the calculation of the new kernels.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-9-318/MediaObjects/12859_2008_Article_2303_Fig1_HTML.jpg
Figure 1

Averaged base-paring probability matrices and DAG kernels using the dynamic programming technique enable us to calculate profile-profile stem kernels for multiple alignments of RNA sequences. (a) Given a pair of multiple alignments, (b) Calculate the base-paring probability matrices for each sequence in the multiple alignments and average these base-pairing probabilities with respect to the columns of each alignment. (c) Build a DAG for the averaged base-pairing probability matrix, where each vertex corresponds to a base pair whose probability is above a predefined threshold. (d) Calculate a kernel value for a pair of DAGs for the multiple alignments by using the DAG kernel and the dynamic programming technique.

First, for each RNA sequence x = x1x2 ... x n , we calculate a base-pairing probability matrix P x using the McCaskill algorithm [21]. We denote the base-pairing probability of (x i , x j ) by P i j x MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaa1aa0baaSqaaiabdMgaPjabdQgaQbqaaiabhIha4baaaaa@3160@ , which is defined as:
P i j x = E [ I i j | x ] = y Y ( x ) p ( y | x ) I i j ( y ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaa1aa0baaSqaaiabdMgaPjabdQgaQbqaaiabhIha4baakiabg2da9mrr1ngBPrwtHrhAYaqeguuDJXwAKbstHrhAGq1DVbaceaGae8hHWxKaei4waSLaemysaK0aaSbaaSqaaiabdMgaPjabdQgaQbqabaGccqGG8baFcqWH4baEcqGGDbqxcqGH9aqpdaaeqbqaaiabdchaWjabcIcaOiabdMha5jabcYha8jabhIha4jabcMcaPiabdMeajnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaOGaeiikaGIaemyEaKNaeiykaKcaleaacqWG5bqEcqGHiiIZt0uy0HwzTfgDPnwy1egarCqtHrhAL1wy0L2yHvdaiuaacqGFyeFwcqGGOaakcqWH4baEcqGGPaqkaeqaniabggHiLdGccqGGSaalaaa@6CD3@
(2)

where Y MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8hgXNfaaa@3779@ (x) is an ensemble of all possible secondary structures of x, p(y|x) is the posterior probability of y given x, and I ij (y) is an indicator function, which equals 1 if the i-th and the j-th nucleotides form a base-pair in y or 0 otherwise. We employ the Vienna RNA package [22] for computing these expected counts (2) using the McCaskill algorithm.

Subsequently, we build a DAG for the base-pairing probability matrix, where each vertex corresponds to a base pair whose probability is above a predefined threshold p*. Let G x = (V x , E x ) be the DAG for an RNA sequence x, where V x and E x are vertices and edges in the DAG G x , respectively. For each v i = (k, l) V x , (x k , x l ) is a likely base pair, in other words, P k l x p MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaa1aa0baaSqaaiabdUgaRjabdYgaSbqaaiabhIha4baakiabgwMiZkabdchaWnaaCaaaleqabaGaey4fIOcaaaaa@35BD@ . Each e ij E x is an edge from vertex v i to vertex v j .

For vertices v i = (k, l) and v i' = (k', l'), we can define a partial order, v i v i' if and only if k <k' and l > l'. An edge e ii' connects vertices v i and v i' if and only if v i v i' and there exists no v j V x such that v i v j v i' .

Finally, we calculate a kernel value between two DAGs representing RNA structure information through the DAG kernel using a dynamic programming technique. The vertices in the DAG can be numbered in a topological order such that for every edge e ij , i <j is satisfied, in other words, there are no directed paths from v j to v i if i <j. Thus, we can apply the dynamic programming technique as follows:
K ( G x , G x ) = v i r o o t ( G x ) , v i r o o t ( G x ) r ( i , i ) r ( i , i ) = { K v ( v i , v i ) + g v ( v i ) + g v ( v i ) ( j , j  s . t . j > i , j > i ) K v ( v i , v i ) + g v ( v i ) j > i g e ( e i j ) r ( j , i ) + g v ( v i ) ( j  s . t . j > i ) K v ( v i , v i ) + g v ( v i ) + g v ( v i ) j > i g e ( e i , j ) r ( i , j ) ( j  s . t . j > i ) K v ( v i , v i ) j > i , j > i K e ( e i j , e i j ) r ( j , j ) + g v ( v i ) j > i g e ( e i j ) r ( j , i ) + g v ( v i ) j > i g e ( e i j ) r ( i , j ) ( otherwise ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGceiqabeaaOiqaaiabdUealjabcIcaOiabdEeahnaaBaaaleaacqWH4baEaeqaaOGaeiilaWIaem4raC0aaSbaaSqaaiqbhIha4zaafaaabeaakiabcMcaPiabg2da9maaqafabaGaemOCaiNaeiikaGIaemyAaKMaeiilaWIafmyAaKMbauaacqGGPaqkaSqaaiabdAha2naaBaaameaacqWGPbqAaeqaaSGaeyicI4SaemOCaiNaem4Ba8Maem4Ba8MaemiDaqNaeiikaGIaem4raC0aaSbaaWqaaiabhIha4bqabaWccqGGPaqkcqGGSaalcqWG2bGDdaWgaaadbaGafmyAaKMbauaaaeqaaSGaeyicI4SaemOCaiNaem4Ba8Maem4Ba8MaemiDaqNaeiikaGIaem4raC0aaSbaaWqaaiqbhIha4zaafaaabeaaliabcMcaPaqab0GaeyyeIuoaaOqaaiaaxMaacqWGYbGCcqGGOaakcqWGPbqAcqGGSaalcuWGPbqAgaqbaiabcMcaPiabg2da9maaceaabaqbaeaabuGaaaaabaGaem4saS0aaSbaaSqaaiabdAha2bqabaGccqGGOaakcqWG2bGDdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiabdAha2naaBaaaleaacuWGPbqAgaqbaaqabaGccqGGPaqkcqGHRaWkcqWGNbWzdaWgaaWcbaGaemODayhabeaakiabcIcaOiabdAha2naaBaaaleaacqWGPbqAaeqaaOGaeiykaKIaey4kaSIaem4zaC2aaSbaaSqaaiabdAha2bqabaGccqGGOaakcqWG2bGDdaWgaaWcbaGafmyAaKMbauaaaeqaaOGaeiykaKcabaGaeiikaGYenfgDOvwBHrxAJf2maeHbnfgDOvwBHrxAJf2maGabaiab=rGiXkabdQgaQjabcYcaSiqbdQgaQzaafaGaeeiiaaIaee4CamNaeiOla4IaeeiDaqNaeiOla4IaeeiiaaIaemOAaOMaeyOpa4JaemyAaKMaeiilaWIafmOAaOMbauaacqGH+aGpcuWGPbqAgaqbaiabcMcaPaqaaiabdUealnaaBaaaleaacqWG2bGDaeqaaOGaeiikaGIaemODay3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWG2bGDdaWgaaWcbaGafmyAaKMbauaaaeqaaOGaeiykaKIaey4kaSIaem4zaC2aaSbaaSqaaiabdAha2bqabaGccqGGOaakcqWG2bGDdaWgaaWcbaGaemyAaKgabeaakiabcMcaPmaaqafabaGaem4zaC2aaSbaaSqaaiabdwgaLbqabaGccqGGOaakcqWGLbqzdaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabcMcaPiabdkhaYjabcIcaOiabdQgaQjabcYcaSiqbdMgaPzaafaGaeiykaKIaey4kaSIaem4zaC2aaSbaaSqaaiabdAha2bqabaGccqGGOaakcqWG2bGDdaWgaaWcbaGafmyAaKMbauaaaeqaaOGaeiykaKcaleaacqWGQbGAcqGH+aGpcqWGPbqAaeqaniabggHiLdaakeaacqGGOaakcqWFeisScuWGQbGAgaqbaiabbccaGiabbohaZjabc6caUiabbsha0jabc6caUiabbccaGiqbdQgaQzaafaGaeyOpa4JafmyAaKMbauaacqGGPaqkaeaacqWGlbWsdaWgaaWcbaGaemODayhabeaakiabcIcaOiabdAha2naaBaaaleaacqWGPbqAaeqaaOGaeiilaWIaemODay3aaSbaaSqaaiqbdMgaPzaafaaabeaakiabcMcaPiabgUcaRiabdEgaNnaaBaaaleaacqWG2bGDaeqaaOGaeiikaGIaemODay3aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGHRaWkcqWGNbWzdaWgaaWcbaGaemODayhabeaakiabcIcaOiabdAha2naaBaaaleaacuWGPbqAgaqbaaqabaGccqGGPaqkdaaeqbqaaiabdEgaNnaaBaaaleaacqWGLbqzaeqaaOGaeiikaGIaemyzau2aaSbaaSqaaiqbdMgaPzaafaGaeiilaWIafmOAaOMbauaaaeqaaOGaeiykaKIaemOCaiNaeiikaGIaemyAaKMaeiilaWIafmOAaOMbauaacqGGPaqkaSqaaiqbdQgaQzaafaGaeyOpa4JafmyAaKMbauaaaeqaniabggHiLdaakeaacqGGOaakcqWFeisScqWGQbGAcqqGGaaicqqGZbWCcqGGUaGlcqqG0baDcqGGUaGlcqqGGaaicqWGQbGAcqGH+aGpcqWGPbqAcqGGPaqkaeaacqWGlbWsdaWgaaWcbaGaemODayhabeaakiabcIcaOiabdAha2naaBaaaleaacqWGPbqAaeqaaOGaeiilaWIaemODay3aaSbaaSqaaiqbdMgaPzaafaaabeaakiabcMcaPmaaqafabaGaem4saS0aaSbaaSqaaiabdwgaLbqabaGccqGGOaakcqWGLbqzdaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabcYcaSiabdwgaLnaaBaaaleaacuWGPbqAgaqbaiqbdQgaQzaafaaabeaakiabcMcaPiabdkhaYjabcIcaOiabdQgaQjabcYcaSiqbdQgaQzaafaGaeiykaKcaleaacqWGQbGAcqGH+aGpcqWGPbqAcqGGSaalcuWGQbGAgaqbaiabg6da+iqbdMgaPzaafaaabeqdcqGHris5aaGcbaaabiqaaOpacaWLjaGaey4kaSIaem4zaC2aaSbaaSqaaiabdAha2bqabaGccqGGOaakcqWG2bGDdaWgaaWcbaGaemyAaKgabeaakiabcMcaPmaaqafabaGaem4zaC2aaSbaaSqaaiabdwgaLbqabaGccqGGOaakcqWGLbqzdaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabcMcaPiabdkhaYjabcIcaOiabdQgaQjabcYcaSiqbdMgaPzaafaGaeiykaKIaey4kaSIaem4zaC2aaSbaaSqaaiabdAha2bqabaGccqGGOaakcqWG2bGDdaWgaaWcbaGafmyAaKMbauaaaeqaaOGaeiykaKYaaabuaeaacqWGNbWzdaWgaaWcbaGaemyzaugabeaakiabcIcaOiabdwgaLnaaBaaaleaacuWGPbqAgaqbaiqbdQgaQzaafaaabeaakiabcMcaPiabdkhaYjabcIcaOiabdMgaPjabcYcaSiqbdQgaQzaafaGaeiykaKcaleaacuWGQbGAgaqbaiabg6da+iqbdMgaPzaafaaabeqdcqGHris5aaWcbaGaemOAaOMaeyOpa4JaemyAaKgabeqdcqGHris5aaGcbaGaeiikaGIaee4Ba8MaeeiDaqNaeeiAaGMaeeyzauMaeeOCaiNaee4DaCNaeeyAaKMaee4CamNaeeyzauMaeiykaKcaaaGaay5Eaaaaaaa@9C53@
(3)

where root(G) is a set of vertices which have no incoming edges, K v and K e are kernel functions for vertices and edges, respectively, and g v and g e are gap penalties for vertices and edges, respectively. K calculates the sum of kernel values for all pairs of possible substructures of G x and G x ' . Each of these kernel values is composed of the product of the subkernels K v , K e , g v and g e . Therefore, K is a convolution kernel and is positive semi-definite if K v and K e are also positive semi-definite [23].

The time and the memory required for the computation of K are of the order of O(c2|V x ||V x ' |) and O(|V x ||V x ' |), respectively, where c is the maximum out-degree of G x and G x ' . We can control |V x | using the predefined threshold for base pairs, p*. When p* = 0, V x contains all possible base pairs, i.e., |V x | = n(n - 1)/2. When p* > 0, since each base can take part in V x at most 1/p* times, |V x | is proportional to n of the length of the RNA sequence x. Since in many cases c |V x |, the time and the memory required for this algorithm are approximately of the order of O(n2) for sufficiently large values of p*.

Several choices of sub-kernels K v , K e , g v and g e in Eq. (3) are available. In order to connect the DAG-based stem kernels to the naive stem kernels calculated from Eq. (1), we first define simple sub-kernels as follows:
K v ( v , v ) = { 1 ( x ¯ k = x l  and  ( x k , x l ) = ( x k , x l ) for  v = ( k , l ) V x  and  v = ( k , l ) V x ) 0 ( otherwise ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4saS0aaSbaaSqaaiabdAha2bqabaGccqGGOaakcqWG2bGDcqGGSaalcuWG2bGDgaqbaiabcMcaPiabg2da9maaceaabaqbaeaabiGaaaqaaiabigdaXaqaamaabmaabaqbaeaabiqaaaqaaiqbdIha4zaaraWaaSbaaSqaaiabdUgaRbqabaGccqGH9aqpcqWG4baEdaWgaaWcbaGaemiBaWgabeaakiabbccaGiabbggaHjabb6gaUjabbsgaKjabbccaGiabcIcaOiabdIha4naaBaaaleaacqWGRbWAaeqaaOGaeiilaWIaemiEaG3aaSbaaSqaaiabdYgaSbqabaGccqGGPaqkcqGH9aqpcqGGOaakcuWG4baEgaqbamaaBaaaleaacuWGRbWAgaqbaaqabaGccqGGSaalcuWG4baEgaqbamaaBaaaleaacuWGSbaBgaqbaaqabaGccqGGPaqkaeaacqqGMbGzcqqGVbWBcqqGYbGCcqqGGaaicqWG2bGDcqGH9aqpcqGGOaakcqWGRbWAcqGGSaalcqWGSbaBcqGGPaqkcqGHiiIZcqWGwbGvdaWgaaWcbaGaeCiEaGhabeaakiabbccaGiabbggaHjabb6gaUjabbsgaKjabbccaGiqbdAha2zaafaGaeyypa0JaeiikaGIafm4AaSMbauaacqGGSaalcuWGSbaBgaqbaiabcMcaPiabgIGiolabdAfawnaaBaaaleaacuWH4baEgaqbaaqabaaaaaGccaGLOaGaayzkaaaabaGaeGimaadabaGaeiikaGIaee4Ba8MaeeiDaqNaeeiAaGMaeeyzauMaeeOCaiNaee4DaCNaeeyAaKMaee4CamNaeeyzauMaeiykaKcaaaGaay5Eaaaaaa@8AAC@
(4)
K e ( e , e ) = { 1 ( e E x  and  e E x ) 0 ( otherwise ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4saS0aaSbaaSqaaiabdwgaLbqabaGccqGGOaakcqWGLbqzcqGGSaalcuWGLbqzgaqbaiabcMcaPiabg2da9maaceaabaqbaeaabiGaaaqaaiabigdaXaqaaiabcIcaOiabdwgaLjabgIGiolabdweafnaaBaaaleaacqWH4baEaeqaaOGaeeiiaaIaeeyyaeMaeeOBa4MaeeizaqMaeeiiaaIafmyzauMbauaacqGHiiIZcqWGfbqrdaWgaaWcbaGafCiEaGNbauaaaeqaaOGaeiykaKcabaGaeGimaadabaGaeiikaGIaee4Ba8MaeeiDaqNaeeiAaGMaeeyzauMaeeOCaiNaee4DaCNaeeyAaKMaee4CamNaeeyzauMaeiykaKcaaaGaay5Eaaaaaa@58CE@
(5)

g v (v) = 1,   v V x V x ' (6)

g e (e) = 1,   e E x E x ' . (7)

When p* → 0, the DAG-based stem kernels calculated form Eq. (3) with the above sub-kernels approach the naive stem kernels calculated from Eq. (1) since both Eqs. (1) and (3) designate recursive traversal to all substructures of x and x ' in the sense of the partial order , and when p* = 0, the substructures of x and x ' for both kernels which contribute kernel values are identical to each other due to these sub-kernels. More sophisticated kernels can be constructed using substitution scoring matrices, as well as local alignment kernels [24]:
K v ( v , v ) = exp ( P k l x P k l x α S ( x k , x l , x k , x l ) ) ( for  v = ( k , l ) V x  and  v = ( k , l ) V x ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGceiqabiaayvaa0AqaaiabdUealnaaBaaaleaacqWG2bGDaeqaaOGaeiikaGIaemODayNaeiilaWIafmODayNbauaacqGGPaqkcqGH9aqpcyGGLbqzcqGG4baEcqGGWbaCdaqadaqaaiabdcfaqnaaDaaaleaacqWGRbWAcqWGSbaBaeaacqWH4baEaaGccqWGqbaudaqhaaWcbaGafm4AaSMbauaacuWGSbaBgaqbaaqaaiqbhIha4zaafaaaaOGaeyyXICTaeqySdeMaeyyXICTaem4uamLaeiikaGIaemiEaG3aaSbaaSqaaiabdUgaRbqabaGccqGGSaalcqWG4baEdaWgaaWcbaGaemiBaWgabeaakiabcYcaSiqbdIha4zaafaWaaSbaaSqaaiqbdUgaRzaafaaabeaakiabcYcaSiqbdIha4zaafaWaaSbaaSqaaiqbdYgaSzaafaaabeaakiabcMcaPaGaayjkaiaawMcaaaqaaiaaxMaacaWLjaGaeiikaGIaeeOzayMaee4Ba8MaeeOCaiNaeeiiaaIaemODayNaeyypa0JaeiikaGIaem4AaSMaeiilaWIaemiBaWMaeiykaKIaeyicI4SaemOvay1aaSbaaSqaaiabhIha4bqabaGccqqGGaaicqqGHbqycqqGUbGBcqqGKbazcqqGGaaicuWG2bGDgaqbaiabg2da9iabcIcaOiqbdUgaRzaafaGaeiilaWIafmiBaWMbauaacqGGPaqkcqGHiiIZcqWGwbGvdaWgaaWcbaGafCiEaGNbauaaaeqaaOGaeiykaKIaeiilaWcaaaa@86C3@
(8)
K e ( e , e ) = { γ n ( e ) + n ( e ) ( e E x  and  e E x ) 0 ( otherwise ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4saS0aaSbaaSqaaiabdwgaLbqabaGccqGGOaakcqWGLbqzcqGGSaalcuWGLbqzgaqbaiabcMcaPiabg2da9maaceaabaqbaeaabiGaaaqaaiabeo7aNnaaCaaaleqabaGaemOBa4MaeiikaGIaemyzauMaeiykaKIaey4kaSIaemOBa4MaeiikaGIafmyzauMbauaacqGGPaqkaaaakeaacqGGOaakcqWGLbqzcqGHiiIZcqWGfbqrdaWgaaWcbaGaeCiEaGhabeaakiabbccaGiabbggaHjabb6gaUjabbsgaKjabbccaGiqbdwgaLzaafaGaeyicI4Saemyrau0aaSbaaSqaaiqbhIha4zaafaaabeaakiabcMcaPaqaaiabicdaWaqaaiabcIcaOiabb+gaVjabbsha0jabbIgaOjabbwgaLjabbkhaYjabbEha3jabbMgaPjabbohaZjabbwgaLjabcMcaPaaaaiaawUhaaaaa@637E@
(9)

g v (v) = γ2,   v V x V x ' (10)

g e (e) = γn(e),   e E x E x ' , (11)

where S ( x l , x k , x k , x l ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaeiikaGIaemiEaG3aaSbaaSqaaiabdYgaSbqabaGccqGGSaalcqWG4baEdaWgaaWcbaGaem4AaSgabeaakiabcYcaSiqbdIha4zaafaWaaSbaaSqaaiqbdUgaRzaafaaabeaakiabcYcaSiqbdIha4zaafaWaaSbaaSqaaiqbdYgaSzaafaaabeaakiabcMcaPaaa@3DC2@ is a substitution scoring function from a base pair (x l , x k ) to a base pair ( x k , x l ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeiikaGIafmiEaGNbauaadaWgaaWcbaGafm4AaSMbauaaaeqaaOGaeiilaWIafmiEaGNbauaadaWgaaWcbaGafmiBaWMbauaaaeqaaOGaeiykaKcaaa@34B5@ , α > 0 is a weight parameter for base pairs, γ > 0 is the decoy factor for loop regions, and n(e) is the number of nucleotides in the loop region enclosed by base pairs at both ends of an edge e.

In our experiments, we employed the RIBOSUM 80-65 [9] for S, and p* = 0.01, α = 0.1, γ = 0.4, which were optimized by cross-validation tests. In order to prevent sequence length bias, we normalize our kernels K as follows:
K ( G x , G x ) = K ( G x , G x ) K ( G x , G x ) K ( G x , G x ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafm4saSKbauaacqGGOaakcqWGhbWrdaWgaaWcbaGaeCiEaGhabeaakiabcYcaSiabdEeahnaaBaaaleaacuWH4baEgaqbaaqabaGccqGGPaqkcqGH9aqpjuaGdaWcaaqaaiabdUealjabcIcaOiabdEeahnaaBaaabaGaeCiEaGhabeaacqGGSaalcqWGhbWrdaWgaaqaaiqbhIha4zaafaaabeaacqGGPaqkaeaadaGcaaqaaiabdUealjabcIcaOiabdEeahnaaBaaabaGaeCiEaGhabeaacqGGSaalcqWGhbWrdaWgaaqaaiabhIha4bqabaGaeiykaKIaem4saSKaeiikaGIaem4raC0aaSbaaeaacuWH4baEgaqbaaqabaGaeiilaWIaem4raC0aaSbaaeaacuWH4baEgaqbaaqabaGaeiykaKcabeaaaaGaeiOla4caaa@538D@

Stem kernels can be applied only to RNA secondary structures. However, primary sequences are still important for calculating the similarities between a pair of RNA sequences. Therefore, in order to take into account both primary sequences and secondary structures, we combine our stem kernels with the local alignment kernels by adding them.

Profile-profile stem kernels

If multiple alignments of homologous RNA sequences are available, we can calculate their base-paring probability matrices more precisely by taking the averaged sum of individual base-pairing probability matrices in accordance with the given multiple alignment [25]. The algorithm of the DAG-based stem kernels for a pair of RNA sequences can be extended to that for a pair of multiple alignments of RNA sequences using averaged base-pairing probability matrices. Since the method of the averaged base-paring probability matrices has been proven to be accurate and robust by Kiryu et al. [25], we can expect this method to improve the proposed stem kernel method. We call these profile-profile stem kernels.

We denote the i-th column of a multiple alignment A by A i , a nucleotide in A i of the j-th sequence by a ij , and the number of aligned sequences in A by num(A). We can calculate the averaged base-pairing probability matrix of a given multiple alignment A as follows:
P k l A = 1 n u m ( A ) x A P k l x , P k l x = { P ρ ( k ) ρ ( l ) x ( for either of  x k  and  x l  are not gaps ) 0 ( otherwise ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGceaqabeaacqWGqbaudaqhaaWcbaGaem4AaSMaemiBaWgabaGaeCyqaeeaaOGaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqWGUbGBcqWG1bqDcqWGTbqBcqGGOaakcqWHbbqqcqGGPaqkaaGcdaaeqbqaaiqbdcfaqzaafaWaa0baaSqaaiabdUgaRjabdYgaSbqaaiabhIha4baaaeaacqWH4baEcqGHiiIZcqWHbbqqaeqaniabggHiLdGccqGGSaalaeaacuWGqbaugaqbamaaDaaaleaacqWGRbWAcqWGSbaBaeaacqWH4baEaaGccqGH9aqpdaGabaqaauaabaqaciaaaeaacqWGqbaudaqhaaWcbaGaeqyWdiNaeiikaGIaem4AaSMaeiykaKIaeqyWdiNaeiikaGIaemiBaWMaeiykaKcabaGafCiEaGNbauaaaaaakeaacqGGOaakcqqGMbGzcqqGVbWBcqqGYbGCcqqGGaaicqqGLbqzcqqGPbqAcqqG0baDcqqGObaAcqqGLbqzcqqGYbGCcqqGGaaicqqGVbWBcqqGMbGzcqqGGaaicqWG4baEdaWgaaWcbaGaem4AaSgabeaakiabbccaGiabbggaHjabb6gaUjabbsgaKjabbccaGiabdIha4naaBaaaleaacqWGSbaBaeqaaOGaeeiiaaIaeeyyaeMaeeOCaiNaeeyzauMaeeiiaaIaeeOBa4Maee4Ba8MaeeiDaqNaeeiiaaIaee4zaCMaeeyyaeMaeeiCaaNaee4CamNaeiykaKcabaGaeGimaadabaGaeiikaGIaee4Ba8MaeeiDaqNaeeiAaGMaeeyzauMaeeOCaiNaee4DaCNaeeyAaKMaee4CamNaeeyzauMaeiykaKIaeiilaWcaaaGaay5Eaaaaaaa@9AB4@
where x ' is the sequence x with all gaps removed and ρ(k) is an index on x ' of the k-th column of A. After constructing P k l A MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaa1aa0baaSqaaiabdUgaRjabdYgaSbqaaiabhgeabbaaaaa@30FA@ , we can build DAGs, and the kernel K v for columns can be calculated by replacing the substitution function S in Eq. (9) with
S ( A k , A l , A k , A l ) = 1 n u m ( A ) n u m ( A ) i = 1 n u m ( A ) i = 1 n u m ( A ) S ( a k i , a l i , a k i , a l i ) S ( a k i , a l i , a k i , a l i ) = { S ( a k i , a l i , a k i , a l i ) ( any of  a k i , a l i , a k i ,  and  a l i  are not gaps ) 0 ( otherewise ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGceaqabeaacqWGtbWucqGGOaakcqWHbbqqdaWgaaWcbaGaem4AaSgabeaakiabcYcaSiabhgeabnaaBaaaleaacqWGSbaBaeqaaOGaeiilaWIafCyqaeKbauaadaWgaaWcbaGafm4AaSMbauaaaeqaaOGaeiilaWIafCyqaeKbauaadaWgaaWcbaGafmiBaWMbauaaaeqaaOGaeiykaKIaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqWGUbGBcqWG1bqDcqWGTbqBcqGGOaakcqWHbbqqcqGGPaqkcqWGUbGBcqWG1bqDcqWGTbqBcqGGOaakcuWHbbqqgaqbaiabcMcaPaaakmaaqahabaWaaabCaeaacuWGtbWugaqbaiabcIcaOiabdggaHnaaBaaaleaacqWGRbWAcqWGPbqAaeqaaOGaeiilaWIaemyyae2aaSbaaSqaaiabdYgaSjabdMgaPbqabaGccqGGSaalcuWGHbqygaqbamaaBaaaleaacuWGRbWAgaqbaiqbdMgaPzaafaaabeaakiabcYcaSiqbdggaHzaafaWaaSbaaSqaaiqbdYgaSzaafaGafmyAaKMbauaaaeqaaOGaeiykaKcaleaacuWGPbqAgaqbaiabg2da9iabigdaXaqaaiabd6gaUjabdwha1jabd2gaTjabcIcaOiqbhgeabzaafaGaeiykaKcaniabggHiLdaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBcqWG1bqDcqWGTbqBcqGGOaakcqWHbbqqcqGGPaqka0GaeyyeIuoaaOqaaiqbdofatzaafaGaeiikaGIaemyyae2aaSbaaSqaaiabdUgaRjabdMgaPbqabaGccqGGSaalcqWGHbqydaWgaaWcbaGaemiBaWMaemyAaKgabeaakiabcYcaSiqbdggaHzaafaWaaSbaaSqaaiqbdUgaRzaafaGafmyAaKMbauaaaeqaaOGaeiilaWIafmyyaeMbauaadaWgaaWcbaGafmiBaWMbauaacuWGPbqAgaqbaaqabaGccqGGPaqkcqGH9aqpdaGabaqaauaabaqaciaaaeaacqWGtbWucqGGOaakcqWGHbqydaWgaaWcbaGaem4AaSMaemyAaKgabeaakiabcYcaSiabdggaHnaaBaaaleaacqWGSbaBcqWGPbqAaeqaaOGaeiilaWIafmyyaeMbauaadaWgaaWcbaGafm4AaSMbauaacuWGPbqAgaqbaaqabaGccqGGSaalcuWGHbqygaqbamaaBaaaleaacuWGSbaBgaqbaiqbdMgaPzaafaaabeaakiabcMcaPaqaaiabcIcaOiabbggaHjabb6gaUjabbMha5jabbccaGiabb+gaVjabbAgaMjabbccaGiabdggaHnaaBaaaleaacqWGRbWAcqWGPbqAaeqaaOGaeiilaWIaemyyae2aaSbaaSqaaiabdYgaSjabdMgaPbqabaGccqGGSaalcuWGHbqygaqbamaaBaaaleaacuWGRbWAgaqbaiqbdMgaPzaafaaabeaakiabcYcaSiabbccaGiabbggaHjabb6gaUjabbsgaKjabbccaGiqbdggaHzaafaWaaSbaaSqaaiqbdYgaSzaafaGafmyAaKMbauaaaeqaaOGaeeiiaaIaeeyyaeMaeeOCaiNaeeyzauMaeeiiaaIaeeOBa4Maee4Ba8MaeeiDaqNaeeiiaaIaee4zaCMaeeyyaeMaeeiCaaNaee4CamNaeiykaKcabaGaeGimaadabaGaeiikaGIaee4Ba8MaeeiDaqNaeeiAaGMaeeyzauMaeeOCaiNaeeyzauMaee4DaCNaeeyAaKMaee4CamNaeeyzauMaeiykaKIaeiOla4caaaGaay5Eaaaaaaa@F18C@

Results and Discussion

In this section, we present some of the results of our experiments in order to confirm the validity of our method as well as a discussion of those results.

Discrimination with SVMs and other kernel machines

We performed several experiments in which SVMs based on our kernel attempted to detect known ncRNA families. The accuracy was assessed using the specificity (SP) and the sensitivity (SN), which are defined as follows:
S P = T N T N + F P , S N = T P T P + F N , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaaiabdofatjabdcfaqjabg2da9KqbaoaalaaabaGaemivaqLaemOta4eabaGaemivaqLaemOta4Kaey4kaSIaemOrayKaemiuaafaaOGaeiilaWcabaGaem4uamLaemOta4Kaeyypa0tcfa4aaSaaaeaacqWGubavcqWGqbauaeaacqWGubavcqWGqbaucqGHRaWkcqWGgbGrcqWGobGtaaGccqGGSaalaaaaaa@4594@

where TP is the number of correctly predicted positives, FP is the number of incorrectly predicted positives, TN is the number of correctly predicted negatives, and FN is the number of incorrectly predicted negatives. Furthermore, the area under the receiver operating characteristic (ROC) curve, i.e., the ROC score, was also used for evaluation. The ROC curve plots the true positive rates (= SN) as a function of the false positive rates (= 1 - SP) for varying decision thresholds of a classifier.

In our first experiment, the discrimination ability and the execution time of the stem kernels were tested on our previous dataset used in [20], which includes five RNA families: tRNAs, miRNAs (precursor), 5S rRNAs, H/ACA snoRNAs, and C/D snoRNAs. We chose 100 sequences in each RNA family from the Rfam database [26] as positive samples such that the pairwise identity was not above 80% for any pair of sequences, and 100 randomly shuffled sequences with the same dinucleotide composition as the positives were generated as negative samples for each family. The discrimination performance was evaluated using 10-fold cross validation. In order to determine an appropriate cutoff threshold for the base-pairing probabilities p*, we performed the experiments for various values of p* {0.1, 0.01, 0.001, 0.0001}. Figure 2 shows the accuracy and the calculation time for each threshold. Since the accuracy for p* = 0.01 was slightly better than that for the other values, and the calculation time in this case was acceptable for practical use, we fixed p* = 0.01 as the default cutoff threshold of the base-pairing probabilities. Then, we compared the DAG-based stem kernels with the naive stem kernels. The experimental results shown in Table 1 indicate that the DAG-based kernels are significantly faster than the naive kernels owing to the approximation by a predefined threshold of the base-pairing probability. Furthermore, in spite of using an approximation, the DAG-based kernels are slightly more accurate than the naive kernels due to the convolution with the local alignment kernels and the removal of low-likelihood base pairs which may create noise.
Table 1

Comparison of the discrimination capabilities of the naive stem kernels and the DAG-based stem kernels.

 

Naive stem kernels

DAG-based stem kernels

ncRNA type

ROC

SP

SN

Time (s)

ROC

SP

SN

Time (s)

tRNA

0.97

0.82

0.94

0.9

0.98

0.93

0.86

9.9 × 10-4

5S rRNA

0.97

0.97

0.74

5.1

1.00

1.00

0.95

2.2 × 10-3

miRNA

0.88

0.65

0.88

1.6

0.86

0.88

0.69

9.7 × 10-4

H/ACA snoRNA

0.80

0.80

0.54

12.8

0.89

0.90

0.72

4.1 × 10-3

C/D snoRNA

0.78

0.55

0.79

4.7

0.87

0.91

0.71

2.0 × 10-3

The dataset contains five RNA families: tRNAs, miRNAs, 5S rRNAs, H/ACA snoRNAs, and C/D snoRNAs. ncRNA type: name of the target ncRNA family. ROC: ROC score, equal to the area under the ROC curve. SP: specificity of the discrimination of the target ncRNA family. SN: sensitivity of the discrimination of the target ncRNA family. Time: averaged time for each kernel computation on a 2.0 GHz AMD Opteron processor.

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-9-318/MediaObjects/12859_2008_Article_2303_Fig2_HTML.jpg
Figure 2

Calculation time and ROC scores for various cutoff threshold values of the base-pairing probabilities. We timed the DAG-based stem kernels in calculating a kernel matrix for each family of the training set containing 100 positives and 100 negatives, and confirmed the accuracy of their discrimination through the ROC scores.

Next, we performed the experiment on a large dataset including multiple alignments, which was used to train RNAz [17]. This dataset includes 12 ncRNA families of 7,169 original alignments, extracted from the Rfam database [26], with the exception of the single-recognition particle (SRP) RNA and RNAseP, which were extracted from [27, 28]. Each alignment consists of two to ten sequences aligned by CLUSTAL-W [29], and the mean pairwise identities are between 50% and 100%. The dataset also includes 7,169 negatives, which were generated from the original alignments by shuffling the columns, where the conservation rate on each column was preserved [30]. In this experiment, for each RNA family, SVMs trained the model which distinguishes the original alignments of a target RNA family from all other original and shuffled alignments in the dataset. We compared the profile-profile stem kernels with the local alignment kernels [24], which only consider primary sequences of RNAs. Subsequently, we extended the local alignment kernels using the same technique as in the case of the profile-profile stem kernels in order to account for multiple alignments.

The discrimination performance of both kernels was evaluated with 10-fold cross-validation. Table 2 presents the experimental results for this dataset. The stem kernels attained nearly perfect discrimination for all families in this dataset, while the local alignment kernels failed to discriminate some families. The performance with respect to tmRNA and RNAse P in terms of sensitivity was especially low. Furthermore, the stem kernels collected a smaller number of support vectors in comparison with the local alignment kernels due to the robustness of the stem kernels with respect to secondary structures. This is a desirable feature since the prediction process of SVMs requires only support vectors for the calculation of kernel values against an input sequence.
Table 2

Non-coding RNA detection using SVMs in comparing the stem kernels with the local alignment kernels.

   

Stem kernels

Local alignment kernels

ncRNA type

Rfam Accession

N

ROC

SP

SN

nSV

ROC

SP

SN

nSV

5S ribosomal RNA

RF00001

449

1.000

1.000

0.996

164.9 (1.3)

1.000

1.000

0.996

4013.0 (31.1)

U2 spliceosomal RNA

RF00004

566

0.999

1.000

0.993

631.2 (4.9)

0.999

1.000

0.986

4117.5 (31.9)

tRNA

RF00005

495

0.998

1.000

0.998

234.8 (1.8)

1.000

1.000

0.998

4287.2 (33.2)

Hammerhead ribozyme III

RF00008

588

1.000

1.000

0.997

221.2 (1.7)

1.000

1.000

0.997

2452.1 (19.0)

U3 snoRNA

RF00012

471

1.000

1.000

0.996

266.2 (2.1)

0.998

1.000

0.870

4665.3 (36.2)

U5 spliceosomal RNA

RF00020

510

1.000

1.000

0.996

525.5 (4.1)

1.000

1.000

0.994

4060.0 (31.5)

tmRNA

RF00023

730

1.000

1.000

0.997

685.8 (5.3)

0.975

1.000

0.037

4677.7 (36.2)

Group II intron

RF00029

604

1.000

1.000

0.993

482.7 (3.7)

1.000

1.000

0.990

4217.3 (32.7)

mir-10

RF00104

620

1.000

1.000

0.998

59.5 (0.5)

1.000

1.000

0.998

159.6 (1.2)

U70 snoRNA

RF00156

608

0.999

1.000

0.990

195.0 (1.5)

0.999

1.000

0.992

3811.8 (29.5)

RNAse P

-

656

1.000

1.000

0.991

490.6 (3.8)

0.905

1.000

0.018

4729.2 (36.6)

SRP RNA

-

872

1.000

1.000

0.995

441.5 (3.4)

0.908

1.000

0.900

4373.9 (33.9)

Total

 

7169

1.000

1.000

0.995

4398.9 (2.9)

0.977

1.000

0.788

45564.6 (29.5)

ncRNA type: name of the target ncRNA family. Rfam Accession: accession number of the target ncRNA family in Rfam. N: number of alignments. ROC: ROC score, equal to the area under the ROC curve. SP: specificity of the discrimination of the target ncRNA family. SN: sensitivity of the discrimination of the target ncRNA family. nSV: number of support vectors collected in the training processes and their rates against the numbers of the training alignments within parentheses.

In addition, we employed another kernel machine instead of SVM, called support vector data description (SVDD) [31], which calculates a spherically shaped boundary around a dataset so as to increase the robustness against outliers without the need for negative examples. In other words, SVDD does not need to generate artificial negative examples. Many applications of SVMs to biological problems require the artificial generation of negative examples such as shuffled positive sequences. However, since most artificial negatives can be easily distinguished from positives in many cases, the generation of artificial negative examples is a crucial problem to attaining practical prediction performance [32]. In this regard, SVDD can avoid this problem by using only positive examples. We applied SVDD instead of SVMs to the above dataset. Table 3 shows the surprising discovery that there is little difference in the accuracy of SVMs and SVDD. This result indicates that negative examples produced by shuffling the alignments make a very small contribution to learning the classifiers with our kernels. Furthermore, the number of support vectors in SVDD decreased significantly in comparison to SVMs.
Table 3

Non-coding RNA detection using SVDD in comparing the stem kernels with the local alignment kernels.

   

Stem kernels

Local alignment kernels

ncRNA type

Rfam Accession

N

ROC

SP

SN

nSV

ROC

SP

SN

nSV

5S ribosomal RNA

RF00001

449

1.000

1.000

0.940

27.8 (6.9)

1.000

1.000

0.886

48.4 (12.0)

U2 spliceosomal RNA

RF00004

566

0.997

0.999

0.912

51.8 (10.2)

0.999

1.000

0.844

92.0 (18.1)

tRNA

RF00005

495

0.983

0.948

0.939

26.8 (6.0)

0.999

0.999

0.853

67.0 (15.0)

Hammerhead ribozyme III

RF00008

588

1.000

0.998

0.971

14.2 (2.7)

1.000

1.000

0.968

19.3 (3.6)

U3 snoRNA

RF00012

471

1.000

1.000

0.915

36.3 (8.6)

0.959

1.000

0.775

95.5 (22.5)

U5 spliceosomal RNA

RF00020

510

0.999

0.998

0.939

30.3 (6.6)

1.000

1.000

0.882

57.2 (12.5)

tmRNA

RF00023

730

1.000

1.000

0.881

83.1 (12.6)

0.757

1.000

0.037

636.5 (96.9)

Group II intron

RF00029

604

0.996

0.989

0.942

30.9 (5.7)

0.999

1.000

0.922

48.7 (9.0)

mir-10

RF00104

620

1.000

1.000

0.977

13.3 (2.4)

1.000

1.000

0.984

10.7 (1.9)

U70 snoRNA

RF00156

608

0.998

0.996

0.952

25.5 (4.7)

1.000

1.000

0.951

29.0 (5.3)

RNAse P

-

656

0.998

1.000

0.887

66.2 (11.2)

0.629

1.000

0.006

587.5 (99.5)

SRP RNA

-

872

1.000

1.000

0.939

54.4 (6.9)

0.994

1.000

0.881

95.3 (12.1)

Total

 

7169

0.998

0.995

0.932

460.6 (7.1)

0.938

1.000

0.729

1787.1 (27.7)

ncRNA type: name of the target ncRNA family. Rfam Accession: accession number of the target ncRNA family in Rfam. N: number of alignments. ROC: ROC score, equal to the area under the ROC curve. SP: specificity of the discrimination of the target ncRNA family. SN: sensitivity of the discrimination of the target ncRNA family. nSV: number of support vectors collected in the training processes and their rates against the numbers of the training alignments within parentheses.

In this section, we trained SVMs with the stem kernels to detect particular ncRNA families. On the other hand, the SVMs in RNAz are trained to detect any structural ncRNAs, including unknown ncRNAs [17]. In order to demonstrate that RNAz is capable of discovering unknown ncRNAs with no bias toward the ncRNA families of the training set, SVMs were trained by excluding particular families of ncRNAs, and were used for classifying the excluded ncRNAs and the shuffled negatives. We attempted the same training scheme as described in [17] to investigate the ability of the stem kernels to discover unknown ncRNAs using the same dataset as in the experiment of Table 2. As a result, the ROC scores in this test were 0.699 for the stem kernels, 0.582 for the local alignment kernels, and 0.949 for RNAz. This result suggests that the ability of stem kernels to discover unknown ncRNAs is weaker than that of RNAz. The key feature in discovering unknown structural ncRNAs is to detect evolutionary conserved structures in multiple sequence alignments. The SCI used in RNAz directly assesses the structure conservation in multiple alignments, and it contributes to the ability of detecting unknown structural ncRNAs. However, since the SCI cannot measure the structural similarities between RNA sequences, it is difficult to apply it to other aspects of structural RNA analysis, such as detecting particular families. On the other hand, the stem kernels evaluate common stem structures between two multiple alignments, in other words, the stem kernels are not the measure of the structure conservation, but rather are the measure of the structural similarity between ncRNAs. Therefore, the stem kernels can be applied to various kernel methods including not only SVMs but also kernel principal component analysis (KPCA), kernel canonical correlation analysis (KCCA), and so on [15].

Remote homology search

Furthermore, we conducted a remote homology search of ncRNAs using SVMs with our kernel. Our kernel method was compared with INFERNAL [7] based on profile SCFGs. INFERNAL has been recommended for RNA homology search by the benchmark of currently available RNA homology search tools called BRAliBase III [33]. This benchmark dataset contains tRNAs, 5S rRNAs and U5 spliceosomal RNAs, which have relatively conserved sequences and/or secondary structures, whereby both INFERNAL and our kernel can easily detect homologs (data not shown).

Therefore, we performed a more practical remote homology search on the dataset shown in Table 4, which includes 47 sequences of H/ACA snoRNAs and 41 sequences of C/D snoRNAs in C. elegans from the literature [34]. These mean pairwise identities are too low to be discovered by existing methods. For each family, non-homologs were generated by shuffling every sequence 10 times. The shuffling processes preserved dinucleotide frequencies. Twenty query sets of 5 and 10 sequences were sampled from each family, respectively. Using these query sets, we attempted to search for homologs among all of the original and the shuffled sequences.
Table 4

Summary of the dataset for the experiment of the remote homology search.

ncRNA type

N

Length

%id

H/ACA snoRNA

47

145.1

29%

C/D snoRNA

41

84.6

30%

ncRNA type: name of the target ncRNA family. N: number of sequences in the dataset of the target ncRNA family. length: average length of the sequences. %id: mean pairwise identity of the dataset.

For INFERNAL, each query was aligned by CLUSTAL-W [29], folded by RNAalifold [35], and converted into a covariance model (CM). The CM searched for homologous sequences in the dataset, calculating a bit score for each sequence. A ROC curve can be plotted using the bit scores as decision values.

For the stem kernel, every sequence for each query was shuffled 10 times in order to generate negative samples. Then, the SVM with the stem kernel learned the discrimination model from the query and the negatives. The model searched for homologous sequences in the dataset, calculating an SVM class probability for each sequence. A ROC curve can be plotted in this case using SVM class probabilities as decision values.

Figures 3 and 4 display the ROC curves of the homology searches of H/ACA snoRNAs and C/D snoRNAs by INFERNAL and SVMs with the stem kernels. The stem kernel produced more precise results than INFERNAL with respect to searching the target families for homologs. In particular, in the H/ACA snoRNAs experiment, the stem kernel was capable of detecting them accurately even with queries of 5 sequences. However, the accurate identification of C/D snoRNAs was problematic for both methods, which can be attributed to the poor secondary structures of C/D snoRNAs. In fact, the identification of C/D snoRNAs is difficult for many structure-based methods.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-9-318/MediaObjects/12859_2008_Article_2303_Fig3_HTML.jpg
Figure 3

ROC curves of the remote homology searches of H/ACA snoRNAs in C. elegans from [34]in comparing our kernels with that of INFERNAL. For every 20 query sets of 5 (or 10) sequences, we search for homologous sequences among all of the original and the shuffled sequences.

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-9-318/MediaObjects/12859_2008_Article_2303_Fig4_HTML.jpg
Figure 4

ROC curves of the remote homology searches of C/D snoRNAs in C. elegans from [34]in comparing our kernels with that of INFERNAL. For each of 20 query sets of 5 (or 10) sequences, we search for homologous sequences among all of the original sequences and the shuffled sequences.

Note that the sequences in the datasets shown in Table 4 are remotely homologous to each other, which makes it difficult for RNAalifold to calculate common secondary structures from alignments produced by CLUSTAL-W. INFERNAL searches the common secondary structure of the query sequences for a given sequence, and thus the CM search fails if no acceptable covariance model for the query sequences can be generated. Although using structural alignments for ncRNAs might improve the homology search with INFERNAL, it is not certain that given query sequences have common secondary structures. In such cases, it is difficult for any alignment programs to produce robust alignments with acceptable common secondary structures. In fact, the secondary structures of snoRNAs used in our experiments are highly diverse so that we often did not obtain suitable multiple alignments for building CMs even if using structural alignment programs (data not shown). In contrast, SVMs calculate kernel values, i.e., pairwise similarities, between every pair of examples, and learn the weight parameters for each example in order to maximize the margin between the positives and the negatives. After this, the trained SVMs predict the classification of a new example based on the weighted sum of kernel values of the new example and all the training examples. In other words, SVMs make a decision about the classification based on the majority voting principle with respect to the optimized weights. This approach minimizes the risk of mispredictions and makes decisions which are more robust than those of the methods which use only one representative such as a common secondary structure of the query sequences, that is, SVMs with our kernel require no common secondary structures of the query sequences, and can make robust predictions in performing remote homology search of structural ncRNAs. This approach, however, requires a number of kernel computations for each sequence to be analyzed, proportional to the number of support vectors collected in training SVMs. Therefore, the prediction process should take a long computation time if the training process could not reduce the number of support vectors.

Kernel hierarchical clustering

We attempted to attain a kernel hierarchical clustering using the weighted pair group method algorithm (WPGMA) with the stem kernels for the same dataset as [36], extracted from the Rfam database [26], which contains 503 ncRNA families and a total of 3,901 sequences that have no more than 80% sequence identity and do not exceed 400 nt in length. Figure 5 shows the resulting dendrogram of the dataset, indicating some typical families, where sequences of the same family are likely to be contained in the same cluster (see also Additional files 1 &2. We evaluated the degree of agreement between the obtained clusters and the Rfam classification by converting the problem of cluster comparison into a binary classification problem in the same way as described in [36]: For every clustering cutoff threshold of the distance on the dendrogram, let the number of true positives (TP) be the number of sequence pairs in the same cluster which belong to the same family of Rfam. Analogously, let the number of false positives (FP) be the number of sequence pairs in the same cluster which belong to different families, the number of false negatives (FN) be the number of sequence pairs from the same family which lie in different clusters, and the number of true negatives (TN) be the number of sequence pairs from different families which lie in different clusters. The ROC curve plots the true positive rates as a function of the false positive rates for different clustering thresholds. Figure 6 shows the ROC curves for our kernel and LocARNA [36]. LocARNA produced hierarchical clusters whose ROC score was 0.781, while our kernel produced a score of 0.894.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-9-318/MediaObjects/12859_2008_Article_2303_Fig5_HTML.jpg
Figure 5

The dendrogram resulting from applying our kernel and WPGMA to the dataset. Some clusters containing typical families are indicated, such as 5S rRNA, tRNA, miRNA and RNaseP. This dendrogram was produced from Additional file 1 which is a newick format file calculated by our kernel and WPGMA. A magnifiable version of this dendrogram is available as Additional file 2.

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-9-318/MediaObjects/12859_2008_Article_2303_Fig6_HTML.jpg
Figure 6

ROC curves of the degree of agreement between the clustering and the Rfam families in comparing our kernels with LocARNA.

LocARNA and the DAG-based stem kernels are similar to each other in their approximation technique, in which the base pairs whose base-pairing probability is below a predefined threshold are disregarded. One of the most important differences between the above two methods is that LocARNA calculates a score for only the best scoring secondary structure with bifurcations, while stem kernels sum all scores over an ensemble of common stem structures, including any suboptimal structures. In other words, stem kernels can be regarded as a variant of Sankoff algorithm [37], which calculates the partition function without any bifurcations. This feature of stem kernels determines their robustness with respect to measuring structural similarities.

Conclusion

We have developed a new technique for analyzing structural RNA sequences using kernel methods. This technique is based on directed acyclic graphs (DAGs) derived from base-pairing probability matrices of RNA sequences, and significantly reduces the computation time for stem kernels. Our method considers only likely base pairs whose base-pairing probability is above a predefined threshold. The kernel values are calculated using DAG kernels, where each DAG is produced from these likely base pairs. Furthermore, we have proposed profile-profile stem kernels for multiple alignments of RNA sequences, which utilize the averaged base-pairing probability matrices of multiple alignments of RNA sequences.

Our kernels outperformed the existing methods for detection of known ncRNAs by using SVMs and kernel hierarchical clustering. In the experiments where SVMs were used, the stem kernels performed nearly perfect discrimination in the dataset, and collected a smaller number of support vectors in comparison with the local alignment kernels due to the robustness of the stem kernels with respect to secondary structures. Therefore, stem kernels can be used for reliable similarity measurements of structural RNAs, and can be utilized in various applications using kernel methods.

The new technique proposed in this paper significantly increases the computation speed for stem kernels, which is a step toward the realization of a genome-scale search of ncRNAs using stem kernels. Since our method is capable of detecting remote homology, it is possible to discover new ncRNAs which cannot be detected with existing methods.

Availability

Our implementation of the profile-profile stem kernels is available at http://www.ncrna.org/software/stem-kernels/ under the GNU public license. It takes RNA sequences or multiple alignments, and calculates a kernel matrix, which can be used as an input for a popular SVM tool called LIBSVM [38]. Furthermore, our software is capable of parallel processing using the Message Passing Interface (MPI) [39].

Declarations

Acknowledgements

This work was supported in part by a grant from "Functional RNA Project" funded by the New Energy and Industrial Technology Development Organization (NEDO) of Japan, and was also supported in part by Grant-in-Aid for Scientific Research on Priority Area "Comparative Genomics" No. 17018029 from the Ministry of Education, Culture, Sports, Science and Technology of Japan. We thank Dr. S. Washietl and Dr. I. L. Hofacker for providing us with their large-scale dataset of multiple alignments of non-coding RNAs. We also thank our colleagues from the RNA Informatics Team at the Computational Biology Research Center (CBRC) for fruitful discussions.

Authors’ Affiliations

(1)
Japan Biological Informatics Consortium (JBIC)
(2)
Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST)
(3)
Department of Biosciences and Informatics, Keio University
(4)
Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo

References

  1. Eddy SR: Non-coding RNA genes and the modern RNA world. Nat Rev Genet 2001, 2(12):919–929. 10.1038/35103511View ArticlePubMedGoogle Scholar
  2. Searls DB: The language of genes. Nature 2002, 420(6912):211–217. 10.1038/nature01255View ArticlePubMedGoogle Scholar
  3. Eddy SR, Durbin R: RNA sequence analysis using covariance models. Nucleic Acids Res 1994, 22(11):2079–2088. 10.1093/nar/22.11.2079PubMed CentralView ArticlePubMedGoogle Scholar
  4. Sakakibara Y, Brown M, Hughey R, Mian IS, Sjölander K, Underwood RC, Haussler D: Stochastic context-free grammars for tRNA modeling. Nucleic Acids Res 1994, 22(23):5112–5120. 10.1093/nar/22.23.5112PubMed CentralView ArticlePubMedGoogle Scholar
  5. Knudsen B, Hein J: RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 1999, 15(6):446–454. 10.1093/bioinformatics/15.6.446View ArticlePubMedGoogle Scholar
  6. Rivas E, Eddy SR: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2001, 2: 8. 10.1186/1471-2105-2-8PubMed CentralView ArticlePubMedGoogle Scholar
  7. Eddy SR: A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics 2002, 3: 18. 10.1186/1471-2105-3-18PubMed CentralView ArticlePubMedGoogle Scholar
  8. Sakakibara Y: Pair hidden Markov models on tree structures. Bioinformatics 2003, 19(Suppl 1):i232-i240. 10.1093/bioinformatics/btg1032View ArticlePubMedGoogle Scholar
  9. Klein RJ, Eddy SR: RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics 2003, 4: 44. 10.1186/1471-2105-4-44PubMed CentralView ArticlePubMedGoogle Scholar
  10. Sato K, Sakakibara Y: RNA secondary structural alignment with conditional random fields. Bioinformatics 2005, 21(Suppl 2):ii237-ii242. 10.1093/bioinformatics/bti1139View ArticlePubMedGoogle Scholar
  11. Holmes I: Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics 2005, 6: 73. 10.1186/1471-2105-6-73PubMed CentralView ArticlePubMedGoogle Scholar
  12. Dowell RD, Eddy SR: Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics 2006, 7: 400. 10.1186/1471-2105-7-400PubMed CentralView ArticlePubMedGoogle Scholar
  13. Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D: Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol 2006, 2(4):e33. 10.1371/journal.pcbi.0020033PubMed CentralView ArticlePubMedGoogle Scholar
  14. Do CB, Woods DA, Batzoglou S: CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 2006, 22(14):e90-e98. 10.1093/bioinformatics/btl246View ArticlePubMedGoogle Scholar
  15. Schölkopf B, Tsuda K, Vert JP: Kernel Methods in Computational Biology. Cambridge, MA: MIT Press; 2004.Google Scholar
  16. Kin T, Tsuda K, Asai K: Marginalized kernels for RNA sequence data analysis. Genome Inform 2002, 13: 112–122.PubMedGoogle Scholar
  17. Washietl S, Hofacker IL, Stadler PF: Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci U S A 2005, 102(7):2454–2459. 10.1073/pnas.0409169102PubMed CentralView ArticlePubMedGoogle Scholar
  18. Hertel J, Stadler PF: Hairpins in a Haystack: recognizing microRNA precursors in comparative genomics data. Bioinformatics 2006, 22(14):e197-e202. 10.1093/bioinformatics/btl257View ArticlePubMedGoogle Scholar
  19. Hertel J, Hofacker IL, Stadler PF: SnoReport: Computational identification of snoRNAs with unknown targets. Bioinformatics 2008, 24(2):158–164. 10.1093/bioinformatics/btm464View ArticlePubMedGoogle Scholar
  20. Sakakibara Y, Popendorf K, Ogawa N, Asai K, Sato K: Stem kernels for RNA sequence analyses. J Bioinform Comput Biol 2007, 5(5):1103–1122. 10.1142/S0219720007003028View ArticlePubMedGoogle Scholar
  21. McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 1990, 29(6–7):1105–1119. 10.1002/bip.360290621View ArticlePubMedGoogle Scholar
  22. Hofacker IL: Vienna RNA secondary structure server. Nucleic Acids Res 2003, 31(13):3429–3431. 10.1093/nar/gkg599PubMed CentralView ArticlePubMedGoogle Scholar
  23. Haussler D: Convolution kernels on discrete structures. In Tech. Rep. UCSC-CRL-99–10. Department of Computer Science, University of California at Santa Cruz; 1999.Google Scholar
  24. Saigo H, Vert JP, Ueda N, Akutsu T: Protein homology detection using string alignment kernels. Bioinformatics 2004, 20(11):1682–1689. 10.1093/bioinformatics/bth141View ArticlePubMedGoogle Scholar
  25. Kiryu H, Kin T, Asai K: Robust prediction of consensus secondary structures using averaged base pairing probability matrices. Bioinformatics 2007, 23(4):434–441. 10.1093/bioinformatics/btl636View ArticlePubMedGoogle Scholar
  26. Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 2005, (33 Database):D121-D124.
  27. Rosenblad MA, Gorodkin J, Knudsen B, Zwieb C, Samuelsson T: SRPDB: Signal Recognition Particle Database. Nucleic Acids Res 2003, 31: 363–364. 10.1093/nar/gkg107PubMed CentralView ArticlePubMedGoogle Scholar
  28. Brown JW: The Ribonuclease P Database. Nucleic Acids Res 1999, 27: 314. 10.1093/nar/27.1.314PubMed CentralView ArticlePubMedGoogle Scholar
  29. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673–4680. 10.1093/nar/22.22.4673PubMed CentralView ArticlePubMedGoogle Scholar
  30. Washietl S, Hofacker IL: Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics. J Mol Biol 2004, 342: 19–30. 10.1016/j.jmb.2004.07.018View ArticlePubMedGoogle Scholar
  31. Tax DM, Duin RP: Support vector data description. Machine Learning 2004, 54: 45–66. 10.1023/B:MACH.0000008084.60811.49View ArticleGoogle Scholar
  32. Babak T, Blencowe BJ, Hughes TR: Considerations in the identification of functional RNA structural elements in genomic alignments. BMC Bioinformatics 2007, 8: 33. 10.1186/1471-2105-8-33PubMed CentralView ArticlePubMedGoogle Scholar
  33. Freyhult EK, Bollback JP, Gardner PP: Exploring genomic dark matter: a critical assessment of the performance of homology search methods on noncoding RNA. Genome Res 2007, 17: 117–125. 10.1101/gr.5890907PubMed CentralView ArticlePubMedGoogle Scholar
  34. Deng W, Zhu X, Skogerbø G, Zhao Y, Fu Z, Wang Y, He H, Cai L, Sun H, Liu C, Li B, Bai B, Wang J, Jia D, Sun S, He H, Cui Y, Wang Y, Bu D, Chen R: Organization of the Caenorhabditis elegans small non-coding transcriptome: genomic features, biogenesis, and expression. Genome Res 2006, 16: 20–29. 10.1101/gr.4139206PubMed CentralView ArticlePubMedGoogle Scholar
  35. Hofacker IL, Fekete M, Stadler PF: Secondary structure prediction for aligned RNA sequences. J Mol Biol 2002, 319(5):1059–1066. 10.1016/S0022-2836(02)00308-XView ArticlePubMedGoogle Scholar
  36. Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R: Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol 2007, 3(4):e65. 10.1371/journal.pcbi.0030065PubMed CentralView ArticlePubMedGoogle Scholar
  37. Sankoff D: Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM Journal on Applied Mathematics 1985, 45(5):810–825. 10.1137/0145048View ArticleGoogle Scholar
  38. Fan RE, Chen PH, Lin CJ: Working set selection using second order information for training support vector machines. Journal of Machine Learning Research 2005, 6: 1889–1918. [http://www.csie.ntu.edu.tw/~cjlin/libsvm/]Google Scholar
  39. Pacheco P: Parallel Programming with MPI. Morgan Kaufmann; 1996.Google Scholar

Copyright

© Sato et al; licensee BioMed Central Ltd. 2008

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.