IgTM: An algorithm to predict transmembrane domains and topology in proteins

Background Due to their role of receptors or transporters, membrane proteins play a key role in many important biological functions. In our work we used Grammatical Inference (GI) to localize transmembrane segments. Our GI process is based specifically on the inference of Even Linear Languages. Results We obtained values close to 80% in both specificity and sensitivity. Six datasets have been used for the experiments, considering different encodings for the input sequences. An encoding that includes the topology changes in the sequence (from inside and outside the membrane to it and vice versa) allowed us to obtain the best results. This software is publicly available at: Conclusion We compared our results with other well-known methods, that obtain a slightly better precision. However, this work shows that it is possible to apply Grammatical Inference techniques in an effective way to bioinformatics problems.


Background
Membrane proteins are involved in a variety of important biological functions [1,2] where they play the role of receptors or transporters. The number of transmembrane segments of a protein and some characteristics such as loop lengths can identify features of the proteins, as well as their role [3]. Therefore, it is very important to predict the location of transmembrane domains along the sequence, since these are the basic structural building blocks defining the protein topology. Several works have dealt with this prediction task from different approaches, mainly using Hidden Markov Models (HMM) [4][5][6], neural networks [7,8] or statistical analysis [9]. A rich literature is available on proteins prediction. For reviews on different methods for predicting transmembrane domains in proteins, we refer the reader to [10][11][12].
This work addresses the problem of protein transmembrane domains prediction by making use of a Grammatical Inference (GI) based approach. GI is a particular case of Inductive Inference, an iterative process that takes into account a set of facts and tries to obtain a model consistent with the available data. In GI the model resulting from the induction process is a formal grammar (that generates a formal language) inferred from a set of sample strings, composed by a set M + of strings belonging to a target formal language and, in some cases, another set Mof strings that do not belong to the language. The results of the inference process gives as the result a language (hypothesis) that, in essence, models all the common features of the strings. This grammatical approach is suitable for the task due to the sequential nature of the information. Some works apply formal languages methods to molecu-lar biology [13]. Figure 1 depicts a general GI scheme. Several classifications of the GI algorithms can be made, for instance: when both sets are non-empty we remalgo[cont2]Algorithm refer to complete presentation algorithms; positive presentation algorithms are those that use an empty Mset; taking into account these algebraic properties of the obtained languages, it is possible to distinguish between characterisable and non-characterisable algorithms. It is difficult to identify what information is suitable to be considered into M -, therefore we will take into account only positive presentation in our approach. For more information, we refer the reader to [14][15][16].
Usually, the model used in GI is a finite state abstract machine commonly named finite automata. HMMs are closely related to finite automata, and therefore our approach is also related to several works that succesfully tackle this task [4][5][6]. Nevertheless, it is to note that the topology of a HMM, number of states and their connection, is a priori fixed by an expert that takes profit from known information. Once the topology is fixed, the available data is used to set the probability of each transition of the HMM. As stated above, the input of a GI algorithm is a set of sequences, therefore no aid from an expert is needed, because both the topology of the automaton and the probability between states is automatically stablished by the algorithm.
Generally speaking, HMMs provide a good solution when the topology of the HMM can be fairly set. In that case, the sequences provided are used just to set the transition probabilities among states. A GI approach tries to extract more information from the sequences and provides good prediction tools using only sequential information. The most important drawback of GI is the lack of enough data to infer proper models.
GI has been used previously in various bioinformatics related tasks, such as gene-finding or prediction of coiled coil domains [17]. The good performance of those works leds us to apply GI algorithms to the prediction of other domains in proteins, such as transmembrane segments.
Our work takes into account a set of protein sequences with known evidence of transmenbrane domains. Firstly, these sequences are processed in order to distinguish among inner, outer and transmembrane residues. This labelling allows to obtain an Even Linear structure (that considers a relationship among the symbols in a sequence, such that the first and the last symbols are related, the second and the last but one are also related and so on). It is possible to model this structure by using an Even Linear Language (ELL) that can be learned using GI techniques. The obtained language is then used to build a probabilistic transducer (an abstract machine that processes an input sequence and obtains another output sequence or transduction with an occurrence probability). The resulting transducer allows to process any unknown protein sequence to obtain a transduction. The transduction shows those detected transmembrane domains. The experimental results have been compared with TMHMM 2.0 [4], Pred-TMR [9], Prodiv-TMHMM [6], HMMTOP 2.0 [18,5], PHOBIUS [19][20][21], TMpred [22,23] and MEMSAT3 [24].

Introduction
We consider the prediction of transmembrane domains as a transduction problem. That is, given an amino acid Process of a Grammatical Inference process Figure 1 Process of a Grammatical Inference process. A formal language can be represented by means of an automaton or a grammar, that will be used together with the problem sequences in the analysis phase. The output of the analysis phase can be a transduction of the input sequence, an error-free form of the input, or a value that tells whether the sequence belongs to the language or not. sequence, the output of our system is a sequence with the same length which distinguishes between those amino acids that are within a transmembrane domain and those that are not.
The available data are transformed in a training set with even linear structure. An item of the data set is a string whose first half is made up by the symbol sequence of the protein and the second by the symbols of the expected output string in reverse order. In order to learn the ELL with this set, we considered as the main feature the segments of a given length k set as a input parameter. The class of ktss languages is a well-known subclass of the regular languages and it is characterized by the set of segments of length k that appear in the words of the language, therefore, we can take profit of previous learning results in order to address this task [25][26][27].
The transducer is obtained using the structure of the inferred ELG (Even Linear Grammar). The general method is described in Algorithm 1. Please refer to section Notation and definition to details.

Algorithm 1 Transmembrane Grammatical Inference approach
Input: • A set P of amino acid sequences with known transmembrane domains.
• A set L of domain labeled sequences. Each string x in P has its corresponding string l x in L.

Output:
• A transducer to locate transmembrane domains.

End
The returned transducer can be used to analyse problem sequences to obtain the corresponding transduction.

Datasets
Due to the fact that each approach to transmembrane prediction uses its own dataset, in order to test our approach six different datasets has been considered. The first one was a set of 160 membrane proteins used in [4], which we refer to as the TMHMM set. Experimental topology data is available for these proteins, most of them have been analysed with biochemical and genetic methods (these methods are not always reliable), and only a small number of membrane protein domains of this dataset have been determined at an atomic resolution. The dataset contains 108 multi-spanning and 52 single-spanning proteins. The original dataset was larger, but those proteins whith conflicting topologies for different experiments were not included.
The second set used was TMPDB [28], whose latest version (Release 6.3) contains 302 transmembrane protein sequences (276 alpha-helical sequences, 17 beta-stranded sequences and 9 alpha-helical sequences with short poreforming alpha-helices buried in the membrane). The topologies of these sequences are based on definite experimental evidences such as X-ray crystallography, NMR, gene fusion technique, substituted cysteine accessibility method, Asp(N)-linked glycosylation experiment and other biochemical methods. The third and fourth datasets are subsets of TMPDB, where homologous proteins have been removed: the third set, TMPDB-α-nR, contains 230 alpha-helix non redundant proteins; and the fourth set TMPDB-αβ-nR, has been obtained by adding 15 β-barrel proteins to the third set.
The fifth dataset used is the 101-Pred-TMR database, a set of 101 non-homologous proteins, extracted form Swiss-Prot database, used in [9,29]. These proteins were selected from a set of 155 proteins, discarding those with more than 25% of similarity.
The last dataset used was the MPTOPO dataset [30]. In its last version (August 2007) the set contains 185 proteins: 25 of them β-barrels and the rest α-helix transmembrane.
All the segments have been experimentally validated. The 3D structure of 119 of these proteins has been determined using x-ray diffraction or NMR methods, therefore, these transmembrane segments are known precisely. The rest of transmembrane segments correspond to 41 helices that have been identified by experimental techniques such as gene fusion, proteolytic degradation, and amino acid deletion. The proteins whose topologies are based solely on hydropathy plots have not been included in the dataset.

Codification
Protein sequences can be considered as strings from a 20 symbols alphabet, where each symbol represents one of xl x r the amino acids. In order to reduce the alphabet size without loss of information, we considered an encoding based on some properties of the amino acids (originally proposed by Dayhoff). The Table 1 shows the correspondence of each amino acid for Dayhoff encoding. This encoding has been previously used in some GI papers [31][32][33].

Performance measures
Several measures are suitable to evaluate the results. Some of them, addressing gene-finding problems, are reviewed in [34]. This measures can also be applied to functional domain location tasks. Among all the proposed measures, Sensitivity and Specificity are probably the most used. Intuitively, Sensitivity (Sn) measures the probability of predicting a particular residue inside a domain. Specificity (Sp) measures the probability of predicted residues to be actually into a domain. Therefore, Sn and Sp can be computed as follows: Where: Note that neither Sn nor Sp, took individually, constitute an exhaustive measure. A single value that summarizes both measures into a better one is the Correlation Coeffi-cient (CC), also referred to as Mathews Correlation Coefficient [35]. It can be computed as follows: Unfortunately, although CC has some interesting statistical properties [34], it has also an undesirable drawback. It is not defined if any factor of the root is equal to zero. In the literature there exist some measures that overcome this inconvenient, in this work we will use the Approximate Correlation (AC) which is defined as follows: We have to note that we were not able to calculate CC for every sample of the testing set (independently the dataset considered). In those cases, the samples were not taken into account. The Approximate Correlation AC has a 100% coverage, including those samples for which it was not possible to calculate CC or Sp. This can explain the relevant difference between AC and CC observed in some experiments. In addition to this, we have used the common segment-based measure Segment overlap, (Sov δ obs ) defined by [36]: where N is the total number of residues observed within all the domains of the protein, s 1 and s 2 are two overlaped segments, E is {end(s 1 ); end(s 2 )}, B is {beg(s 1 ); beg(s 2 )} and δ is a parameter for the accepted (maximal) deviation. We used a value of δ = 3.
We have also calculated the number of segments correctly predicted at three accuracy thresholds: 100%, 90% and 75%, that is, number of segments with the 100%, 90% or more, and 75% or more of their amino acids are correctly predicted. This measure is similar to Sensibility, but it is based on segments. Therefore it is necessary to calculate also the Sp measure in order to complement it. This measure allows to obtain a reliable evaluation for those segments that contain false negatives not only at the extremities of the segment. For example, this occurs when a viewed segment is recognized as more than one segment, and there are some false negatives between two of this predicted segments. Figure 2 shows how this measure is calculated.

Experimentation
Note that our approach needs some information to learn a model. In order to obtain probabilistic relevance in the  We hereby provide a description of each experiment, all the experiments but the last consider a previous reduction using the Dayhoff code: The first one (exp1) considered a two-classes encoding, that is, residues inside and outside a transmembrane domain; the second experiment (exp2) added another class in order to consider the topology of the protein (inner and outer residues); the third experiment (exp3) also included a class to distinguish among transmembrane domains with previous inner and outer regions; the fourth (exp4) experiment took into account the previous encoding with a special labelling of the last five residues of each region preceding a transmembrane one; the fifth experiment (exp5) added special symbols to track the transition to a transmembrane and out to one; the last experiment (exp6) did not consider the Dayhoff encoding and used the annotation of the second experiment.
Each of the experiments builds a different model for the language of the TM proteins, that highlights differents propierties of them, by searching different patterns among the amino acids, depending on whether they belong, for instance, to a TM zone or not (exp1), to an inner or outer zone (exp2 and exp6), to a TM domain with previous inner or outer regions (exp3), to the sequence of the last 5 residues that precede a TM segment (exp4), or to the set of amino acids that represent a transition from a TM zone to an inner or outer zone, or vice versa (exp5). Figure 3 shows the annotation and encoding of an example sequence for each different experimental configuration.
Once encoded the sequences, and for each of the described encodings, a set of experiments were run to test the best learning parameter of the inference algorithm. The best accuracy was obtained in the experiment with the configuration of exp 5 and exp 6 . The HMM-based methods we compared our system with, obtain a slightly better precision. The difference in results can be explained with the fact that GI algorithms need a greater quantity of data than the amount needed by Hidden Markov Models in order to achieve the same accuracy.
The main advantage of our approach is that it learns the topology of the model from samples, without the need of the external knowledge, as in HMM-based methods, where states and edges are determined by an expert. In a GI method, the automata are built by the algorithm, which stablishes the topology, number of states, the transitions or edges between states and probabilities of transition. Tables 2, 3 Although it may seem erroneous or non-sense to build a model to predict both α and β transmembrane domains, we would like to illustrate with this experiment the way a GI approach distinguishes from other approaches: if the dataset contains enough data (sequences in our case) from differents classes (α-helices and β-barrels), the model obtained should be able recognize all the different patterns. Table 7 compares the results of the experiment carried out over TMPDB-α-nR and TMPDB-αβ-nR datasets. The results with TMPDB-αβ-nR are slightly worse, but it can be explained because the set of β-barrel proteins contains only 15 sequences, and it is difficult to learn an Example of segments correctly predicted with different levels of precision Figure 2 Example of segments correctly predicted with different levels of precision. The first line shows the protein to predict, the second one is the prediction. This example shows a segment completely predicted (a), one segment correctly predicted at least at 75\% (b) and one segment predicted at least at 90\% (c).

Conclusion
This work addresses the problem of the localization of transmembrane segments within proteins by making use of Grammatical Inference (GI) algorithms. GI has been effectively used in some bioinformatic related tasks, such as gene-finding or prediction of coiled coil domains. IgTM exploits the features of proteins by using Even Linear Languages as the inferred class of languages. We tested different labellings for the input sequences, with the best accuracy achieved using a labelling that takes into account several changes in the sequence topology: from inside and outside the membrane to it and vice versa. We compared our method with other methods to predict transmembrane domains in proteins, obtaining slightly less accuracy with respect to them. This should be due to the fact that in GI the training phase need more data than the most common approach, based on Hidden Markov Mod-   els. In addition to this, many of the available prediction tools are closed, that is, there is no way to know exactly the training set used by the tools which we have compared igTM with, therefore it is possible that some of our six datasets included proteins used by these tools in the training phase (in this case, the tools we compare our algorithm with, would obtain better results). The same problem happens with online prediction tools, where the data considered to build the tools is not available. Then, since the other methods can have been trained on sequences that share homology with the test set (or even sequences included in the test set), the comparison could be not very reliable. However, the obtained results show that GI can be used effectively in bioinformatics related tasks. Furthermore, the main advantage of GI when applied to bioinformatics tasks is that an expert is not needed in order to give additional information (in this case the topology of transmembrane proteins). An online version of IgTM is publicly available at http:// www.dsic.upv.es/users/tlcc/bio/bio.html It remains as a future work to use this method together with another one (based on HMM or not). This could lead to improve the performance. At present we are testing other inference algorithms to learn the automata, the use of new codings to the sequences [37,38], and the consideration of new datasets (for instance the Möller dataset [39]).  Experimental results and comparison with results of other methods, with the TMPDB-α-nR α-helix database. As discussed in the Performance Measures section, 100%, ≤ 90% and ≥ 75% stand for segments correctly detected in that percentage.

Introduction
Our approach considers the concatenation of the protein symbols with the inverted annotation string, the whole considered as an ELL string. We subsequently apply a transformation to it, in order to obtain a string belonging to a regular language. The transformation is done by joining the first symbol of the first half with the last of the second one, the second symbol of the first half with the second-last symbol, and so on. Then, a GI process learns a language building a transducer that accepts the first part of each symbol (the one coming from the first half of the string) and returns the second part as output. The test phase consists in using Viterbi's algorithm to analyse the string. This algorithm returns the transduction that is most likely to be produced by the input string.

Notation and definitions
Let Σ be an alphabet and Σ* the set of words over the alphabet. A language is any subset of Σ*, that is a set of words. For any word x over Σ* let x i denote the i-th symbol of the sequence. Let |x| denote the length of the word and let x r denote the reverse of x. Let also λ denote the empty word. A grammar is denoted by G = (N, Σ, P, S) where N and Σ are the auxiliar and terminal alphabets, P is the set of productions and S ∈ N is the initial symbol or axiom. Intuitively, a grammar can be seen as a rewritting system that uses the set of productions to generate a set of words over Σ*. The language generated by a grammar G is denoted by L(G).  An Even Linear Grammar (ELG) is a context-free grammar [40] where the productions are of the forms: The class of Even linear Languages (ELL) is a subclass of the context free languages and includes properly the class of regular languages. Given an ELG, it is possible to obtain an equivalent one where the productions are of the form: The learning of ELL can be reduced to the inference of regular languages [41]. Intuitively, this function relates the first and last symbols of the word, as well as the second and the last but one, and so on. Once applied the function σ, it is possible to use any regular language inference algorithm to learn a language over the alphabet [Σ × Σ]* ∪ [Σ]*, that is, the alphabet of paired symbols. The learned language can be processed to undo the transformation σ as follows: Several inference algorithms are suitable to be applied, each obtaining a different solution. In fact, if the GI algorithm identifies a subclass of regular languages, then a subclass of ELL is obtained and applied with good performance.
A finite state transducer is an abstract machine formally defined by a system τ = (Q, Σ, Δ, q 0 , Q F , E) where: Q is a set of states, Σ and Δ are respectively the input and output alphabets, q 0 is the initial state, Q F ⊆ Q is the set of final states and E ⊆ (Q × Σ* × Δ* × Q) is the set of transitions of the transducer. A transducer processes an input string (word of a language), and outputs another string. A successful path in a transducer is a sequence of transitions (q 0 , x 1 , y 1 , q 1 ), (q 1 , x 2 , y 2 , q 2 ), ..., (q n-1 , x n , y n , q n ) where q n ∈ Q F and for 1 ≤ i ≤ n: q i ∈ Q, x i ∈ Σ* and y i ∈ Δ*. Note that a path can be denoted as (q 0 , x 1 x 2 ... x n , y 1 y 2 ... y n , q n ) whenever the sequence of states are not of particular concern. A transduction is defined as a function t: Σ* → Δ* where t(x) = y if and only if there exist a successful path (q 0 , x, y, q n ). Figure 4 shows an example of transducer, and the transduction that an accepted sequence generates. We refer the interested reader to [42].

Grammatical inference approach to transmembrane segments prediction
We consider the transmembrane segments prediction problem as a transduction problem. That is, given an amino acid sequence, the output of our system is a sequence with the same length which distinguishes between those amino acids within transmembrane segment and those that are not. In our work, we took into account the special features of our problem to propose a method based on inference of ELL.

A B S
First of all, we had to transform the available data to obtain a training set with even linear structure. This set was used to infer an ELL. The transducer is obtained using the structure of the inferred ELG. Given a ELG G = (N, Σ, P, S) that does not contain productions of the form A → a, a ∈ Σ, it is possible to obtain a transducer τ = (N, Σ, Σ, The resulting transducer is shown in Figure 4. As we stated before, the learning problem for ELL can be reduced to the problem of learning regular languages. In our work, in order to learn the ELL, we use an algorithm to infer k-testable in the strict sense (k-TSS) languages [25][26][27]. The class of ktss languages is contained into the regular languages one; it is characterized by the set of segments of length k that appear in the words of the language.
Our approach considered a set of protein sequences P with known transmembrane domains and another set L of strings over an alphabet of labels Δ = {i, o, M}. For each sequence x in P, a labeled sequence l x is obtained. The labelling allows to distinguish the transmembrane segments from the non-transmembrane ones. That is, given the string x = x 1 x 2 ... x n ∈ P and its corresponding labeled string l x , = l 1 l 2 ... l n ∈ L, l i = M whenever x i correspond to a transmembrane segment, l i = i, when correspond to a inner segment, and l i = o, when correspond to a outer segment.
These sets were combined to obtain another set, named M, with the strings . Note that the strings in this set have an even linear structure and an even length. The set M was used to obtain a probabilistic transducer by ELL inference. The general method is summarized in Algorithm 1.
The returned transducer can be used to analyse problem sequences to obtain the corresponding transduction. It is possible that the transducer may result to be non-deterministic and the test sequences may not belong to the language accepted by the transducer. Therefore, an errorcorrecting parser (for instance Viterbi's algorithm) is necessary to analyze the test sequences. We employed a standard configuration of Viterbi's algorithm used when a GI approach is applied to pattern recognition tasks (i.e. [33]).

Complexity
The igTM method is composed by two phases: inference and analysis. The first one consists in inferring an transducer from the sequences of the dataset.
The execution time of the GI algorithm used in this work is linear with the size of the dataset. The space requirements of this step is bounded by |Σ| k , where Σ is the alphabet of the samples and k is the parameter of the k-tss algorithm. Therefore, depending on the parameter used, the automaton obtained can be relatively big. The transformation of the automaton into a transducer is bounded by a polynomial of degree k.
The execution time of the second phase, the analysis one, is linear respect the size of the string to analyse. The space requirements are bounded by the size of the transducer and the analized string.