### The dynamic Bayesian network

DBN is a directed graphical model in which nodes represent random variables and arcs represent dependency between nodes. The architecture of our DBN model is illustrated in Fig. 2(b). There are totally six nodes for each residue. Specifically, the node *AA*
_{
i
}(*i* = 1, 2, 3...) contains the PSI-BLAST profile of residue *i*, which is a 20-dimensional vector corresponding to 20 scores in the PSSM. The node *R*
_{
i
}stores replica of the profiles of a series of residues before *i*, i.e. the profiles of residues *i*-1, *i*-2, *i*-3, ... *i*-*L*
_{
AA
}, as shown in Fig. 2(b), where *L*
_{
AA
}is a profile window size indicating the range of the dependency for the profiles. As shown in Fig. 2(b), all the dependency between *AA*
_{
i
}and its neighboring sites, *AA*
_{i-1}, *AA*
_{i-2}, ... *AA*
_{i-LAA, }can be summarized into one single connection to *R*
_{
i
}, simplifying the topology of the graph. The state-space of *R*
_{
i
}is 21·*L*
_{
AA
}-dimensional, with 20·*L*
_{
AA
}storing the profiles of the past residues and extra *L*
_{
AA
}dimensions representing the "over-terminus" state.

The node *SS*
_{
i
}is used to describe the secondary structure state of residue *i*, which has a discrete state-space of three elements: H, E, and C. The node *d*
_{
i
}has a similar role as *R*
_{
i
}, but describes here the joint distribution with the secondary structure states of residues *i*-1, *i*-2, ... *i*-*L*
_{
SS
}, where *L*
_{
SS
}is the secondary structure window size indicating the range of the dependency, as shown in Fig. 2(b). Again, the node *d*
_{
i
}is introduced to simplify the topology of the graph, yet to keep a long-range dependency between profile (*AA*
_{
i
}) and secondary structure (*SS*
_{i-1}, *SS*
_{i-2}, ...). The dimension of *d*
_{
i
}is 4·*L*
_{
SS
}, where 3·*L*
_{
SS
}are from the joint past secondary structure states and the extra *L*
_{
SS
}from the "over-terminus" situation.

The nodes *D*
_{
i
}and *F*
_{
i
}are introduced to mimic a duration-HMM [22], with a specified parameter *D*
_{
max
}and two elements, respectively. Specifically, *D*
_{
i
}represents the distance (measured by the number of residues) from the position *i* to the end of the corresponding secondary structure segment. For example, in a segment with end residue at position *j*, the value of *D*
_{
i
}is set to be *j*-*i*+1. Note that the state-space of *D*
_{
i
}requires that the maximum length of segments should not exceed *D*
_{
max
}. In order to cope with longer segments, a modified definition of *D*
_{
i
}is introduced as following: when the length of a segment ≤ *D*
_{
max
}, the value of *D*
_{
i
}is set as described above; when the length of the segment > *D*
_{
max
}, for example *D*
_{
max
}+3, the *D*
_{
i
}is set to be *D*
_{
max
}for the first four residues of the segment and is set to be *D*
_{
max
}-1, *D*
_{
max
}-2, ... 1 for the rest. In this way, the lengths of segments longer than *D*
_{
max
}are modeled by a geometric distribution (see below). The value of the node *F*
_{
i
}is deterministically dependent on *D*
_{
i
}: if *D*
_{
i
}> 1, *F*
_{
i
}= 1; if *D*
_{
i
}= 1, *F*
_{
i
}= 2.

Each node described above is assigned a specific conditional probability distribution (CPD) function according to the connections' pattern shown in Fig. 2(b), except for *R*
_{
i
}, which is a "root" node [22] with no "parent node", and which is observable in both training and predicting. Specifically, the CPD of *AA*
_{
i
}(*i* = 1, 2, 3) is modeled using a conditional linear Gaussian function, which is defined by:

*P*(*AA*_{
i
}= **y** | *R*_{
i
}= **u**, *SS*_{
i
}= *α*, *d*_{
i
} = *γ*) = *N*(**y**;**w**_{α,γ}**u** + **c**_{α,γ}, Σ_{α,γ}), (7)

where

*N*(

**y**;

**μ**,

**Σ**) represents a Gaussian distribution with mean

**μ** and covariance

**Σ**,

**u** is a 21·

*L*
_{
AA
}-dimensional vector,

*α* is one of H, E, and C, and

*γ* is one of the

*L*
_{
SS
}-tuples formed by four elements: O, H, E, and C (O represents the "over-terminus" state). The distribution function is characterized by the mean

**μ**
_{α,γ}=

**w**
_{α,γ}
**u** +

**c**
_{α,γ}, where

**w**
_{α,γ}is a 20 × 21

*L*
_{
AA
}matrix and

**c**
_{α,γ}is a 20-dimensional vector, and the covariance

**Σ**
_{α,γ}. The subscripts

*α* and

*γ* indicate that the parameters

**w**
_{α,γ},

**c**
_{α,γ}, and

**Σ**
_{α,γ}are dependent on the states of

*SS*
_{
i
}and

*d*
_{
i
}. Second, the CPD of

*SS*
_{
i
}(

*i* = 2, 3, 4...) is defined by

where

*T*
_{
α
}(

*β*) is the transition probability from the secondary structure state

*α* to the state

*β*. Third, the CPD of

*d*
_{
i
}(

*i* = 2, 3, 4...) is defined by

where

*λ*
_{
j
}and

*γ*
_{
j
}(

*j* = 1, 2, ...

*L*
_{
SS
}) are the

*j*th elements of the

*L*
_{
SS
}-tuples

*λ* and

*γ*, respectively. Fourth, the CPD of

*D*
_{
i
}(

*i* = 2, 3, 4...) is defined by

where

*g*
_{
α
}(

*n*) is the segment length distribution given the secondary structure state

*α* and

*h*
_{
α
}is the probability for

*D*
_{
i
}to maintain the value

*D*
_{
max
}given

*SS*
_{
i
}=

*α* and

*D*
_{i-1 }=

*D*
_{
max
}. Using this function, the probability of producing a segment with length

*n* (

*n* > =

*D*
_{
max
}) is proportional to (1-

*h*
_{
α
})

*h*
_{
α
}
^{n-Dmax}, i.e. a geometric distribution. The validity of using such a distribution to model segments of length longer than

*D*
_{
max
}is supported by Fig.

3(a), in which all the helices, sheets, and coils show exponential tails in their segment length distributions. Fig.

3(a) also indicates that a proper

*D*
_{
max
}should be 13, after which all the distributions can be fitted well to exponential functions (see the inset of Fig.

3(a)). At last, the CPD of

*F*
_{
i
}(

*i* = 1, 2, 3...) is defined by

Note that the CPDs of *SS*
_{1}, *d*
_{1}, and *D*
_{1} have similar definition to CPDs of *SS*
_{
i
}, *d*
_{
i
}, and *D*
_{
i
}(*i* = 2, 3, 4...) but with an independent set of parameters.

The parameters of the CPDs described above are derived by applying the maximum likelihood (ML) method to the training set. In prediction, the marginal probability distribution of *SS*
_{
i
}(*i* = 1, 2, 3...) is computed by using the forward-backward (FB) algorithm [22], and then the state of *SS*
_{
i
}with the maximum probability is the prediction of residue *i*. Both ML and FB algorithms are implemented by using the Bayes Net Toolbox [38].

### Training and combinations

Training is done in two different ways, depending on datasets involved. For the dataset CB513 and SD576, the standard *N*-fold cross-validation testing strategy is adopted, where *N* is either 7 or 10. That is, the dataset is split into *N* subsets with approximately equal numbers of sequences in each, and then *N*-1 of them are used for training while the remaining one is used for testing; the process continues *N* times with a rotation of the testing subset, making sure that every protein sequence is tested once. The second way of training concerns the dataset EVAc6, for which there exists a separate large dataset EVAtrain with low sequence identity (< 25%) to EVAc6. So, it is customary to use EVAtrain as the training set and EVAc6 as the test set.

Note that the DBN and NN models are usually trained on the same training set, in order to make a comparison and to be combined later to form DBNN. However, the detailed training process of DBN is somewhat different from NN, owing to different architectures of the model. The DBN takes two sets of data as input, one for profile and the other for secondary structure; each set is a sliding window with the "current" residue located at the right end. The correlation information between "current" residue and its neighbors is stored in the data, but depends on the direction in which the window slides (from N-terminus to C-terminus or reverse). We actually run the DBN model in both directions and then average the results (see below). On the other hand, the NN takes only one sliding-window, with the "current" residue located at the center of the window. Finally, the training for DBNN is simple the training of DBN and NN on the same dataset.

When a sequence is selected for either training or testing, the original PSSM generated by PSI-BLAST can be transformed into [0 1] in two strategies: linear transformation [Eq. (3)] or sigmoid transformation [Eq. (4)]. In addition, as mentioned above, the direction from either N-terminus to C-terminus (NC) or the reverse (CN) gives rise to different correlation structure, so we treat them separately. As a result, four basic DBN models are generated corresponding to four above combinations: (i) DBN_{linear+NC}, (ii) DBN_{linear+CN}, (iii) DBN_{sigmoid+NC}, and (iv) DBN_{sigmoid+CN}, where the subscripts are self-explanatory. On the other hand, NN is split into two kinds according to the transformation for PSSM, and the corresponding models are denoted by NN_{linear} and NN_{sigmoid}, respectively.

The six basic models described above are believed to contain complementary information and need to be combined to form three final models. Two strategies for forming the final models are used. The first is a simple averaging of the output scores and is used to form the two architecture-based final models, DBN_{final} and NN_{final}. It is done in two steps. One first averages the outputs of DBN_{linear+NC} and DBN_{linear+CN} to form DBN_{linear}, and of DBN_{sigmoid+NC} and DBN_{sigmoid+CN} to form DBN_{sigmoid}. Then, DBN_{linear} and DBN_{sigmoid} are further combined to form DBN_{final}. Similarly, NN_{linear} and NN_{sigmoid} are combined to form NN_{final}.

The second strategy consists in using a new neural network, which has the same architecture to basic NN models except that it takes as inputs, the outputs of all the other scores (DBN_{linear+NC}, DBN_{linear+CN}, DBN_{sigmoid+NC}, DBN_{sigmoid+CN}, NN_{linear}, and NN_{sigmoid}). This final model is named DBNN, and is the one that shows the best performance among the models mentioned above.