In this section we describe our novel method for predictive classification of DNA-binding propensity of proteins using so-called ball histograms. The motivation for the method is the observation that distributions of certain types of amino acids differ significantly between DNA-binding and non-DNA-binding proteins. This suggests that information about distributions of some amino acids could be used to construct predictive models able to classify proteins as binding or non-binding given their spatial structure. We propose the following approach which is able to capture fine differences between the distributions. It consists of four main parts. First, so-called templates are found, which determine amino acids whose distributions should be captured by ball histograms. In the second step ball histograms are constructed for all proteins in a training set. Third, a transformation method is used to convert these histograms to a form usable by standard machine learning algorithms. Finally, a random forest classifier [10] is learned on this transformed dataset and then it is used for classification. The reason why we chose the random forest learning algorithm is that it is known to be able to cope with large numbers of attributes such as in our case of ball histograms [11]. We also experimented with the Support Vector Machine method in the third step, however, it was consistently outperformed by the random forest classifier.
Ball histograms
A template is a list of some Boolean amino acid properties. A property may, for example, refer to the charge of the amino acid (e.g. Positive), but it may also directly stipulate the amino acid type (e.g. Arginine). An example of a template is (Arg, Lys, Polar) or (Positive, Negative, Neutral). A bounding sphere of a protein structure is a sphere with center located in the geometric center of the protein structure and with radius equal to the distance from the center to the farthest amino acid of the protein plus the diameter of the sampling ball which is a parameter of the method. We say that an amino acid falls within a sampling ball if the alpha-carbon of that amino acid is contained in the sampling ball in the geometric sense.
Given a protein structure, a template τ = (f1, ..., f
k
), a sampling-ball radius R and a bounding sphere S, a ball histogram is defined as:
(1)
where I
T, R
(x, y, z, t1, ..., t
k
) is an indicator function which we will define in turn. The expression is meant as a normalization factor - it ensures that . In order to define the indicator function I
T
,
R
we first need to define an auxiliary indicator function
Notice that does not make any distinction between a sampling ball that contains no amino acid at all and a sampling ball that contains some amino acids of which none complies with the parameters in the template T. Therefore if we used in place of I
T, R
the histograms would be affected by the amount of empty space in the bounding spheres. Thus, for example, there might be a big difference between histograms of otherwise similar proteins where one would be oblong and the other one would be more curved. In order to get rid of this unwanted dependence of the indicator function I
T;R
on proportion of empty space in sampling spheres we define I
T, R
in such a way that it ignores the empty space. For (t1,..., t
k
) ≠ 0 we set
In the cases when (t1, ..., t
k
) = 0 we set I
T, R
(x, y, z, t1, ..., t
k
) = 1 if and only if and if the sampling ball with radius R at (x, y, z) contains at least one amino acid.
Ball histograms capture the joint probability that a randomly picked sampling ball (See Figure 2) containing at least one amino acid will contain exactly t1 amino acids complying with property f1, t2 amino acids complying with property f2 etc. They are invariant to rotation and translation of protein structures which is an important property for classification. Also note that the histograms would not change if we increased the size of the bounding sphere.
The indicator function I
T, R
makes crisp distinction between the case where an amino acid falls within a sampling ball on one hand, and the case where it falls out of it, on the other hand. This could be changed towards capturing a more complex case by replacing the value 1 by the fraction of the amino acid that falls within the sampling ball, however, for simplicity we will not consider this case in this paper.
Ball-histogram construction
Computing the integral in Equation 1 precisely is infeasible therefore we decided to use a Monte-Carlo method. The method starts by finding the bounding sphere. First, the geometric center C of all amino acids of a given protein P is computed (each amino acid is represented by coordinates of its alpha-carbon). The radius R
S
of the sampling sphere for the protein structure P is then computed as
where R is a given sampling-ball radius. After that the method collects a pre-defined number of samples from the bounding sphere. For each sampling ball the algorithm counts the number of amino acids in it, which comply with the particular properties contained in a given template and increments a corresponding bin in the histogram. In the end, the histogram is normalized.
Example 1. Let us illustrate the process of histogram construction. Consider the template (Arg, Lys) and assume we already have a bounding sphere. The algorithm starts by placing a sampling ball randomly inside the bounding sphere. Assume the first such sampling ball contained the following amino acids: 2 Arginins and 1 Leucine therefore we increment (by 1) the histogram's bin associated with vector (2, 0). Then, in the second sampling ball, we get 1 Histidine and 1 Aspartic acid, so we increment the bin associated with vector (0, 0). We continue in this process until we have gathered a sufficient number of samples. In the end we normalize the histogram. Examples of such histograms are shown in Figure 3 and 4.
Predictive classification using ball histograms
In the preceding sections we have explained how to construct ball-histograms but we have not explained how we can use them for predictive classification. One possible approach would be to define a metric on the space of normalized histograms and then use either a nearest neighbour classifier or a nearest-centroid classifier. Since our preliminary experiments with these classifiers did not give us satisfying predictive accuracies, we decided to follow a different approach inspired by a method from relational learning known as propositionalization [12] which is a method for transferring complicated relational descriptions to attribute-value representations.
The transformation method is quite straightforward. It looks at all histograms generated from the protein structures in a training set and creates a numerical attribute for each vector of property occurrences which is non-zero at least in one of the histograms. After that an attribute vector is created for each training example using the collected attributes. The values of the entries of the attribute-vectors then correspond to heights of the bins in the respective histograms. After this transformation a random forest classifier is learned on the attribute-value representation. This random forest classifier is then used for the predictive classification.
In practice, there is a need to estimate the optimal sampling-ball radius. This can be done by creating several sets of histograms and their respective attribute-value representations corresponding to different radii and then selecting the optimal parameters using an internal cross-validation procedure.
Construction of templates
A question that we left unanswered so far in the description of our method is how to construct appropriate templates, which would allow us to accurately predict DNA-binding propensity. It is obvious that an all-inclusive strategy where the template would simply list all possible properties is infeasible. A template with n properties will generate training samples with a number of attributes d that is exponential in n. Furthermore, machine learning theory [13] indicates that the number of training samples needed to preserve accurate classification grows exponentially in d. In effect, the requested number of training samples grows doubly-exponentially with the size of the template. It is thus crucial that the template consists of only a small number of relevant properties. On the other hand, omitting some amino acids completely might be a problem as well. A possible solution is to use more templates of bounded size instead of one big template, because the number of attributes d grows only linearly with the number of templates.
One possibility could be to use templates with properties corresponding to amino acid types believed to play an important role in the DNA-binding process according to literature. This could mean, for example, using the four charged amino acids as properties in a template - Arg, Lys, Asp, Glu, which are known to often interact with the negatively charged backbone as well as with the bases of the DNA [14–16] or other amino acids identified as important, e.g. the eight amino acids used in [7]. We performed such experiments in [17].
Here we advance the strategy by developing an automated method for template construction. The basic idea of the method is to find templates which maximize distance between average histograms from the two classes (DNA-binding and non-DNA-binding proteins). Intuitively, such templates should allow us to construct classifiers with good discriminative ability. We construct the templates in a heuristic way using a variant of best-first search algorithm (Algorithm 1 - Template Search) to maximize the distance between the average histograms from the two classes. In this paper we use the Bhattacharyya distance [18] which is defined as
where h
a
and h
b
are histograms and X is their support-set. Further different distances could be used as well.
The following example shows that the templates cannot be constructed greedily. Although we do not directly prove hardness of the template-search problem here, the next example gives an intuition why the problem is probably hard.
Example 2. We assume that we have histograms for DNA-binding proteins and non-DNA-binding proteins as shown below and we want to find an optimal template with length 2. It can be easily verified that greedy search starting with an empty template would construct either the template (Arg, Gly) or (Lys, Gly), but not the optimal template (Arg, Lys).
Consider now the case where the search algorithm starts from the empty template. In the first step, the template (Gly) is constructed because it maximizes distance between the histograms for the two classes.
Both (Arg) and (Lys) would give rise to identical histograms for the two classes. In the next step, Arg or Lys is added to this template. However, the resulting template is clearly not optimal as can be checked by routine calculation of the Bhattacharyya distance which is finite for the discovered sub-optimal template but which would be infinite for the optimal template (Arg, Lys).
Similarly, if we wanted to construct an optimal template of length 1 and if we started with the maximal template (Arg, Lys, Gly) and then tried to iteratively remove its elements while greedily maximizing the distance between the histograms of the two classes then we would end up with a sub-optimal template as before. In the first step, Gly would be removed and we would get the optimal template of size 2 (Arg, Lys). However, in this case we want to construct a template of length 1. Therefore, in the next step we would create either the template (Arg) or (Lys) but not the optimal one (Gly) (note that the distances for templates (Arg) or (Lys) are 0).
In order to avoid repeated construction of histograms from the whole datasets, we construct a histogram corresponding to the largest possible template (containing all amino acid properties), then, during the best-first search, we construct histograms for the other templates by marginalising this largest histogram. While searching for a single template using best-first search is quite straightforward, searching for several templates is more complicated. This is because we need to find not only a set of templates making the distances between the average histograms as large as possible, but also these templates should be sufficiently diverse. There are several possible ways to enforce diversity in the template set, and we decided to follow a fast heuristic approach. During the template search we penalize all candidate templates which are subsets of some templates already discovered in previous runs of the procedure.
Algorithm 1 Template Search
function TemplateSearch()
Templates ← {}
for i = 1 to NumberO fTemplates do
i ← i + 1
Templates ← Templates ∪ BestFirstSearch(Templates)
end for
function BestFirstSearch(Templates, λ)
E+ - set of positive examples (DNA-binding Proteins)
E- - set of negative examples (non-DNA-binding Proteins)
Open ← {()}
Open ← Open ∪ {t
i
∩ t
j
|t
i
, t
j
Templates}
Closed ←∅
BestTemplate ← ()
Scores ← HeuristicScore(Open)
while Open ≠ ∅ do
Template ← Remove best template from Open according to Scores
if (D
x
(Template, E+, E-) > D
x
(BestTemplate, E+, E-)) ∧ (Template ∉ Templates) then
BestTemplate ← Template
end if
for T ∈ Expand(Template) do
if T ∉ Closed then
Closed ← Closed ∪ {T}
Open ← Open ∪ {T}
if ∃ T' ∈ Templates: T ⊆ T' then
Score ← λ · HeuristicScore(T)
else
Score ← HeuristicScore(T)
end if
Scores ← Scores ∪ Score
end if
end for
end while
BestFirstSearch(Templates). In order to direct the search early to the most promising regions of the search-space, we first initialize the set Open with all pairwise intersections of the already discovered templates. Intuitively, sub-templates which appear in more templates constitute a kind of a core shared by the most informative templates. This, together with penalization of redundant templates, helps the algorithm visit the most promising parts of the search space.