### Residue contact number

We take two definitions of contact number in this study, namely, that of "discrete" and "consecutive" contact number. The "discrete" contact number, *N*
_{
d
}, is defined by the number of C_{
β
}atoms on other residues located within a sphere of radius *r*
_{
d
}centred on the C_{
β
}atom of the residue of interest. The discrete contact number for *i*-th residue in a sequence with *M* residues is given by

where *r*
_{
i,j
}is the distance between the C_{
β
}atoms of the *i*th and *j*th residues which are understood to be separated in sequence by at least two amino acids. Note that
is a discrete integer. By replacing the step function *σ*(*r*
_{
i,j
}) with a sigmoid function,
becomes a real number. This procedure was previously adopted by Kinjo et al. [3] to smooth the discrete contact numbers. A particular sigmoid function is given by

*σ*(*r*
_{
i,j
}) = 1/{1 + exp [3(*r*
_{
i,j
}- *r*
_{
d
})]}. (2)

We have tried four values of *r*
_{
d
}(8 Å, 10 Å, 12 Å and 14 Å) with discrete and consecutive definitions and thus have 8 combinations all of which will be used in our SVR approach.

### Normalization of contact number

The distributions of contact numbers can be approximated by normal distributions, as shown in Fig. 1. With respect to a certain *r*
_{
d
}, we calculate the mean (
) and standard deviation (*SD*). So, the normalized contact number *N*
_{
norm
}is determined by the following formula:

At the first step, we predict the normalized contact number because 1) it is easy to handle the data, and 2) it is easy to compare the results for different *r*
_{
d
}thresholds. At the second step, we recover the absolute contact numbers from their predicted normalized values using this equation.

### Sequence coding

We predict contact number from protein local sequence. For a given residue, the local sequence contains its N-terminal and C-terminal seven nearest-neighbour residues. Thus, the local sequence makes a window of fifteen amino acids. We code each residue in the window using the PSI-BLAST position-specific scoring matrix [12]. The matrices are obtained by querying the input sequence using PSI-BLAST against the NCBI non-redundant protein sequence database with three rounds, masking coil-coiled and low-complexity regions [13]. The elements in the row of the matrix reflect the probabilities for 20 amino acids occurring at this position. All the elements are divided by 10 for normalization and thus each residue is represented by a 20-dimesional vector. Since the residues in coil-coil and low-complexity regions do not have meaningful scores, we encode the residue with an orthogonal scheme. In the 20-dimensional vector coding a given residue, only the entry representing this type of amino acid is assigned as 0.5 with all other entries set as zeros. To consider the terminal residues, we expend the 20-dimensional vector to being 21-dimensional for all residues. When the last entry is set as 0.5 and other entries have zeros, it represents a blank residue added to the N-terminal or the C-terminal to make a local sequence of 15-residue length. For all other residues, the 21-st entries are set to zero. In summery, a residue is coded by a 315-dimensional vector.

### Support vector regression

To find the function between protein local sequence and normalized contact number, we use ∈-insensitive support vector regression (∈-SVR) [14, 15]. The expected function can be formulated as

*f*(*X*
_{
i
}) = 〈*W*, Φ(*X*
_{
i
})〉 + *b*, (4)

where *W* is the weight and *b* is the bias. Φ(*X*
_{
i
}) is a non-linear function mapping a data point from the input space to the feature space, so consequently, SVR is able to perform non-linear regression. The goal of the regression is to find the optimal *W* and *b* using some optimisation criteria. In ε-SVR, errors greater than ε are penalized, where two positive variables *ξ* and *ξ** are used to measure the deviation of samples outside the *ε*-insensitive tube. The optimisation problem can be expressed as

where *C* is the regularization constant that determines the trad∈off between the norm and the error penalty.

The solution of the above problem was given by the authors of ∈SVR [14, 15] as follows,

where *α*
_{
i
}and
are Lagrange multipliers. We can replace 〈Φ(*X*
_{
i
}), Φ(*X*)〉, the inner product of Φ(*X*
_{
i
}) and Φ(*X*), by a kernel function *K*(*X*
_{
i
}, *X*), if *K*(*X*
_{
i
}, *X*) = 〈Φ(*X*
_{
i
}), Φ(*X*)〉. The radial basis function are used in our study, as given by

*K*(*X*
_{
i
}, *X*) = exp(-*γ*||*X*
_{
i
}- *X*||^{2}), (7)

where *γ* is a parameter to be tuned by the user.

We constantly set ε as 0.01, *γ* as 0.01 and *C* as 5.0, because this set of parameters yielded the best performance in our previous work [6, 8]. A number of software packages can be used to find the solution such as SVMlight [16].

### Dataset preparation and prediction evaluation

To test our approach, we selected 945 unique protein chains, which were previously used for prediction of protein ASA, and were prepared by PDB-REPRDB [17]. The structures solved by X-ray crystallography were with resolution less than 2.0 Å and with an R-factor less than 0.2. All chains are at least 60 amino acids or longer, and the pair-wise identity is less than 25%. The protein names can be found in the additional file 1 (supplementary material).

The proteins are randomly divided into three groups with each group having 315 chains. Each group is in turn used for training with the remaining two groups used for testing. Therefore, each group is tested twice by the two functions derived from the other groups, and as a result we have six groups of examination results.

Pearson's correlation coefficients and root mean square errors are calculated with respects to all residues and individual proteins. In addition, the absolute errors are calculated for the residues with different contact numbers. In order to compare with previous classification methods, we use different thresholds to classify contact numbers as "contacted" or "non-contacted" and compute the overall accuracy. The accuracy is defined as the ratio between the number of correctly predicted residues and the total number.