A structural study for the optimisation of functional motifs encoded in protein sequences

Background A large number of PROSITE patterns select false positives and/or miss known true positives. It is possible that – at least in some cases – the weak specificity and/or sensitivity of a pattern is due to the fact that one, or maybe more, functional and/or structural key residues are not represented in the pattern. Multiple sequence alignments are commonly used to build functional sequence patterns. If residues structurally conserved in proteins sharing a function cannot be aligned in a multiple sequence alignment, they are likely to be missed in a standard pattern construction procedure. Results Here we present a new procedure aimed at improving the sensitivity and/ or specificity of poorly-performing patterns. The procedure can be summarised as follows: 1. residues structurally conserved in different proteins, that are true positives for a pattern, are identified by means of a computational technique and by visual inspection. 2. the sequence positions of the structurally conserved residues falling outside the pattern are used to build extended sequence patterns. 3. the extended patterns are optimised on the SWISS-PROT database for their sensitivity and specificity. The method was applied to eight PROSITE patterns. Whenever structurally conserved residues are found in the surface region close to the pattern (seven out of eight cases), the addition of information inferred from structural analysis is shown to improve pattern selectivity and in some cases selectivity and sensitivity as well. In some of the cases considered the procedure allowed the identification of functionally interesting residues, whose biological role is also discussed. Conclusion Our method can be applied to any type of functional motif or pattern (not only PROSITE ones) which is not able to select all and only the true positive hits and for which at least two true positive structures are available. The computational technique for the identification of structurally conserved residues is already available on request and will be soon accessible on our web server. The procedure is intended for the use of pattern database curators and of scientists interested in a specific protein family for which no specific or selective patterns are yet available.

The LIPOCALIN PROSITE signature is derived from a stretch of residues common to all the proteins belonging to the lipocalin protein family [8,9].
Single-stranded RNA binding proteins are characterised by one or more copies of a putative RNA-binding domain. This domain, about 90 amino acids long [10,11], displays two highly conserved regions. The first region is used to build the RRM_RNP_1 PROSITE pattern.
For each PROSITE pattern, two extended patterns, displaying better correlation than the PROSITE original one and comparable correlation (C) between them, are proposed: the former pattern with higher sensitivity and the latter with higher selectivity (see Table 4).
The PROSITE region of the former is unmodified while the PROSITE region of the latter is slightly 'softened' (see Table 3). Choice between the two motifs when analysing novel sequences will therefore depend on a preference for more sensitivity or selectivity.
The extended 1 patterns generally match a lower number of false positives and the same number of true positives and false negatives with respect to the corresponding PROSITE patterns on both the SWISS-PROT releases used as optimisation and test dataset ( Table 2 and 3). There are two exceptions: the AA_TRNA_LIGASE_II_2 extended 1 pattern  Table 3], these three false negatives become true positive matches. Also the number of false positives increases, though remaining lower than that of the EGF_1 PROSITE pattern.
This means that, in the case of the EGF_1 PROSITE pattern, even three extended patterns might be considered, for some purposes.
The extended 2 patterns select less false negatives than the corresponding PROSITE patterns do. This implies a greater number of true positives and, in every case but one (the AA_TRNA_LIGASE_II_1 extended 2 pattern), a lower number of false positives also. The AA_TRNA_LIGASE_II_1 extended 2 pattern matches on both the SWISS-PROT releases a slightly higher number of false positive sequences than the original PROSITE pattern does.
All the remaining true positives that have been missed (see table 2 and 3, AA_TRNA_LIGASE_II_2, ASP_PROTEASE and LIPOCALIN) are partial sequences of the extended patterns, namely sequences that are not matched because they are fragments.
The number of such partial sequences is underlined in the corresponding columns of Table 3.

2) THIOL_PROTEASE_HIS
The THIOL_PROTEASE_HIS PROSITE pattern is built around the histidine residue of the Cys-His-Asn catalytic triad that characterises the proteolytic enzymes belonging to the eukaryotic thiol proteases family [12]. In this case only one extended pattern was built, which displays a better performance with respect to the PROSITE original one.
Such a pattern (see Table 2 and 3, last rows) matches the same number of true positives as the PROSITE one and displays selectivity and specificity values equal to one, which means that it detects NO false positives (on both the SWISS-PROT releases used).

3) CYTOCHROME_C
The CYTOCHROME_C pattern is built around the heme-binding site of the cytochrome c protein family [13,14] and contains two cysteine residues known to be bound to the heme group and a histidine residue, which is one of the two axial ligands of the heme iron.
In the case of the CYTOCHROME_C PROSITE pattern the computational and the visual analysis of the heme-binding site region did not highlight the presence of structurally conserved residues across the entire set of structures aligned, except for some residues belonging to the short three-dimensional fragment corresponding to the PROSITE pattern. Therefore, the corresponding R-HET was only partially filled with Heavy Elements (see above). Indeed, the heme-binding site occupies a very small spatial region and is found in proteins with very different folds. Even if in many cases the local structure of the heme-binding site comprises similar loops, there are structures (e.g. the cytochrome c 1cpq structure) for which the 3D match occurs in a completely different local structure namely in a-helix, although still interacting with the heme. Mondal et al. (15) showed that, at least in some cases, true PDB hits of a PROSITE pattern display structural plasticity depending on the context (e.g. interaction with ligands and DNA).
Therefore, when creating a 3D template for a PROSITE pattern, it might be important to take into account all the known distinct conformational states of the pattern. Similar conclusions are drawn by Lin et al. [16]. Maybe a more refined analysis of the nr-PDB true positive structures of the CYTOCHROME_C PROSITE pattern would make it possible to cluster structures with a similar 3D conformation of the pattern (e.g. bound to the heme). Structures belonging to the same group may display a greater number of conserved residues on the surface region surrounding the heme-binding site. Actually, some residues were found to be conserved in small subsets of structures and were used to build a rough extended sequence pattern. However, such CYTHOCROME_C rough extended pattern gave rise only to extended sequence patterns, which performed very poorly.