Functionally specified protein signatures distinctive for each of the different blue copper proteins

Background Proteins having similar functions from different sources can be identified by the occurrence in their sequences, a conserved cluster of amino acids referred to as pattern, motif, signature or fingerprint. The wide usage of protein sequence analysis in par with the growth of databases signifies the importance of using patterns or signatures to retrieve out related sequences. Blue copper proteins are found in the electron transport chain of prokaryotes and eukaryotes. The signatures already existing in the databases like the type 1 copper blue, multiple copper oxidase, cyt b/b6, photosystem 1 psaA&B, psaG&K, and reiske iron sulphur protein are not specified signatures for blue copper proteins as the name itself suggests. Most profile and motif databases strive to classify protein sequences into a broad spectrum of protein families. This work describes the signatures designed based on the copper metal binding motifs in blue copper proteins. The common feature in all blue copper proteins is a trigonal planar arrangement of two nitrogen ligands [each from histidine] and one sulphur containing thiolate ligand [from cysteine], with strong interactions between the copper center and these ligands. Results Sequences that share such conserved motifs are crucial to the structure or function of the protein and this could provide a signature of family membership. The blue copper proteins chosen for the study were plantacyanin, plastocyanin, cucumber basic protein, stellacyanin, dicyanin, umecyanin, uclacyanin, cusacyanin, rusticyanin, sulfocyanin, halocyanin, azurin, pseudoazurin, amicyanin and nitrite reductase which were identified in both eukaryotes and prokaryotes. ClustalW analysis of the protein sequences of each of the blue copper proteins was the basis for designing protein signatures or peptides. The protein signatures and peptides identified in this study were designed involving the active site region involving the amino acids bound to the copper atom. It was highly specific for each kind of blue copper protein and the false picks were minimized. The set of signatures designed specifically for the BCP's was entirely different from the existing broad spectrum signatures as mentioned in the background section. Conclusions These signatures can be very useful for the annotation of uncharacterized proteins and highly specific to retrieve blue copper protein sequences of interest from the non redundant databases containing a large deposition of protein sequences.


Background
Most proteins can be grouped, on the basis of similarities in their amino acid sequences, into a limited number of protein families. Proteins or protein domains that belong to a particular family usually share functional attributes and are derived from a common ancestor. Highly conserved sequences in protein families are generally important for the function of a protein and/or for the maintenance of its 3-dimensional structure. Within the last decade, the sensitivity of sequence searching techniques has been improved by profile or motif-based analysis, which uses information derived from multiple sequence alignments to construct and search for sequence patterns [1][2][3][4]. By studying constant and variable properties of such groups, a signature for a protein family or domain can be derived which distinguishes its members from all other unrelated proteins. The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching [5]. Unlike single-sequence similarity, a profile or motif can exploit additional information, such as the position and identity of residues that are conserved throughout the family, as well as variable insertion and deletion probabilities [6]. These signatures can be used to assign a newly sequenced protein to a specific family to formulate hypotheses about its function.
By doing a keyword search, the protein sequences mined out from different databases is highly varied owing to different levels of redundancy. This could be due to the different strengths and weaknesses underlying the analysis algorithms used in different databases. The usage of the broad range signatures existing in databases, for the retrieval of blue copper proteins like the type 1 copper blue, multiple copper oxidase, cyt b/b6, photosystem 1 psaA&B, psaG&K, and reiske iron sulphur protein brings out different kinds of copper proteins and a lot more of unrelated proteins. A search once again becomes necessary for sorting out the required blue copper proteins. The usage of pattern database would be more selective as it can identify family members based on the conserved functional region patterns. Keeping these broad spectrum signatures in mind, more specific and targeted protein signatures for each of the blue copper proteins was designed. The diagnostic success of these specified signatures over the wide range signatures mentioned lies in the number of true positives picked over the minimal or nil false positives picked from the non redundant databases.
Blue copper proteins, which are also known as cupredoxins, are small, soluble proteins (10 -14 kDa) whose active site contains a type 1-copper [7]. All these type 1 blue copper proteins possess an eight stranded Greek key beta bar-rel or beta sandwich fold and have a highly conserved active site architecture. The type 1 blue copper proteins exert their function by shuttling electrons from a protein acting as an electron donor, to another acting as an electron acceptor in various biological systems such as bacterial and plant photosynthesis [8,9]. During the electron transfer process, the copper ion changes from a diamagnetic Cu(I) to a paramagnetic Cu(II), oxidation state [10]. The coordination of the copper is determined by the conformation of its three closest ligands, two histidine nitrogens and a cysteine sulfur and of a fourth more distant ligand a methionine sulphur [11]. The coordination sphere of copper ions in blue copper protein rusticyanin is shown for example in Figure 1. Type 1 copper sites are characterized by an intense blue color due to copper bound to thiolate [12]. An absorption is seen at 600 nm and gives rise to an unusual EPR signal, arising from asymmetrical copper site. Most of the cupredoxins have similar redox potentials ranging from 260 to 375 mV and function at pH values ranging from 6 to 8 [9]. Rusticyanin is an exception in having a very high redox potential of 680 mV [13].
The use of active site patterns or signatures is very rapidly becoming one of the essential tools of sequence analysis [14,15]. Although there is an appreciable amount of divergence in the sequences of the different blue copper proteins, the copper ligand sites are conserved. Direct application of the functionally specified signatures in databases, would help in quick retrieval of protein sequences related to that signature. The protein sequences thus retrieved were found to be highly specific to a particular blue copper protein. These signatures being highly specific allow the efficient mining out of uncharacterized proteins from the vast sequences deposited in different databases.

Differentiation of blue copper proteins based on source of origin and active site tabulation
The eukaryotic blue copper proteins chosen for the study were plantacyanin, plastocyanin, cucumber basic protein, stellacyanin, umecyanin, uclacyanin, and cusacyanin. The prokaryotic blue copper proteins were rusticyanin, sulfocyanin, halocyanin, azurin, pseudoazurin, auracyanin, amicyanin and blue nitrite reductase. Plastocyanins are found both in eukaryotes and prokaryotes. Table 1 and 2 describe the active site functional region for each of the blue copper proteins mentioned above. The active site functional region indicates the aminoacids in the respective blue copper proteins bound to the copper atom. For example in plantacyanin with the protein data bank id 1F56, histidine at the 34 th position, cysteine at the 74 th position, histidine at 79 th position and methionine at the 74 th position are bound to the copper atom involved in electron transport chain.

Keyword search for the specified blue copper proteins in different databases
The number of sequences retrieved for a protein from different databases by keyword search are tabulated in Table  3. As seen from Table 3, a keyword search is no longer effective and precise in retrieving sequences of a particular kind. If still used, it is only a time consuming process, as the particular protein of interest has to be filtered from the retrieved sequences once again. For example, a varied response of data output is seen on a keyword search for plastocyanin. The sequences retrieved from each of the database in terms of number of protein sequences is 901 sequences in NCBI, 41 sequences in SwissProt, 10 sequences in TrEMBL, 350 sequences in Protein Information Resource, 375 sequences in EMBL, and 41 sequences in PDB.

A search for the existing signatures for the blue copper proteins
The signatures already available for each of the blue copper proteins retrieved from the Prosite motif database are listed in Table 4. The number of protein sequences retrieved in response to the input of the already existing signatures for blue copper proteins in the PIR nREF database is shown in Table 5. An overview of the results in Table 4 indicates that most of the blue copper proteins have a type 1 blue copper signature with an id PS00196. The multiple copper oxidase signature present in rusticyanin as shown in Table 4 with id PS00079 and PS00080 retrieves out 799 and 366 sequences respectively as shown in Table 5. The existing rusticyanin sequences are very few in actual number and hence a secondary search becomes necessary. Even if the signature has annotated an unknown protein such as rusticyanin, it has to be searched amongst the 779 and 366 sequences retrieved. From Table 4 it is seen that plastocyanin has cyt b-heme (PS00192), cyt b QO(PS00193), photosystem1 PSAAB (PS00419), photosystem1 PSAGK(PS01026), Reiske 1 (PS00199) and Reiske 11(PS00200) as the signatures. As the names of the signatures suggest they are highly broad spectrum. The number of sequences picked out by these signatures as shown in Table 5 clearly indicates that most of these signatures are picking a lot more of other sequences other than plastocyanin and some of the signatures are missing out some plastocyanin sequences.      Table 6.

Discussion
The members of a protein family can be identified by collecting the matching sequences to profile or motif databases. Protein signatures are sequence motifs diagnostic to a protein family indicating function. Signatures are matched to protein sequences in the non redundant databases and is scored using a dynamic programming algorithm which permits permeability in gap distance and residue type [16]. Generating a signature involves identifying residues in a protein sequence that imparts functional properties to the protein. Protein signatures are efficient miners of related protein sequences having the same functional residues, which belong to the same class of proteins from the abundant sequences present in the non redundant databases. All the copper ions in the living cells are protein bound, as it is toxic in its free form. In most copper proteins, the copper ion having the ability to change valence state is mainly involved in catalysis of biological process, or the transport of electrons different proteins in a cell. Blue copper proteins also known as   cupredoxins, have a type I copper site. They possess a single copper functional domain. The coordination of copper in most of the blue copper proteins is determined by the conformation of its three closest ligands, two histidine nitrogens and a cysteine sulfur and of a fourth more distant ligand a methionine sulfur [11]. In the case of auracyanin, stellacyanin and umecyanin the methionine is substituted by a glutamine residue, which binds as the fourth ligand to the copper atom.
By doing a keyword search, we get varied results from the different databases as indicated in Table 3 owing to different levels of redundancy. On using a functionally related protein signature only relevant related sequences are picked out from the non redundant database as seen from Table 6. Thus protein signatures can play a great role in extracting out highly related sequences from different databases than keyword searches. The signatures already available for the blue copper proteins like the Cyt b/b6, Photosystem 1 PSAGK, Rieske Iron Sulfur protein, and type I copper blue signatures are broad spectrum signatures. PS00196 a type I blue copper signature which is an already existing signature, when fed in the PIR nREF database has picked out 589 sequences as indicated in the result in Table 5. We have also ensured that the active site region involving amino acids bound to the copper atom is present in all our signatures. Protein signatures designed taking into account the active site region will be very efficient for annotation of uncharacterized proteins. In one study the authors have used metal binding patterns of metalloproteins present in Protein Data Bank to search gene banks for new metalloproteins [17].
The protein signatures in a way can be compared to primers used for amplification. The more specific and concise a primer, the more specific is the amplification, similarly more specific the protein signature more significant are the picks from the non redundant databases. Specific signatures in a way reduce the time taken to pool related sequences from the abundantly available sequences from the non redundant databases. For example as shown in Table 4 the Type 1 blue copper signature is present in plantacyanin, stellacyanin, umecyanin, cusacyanin, halocyanin, azurin, auracyanin and nitrite reductase amongst the sixteen different blue copper proteins. When this signature is used as a query in the Prosite database even if an unknown protein is annotated it can only be as a type 1 blue copper protein, but it cannot classify it as a particular blue copper protein. The newly designed signatures or peptides will help in classifying the uncharacterized protein to the exact subtype of blue copper proteins. In this study, we have assigned functional property based signatures, which have the amino acid residues binding to the copper atom. It may be concluded that we have been successful in designing functionally related protein signatures for the blue copper proteins.

Conclusions
Signatures designed around the functionally important regions of a protein are valuable for annotation. In this study, specific signatures were designed around the active site regions of each of the blue copper proteins plantacyanin, plastocyanin, uclacyanin, stellacyanin, rusticyanin, sulfocyanin, amicyanin, halocyanin, pseudoazurin, azurin and nitrite reductase. These will be very useful for annotating uncharacterized proteins as blue copper proteins. Further, because of their high specificity to each subclass, they can be used in classifying the various subtypes of blue copper proteins.

Differentiation of blue copper proteins based on source of origin and active site tabulation
The blue copper proteins were distinguished based on the source of origin as prokaryotic and eukaryotic. The active site residues of the eukaryotic and prokaryotic blue copper proteins, which bind the copper metal atom, were identified from the Protein Data Bank and tabulated.

Keyword search for the specified blue copper proteins in different databases
The name of each of the blue copper protein was given as a query in the keyword searches at NCBI, SwissProt, TrEMBL, PIR, EMBL and PDB databases to check for the number of sequences retrieved from each database and the results were tabulated.

A search for the existing signatures for the blue copper proteins
The signatures already existing for each of these blue copper proteins were identified from the Prosite motif database and tabulated. These already existing signatures were used as query patterns in the PIR Motif/Peptide match and a search was made against the PIR-nREF database.