Identification of putative domain linkers by a neural network – application to a large sequence database
© Miyazaki et al; licensee BioMed Central Ltd. 2006
Received: 24 February 2006
Accepted: 27 June 2006
Published: 27 June 2006
The reliable dissection of large proteins into structural domains represents an important issue for structural genomics/proteomics projects. To provide a practical approach to this issue, we tested the ability of neural network to identify domain linkers from the SWISSPROT database (101602 sequences).
Our search detected 3009 putative domain linkers adjacent to or overlapping with domains, as defined by sequence similarity to either Protein Data Bank (PDB) or Conserved Domain Database (CDD) sequences. Among these putative linkers, 75% were "correctly" located within 20 residues of a domain terminus, and the remaining 25% were found in the middle of a domain, and probably represented failed predictions. Moreover, our neural network predicted 5124 putative domain linkers in structurally un-annotated regions without sequence similarity to PDB or CDD sequences, which suggest to the possible existence of novel structural domains. As a comparison, we performed the same analysis by identifying low-complexity regions (LCR), which are known to encode unstructured polypeptide segments, and observed that the fraction of LCRs that correlate with domain termini is similar to that of domain linkers. However, domain linkers and LCRs appeared to identify different types of domain boundary regions, as only 32% of the putative domain linkers overlapped with LCRs.
Overall, our study indicates that the two methods detect independent and complementary regions, and that the combination of these methods can substantially improve the sensitivity of the domain boundary prediction. This finding should enable the identification of novel structural domains, yielding new targets for large scale protein analyses.
Structural genomics/proteomics projects seek to establish high-throughput techniques by promoting routine protein structure determination either by X-ray crystallography or NMR spectroscopy [1–7]. However, the determination of large protein structures remains as a major hurdle, especially for NMR, which requires elaborate techniques and time consuming analyses . Even when X-ray crystallography is employed, the average size of proteins determined by this method and listed in the PDB (Protein Data Bank) is about 230 residues. This situation not only reflects the difficulty of determining large protein structures, but also that of expressing and purifying them. Meanwhile, most large proteins are assembled from structural domains, which are structurally independent units that are able to fold into a native structure even when isolated from the rest of the protein. Thus, dissecting large proteins into their structural domains can provide several candidates for swift structural analysis by either X-ray crystallography or NMR spectroscopy.
Protein dissection is often a long and tedious process. Limited proteolysis is the prevalent experimental method for determining structural domain boundaries [9–12], but it does not alleviate the problems related to the expression and purification of large proteins. Screening methods for detecting natively folded proteins without relying on a specific functional activity have recently been developed [13, 14], and they may serve as tools to isolate natively folded domains from a library of randomly generated protein fragments, thus alleviating the need to first purify the full length protein. However, experimental methods are usually time-consuming, and less expensive computer-aided methods for detecting putative domains in protein sequences have practical values for all types of high-throughput proteomics projects .
Various theoretical methods for identifying domains in protein sequences have recently been reported. These include well-established sequence similarity searches against existing domain databases, such as Pfam or SMART [16–19]. A major limitation of these methods is their inherent inability to identify completely novel domains. On the other hand, methods that do not rely on a pre-existing domain database can be valuable tools in high-throughput structural genomics projects as they can identify novel, natively folded domains suitable for structural analysis[20, 21]. Thus, the prediction of domain organization based on sequence information alone is presently an actively investigated topic .
Recently, domain prediction methods based on sequence information alone, such as the statistics of residue contact in domains , the statistics of domain size distribution , the sequence characteristics of domain linkers [25–27], the amino acid composition of domain linkers [28–30], covariance analysis and the conservation of hydrophobic clusters  have been developed. Some of the aforementioned methods to detect domain boundary sequence characteristics use neural networks [25–27]. Neural networks  have been successfully applied to the prediction of several aspects of protein structure, such as secondary structures [34, 35], β turns, structural classes, and stabilization centers, but its use in domain boundary recognition is relatively new .
In this paper, we used our neural network  to search for putative domain linker regions in the SWISSPROT database . The aim of the present study was threefold. First, we asked if our neural network – which was trained with a small data set of 74 multi-domain proteins derived from SCOP  – could be applied to a practical problem, specifically, that of detecting protein domains for structural genomics/proteomics projects from a large sequence dataset. Second, we were interested in comparing our predictions, which rely only on sequence characteristics, with traditional methods that detect domains by sequence similarity to domain databases; here, we used the Protein Data Bank (PDB)  and the Conserved Domain Database (CDD) . Last, we examined the possibility of improving the detection of domain boundaries by combining the detection of the putative domain linkers with that of the low-complexity regions, which encode unstructured protein sequence segments. Overall, the present analysis confirmed our previous study, and indicated that our neural network can efficiently detect domain boundaries even when applied to a large and "real" sequence database.
Results and discussion
Detection of putative domain linkers by the neural network
Sequence regions detected
No. of sequencesa
No. of sequence regionsb
No. of residuesc
Low-complexity regions (45, 3.4, 3.75)e
Low-complexity regions (45, 2.9, 3.2)
Low-complexity regions (45, 2.6, 2.9)
Low-complexity regions (45, 2.45, 2.75)
Putative domain linkers (0.90)f
Putative domain linkers (0.91)
Putative domain linkers (0.92)
Putative domain linkers (0.93)
Putative domain linkers (0.94)
Putative domain linkers (0.95)
Putative domain linkers (0.96)
Putative domain linkers (0.97)
Putative domain linkers (0.98)
Low-complexity regions (45, 2.9, 3.2) + Putative domain linkers (0.95)g
Assignment of 'putative structural domains'
For the purposes of this discussion, we define 'putative structural domains' as sequence segments with high similarity to PDB or CDD sequences (sequence identity >30% and sequence overlap > 85%; See details in the Material and methods section). Putative structural domains are thus able to fold into a native structure or at least to form a domain, and we used them to assess the correctness of the predicted domain boundaries. As anticipated, a substantial fraction of the SWISSPROT sequences is covered by known putative structural domains. Specifically, from a total of 101602 SWISSPROT sequences, 38470 sequences (corresponding to, respectively, 38% and 27% on a sequence and residue basis) had similarity to a PDB sequence, and 64349 sequences (43% on a residue basis) had similarity to a CDD sequence (Table 1).
Correlation between predicted linkers and putative structural domain termini
The putative structural domains as defined above may contain multiple structural domains, and, hence, some linkers in class 4 may be correctly located. Our calculations thus slightly underestimate the actual performances of both the neural network and the LCRs predictions (see also next section). However, the underestimations are likely to be very small, and concern only a few percents of the putative linkers, as most proteins in the PDB (and many in the CDD) are single structural domain proteins [28, 29].
Detection of low-complexity regions
Most large-scale sequence databases contain a substantial number of long, unstructured, disordered regions that may interfere with systematic searches for structural domains. Thus, the detection of unstructured portions of proteins as defined by low complexity regions (LCRs), which are unlikely to fold into a globular structure , or structurally disordered regions  may help predict domain boundaries, although this was not the original intent. Here, we examined whether LCRs as detected by SEG , overlapped with domain boundaries. Two parameters in the SEG program, called trigger and extension complexity, control the balance between the detection number (Table 1) and the ratio of correct matches relative to incorrect ones (data not shown). In order to analyze approximately the same number of sequences as that of the putative linkers detected with the cutoff of 0.95, we set the trigger complexity to 2.9 and the extension complexity to 3.2, which yielded 8539 low-complexity regions (Table 1). Using an error window of 20 residues, the percentages of correct matches (classes 1 and 2), overlaps (class 4) and unknown locations (class 3) were 26.3%, 10.3% and 63.4%, respectively (Figure 1C). Thus, the position of the LCRs correlate with the temini of the putative structural domains at a level similar to that observed for the domain linkers (Figure 1B).
Comparison of domain boundaries detected by domain linkers and LCRs
Putative domain linkersa
Uniquely linker regions
Overlapped with low-complexity regions
Overlapped with putative domain linkers
Uniquely Low complexity
Correct matches of both ends (class 1)
Correct matches of either end (class 2)
Unknown locations (class 3)
Overlaps (class 4)
As a result of their complementarity, the sensitivity of the domain detection was clearly improved by combining the LCR and linker predictions (Table 1; Figure 3). A combined search yielded 13946 domain boundaries, i.e., only 2726 sequences less than the total of the LCR and linker sequences. Furthermore, the domain boundary sequences identified by a combined LCR-linker search were categorized into the 4 classes in percentages similar to those identified by the separate LCR and linker searches. Thus, the total number of correctly predicted domain termini increased 1.6 fold, while the fraction of incorrect predictions (false positives) remained unchanged.
Comparison with random guesses
Domain termini and error windows
Our study strongly suggests that sequence characteristics alone, as detected by either our neural network or SEG, can identify domain boundaries in protein sequences even without sequence similarity to existing domain databases. There is a clear correlation between the termini of putative structural domains and the positions of both the domain linkers and the LCRs. Furthermore, our neural network and SEG are complementary for detecting domain boundaries, and when combined, the sensitivity of the domain boundary prediction is increased without decreasing its specificity. Overall, our study shows that domain identification protocol based on domain boundary prediction can be applied to practical problems, such as the identification of novel structural domains, and thus will yield new targets for large scale protein analyses.
Sequence databases and estimation of the putative structural domains
A total of 101602 SWISSPROT protein sequences  were used in the present investigation. Since the putative structural domains needed to be structurally independent units, we located all of the sequences with high similarity to PDB  and CDD  sequences, using the BLAST and RPS-BLAST programs[48, 49]. To ensure the structural identity, as much as possible, we required a sequence identity greater than 30% and a sequential overlap greater than 85% over the entire length of the corresponding PDB or CDD sequence. Thus, putative structural domains detected by similarity to a PDB sequence are likely to fold into a structure similar to the corresponding PDB structure. Analogously, putative structural domains detected by similarity to CDD sequences, which is a compilation of conserved protein domain sequences imported from Pfam  and SMART , are likely correspond to a natively folded domain, although their structures have not necessarily been determined.
Putative domain linkers predicted by the neural network
We used a two hidden units neural network  trained to distinguish between domain linker and non-linker regions. The prediction procedure was identical to that reported in our previous paper , except for the following two points. (1) The prediction was carried out over the entire protein sequence, namely from the start to the end of each target sequence, because the SWISSPROT sequences may contain unstructured termini. Indeed, in our previous study, we assumed that a 60 residue length is the minimum for a polypeptide to fold independently, and we omitted the 60 terminal residues of the multi-domain protein sequences from the prediction, because the protein structures were known, and we knew that no unstructured termini were present. (2) Predicted domain linkers were not ranked, because under the stringent conditions (cutoff 0.90–0.98; see below) examined here, the prediction success rate was sufficiently high without such a procedure.
The smoothing window size and the threshold parameters were fixed to 19 and 0.5, respectively, as in our previous study. However, we set the cutoff parameter to values ranging from 0.90 to 0.98, because a high cutoff yields a better prediction specificity at the cost of the prediction sensitivity. The specificity and sensitivity for the first ranked domain linkers predicted with a cutoff of 0.90 are 81.8% and 10.3%, respectively, as calculated with a ten-fold jack-knife .
Sequence entropy (also called Shannon's entropy) has been used to quantify the complexity of amino acid sequences, and several studies have examined the relationship between the sequence entropy and the globularity of proteins [42, 43]. According to these studies, the sequence entropy of globular proteins is generally high, with a lower limit of around 2.9.
SEG is a program that identifies low-complexity regions in protein sequences . This program was originally intended to distinguish between globular and non-globular regions. In this study, we used SEG to check whether a correlation between the low-complexity regions and the putative structural domain termini existed. Three parameters in SEG, the trigger window length, the trigger complexity and the extension complexity, are used to assign low complexity regions. We set the trigger window length to 45 residues, in line with previous studies [43, 51] To obtain a number of LCRs similar to that of the linkers predicted with a cutoff of 0.95, the trigger and extension complexities were set to 2.9 and 3.2, respectively (Table 1 and Figures 1 and 3).
Evaluation of putative domain linkers and low-complexity region
We evaluated the validity of the prediction of the domain boundaries from their positions relative to the putative structural domains as defined above. The predicted domain boundaries were divided into four classes (Figure 1A), using an error window to accommodate the ambiguity in the termini position of both the predicted domain boundaries and the putative structural domains. A predicted domain boundary was considered to be correctly located when its end was separated from a putative structural domain by fewer residues than specified by the error window (Figure 1A). Class 1 includes predicted domain boundaries in which the closest ends are located within the error window of a putative structural domain. Predicted domain boundaries with both ends located within the error window of the N and C terminal ends of two putative structural domains are categorized in class 2. Class 3 consists of predicted domain boundaries that are separated from any putative structural domain by a number of residues larger than the error window.
We assumed the success rate of a blind prediction, i.e. a prediction without any a priori information, to be the probability that a randomly assigned position matches a terminal residue of a putative structural domain. Four classes were defined similarly to those used to evaluate the putative domain linkers and the low-complexity regions. For example, a randomly picked residue was considered to be correctly located and was classified in class 1, when the end of a putative structural domain was found within the error window. The success rates (quality index) for the blind prediction, the putative domain linkers and the low-complexity regions were calculated as the rate of correct matches (classes 1 and 2) relative to both the correct and incorrect matches (classes 1, 2 and 4).
We thank the members of the Protein Research Group (RIKEN, GSC) for discussions, and the Informatics Infrastructure Team (RIKEN, GSC) for the computational environment. The training of the neural network was performed on a Fujitsu VPP700E supercomputer at RIKEN, Wako campus. Satoshi Miyazaki passed away during the course of this work. He was a gifted graduate student, a kind and generous person. Y.K and S.Y. wish to dedicate this paper to his memory.
- O'Toole N, Raymond S, Cygler M: Coverage of protein sequence space by current structural genomics targets. J Struct Funct Genomics 2003, 4(2–3):47–55. 10.1023/A:1026156025612View ArticlePubMedGoogle Scholar
- Kim SH: Shining a light on structural genomics. Nat Struct Biol 1998, 5 Suppl: 643–645. 10.1038/1334View ArticlePubMedGoogle Scholar
- Shapiro L, Lima CD: The Argonne Structural Genomics Workshop: Lamaze class for the birth of a new science. Structure 1998, 6(3):265–267. 10.1016/S0969-2126(98)00030-6View ArticlePubMedGoogle Scholar
- Brenner SE, Barken D, Levitt M: The PRESAGE database for structural genomics. Nucleic Acids Res 1999, 27(1):251–253. 10.1093/nar/27.1.251PubMed CentralView ArticlePubMedGoogle Scholar
- Mallick P, Goodwill KE, Fitz-Gibbon S, Miller JH, Eisenberg D: Selecting protein targets for structural genomics of Pyrobaculum aerophilum: validating automated fold assignment methods by using binary hypothesis testing. Proc Natl Acad Sci U S A 2000, 97(6):2450–2455. 10.1073/pnas.050589297PubMed CentralView ArticlePubMedGoogle Scholar
- Yokoyama S, Hirota H, Kigawa T, Yabuki T, Shirouzu M, Terada T, Ito Y, Matsuo Y, Kuroda Y, Nishimura Y, Kyogoku Y, Miki K, Masui R, Kuramitsu S: Structural genomics projects in Japan. Nat Struct Biol 2000, 7 Suppl: 943–945. 10.1038/80712View ArticlePubMedGoogle Scholar
- Chandonia JM, Brenner SE: The impact of structural genomics: expectations and outcomes. Science 2006, 311(5759):347–351. 10.1126/science.1121018View ArticlePubMedGoogle Scholar
- Wider G, Wuthrich K: NMR spectroscopy of large molecules and multimolecular assemblies in solution. Curr Opin Struct Biol 1999, 9(5):594–601. 10.1016/S0959-440X(99)00011-1View ArticlePubMedGoogle Scholar
- Dalzoppo D, Vita C, Fontana A: Folding of thermolysin fragments. Identification of the minimum size of a carboxyl-terminal fragment that can fold into a stable native-like structure. J Mol Biol 1985, 182(2):331–340. 10.1016/0022-2836(85)90349-3View ArticlePubMedGoogle Scholar
- Parrado J, Conejero-Lara F, Smith RA, Marshall JM, Ponting CP, Dobson CM: The domain organization of streptokinase: nuclear magnetic resonance, circular dichroism, and functional characterization of proteolytic fragments. Protein Sci 1996, 5(4):693–704.PubMed CentralView ArticlePubMedGoogle Scholar
- Hubbard SJ: The structural aspects of limited proteolysis of native proteins. Biochim Biophys Acta 1998, 1382(2):191–206.View ArticlePubMedGoogle Scholar
- Christ D, Winter G: Identification of protein domains by shotgun proteolysis. J Mol Biol 2006, 358(2):364–71. Epub 2006 Feb 13.. 10.1016/j.jmb.2006.01.057View ArticlePubMedGoogle Scholar
- Waldo GS, Standish BM, Berendzen J, Terwilliger TC: Rapid protein-folding assay using green fluorescent protein. Nat Biotechnol 1999, 17(7):691–695. 10.1038/10904View ArticlePubMedGoogle Scholar
- Hagihara Y, Kim PS: Toward development of a screen to identify randomly encoded, foldable sequences. Proc Natl Acad Sci U S A 2002, 99(10):6619–24. Epub 2002 May 7.. 10.1073/pnas.102172099PubMed CentralView ArticlePubMedGoogle Scholar
- Hondoh T, Kato A, Yokoyama S, Kuroda Y: Computer-aided NMR assay for detecting natively folded structural domains. Protein Sci 2006, 15(4):871–83. Epub 2006 Mar 7.. 10.1110/ps.051880406PubMed CentralView ArticlePubMedGoogle Scholar
- Schultz J, Copley RR, Doerks T, Ponting CP, Bork P: SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res 2000, 28(1):231–234. 10.1093/nar/28.1.231PubMed CentralView ArticlePubMedGoogle Scholar
- Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci U S A 1998, 95(11):5857–5864. 10.1073/pnas.95.11.5857PubMed CentralView ArticlePubMedGoogle Scholar
- Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res 2002, 30(1):276–280. 10.1093/nar/30.1.276PubMed CentralView ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, Bryant SH: CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 2002, 30(1):281–283. 10.1093/nar/30.1.281PubMed CentralView ArticlePubMedGoogle Scholar
- Kuroda Y, Tani K, Matsuo Y, Yokoyama S: Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics. Protein Sci 2000, 9(12):2313–2321.PubMed CentralView ArticlePubMedGoogle Scholar
- George RA, Heringa J: Protein domain identification and improved sequence similarity searching using PSI-BLAST. Proteins 2002, 48(4):672–681. 10.1002/prot.10175View ArticlePubMedGoogle Scholar
- Kong L, Ranganathan S: Delineation of modular proteins: domain boundary prediction from sequence information. Brief Bioinform 2004, 5(2):179–192. 10.1093/bib/5.2.179View ArticlePubMedGoogle Scholar
- Kikuchi T, Nemethy G, Scheraga HA: Prediction of the location of structural domains in globular proteins. J Protein Chem 1988, 7(4):427–471. 10.1007/BF01024890View ArticlePubMedGoogle Scholar
- Wheelan SJ, Marchler-Bauer A, Bryant SH: Domain size distributions can predict domain boundaries. Bioinformatics 2000, 16(7):613–618. 10.1093/bioinformatics/16.7.613View ArticlePubMedGoogle Scholar
- Miyazaki S, Kuroda Y, Yokoyama S: Characterization and prediction of linker sequences of multi-domain proteins by a neural network. J Struct Funct Genomics 2002, 2(1):37–51. 10.1023/A:1014418700858View ArticlePubMedGoogle Scholar
- Sim J, Kim SY, Lee J: PPRODO: prediction of protein domain boundaries using neural networks. Proteins 2005, 59(3):627–632. 10.1002/prot.20442View ArticlePubMedGoogle Scholar
- Liu J, Rost B: Sequence-based prediction of protein domains. Nucleic Acids Res 2004, 32(12):3522–3530. 10.1093/nar/gkh684PubMed CentralView ArticlePubMedGoogle Scholar
- Tanaka T, Yokoyama S, Kuroda Y: Improvement of domain linker prediction by incorporating loop-length-dependent characteristics. Biopolymers 2006, 84(2):161–168. 10.1002/bip.20361View ArticlePubMedGoogle Scholar
- Tanaka T, Kuroda Y, Yokoyama S: Characteristics and prediction of domain linker sequences in multi-domain proteins. J Struct Funct Genomics 2003, 4(2–3):79–85. 10.1023/A:1026163008203View ArticlePubMedGoogle Scholar
- Dumontier M, Yao R, Feldman HJ, Hogue CW: Armadillo: domain boundary prediction by amino acid composition. J Mol Biol 2005, 350(5):1061–1073. 10.1016/j.jmb.2005.05.037View ArticlePubMedGoogle Scholar
- Rigden DJ: Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments. Protein Eng 2002, 15(2):65–77. 10.1093/protein/15.2.65View ArticlePubMedGoogle Scholar
- George RA, Heringa J: SnapDRAGON: a method to delineate protein structural domains from sequence data. J Mol Biol 2002, 316(3):839–851. 10.1006/jmbi.2001.5387View ArticlePubMedGoogle Scholar
- Hirst JD, Sternberg MJ: Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks. Biochemistry 1992, 31(32):7211–7218. 10.1021/bi00147a001View ArticlePubMedGoogle Scholar
- Qian N, Sejnowski TJ: Predicting the secondary structure of globular proteins using neural network models. J Mol Biol 1988, 202(4):865–884. 10.1016/0022-2836(88)90564-5View ArticlePubMedGoogle Scholar
- Rost B, Sander C: Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol 1993, 232(2):584–599. 10.1006/jmbi.1993.1413View ArticlePubMedGoogle Scholar
- Shepherd AJ, Gorse D, Thornton JM: Prediction of the location and type of beta-turns in proteins using neural networks. Protein Sci 1999, 8(5):1045–1055.PubMed CentralView ArticlePubMedGoogle Scholar
- Chandonia JM, Karplus M: Neural networks for secondary structure and structural class predictions. Protein Sci 1995, 4(2):275–285.PubMed CentralView ArticlePubMedGoogle Scholar
- Dosztanyi Z, Fiser A, Simon I: Stabilization centers in proteins: identification, characterization and predictions. J Mol Biol 1997, 272(4):597–612. 10.1006/jmbi.1997.1242View ArticlePubMedGoogle Scholar
- Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 2000, 28(1):45–48. 10.1093/nar/28.1.45PubMed CentralView ArticlePubMedGoogle Scholar
- Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res 2002, 30(1):264–267. 10.1093/nar/30.1.264PubMed CentralView ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Wootton JC, Federhen S: Analysis of compositionally biased regions in sequence databases. Methods Enzymol 1996, 266: 554–571.View ArticlePubMedGoogle Scholar
- Romero P, Obradovic Z, Li X, Garner EC, Brown CJ, Dunker AK: Sequence complexity of disordered protein. Proteins 2001, 42(1):38–48. 10.1002/1097-0134(20010101)42:1<38::AID-PROT50>3.0.CO;2-3View ArticlePubMedGoogle Scholar
- Nagano K: Logical analysis of the mechanism of protein folding. I. Predictions of helices, loops and beta-structures from primary structure. J Mol Biol 1973, 75(2):401–420. 10.1016/0022-2836(73)90030-2View ArticlePubMedGoogle Scholar
- Lewis PN, Scheraga HA: Predictions of structural homologies in cytochrome c proteins. Arch Biochem Biophys 1971, 144(2):576–583. 10.1016/0003-9861(71)90363-8View ArticlePubMedGoogle Scholar
- Chou PY, Fasman GD: Prediction of protein conformation. Biochemistry 1974, 13(2):222–245. 10.1021/bi00699a002View ArticlePubMedGoogle Scholar
- Westbrook J, Feng Z, Chen L, Yang H, Berman HM: The Protein Data Bank and structural genomics. Nucleic Acids Res 2003, 31(1):489–491. 10.1093/nar/gkg068PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410. 10.1006/jmbi.1990.9999View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Rumelhart DE, Hinton GE, R.J. W: Learning representations by back-propagating errors. Nature 1986, 323: 533–536. 10.1038/323533a0View ArticleGoogle Scholar
- Wootton JC: Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 1994, 18(3):269–285. 10.1016/0097-8485(94)85023-2View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.