PepBank - a database of peptides based on sequence text mining and public peptide data sources
© Shtatland et al; licensee BioMed Central Ltd. 2007
Received: 16 March 2007
Accepted: 01 August 2007
Published: 01 August 2007
Peptides are important molecules with diverse biological functions and biomedical uses. To date, there does not exist a single, searchable archive for peptide sequences or associated biological data. Rather, peptide sequences still have to be mined from abstracts and full-length articles, and/or obtained from the fragmented public sources.
We have constructed a new database (PepBank), which at the time of writing contains a total of 19,792 individual peptide entries. The database has a web-based user interface with a simple, Google-like search function, advanced text search, and BLAST and Smith-Waterman search capabilities. The major source of peptide sequence data comes from text mining of MEDLINE abstracts. Another component of the database is the peptide sequence data from public sources (ASPD and UniProt). An additional, smaller part of the database is manually curated from sets of full text articles and text mining results. We show the utility of the database in different examples of affinity ligand discovery.
We have created and maintain a database of peptide sequences. The database has biological and medical applications, for example, to predict the binding partners of biologically interesting peptides, to develop peptide based therapeutic or diagnostic agents, or to predict molecular targets or binding specificities of peptides resulting from phage display selection. The database is freely available on http://pepbank.mgh.harvard.edu/, and the text mining source code (Peptide::Pubmed) is freely available above as well as on CPAN (http://www.cpan.org/).
Peptides have emerged as important affinity ligands for diagnostic and therapeutic medical uses as well as materials for a host of applications in biotechnology. While many excellent databases exist that provide protein sequence data [1–3], protein interaction data [4–9], and peptide data [10–13], a substantial fraction of literature data remains untapped. Unfortunately, the wealth of the peptide sequences in these sources is often difficult to access by modern methods of sequence similarity searching, because peptide sequences are not extracted in a suitable format. We therefore sought to address this issue by developing a combination of automatically mining MEDLINE abstracts for peptide sequences, combining the existing bioinformatics sources, and manually curating the full text articles and MEDLINE text mining results. The data, available through a web-based interface for simple and more advanced text search and BLAST and Smith-Waterman sequence similarity search, proved useful in our own work. Examination of initial data yielded some surprises as well, providing an incentive for us to make further improvements to the database. We hope that the peptide database, the associated tools, and the text mining algorithm will be useful to the larger biomedical community.
Peptides are defined by International Union of Pure and Applied Chemistry and International Union of Biochemistry and Molecular Biology (IUPAC-IUB) as compounds "produced by amide formation between a carboxyl group of one amino acid and an amino group of another" . In this paper, we use the term "peptides" as a common synonym for oligopeptides, which are defined as having "fewer than about 10–20 residues". We thus currently use an IUPAC-IUB length cut-off of 20 amino acid residues or less. Many of the peptides used as pharmaceutical and diagnostic agents fall within this cut-off.
Naturally occurring peptides function as hormones, transmitters, and modulators of numerous biological processes . Both naturally occurring and synthetic peptides are used in therapeutic applications , for example somatostatin analogs in tumor radiotherapy [16, 17] and oxytocin to induce labor . Examples of diagnostic uses include membrane-translocating agents , receptor targeting agents , and enzyme substrates . Driven by the great interest in the diverse applications of peptides, the new peptidomics field is rapidly emerging . The functions of peptides, including their interacting partners, are determined by their sequence and similar to longer proteins, can be predicted based on sequence similarity.
Prior knowledge can be used to predict or shorten the list of possible binding partners of a given peptide of interest, provided a peptide shares significant sequence similarity with other peptides or proteins whose binding partners are known [20, 23]. One can also use a sequence similarity search to remove peptides with similarity to other peptides with known, undesirable properties such as non-specific binding  or toxicity. Computational predictions are relatively fast and inexpensive, but require a peptide sequence database with links to peptide data, for use with sequence similarity search methods such as basic local alignment search tool (BLAST) [25, 26] or Smith-Waterman search [27, 28]. The non-sequence (text) data in such a peptide database can be queried with text search tools for biological, therapeutic or diagnostic applications, for example to find peptides that are enzyme inhibitors and whose sequences are available.
We searched through the existing bioinformatics sources, and found no single source that fully suited our needs. With the exception of the Receptor Ligand Contacts (RELIC) database and web-server  and Artificially Selected Proteins/Peptides Database (ASPD) , most large protein sequence and interaction databases that allow both sequence similarity and text annotation searches have two major drawbacks. First, most of their sequences are of biological origin, while many phage display [29, 30] or combinatorial screens yield non-biological sequence hits. There is no large repository of chemically generated unnatural sequences, similar to what PubChem  or ChemBank  are for compounds. Second, there exists less data on short peptides than on longer proteins, and usually no facile way to restrict the search to short sequences only. This is important because performing an unrestricted sequence similarity search often results in a large proportion of false positives due to hits to proteins in which the peptide sequence is buried and not accessible for binding, or is in a conformation different from that in a shorter peptide. The same sequence may have different binding properties when displayed on a phage versus when presented as part of the native protein . Sequence similarity based predictions are further hampered for conformationally constrained peptides, designed specifically to have properties different from the same sequence in linear form . ASPD  and RELIC  databases do not have these drawbacks, are well curated, but are relatively small compared with the large amount of sequence data in the MEDLINE abstracts. For example, the ASPD database has 1,717 entries of 20 amino acid or shorter sequences. RELIC (a server with many useful peptide sequence analysis tools) has 3,632 peptide sequences that result from phage display selections, but only 7 distinct targets to which they bind. Other peptide databases have different purposes and are more specialized by design, for example antimicrobial (the Antimicrobial Peptide Database (APD) , and others [12, 34, 35]), phosphorylation sites (Scansite ), or major histocompatibility complex related (SYFPEITHI , EPIMHC , and others [39–42]).
In order to create a database suitable for the identification of affinity ligands, we developed text mining methods to extract peptide sequences from MEDLINE abstracts and compiled them in a single, easily searchable database. While far from complete, the database is a useful publicly available source of peptide sequences and the associated data. Below we show how the database was constructed, how it functions, and how it can be used to identify targeting ligands.
Construction and content
Database model and overview
The application is using the open source Ruby on Rails framework  with a MySQL database  in the backend. The BLAST search [25, 26] was implemented using the NCBI binaries . The Smith-Waterman search was implemented using the SSEARCH program from the FASTA3 distribution [27, 28, 47]. The databases for sequence similarity searches included, in addition to sequences, the motifs, with any variable positions replaced with X for simplicity (for example, motif 'P(P/S)GH(Y/F)K' was used as 'PXGHXK').
The database was constructed from the following sources (with the current number of entries in parentheses): text mining of MEDLINE abstracts (13,596 entries), manual curation of full text PDF articles (859), and other public sources: ASPD (1,717) and UniProt (3,620), as described in the sections below. The total number of entries is currently 19,792. A small fraction of the peptide sequences resulting from MEDLINE abstract text mining were manually curated: 1,773 entries were validated as correct peptide sequences, and 170 of those were more fully annotated with additional interaction data present in the abstract. The database continues to grow as the new data are added to the sources such as MEDLINE and UniProt.
MEDLINE abstract text mining
In order to identify abstracts with peptide sequences, the entire MEDLINE database with its 15 million records was downloaded from the National Library of Medicine (NLM) ftp site . The text mining code was written in Perl, a language selected due to its text processing capabilities, and widely used in many important biomedical literature text mining applications [49–51]. Data were processed in 3 steps. First, each abstract was assigned a score based on how likely it was to contain a peptide sequence anywhere within the text. Second, each individual word was assigned a score based on how likely it was to contain a peptide sequence. For each word, a combined score was then computed based on both the word score and the abstract score. Thus, in total, we used three types of scores (abstract, word, and combined). Third, the sequences associated with the words were cleaned, and ambiguities resolved. After these tasks were completed, the words were ranked by the combined score and included in the peptide database based on empirically determined thresholds. Each unique sequence per abstract identified by text mining was assigned one database entry. Multiple occurrences of the same sequence in different forms, such as 'RGD' and 'Arg-Gly-Asp', were considered a single entry.
Text mining was performed on a Fedora Core 5 Linux virtual machine running on an HP DL320 server with two 3 GHz Xeon processors, allocated 512 MB of RAM. The data resided on a file server connected via Gigabit Ethernet. Text mining of the entire MEDLINE (baseline distribution and updates) took 44 hours, with an additional 16 hours for pre-processing: downloading, uncompressing/compressing and parsing MEDLINE distribution files. The resulting database was 35 MB. Incremental weekly processing of MEDLINE updates took on average under 1 hour.
Step 1. Classification of abstracts
MEDLINE entries that were either duplicates, or did not have abstracts, or were older than 1950 were removed. The older abstracts, which were published prior to the development of Edman degradation , did not contain peptide sequences. Several pattern categories of interest were created, such as those related to peptides, phage display, proteases, and others. For each abstract, the total number of matches to patterns in each category was computed, for example, for the 'peptide' category this included the number of matches to 'peptid' or 'hormone', and if at least one of these patterns was present, additionally included the number of matches to less specific patterns such as 'sequenc' or 'motif'. The title, abstract, medical subject heading terms and the chemical list were all scored. Some of the abstracts, especially those published before mid-1990s, often include peptide sequences which are related to protein digestion and sequencing. These sequences usually represent parts of longer proteins, rather than individual peptides, and were thus scored differently. Any matches to this 'digestion' category of patterns were counted. The abstract score was computed as the sum of the number of matches to categories 'peptide' and 'phage', minus the number of matches to the 'digestion' category. Additional terms were added to the abstract score for matches to more than one pattern category in the same abstract, for example the number of matches to patterns from the 'phage' category multiplied by the number of matches to the 'peptide' category. Phosphorylated peptides, such as those selected using the oriented phosphopeptide library technique , were not scored any differently from other peptides, that is, neither included not excluded specifically. There is a useful resource, Scansite, dedicated specifically to the phosphorylated peptides , which can be used for this application. Texts with a large number or fraction of words in all caps tend to produce many false positives, thus the abstract score was decreased for such abstracts. The abstract score was then transformed for convenience to the (0,1) interval using the function: y = x/(1+x). An abstract score below 0 was assigned to 0. An abstract related to peptide sequences tended to have a score close to 1, and an unrelated one to 0.
Step 2. Classification of words
Each abstract was split into words on whitespace. Each word was matched against a series of peptide sequence pattern categories, in order of decreasing specificities of patterns, until the first successful match. The pattern categories were: full names of amino acids (longest, most specific, such as 'valine' or 'valyl'), 3 letter symbols (such as 'Val') and 1 letter symbols (such as 'V', least specific). Because the recommendations of IUPAC-IUB for reporting peptide sequences  were not followed in a large number of abstracts, we had to use a complex classification method and added methods to clean sequences and resolve the ambiguities. Any word that matched a pattern of peptide sequence of at least two amino acids was assigned a score. The score was an empirically calculated measure used to distinguish peptide sequences from other terms, such as nucleic acid sequences, gene symbols, acronyms and all caps English words, which they sometimes closely resemble or are even identical to, when taken out of context.
The above score was defined by several factors. The length/amino acid symbol factor was based on the length of the sequence in amino acids (higher score for longer sequence patterns, which were more specific) and on the type of amino acid symbols used (higher score for the more specific full names than for 1 letter symbols). The degenerate amino acid factor was based on the fraction and the total number of degenerate amino acids (lower score for degenerate amino acids such as 'X' or 'Xaa', which may represent, for example, the starting randomized phage display library rather than the selected peptide). Other factors reflected similarity to either of the following categories: Roman numerals, nucleic acid sequences, gene names and gene symbols, English words, scientific terms or abbreviations, or a combination of the above. The list of abbreviations was derived from the comprehensive ADAM database . The list of gene names and symbols was derived from Entrez Gene , UniProt  and Human Gene Nomenclature (HGNC)  databases. An additional factor represented similarity of a given word to protein sequences relative to English words. It was computed for all words that matched a pattern of sequences in 1 letter amino acid symbols. The word was broken up into overlapping k-mers. For example, for k = 3, word 'EYHHYNK' was broken up into 'EYH', 'YHH', 'HHY', 'HYN', 'YNK'. The proportions of all possible k-mers were precomputed in the databases of known protein sequences (from UniProt) and non-sequences (here, English words from MEDLINE abstracts not related to peptides), designated Pp and Pn, respectively. We used the databases of protein sequences and non-sequences of 8 × 107 k-mers each, with k = 3, replacing counts of 0 with 1 to avoid division by 0. The protein/English word similarity factor was defined as the product over all overlapping k-mers within the word of (Pp/Pn). For a word with all k-mers equally frequent among sequences and non-sequences, the factor was 1, while for a word such as 'EYHHYNK' in which on average the k-mers were more frequent in protein sequences than in English words, the factor was greater than 1.
The word score was transformed to the (0,1) interval, similarly as in the abstract score. The word score thus depended only on the properties of the word itself, rather than on the context (the properties of the abstract). The combined word/abstract score was then computed for each word, and reflected the abstract score, the word score, and the maximum word score for all words in the abstract, included because sequences tend to occur together in abstracts. The combined word/abstract score sc was computed according to the formula
sc = sa(w1sw + w2sm), for sw > 0,
sc = 0, for sw = 0,
where sa is the abstract score, sw is the word score of the current word, sm is the maximum word score for all words in the abstract, and w1, w2 are the weights (w1 > w2). The combined score varied in the (0,1) interval. Words that matched peptide sequence patterns in abstracts related to peptides tended to have a score close to 1, and close to 0 otherwise.
Step 3. Clean-up
Words that matched peptide sequence patterns were cleaned in a series of steps and converted to 1 letter amino acid symbols, as follows. The terminal marks and modifications, such as 'H(2)N-' or '-CO-Ph', were removed. Numbers representing amino acid positions were removed. Other modifications, such as phosphate in 'pY' were removed. Motifs such as '(L/I)' or 'L/I' were resolved. Amino acids that do not have a 1 letter IUPAC symbol were replaced with X. As a result, a large variety of different sequence formats were resolved, including 'N-acetyl-l-aspartyl-l-glutamyl-l-valyl-l-aspartyl-7-amino-4-methylcoumarin' to 'DEVD', 'Gly1-Val2-Thr3-Ser4' to 'GVTS', '(Arg-Glu(EDANS)-Ser-Gln)' to 'RESQ', 'TRDI-pY-ETD-pY-pY-RK' to 'TRDIYETDYYRK', and others.
To estimate precision of text mining, 50 sequences with the combined score above the threshold for inclusion in PepBank were selected at random from the text mining output. Each of these positive predictions was manually verified, whether or not the word contained a peptide sequence (40 out of 50 were found correctly, precision = 0.8), and whether or not the word contained a peptide sequence AND the sequence was parsed 100% correctly (35 out of 50 correct, precision = 0.7). If the identified sequence was a partial protein sequence, rather than a peptide or a phage display sequence, it was considered an error: such sequences are typically entered in protein databases and do not need to be mined from text (most of the errors in precision were of this type). One or more incorrect amino acid was also considered an error.
For estimating recall, we created a separate test set of 50 sequences by searching in PubMed for recent review articles using as a query "peptide OR peptides" alone or in combination with "sequence OR sequences", and followed the PubMed abstract links for the references cited in the reviews. Peptide sequences were manually extracted from the abstracts without any automated pattern matching. The text mining output with the combined score above the threshold for inclusion in PepBank was matched against these positive real cases. Again, for each case we manually verified whether or not the algorithm found the word, which contained this peptide sequence (12 out of 50 correct, recall = 0.24), and whether or not the algorithm found the word AND the sequence was parsed 100% correctly (10 out of 50 correct, recall = 0.2). Most of the errors in recall were due to blanks (often typos) inside peptide sequences or due to unrecognized amino acid modifications.
The pioneering method to identify DNA and protein sequences in text, based on Markov models was described by Wren and co-workers . Our text mining method, while similar in spirit, has different goals and thus uses a different sequence identification strategy. One of our main goals was to rapidly identify peptides with potential therapeutic and diagnostic utility (including those derived from phage display peptides), rather than identifying peptide epitopes and providing an aid to their manual curation. We also use extensive context information from the abstract, and collect peptide motifs in addition to sequences. We clean the sequences and provide access to the data for biologists through a simple web-based interface for text and sequence similarity searches. We do not place a minimum length restriction on sequences, such as 6 amino acids, because many therapeutic peptides are relatively short, for example the well-known RGD motif and many others found in phage display. Due to the substantial differences in goals and methods between our approach and that of others, it may be interesting to develop in the future a hybrid method combining the strengths of both approaches.
All peptide sequences with length 20 or below were extracted from ASPD  and UniProt , and fields that mapped to PepBank were parsed and stored (for example, interactor fields from ASPD, peptide fields from UniProt). The links from PepBank to the source databases were provided for all entries. Many of the peptides were stored in UniProt as part of the longer precursor proteins, producing peptides on cleavage. These peptide sequences were extracted using the UniProt feature table by selecting those with feature key "peptide" or "chain" and feature length under 20. Additional entries were manually curated, capturing the available interaction data, from the full text articles on phage display in PDF format. The articles were chosen to represent a small but diverse selection of reports within this field.
Utility and discussion
The web-based user interface to PepBank offers text search (both Quick and Advanced), as well as sequence similarity search (BLAST and Smith-Waterman algorithms). The Quick Search function offers a simple, Google-like search for biologists looking for peptide data in all fields. Advanced Search options include querying data by individual fields. Exact search, wildcard (*) and any single character (_) are supported in text search, which enables, for example, searching for a sequence pattern as a query. The results of the text search are displayed as a table sortable in the browser, with hyperlinks to the original sources (MEDLINE/PubMed, ASPD, UniProt) and to more detailed information.
Text search example: VEGFR related peptides
To determine whether the database would yield target leads against known drug targets, we randomly chose a set of 20 defined drug targets from the 547 approved drug target data set in DrugBank . The randomly chosen drug targets were not skewed towards peptide receptors and included: squalene epoxidase, RAF proto-oncogene serine/threonine-protein kinase, muscarinic acetylcholine receptor M4, opioid mu receptor (OP3), adenosine A1 receptor, GABA transaminase, amidophosphoribosyltransferase precursor, tryptophan 5-hydroxylase 1, apoptosis regulator Bcl-2, matrix protein M2, vascular endothelial growth factor receptor 2 precursor, amiloride-sensitive sodium channel gamma-subunit, ribonucleotide reductase, cAMP phosphodiesterase, coagulation factor VIII, high affinity immunoglobulin epsilon receptor alpha-subunit precursor, retinol-binding protein I, glycine alpha 2 receptor, cytochrome P450 51, GABA-A receptor subunit (C. elegans). Relevant peptides were defined as those interacting with the target or its ortholog, or modulating the function of the target, for example by acting as a competitor. Relevant peptides in our database were identified in approximately 25% of the above drug targets.
Sequence similarity search examples
As an illustrative example, we performed an all-against-all BLAST search of PepBank sequences. One of the surprises was the discovery of an exact match to sequence 'GETRAPL' from phage display selection for peptides that bind to secreted protein acidic and rich in cysteine (SPARC) . The sequence had a BLAST hit with an E-value of 0.06 to an isolate from phage display selection of peptides that bind human saphenous vein smooth muscle cells . Following the BLAST results, we then found that in addition to these 2 selections, the exact same sequence was isolated independently multiple times by different groups in selections with unrelated targets. GETRAPL was found in phage display selections of peptides that bind human immunodeficiency virus type 1 (HIV-1) accessory viral protein (Vpr) , chromatin high mobility group protein 1, box A (HMGB1) from rat , mouse skeletal muscle tissue in vivo , and mouse brain cells in vivo .
We suggest that one of the utilities for PepBank is to search the peptide sequences of interest to the user with BLAST or Smith-Waterman algorithms to find any important similarities to the known peptides collected in our database. In this example, the search can be used to remove a relatively nonspecific binder GETRAPL. Note that searching PepBank with these tools is a unique resource: an exact match may be easy to find, but using a partial match such as GETRA as a query finds GETRAPL only in PepBank, but not in PubMed  or on Google. Searching with BLAST  or with Smith-Waterman/SSEARCH methods  using GETRAPL as a query against nr database  gives no peptide hits cited above. A large interactions database IntAct  gives no hits for GETRAPL query at all.
Another surprise discovery in the all-against-all BLAST search of PepBank sequences was the multiple occurrence of the sequence SVSVGMKPSPRP. The sequence had several exact matches over its entire length of 12 amino acids, with an E-value of 1 × 10-6. It was isolated in phage display selection for peptides that bind to DNA . In this selection SVSVGMKPSPRP was the only sequence studied due to its dominance (9 out of 10) in the selected pool. The exact same sequence was isolated in phage display selection for peptides binding to human monoclonal IgM , and to the mirror image of Alzheimer's disease amyloid peptide Abeta(1–42) . The sources for these sequences were MEDLINE abstract text mining, ASPD database, and manually curated full text articles, respectively. In addition, SVSVGMKPSPRP occurs in several patents [71, 72]. Several groups note multiple isolation of this remarkable sequence in their own and other, unrelated, experiments [73, 74]. The sequence has also been identified in a recent excellent review  which covers the important topic of target-unrelated sequences in phage display. Interestingly, all of the studies with both GETRAPL and SVSVGMKPSPRP were done with the phage display libraries from the same manufacturer, thus suggesting a library- or methodology-specific phenomenon. Both sequences illustrate one of the suggested utilities for PepBank, namely that one can search it with a sequence query using BLAST or Smith-Waterman algorithms to find any important similarities to the known peptides.
A new text mining tool was developed and used to identify peptide sequences in MEDLINE abstracts. These data were combined with two of the public sources of peptide sequence data, ASPD and UniProt, as well as with manually curated peptide data. The database application was developed to query the data using text and sequence similarity search through a web-based user interface. The utility of PepBank was demonstrated using different examples of peptide sequences. The results show that the database has valuable biological and medical applications. In the future, we plan to add other public sources of peptide data, such as the peptide subset of the Molecular Interaction database (MINT) , and other sources for text mining, such as full-text journal articles. Also, in the future we will apply machine learning techniques to improve the accuracy of text mining to extract sequences. In the next release, we plan to add the functionalities to download the data in a standard format, such as PSI MI, and to search the database for peptide motifs.
Availability and requirements
We thank Timo Duchrow, Vladimir Kubatin, Lee Josephson, Ching Tung, Elena Aikawa, Kim Kelly and Rajesh Anbazhagan for helpful discussions and feedback on the database, Jason Brown and Brett Dikeman for system administration work and Melissa Carlson for editorial assistance. We are grateful to the authors and curators of the resources we used: ADAM (in particular, Neil Smalheiser), MEDLINE/NLM, UniProt and ASPD, and to anonymous reviewers for their comments. This work was supported in part by NIH grants PO1-AI54904 (RW), P50-CA86355 (RW), U54-CA126515 (RW).
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006, 34 (Database issue): D187-91. 10.1093/nar/gkj161.PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Ostell J, Miller V, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2007, 35 (Database issue): D5-12. 10.1093/nar/gkl1031.PubMed CentralView ArticlePubMedGoogle Scholar
- Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R: The International Protein Index: an integrated database for proteomics experiments. Proteomics. 2004, 4 (7): 1985-1988. 10.1002/pmic.200300721.View ArticlePubMedGoogle Scholar
- Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, Buzadzija K, Cavero R, D'Abreo C, Donaldson I, Dorairajoo D, Dumontier MJ, Dumontier MR, Earles V, Farrall R, Feldman H, Garderman E, Gong Y, Gonzaga R, Grytsan V, Gryz E, Gu V, Haldorsen E, Halupa A, Haw R, Hrvojic A, Hurrell L, Isserlin R, Jack F, Juma F, Khan A, Kon T, Konopinsky S, Le V, Lee E, Ling S, Magidin M, Moniakis J, Montojo J, Moore S, Muskat B, Ng I, Paraiso JP, Parker B, Pintilie G, Pirone R, Salama JJ, Sgro S, Shan T, Shu Y, Siew J, Skinner D, Snyder K, Stasiuk R, Strumpf D, Tuekam B, Tao S, Wang Z, White M, Willis R, Wolting C, Wong S, Wrong A, Xin C, Yao R, Yates B, Zhang S, Zheng K, Pawson T, Ouellette BF, Hogue CW: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res. 2005, 33 (Database issue): D418-24. 10.1093/nar/gki051.PubMed CentralView ArticlePubMedGoogle Scholar
- Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G: MINT: the Molecular INTeraction database. Nucleic Acids Res. 2007, 35 (Database issue): D572-4. 10.1093/nar/gkl950.PubMed CentralView ArticlePubMedGoogle Scholar
- Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, Kohler C, Khadake J, Leroy C, Liban A, Lieftink C, Montecchi-Palazzi L, Orchard S, Risse J, Robbe K, Roechert B, Thorneycroft D, Zhang Y, Apweiler R, Hermjakob H: IntAct--open source resource for molecular interaction data. Nucleic Acids Res. 2007, 35 (Database issue): D561-5. 10.1093/nar/gkl958.PubMed CentralView ArticlePubMedGoogle Scholar
- Mewes HW, Frishman D, Mayer KF, Munsterkotter M, Noubibou O, Pagel P, Rattei T, Oesterheld M, Ruepp A, Stumpflen V: MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res. 2006, 34 (Database issue): D169-72. 10.1093/nar/gkj148.PubMed CentralView ArticlePubMedGoogle Scholar
- Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, Menon S, Hanumanthu G, Gupta M, Upendran S, Gupta S, Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun KS, Sharma S, Chandrika KN, Deshpande N, Palvankar K, Raghavnath R, Krishnakanth R, Karathia H, Rekha B, Nayak R, Vishnupriya G, Kumar HG, Nagini M, Kumar GS, Jose R, Deepthi P, Mohan SS, Gandhi TK, Harsha HC, Deshpande KS, Sarker M, Prasad TS, Pandey A: Human protein reference database--2006 update. Nucleic Acids Res. 2006, 34 (Database issue): D411-4. 10.1093/nar/gkj141.PubMed CentralView ArticlePubMedGoogle Scholar
- Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30 (1): 303-305. 10.1093/nar/30.1.303.PubMed CentralView ArticlePubMedGoogle Scholar
- Mandava S, Makowski L, Devarapalli S, Uzubell J, Rodi DJ: RELIC--a bioinformatics server for combinatorial peptide analysis and identification of protein-ligand interaction sites. Proteomics. 2004, 4 (5): 1439-1460. 10.1002/pmic.200300680.View ArticlePubMedGoogle Scholar
- Valuev VP, Afonnikov DA, Ponomarenko MP, Milanesi L, Kolchanov NA: ASPD (Artificially Selected Proteins/Peptides Database): a database of proteins and peptides evolved in vitro. Nucleic Acids Res. 2002, 30 (1): 200-202. 10.1093/nar/30.1.200.PubMed CentralView ArticlePubMedGoogle Scholar
- Wade D, Englund J: Synthetic antibiotic peptides database. Protein Pept Lett. 2002, 9 (1): 53-57. 10.2174/0929866023408986.View ArticlePubMedGoogle Scholar
- Wang Z, Wang G: APD: the Antimicrobial Peptide Database. Nucleic Acids Res. 2004, 32 (Database issue): D590-2. 10.1093/nar/gkh025.PubMed CentralView ArticlePubMedGoogle Scholar
- IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN). Nomenclature and symbolism for amino acids and peptides. Recommendations 1983. Biochem J. 1984, 219 (2): 345-373.Google Scholar
- Barrett JC, Elmore DT: Amino Acids and Peptides. 1998, Cambridge , Cambridge University PressView ArticleGoogle Scholar
- Cremonesi M, Ferrari M, Bodei L, Tosi G, Paganelli G: Dosimetry in Peptide radionuclide receptor therapy: a review. J Nucl Med. 2006, 47 (9): 1467-1475.PubMedGoogle Scholar
- Reubi JC, Macke HR, Krenning EP: Candidates for peptide receptor radiotherapy today and in the future. J Nucl Med. 2005, 46 Suppl 1: 67S-75S.PubMedGoogle Scholar
- Patka JH, Lodolce AE, Johnston AK: High- versus low-dose oxytocin for augmentation or induction of labor. Ann Pharmacother. 2005, 39 (1): 95-101.View ArticlePubMedGoogle Scholar
- Reynolds F, Weissleder R, Josephson L: Protamine as an efficient membrane-translocating peptide. Bioconjug Chem. 2005, 16 (5): 1240-1245. 10.1021/bc0501451.View ArticlePubMedGoogle Scholar
- Kelly KA, Allport JR, Tsourkas A, Shinde-Patil VR, Josephson L, Weissleder R: Detection of vascular adhesion molecule-1 expression using a novel multimodal nanoparticle. Circ Res. 2005, 96 (3): 327-336. 10.1161/01.RES.0000155722.17881.dd.View ArticlePubMedGoogle Scholar
- Messerli SM, Prabhakar S, Tang Y, Shah K, Cortes ML, Murthy V, Weissleder R, Breakefield XO, Tung CH: A novel method for imaging apoptosis using a caspase-1 near-infrared fluorescent probe. Neoplasia. 2004, 6 (2): 95-105. 10.1593/neo.03214.PubMed CentralView ArticlePubMedGoogle Scholar
- Schulz-Knappe P, Zucht HD, Heine G, Jurgens M, Hess R, Schrader M: Peptidomics: the comprehensive analysis of peptides in complex biological mixtures. Comb Chem High Throughput Screen. 2001, 4 (2): 207-217.View ArticlePubMedGoogle Scholar
- Liu C, Bhattacharjee G, Boisvert W, Dilley R, Edgington T: In vivo interrogation of the molecular display of atherosclerotic lesion surfaces. Am J Pathol. 2003, 163 (5): 1859-1871.PubMed CentralView ArticlePubMedGoogle Scholar
- Menendez A, Scott JK: The nature of target-unrelated peptides recovered in the screening of phage-displayed random peptide libraries with antibodies. Anal Biochem. 2005, 336 (2): 145-157. 10.1016/j.ab.2004.09.048.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147 (1): 195-197. 10.1016/0022-2836(81)90087-5.View ArticlePubMedGoogle Scholar
- Pearson WR: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics. 1991, 11 (3): 635-650. 10.1016/0888-7543(91)90071-L.View ArticlePubMedGoogle Scholar
- Smith GP: Filamentous fusion phage: novel expression vectors that display cloned antigens on the virion surface. Science. 1985, 228 (4705): 1315-1317. 10.1126/science.4001944.View ArticlePubMedGoogle Scholar
- Smith GP, Petrenko VA: Phage Display. Chem Rev. 1997, 97 (2): 391-410. 10.1021/cr960065d.View ArticlePubMedGoogle Scholar
- Tolliday N, Clemons PA, Ferraiolo P, Koehler AN, Lewis TA, Li X, Schreiber SL, Gerhard DS, Eliasof S: Small molecules, big players: the National Cancer Institute's Initiative for Chemical Genetics. Cancer Res. 2006, 66 (18): 8935-8942. 10.1158/0008-5472.CAN-06-2552.View ArticlePubMedGoogle Scholar
- Craig L, Sanschagrin PC, Rozek A, Lackie S, Kuhn LA, Scott JK: The role of structure in antibody cross-reactivity between peptides and folded proteins. J Mol Biol. 1998, 281 (1): 183-201. 10.1006/jmbi.1998.1907.View ArticlePubMedGoogle Scholar
- Uchiyama F, Tanaka Y, Minari Y, Tokui N: Designing scaffolds of peptides for phage display libraries. J Biosci Bioeng. 2005, 99 (5): 448-456. 10.1263/jbb.99.448.View ArticlePubMedGoogle Scholar
- Gueguen Y, Garnier J, Robert L, Lefranc MP, Mougenot I, de Lorgeril J, Janech M, Gross PS, Warr GW, Cuthbertson B, Barracco MA, Bulet P, Aumelas A, Yang Y, Bo D, Xiang J, Tassanakajon A, Piquemal D, Bachere E: PenBase, the shrimp antimicrobial peptide penaeidin database: sequence-based classification and recommended nomenclature. Dev Comp Immunol. 2006, 30 (3): 283-288. 10.1016/j.dci.2005.04.003.View ArticlePubMedGoogle Scholar
- Seebah S, Suresh A, Zhuo S, Choong YH, Chua H, Chuon D, Beuerman R, Verma C: Defensins knowledgebase: a manually curated database and information source focused on the defensins family of antimicrobial peptides. Nucleic Acids Res. 2007, 35 (Database issue): D265-8. 10.1093/nar/gkl866.PubMed CentralView ArticlePubMedGoogle Scholar
- Obenauer JC, Cantley LC, Yaffe MB: Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res. 2003, 31 (13): 3635-3641. 10.1093/nar/gkg584.PubMed CentralView ArticlePubMedGoogle Scholar
- Rammensee H, Bachmann J, Emmerich NP, Bachor OA, Stevanovic S: SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics. 1999, 50 (3-4): 213-219. 10.1007/s002510050595.View ArticlePubMedGoogle Scholar
- Reche PA, Zhang H, Glutting JP, Reinherz EL: EPIMHC: a curated database of MHC-binding peptides for customized computational vaccinology. Bioinformatics. 2005, 21 (9): 2140-2141. 10.1093/bioinformatics/bti269.View ArticlePubMedGoogle Scholar
- Bhasin M, Singh H, Raghava GP: MHCBN: a comprehensive database of MHC binding and non-binding peptides. Bioinformatics. 2003, 19 (5): 665-666. 10.1093/bioinformatics/btg055.View ArticlePubMedGoogle Scholar
- Blythe MJ, Doytchinova IA, Flower DR: JenPep: a database of quantitative functional peptide data for immunology. Bioinformatics. 2002, 18 (3): 434-439. 10.1093/bioinformatics/18.3.434.View ArticlePubMedGoogle Scholar
- Govindarajan KR, Kangueane P, Tan TW, Ranganathan S: MPID: MHC-Peptide Interaction Database for sequence-structure-function information on peptides binding to MHC molecules. Bioinformatics. 2003, 19 (2): 309-310. 10.1093/bioinformatics/19.2.309.View ArticlePubMedGoogle Scholar
- Sathiamurthy M, Hickman HD, Cavett JW, Zahoor A, Prilliman K, Metcalf S, Fernandez Vina M, Hildebrand WH: Population of the HLA ligand database. Tissue Antigens. 2003, 61 (1): 12-19. 10.1034/j.1399-0039.2003.610102.x.View ArticlePubMedGoogle Scholar
- Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S, Orchard S, Sarkans U, von Mering C, Roechert B, Poux S, Jung E, Mersch H, Kersey P, Lappe M, Li Y, Zeng R, Rana D, Nikolski M, Husi H, Brun C, Shanker K, Grant SG, Sander C, Bork P, Zhu W, Pandey A, Brazma A, Jacq B, Vidal M, Sherman D, Legrain P, Cesareni G, Xenarios I, Eisenberg D, Steipe B, Hogue C, Apweiler R: The HUPO PSI's molecular interaction format--a community standard for the representation of protein interaction data. Nat Biotechnol. 2004, 22 (2): 177-183. 10.1038/nbt926.View ArticlePubMedGoogle Scholar
- Ruby on Rails. [http://www.rubyonrails.org]
- MySQL. [http://www.mysql.com]
- The National Center for Biotechnology Information (NCBI) ftp site. [ftp://ftp.ncbi.nih.gov/]
- University of Virginia FASTA server. [http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml]
- The National Library of Medicine (NLM) ftp site. [ftp://ftp.nlm.nih.gov/]
- Bajdik CD, Kuo B, Rusaw S, Jones S, Brooks-Wilson A: CGMIM: automated text-mining of Online Mendelian Inheritance in Man (OMIM) to identify genetically-associated cancers and candidate genes. BMC Bioinformatics. 2005, 6: 78-10.1186/1471-2105-6-78.PubMed CentralView ArticlePubMedGoogle Scholar
- Crasto CJ, Marenco LN, Migliore M, Mao B, Nadkarni PM, Miller P, Shepherd GM: Text mining neuroscience journal articles to populate neuroscience databases. Neuroinformatics. 2003, 1 (3): 215-237. 10.1385/NI:1:3:215.View ArticlePubMedGoogle Scholar
- Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, Zhang S, Baskin B, Bader GD, Michalickova K, Pawson T, Hogue CW: PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics. 2003, 4: 11-10.1186/1471-2105-4-11.PubMed CentralView ArticlePubMedGoogle Scholar
- Edman P: Method for determination of the amino acid sequence in peptides. Acta Chem Scand. 1950, 4: 283-293.View ArticleGoogle Scholar
- Songyang Z, Shoelson SE, Chaudhuri M, Gish G, Pawson T, Haser WG, King F, Roberts T, Ratnofsky S, Lechleider RJ: SH2 domains recognize specific phosphopeptide sequences. Cell. 1993, 72 (5): 767-778. 10.1016/0092-8674(93)90404-E.View ArticlePubMedGoogle Scholar
- Zhou W, Torvik VI, Smalheiser NR: ADAM: another database of abbreviations in MEDLINE. Bioinformatics. 2006, 22 (22): 2813-2818. 10.1093/bioinformatics/btl480.View ArticlePubMedGoogle Scholar
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2007, 35 (Database issue): D26-31. 10.1093/nar/gkl993.PubMed CentralView ArticlePubMedGoogle Scholar
- Eyre TA, Ducluzeau F, Sneddon TP, Povey S, Bruford EA, Lush MJ: The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Res. 2006, 34 (Database issue): D319-21. 10.1093/nar/gkj147.PubMed CentralView ArticlePubMedGoogle Scholar
- Wren JD, Hildebrand WH, Chandrasekaran S, Melcher U: Markov model recognition and classification of DNA/protein sequences within large text databases. Bioinformatics. 2005, 21 (21): 4046-4053. 10.1093/bioinformatics/bti657.View ArticlePubMedGoogle Scholar
- Underiner TL, Ruggeri B, Gingrich DE: Development of vascular endothelial growth factor receptor (VEGFR) kinase inhibitors as anti-angiogenic agents in cancer therapy. Curr Med Chem. 2004, 11 (6): 731-745. 10.2174/0929867043455756.View ArticlePubMedGoogle Scholar
- An P, Lei H, Zhang J, Song S, He L, Jin G, Liu X, Wu J, Meng L, Liu M, Shou C: Suppression of tumor growth and metastasis by a VEGFR-1 antagonizing peptide identified from a phage display library. Int J Cancer. 2004, 111 (2): 165-173. 10.1002/ijc.20214.View ArticlePubMedGoogle Scholar
- Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J: DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006, 34 (Database issue): D668-72. 10.1093/nar/gkj067.PubMed CentralView ArticlePubMedGoogle Scholar
- Kelly KA, Waterman P, Weissleder R: In vivo imaging of molecularly targeted phage. Neoplasia. 2006, 8 (12): 1011-1018. 10.1593/neo.06610.PubMed CentralView ArticlePubMedGoogle Scholar
- Work LM, Nicklin SA, Brain NJ, Dishart KL, Von Seggern DJ, Hallek M, Buning H, Baker AH: Development of efficient viral vectors selective for vascular smooth muscle cells. Mol Ther. 2004, 9 (2): 198-208. 10.1016/j.ymthe.2003.11.006.View ArticlePubMedGoogle Scholar
- Sankovich SE, Koleski D, Baell J, Matthews B, Azad AA, Macreadie IG: Design and assay of inhibitors of HIV-1 Vpr Cell Killing and growth arrest activity using microbial assay systems. J Biomol Screen. 1998, 3 (4): 299-304. 10.1177/108705719800300409.View ArticleGoogle Scholar
- Dintilhac A, Bernues J: HMGB1 interacts with many apparently unrelated proteins by recognizing short amino acid sequences. J Biol Chem. 2002, 277 (9): 7021-7028. 10.1074/jbc.M108417200.View ArticlePubMedGoogle Scholar
- Smith BF, Samoilova T: Methods and compositions for targeting compounds to muscle. United States Patent 6399575. 2001Google Scholar
- Smith BF, Samoilova T, Baker HJ: Methods and compositions for targeting compounds to the central nervous system. United States Patent 6399575. 2002Google Scholar
- The National Center for Biotechnology Information (NCBI) BLAST server. [http://www.ncbi.nlm.nih.gov/BLAST/]
- Wolcke J, Weinhold E: A DNA-binding peptide from a phage display library. Nucleosides Nucleotides Nucleic Acids. 2001, 20 (4-7): 1239-1241. 10.1081/NCN-100002526.View ArticlePubMedGoogle Scholar
- Messmer BT, Sullivan JJ, Chiorazzi N, Rodman TC, Thaler DS: Two human neonatal IgM antibodies encoded by different variable-region genes bind the same linear peptide: evidence for a stereotyped repertoire of epitope recognition. J Immunol. 1999, 162 (4): 2184-2192.PubMedGoogle Scholar
- Wiesehan K, Buder K, Linke RP, Patt S, Stoldt M, Unger E, Schmitt B, Bucci E, Willbold D: Selection of D-amino-acid peptides that bind to Alzheimer's disease amyloid peptide abeta1-42 by mirror image phage display. Chembiochem. 2003, 4 (8): 748-753. 10.1002/cbic.200300631.View ArticlePubMedGoogle Scholar
- Atkinson HJ, McPherson MJ, Winter MD: Control of crop pests & animal parasites through direct neuronal uptake. United States Patent 20030181376. 2003Google Scholar
- Robbins PD, Mi Z, Frizzell R, Glorioso JC, Gambotto A, Mai JC: Identification of peptides that facilitate uptake and cytoplasmic and/or nuclear transport of proteins, DNA and viruses. United States Patent 20030219826. 2003Google Scholar
- Arnaiz B, Madrigal-Estebas L, Todryk S, James TC, Doherty DG, Bond U: A novel method to identify and characterise peptide mimotopes of heat shock protein 70-associated antigens. J Immune Based Ther Vaccines. 2006, 4: 2-10.1186/1476-8518-4-2.PubMed CentralView ArticlePubMedGoogle Scholar
- Kolb G, Boiziau C: Selection by phage display of peptides targeting the HIV-1 TAR element. RNA Biol. 2005, 2 (1): 28-33.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.