Two Pfam protein families characterized by a crystal structure of protein lpg2210 from Legionella pneumophila
- Penelope Coggill†1, 2Email author,
- Ruth Y Eberhardt1, 2,
- Robert D Finn3,
- Yuanyuan Chang4, 5,
- Lukasz Jaroszewski4, 5,
- Adam Godzik4, 5,
- Debanu Das5, 6,
- Qingping Xu5, 6,
- Herbert L Axelrod5, 6,
- L Aravind7,
- Alexey G Murzin8 and
- Alex Bateman†2
© Coggill et al.; licensee BioMed Central Ltd. 2013
Received: 25 June 2013
Accepted: 21 August 2013
Published: 3 September 2013
Every genome contains a large number of uncharacterized proteins that may encode entirely novel biological systems. Many of these uncharacterized proteins fall into related sequence families. By applying sequence and structural analysis we hope to provide insight into novel biology.
We analyze a previously uncharacterized Pfam protein family called DUF4424 [Pfam:PF14415]. The recently solved three-dimensional structure of the protein lpg2210 from Legionella pneumophila provides the first structural information pertaining to this family. This protein additionally includes the first representative structure of another Pfam family called the YARHG domain [Pfam:PF13308]. The Pfam family DUF4424 adopts a 19-stranded beta-sandwich fold that shows similarity to the N-terminal domain of leukotriene A-4 hydrolase. The YARHG domain forms an all-helical domain at the C-terminus. Structure analysis allows us to recognize distant similarities between the DUF4424 domain and individual domains of M1 aminopeptidases and tricorn proteases, which form massive proteasome-like capsids in both archaea and bacteria.
Based on our analyses we hypothesize that the DUF4424 domain may have a role in forming large, multi-component enzyme complexes. We suggest that the YARGH domain may play a role in binding a moiety in proximity with peptidoglycan, such as a hydrophobic outer membrane lipid or lipopolysaccharide.
KeywordsDomain of unknown function Protein family Protein structure DUF4424 YARHG domain Sequence analysis
A significant percentage of proteins encoded by all known genomes consist of uncharacterized proteins that have never been studied experimentally and do not show significant sequence similarity to any known proteins. A frequent first step in the analysis of such proteins is their classification into protein families. Many research groups are focusing on the identification and definition of new protein families that are then deposited into protein family databases and used for annotation of proteins by resources such as UniProtKB. Some protein families consist entirely of uncharacterized proteins, and therefore are typically defined as domains of unknown function (DUF) or uncharacterized protein families (UPFs). The Pfam database now contains a large collection of these families . In this work we have analyzed the DUF4424 family [Pfam:PF14415] and the YARHG domain  [Pfam:PF13308]. The YARHG domain is also experimentally uncharacterized, but it was named for its highly conserved characteristic motif found in many of the sequences.
In a synergistic effort, the NIH Protein Structure Initiative (PSI) centers are systematically targeting uncharacterized proteins with the goal of providing structural information for a significant portion of the protein universe, often using Pfam for guidance in target selection. In this instance, Joint Center for Structural Genomics (JCSG) has solved the crystal structure of a hypothetical protein (lpg2210) [UniProtKB:Q5ZTF2] encoded in the genome of L. pneumophila subsp. Pneumophila str. Philadelphia 1 as a representative of the DUF4424 family, and deposited the coordinates in the Protein Data Bank as [PDB:4g2a]. L. pneumophila invades and replicates within human monocytes and alveolar macrophages in humans, and also within Amoebae, and is the established causative agent of legionellosis or Legionnaires’ disease.
Little is known about the lpg2210 protein. In a gene expression study, lpg2210 was found to be induced in the post-exponential growth phase, when L. pneumophila is known to express a variety of virulence factors in vitro. Thus, it is plausible that lpg2210 and other members of this family, all of which have a signal sequence and are secreted proteins, may play a role in virulence.
Results, Methods and Discussion
The first structure of the YARHG domain gives us the opportunity to look at the reason that the YARHG motif is conserved among members of the family. Detailed examination of the structure shows that the most conserved sequence-region corresponds to the structural region that contains a rather unusual feature, i.e. that of crossing loops. The underlying loop connects helices H8 and H9 (Figure 1), whereas the next loop, between H9 and H10, crosses over the underlying loop. The YARHG sequence motif (the actual sequence in lpg2210 is YAQYG) maps onto the C-terminal part of H8 with the conserved Gly residue forming its C-terminal cap. The other conserved residues probably contribute to the stabilization of this structural feature. In particular, the small residue (Ala) is packed against several conserved aromatic residues ‘upstream’ of the YARHG motif (F308, F317, W322 and Y323). The sequence and structural conservation of this region suggest again that it might contribute to the binding of a yet unknown specific ligand of this family.
The sequence alignment of the DUF4424 domain does not suggest any strongly conserved short motifs suggestive of a particular function, and is shown for a representative set in Figure 3A.
Pfam family DUF4424 is found predominantly in Gram-negative bacteria, in particular in Fusobacteria and Proteobacteria species, where it is found in Alpha-, Beta-, Gamma- and Delta-proteobacteria. Many of these species make up part of the natural gut and oral flora of a human, but can also be involved in human disease. For example, Fusobacterium nucleatum is an oral bacterium, indigenous to the human oral cavity, that plays a role in periodontal disease. The organism is commonly recovered from different monomicrobial and mixed infections in humans and animals. It is a key component of periodontal plaque due to its abundance and its ability to co-aggregate with other species in the oral cavity . The sequence containing the CARDB domain came from a genomic analysis of deep-sea sediments  and is annotated as being of archaeal origin.
In bacterial families it has been shown that analysis of the gene-neighborhood can give hints about the function of a protein family . Our analysis of genomic contexts using MicrobesOnline  and STRING  did not show any clearly recurrent associations with other genes that might give a hint towards function.
In this work we present the novel structure of the lpg2210 protein from the bacterium Legionella pneumophila. The structure confirms the domain organization that was inferred through careful sequence analysis in the Pfam database. The N-terminal domain was found to share structural similarity to a variety of peptidase-associated domains. Based on these structural similarities we suggest that the lpg2210 protein is a part of a multiple-component enzyme, possibly pairing with a catalytic partner. Our analysis is also suggestive of a role for the YARGH domain in binding a moiety in proximity with peptidoglycan. This could be a hydrophobic outer membrane lipid or lipopolysaccharide. In this capacity it might help anchor a variety of extracellular peptidase domains to the cell-surface.
We are grateful to the Sanford Burnham Medical Research Institute for hosting the DUF annotation jamboree in June 2013, which allowed the authors to collaborate on this work. We would like to thank all the participants of this workshop for their intellectual contributions to this work, who, in addition to the authors, were: Padmaja Natarajan, Marco Punta, Neil Rawlings, Daniel Rigden, Mayya Sedova, Anna Sheydina, John Wooley. We thank the members of the JCSG high-throughput structural biology pipeline for their contribution to this work.
Wellcome Trust (grant numbers WT077044/Z/05/Z); Howard Hughes Medical Institute; Work by LA is supported by the intramural funds of the National Library of Medicine, USA.; NIH (R01GM101457); This work was supported in part by National Institutes of Health Grant U54 GM094586 from the NIGMS Protein Structure Initiative to the Joint Center for Structural Genomics. The DUF annotation jamboree was supported by National Science Foundation (IIS-0646708 and IIS-1153617); Portions of this research were carried out at the Stanford Synchrotron Radiation Lightsource, a Directorate of SLAC National Accelerator Laboratory and an Office of Science User Facility operated for the U.S. Department of Energy Office of Science by Stanford University. The SSRL Structural Molecular Biology Program is supported by the DOE Office of Biological and Environmental Research, and by the National Institutes of Health, National Institute of General Medical Sciences (including P41GM103393). Work by AGM was supported by the UK Medical Research Council [MC_U105192716]. The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of NIGMS, NCRR or NIH.
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD: The Pfam protein families database. Nucleic Acids Res. 2012, 40: D290-D301. 10.1093/nar/gkr1065.PubMed CentralView ArticlePubMedGoogle Scholar
- Coggill P, Bateman A: The YARHG domain: an extracellular domain in search of a function. PLoS ONE. 2012, 7: e35575-10.1371/journal.pone.0035575.PubMed CentralView ArticlePubMedGoogle Scholar
- Edwards RL: PhD thesis. Metabolic cues and regulatory proteins that govern Legionella Pneumophila differentiation and virulence. 2008, : University of Michigan, Cellular and Molecular Biology DepartmentGoogle Scholar
- Holm L, Rosenström P: Dali server: conservation mapping in 3D. Nucleic Acids Res. 2010, 38: W545-W549. 10.1093/nar/gkq366.PubMed CentralView ArticlePubMedGoogle Scholar
- Thunnissen MM, Nordlund P, Haeggström JZ: Crystal structure of human leukotriene A(4) hydrolase, a bifunctional enzyme in inflammation. Nat Struct Biol. 2001, 8: 131-135. 10.1038/84117.View ArticlePubMedGoogle Scholar
- Andreeva A, Murzin AG: Structural classification of proteins and structural genomics: new insights into protein folding and evolution. Acta Crystallogr Sect F Struct Biol Cryst Commun. 2010, 66: 1190-1197. 10.1107/S1744309110007177.PubMed CentralView ArticlePubMedGoogle Scholar
- Ye Y, Godzik A: Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics. 2003, 19 (Suppl 2): ii246-ii255. 10.1093/bioinformatics/btg1086.View ArticlePubMedGoogle Scholar
- Anantharaman V, Aravind L: Application of comparative genomics in the identification and analysis of novel families of membrane-associated receptors in bacteria. BMC Genomics. 2003, 4: 34-10.1186/1471-2164-4-34.PubMed CentralView ArticlePubMedGoogle Scholar
- Consortium UP: The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res. 2009, 37: D169-D174.View ArticleGoogle Scholar
- Aas JA, Paster BJ, Stokes LN, Olsen I, Dewhirst FE: Defining the normal bacterial flora of the oral cavity. J Clin Microbiol. 2005, 43: 5721-5732. 10.1128/JCM.43.11.5721-5732.2005.PubMed CentralView ArticlePubMedGoogle Scholar
- Hallam SJ, Putnam N, Preston CM, Detter JC, Rokhsar D, Richardson PM, DeLong EF: Reverse methanogenesis: testing the hypothesis with environmental genomics. Science. 2004, 305: 1457-1462. 10.1126/science.1100025.View ArticlePubMedGoogle Scholar
- Snel B, Bork P, Huynen MA: The identification of functional modules from the genomic association of genes. Proc Natl Acad Sci USA. 2002, 99: 5890-5895. 10.1073/pnas.092632599.PubMed CentralView ArticlePubMedGoogle Scholar
- Dehal PS, Joachimiak MP, Price MN, Bates JT, Baumohl JK, Chivian D, Friedland GD, Huang KH, Keller K, Novichkov PS, Dubchak IL, Alm EJ, Arkin AP: MicrobesOnline: an integrated portal for comparative and functional genomics. Nucleic Acids Res. 2010, 38: D396-D400. 10.1093/nar/gkp919.PubMed CentralView ArticlePubMedGoogle Scholar
- Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, Mering C, Jensen LJ: STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013, 41: D808-D815. 10.1093/nar/gks1094.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.