- Open Access
Towards bioinformatics assisted infectious disease control
© Sintchenko et al; licensee BioMed Central Ltd. 2009
- Published: 5 February 2009
This paper proposes a novel framework for bioinformatics assisted biosurveillance and early warning to address the inefficiencies in traditional surveillance as well as the need for more timely and comprehensive infection monitoring and control. It leverages on breakthroughs in rapid, high-throughput molecular profiling of microorganisms and text mining.
This framework combines the genetic and geographic data of a pathogen to reconstruct its history and to identify the migration routes through which the strains spread regionally and internationally. A pilot study of Salmonella typhimurium genotype clustering and temporospatial outbreak analysis demonstrated better discrimination power than traditional phage typing. Half of the outbreaks were detected in the first half of their duration.
The microbial profiling and biosurveillance focused text mining tools can enable integrated infectious disease outbreak detection and response environments based upon bioinformatics knowledge models and measured by outcomes including the accuracy and timeliness of outbreak detection.
- Microbial Profile
- Outbreak Detection
- Infectious Disease Control
- MLVA Type
- Panton Valentine Leucocidin
Current early warning systems successfully detect large outbreaks but show significant delays and low sensitivity in detecting moderate and small epidemics [1, 2]. It is also estimated that a one-week delay in the implementation of control measures for SARS resulted in a 2.6-fold increase in the mean epidemic size and a four-week extension of the mean epidemic duration .
The body of knowledge on the molecular profiles and epidemiology of pathogens accumulated in databases and biomedical literature has been also expanding. We have argued that bioinformatics pathogen profiling can help to predict patient outcomes, identify markers that can be applied for early diagnosis and enable tailored interventions, commonly referred to as "personalized medicine" . We propose in this paper a novel framework for bioinformatics assisted biosurveillance and therapy that addresses the inefficiencies in traditional public health surveillance as well as the need for more timely and comprehensive infectious disease monitoring and control. It leverages from recent breakthroughs in high-throughput molecularprofiling of microorganisms and text mining as well as upon the growing electronic sources of knowledge about the molecular epidemiology of pathogens with epidemic potential.
A pathogen profile can be considered as a single multivariate observation or a set of observations, comprised of classes of specific attributes, which are designed for a comparison with a knowledgebase. In contrast to traditional subtyping based on phenotypic characteristics, such as serotype, biotype, phage type or antibiogram (susceptibility to antimicrobials), genetic profiling describes the phenotypic potential within the nucleic acid sequence. Sequence-based typing on one hand and randomly amplified polymorphic DNA, plasmid fingerprinting or pulsed-field gel electrophoresis (PFGE) on the other, can be viewed as examples of direct and indirect methods of assessing nucleic acid sequence, respectively. They provide both strain typing and key phylogenic data .
Methods based on repetitive sequences present at several loci in the genome have been developed with the potential to generate consistent pathogen profiles that are amenable to standardization and database cataloging. Genotyping systems based on comparison of sizes and numbers of different DNA fragments separated by gel electrophoresis express their results in continuous values which are less reliable than direct sequence-based methods, due to the lack of precision and reproducibility. Typing systems that use markers with categorized values (e.g., multilocus sequence typing, MLST) are intrinsically highly reproducible and more appropriate for library typing. Typing techniques like MLST that allow classification consistent with phylogenetic relationships are useful tools for studying global epidemiology. However, techniques that provide real-time or near real-time typing capability are likely to provide the most relevant information and therefore have the most impact.
The emergence of multi-resistant Staphylococcus aureus (MRSA) as a significant cause of hospital- and community-acquired morbidity and mortality has prompted calls for better surveillance systems supported by analysis of the molecular epidemiology of MRSA . Different genotypes produce many virulence factors associated with certain severe infections and caused by MRSA clones distributed worldwide. Accumulating evidence suggests associations between specific MRSA genotypes, clinical syndromes and patient outcomes. Specifically, super-antigenic exotoxins appear to be major virulence factors in hospital MRSA clones, and enterotoxin A may be involved in the septic shock. Panton Valentine Leucocidin (PV) has emerged as a virulence factor in community-acquired MRSA infections responsible for the high mortality associated with necrotizing pneumonia .
Genotyping assisted infectious disease control
The searches and analyses are conducted on two levels. First, genotypes or pathogen profiles are compared with local databases to detect recent transmission or to predict therapy outcomes. Then, global search is performed to collect the information about temporal and spatial distribution of genotypes or profiles obtained, in order to discover transmission pathways and enhance clinical policies. Once a profile is constructed, it can be matched with those of others or with existing datasets using similarity measures and clustering techniques. Probabilistic or information theoretic distance functions and clustering algorithms have been applied to match microbial profiles .
Genomic profiling of pathogens, knowledge representation and text mining
The majority of new genes and genomic markers are found in GenBank, while reports of the occurrence of known genes in new locations or their phenotypic characteristics are available from either Internet sources or peer-reviewed manuscripts. The presence or absence of particular genes or genetic elements can now be indicated by DNA or RNA sequencing-based typing systems which produce binary or octal representations. Information extraction will be an increasingly important measure for locating and assembling critical information relevant to the surveillance of infectious disease outbreaks in a timely fashion. There is already substantial prior work that aims to extract information from news articles reporting outbreaks, applied to online sources such as blogs and news feeds . Text mining applications for detecting genetic mutations have also recently emerged . In order to track the spread of pathogens and their various strains, we envision the mining of PubMed publications. Automated search agents and text mining utilities would be able to monitor new publications and check for say new strains of antibiotic resistance, tracking behavior over time.
Biosurveillance pilot study
Near real time Salmonella surveillance was employed to test the framework and to further explore the potential of molecular typing-based early warning for outbreaks of diseases with epidemic potential. In early 2006, the Centre for Infectious Diseases & Microbiology, The University of Sydney, Westmead Hospital introduced routine Salmonella typhimurium (STM) fingerprinting. Specifically, clonally related strains were identified by multi-locus variable number tandem repeat analysis (MLVA) which was shown to be superior to gel-based typing for both surveillance and outbreak investigations of STM . MLVA is based on the identification of short sequence repeats that vary in copy number in the microbial genome at various loci. MLVA detects polymorphisms at five different sites in the STM genome. Four regions of detection are on the bacterial chromosome and one is located on the serotype specific plasmid pSLT. MLVA has high discriminatory power within clonal species and appears to be more rapid and more amenable to standardization than pulse-field gel electrophoresis for both surveillance and outbreak investigations of STM .
Early warning for population health and infection control
Microbial genotyping improves the efficiency of outbreak investigations as it confirms or refutes epidemiological links among cases and between cases and potential environmental sources, thus triggering public health investigations. In our pilot, two types of early warning were considered: frequency alarm and the new MLVA type alarm. However, the utility of pathogen profiling goes beyond specific questions related to the investigation of possible outbreaks. It can be used also for disease monitoring, by identifying the transmission events and the associations between microbial types and clinical outcomes. Molecular profiling can assist in the assessment of the reproductive rates of infection during epidemics, in making infection control policies more pathogen-specific . Molecular typing also facilitates the detection of chains and patterns of infection transmission and the construction of epidemic trees.
More bacterial genomes are being sequenced by the growing number of laboratories and genomic data stored and annotated online. Public databases such as MLSTNet http://www.mlst.net, PulseNet http://www.cdc.gov/pulsenet/ and the BioPortal http://www.bioportal.org, among others, allow access and matching of bacterial or viral isolates [14, 15]. Linking systematically annotated profiles with clinical and research databases can identify previously unrecognized associations between phenotype, genotype, environment and host responses and can potentially identify the specific genes that govern them. Networks, which are created by relationships among phenotype, disease expression, environment, experimental context and associated genes with differential expression, could provide new insights into microbial interactions and pathogenesis [3, 14]. This approach has been fruitful in metagenomics, and information management systems, capable of assisting with genotyping or functional genomics are being developed.
A particularly interesting prospect of this framework is the integration of molecular typing with epidemiological information, potentially achieving the global real-time epidemiological surveillance of pathogens with epidemic potential. Our MLVA pilot relied on a single database source for strain comparisons. However, surveillance typing networks such as SeqNet or second-generation PulseNet have improved the speed of outbreak detection and made possible the exchange of data between laboratories. In such 'on-line' surveillance systems, novel and previously characterized strains can be compared and grouped by cluster analysis. Spatial surveillance using emerging geographic information systems enhances the likelihood that even localized events will be detected and their extent and variables measured in space and time . The output from these systems ultimately needs to be integrated into clinical and diagnostic processes.
The systematic construction of pathogen profiles would enable data to be integrated and shared, which is essential for successful surveillance and disease management. The heterogeneity of pathogen, host and environment imply that any individual marker is rarely sufficient to reflect the clinical and epidemiological complexity and ensure a reliable prediction of outcomes. However, a combination of markers is more likely to do so. Current genomic technology can rapidly and reliably produce this type of data from large numbers of strains to monitor routes of transmission or contribute to inventories of microbial gene polymorphisms and translate them into innovative typing procedures. Selective gene profiling can be directly linked to phenotypic characteristics such as virulence and antimicrobial resistance.
Antimicrobial therapy optimization
Innovative bioinformatics tools that predict drug resistance or response to therapy from genotype have been developed to provide clinical decision support. These tools use either a statistical approach, in which the inferred model and prediction are treated as regression problems, or machine learning algorithms, in which the model is addressed as a classification problem. A statistical learning approach to the ranking of therapeutic choices often relies on a direct correlation between the baseline microbial profile, the therapeutic decision and the response to treatment e.g. expected reduction in viral load resulting from anti-HIV combination therapy . It highlights the need for databases that can be widely shared, and allow the correlation of quality-controlled data from genotypic resistance assays and treatment regimens with short- and long-term clinical outcomes.
The bioinformatics assisted infectious disease control framework targets questions of immediate public health and clinical utility and utilizes the molecular genotyping of biothreats in a clinically focused manner. The concept of pathogen profiling described here provides a framework for data integration and sharing to ensure that the flood of data from new molecular technologies will be used effectively in public health surveillance and disease management. The identification of new sets of molecular markers and the development of biosurveillance focused text mining tools will enable innovative integrated infectious disease outbreak detection and response environments based upon bioinformatics knowledge models and measured by outcomes including the accuracy and timeliness of outbreak detection and response.
Octal code profiles were assigned to each isolate for comparison and spatio-temporal clustering. A genotyping cluster was defined as a maximal set of at least four isolates that share the same MLVA genotype, among a set of isolates from infected patients, each with an associated date and location (e.g., collection date and patient's residential address) . This definition was tested retrospectively using the space-time permutation scan statistic method with default values for bonds on cluster size (50% of population at risk and 50% of study period).
The system was implemented as both a stand-alone and as a web application with Map Objects (ESRI) GIS for map visualisation. MLVA patterns were matched using hierarchical cluster analysis. Minimum spanning trees analysis was also used to overcome the potential loss of information introduced by the latter.
The authors thank Drs Q Wang, JR Iredell, L Hiley and GL Gilbert for their expert advice and providing genotyping data. Funding from the Australian Research Council and National Health & Medical Research Council (Grants 358351, 457472, 457122) and from a Capacity Building Infrastructure Grant from the New South Wales Department of Health is acknowledged.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 2, 2009: Selected Proceedings of the First Summit on Translational Bioinformatics 2008. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S2.
- Dato V, Wagner MM, Fapohunda A: How outbreaks of infectious disease are detected: a review of surveillance systems and outbreaks. Public Health Rep 2004, 119: 464–471. 10.1016/j.phr.2004.07.003PubMed CentralView ArticlePubMedGoogle Scholar
- Bravata DM, McDonald KM, Smith WM, Rydzak C, et al.: Systematic Review: Surveillance systems for early detection of bioterrorism-related diseases. Emerg Infect Dis 2004, 10: 100–8.PubMed CentralView ArticlePubMedGoogle Scholar
- Sintchenko V, Iredell JR, Gilbert GL: Genomic profiling of pathogens for disease management and surveillance. Nat Microbiol Rev 2007, 5: 464–70. 10.1038/nrmicro1656View ArticleGoogle Scholar
- Agius P, Kreiswirth BN, Naidich S, Bennett KP: Typing Staphylococcus aureus using the spa gene and novel distance measures. IEEE Trans Comp Biol Bioinform 2007, 4(4):693–704. 10.1109/tcbb.2007.1053View ArticleGoogle Scholar
- Tristan A, Ferry T, Durand G, Dauwalder O, et al.: Virulence determinants in community and hospital methicillin-resistant Staphylococcus aureus . J Hosp Infect 2007, 65(Suppl 2):105–9. 10.1016/S0195-6701(07)60025-5View ArticlePubMedGoogle Scholar
- Mellmann A, Friedrich AW, Rosenkotter R, Rothganger J, Karch H, Reintjes R, Harmsen D: Automated DNA sequence-based early warning system for the detection of methicillin-resistant Staphylococcus aureus outbreaks. PLoS Med 2006, 3: e3. 10.1371/journal.pmed.0030033View ArticleGoogle Scholar
- McDonald CJ, Haff SM, Suico JG, Hill G, Leavelle D, Aller R, et al.: LOINC, a Universal Standard for Identifying Laboratory Observations: A 5-year update. Clin Chem 2003, 49: 624–633. 10.1373/49.4.624View ArticlePubMedGoogle Scholar
- Sax U, Scmidt S: Integration of genomic data in electronic health records. Methods Inf Med 2005, 44: 546–50.PubMedGoogle Scholar
- Grishman R, Huttunen S, Yangarber R: Information extraction for enhanced access to disease outbreak reports. J Biomed Inform 2002, 35: 236–46. 10.1016/S1532-0464(03)00013-3View ArticlePubMedGoogle Scholar
- Caporaso JG, Baumgartner WA Jr, Randolph DA, Cohen DA, Hunter L: MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics 2007, 23: 1862–1865. 10.1093/bioinformatics/btm235PubMed CentralView ArticlePubMedGoogle Scholar
- Kersey P, Apweiler R: Linking publication, gene and protein data. Nature Cell Biol 2006, 8: 1183–9. 10.1038/ncb1495View ArticlePubMedGoogle Scholar
- Lindstedt BA, Varlund T, Aas L, Kapperud G: Multiple-locus variable-number tandem-repeats analysis of Salmonella enterica subsp. enterica serovar Typhimurium using PCR multiplexing and multicolor capillary electrophoresis. J Microbiol Methods 2004, 59: 163–72. 10.1016/j.mimet.2004.06.014View ArticlePubMedGoogle Scholar
- Torpdahl M, Sorensen G, Lindstedt B, Moller Nielsen E: Tandem repeat analysis for surveillance of human Salmonella Typhimurium infections. Emerg Infect Dis 2007, 13(3):388–395.PubMed CentralView ArticlePubMedGoogle Scholar
- Chan M-S, Maiden MCJ, Spratt BG: Database-driven multi locus sequence typing (MLST) of bacterial pathogens. Bioinformatics 2001, 17: 1077–1083. 10.1093/bioinformatics/17.11.1077View ArticlePubMedGoogle Scholar
- Swaminathan B, Gerber-Smidt P, Ng L-K, Lukinmaa S, Kam K-M, Rolando S, et al.: Building PulseNet International: An interconnected system of laboratory networks to facilitate timely public health recognition and response to foodborne disease outbreaks and emerging foodborne diseases. Foodborne Path Dis 2006, 3: 36–50. 10.1089/fpd.2006.3.36View ArticlePubMedGoogle Scholar
- Revesz P, Wu S: Spatiotemporal reasoning about epidemiological data. Artif Intell Med 2006, 38: 157–70. 10.1016/j.artmed.2006.05.001View ArticlePubMedGoogle Scholar
- Lengauer T, Sing T: Bioinformatics-assisted anti-HIV therapy. Nat Rev Microbiol 2006, 4: 790–797. 10.1038/nrmicro1477View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.