- Open Access
Data mining tools for Salmonella characterization: application to gel-based fingerprinting analysis
© Zou et al.; licensee BioMed Central Ltd. 2013
- Published: 9 October 2013
Pulsed field gel electrophoresis (PFGE) is currently the most widely and routinely used method by the Centers for Disease Control and Prevention (CDC) and state health labs in the United States for Salmonella surveillance and outbreak tracking. Major drawbacks of commercially available PFGE analysis programs have been their difficulty in dealing with large datasets and the limited availability of analysis tools. There exists a need to develop new analytical tools for PFGE data mining in order to make full use of valuable data in large surveillance databases.
In this study, a software package was developed consisting of five types of bioinformatics approaches exploring and implementing for the analysis and visualization of PFGE fingerprinting. The approaches include PFGE band standardization, Salmonella serotype prediction, hierarchical cluster analysis, distance matrix analysis and two-way hierarchical cluster analysis. PFGE band standardization makes it possible for cross-group large dataset analysis. The Salmonella serotype prediction approach allows users to predict serotypes of Salmonella isolates based on their PFGE patterns. The hierarchical cluster analysis approach could be used to clarify subtypes and phylogenetic relationships among groups of PFGE patterns. The distance matrix and two-way hierarchical cluster analysis tools allow users to directly visualize the similarities/dissimilarities of any two individual patterns and the inter- and intra-serotype relationships of two or more serotypes, and provide a summary of the overall relationships between user-selected serotypes as well as the distinguishable band markers of these serotypes. The functionalities of these tools were illustrated on PFGE fingerprinting data from PulseNet of CDC.
The bioinformatics approaches included in the software package developed in this study were integrated with the PFGE database to enhance the data mining of PFGE fingerprints. Fast and accurate prediction makes it possible to elucidate Salmonella serotype information before conventional serological methods are pursued. The development of bioinformatics tools to distinguish the PFGE markers and serotype specific patterns will enhance PFGE data retrieval, interpretation and serotype identification and will likely accelerate source tracking to identify the Salmonella isolates implicated in foodborne diseases.
- Data mining
- bioinformatics tools
- data analysis.
Food safety remains an important concern due in part to the globalization of food supply and foodborne illnesses create an important public health burden in the United States. CDC data indicates that nearly 48 million people become ill, 128,000 are hospitalized, and 3,000 die due to foodborne illnesses each year, and non-typhoidal Salmonella enterica is one of the leading causes of illnesses among the top 31 known foodborne pathogens . The characteristics of Salmonella infections has changed over time, including changes in the frequency of antimicrobial-resistant Salmonella subtypes implicated and the frequency of different serotypes among isolates associated with human infections .
Multiple phenotypic and genotypic Salmonella subtyping methods have been developed to efficiently detect the cases of human salmonellosis . These methods include traditional phenotype-based approaches such as serotyping ; genotype-based methods such as Pulsed Field Gel Electrophoresis (PFGE) [3, 5]; DNA sequence-based methods including DNA microarray analysis, multi-locus sequence typing (MLST) [6, 7], multi-locus variable-number tandem repeat analysis (MLVA) [8, 9] and next-generation sequencing (NGS) [10–14]. Each of the subtyping approaches has been applied in Salmonella outbreak strain identification and source tracking; however they each have their own strengths and weaknesses in terms of sensitivity, cost, speed, and robustness.
Large amounts of molecular subtyping data have been generated by academia, private companies and government agencies. Along with the development of new technologies, it is anticipated that new analytical methods will be applied more often in combination with the conventional assays to characterize and subtype foodborne isolates, therefore, enhancing the current food safety and regulatory science paradigm . Facing the large amount of emerging data and technologies, one of the major challenges is the data management, storage, analysis and retrieval, and how to build up the connections and communication for data developed by various subtyping methods. Data mining seeks to find new interesting patterns and relationships in huge amounts of data. Data mining involves the bioinformatics approaches that combine biological data using computational tools and statistical methods to analyze, summarize and transform data into useful information to improve food safety. Such a systematic approach facilitates the extraction and correlation of patterns of knowledge that is implicit in the stored databases.
PFGE is currently the most widely and routinely used molecular subtyping method by CDC and state health labs in the US for Salmonella surveillance and outbreak investigation . Although PFGE provides less-detailed genetic information than NGS and other DNA sequence-based methods, it has been successfully used for over twenty years to type Salmonella from human patients, foods, and food animal sources because of its discriminatory power, low cost and high reproducibility [3, 5]. PulseNet (http://www.cdc.gov/pulsenet), the CDC's molecular surveillance network used for foodborne infections, has the largest and most rich Salmonella subtyping database in the world, storing more than 350,000 PFGE patterns of more than 500 serotypes since 1996 . Data mining of this valuable database will provide resources to study the ecology, epidemiology, transmission, and evolution of the emerging Salmonella serotypes.
Several commercial software applications have been used to analyze PFGE data, such as BioNumerics (Applied Maths, Inc., Austin, TX), GelCompar II (Applied Maths, Kortrijk, Belgium) and Fingerprinting II version 3 (Bio-Rad, Hercules, USA). BioNumerics is the default software in PulseNet standard protocol [18, 19] and has been widely used in PulseNet participating laboratories and other public health laboratories that perform PFGE subtyping for bacterial foodborne pathogens for surveillance and outbreak investigations. These softwares are currently used to analyze PFGE gel images to generate dendrograms for clustering PFGE patterns from different strains of foodborne pathogens. No other methodologies or commercial tools are applicable on PFGE data except for the cluster analysis, which limits the usage of this subtyping technology in understanding the genetic diversities of foodborne bacteria. In addition, BioNumerics and other software have limitations on dealing with large number of samples (less than 20,000 patterns for Bionumerics), which is an obstacle for meta-analysis of the PFGE data and data mining.
In this study, in order to systematically investigate and characterize PFGE patterns of Salmonella isolates, BACPAK knowledgebase was created and systematic approaches assembled to build up a functional software package for PFGE data mining. The approaches include PFGE band standardization, Salmonella serotype prediction, hierarchical cluster analysis, distance matrix analysis and two-way hierarchical cluster analysis. The development of this software package and the application of its approaches provide a better understanding of Salmonella genetic diversity and epidemiology, and contribute to PFGE-based characterization and surveillance of Salmonella isolates in outbreak investigations.
Bacterial pathogen knowledgebase (BACPAK) construction and PFGE database
The data composition in BACPAK and Salmonella PFGE fingerprints database.
Antimicribial susceptibility test
Antimicribial resistant gene PCR
Plasmid sequence information
Number of patterns
Paratyphi B var. L(+) tartrate+
Typhimurium var. 5-
PFGE band standardization
Before analysis with the developed tools, the bands of all the PFGE patterns should be normalized. For example, when using Salmonella serotype prediction tool, the bands of tested Salmonella isolates should be normalized to band classes stored within the database, which are used in the development of training sets for the prediction tools. To accomplish this band normalization, the NCTR fixed band method  was implemented to standardize the band classes for cross-group analysis. In this method, the means of the band sizes of two adjacent bands of the training data was used as the standard to normalize the corresponding bands of each new sample. As an example, assume that the training data have a set of descending bands sized as s1, s2, s3, s4..., and the test sample consists of descending bands of t1, t2, t3, t4... if t1 ≤ (s1+s2)/2, t1 is normalized to s2, and if t1 > (s1+s2)/2, it is adjusted to s1 . A total of 39,830 PFGE patterns were band standardized and stored in the database .
Salmonella serotype prediction from PFGE fingerprints
Previous studies have reported two classification algorithms, Random Forest (RF)  and Support Vector Machine (SVM) , to predict Salmonella serotypes based on PFGE fingerprints [21, 24]. The scripts of the algorithms were based on the packages "RandomForest" and "e1071" in R (version 2.12.1), respectively. Based on the prediction accuracies, the SVM algorithm was chosen to computerize the scripts as a practical tool for Salmonella prediction using PFGE fingerprints. The normalized database consisting of 39,830 patterns from 32 serotypes was used as the default standard and training set .
Hierarchical cluster analysis
The distances of any two of the standardized PFGE patterns were measured and hierarchical cluster analysis was pursued by the complete linkage method using "hcluster package" in R . The scripts were converted to a computational tool provided in BACPAK.
Distance matrix development and two-way hierarchical cluster analysis
In the approach of distance matrix analysis, scripts were written in R to calculate the Jaccard Distance  of PFGE patterns for measuring the dissimilarity of PFGE inter- or intra-serotypes patterns. The color from blue to red indicted the values of the Jaccard Distance ranged from 0 to 1. The scripts were computerized as a tool to identify the differences and relationships among the various Salmonella patterns within specific serotypes and among the targeted serotypes.
In the two-way hierarchical cluster analysis, scripts were coded using R to calculate the average proportions of the bands present at every designated band location with values ranging from 0 to 1 to build the characteristic parameters of each target serotype. The hierarchical cluster analysis using the complete linkage was applied based on the dissimilarity measures of any two serotypes calculated by the Euclidean distance  of the characteristic parameters. The scripts were implemented to pursue a two-way clustering analysis of the PFGE patterns, in which both serotypes and band locations were clustered according to dissimilarity measures to simultaneously identify the associations between serotypes and band locations.
To begin to address the need to develop improved analytical tools for PFGE analysis, a software package consisting of the integrated data mining techniques and the PFGE database was established and stored within NCTR's BACPAK. BACPAK is capturing and storing data including antimicrobial susceptibility data, plasmid sequence data, PCR data on antimicrobial resistance genes and PFGE data (Table 1). The PFGE database consisted of 45,923 semi-randomly selected PFGE patterns submitted to PulseNet from 2005 to 2010 (Table 1) . Based on the statistics of the Salmonella Annual Report 2006  and Salmonella Annual Summary Tables 2009 from CDC , isolates from the 32 serotypes represented in the BACPAK database comprised more than 80% of all Salmonella reported over the past 14 years in the US .
The approach for PFGE band standardization
Band normalization is the key point to allow the comparison from different dataset. Since BioNumerics was unstable to handle more than 20,000 PFGE patterns, the implemented NCTR fixed band method was especially useful for large dataset analysis. It showed higher accuracies when used to normalize PFGE bands for Salmonella serotype prediction in comparison to the conventional BioNumerics fixed band method , and made the meta-analysis available to clarify the inter- and intra-serotypes relationships in a large dataset . In addition, NCTR fixed band method transferred the gel-imaged band class into certain digital parameters in the model, and normalized the bands of future candidates with no necessity to upload and save standard band class in BioNumerics .
The prediction approach for Salmonella serotype prediction based on PFGE patterns
The prediction algorithm was developed as described previously to identify Salmonella serotypes based on their PFGE patterns [21, 24]. In these studies, the NCTR fixed band method coupled to the SVM classification produced the highest average predictive accuracies for serotype determination (96.1%) . Therefore, the SVM algorithm was coupled with NCTR standardization method and turned the R scripts which were implemented as a computational prediction tools installed in BACPAK.
Five selected test Salmonella isolates, the prediction results and the distinguished band markers by the two-way hierarchical cluster analysis tool for five serotype identification ("X" stands for band presence).
Test Salmonella isolates
Distinguished band markers (Kb)
The original prediction tools were developed by a supervised classification approach [21, 24]. This approach focused on studying the association between PFGE patterns and serotypes determined using traditional serological methods, and applying the information learned from the training set as the rules for prediction in the test set. The prediction accuracy was measured by applying the prediction model based on the training set to emulate the population of the future profiles to be analyzed. If the samples in the training set do not adequately represent the likely samples to be encountered in use, then bias may occur. The training set used in these studies represents greater than 80% of all the isolates reported to CDC, therefore, the prediction tool should be able to predict most Salmonella serotypes. As such, this tool should be especially useful to predict the serotype of outbreak isolates before the conventional methods were carried out in a laboratory. The refinement of the predictive tool is an ongoing effort as additional PFGE data becomes available and is incorporated into the training dataset to improve the prediction accuracies.
Hierarchical cluster analysis
Distinguishing serotype relationships: distance matrix and two-way hierarchical cluster analysis
The five functional tools were assembled and integrated into a software package to study PFGE profiles for better understanding the genetic diversity of Salmonella and other foodbornbe pathogens. The analysis tools included in the package allow the systematic analysis of PFGE data from various aspects and make it available to meta-analyze PFGE profiles from large data sets. The software package is currently available in the NCTR internal BACPAK knowledgebase. BACPAK, as a general-purposed bioinformatics pipeline for foodborne pathogen analysis, will be a new addition to the FDA bioinformatics tools at http://www.fda.gov/ScienceResearch/BioinformaticsTools/default.htm.
Although NGS and other sequencing technologies are advancing rapidly in foodborne pathogen subtyping, PFGE is still the most widely used method to characterize Salmonella strains isolated from outbreaks. In the developed software package, PFGE band standardization normalizes the data for cross-group large dataset analysis. The Salmonella serotype prediction tool based on PFGE patterns allows rapid and accurate prediction of Salmonella serotypes from outbreaks before the conventional serological methods are pursued. It also shows advantages in distinguishing an isolate that is serotyped as "unknown" by conventional methods, or for a laboratory where standard serotyping is not available. Hierarchical cluster analysis could be used to clarify the subsets of a group of PFGE patterns for source tracking and identification of outbreak isolates. Since Salmonella serotypes can be closely related in terms of their virulence, and antimicrobial resistance profiles [17, 30–33], our distance matrix analysis and two-way hierarchical analysis tools make it possible to study the relationships between phenotypes and genotypes of Salmonella isolates and to distinguish band markers and PFGE pattern diversity for serotype identification, especially for large dataset analysis. Theoretically, these approaches could be applied to other gel-based analysis and other pathogens in the future. Combined with the Salmonella genome sequencing data, the distinct serotype specific patterns and bands may provide useful information to aid in the distribution of serotypes in the population and potentially reduce the need for laborious analyses, such as traditional serotyping. In addition, the PFGE analysis tools in the software package are expected to help the in silico pattern construction to match PFGE data with NGS data in future studies.
This work and the publication were funded by Food Protection Plan of FDA. We are grateful to Ms. Beth Juliar and Dr. Tzu-Pin Lu for critical reading of this manuscript.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 14, 2013: Proceedings of the Tenth Annual MCBIOS Conference. Discovery in a sea of data. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S14.
- Scallan E, Hoekstra RM, Angulo FJ, Tauxe RV, Widdowson MA, Roy SL, Jones JL, Griffin PM: Foodborne illness acquired in the United States--major pathogens. Emerg Infect Dis. 2011, 17 (1): 7-15. 10.3201/eid1701.P11101.PubMed CentralView ArticlePubMedGoogle Scholar
- CDC: Salmonella: annual summary, 2006. CDC, Division of Foodborne, Bacterial and Mycotic Diseases. Atlanta, GAGoogle Scholar
- Wattiau P, Boland C, Bertrand S: Methodologies for Salmonella enterica subsp. enterica Subtyping: Gold Standards and Alternatives. Appl Environ Microbiol. 2011, 77 (22): 7877-7885. 10.1128/AEM.05527-11.PubMed CentralView ArticlePubMedGoogle Scholar
- Grimont PAD, Weill FX: Antigenic Formulae of the Salmonella Serovars. World Health Organization Collaborating Center for Reference and Research on Salmonella. 2007, Institut Pasteur, Paris, France, 9Google Scholar
- Kerouanton A, Marault M, Lailler R, Weill FX, Feurer C, Espie E, Brisabois A: Pulsed-field gel electrophoresis subtyping database for foodborne Salmonella enterica serotype discrimination. Foodborne Pathog Dis. 2007, 4 (3): 293-303. 10.1089/fpd.2007.0090.View ArticlePubMedGoogle Scholar
- Kidgell C, Reichard U, Wain J, Linz B, Torpdahl M, Dougan G, Achtman M: Salmonella typhi, the causative agent of typhoid fever, is approximately 50,000 years old. Infect Genet Evol. 2002, 2 (1): 39-45. 10.1016/S1567-1348(02)00089-8.View ArticlePubMedGoogle Scholar
- Stepan RM, Sherwood JS, Petermann SR, Logue CM: Molecular and comparative analysis of Salmonella enterica Senftenberg from humans and animals using PFGE, MLST and NARMS. BMC Microbiol. 2011, 11: 153-10.1186/1471-2180-11-153.PubMed CentralView ArticlePubMedGoogle Scholar
- Beranek A, Mikula C, Rabold P, Arnhold D, Berghold C, Lederer I, Allerberger F, Kornschober C: Multiple-locus variable-number tandem repeat analysis for subtyping of Salmonella enterica subsp. enterica serovar Enteritidis. Int J Med Microbiol. 2009, 299 (1): 43-51. 10.1016/j.ijmm.2008.06.002.View ArticlePubMedGoogle Scholar
- Chiou CS, Hung CS, Torpdahl M, Watanabe H, Tung SK, Terajima J, Liang SY, Wang YW: Development and evaluation of multilocus variable number tandem repeat analysis for fine typing and phylogenetic analysis of Salmonella enterica serovar Typhimurium. Int J Food Microbiol. 2010, 142 (1-2): 67-73. 10.1016/j.ijfoodmicro.2010.06.001.View ArticlePubMedGoogle Scholar
- Allard MW, Luo Y, Strain E, Li C, Keys CE, Son I, Stones R, Musser SM, Brown EW: High resolution clustering of Salmonella enterica serovar Montevideo strains using a next-generation sequencing approach. BMC Genomics. 2012, 13: 32-10.1186/1471-2164-13-32.PubMed CentralView ArticlePubMedGoogle Scholar
- Cao G, Zhao S, Strain E, Luo Y, Timme R, Wang C, Brown E, Meng J, Allard M: Draft genome sequences of eight Salmonella enterica serotype newport strains from diverse hosts and locations. J Bacteriol. 2012, 194 (18): 5146-10.1128/JB.01171-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Lienau EK, Strain E, Wang C, Zheng J, Ottesen AR, Keys CE, Hammack TS, Musser SM, Brown EW, Allard MW: Identification of a salmonellosis outbreak by means of molecular sequencing. N Engl J Med. 2011, 364 (10): 981-982. 10.1056/NEJMc1100443.View ArticlePubMedGoogle Scholar
- Hoffmann M, Zhao S, Luo Y, Li C, Folster JP, Whichard J, Allard MW, Brown EW, McDermott PF: Genome sequences of five Salmonella enterica serovar Heidelberg isolates associated with a 2011 multistate outbreak in the United States. J Bacteriol. 2012, 194 (12): 3274-3275. 10.1128/JB.00419-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Timme RE, Allard MW, Luo Y, Strain E, Pettengill J, Wang C, Li C, Keys CE, Zheng J, Stones R: Draft genome sequences of 21 Salmonella enterica serovar enteritidis strains. J Bacteriol. 2012, 194 (21): 5994-5995. 10.1128/JB.01289-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Elkins CA, Kotewicz ML, Jackson SA, Lacher DW, Abu-Ali GS, Patel IR: Genomic paradigms for food-borne enteric pathogen analysis at the USFDA: case studies highlighting method utility, integration and resolution. Food Addit Contam Part A Chem Anal Control Expo Risk Assess. 2012Google Scholar
- Barrett TJ, Gerner-Smidt P, Swaminathan B: Interpretation of pulsed-field gel electrophoresis patterns in foodborne disease investigations and surveillance. Foodborne Pathog Dis. 2006, 3 (1): 20-31. 10.1089/fpd.2006.3.20.View ArticlePubMedGoogle Scholar
- Gerner-Smidt P, Hise K, Kincaid J, Hunter S, Rolando S, Hyytia-Trees E, Ribot EM, Swaminathan B: PulseNet USA: a five-year update. Foodborne Pathog Dis. 2006, 3 (1): 9-19. 10.1089/fpd.2006.3.9.View ArticlePubMedGoogle Scholar
- Swaminathan B, Barrett TJ, Hunter SB, Tauxe RV: PulseNet: the molecular subtyping network for foodborne bacterial disease surveillance, United States. Emerg Infect Dis. 2001, 7 (3): 382-389. 10.3201/eid0703.017303.PubMed CentralView ArticlePubMedGoogle Scholar
- Ribot EM, Fair MA, Gautom R, Cameron DN, Hunter SB, Swaminathan B, Barrett TJ: Standardization of pulsed-field gel electrophoresis protocols for the subtyping of Escherichia coli O157:H7, Salmonella, and Shigella for PulseNet. Foodborne Pathog Dis. 2006, 3 (1): 59-67. 10.1089/fpd.2006.3.59.View ArticlePubMedGoogle Scholar
- Zou W, Chen HC, Hise KB, Tang H, Foley SL, Meehan J, Lin WJ, Nayak R, Xu J, Fang H: Meta-analysis of pulsed-field gel electrophoresis fingerprints based on a constructed Salmonella database. PLoS One. 2013, 8 (3): e59224-10.1371/journal.pone.0059224.PubMed CentralView ArticlePubMedGoogle Scholar
- Zou W, Lin WJ, Hise KB, Chen HC, Keys C, Chen JJ: Prediction system for rapid identification of Salmonella serotypes based on pulsed-field gel electrophoresis fingerprints. J Clin Microbiol. 2012, 50 (5): 1524-1532. 10.1128/JCM.00111-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Breiman L: Random Forests. Machine Learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Vapnik V: The Nature of Statistical Learning Theory. 1995, New York: SpringerView ArticleGoogle Scholar
- Zou W, Lin WJ, Foley SL, Chen CH, Nayak R, Chen JJ: Evaluation of pulsed-field gel electrophoresis profiles for identification of Salmonella serotypes. J Clin Microbiol. 2010, 48 (9): 3122-3126. 10.1128/JCM.00645-10.PubMed CentralView ArticlePubMedGoogle Scholar
- Murtagh F: Lectures in Computational Statistics: Multidimensional Clustering Algorithms (Compstat Lectures, No 4). 1985, Springer-VerlagGoogle Scholar
- Jaccard P: Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles. 1901, 37: 547-579.Google Scholar
- Deza EDMM: Encyclopedia of Distances. 2009, SpringerView ArticleGoogle Scholar
- CDC: Salmonella Annual Summary Tables 2009. CDC, Division of Foodborne, Bacterial and Mycotic Diseases. Atlanta, GAGoogle Scholar
- Wonderling L, Pearce R, Wallace FM, Call JE, Feder I, Tamplin M, Luchansky JB: Use of pulsed-field gel electrophoresis to characterize the heterogeneity and clonality of Salmonella isolates obtained from the carcasses and feces of swine at slaughter. Appl Environ Microbiol. 2003, 69 (7): 4177-4182. 10.1128/AEM.69.7.4177-4182.2003.PubMed CentralView ArticlePubMedGoogle Scholar
- Gerner-Smidt P, Whichard JM: Foodborne disease trends and reports. Foodborne Pathog Dis. 2010, 7 (6): 609-611. 10.1089/fpd.2010.9998.View ArticlePubMedGoogle Scholar
- Zou W, Frye JG, Chang CW, Liu J, Cerniglia CE, Nayak R: Microarray analysis of antimicrobial resistance genes in Salmonella enterica from preharvest poultry environment. J Appl Microbiol. 2009, 107 (3): 906-914. 10.1111/j.1365-2672.2009.04270.x.View ArticlePubMedGoogle Scholar
- Zou W, Al-Khaldi SF, Branham WS, Han T, Fuscoe JC, Han J, Foley SL, Xu J, Fang H, Cerniglia CE: Microarray analysis of virulence gene profiles in Salmonella serovars from food/food animal environment. J Infect Dev Ctries. 2011, 5 (2): 94-105.PubMedGoogle Scholar
- Frye JG, Lindsey RL, Meinersmann RJ, Berrang ME, Jackson CR, Englen MD, Turpin JB, Fedorka-Cray PJ: Related antimicrobial resistance genes detected in different bacterial species co-isolated from swine fecal samples. Foodborne Pathog Dis. 2011, 8 (6): 663-679. 10.1089/fpd.2010.0695.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.