MuTrack: a genome analysis system for large-scale mutagenesis in the mouse
© Baker et al 2004
Received: 16 October 2003
Accepted: 03 February 2004
Published: 03 February 2004
Skip to main content
© Baker et al 2004
Received: 16 October 2003
Accepted: 03 February 2004
Published: 03 February 2004
Modern biological research makes possible the comprehensive study and development of heritable mutations in the mouse model at high-throughput. Using techniques spanning genetics, molecular biology, histology, and behavioral science, researchers may examine, with varying degrees of granularity, numerous phenotypic aspects of mutant mouse strains directly pertinent to human disease states. Success of these and other genome-wide endeavors relies on a well-structured bioinformatics core that brings together investigators from widely dispersed institutions and enables them to seamlessly integrate data, observations and discussions.
MuTrack was developed as the bioinformatics core for a large mouse phenotype screening effort. It is a comprehensive collection of on-line computational tools and tracks thousands of mutagenized mice from birth through senescence and death. It identifies the physical location of mice during an intensive phenotype screening process at several locations throughout the state of Tennessee and collects raw and processed experimental data from each domain. MuTrack's statistical package allows researchers to access a real-time analysis of mouse pedigrees for aberrant behavior, and subsequent recirculation and retesting. The end result is the classification of potential and actual heritable mutant mouse strains that become immediately available to outside researchers who have expressed interest in the mutant phenotype.
MuTrack demonstrates the effectiveness of using bioinformatics techniques in data collection, integration and analysis to identify unique result sets that are beyond the capacity of a solitary laboratory. By employing the research expertise of investigators at several institutions for a broad-ranging study, the TMGC has amplified the effectiveness of any one consortium member. The bioinformatics strategy presented here lends future collaborative efforts a template for a comprehensive approach to large-scale analysis.
The rapid diversification of experimental techniques, expertise and public domain data has necessitated a shift away from the traditional institutionally-centric research paradigm. Indeed, an inclination towards comprehensive approaches to biological research on a genome-wide scale dictates that any one single institution may not contain the critical mass of physical and intellectual resources necessary to address certain broad biological questions. We describe herein an approach to this challenge that focuses on the creation of inter-institutional research teams that leverage existing internet technologies to bring together wide-ranging expertise in an efficient and effective analysis system.
While the metaphor of research teams often exists at the institutional or local level they do not exist across several institutions for mostly logistical reasons. Effective distributed collaborations require the implementation of an infrastructure that handles a fundamental array of information processes unique to non-local research communities. Researchers must have mechanisms for exhaustive electronic data storage, curation, and sharing. They must be permitted to make observations about the data and the experimental process, and they must have access to computational tools that assist in the extraction of new knowledge from the common warehouse of shared data. Concurrently, researchers in a distributed collaboration must find the bioinformatics core flexible enough to handle the immense diversity of information produced by modern experimental techniques, and structured enough to enforce machine-readable data types for future analysis. Finally, distributed data systems must meet ease-of-use requirements while simultaneously applying explicit control over who has access to data sets and observations.
The utility of employing the mouse as a model for human disease is well documented [6–9]. Traditional methods of site-directed in vivo mutagenesis are tedious and require prior knowledge of gene function and location . Alternative approaches, developed to induce primarily single base pair changes in a genome region of interest , are also effective at producing recessive and dominant heritable mutations in the mouse [12, 13] but lack the specificity of traditional approaches. As a result, any single mutation event may be silent or effective and may lie within a gene directing a visible phenotypic characteristic, a gene without phenotypic consequence, or in a non-coding region . In order to produce substantive phenotypic anomalies in large-scale germ-cell strategies, such as N-ethyl-N-nitrosourea (ENU) directed mutagenesis, the production and phenotypic classification of vast numbers of mouse pedigrees from birth through senescence and death is required.
The system implemented to satisfy this bioinformatics task is named MuTrack, and has evolved into the central mechanism that supports the functions of the broad based TMGC consortium. It resides as a collection of database-backed, on-line analysis tools capable of tracking mouse breeding schemes, the shipment of mutant mice throughout the consortium and the exchange of physical samples, ranging from sperm to histological sections. In total, it collects raw and processed data and observations from the twenty-two discrete phenotype testing domains and provides a real-time statistical analysis of possible phenodeviant mouse lineages based on the collected experimental data. It simultaneously allows member researchers to select mice for secondary and tertiary study to test mutant heritability and provides a means to distribute new mutant strains to researchers outside the collaboration. To date, it has aided in the successful identification of 75 new mutant mouse strains, and has screened more than 22,500 individual mice.
Successful development of heritable mouse mutations will contribute to our understanding of human disease states through the development of new mouse models. Of equal consequence, the implementation of a workable and collaborative data sharing architecture represents a significant advancement in the way researchers bring to bear comprehensive high-throughput analysis in biology's information rich environment.
MuTrack accepts two types of import formats: web-based forms or direct upload of text in a comma separated value (CSV) format. Husbandry information is generally accepted through web-based forms containing pre-calculated attributes where possible. Domain investigators may submit internet forms or use preformatted spreadsheets that parallel the Microsoft Excel paradigm. Immediately upon submission, these files are pipelined through an error checking process and uploaded into their respective database relations. The error checking process includes examination for proper formatting, data type constraints, and maintenance of the testing pipeline structure, ensuring the testing of mice in proper chronological order. Domain investigators may likewise search any information associated with their testing domain and export search results in CSV or tab-delimited formats. Image information collected via on-line means from the neural histology and eye cores may be exported in png format along with the dynamically generated statistical graphs associated with any mouse pedigree or testing domain.
MuTrack seamlessly integrates with the strong analysis tools in the SAS statistical system, allowing incorporation of more complex and highly appropriate data analysis into the simple user interface. Robust estimates of the population mean and standard deviation are calculated from pedigree means using SAS (Version 8.2) to eliminate contamination biases inherent to the detection of unknown mutants from a set of observations. The robust mean is obtained from the Univariate procedure with the Trim option set at 0.25. The Trim option is selected to reduce the influence of phenodeviant pedigrees on the mean and to reduce the movement of means estimates of central tendency as new mice are tested for each domain. By trimming the extremes (i.e. defined or suspected phenodeviants) the central data remaining should be a close unbiased and robust representation of the "normal" mice, thus giving a more stable and accurate population to predict against. The robust standard deviation estimator, Mean Absolute Deviation (MAD) sigma, is obtained using the SAS Univariate procedure robust option. The estimator is insensitive to the inflated variance that results when outlier pedigrees are present.
Each pedigree is averaged and measured against the trimmed population mean for distance. Outliers are flagged (highlighted) at plus or minus 1.645 SD (10% in each tail) and again at 1.96 SD (5% in each tail) from the mean to alert investigators about the possibility of an outlier. Each investigator is expected to take these results and compare it against their own notes about the pedigree. Re-tests are called based on these results.
Using the methods from above, an investigator may select a testing area of interest, based on experimental domains, and any pedigree for any field exceeding a distance from the mean of 1.645 or greater is included in a two-way table with the appropriate cell highlighted. All scripts are batched on a weekly basis to provide a data management overview, but are also generated dynamically as users engage in database queries.
The data for a particular test in a domain are plotted in a histogram with a normal curve overlaid using the Capability procedure in SAS. Drop-down lines indicating 1 and 2 SD from mean are also included. This tool, along with normality assessment statistics generated using SAS Univariate procedure, alerts the investigator to non-normal data distributions and the presence of outliers. Specific data plots, such as those resulting from non-parametric analysis, may be done at a consortium member's request.
A tool has been added using the T-Test procedure to assess whether blindness in the 33TNK strain has any effect on testing. A list of mice known to be sighted or blind is used to create a dataset with which this comparison is made. Any domain that used any of these mice is eligible for this test.
The aging data are evaluated using the Boxplot procedure. The Test Tables for Aging uses the plots to determine growth of a pedigree across time. The investigator uses this to determine weight gain or loss relative to the "family" to check for outliers. Another tool looks at each pedigree within a family together at a particular age to see outliers, as it is believed that slower growing mice live longer.
An exception to the open-source paradigm is the choice of database framework. The Oracle 8i DBMS comes with extensive redundancies that allow for seamless data recovery of edits or interrupted transactional processing resulting from hardware or software failure or operator error . Data clashes are prevented at the interface level as well as at the database level, ensuring that only one record exists for each data iteration. Log tables transparently save edited data, allowing the recovery of results edited in earlier sessions. MuTrack also implements intrinsic concurrency functions that search the database for duplicate or non-standard records.
Tremendous local expertise and experience in Oracle and SAS technologies was a contributing factor in the decision to avoid open-source alternatives such as PostgreSQL and R, respectively. We believe that future implementations of a comparable system in a complete open-source environment is feasible.
Because MuTrack is available as a web-based platform, numerous considerations about internet navigation, security and accessibility were addressed. The site maintains a consistent look and feel designed around dynamically generated web-pages. A generalized view of each data representation is located within one click of the main page, and each relation is one click away from any other relation. When a user becomes familiar with one area of MuTrack they will, by similarity, be familiar with all areas. Computational tools that deal with statistical analysis and pages designed as areas for free-form textual observation are complex and require specific homepages one level down from the main MuTrack page. In these cases, web navigation is menu driven, allowing users to make very specific observations or drill down to a specific statistical test performed on particular mice or mouse pedigrees. Most areas within MuTrack are available to the public using a guest password, while specialized sites are limited to TMGC researchers in general and specific domain investigators in particular.
The TMGC mutagenesis project uses two distinct and well-identified breeding strategies that have been summarized in the recent literature [13, 21]. While both strategies differ in their molecular focus, they maintain the need to sustain large stocks of breeding mice for several generations. MuTrack begins the process of sample tracking by forcing technicians to input unique mouse information into a Mouse ID relation for mice of generation zero. Once mice exist within the system they are put on a mating schedule based on age and lineage. MuTrack tracks the removal of fertilized embryos from test-generation mice and manages their shipment and implantation into immunologically clean surrogate mothers located at a different institution. New mice are tracked through their Litter and are entered into the Mouse ID relation after Weaning.
During the breeding process sample tissues are often collected and stored for later analysis; the database must likewise account for destroyed mice. Hence, the Mouse Disposal and Tissue Sample tools reside within this domain and may be accessed by any privileged user anywhere within MuTrack. These represent integral processes in the highly structured chain-of-custody standards enforced at the interface and database levels. Indeed, the primary computational concern of the analysis pipeline is the location, status, and ownership of each mouse or tissue sample generated by the consortium. Adequate appraisal of this information provides project supervisors the ability to maintain a constant flow of animals through the testing domains, and reduces the amount of experimental data lost to logistical oversights.
Size and Scope of MuTrack Database
Number of Attributes
Number of Tuples
Number of Discrete Data Points
The second concern addressed by MuTrack is that of data integrity and security. Once information has been submitted to the database it can only be removed by the database administrator. Investigators may edit individual data items, but MuTrack tracks updated data in mirror log relations, adding another layer to data recoverability. In addition, while other researchers and those entering the site using the public password have access to view and search data, only the primary domain investigator has permission to download, submit, delete or edit information.
The main strength of MuTrack lies in its ability to initiate a real-time analysis of phenotype domain data to classify subtle phenodeviants. Analysis tools are designed to compare any particular mouse against members of its same litter, pedigree, generation or against control pedigrees and pedigrees under similar mutational pressure. These processes are entirely dynamic.
TMGC consortium members have access to more complete computational tools as described in the methods section. Tools in this domain also compute dynamic reports with the aim of isolating statistical outliers, but are more robust in sample selection, test selection, and cross-domain test comparisons. In addition, tests located in this controlled space correct for blindness, a side-effect of some breeding strategies, sex, aging and other variables of particular concern to the testing domain. Researchers can create dynamic weekly reports that use trimmed testing sets and can create publication-quality histograms of data sets. An exhaustive list of available administrative and analysis tools is available on the MuTrack site.
During primary mouse screening researchers rely on the statistical analysis generated by MuTrack's computational tools to make determinations about the deviation of a pedigree's phenotype. Any primary domain investigator may set a "deviation" flag via online switches, indicating that the mouse pedigree is a 'putative mutant', or putant. Following a structured decision tree (Figure 3), MuTrack initiates an automatic alert and the physical retesting of the putant pedigree. If pedigrees continue to be classified as statistically aberrant, the domain investigator is given the opportunity to promote putants to cutants, or 'confirmed mutants'. MuTrack then initiates a process to test the phenotype deviation for heritability. Putant and cutant pedigrees return to the same testing domain that first noticed the primary abnormality and, in addition, are tested in secondary domains, some of which are located outside of the consortium. Secondary domains provide alternate methodologies for quantization of phenotype abnormalities that serve to refine phenotype characteristics. MuTrack combines and interprets data from primary and secondary testing domains and forwards results to a TMGC committee that makes the final determination of mutant heritability. Positive mice are determined to be mutants and are made available to the mutant mouse distribution effort (along with visual and lethal mutants), located on the Jackson Laboratory website . A listing of current mouse mutants is available at the Tennessee Mouse Genome Consortium homepage .
The diversification of experimental techniques in all areas of biological research has caused a trend in laboratory specialization that exceeds the ability of any single primary investigator to provide comprehensive validation of genome-wide investigations. Simultaneously, the excess of quantitative data and empirical observations produced by varying research techniques far outstrips the ability of computational tools to adequately analyze the data for meaningful inferences. These issues combined with finite labor and funding resources have forced large research projects to use bioinformatics techniques to extract a maximum of information at a reasonable cost from geographically dispersed researchers. Researchers at the TMGC are attempting to bring together research teams using a centralized on-line database and analysis toolbox. Because distributed bioinformatics collaborations are relatively unknown quantities in large-scale hypothesis driven research, the TMGC was forced to engineer a system de novo to meet its particular needs.
The MuTrack system was initially released as the central bioinformatics tool for the TMGC in February, 2001. The database responsible for collecting experimental data and generating dynamic web content, including data analysis and knowledge exchange, has grown by the average rate of 34,000 tuples per month. The system has proven to be flexible, robust, extensible and, most importantly, has to date helped to elucidate 75 new heritable phenotypes.
While the system is fundamentally sound it is not exhaustive. Development continues to incorporate ongoing research as it moves into the molecular characteristics of mouse phenodeviants. Ideally, future mutants will be categorized at both gross and molecular granularity and MuTrack will be used to bring together genetic observations and phenotypic effects. Incorporation of primitive phenotyping ontologies will greatly increase our ability to communicate new phenodeviants . Computational systems are under development that will enable MuTrack to support recombination analysis, including the examination of quantitative trait loci and make reasonable inferences about molecular networks and gene regulation. Operationally, it is beyond the scope of MuTrack to create a panacea for the needs of every mouse-centric research scenario, but it remains our goal to maintain the software flexibility necessary to allow future application development in a variety of concerted research directions.
Lessons learned from MuTrack can contribute favorably to future distributed team research directives. First, there is no immediately apparent generic or proprietary solution to every problem encountered during the development of distributed bioinformatics software. Research, by definition, produces either novel data types or requires the novel interpretation of data. Cogent engineering of software must be conducted in conjunction with a clear biological hypothesis to demonstrate progress in either area. Secondly, the compulsory use of MuTrack's data collection, analysis and results reporting tools by consortium researchers has greatly aided in the refinement of the system for external users. Bioinformatics systems are capable of producing substantive results only if meaningful data is collected and analyzed, and robust software is only created under real conditions of use. Finally, future large-scale projects that rely heavily on centralized software must allow individual researchers the ability to supplement generalized computational results with free-form observations. To this end, MuTrack developers are attempting to incorporate data analysis systems and results-reporting functions with virtual publication areas, where consortium members may collaborate in the construction of publication quality documents.
There are currently several large-scale and genome-wide research projects that rely heavily on bioinformatics for the elucidation of novel observations. MuTrack provides a working framework for these projects.
MuTrack is available to members of the TMGC neuromutagenesis phenotyping project. There are currently twenty-two discrete testing and husbandry domains located at seven independent institutions within the state of Tennessee that make daily contributions to, or take advantage of, MuTrack data, knowledge, or analysis. Non-members can access a limited number of web interfaces via the TMGC homepage  when using the directed public password and username.
We would like to thank Daniel Goldowitz Ph.D. and Gene Rinchik Ph.D. for their critical reading of this manuscript. In addition, we would like to thank Elissa Chesler, Ph.D. for her statistical insights. We would also like to acknowledge the funding and collaborative support provided by the TMGC member institutions.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.