T-REX: software for the processing and analysis of T-RFLP data
© Culman et al; licensee BioMed Central Ltd. 2009
Received: 06 January 2009
Accepted: 06 June 2009
Published: 06 June 2009
Despite increasing popularity and improvements in terminal restriction fragment length polymorphism (T-RFLP) and other microbial community fingerprinting techniques, there are still numerous obstacles that hamper the analysis of these datasets. Many steps are required to process raw data into a format ready for analysis and interpretation. These steps can be time-intensive, error-prone, and can introduce unwanted variability into the analysis. Accordingly, we developed T-REX, free, online software for the processing and analysis of T-RFLP data.
Analysis of T-RFLP data generated from a multiple-factorial study was performed with T-REX. With this software, we were able to i) label raw data with attributes related to the experimental design of the samples, ii) determine a baseline threshold for identification of true peaks over noise, iii) align terminal restriction fragments (T-RFs) in all samples (i.e., bin T-RFs), iv) construct a two-way data matrix from labeled data and process the matrix in a variety of ways, v) produce several measures of data matrix complexity, including the distribution of variance between main and interaction effects and sample heterogeneity, and vi) analyze a data matrix with the additive main effects and multiplicative interaction (AMMI) model.
T-REX provides a free, platform-independent tool to the research community that allows for an integrated, rapid, and more robust analysis of T-RFLP data.
The high-throughput nature of terminal restriction fragment length polymorphism (T-RFLP) makes this technique amenable for generating comprehensive datasets in the study of microbial communities. Despite continued improvements, the analysis of these datasets still requires numerous steps and data manipulations in order to interpret the results. These steps often become obstacles to the analysis, as they are time-intensive and prone to user and analytical error. Currently, some of the greatest obstacles of T-RFLP data analysis are: i) distinguishing true peaks from noise, ii) aligning peaks across samples iii) creating a two-way data matrix of T-RFs by samples from tabulated raw data, iv) rapid manipulation of data matrices, and v) determining which multivariate analysis is most appropriate for a particular dataset. Collectively, these obstacles create research inefficiencies, reduce method standardizations and may limit the amount of information gained from the analysis overall. To address these obstacles, we have developed T-REX (T-R FLP analysis EX pedited), a free, web-based tool to aid in the analysis of T-RFLP data. In this paper, we introduce and outline the functions of T-REX and how it addresses each of the above obstacles.
Distinguishing true terminal restriction fragments (i.e., true peaks) from background fluctuations in fluorescence is often a major challenge in T-RFLP data analysis. The selection of a baseline threshold can dramatically affect the complexity of the community fingerprint and downstream analyses, resulting in signal loss or noise retention. A common procedure is to apply an arbitrary baseline threshold across all samples to delineate true peaks from noise [1–3]. However, this approach is less than optimal, as noise in a sample varies in proportion to the amount of DNA subject to analysis, causing variation in the proportions of signal to noise between profiles. Various approaches have been described to address this issue [2–5]. In particular, those that seek to objectively eliminate noise on a sample by sample basis, such as a variable percentage threshold  or recursively selecting true peaks based on standard deviations of peak areas  can be more effective at minimizing this bias.
While the base pair size of every T-RF is determined in relation to an internal size standard, sizing errors can occur due to random fluctuations, purine content, and fluorophores [6, 7]. These analytical errors in determining fragment length can result in TRF-drift between samples, in which the same fragment is incorrectly assigned a different size in different samples. These errors are either ignored and treated as analytical error, corrected through painstaking manual alignment , or aligned using an automated approach [2, 8]. However, to date, there have been no reports on the effects of these three approaches. Since most peak alignment software isn't integrated with downstream multivariate analyses, it is often difficult to determine the effects of this alignment on the overall interpretation of the data.
Multivariate statistical analyses are commonly required to interpret T-RFLP data and to examine the impact of environmental variables or treatments on microbial community composition. Raw T-RFLP data exported from Genemapper™, Peak Scanner™, or similar size-calling software is typically in a tabulated or listed format, where one column contains all the records for each variable (i.e., one column for all T-RF sizes, one column for all peak heights, etc.). However, these data often need to be formatted into a two-way data matrix to facilitate import into a statistical software package capable of analyzing multivariate data. The formatting of tabulated raw data into a data matrix is generally performed manually or with an application such as a pivot table in MS Excel, after samples have been labeled with information pertaining to the experimental design (sampling period, treatment, replicate number, etc.). These formatting approaches can be laborious and error-prone.
A thorough analysis of large T-RFLP datasets requires various data matrix manipulations, such as examining all three types of data (presence/absence, peak height, peak area), relativization of peak height or peak area, averaging replicated samples, examining specific experimental factors, deleting spurious T-RFs, etc. Most spreadsheet software applications aren't amendable to these more sophisticated manipulations, making an exhaustive analysis of these data difficult. In our experience, the rational exploration of T-RFLP data, which properly accounts for experimental design, replication, and differences in signal to noise ratios can reveal patterns in ordinations that are obscured in less complete approaches to data analysis [9, 10].
Finally, there is a lack of consensus in the literature today about which statistical analyses are more appropriate to analyze T-RFLP data. In a comparative study of multivariate methods, Culman et al.  reported that the sample heterogeneity and percent interaction effects of a T-RFLP dataset can be used as criteria to select the appropriate statistical approach for data analysis. Although sample heterogeneity can easily be calculated, the calculation of interaction effects is more algorithmically involved. Culman et al.  also demonstrated the utility of the Additive Main Effects and Multiplicative Interaction (AMMI) model as a robust and advantageous method for T-RFLP analysis. This model is found in only a few multivariate software packages offered today.
Currently Available Software for T-RFLP Analysis
Currently, there are few options to choose from when analyzing T-RFLP microbial community data. Most software that has been developed is aimed at referencing T-RFLP profiles with a sequence database (e.g. TAP-TRFLP [11, 12], MiCA , PAT , TRAMPR . There are, however, a few available packages that do aid with exploratory multivariate data analysis. T-Align  implements an algorithm to align peaks, hence reducing the potential for subjective bias during peak alignment. Another package, T-RFLP Stats  allows users to align peaks (as does T-Align), group samples based on various classification procedures and then reference these profiles to a clone library. However, a drawback is that this software is written in three separate languages (R, Perl and SAS) requiring three separate platforms. These platforms are all primarily command line driven and can be cumbersome to inexperienced users. SAS also requires a purchased license for use. In addition, T-RFLP Stats offers no labeling procedure to designate and format raw data, nor does it perform any ordination analyses, argued by some to be superior to classification procedures for the exploratory analysis of microbial community data . A few commercial software packages have become available in recent years that offer a range of features regarding electropherogram manipulation, with some limited multivariate procedures, most notably GelQuest (SequentiX, Germany), Genemarker (SoftGenetics, USA), and Torast (Dresden, Germany). However, the high costs of these programs make them inaccessible to some research labs. In addition, features and functions vary widely between these programs, as most were not primarily designed to facilitate T-RFLP analysis.
We developed T-REX to address current obstacles encountered in T-RFLP data analysis. We sought to build a program that integrated pertinent functions to streamline T-RFLP analysis. T-REX allows users to i) label raw data with attributes related the experimental design of the samples, ii) determine a baseline threshold for identification of true peaks over noise, iii) align T-RFs in all samples (bin T-RFs), iv) construct a two-way data matrix from labeled data and process the matrix in a variety of ways, v) produce several measures of data matrix complexity, including the distribution of variance between main and interaction effects and sample heterogeneity, and vi) analyze a data matrix with the AMMI model. T-REX offers users a consolidated, flexible and rapid analysis of T-RFLP data.
Uploading Data and Labeling Procedure (Upload Data and My Projects)
The first step in using T-REX is to create a project. A new project is created by uploading and labeling raw data. This process happens simultaneously and requires two files: i) the raw data file and ii) the label file. The raw data file is the tabulated file that is exported in GeneMapper®, PeakScanner™, or similar size-calling software that contains the peak information for a set of samples. The label file contains a set of labels/attributes that describe each sample and often correspond to factors in the experimental design. Both files should be simple text files in tab-delimited format (see the T-REX documentation for specific guidelines on file formats). Once a project is created, it can be renamed, merged, or deleted in the My Projects page. Users can also come back to pre-existing projects and load them in this page for further manipulation.
T-REX has several functions to appropriately handle replicated, missing or multiplexed  T-RFLP data. Users can define what samples are replicates when uploading data (or manually in the Sample Summary page) and T-REX will provide information based on these defined replicates. Missing data occurs when there is a discrepancy between samples in the raw data and label files, or when poor quality samples are flagged due to data processing procedures. T-REX accounts for missing data, allowing users to omit samples of poor quality without sacrificing information replicated data provide. In addition to replicated and missing data, T-REX is amenable to multiplexing T-RFLP methodologies. If a sample contains multiple fluors, peaks of the same fluor are processed as a unit of peaks, keeping them distinct from peaks of other fluors. The program documentation outlines specific guidelines for dealing with replicated, missing, or multiplexed data.
Viewing and Editing Individual Samples (Sample Summary)
Filtering out Noise from True Peaks (Filter Noise)
T-REX uses the approach outlined by Abdo et al.  to find true peaks and reduce background noise. True peaks are identified as those whose height (or area) exceeds the standard deviation (assuming zero mean) computed over all peaks and multiplied by the factor specified in the box provided. The procedure is then reiterated with the peaks which were not identified as true ones. The iterations continue until no new true peaks are found. The noise filtering can be applied to all samples or just selected samples in the active project. Users should select an appropriate standard deviation multiplier based on the original electropherograms and results of the filtering procedure. The program allows for rapid manipulation of the multiplier and subsequent reviewing of results in the Samples Summary page if a user wants to determine an appropriate multiplier empirically (Figure 4). At any time the filtering procedure can be cleared and the data reverted to their original state with the 'Clear filtering' button.
Automated Alignment of Peaks (Align T-RFs)
Peak alignment in T-REX is performed on the set of currently active peaks and occurs automatically whenever this set changes as a result of data manipulation by the user. T-REX offers users two functions to align peaks in the Align T-RFs page. With the default option ('Round to the nearest integer'), peaks are simply rounded to the nearest nucleotide (integer) size. Alternatively, an automated alignment of peaks across all samples is also possible. This function models the approach taken by the software program T-Align . Briefly the smallest peak across all samples is identified and tagged. Peaks within the range specified by the clustering threshold are then identified and grouped into a T-RF. The next smallest peak across all samples not falling into the first T-RF is identified and tagged. Peaks within the specified clustering threshold are identified and grouped with the second T-RF. This process continues until all peaks are grouped into a T-RF.
Grouping Samples into Environments (Environments)
The Environments page allows users to rapidly classify samples into environments based on the given labels. This approach is especially useful when replication in an experiment occurred at multiple scales (e.g., analytical, field) and a user wants to compare results based these different ways of defining replication. Users can assign and/or reassign replicated samples into environments by using the provided checkboxes to define the set of labels that determine an environment. Samples will be considered replicates (i.e., belonging to the same environment) if they have identical sets of checked label values. The Environments page can be used as an alternative to specifying replicates at data upload, or to change the environment assignments made at the upload stage.
Export Labeled Data to Use Elsewhere (Export Labeled Data)
The Export Labeled Data page was designed for users who want to take advantage of T-REX's rapid labeling procedure and data manipulation functions, but analyze their data with another software program. After data are uploaded and labeled, users can export the labeled data directly, or can manipulate the data before exporting. The Sample Summary page indicates the current status of a project and will reflect the exact details of the data that will be exported.
Data Matrix Construction and AMMI analysis (Data Matrix/AMMI)
The Data Matrix/AMMI page allows users to first construct a two-way data matrix and then run the AMMI model on this data matrix. Data matrix construction involves six steps. The first two steps require that all peaks be assigned to a particular T-RF and that each sample be associated with an environment. Typically, both these conditions are automatically satisfied and require no special action. The third step allows users to specify which type of data to use for data matrix construction (presence/absence, peak height or peak area), and if these data are to be averaged across replicates and/or relativized within samples. The fourth step allows users to select which experimental factors should be included in the data matrix. Users have the option of selecting all, or only a subset of specific fluors and/or factors to be included in the data matrix and subsequent analysis. The fifth step allows users to omit rare T-RFs or samples with poor peak representation.
T-RFs can be omitted based on number or percentage of occurrences across samples. Total number of T-RFs or the cumulative peak height or area can be used to eliminate certain samples. This T-RF and sample filtering step represents a final quality control on the resulting data matrix. Selecting 'Create Data Matrix' in the sixth step will take the user to another page where a data matrix is ready for download, and various data matrix properties are displayed, including total numbers of samples and T-RFs present, the maximum, minimum, and average number (average richness) of T-RFs across samples, and sample heterogeneity.
Files types generated by T-REX that are available to download.
(Tables one – four)
AMMI Graphing Data
Environment and T-RF scores for graphing
Data matrix for additional analyses with other software
Transposed Data Matrix
Data matrix for additional analyses with other software
MATMODEL output file
Full MATMODEL output
MATMODEL input file
MATMODEL input file
Environments Assigned to Samples
Defines which samples are replicates
Labeled Data (list format)
Labeled raw data
Zipped folder containing all files
compatible .zip extractor
Archive of all output files
Summary of Results and Output (Results Summary)
The Results Summary page reports the results of relevant basic data matrix properties and summarizes the results of the AMMI analysis in one place. The 'T-RF Abundance table' reports the number of samples (samples present) and percentage of samples (% of samples present) in which each T-RF occurs. All generated output files are also available for download at this page.
We used T-REX to analyze 16S T-RFLP data generated from soils under two different management histories–harvested tallgrass prairie and adjacent agricultural fields–from five different sites across north central Kansas (Culman et al., unpublished). Soil was sampled at 3 different depth intervals (0 – 10 cm, 10 – 20 cm, and 20 – 40 cm) in June 2007. T-RFLP procedures were conducted as previously described . The data were subjected to several quality control procedures–T-RF Alignment (clustering threshold = 0.5), Noise Filtering (peak area, standard deviation multiplier = 1) and elimination of samples with less than 20 T-RFs. This initial processing deemed that all 30 samples were of good quality and suitable to include in the final ordination analyses. Processed data were subject to the AMMI analysis with T-REX in two separate ways–first, with data defined as un-replicated (3 depths × 2 management histories × 5 different sites) and second with each site defined as a replicate (3 depths × 2 management histories × 5 replicates). Analyzing data as un-replicated was performed to gain insight into variability between sites; a second analysis with sites defined as replicates allowed for a more focused analysis on the experimental factors of primary interest–management history and depth. With replicated data, the AMMI analysis provides a calculation of interaction pattern and noise, providing a more resolute picture of the strength of the interaction term.
Defining replication was easily performed in the Environments page. Sample heterogeneity calculations provided by T-REX were high relative to T-RFLP datasets previously encountered . As a result, we also used nonmetric multidimensional scaling (NMS) to analyze the data. The T-REX-constructed data matrices were then exported and subjected to NMS in R  via the metaMDS function in the vegan package. NMS parameters were manipulated in a variety of ways, but the final analyses were performed with metaMDS default parameters with the following exceptions: autotransform = false, 100 runs. NMS ordination results were graphed in R. After observing the AMMI ordination results in the scatterplot provided by T-REX, graphing scores were exported and graphed in R for publication purposes.
In addition to ordination results, the AMMI analysis provides a breakdown of the contributions of variation from the three sources in the data matrix, i) T-RFs, ii) environments, and iii) T-RFs × environments interactions. The variation from T-RFs reflects variability in the means of different T-RFs, while the variation from environments reflects the number of peaks or overall signal strength in T-RFLP profiles. The variation from T-RFs × environments interactions reflect how T-RFs differentially respond with the environments. For our research objectives, the interaction variation was the source of primary interest, as we were concerned with the response of microbial community profiles (T-RFS) to different depths and management histories (environments). Culman et al.  found that variation due to interaction effects reflect how similar or dissimilar the microbial communities are, and could be used as a tool to objectively assess differences across multiple datasets.
Results and discussion
T-REX output of the percent variation from each source in the three datasets.
Agricultural + Prairie Soil
The T-RFLP dataset in this study contained three factors (depth, management history, and site), all of which were detectable drivers of bacterial community structure. However, the strength of site differences varied depending on management practice. Hence, exploratory data analyses and data matrix manipulation were required to elucidate which factors exerted the greatest influence on bacterial community structure within a specified treatment. T-REX aided in an integrated and rapid manipulation of these data matrices, enabling a thorough analysis of this dataset.
In addition to rapid data matrix manipulation, T-REX also produced a more robust dataset, as prior to data matrix construction, the data were subjected to several quality control procedures–T-RF Alignment, Noise Filtering, and elimination of samples with less than 20 T-RFs. This initial processing ensured that all samples were of acceptable quality. The calculations of sample heterogeneity and interaction effects generated by T-REX were also used as prescriptive indicators that the data were complex and that non-parametric analyses, such as NMS, may yield more discriminatory ordination results. However, the overall trends revealed by NMS did not differ from the ordination results of the AMMI analyses (not shown).
T-REX facilitates an integrated and streamlined analysis of microbial community data with a suite of flexible functions that allows researchers to choose the most appropriate data manipulations based on research objectives. T-REX also enables researchers to implement the AMMI analysis, a method which holds many advantages for microbial community data analysis. In addition, this software provides a tool to the research community to rapidly and robustly test the effects of various data processing methods on the overall results of datasets. Many of these processing methods are known sources of analytical variability, but there is no consensus in the literature of how to most appropriately minimize this variability. We intend to focus the continued development of T-REX on a more sophisticated T-RF alignment algorithm, as well as integrating NMS and permutational multivariate analysis of variance. T-REX will allow microbial community analyses to continue to develop as an important tool in understanding microbial community dynamics and their effects on ecosystem processes.
Availability and requirements
Project name: T-REX
Project home page: http://trex.biohpc.org
Operating system(s): Platform independent for users
Programming language: Microsoft ASP.NET and MS SQL Server platforms
License: GNU GPL
Any restrictions to use by non-academics: none
The authors wish to acknowledge Noah Spies for contributing to ideas and developments of an earlier version of this software. Funding for this software development was provided by the National Science Foundation (IGERT Fellowship DGE 0221658) and by the Microsoft Corporation.
- Blackwood CB, Marsh T, Kim SH, Paul EA: Terminal restriction fragment length polymorphism data analysis for quantitative comparison of microbial communities. Appl Environ Microbiol 2003, 69: 926–932. 10.1128/AEM.69.2.926-932.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Dunbar J, Ticknor LO, Kuske CR: Phylogenetic specificity and reproducibility and new method for analysis of terminal restriction fragment profiles of 16S rRNA genes from bacterial communities. Appl Environ Microbiol 2001, 67: 190–197. 10.1128/AEM.67.1.190-197.2001PubMed CentralView ArticlePubMedGoogle Scholar
- Abdo Z, Schuette UME, Bent SJ, Williams CJ, Forney LJ, Joyce P: Statistical methods for characterizing diversity of microbial communities by analysis of terminal restriction fragment length polymorphisms of 16S rRNA genes. Environ Microbiol 2006, 8: 929–938. 10.1111/j.1462-2920.2005.00959.xView ArticlePubMedGoogle Scholar
- Sait L, Galic M, Strugnell RA, Janssen PH: Secretory antibodies do not affect the composition of the bacterial microbiota in the terminal ileum of 10-week-old mice. Appl and Environ Microbiol 2003, 69: 2100–2109. 10.1128/AEM.69.4.2100-2109.2003View ArticleGoogle Scholar
- Osborne CA, Rees GN, Bernstein Y, Janssen PH: New threshold and confidence estimates for terminal restriction fragment length polymorphism analysis of complex bacterial communities. Appl and Environ Microbiol 2006, 72: 1270–1278. 10.1128/AEM.72.2.1270-1278.2006View ArticleGoogle Scholar
- Kaplan CW, Kitts CL: Variation between observed and true terminal restriction fragment length is dependent on true TRF length and purine content. J Microbiol Methods 2003, 54: 121–125. 10.1016/S0167-7012(03)00003-4View ArticlePubMedGoogle Scholar
- Marsh TL: Culture-independent microbial community analysis with terminal restriction fragment length polymorphism. Methods Enzymol 2005, 397: 308–329. 10.1016/S0076-6879(05)97018-3View ArticlePubMedGoogle Scholar
- Smith CJ, Danilowicz BS, Clear AK, Costello FJ, Wilson B, Meijer WG: T-Align, a web-based tool for comparison of multiple terminal restriction fragment length polymorphism profiles. FEMS Microbiol Ecol 2005, 54: 375–380. 10.1016/j.femsec.2005.05.002View ArticlePubMedGoogle Scholar
- Culman SW, Duxbury JM, Lauren JG, Thies JE: Microbial community response to soil solarization in Nepal's rice-wheat cropping system. Soil Biol Biochem 2006, 38: 3359–3371. 10.1016/j.soilbio.2006.04.053View ArticleGoogle Scholar
- Culman SW, Gauch HG, Blackwood CB, Thies JE: Analysis of T-RFLP data using analysis of variance and ordination methods: a comparative study. J Microbiol Methods 2008, 75: 55–63. 10.1016/j.mimet.2008.04.011View ArticlePubMedGoogle Scholar
- Cole JR, Chai B, Marsh TL, Farris RJ, Wang Q, Kulam SA, Chandra S, McGarrell DM, Schmidt TM, Garrity GM, et al.: The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res 2003, 31: 442–443. 10.1093/nar/gkg039PubMed CentralView ArticlePubMedGoogle Scholar
- Marsh TL, Saxman P, Cole JR, Tiedje JM: Terminal restriction fragment length polymorphism analysis program, a web-based research tool for microbial community analysis. Appl Environ Microbiol 2000, 66: 3616–3620. 10.1128/AEM.66.8.3616-3620.2000PubMed CentralView ArticlePubMedGoogle Scholar
- Shyu C, Soule T, Bent SJ, Foster JA, Forney LJ: MiCA: A web-based tool for the analysis of microbial communities based on terminal-restriction fragment length polymorphisms of 16S and 18S rRNA genes. Microb Ecol 2007, 53: 562–570. 10.1007/s00248-006-9106-0View ArticlePubMedGoogle Scholar
- Kent AD, Smith DJ, Benson BJ, Triplett EW: Web-based phylogenetic assignment tool for analysis of terminal restriction fragment length polymorphism profiles of microbial communities. Appl Environ Microbiol 2003, 69: 6768–6776. 10.1128/AEM.69.11.6768-6776.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Fitzjohn RG, Dickie IA: TRAMPR: an R package for analysis and matching of terminal-restriction fragment length polymorphism (TRFLP) profiles. Mol Ecol Notes 2007, 7: 583–587. 10.1111/j.1471-8286.2007.01744.xView ArticleGoogle Scholar
- Grant A, Ogilvie LA, Blackwood CB, Marsh TL, Sang-Hoon K, Paul EA: Terminal restriction fragment length polymorphism data analysis. Appl Environ Microbiol 2003, 69: 6342–6343. 10.1128/AEM.69.10.6342-6343.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Singh BK, Nazaries L, Munro S, Anderson IC, Campbell CD: Use of multiplex terminal restriction fragment length polymorphism for rapid and simultaneous analysis of different components of the soil microbial community. Appl Environ Microbiol 2006, 72: 7278–7285. 10.1128/AEM.00510-06PubMed CentralView ArticlePubMedGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing.R Foundation for Statistical Computing, Vienna, Austria; 2008. [http://www.R-project.org]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.