FIND: A new software tool and development platform for enhanced multicolor flow analysis
© Dabdoub et al; licensee BioMed Central Ltd. 2011
Received: 3 November 2010
Accepted: 10 May 2011
Published: 10 May 2011
Flow Cytometry is a process by which cells, and other microscopic particles, can be identified, counted, and sorted mechanically through the use of hydrodynamic pressure and laser-activated fluorescence labeling. As immunostained cells pass individually through the flow chamber of the instrument, laser pulses cause fluorescence emissions that are recorded digitally for later analysis as multidimensional vectors. Current, widely adopted analysis software limits users to manual separation of events based on viewing two or three simultaneous dimensions. While this may be adequate for experiments using four or fewer colors, advances have lead to laser flow cytometers capable of recording 20 different colors simultaneously. In addition, mass-spectrometry based machines capable of recording at least 100 separate channels are being developed. Analysis of such high-dimensional data by visual exploration alone can be error-prone and susceptible to unnecessary bias. Fortunately, the field of Data Mining provides many tools for automated group classification of multi-dimensional data, and many algorithms have been adapted or created for flow cytometry. However, the majority of this research has not been made available to users through analysis software packages and, as such, are not in wide use.
We have developed a new software application for analysis of multi-color flow cytometry data. The main goals of this effort were to provide a user-friendly tool for automated gating (classification) of multi-color data as well as a platform for development and dissemination of new analysis tools. With this software, users can easily load single or multiple data sets, perform automated event classification, and graphically compare results within and between experiments. We also make available a simple plugin system that enables researchers to implement and share their data analysis and classification/population discovery algorithms.
The FIND (Flow Investigation using N-Dimensions) platform presented here provides a powerful, user-friendly environment for analysis of Flow Cytometry data as well as providing a common platform for implementation and distribution of new automated analysis techniques to users around the world.
The advent of Flow Cytometry (FC) and subsequent enhancements allowing for polychromatic investigation has proved invaluable to those involved in basic research and medical diagnosis. By differentially staining the cell population using dyes specific for various characteristics, each cell produces a fluorescence signature characteristic of the bound dyes. Thus, by measuring the fluorescence at multiple wavelengths, a multidimensional quantitative signature can be defined for each cell. However, from the time of introduction the overwhelmingly predominant method for analyzing and interpreting data gathered from flow cytometers has been manual gating. In this process, event data is visualized using bivariate plots such as histograms, contour, scatter, and density plots. Users must then peruse the entirety of the data, two channels (dimensions) at a time and manually isolate sections of the plot by visually applying geometric constructs (rectangles, circles, ellipses) to group or separate the data. In this manner, specific cell populations are identified and separated from the mass of data.
With advances in multi-laser flow cytometers, commercial machines are now capable of gathering up to 20 channels (or more using quantum dots over traditional organic fluorophores ) of data per event. In addition, current research into adding mass spectrometry capability to flow cytometers (mass cytometry ) promises determination of up to 100 different biomarkers. The number of unique 2D plots needed to cover a dataset is , thus for 20 channels of data almost 190 plots are necessary, and for 100 channels, nearly 5000 plots.
The main barrier to the widespread adoption of such tools is a technological one. Published research describing classification methods takes one of two routes. Authors may write their own processing code and rely on importing analyzed data into generic or FC-specific programs for visualization and data-interaction [3, 5, 6]. Alternately, some authors have repurposed software intended for analysis of other types of data such as microarrays . While ultimately accomplishing their purpose, such methods are far from ideal for reusability and wider adoption. Indeed, these approaches expose an underlying need: a user-friendly, purpose-built software platform on which researchers can easily implement analysis techniques, algorithms, and new visualizations. In this paper, we present the development of such a platform with the main goal of easing future development and encouraging widespread use of alternate tools, providing a system that makes complete use of the available data dimensions. Consequently, our tool will greatly aid user analysis in time as well as consistency.
The main analytical focus of the most widely used industry software, such as FlowJo and FCS Express, is manual gating. FlowJo in particular provides one method of algorithmic analysis, based on Probability Binning , however it is currently in beta status and available on only one platform. Such software with large user bases are understandably slow to change, and perhaps reluctant to provide users with analysis tools that are not widely recognized.
Currently available software providing automated analysis methods include the more generalized analysis packages such as the flow cytometry suite , provided as part of the Bioconductor  project, the stand-alone software Flow , and specialized tools such as that provided by the authors of the recently published automated analysis technique: FLAME .
The FC analysis libraries available through the Bioconductor project are excellent, yet, while being based on R has many benefits, it is at its core a command-line based software. For many users, command-line access is a foreign concept, and this fact alone will prevent them from ever trying it. However, far from being in competition with Bioconductor, there is potential for FIND to make use of Bioconductor through a bridge library, RPy, that allows access to R functionality from Python programs. The program Flow suffers from similar problems in the area of usability. The interface itself can be confusing and tends to be overly technical, as is the documentation which states the program is aimed more at developers than end-users. Software in the category of algorithm-specific programs, such as the webservice FLAME, certainly serve a useful purpose, but offer no generality or measures for comparison to other methods. No algorithm will be appropriate for every dataset, and continuity of interface is important for work-flow efficiency.
The Python programming language was chosen for its high-level syntax, ease of use, multi-platform independence, and the availability of excellent scientific and numerical analysis packages (among them, SciPy and NumPy). In addition, Python interfaces well with the C++ and Java programming languages, allowing code in both to be used from within Python programs (Boost.Python and Jython, respectively). Our foremost goals while developing FIND were: user friendliness, and simplicity of design, implementation, and maintenance. In pursuit of these goals, we chose the wxWidgets library which allows true cross-platform development of windowed applications with the native look and feel of the operating system under which the program is run.
The design and implementation of FIND was guided by the well-known Model-View-Controller (MVC) architectural pattern, in which the underlying data model is separated from the user interface and the controlling system logic. This design paradigm allows for greater flexibility, clean implementation, and simplified overall debugging and maintenance. The other major concerns in design centered around the two user populations we envisioned for this software: Normal users of Flow Cytometry analysis software, and researchers involved in improving the analysis and visualization of FC data. The former target population was represented throughout the development process by five researchers experienced in the design and analysis of FC experiments. The latter population was considered carefully throughout and well-represented by the developer.
Finally, we provide pre-built platform specific executables for Microsoft Windows and Mac OS X. These packages are entirely self-contained and require no additional software to be installed. Additionally, all plugins are located in a folder external to the executables, allowing users to easily "install" extended functionality to the program.
Results and Discussion
The first concern in the design of FIND was simplifying the user experience and streamlining the analysis process. Thus, FIND consists of a single window split into a data pane and a display pane. The data pane lists loaded data sets, their subdivisions and clusterings, as well as stored figure sets, in a hierarchical tree. Each item in the tree is selectable and has a context menu available with a number of generic and item-specific actions available to the user. For example, all items may be plotted, but plotting methods are specified as applicable to data items alone, clustering items alone, or both.
The display pane is further subdivided into a plotting pane and a dimension-selection pane. The plotting pane may be configured to display a grid of subplots, each selectable and configurable independently. This forms the heart of the display system and was designed to collect data visualizations in one location. By default, all subplots are bound to the user-selected dimensions, and are updated instantly when the selection is changed. Individual plots may be unbound via a simple checkbox that indicates the bound status of the currently selected subplot.
One consequence of many automatic classification algorithms is some degree of non-determinism in cluster ordering. This does not negatively impact the resulting analysis, but it does pose an issue for later visualization. The simplest method for visually distinguishing the events belonging to different clusters is through color variation. However, for such differentiation to be more meaningful across multiple runs of an algorithm, whether for the same or separate data sets, similar clusters should be given the same color. Thus, we have applied the measure of cluster similarity used in the Bakker Schut et al. algorithm  to reorder the clusters such that those most similar appear with the same color when plotted (Figure 6).
While the grid layout display provided by FIND enables easy comparative visualization and analysis of multiple datasets, it doubles as a figure creation tool. Users simply select the 'Save Figure' menu item in the Plot menu, and are able to instantly create usable figures for presentation or publication in the following formats: PNG, PDF, PostScript, EPS (Encapsulated PostScript), and SVG (Scalable Vector Graphics). Exporting the current visualization grid eliminates the need for additional steps in figure creation and additional complexity in subplot placement, thus saving time and effort (Figure 6).
One of the most important steps in analysis at any point, and especially the end, is saving the work that has been done. FIND allows users to save the entire state of the analysis as two files: one containing all the qualitative information about data structure, attributes, and visualizations, the second a simple binary file containing all the loaded numeric data. These two files can be packaged and transported in any manner for later use on the same machine or any other machine running FIND. Finally, all of the preceding topics are discussed in more detail and with examples in the user documentation included in Additional file 1 as well as on the project website.
A major component of the FIND application is enabling researchers and software developers to extend the functionality of the system in various ways. The main focus of the plugin system is automated population discovery and giving researchers a common platform for implementing their research. However, FIND provides additional access modalities; specifically we have categorized them into the following: Graphing, Transformations, I/O, and Analysis.
Graphing plugins offer authors a hook into the plotting subsystem to implement new visualization options. In order to reduce complexity, graphing plugins are visible in the Plugins menu (indicating their availability), but only usable from the plot submenu in the data tree context menu. Furthermore, authors specify graphing plugins as applicable to either datasets (items in the tree representing a file or a child dataset) or clusterings. For example, the scatterplot is applicable to both, while the histogram plot is only available for datasets. As a result, the context menu is built dynamically to fit the tree item selected. Transformation plugins provide means to transform the data to a more meaningful data space, such as the widely used Hyperlog  transformation. All transformations (plugin or built-in) can be called through the FIND API and applied to given datasets. I/O (input/output) plugins allow the import and export of data and automated analysis results. This will enable the inclusion of various data formats from other programs/FC machines, and allow for FIND results to be saved for later use as needed. The last modality, Analysis, is intended to provide an all-purpose means for plugin authors to offer novel analysis or visualization options that do not fall into any of the other plugin categories. On the user side, plugins are easily installed by saving the plugin files into the appropriate subdirectory of the plugins directory located with the FIND executable. Complete details on writing plugins for FIND, with code examples, are available in the developer documentation included in Additional file 1 as well as on the project website.
Finally, as software engineering is never a flawless science, errors and bugs will occur. Thus FIND catches unexpected errors, and can report these directly to us. This will aid in the development and improvement of FIND, as well as provide users a more technically specific means of feedback. In addition, FIND provides a mechanism enabling users to check for updates to the program, allowing them to stay current with fixes and enhancements as development continues.
It is becoming clear that the advance of biochemical technology will soon outstrip the ability of users to manually analyze Flow Cytometry data effectively and in a reasonable amount of time. Two-dimensional visual analysis is useful, but at the very least should be combined with multi-dimensional mathematical analysis to minimize the risk of losing important information. The answer lies in the direction of intelligent automated analysis tools , however, without a user-friendly platform on which to implement, widespread adoption of such tools has been and will continue to be extremely difficult. Additionally, it is wasted and redundant effort for researchers to reinvent the wheel or attempt to repurpose existing software each time a new analysis tool is developed.
A more serious problem lies in the fact that, without widespread testing of algorithms on multiple different datasets, it will be difficult to verify the accuracy and usefulness of new software analysis tools. Fortunately, these needs are coming to be recognized in the form of calls for increased development and improvement [7, 8], the IEEE Bioinformatics Standards for Flow Cytometry Working Group designed to provide data standard proposals, and an assessment competition (FlowCAP) similar to the annual protein structure prediction competition CASP. FIND can be very useful in facilitating these efforts, providing a platform upon which to build and test.
Finally, it should not be overlooked that automated analysis methods which simply present the researcher with results as a finished product, but which do not facilitate the researcher's comprehension by supporting exploration and evaluation of the results, are almost as non-productive as a total lack of automation. It has been repeatedly demonstrated that human experts either reject and manually re-do automated analyses which are not presented in such a way as to facilitate exploration, comparison and understanding, or come to rely upon these analyses without adequate concern for confirming their ongoing validity [17–20]. Even the best-of-breed research algorithms for FC analysis therefore fail to deliver the benefits that they could, because they universally operate in isolation, and do not provide internal comparisons to other methods. Such omissions, as trivial as they may seem, critically impede the wide adoption of improved methods, and absorb considerable human resources which could be released for more productive tasks if the improved methods can be incorporated within a familiar comprehensive analysis framework.
To conclude, here we have presented FIND, a new cross-platform software package that provides a basic visualization and development platform for analysis of Flow Cytometry data, while maintaining a focus on end-user accessibility. FIND presents an easy to use integrated interface, behind which exists a powerful plugin system based on the modern, widely used language Python, and the excellent numeric and scientific computational toolset available to it.
Availability and requirements
Project name: Flow Investigation using N-Dimensions
Project home page: http://www.justicelab.org/find
Alternate page: http://www.mathmed.org/#find
Operating Systems: Platform independent
Programming language: Python
Other requirements: None
License: Version 0.3.1 of FIND is released under the GPLv3
Restrictions: FIND is free for Academic use only
FCS Express http://www.denovosoftware.com
Massively Multiparametric Mass Cytometer Analyzer Project http://www.stemspec.ca
FLAME: FLow analysis with Automated Multivariate Estimation http://www.broadinstitute.org/cancer/software/genepattern/modules/FLAME
SciPy and NumPy http://www.scipy.org
Bioinformatics Standards for Flow Cytometry http://flowcyt.sourceforge.net
FlowCAP - Flow Cytometry: Critical Assessment of Population Identification Methods http://flowcap.flowsite.org
Critical Assessment of protein Structure Prediction (CASP) http://predictioncenter.org
The authors would like to thank Dr. Santiago Partida-Sanchez, Dennis J. Horvath Jr., and Dr. Ashay S. Patel for their invaluable input and assistance.
- Perfetto SP, Chattopadhyay PK, Roederer M: Seventeen-colour flow cytometry: unravelling the immune system. Nat Rev Immunol 2004, 4(8):648–655.View ArticlePubMedGoogle Scholar
- Bandura DR, Baranov VI, Ornatsky OI, Antonov A, Kinach R, Lou X, Pavlov S, Vorobiev S, Dick JE, Tanner SD: Mass Cytometry: Technique for Real Time Single Cell Multitarget Immunoassay Based on Inductively Coupled Plasma Time-of-Flight Mass Spectrometry. Analytical Chemistry 2009, 81(16):6813–6822.View ArticlePubMedGoogle Scholar
- Schut TCB, Grooth BGD, Greve J: Cluster analysis of flow cytometric list mode data on a personal computer. Cytometry 1993, 14(6):649–659.View ArticleGoogle Scholar
- Wilkins MF, Hardy SA, Boddy L, Morris CW: Comparison of five clustering algorithms to classify phytoplankton from flow cytometry data. Cytometry 2001, 44(3):210–217.View ArticlePubMedGoogle Scholar
- Lugli E, Pinti M, Nasi M, Troiano L, Ferraresi R, Mussi C, Salvioli G, Patsekin V, Robinson JP, Durante C, Cocchi M, Cossarizza A: Subject classification obtained by cluster analysis and principal component analysis applied to flow cytometric data. Cytometry Part A 2007, 71A(5):334–344.View ArticleGoogle Scholar
- Finn WG, Carter KM, Raich R, Stoolman LM, Hero AO: Analysis of clinical flow cytometric immunophenotyping data by clustering on statistical manifolds: Treating flow cytometry data as high-dimensional objects. Cytometry Part B: Clinical Cytometry 2009, 76B: 1–7.View ArticleGoogle Scholar
- Lizard G: Flow cytometry analyses and bioinformatics: Interest in new softwares to optimize novel technologies and to favor the emergence of innovative concepts in cell research. Cytometry Part A 2007, 71A(9):646–647.View ArticleGoogle Scholar
- Lo K, Brinkman RR, Gottardo R: Automated gating of flow cytometry data via robust model-based clustering. Cytometry Part A 2008, 73A(4):321–332.View ArticleGoogle Scholar
- Hofmann M, Zerwes H: Identification of organspecific T cell populations by analysis of multiparameter flow cytometry data using DNA-chip analysis software. Cytometry Part A 2006, 69A(6):533–540.View ArticleGoogle Scholar
- Roederer M, Treister A, Moore W, Herzenberg LA: Probability binning comparison: A metric for quantitating univariate distribution differences. Cytometry 2001, 45: 37–46.View ArticlePubMedGoogle Scholar
- Hahne F, LeMeur N, Brinkman RR, Ellis B, Haaland P, Sarkar D, Spidlen J, Strain E, Gentleman R: flowCore: a Bioconductor package for high throughput flow cytometry. BMC Bioinformatics 2009, 10: 106. [PMID: 19358741] [PMID: 19358741]PubMed CentralView ArticlePubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 2004, 5(10):R80. [PMID: 15461798] [PMID: 15461798]PubMed CentralView ArticlePubMedGoogle Scholar
- Frelinger J, Kepler T, Chan C: Flow: Statistics, visualization and informatics for flow cytometry. Source Code for Biology and Medicine 2008, 3: 10.PubMed CentralView ArticlePubMedGoogle Scholar
- Pyne S, Hu X, Wang K, Rossin E, Lin T, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Haer DA, Jager PLD, Mesirov JP: Automated high-dimensional flow cytometric data analysis. Proceedings of the National Academy of Sciences 2009, 106(21):8519–8524.View ArticleGoogle Scholar
- Arthur D, Vassilvitskii S: k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. New Orleans, Louisiana: Society for Industrial and Applied Mathematics; 2007:1027–1035.Google Scholar
- Bagwell CB: Hyperlog-A flexible log-like transform for negative, zero, and positive valued data. Cytometry Part A 2005, 64A: 34–42.View ArticleGoogle Scholar
- Parasuraman R, Riley V: Humans and automation: Use, misuse, disuse, abuse. Human Factors 1997., 39(2):Google Scholar
- Skitka LJ, Mosier K, Burdick MD: Accountability and automation bias. International Journal of Human-Computer Studies 2000, 52(4):701–717.View ArticleGoogle Scholar
- Wickens CD: Imperfect and unreliable automation and its implications for attention allocation, information access and situation awareness. Tech. rep., University of Illinois at Urbana-Champaign; 2000.Google Scholar
- Lee JD, See KA: Trust in Automation: Designing for Appropriate Reliance. Human Factors: The Journal of the Human Factors and Ergonomics Society 2004, 46: 50–80.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.