Supervised learning for the automated transcription of spacer classification from spoligotype films
© Jeffries et al; licensee BioMed Central Ltd. 2009
Received: 21 May 2009
Accepted: 12 August 2009
Published: 12 August 2009
Molecular genotyping of bacteria has revolutionized the study of tuberculosis epidemiology, yet these established laboratory techniques typically require subjective and laborious interpretation by trained professionals. In the context of a Tuberculosis Case Contact study in The Gambia we used a reverse hybridization laboratory assay called spoligotype analysis. To facilitate processing of spoligotype images we have developed tools and algorithms to automate the classification and transcription of these data directly to a database while allowing for manual editing.
Features extracted from each of the 1849 spots on a spoligo film were classified using two supervised learning algorithms. A graphical user interface allows manual editing of the classification, before export to a database. The application was tested on ten films of differing quality and the results of the best classifier were compared to expert manual classification, giving a median correct classification rate of 98.1% (inter quartile range: 97.1% to 99.2%), with an automated processing time of less than 1 minute per film.
The software implementation offers considerable time savings over manual processing whilst allowing expert editing of the automated classification. The automatic upload of the classification to a database reduces the chances of transcription errors.
Genotyping of M. tuberculosis complex isolates has enhanced TB control and contact tracing while providing valuable insights on tuberculosis transmission and pathogenesis . Recently, strain differences were found to affect clinical presentation , and unravelling of the genes responsible for these phenotypic differences might lead to the identification of drug- and vaccine targets.
Spacer oligonucleotide typing (spoligotype) analysis is the most user-friendly and commonly applied genotyping tool for M. tuberculosis isolates worldwide. Global spoligotype databases include 'fingerprints' from thousands of M. tuberculosis complex isolates from diverse regions . Based on hybridization of the direct repeat region of M. tuberculosis , spoligotype analysis generates reproducible binary patterns of 43 spacers, which can readily be shared electronically. This 43 binary spacer format can be transcribed as a 15-digit code , although no international standardization has been established. While spoligotype analysis lacks the resolution of the 'gold standard' genotyping method IS6110 restriction fragment length polymorphism (RFLP) , it has several advantages compared with this technique: Firstly, it relies on PCR amplification of M. tuberculosis DNA, which requires much less DNA and can be applied straight to sputum samples. Secondly, up to 43 isolates can be completed within one day. Thirdly, isolates with less than 6 bands on IS6110 RFLP can be genotyped with a higher resolution by spoligotype analysis. Finally, the spoligotype patterns can distinguish between subspecies and clades within the M. tuberculosis complex and are phylogenetically informative .
Spoligotyping generates arrays of spots and typically data entry and classification is performed manually, which can result in errors , or with often expensive electrophoresis gel-type software. Spoligotype analysis is a robust method with reproducibility of over 90% , but it can be affected by the subjective determination of the hybridization signal, especially after repeated use of the membrane. Hybridization detection is not an all-or-nothing process and slight variations in the quality and repeatability of results requires these images to be double checked by experienced staff. Alternatively a fully automated multiplex bead-based Luminex hybridization assay can be used . Whereas several papers have described novel data mining methods for spoligotype data [11, 12], to our knowledge none have developed software to facilitate the acquisition of the images and classification of spots on a complete spoligotype film generated by membrane based hybridization. Commercial software packages have facilities for the semi-automated capture and classification of spoligo film images. The BioNumerics platform has a 'Character Types' module which can process spot information from tif files. Quantity One (Bio-Rad) allows images to be segmented to quantify spot information via the 'Volume Tools' function.
To support the spoligotype analysis of a Tuberculosis Case Contact study in the Gambia , we used two supervised learning algorithms, a neural network  and a support vector machine  to categorize the hybridization signal as positive or negative. These algorithms were incorporated into a software package that automated the process of gridding, classification, expert verification and transcription of data from the scanned spoligo film directly into a database.
Spoligotype analysis was performed according to a standardized method  using commercially prepared membranes from Isogen Biosciences (Maarsen, the Netherlands). The assay was repeated for isolates with spacers that were difficult to classify, and for those without any hybridization signal.
The spoligotype film was then scanned using a transparency enabled scanner (Microtek ScanMaker i800) at a low resolution of 300 pixels per inch, with a 256 level greyscale. Transparency enabled scanners have a light source in the lid of the scanner as well as the base. The images were scanned to tiff files (although any common image file type can be used), appropriately cropped and rotated so that the rows (isolates) were true horizontal. The automated processing of spoligo films can be divided into four main processes, automatic grid placement, feature extraction, supervised learning and data storage.
Algorithm – Automatic grid placement
The image could now be sheared to make the rows and columns orthogonal, but this creates an edge effect problem and the displayed image is no longer identical to the film. For these reasons the grid lines were rotated rather than the image.
The number of columns in a film is set at 43 and the number of rows has a maximum of 43. Given these parameters it is trivial to place an initial grid, allowing for the rotation affect. Each grid line is then moved -15 to +15 pixels (0 to 15 for edges) from its initial position and the mean intensity for each position is calculated. White areas in the image have high intensity values and dark areas low values and the grid line is moved to the area of highest intensity.
For noisy data with no well defined dark cells the numerator tends to be large and the denominator small, and for the example in Figures 4A and 4B, Tnoise = 0.11 and 3.93 respectively. Based on trial and error the threshold was set at 0.93, which successfully relocated the initial grid lines for the 874 lines on the 10 films (note one film only had 38 isolates) analysed here, using the following criterion.
If c>0 and Tnoise < 0.93
Move grid line to the highest intensity peak
Do not move grid line
Algorithm – Feature extraction and supervised learning
The success of any supervised learning technique depends on the ability of the features extracted to discriminate. There is no prescriptive rule for the determination of these features and the selection difficulty is compounded in this case by the wide range of film quality and the large number of cells that have to be processed. Features that discriminate on low noise/high quality films might not be the same as those on films of inferior quality. Given that the automated implementation has to run in real time, a small number of features should be extracted from each cell. Initially simple summary statistics are extracted from each cell, i.e. the mean and the median pixel intensity for each cell.
The largest dark region within a cell is then selected, as follows. The intensity of the pixels is thresholded, by setting those below the lower quartile intensity for the cell to white and the remaining pixels to black. The largest of the resulting black regions is selected as the region of interest. The vertical and horizontal profiles of the cell pixels through the centre of this region are then extracted and smoothed. The profiles are smoothed using a B-splines smoother , with a uniformly high degree of smoothing, to reduce the number of noisy peaks. The B-splines smoother is appropriate for a real time application as it has a fast computational implementation. The amplitude of each smoothed profile and the number of minima are calculated. Positive cells typically yield profiles having well defined centrally located minima and large amplitudes, in contrast to negative cells.
This gives seven features which are fast to compute (computations have to be repeated for up to 43*43 cells) and are able to discriminate between positive and negative cells from a range of different quality films. At the expense of information, the dimensionality of multivariate training sets can be reduced using principal component analysis . For this application where the dimensionality of the features is small and a wide range of film qualities may have to be processed, there is little computational advantage in attempting to reduce the dimensionality.
Based on a training set of the seven extracted features where the true classification of the cell is known, supervised learning algorithms can classify cells based on unseen sets of cell features. The implementation of two common supervised learning algorithms, a neural network and a support vector machine are described here and compared in the following section.
A feed forward neural network links inputs to outputs by a series of one way connections. The inputs are the seven features for each cell and there are two outputs representing a positive or negative cell. A unary encoding is used for the outputs in the form of a 2 element vector [a b], where a perfect prediction is [0 1] and [1 0] for a positive and negative cell respectively.
Between the input and output layers there is at least one set of neurons, which connects the inputs to the relevant outputs for the training set. More complex relationships can be modelled by increasing the number of neurons and/or increasing the number of layers between the inputs and outputs. There is no prescriptive method for determining the number of neurons or hidden layers and for the films classified here, 2 layers with 5 neurons in each layer gave good performance across all films. The number of layers and neurons are parameters that can be easily changed. The network was trained using back-propagation and the Levenberg-Marquardt algorithm was used to optimise the weights. One of the key problems with neural networks is overfitting, where the error on the training set is very small, but the network does not generalise well when used to process unseen data. To overcome this problem the training sets are divided into two, a training and a validation set. The validation set is classified at each iteration (epoch) during training, based on the weights obtained from the training set only. If the network over fits the data, the error on the validation set tends to increase. An early stopping rule is applied if this occurs for 6 consecutive iterations and the network based on the minimum of the validation error is returned. The requirement for the neural network to converge in a real time implementation restricts the size of the training set. Simulation showed little improvement in classification performance for sets of more than 3% of the cells, split evenly between the training and the validation set.
Support vector machines are closely related to neural networks. They construct a hyperplane separating the input vectors into two categories and so are ideal for binary classification problems. Support vector machines have integral support to minimize overfitting by creating a soft margin between categories that allows some misclassifications. To allow non-linear separations a kernel function is used to transform the data and in contrast to determining the numbers of neurons and hidden layers for a neural network, only the appropriate kernel needs to be chosen for a support vector machine. Simulation showed that a linear kernel function gave the most appropriate fit, with non-linear fits showing overfitting problems. In common with the neural networks simulation, a training set of 1.5% of the cells was adequate. Given the integral support for overfitting and the use of a linear kernel no further control for overfitting was necessary and a validation set was not required.
A neural network is initialized with a set of random weights and for repeat training sessions it will converge to different classifications from the same training set. This is in contrast to a support vector machine, which results in a one to one relationship between the training set and the classification.
Algorithm – Data storage and retrieval
The data from the classified films are automatically updated to an Access data base, although any common database standard such as SQL server or Oracle could be substituted. Each film is stored as a separate table, where each row is labelled with a unique isolate identifier and a row position identifier. To enhance data quality and portability, film images are stored directly within the database (see implementation for technical details).
The algorithms above were tested on ten films, where quality ranged from very high, to images with uneven exposure and noisy artefacts. The gold standard for comparison with the supervised learning algorithms was provided by manual classification by an experienced laboratory technician and verification by a senior scientist. Visual confirmation by the above staff showed good automatic grid placement for the ten films and a neural network and a support vector machine were applied to each spoligotype film, using identical training sets.
A rank based analysis of variance on the percentage of correct classifications, allowing for the clustering effect of film , shows no significant difference (p = 0.90) between the classifiers for the 10-fold cross-validation. For the data simulated from smaller training sets the support vector machine showed a border line significant improvement (p = 0.047) over the neural network. The smaller training sets compared to those from the 10-fold cross-validation resulted in a significant (p < 0.001) performance degradation for both classifiers. Generally for realistic training sets the support vector machine gives slightly better classification, shows less dependence on the training set and is computationally faster to fit. In terms of real time software implementation for small training sets there is little functional difference between the two methods.
The best films have median correct classification rates above 99%. For the poorer quality films there is as expected a lower classification rate and greater dependence on the training set selection. The results in Figure 8 are conservative as the training sets were chosen randomly and there was no constraint to include training cells having diverse characteristics. In reality decisions about difficult to classify cells may depend on neighbourhood cells and exposure artefacts, which is why the software implementation facilitates user re-classification. The worst performance occurred for the third film, which has a very uneven spot exposure making classification difficult even by manual inspection.
There are many different image processing morphological operations and filters that attempt to improve the quality of image, but these had little affect on the classification error for the poorer quality films. The disadvantage of image transformations is that the film image can be noticeably different from the actual film, which doesn't promote user confidence. The only transformation that was universally applied and accepted was a simple contrast stretch.
Software for the preceding applications was developed in Matlab (The Mathworks Inc version R2008b), which offers a rich program development environment and dedicated image processing and neural network toolboxes. Matlab is compatible with MAC OSX, Linux and Windows operating systems. Although it is ideal for developing and testing code, the language is interpreted, which makes some of the more intensive numerical operations relatively slow, affecting the user-friendliness of this interactive software application. Code written and compiled in C++, can be incorporated into Matlab applications, resulting in faster execution speed, often greater than 10 fold. The Fast Fourier transform, two-dimensional interpolation, and the morphological operations are written in C++.
The code is controlled from a graphical user interface and films are defined by their image name, where the actual image can be in any of the common graphical file types. The code then runs (on a machine with a Xeon 3.4 GHz CPU and 1 GB of RAM) in the following order (user actions in italics):
1. User selects film for processing from file browser
2. Automatic grid placement (5 seconds)
3. User can manually move grid lines if required
4.Extracts features from all cells (15 seconds)
5. User chooses training set and marks any regions excluded from classification
6. User chooses classifier (Neural network or support vector machine)
7. Trains and displays classified film (1 second)
8. User verifies and edits classification
9. Writes classification and film image to a Microsoft Access database (1 second)
Even though the process is completely automatic, the user can override any of the computer generated results. Every line of the grid is a selectable graphics object and the lines can be moved forward, backwards, up or down one pixel at a time. After classification any of the cell classifications can be over-written before updating the results to a database. Connection to the Access database is via an object linking and embedding database connector and all database actions are automatically written from the Matlab code using SQL (structured query language), so no knowledge of Access is required. Rather than storing a link to the spoligotype film image; for integrity and database portability the actual film image is stored directly in the database. This requires the image to be converted to a Blob (binary large object) and written from Matlab to an OLE (object linking and embedding) field in the relevant database table. A typical resolution spoligotype film after resizing by a factor of a quarter consumes 500 kB of space in the database table. Storage within a single Access database is only limited by the product limit of 2 GB, which would not affect a server sided database such as an SQL server.
The results in Figure 8 were obtained using randomly selected training sets, but intelligent choice of the training set is likely to give superior performance. The film in Figure 11 is film 1 from Figure 8 and had a median correct classification of 97.4%. The support vector machine classification displayed in Figure 11 correctly classified 98.1% cells with 8 cells having weak classification (marked with green ticks and crosses), including 3 incorrectly classified cells.
Films can be re-classified at any time and users are warned before updating any existing classifications in the database. An additional graphical tool displays previously classified films and allows cell classifications to be viewed and if necessary edited.
We developed a method for automating the transcription of spacer classification from spoligo films. With a large throughput of films from the laboratory, the process had to offer a high degree of automation, but allow manual re-classification and provide a systematic method for storing the data. There appear to be no commercial packages exclusively for the processing of spoligo films.
Matlab is a high level programming language with excellent graphical capabilities, dedicated image processing, supervised learning tools and sophisticated native support for database connectivity, which make it ideal for processing and managing spoligo film data. C++ can also be incorporated into Matlab programs, allowing this application to run within a realistic timeframe.
After scanning, the processing of the spoligofilms was divided into four categories, grid application, classification, manual verification and data storage. The automated processes run in less than 1 minute, which is a considerably less than manual processing. The implementation described here uses either a neural network or a support vector machine as classification tools. They provide robust classification, with the support vector machine showing superior performance on realistically sized training sets. In terms of the number of cells correctly identified it is of little practical relevance. Given the speed of film specific classifiers there is little advantage in using the same global classifier for multiple films. There are many other types of classification algorithm, which could easily be incorporated into the modular code.
The Matlab source code is available for download from the 'statistics and data management' page of The Gambia MRC website http://www.mrc.gm and from the corresponding author. It requires the image processing, neural network and bioinformatics (for support vector machine) toolboxes. Bespoke code or alternative classifiers could be used to replace the reliance on toolboxes. Data is currently output to a Microsoft Access 2007 database, but there is native support for all the database systems most commonly used in medical research. The modular nature of the code makes it straightforward to add additional output formats. For example the input format for the spolTools software  is text with format "name: binary pattern: cluster size" (where cluster size refers to the number of isolates with a particular pattern). The Matlab code stores output data in a two-dimensional matrix which can easily be written to a text file, from where it could be pasted into the spolTools page. Although this paper has concentrated on films with 43 spacers, the only limitation on the addition of further spacers  is the size of the scanner.
Many aspects of the gridding, feature extraction, and supervised learning are generally applicable to other laboratory image-based analyses, and could be adapted to the analysis of microarray or well-plate images. Further research will optimise the software implementation, explore the development of platform independent executable code and compare the speed and accuracy of traditional manual data entry against classification with the automated method.
We developed a user friendly software package that can capture and classify data generated by reverse hybridization methods, such as spoligotype analysis. Although fully automated the software allows for manual editing, before uploading the data to an Access database. Tools are also provided to visualize existing data and make retrospective changes. The software is publicly available, and potential users can contact us for assistance, modification and additions to the software.
Availability and requirements
The Matlab (version R2000b) source code, user guide and anonymous test film images are available from the 'statistics and data management' page of The Gambia MRC website http://www.mrc.gm and the corresponding author. Note the source code requires the image processing, neural network and bioinformatics toolboxes.
We thank Christopher Sola for his insightful comments. This work was supported by MRC funds and by NIH grant TW006083. We'd like to thank Dr. Margaret Pinder for her assistance with the preparation of the paper.
- Daley CL: Molecular epidemiology: a tool for understanding control of tuberculosis transmission. Clin Chest Med 2005, 26: 217–231. 10.1016/j.ccm.2005.02.005View ArticlePubMedGoogle Scholar
- Malik AN, Godfrey-Faussett P: Effects of genetic variability of Mycobacterium tuberculosis strains on the presentation of disease. Lancet Infect Dis 2005, 5: 174–183.PubMedGoogle Scholar
- Filliol I, Driscoll JR, van Soolingen D, et al.: Snapshot of moving and expanding clones of Mycobacterium tuberculosis and their global distribution assessed by spoligotyping in an international study. J Clin Microbiol 2003, 41: 1963–1970. 10.1128/JCM.41.5.1963-1970.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Kamerbeek J, Schouls L, Kolk A, van Agterveld M, van Soolingen D, Kuijper S, Bunschoten A, Molhuizen H, Shaw R, Goyal M, van Embden J: Simultaneous detection and strain differentiation of Mycobacterium tuberculosis for diagnosis and epidemiology. J Clin Microbiol 1997, 35: 907–914.PubMed CentralPubMedGoogle Scholar
- Dale JW, Brittain D, Cataldi AA, et al.: Spacer oligonucleotide typing of bacteria of the Mycobacterium tuberculosis complex: recommendations for standardised nomenclature. Int J Tuberc Lung Dis 2001, 5: 216–219.PubMedGoogle Scholar
- Gori A, Esposti AD, Bandera A, et al.: Comparison between spoligotyping and IS 6110 restriction fragment length polymorphisms in molecular genotyping analysis of Mycobacterium tuberculosis strains. Mol Cell Probes 2005, 19: 236–244. 10.1016/j.mcp.2005.01.001View ArticlePubMedGoogle Scholar
- Ferdinand S, Valetudie G, Sola C, Rastogi N: Data mining of Mycobacterium tuberculosis complex genotyping results using mycobacterial interspersed repetitive units validates the clonal structure of spoligotyping-defined families. Res Microbiol 2004, 155: 647–654. 10.1016/j.resmic.2004.04.013View ArticlePubMedGoogle Scholar
- Zanden AG, Kremer K, Schouls LM, Caimi K, Cataldi A, Hulleman A, Nagelkerke NJD, van Soolingen D: Improvement of differentiation and interpretability of spoligotyping for Mycobacterium tuberculosis complex isolates by introduction of new spacer oligonucleotides. J Clin Microbiol 2002, 40: 4628–4639. 10.1128/JCM.40.12.4628-4639.2002PubMed CentralView ArticlePubMedGoogle Scholar
- Kremer K, van Soolingen D, Frothingham R, et al.: Comparison of methods based on different molecular epidemiological markers for typing of Mycobacterium tuberculosis complex strains: interlaboratory study of discriminatory power and reproducibility. J Clin Microbiol 1999, 37: 2607–2618.PubMed CentralPubMedGoogle Scholar
- Cowan LS, Diem L, Brake MC, Crawford JT: Transfer of a Mycobacterium tuberculosis genotyping method, Spoligotyping, from a reverse line-blot hybridization, membrane-based assay to the Luminex multianalyte profiling system. J Clin Microbiol 2004, 42: 474–477. 10.1128/JCM.42.1.474-477.2004PubMed CentralView ArticlePubMedGoogle Scholar
- Driscoll JR, Bifani PJ, Mathema B, McGarry MA, Zickas GM, Kreiswirth BN, Taber HW: Spoligologos: a bioinformatic approach to displaying and analyzing Mycobacterium tuberculosis data. Emerg Infect Dis 2002, 8: 1306–1309.PubMed CentralView ArticlePubMedGoogle Scholar
- Sebban M, Mokrousov I, Rastogi N, Sola C: A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis. Bioinformatics 2002, 18: 235–243. 10.1093/bioinformatics/18.2.235View ArticlePubMedGoogle Scholar
- Hill PC, Brookes RH, Fox A, et al.: Large-scale evaluation of enzyme-linked immunospot assay and skin test for diagnosis of Mycobacterium tuberculosis infection against a gradient of exposure in The Gambia. Clin Infect Dis 2004, 38: 966–973. 10.1086/382362View ArticlePubMedGoogle Scholar
- Ripley BD: Pattern recognition and neuralnetworks. CUP, UK; 2004.Google Scholar
- Cristianini N, Shawe-Taylor J: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. CUP, UK; 2002.Google Scholar
- Eilers PHC, Marx BD: Flexible Smoothing with B-splines and Penalties. Statistical Science 1996, 11: 89–102. 10.1214/ss/1038425655View ArticleGoogle Scholar
- Mardia KV, Kent JT, Bibby JM: Multivariate analysis. Academic Press, London, New York; 1979.Google Scholar
- Demsar J: Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 2006, 7: 1–30.Google Scholar
- Tang C, Reyes JF, Luciani F, Francis AR, Tanaka MM: spolTools: online utilities for analyzing spoligotypes of the Mycobacterium tuberculosis complex. Bioinformatics 2008, 24(20):2414–5. 10.1093/bioinformatics/btn434View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.