Benchmark for multi-cellular segmentation of bright field microscopy images

Background Multi-cellular segmentation of bright field microscopy images is an essential computational step when quantifying collective migration of cells in vitro. Despite the availability of various tools and algorithms, no publicly available benchmark has been proposed for evaluation and comparison between the different alternatives. Description A uniform framework is presented to benchmark algorithms for multi-cellular segmentation in bright field microscopy images. A freely available set of 171 manually segmented images from diverse origins was partitioned into 8 datasets and evaluated on three leading designated tools. Conclusions The presented benchmark resource for evaluating segmentation algorithms of bright field images is the first public annotated dataset for this purpose. This annotated dataset of diverse examples allows fair evaluations and comparisons of future segmentation methods. Scientists are encouraged to assess new algorithms on this benchmark, and to contribute additional annotated datasets.

Brief description of the evaluated algorithms: TScratch [1] is freely available software that uses fast discrete curvelet transform [2] to segment and measure the area occupied by cells in an image. The curvelet transform extracts gradient information in several scales, orientations and positions in a given image, and encodes it as curvelet coefficients. TScratch selects two scale levels to fit the gradient details found in cells contours, and combines the two scale levels to generate a curvelet magnitude image, which incorporates the details of the original image in the selected scales. Morphological operators are further applied to refine the curvelet magnitude image. As a final step, an automatic threshold is applied to partition the curvelet magnitude image into occupied and free regions. This approach was first applied for edge detection in microscopy images [3]. TScratch can be used via a GUI or by applying the source code which are available together with a detailed user manual at http://www.cse-lab.ethz.ch/index.php?&option=com_content&view=article&id=363. This is by far the most developed software of the three options, featuring a convenient user interface, the ability to set several parameters and compatibility with most platforms.
Topman et al. [4] used standard deviation of pixel intensities as a measure for texture. It is calculated across two different scales for every given image, producing an image of texture homogeneities for each scale. A threshold based on the smaller scale is automatically defined, as half the highest local maximum of pixel intensities in the histogram of the texture image. This threshold is applied to segment the texture images of both scales. The final segmentation is calculated as the intersection of these two segmented masks followed by morphological operators. Source code is available in [4].
MultiCellSeg [5] is a freely software available that applies machine-learning based classification to local patches within an image. Basic image features are used to represent each patch which is fed to a designated pre-train support vector machine [6] (using LIBSVM implementation [7]). Post processing includes additional classification of larger connected components and graph-cut segmentation to reclassify erroneous regions and refine the segmentation. A basic GUI and the source code are available at http://www.cs.tau.ac.il/~assafzar/. A main feature is that MultiCellSeg requires no parameter tuning as the classifiers are computed based on the training images. However, this is the slowest of the three algorithms presented here.
Parameter tuning: RGB images were converted to gray level images by the built-in Matlab function rgb2gray for all algorithms. The set of parameters used for each algorithm were set once and then used to compare the algorithms' performance across all 8 datasets. MultiCellSeg is parameter-free so no parameters were set. The default set of parameters was used to evaluate TScratch. The algorithm by Topman et al. [4] was evaluated using scales of 9 and 25 pixels to fit the values reported in the paper; small changes of these values had minor effect on the algorithm's performance. Additional attention was given to the thresholding method as described below.
Extensive evaluation of Topman's thresholding: While examining Topman's algorithm, we discovered that the automatic threshold extraction method performed inferiorly compared to a constant threshold. Thus, we evaluated the performance of different values for this parameter. The best performance for 6 of the 8 datasets was achieved at a threshold value of 0.03. These results are summarized in Additional file 6: Table S3. Next, we tried to set the threshold parameter based on the training images. The threshold value was selected to optimize the average F-measure of the images in the training set, then applied to the test datasets. The performance was similar to the results obtained before, see Table 1 for details.

Assessing the baseline variance in annotation:
A second expert annotated a partial set of arbitrary images from 7 out of the 8 datasets (excluding Scatter, which has only 6 images). To assess the baseline variance created by different human annotations we used the average F-measure to compare between the second annotation and the official ground truth annotation, in the same manner as used when comparing the different algorithms. Four datasets ("Init", "SN15", "HEK293", and "MDCK") achieved an agreement F-measure of 0.98 or higher, TScratch had a median Fmeasure of 0.95. To retain a better confidence on the remaining two high-variance datasets, a second annotation was performed on all images in these sets. "Melanoma" median F-measure was 0.93 while "Microfluidics" was 0.81. Indeed, the performance of all algorithms on "Microfluidics" was significantly inferior compared with the other datasets (Table 1). Example visualizations of the agreements and inconsistencies between the two annotations on images from the less-consistent datasets are presented in Additional file 5: Figure S2, we conclude that datasets of scattered cells have higher baseline variance since F-measure is sensitive to inconsistent annotations at cell borders (The pros for using F-measure are its simplicity and sensitivity; when the score is highannotations are definitely very similar). Additional file 4: Table S2 summarizes all baseline variance results. The second annotation is also provided as part of the benchmark.