Learning Sparse Representations for Fruit-Fly Gene Expression Pattern Image Annotation and Retrieval

Background Fruit fly embryogenesis is one of the best understood animal development systems, and the spatiotemporal gene expression dynamics in this process are captured by digital images. Analysis of these high-throughput images will provide novel insights into the functions, interactions, and networks of animal genes governing development. To facilitate comparative analysis, web-based interfaces have been developed to conduct image retrieval based on body part keywords and images. Currently, the keyword annotation of spatiotemporal gene expression patterns is conducted manually. However, this manual practice does not scale with the continuously expanding collection of images. In addition, existing image retrieval systems based on the expression patterns may be made more accurate using keywords. Results In this article, we adapt advanced data mining and computer vision techniques to address the key challenges in annotating and retrieving fruit fly gene expression pattern images. To boost the performance of image annotation and retrieval, we propose representations integrating spatial information and sparse features, overcoming the limitations of prior schemes. Conclusions We perform systematic experimental studies to evaluate the proposed schemes in comparison with current methods. Experimental results indicate that the integration of spatial information and sparse features lead to consistent performance improvement in image annotation, while for the task of retrieval, sparse features alone yields better results.


Background
Embryos undergo a temporally ordered differentiation process, starting as basic undifferentiated eggs. Through the process of differentiation, gene expressions take on increasingly complex patterns. Transcriptional regulation of the fruit-fly Drosophila melanogaster is one of the best understood examples of the regulatory networks that govern gene expression patterning. An understanding of the regulatory networks responsible for gene patterning in Drosophila embryos has been aided by digital images produced via in situ hybridization [1][2][3]. These images *Correspondence: jieping.ye@asu.edu 1 Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ 85287, USA 2 Ira A. Fulton Schools of Engineering, Arizona State University, Tempe, AZ 85287, USA Full list of author information is available at the end of the article document the spatiotemporal dynamics of differentiation found in Drosophila embryos. A comparative analysis of these images is beneficial for the understanding of functions and interactions in gene networks [4][5][6][7][8][9][10][11][12][13][14]. To facilitate these discoveries, tools have been developed to searching for images based on keywords that describe embryonic structures [15], and searching for images based on gene expression patterns [13,14]. Images for these tools have been obtained from databases of Drosophila embryonic images, e.g. the Berkeley Drosophila Genome Project (BDGP), and they are annotated with a controlled vocabulary (CV) [1,2] (Figure 1). The CV terms describe the developmental and anatomical properties of gene expression during embryogenesis [1]. Currently, groups of BDGP images are manually annotated with CV terms. This is done collectively so that not all images in a group necessarily correspond with each CV annotation. The http://www.biomedcentral.com/1471-2105/13/107 manual nature of these tasks puts an inordinate burden on biologists as the collection of Drosophila gene expression patterns are growing rapidly [1]. It is therefore imperative to investigate efficient and effective computational methods to automate this task [16][17][18].
Image annotation and image retrieval problems have been studied extensively in computer vision and machine learning. However, natural images are the most common subjects of study for image annotation and image retrieval problems; and commonly-used annotation and retrieval techniques may not be effective for our task. For example, unlike most natural images, BDGP images have all been aligned and scaled semi-automatically. The binary feature vector (BFV) representation have been developed correlate pattern similarities between images [13], however the BFV representation is not robust to distortions; there were also some studies which tried to use robust descriptors to represent the BDGP images [19][20][21][22], however they have not exploited spatial information. It is desirable to represent images in a way that takes advantage of the spatial properties of image features, while at the same time being robust to image distortions. In our annotation problem, we are interested in collectively annotating groups of images, with each group annotated with multiple CV terms. Previous studies have revealed that ignoring group memberships can be detrimental to annotation performance [19], and formulating the task as learning the function between local input patterns and CV terms lead to significant performance improvement [21].
In this article we propose a novel approach for the automated annotation and retrieval of Drosophila melanogaster images. We present an image representation model that takes advantage of the spatial information provided by the BDGP images while at the same time being more robust against distortions. We also take advantage of a state-of-the-art learning model in order to boost the performance of our tasks. Our feature representation framework is inspired by the spatial bag-of-words (BoW) approach for image representation. The BoW approach involves first extracting features from local patches on images. These patches are then quantized to a visual word that has been determined by a pre-computed codebook. Our approach involves extracting these local patches from each image in a group, while maintaining a record of the locations where features are extracted. Thus, our bag-of-words method is essentially a spatial-bag-of-words method. As previous experiments have discovered [16], using only one codebook word to describe a local patch does not capture the slight differences between a word and the actual feature. Therefore, we have adopted a sparse learning framework in order to take advantage of multiple codebook words that show varying levels of similarity to a single feature, leading to a "visual sentence" representation of the image patch.
We have tested our methods on BDGP images from the FlyExpress database (www.flyexpress.net). Annotation results from our study show that the spatial-bag-of-words approach consistently outperforms the non-spatial, bagof-words approach as well as the binary feature vector approach. Results also show that incorporating the sparse learning framework into our representation model further improves performance. While for the image retrieval task, experiments show that utilizing the sparse representation alone is sufficient.

Methods
In this section, we describe the bag-of-words (BoW) and the sparse learning representations for gene expression pattern image annotation and retrieval.

The bag-of-words approach
The bag-of-words method was originally used for text classification problems where each document is represented as a feature vector indicating the frequency of each word in the document. Such feature vector representation is used to classify documents into one or more categories. This text categorization approach has been adapted to image analysis [23]. Specifically, images are represented as http://www.biomedcentral.com/1471-2105/13/107 a collection of "visual words", based on features extracted from the images [24].
In the BoW approach for image representation, invariant visual features are usually extracted from a subset of images [24] to produce a visual codebook using a clustering algorithm, though a recent study shows that the clustering process is not really essential [25]. Here the cluster centers are considered to be visual words. From this codebook, each feature from an image patch is quantized to the closest visual word in the codebook. A histogram is then created to represent the number of occurrences of each word located in an image. This histogram is a global representation because it only tracks the number of occurrences of each word in an image but not the location of those words, thereby the spatial layout of local image features is not captured. This is considered as one of the major drawbacks of the BoW model [19]. Next, we discuss each step involved in the BoW model when applied to fruit fly images in details.

Feature detection
Feature detection involves locating regions in an image to serve as representative boundaries for visual words. We are using images that have been properly scaled and aligned semi-automatically. We use a series of overlapping circles to represent areas where feature information is extracted to construct a single visual word. An example of these overlapping circles is shown in Figure 2. In our experiments, the radius of the patches are set to 16.

Feature description
Based on the regions described above, a local feature is extracted from each of the overlapping circle. Because of its robustness against variations in image scale and rotation, we use the scale-invariant feature transform (SIFT) x W0 x W1 x descriptor [26] for representing each local patch. Thus, each image consists of a collection of feature vectors.

Codebook generation
The codebook is constructed by obtaining a collection of representative vectors from the extracted features. We use the common generation approach of selecting a subset of images and then using the k-means algorithm to cluster their SIFT feature vectors [27]. The number of cluster centers which represent the visual words can be set manually. For our image annotation and retrieval problem, we have set this number to 2000. The SIFT feature vectors can then  be quantized to the closest codebook centers in order to form a visual word representation for each image.
Once the codebook has been created, we can assign codebook words to features extracted from image patches. Formally, assume the number of patches (feature vectors) for a given image is I and the size of the codebook is J. Define e ij = 1 if the i th feature vector is assigned to the j th codeword, and 0 otherwise. Then the given image can be represented as ( 1 )

The spatial bag-of-words approach
A major limitation of the BoW approach is that the spatial information of local image features is not encoded, as the bag-of-words representation is an un-ordered collection of visual words. A previous study on a bag-of-words approach [19] for automated annotation of Drosophila embryo image groups showed encouraging results, and a recent study [21] showed that using spatial information together with visual information is better than using only visual information. We expect the performance can be further improved by taking advantage of the spatial information, i.e., the location where visual words are found within images. Intuitively, the additional spatial information of visual words within images may facilitate the classification of images when the discriminant features are restricted to a certain region, which is the case for our CV terms. This can be implemented by adopting a method similar to the spatial pyramid matching scheme [28].
Our approach for image representation is based on an implementation of the spatial bag-of-words method. Like the BoW method, the spatial BoW method creates a histogram for each image, counting the number of times each word appears in an image. Additionally, the spatial BoW tracks the position where each visual word is located. Therefore, the spatial BoW method benefits from the robustness of the BoW method while also taking advantage of the spatial properties of images.
A spatial bag-of-words is much like a normal bag-ofwords except that it is represented by a larger feature vector. While a histogram of an image is represented by a non-spatial bag-of-words, H, a spatial bag-of-words consists of multiple non-spatial bags, concatenated. Specifically, for each image with n spatial sections, a spatial bag M n can be represented as M n = [H 1 , H 2 , . . . , H n ], where each H i corresponds to a non-spatial bag-of-words for a particular spatial section. Thus we have n bagsof-words from n spatial sections on each image that are concatenated to form M n . This way, different sections of a spatial vector represent different sections of an image. Our automated annotation representation is created by partitioning feature patches into 3 by 6 sections on each image. This representation creates a multiple of 18 in added dimensionality to a non-spatial representation of the same visual words. For each image group in the study we also create a global bag-of-words representation to test the differences in annotation performance that are seen between the global and the spatial approaches. Figure 2 shows a global bag-of-words representation, a 2 by 2 spatial BoW representation, and a 4 by 4 spatial BoW representation Figure 5 The AUC of individual terms on three data sets from stage range 13-16. The three figures, from top to down, show the performance with 40, 50, and 60 terms, respectively. http://www.biomedcentral.com/1471-2105/13/107 below the circular feature representations of two separate images.

The sparse spatial representation
The original BoW representation, as applied to image analysis, assigns each feature vector to the closest visual word in the dictionary. Denote the feature vector obtained for a given patch as y ∈ R d and the dictionary matrix as D ∈ R d×c , in which each column is a centroid (visual word). Then, the assignment of an image patch to a visual word can be written formally as the following optimization problem: Clearly, the constraints enforce that only one element in the solution e will be set to one, which corresponds to the visual word most similar to the image patch y. In this case, relationships between a feature vector and other visual words are discarded. This would not be a problem if a feature vector is an exact match with the visual word that it is assigned to, as in the case of text classification. However for images, a feature vector may be close to multiple visual words. In such cases, the relationship with the closest word would be overestimated and the relationships with the other similar words would be lost, leading to degenerated representation accuracy.
The sparse approach for BoW representation addresses this problem by assigning feature vectors to multiple visual words simultaneously. We seek to represent the local patch using "visual sentence" with a set of "words" instead of a single one. Besides the selection of visual words to form this sentence, we also need to evaluate the "contributions". A commonly used approach is to formulate this problem as a sparse learning problem, which can be solved by state-of-the-art algorithms.
Mathematically, the generalization from visual word to visual sentence can be done by relaxing the constraint in (2). We construct the representation vector x ∈ R c , such that for the i th entry, i = 1, . . . , c, x i = w i when the i th keyword is selected with contribution w i , and 0 when the keyword is not selected.
In order to make x sparse (contains multiple 0 entries), an 1 regularization is imposed, resulting in the following optimization problem: In which | · | 1 is the 1 norm and λ is a parameter that controls the sparsity. In our experiments, λ is fixed to be 0.01. This problem is closely related to LASSO [29], and can be solved by many existing software packages, such as SLEP [30].
The comparison between "visual word" and "visual sentence" for image representation is illustrated in Figure 3. As shown in the figure, the sparse learning provides more smooth representation.
Integrating the spatial and sparse approaches into the BoW representation model is therefore expected to produce a more accurate description of Drosophila images. We have created both sparse and non-sparse versions of both our global and spatial bag-of-words representations, and compare different combinations of approaches for image annotation and retrieval. Detailed performance evaluation can be found in the results section.

Data description
The Drosophila gene expression pattern images used in our study are obtained from the FlyExpress database, which contains standardized images obtained from the Berkeley Drosophila Genome Project (BDGP). In BDGP, the Drosophila embryogenesis is divided into six stage ranges (1-3, 4-6, 7-8, 9-10, 11-12, 13-16). The first stage range is not included in this study because of the small number of CV terms used to describe its images. Images from the remaining stage ranges are annotated separately in their respective groups because the majority of terms are stage range specific. The second through sixth stage ranges consist of 1081, 877, 1072, 2113, and 2816 image  groups, respectively. The last two stage ranges contain the largest number of lateral images as well as the highest counts of CV terms.

Evaluation of annotation performance
We employ the one-against-rest support vector machines (SVM) to annotate the gene expression pattern images, where the SVM builds a decision boundary between image groups that contain a particular term and the remaining image groups. We employ the LIBSVM package [31] and the linear kernel is used. The regularization parameter is set to 1 in all cases. Our proposed method combines both the spatial and sparse approaches and is denoted by SVM Spatial+Sparse . We compare our method with those that utilize only sparse, only spatial, or global bag-of-words approaches. These approaches are denoted by SVM Sparse , SVM Spatial , and SVM Global , respectively. The performance comparison of the four representations in terms of AUC and macro F1 scores is summarized in Tables 1 and 2, respectively. Since most CV terms are stage-range specific, we annotate the image groups according to their stage ranges separately. The numbers and proportions of positive samples for the 10 most frequent term in each stage range are summarized in Table 3. For each stage range, we begin with the 10 terms that appear most frequently, and then we add additional terms in the order of their frequencies with a step size of 10. This results in different numbers of data sets in each stage range, depending on the total number of CV terms in that stage range. The extracted data sets are randomly partitioned into disjoint training and testing sets using the ratio 1:1 for each term. For each data set, we generate 30 random partitions and the average performance is reported. Because our method models each individual term separately, we can compare the results of our method against the results of the other method on a term-by-term basis. For example, we can compare annotation results of our method with the non-spatial method in stage range 13-16, term by term, where 40 CV terms are used. In this comparison, of the 40 terms being studied, 39 saw an average increased AUC performance and 31 saw average increased F1 Score (F1) performance. Due to space limitation, we will not show each individual term by term comparison. Instead, we show the results for each stage range where various numbers of CV terms are used. Table 1 shows a comparison of AUC results for all four methods discussed. The best results for each case are highlighted in bold. The results show that both the spatial and the sparse methods consistently outperform the nonspatial method in terms of average AUC. The results also show that combining both sparse and spatial approaches outperforms any of the other three methods. The results indicate that the sparse approach offers improved performance over the spatial approach for the earlier stage ranges, and that the two approaches are comparable for the last stage range. The poorer performance of the spatial approach for the earlier stages may have been due to the less developed embryonic structures found earlier in embryogenesis. Combining the spatial and sparse approaches resulted in the best results, particularly in the later stage ranges. Table 2 shows a similar type of comparison as in Table 1. The only difference is that F1 score is used as a comparison measure instead of AUC. We observe a similar trend: both the spatial and sparse methods outperform the global approach; the sparse approach performs slightly better than the spatial approach in the earlier stages, and they achieve similar performance during the last stage. Again, we can observe that combining the sparse and spatial approaches generates better results than using sparse or spatial information alone.
We have observed that there were significant differences in performance increases between earlier stage ranges where Drosophila embryos were less developed and later stage ranges where embryos were more developed. We also observe that there are certain terms that benefit far greater from a spatial bag-of-words approach than other terms. For example, mesectoderm anlage in statu nascendi, central brain anlage, crystal cell specific anlage, hypopharynx primordium P2, procrystal cell, and crystal cell are all stage dependent terms that showed the most dramatic increases in annotation performance. These increases in performance were consistent across multiple stage range tests, where the number of terms being annotated varied. There are also a number of terms such as pole cell, mesectoderm primordium, foregut primordium, germ cell, embryonic central brain neuron, embryonic central brain glia, and lateral cord glia that showed good performance across multiple stage ranges, where various numbers of CV terms were annotated. We included detailed performance evaluation of individual terms in 6 different data sets in Figures 4 and 5.
There are pioneering works on constructing feature representations for Drosophlia gene expression image annotation. Zhou et al. [32] applied multi-resolution 2D wavelet discrete transform followed by min-Redundancy max-Relevance feature selection. Puniyani et al. [12] proposed an automatic system named "SPEX 2 " that performs pattern extraction using Markov random field and further extracts features using the SIFT descriptor and singular value decomposition. Using the top 10 most frequent terms [12] in the BDGP data set, Zhou's system achieved an average F1 score of about 0.35, while Puniyani's method achieved about 0.45. For comparison purposes, we extract the individual F1-scores for the same terms. Our Sparse + Spatial representation yields an average F1-score of 0.64, which outperforms both methods.

Comparison of different classifiers
Since the main focus of this section is to demonstrate the performance of various image representations, we fix our classifier to be SVM with linear kernel for its effectiveness in high-dimensional data. However, it will also be interesting to investigate how different classifiers perform in this task. As an illustrative example, we use stage range 11-12 with sparse representation and test the classification performance of three different classifiers including SVM, logistic regression and ridge regression. The performance in terms of sensitivity and specificity is reported in Table 4. For all three methods, we apply 4-fold cross validation for parameter selection. As we can see in Table 4, the three classifiers achieve comparable overall performance, and SVM achieves slightly higher sensitivity.

Performance of over-sampling
As we can see in Tables 2 and 4, when the number of labels is large, the average sensitivity as well as F1 score is quite low. This is due to the dramatic lack of positive samples for some labels. For example, in stage range 11-12, when we use 50 labels, the proportion of positive samples in these 50 labels can be as low as 0.8%. In this subsection, we present some preliminary results on tackling this problem with over-sampling.
The over-sampling method works as follows. Before training a classifier for a particular label, we first do random sampling on the positive samples with replacement, so that the number of positive samples is equal to the negative ones. Then, we train the classifier using the balanced samples. We test this method using the same setting as in the previous subsection, and the classification performance is presented in Table 5. As we can see in Tables 4 and 5, the over-sampling method provides promising improvements in this example, especially when the number of labels is large. For example, when using the logistic regression on annotating 50 labels, the oversampling improves sensitivity from 0.2 to 0.36. Exploring methods such as over-sampling to further improve the classification performance will be an interesting future direction.

Evaluation of retrieval performance
Based on the proposed image representations, we obtain the pair-wise similarity for every two images in the database, which can be used for image retrieval. In our study, the representative images for different views and stage ranges from the well-known Interactive Fly website a are used as queries. Then, for a given method and a query image, we select 8 images with the highest similarity values to obtain a set of query results. Note that the query images are removed from the results since they are always the one with highest similarity. Sample query results from different views and stage ranges are presented in Figures 6,7,8,9 and 10. First, we will compare different methods by visually inspecting the images retrieved for each query. The first conclusion we can draw from the figures is that the methods based on the bag-of-words (the first three columns) generally outperform the one that utilizes the binary representation only. For example, for the stripe patterns such as those in Figures 6 and 8, the BFV method retrieves less than 4 similar images in its top 8 matches, and in Figure 6, even the best match looks quite different from the query image. Also, we can observe that among the three proposed methods, the sparse representations generally yield more satisfactory results, particularly when the layout of the pattern is subtle, such as the ones in Figure 7.
We also give brief interpretations of the retrieved images by analyzing the functions of the corresponding genes in the biological process annotated in the gene ontology b . Figure 6 shows a stripe pattern expressed by gene odd, obtained from the dorsal view, in stage range 4-6. odd is in charge of the periodic partitioning. The retrieved genes prd and slp1 are about periodic partitioning and blastoderm segmentation, respectively. Both of them are closely related to the query gene. We also observe that several other retrieved genes, such as comm, comm2, run, trn and Alhambra, are not directly related to the segmentation process. However, they are all involved in the development of the nerve system. It will be interesting to examine how these two functions are related. Figure 8 shows a pattern expressed by gene slp1, during stage range 9-10. As we can see, all of the three "visual sentence" based approaches retrieved 6 images with slp1 expressed. The rest of the genes retrieved, such as slp2 which is involved in periodic partitioning, and en which is associated with the head segmentation process, are all closely connected to the blastoderm segmentation controlled by slp1. Figure 9 is taken from the lateral view, during stage range 11-12. The corresponding gene pdm2 is linked to the nervous system development. We can observe that our proposed method with the "visual sentence" concept returns 2 images with the same gene as the top query results. The gene nub takes part in the fate determination of ganglion mother cell, neuroblast. beat-IIIc and wg are related to the formation of synapse and endoderm, respectively. Figure 10 illustrates a pattern expressed by gene Gasp, during stage range 13-16, taken from the lateral view. The spatial and sparse representation retrieves 4 images with the same gene, compared to 2 images by spatial BoW and 1 image obtained by BFV. Gasp as well as CG13676 is involved in the chitin metabolic process. Another gene, Idgf2, which is related to the chitin catabolic process, is also closely related. The trh gene, which affects the epithelial cell fate determination and open tracheal system, is also related because chitin regulates epithelial tube morphogenesis; in addition to its classical role, protecting mature epithelia.

Conclusions
This article presents computational methods for annotating Drosophila gene expression pattern images, and identifying similar images based on gene patterning. In both tasks, images are represented as bags-of-words. The size of the bags is determined by the spatial properties of a representation. For both applications, a sparse learning framework was used. Results on the FlyExpress database indicate that the proposed annotation method outperforms the non-sparse, non-spatial bag-of-words method, as well as approaches that would use either a sparse or spatial framework. In our study, the bag-of-words representations were created by partitioning image features with local feature patches. Terms that saw the greatest increases in annotation accuracy may only reside in specific regions of Drosophila embryos during a given stage of development. one promising direction is to create local bag-of-words from these regions in order to eliminate some of the noise created by other unrelated regions, when searching for specific embryonic structures. This technique is commonly referred to as region of interest (ROI). We plan to explore this in the future. Endnotes a http://www.sdbonline.org/fly/aimain/1aahome.htm b http://www.geneontology.org/