A bag-of-words approach for Drosophila gene expression pattern annotation

Background Drosophila gene expression pattern images document the spatiotemporal dynamics of gene expression during embryogenesis. A comparative analysis of these images could provide a fundamentally important way for studying the regulatory networks governing development. To facilitate pattern comparison and searching, groups of images in the Berkeley Drosophila Genome Project (BDGP) high-throughput study were annotated with a variable number of anatomical terms manually using a controlled vocabulary. Considering that the number of available images is rapidly increasing, it is imperative to design computational methods to automate this task. Results We present a computational method to annotate gene expression pattern images automatically. The proposed method uses the bag-of-words scheme to utilize the existing information on pattern annotation and annotates images using a model that exploits correlations among terms. The proposed method can annotate images individually or in groups (e.g., according to the developmental stage). In addition, the proposed method can integrate information from different two-dimensional views of embryos. Results on embryonic patterns from BDGP data demonstrate that our method significantly outperforms other methods. Conclusion The proposed bag-of-words scheme is effective in representing a set of annotations assigned to a group of images, and the model employed to annotate images successfully captures the correlations among different controlled vocabulary terms. The integration of existing annotation information from multiple embryonic views improves annotation performance.


Background
Study of the interactions and functions of genes is crucial to deciphering the mechanisms governing cell-fate differentiation and embryonic development. The DNA microarray technique is commonly used to measure the expression levels of a large number of genes simultaneously. However, this technique primarily documents the average expression levels of genes, with information on spatial patterns often unavailable [1,2]. In contrast, The RNA in situ hybridization uses gene-specific probes and illuminates the spatial patterns of gene expression precisely. Recent advances in this high-throughput technique have generated spatiotemporal information for thousands of genes in organisms such as Drosophila Open Access [1,3] and mouse [4]. Comparative analysis of the spatiotemporal patterns of gene expression can potentially provide novel insights into the functions and interactions of genes [5][6][7].
The embryonic patterning of Drosophila melanogaster along the anterior-posterior and dorsal-ventral axes represents one of the best understood examples of a complex cascade of transcriptional regulation during development. Systematic understanding of the mechanisms underlying the patterning is facilitated by the comprehensive atlas of spatial patterns of gene expression during Drosophila embryogenesis, which has been produced by the in situ hybridization technique and documented in the form of digital images [1,8]. To provide flexible tools for pattern searching, the images in the Berkeley Drosophila Genome Project (BDGP) highthroughput study are annotated with anatomical and developmental ontology terms using a controlled vocabulary (CV) [1] (Figure 1). These terms integrate the spatial and temporal dimensions of gene expression by describing a developmental "path" that documents the Stage range BDGP terms 4-6 dorsal ectoderm anlage in statu nascendi mesectoderm anlage in statu nascendi segmentally repeated trunk mesoderm anlage in statu nascendi ventral ectoderm anlage in statu nascendi 7-8 dorsal ectoderm primordium hindgut anlage mesectoderm primordium procephalic ectoderm anlage trunk mesoderm primordium P2 ventral ectoderm primordium P2 9-10 inclusive hindgut primordium mesectoderm primordium procephalic ectoderm primordium trunk mesoderm primordium ventral ectoderm primordium [11][12] atrium primordium brain primordium clypeo-labral primordium dorsal epidermis primordium gnathal primordium head epidermis primordium P1 hindgut proper primordium midline primordium ventral epidermis primordium ventral nerve cord primordium 13-16 atrium embryonic brain embryonic central nervous system embryonic dorsal epidermis embryonic epipharynx embryonic head epidermis embryonic large intestine embryonic ventral epidermis ventral midline ventral nerve cord

Figure 1
Sample image groups and their associated terms in the BDGP database http://www.fruitfly.org for the segmentation gene engrailed in 5 stage ranges.
dynamic process of Drosophila embryogenesis [1,2]. Currently, the annotation is performed manually by human curators. However, the number of available images is now rapidly increasing [5,[9][10][11]. It is therefore tempting to design computational methods to automate this task.
The particular nature of this problem determines that some challenging questions need to be addressed while designing the automated method. Owing to the effects of stochastic processes during development, no two embryos develop identically. Also, the quality of the obtained data is limited by current image processing techniques. Hence, the shape and position of the same embryonic structure may vary from image to image. Indeed, this has been considered as one of the major impediments to automate this task [1]. Thus, invariance to local distortions in the images is an essential requirement for the automatic annotation system. Furthermore, gene expression pattern images are annotated collectively in small groups using a variable number of terms in the original BDGP study. Images in the same group may share certain anatomical and developmental structures, but all terms assigned to a group of images do not apply to every image in the group. This requires the development of approaches that can retain the original group membership information of images, because we need to test the accuracy of the new method using existing (and independent) annotation data. Prior work on this task [12] ignored such groups and assumed that all terms are associated with all images in a group, which may adversely impact their effectiveness for use on the BDGP data. Finally, the Drosophila embryos are 3D objects, and they are documented as 2D images taken from multiple views. Since certain embryonic structures can only be seen in specific two-dimensional projections (views), it is beneficial to integrate images with different views to make the final annotation. In this article we present a computational method for annotating gene expression pattern images. This method is based on the bag-ofwords approach in which invariant visual features are first extracted from local patches on the images, and they are then quantized to form the bag-of-words representation of the original images. This approach is known to be robust to distortions in the images [13,14], and it has demonstrated impressive performance on object recognition problems in computer vision [15] and on image classification problems in cell biology [16]. In our approach, invariant features are first extracted from local patches on each image in a group. These features are then quantized based on precomputed "visual codebooks", and images in the same group with the same view is represented as a bag-of-words. Thus, our approach can take advantage of the group membership information of images as in the BDGP study. To integrate images with different views, we propose to construct a separate codebook for images with each view. Then image groups containing images with multiple views can be represented as multiple bags, each containing words from the corresponding view. We show that multiple bags can be combined to annotate the image group collectively. After representing each image group as multiple bags of words, we employ a classification model [17] developed recently to annotate the image groups. This model [17] can exploit the correlations among different terms, leading to improved performance. Experimental results on the gene expression pattern images obtained from the FlyExpress database http://www.flyexpress.net show that the proposed approach outperforms other methods consistently. Results also show that integration of images with multiple views improves annotation performance. The overall flowchart of the proposed method is depicted in Figure 2.

Methods
The proposed method is based on the bag-of-words approach, which was originally used in text mining, and is now commonly employed in image and video analysis problems in computer vision [15,[18][19][20]. In this approach, invariant features [21] are first extracted from local regions on images or videos, and a visual codebook is constructed by applying a clustering algorithm on a subset of the features where the cluster centers are considered as "visual words" in the codebook. Each feature in an image is then quantized to the closest word in the codebook, and an entire image is represented as a global histogram counting the number of occurrences of each word in the codebook. The size of the resulting histogram is equal to the number of words in the codebook and hence the number of clusters obtained from the clustering algorithm. The codebook is usually constructed by applying the flat k-means clustering algorithm or other hierarchical algorithms [14]. This approach is derived from the bag-of-words models in text document categorization, and is shown to be robust to distortions in images. One potential drawback of this approach is that the spatial information conveyed in the original images is not represented explicitly. This, however, can be partially compensated by sampling dense and redundant features from the images. The bagof-words representation for images is shown to yield competitive performance on object recognition and retrieval problems after some postprocessing procedures such as normalization or thresholding [14,15]. The basic idea behind the bag-of-words approach is illustrated in Figure 3.
For our problem, the images are annotated collectively in small groups in the BDGP database. Hence, we propose to extract invariant visual features from each image in a group and represent the images in the same group with the same view as a bag of visual words. The 3D nature of the embryos and the 2D layout of the images determine that certain body parts can only be captured by images taken from certain views. For example, the body part "ventral midline" can only be identified from images taken from the ventral view. Hence, one of the challenges in automated gene expression pattern annotation is the integration of images with different views. We propose to construct a separate codebook for images with each view and quantize the image groups containing images with multiple views as multiple bags of visual words, one for each view. The bags for multiple views can then be concatenated to annotate the image groups collectively. After representing each image group as a bag-ofwords, we propose to apply a multi-label classification method developed recently [17] that can extract shared information among different terms, leading to improved annotation performance.

Feature extraction
The images in the FlyExpress database have been standardized semi-automatically, including alignment. Three common methods for generating local patches on images are those based on affine region detectors [22], random sampling [23], and regular patches [24]. We extract dense features on regular patches on the images, since such features are commonly used for aligned images. The radius and spacing of the regular patches are set to 16 pixels in the experiments ( Figure 4). Owing to the limitations of image processing techniques, local variations may exist on the images. Thus, we extract invariant features from each regular patch. In this article, we apply the SIFT descriptor [21,25] to extract local visual features, since it has been applied successfully to other image-related applications [21]. In particular, each feature vector is computed as a set of orientation histograms on 4 × 4 pixel neighborhoods, and each histogram contains 8 bins. This leads to a SIFT feature vector with 128 (4 × 4 × 8) dimensions on each patch.

Figure 3
Illustration of the bag-of-words approach. First, a visual codebook is constructed by applying a clustering algorithm to a subset of the local features from training images, and the center of each cluster is considered as a unique "visual words" in the codebook. Each local feature in a test image is then mapped to the closest visual word, and each test image is represented as a (normalized) histogram of visual words.
Note that although the invariance to scale and orientation no longer exists since we do not apply the SIFT interest point detector, the SIFT descriptor is still robust against the variance of position, illumination, and viewpoint [25].

Codebook construction
In this article, we consider images taken from the lateral, dorsal, and ventral views, since the number of images from other intermediate views is small. For each stage range, we build a separate codebook for images with each view. Since the visual words of the codebooks are expected to be used as representatives of the embryonic structures, the images used to build the codebooks should contain all the embryonic structures that the system is expected to annotate. Hence, we extract codebook images in a way so that each embryonic structure appears in at least a certain number of images. This number is set to 10, 5, and 3 for codebooks of lateral, dorsal, and ventral images, respectively, based on the total number of images with each view ( Table 1). The SIFT features computed from regular patches on the codebook images are then clustered using the k-means algorithm. Since this algorithm depends on the initial centers, we repeat the algorithm with ten random initializations from which the one resulting in the smallest summed within-cluster distance is selected. We study the effect of the number of clusters (i.e., the size of the codebook) on the performance below and set this number to 2000, 1000, and 500 for lateral, dorsal, and ventral images, respectively.

Pattern representation
After the codebooks for all views are constructed, the images in each group are quantized separately for each view. In particular, features computed on regular patches on images with a certain view are compared with the visual words in the corresponding codebook, and the word closest to the feature in terms of Euclidean distance is used to represent it. Then the entire image group is represented as multiple bags of words, one for each view.
Since the order of the words in the bag is irrelevant as long as it is fixed, the bag can be represented as a vector counting the number of occurrences of each word in the image group. Let c 1 ,...,c m R d be the m cluster centers (codebook words) and let v 1 ,...,v n R d be the n features extracted from images in a group with the same view where d is the dimensionality of the local features (d = 128 for SIFT). Then the bag-of-words vector w is m-dimensional, and the k-th component w k of w is computed as , since each feature is assigned to exactly one word.
Based on this design, the vector representation for each view can be concatenated so that the images in a group with different views are integrated ( Figure 3). Let w l , w d , and w v be the bag-of-words vector for images in a group with lateral, dorsal, and ventral views, respectively. Then the bag-of-words vector w for the entire image group can be represented as To account for the variability in the number of images in each group, we normalize the bag-of-words vector to unit length. Note that since not all the image groups contain images from all views, the corresponding An image group is a group of gene expression pattern images of a particular gene at a particular stage range.

Figure 4
Illustration of the image patches on which the SIFT features are extracted. We extract local features on regular patches on the images where the radius and spacing of the regular patches are set to 16 pixels.
bag-of-words representation is a vector of zero if a specific view is absent.

Pattern annotation
After representing each image group as a global histogram using the bag-of-words representation, the gene expression pattern image annotation problem is reduced to a multi-label classification problem, since each group of images can be annotated with multiple terms. (We use the terminology "label" and "term" interchangably, since the former is commonly used in machine learning literature, and the latter is more relevant for our application). The multi-label problems have been studied extensively in the machine learning community, and one simple and popular approach for this problem is to construct a binary classifier for each label, resulting in a set of independent binary classification problems. However, this approach fails to capture the correlation information among different labels, which is critical for many applications such as the gene expression pattern image annotation problem where the semantics conveyed by different labels are correlated. To this end, various methods have been developed to exploit the correlation information among different labels so that the performance can be improved [17,[26][27][28][29]. In [17], a shared-subspace learning framework has been proposed to exploit the correlation information in multi-label problems. We apply this formulation to the gene expression pattern image annotation problem in this article.
We are given a set of n input data vectors { } R d (d = 3500 if all of the three views are used) which are the bag-of-words representations of n image groups. Let the terms associated with each of the n image groups be encoded into the label indicator matrix Y R n × m where m is the total number of terms, and Y iℓ = 1 if the ith image group has the ℓth term and Y iℓ = -1 otherwise. In the shared-subspace learning framework proposed in [17], a binary classifier is constructed for each label to discriminate this label from the rest of them. However, unlike the approaches that build the binary classifiers independently, a low-dimensional subspace is assumed to be shared among multiple labels. The predictive functions in this framework consist of two parts: one part is contributed from the original data space, and the other part is derived from the shared subspace as follows: where w ℓ R d and v ℓ R r are the weight vectors, Θ R r × d is the linear transformation used to parameterize the shared low-dimensional subspace, and r is the dimensionality of the shared subspace. The transformation Θ is common for all labels, and it has orthonormal rows, that is ΘΘ T = I. In this formulation, the input data are projected onto a low-dimensional subspace by Θ, and this low-dimensional projection is combined with the original representation to produce the final prediction.
In [17] the parameters { , } w v =1 m and Θ are estimated by minimizing the following regularized empirical risk: subject to the constraint that ΘΘ T = I, where L is some loss function, y i = Y iℓ , and a > 0 and b > 0 are the regularization parameters. It can be shown that when the least squares loss is used, the optimization problem in Eq. (2) can be expressed as where X = [x 1 ,...,x n ] T R n × d is the data matrix, ∥·∥ F denotes the Frobenius norm of a matrix [30], u ℓ = w ℓ + Θ T v ℓ , U = [u 1 ,...,u m ], and V = [v 1 ,...,v m ]. The optimal Θ* can be obtained by solving a generalized eigenvalue problem, as summarized in the following theorem: Theorem 1 Let X, Y, and Θ be defined as above. Then the optimal Θ that solves the optimization problem in Eq. (3) can be obtained by solving the following trace maximization problem: where S 1 and S 2 are defined as: For high-dimensional problems where d is large, an efficient algorithm for computing the optimal Θ is also proposed in [17]. After the optimal Θ is obtained, the optimal values for { , } w v =1 m can be computed in a closed form.

Results and discussion
We report and analyze the experimental results on gene expression pattern annotation in this section. We also demonstrate the performance improvements achieved by integrating images with multiple views and study the effect of the codebook size on the annotation performance. The performance for each individual term is also presented and analyzed.

Data description
In our experiments, we use Drosophila gene expression pattern images retrieved from the FlyExpress database [8], which contains standardized versions of images obtained from the BDGP high-throughput study [1,2]. The images are standardized semi-manually, and all images are scaled to 128 × 320 pixels. The embryogenesis of Drosophila has been divided into six discrete stage ranges (stages 1-3, 4-6, 7-8, 9-10, 11-12, and 13-16) in the BDGP high-throughput study [1]. Since most of the CV terms are stage range specific, we annotate the images in each stage range separately. The Drosophila embryos are 3D objects, and the FlyExpress database contains 2D images that are taken from several different views (lateral, dorsal, ventral, and other intermediate views) of the 3D embryos. The size of the CV terms, the number of image groups, and the number of images with each view in each stage range are summarized in Table 1.
We can observe that most of the images are taken from the lateral view. In stage range 13-16, the number of dorsal images is also comparable to that of the lateral images. We study the performance improvement obtained by using images with different views, and results show that incorporating images with dorsal views can improve performance consistently, especially in stage range 13-16 where the number of dorsal images is large. In contrast, the integration of ventral images results in marginal performance improvement at the price of an increased computational cost, since the number of ventral images is small. Hence, we only use the lateral and dorsal images in evaluating the relative performance of the compared methods.

Evaluation of annotation performance
We apply the multi-label formulation proposed in [17] to annotate the gene expression pattern images. To demonstrate the effectiveness of this formulation in exploiting the correlation information among different labels, we also report the annotation performance achieved by the one-against-rest linear support vector machines (SVM) in which each linear SVM builds a decision boundary between image groups with and without one particular term. Note that in this method the labels are modeled separately, and hence no correlation information is captured. To compare the proposed method with existing approaches for this task, we report the annotation performance of a prior method [31], which used the pyramid match kernel (PMK) algorithm [32][33][34] to construct the kernel between two sets of feature vectors extracted from two sets of images. We report the performance of kernels constructed from the SIFT descriptor and that of the composite kernels combined from multiple kernels as in [31]. In the case of composite kernels, we apply the three kernel combination schemes (i.e., star, clique, and kCCA) and the best performance on each data set is reported. Note that the method proposed in [12] required that the training set contains embryos that are annotated individually, and it has been shown [31] that such requirement leads to low performance when applied to BDGP data in which the images are annotated in small groups. Hence, we do not report these results. In the following, the multi-label formulation proposed in [17] is denoted as ML LS , and the one-against-rest SVM is denoted as SVM. The pyramid match kernel approaches based on the SIFT and the composite features are denoted as PMK SIFT and PMK comp , respectively. All of the model parameters are tuned using 5-fold cross validation in the experiments.
From Table 1 we can see that the first stage range (1-3) is annotated with only two terms, and we do not report the results in this stage range. In other five stage ranges, we remove terms when they appeared in less than 5 training image groups in a stage range, which yielded data sets in which 60 or fewer terms need to be considered in every case. The two primary reasons for this decision are (1) terms which appeared in too few image groups are statistically too weak to be learned effectively, and (2) we used 5-fold cross-validation to tune the model parameters, and each term should appear in each fold at least once. Therefore, the maximum numbers of terms reported in Table 2, Table 3, Table 4, Table 5, and Table 6 represent the "all terms" test.
The experiments are geared toward examining the change in the accuracy of our annotation method, as we used an increasingly larger set of vocabulary terms. In our experiment, we begin with the 10 terms that appear in the largest number of image groups. Then we add additional terms in the order of their frequencies. By virtue of this design, experiments with 10 terms should show higher performance than those with 50 terms, because 10 most frequent terms will appear more often in image groups in the training data sets as compared to the case of 50 terms (for example). The extracted data set is partitioned into training and test sets using the ratio 1:1 for each term, and the training data are used to construct the classification model.
The agreement between the predicted annotations and the expert data provided by human curators is measured using the area under the receiver operating characteristic (ROC) curve, called AUC [35], F1 measure [36], sensitivity and specificity. For AUC, the value for each term is computed and the averaged performance across multiple terms is reported. For F1 measure, there are two ways, called macro-averaged F1 and micro-averaged F1, respectively, to average the performance across multiple terms and we report both results. For each data set, the training and test data sets are randomly generated 30 times, and the averaged performance and standard deviations are reported in Table 2, Table 3, Table 4, Table 5, and Table 6. To compare the performance of all methods across different values of sensitivity and specificity, we show the ROC curves of 9 randomly selected terms on two data sets from stage ranges 11-12 and 13-16 in Figures 5 and 6.
We can observe from Table 2, Table 3, Table 4, Table 5, and Table 6 and Figures 5 and 6 that approaches based on the bag-of-words representation (ML LS and SVM) consistently outperform the PMK-based approaches (PMK SIFT and PMK comp ). Note that since both the shared-subspace formulation and SVM are based on the bag-of-words representation, the benefit of this "ML LS " denotes the performance obtained by applying the shared subspace multi-label formulation to the proposed bag-of-words representations derived from lateral and dorsal images. "SVM" denotes the performance of SVM applied on the bag-of-words representations using the one-againstrest scheme. "PMK" denotes the method based on pyramid match kernels, and the subscripts "SIFT" and "comp" denote kernels based on the SIFT descriptor and the composite kernels, respectively. Some terms appear in few image groups, and we eliminate them from the experiments. We randomly generate 30 different training/test partitions, and the average performance and standard deviation are reported. representation should be elucidated by comparing the performance of both the shared-subspace formulation and SVM to the two approaches based on PMK. In particular, ML LS outperforms PMK SIFT and PMK comp on all of the 18 data sets in terms of all three performance measures (AUC, macro F1, and micro F1). In all cases, the performance improvements tend to be larger for the two F1 measures than AUC. It can also be observed from Figures 5 and 6 that the ROC curves for SVM and the shared-subspace formulation are always above those based on the pyramid match algorithm. This indicates that both SVM and the shared-subspace formulation outperform previous methods across all classification thresholds. A similar trend has been observed from other  Table 2 for detailed explanations. data sets, but their detailed results are omitted due to space constraint. This shows that the bag-of-words scheme is more effective in representing the image groups than the PMK-based approach. Moreover, we can observe that ML LS outperforms SVM on most of the data sets for all three measures. This demonstrates that the shared-subspace multi-label formulation can improve performance by capturing the correlation information among different labels. For the PMK-based approaches, PMK comp outperforms PMK SIFT on all of the data sets. This is consistent with the prior results obtained in [31] that the integration of multiple kernel matrices derived from different features improves performance.

Performance of individual terms
To evaluate the relative performance of the individual terms used, we report the AUC values achieved by the proposed formulation on 6 data sets in Figures 7 and 8. One major outcome of our analysis was that some terms were consistently assigned to wrong image groups. For example, the terms "hindgut proper primordium", "Malpighian tubule primordium", "garland cell primordium", "salivary gland primordium", and "visceral muscle primordium" in stage range 11-12 achieve low AUC on all three data sets. Similarly, the terms "ring gland", "embryonic anal pad", "embryonic proventriculus", "gonad"", and "embryonic/larval garland cell" achieve low AUC on all three data sets in stage range [13][14][15][16]. For most of these terms, the low performance is caused by the fact that they only appear in very few image groups. Such low frequencies result in weak learning due to statistical reasons. Therefore, the number of images available for training our method will need to be increased to improve performance.

Integration of images with multiple views
To evaluate the effect of integrating images with multiple views, we report the annotation performance in the cases of using only lateral images, lateral and dorsal images, and lateral, dorsal, and ventral images. In particular, we extract six data sets from the stage range 13-16 with the number of terms ranged from 10 to 60 with a step size of 10. The average performance in terms of AUC, macro F1, and micro F1 achieved by ML LS over 30 random trials is shown in Figure 9. We observe that performance can be improved significantly by incorporating the dorsal view images. In contrast, the incorporation of ventral images results in slight performance improvement. In other stage ranges, the integration of images with multiple views can either improve or keep comparable performance. This may be due to the fact that the dorsal view images are mostly informative for annotating embryos in stage range 13-16, as large morphological movements happen on dorsal side in this stage range. Similar trends have been observed when the SVM classifier is applied.

Effect of codebook size
The size of the codebook is a tunable parameter, and we evaluate its effect on annotation performance using a subset of lateral images from stage range 13-16 with 60 terms in this experiment. In particular, the size of the codebook for this data set increases from 500 to 4000 gradually with a step size of 500, and the performance of ML LS and SVM is plotted in Figure 10. In most cases the performance can be improved with a larger codebook size, but it can also decrease in certain cases such as the performance of ML LS when measured by macro F1. In general, the performance does not change significantly with codebook size. Hence, we set the codebook size to 2000 for lateral images in previous experiments to The ROC curves for 9 randomly selected terms on a data set from stage range 11-12. Each figure corresponds to the ROC curves for a term. The circles on the curves show the corresponding decision points, which are tuned on the training set based on F1 score. maximize performance and minimize computational cost. An interesting observation from Figure 10 is that the performance differences between ML LS and SVM tend to be larger for a small codebook size. This may reflect the fact that small codebook sizes cannot capture the complex patterns in image groups. This representation insufficiency can be compensated effectively by sharing information between image groups using the sharedsubspace multi-label formulation. For a large codebook size, the performance of ML LS and SVM tend to be close.

Conclusion
In this article we present a computational method for automated annotation of Drosophila gene expression pattern images. This method represents image groups using the bag-of-words approach and annotates the groups using a shared-subspace multi-label formulation. The proposed method annotates images in groups, and hence retains the image group membership information as in the original BDGP study. Moreover, multiple sources of information conveyed by images with different views can be integrated naturally in the proposed method. Results on images from the FlyExpress database demonstrate the effectiveness of the proposed method.
In constructing the bag-of-words representation in this article, we only use the SIFT features. Prior results on The ROC curves for 9 randomly selected terms on a data set from stage range 13-16. Each figure corresponds to the ROC curves for a term. The circles on the curves show the corresponding decision points, which are tuned on the training set based on F1 score. The AUC of individual terms on three data sets from stage range 11-12. The three figures, from top to down, show the performance on data sets with 30, 40, and 50 terms, respectively.

Figure 8
The AUC of individual terms on three data sets from stage range 13-16. The three figures, from top to down, show the performance on data sets with 40, 50, and 60 terms, respectively.
other image-related applications show that integration of multiple feature types may improve performance [37]. We plan to extend the proposed method for integrating multiple feature types in the future. In addition, the bagof-words representation is obtained by the hard assignment approach in which a local feature vector is only assigned to the closest visual word. Recent study [38] shows that the soft assignment approach that assigns each feature vector to multiple visual words based on their distances usually results in improved performance. We will explore this in the future.

Figure 9
Comparison of annotation performance achieved by ML LS when images from different views (lateral, lateral +dorsal, and lateral+dorsal+ventral) are used on 6 data sets from the stage range 13-16. In each figure, the x-axis denotes the data sets with different numbers of terms. For each data set, 30 random partitions of the training and test sets are generated and the averaged performance is reported.

Figure 10
The change of performance when the codebook size increases gradually from 500 to 4000 with a step size of 500 on a data set in stage range 13-16 with 60 terms. In each case, the average performance and standard deviation over 30 random partitions of the training and test sets are shown. A similar trend has been observed in other stage ranges.