Identifying spatially similar gene expression patterns in early stage fruit fly embryo images: binary feature versus invariant moment digital representations

Background Modern developmental biology relies heavily on the analysis of embryonic gene expression patterns. Investigators manually inspect hundreds or thousands of expression patterns to identify those that are spatially similar and to ultimately infer potential gene interactions. However, the rapid accumulation of gene expression pattern data over the last two decades, facilitated by high-throughput techniques, has produced a need for the development of efficient approaches for direct comparison of images, rather than their textual descriptions, to identify spatially similar expression patterns. Results The effectiveness of the Binary Feature Vector (BFV) and Invariant Moment Vector (IMV) based digital representations of the gene expression patterns in finding biologically meaningful patterns was compared for a small (226 images) and a large (1819 images) dataset. For each dataset, an ordered list of images, with respect to a query image, was generated to identify overlapping and similar gene expression patterns, in a manner comparable to what a developmental biologist might do. The results showed that the BFV representation consistently outperforms the IMV representation in finding biologically meaningful matches when spatial overlap of the gene expression pattern and the genes involved are considered. Furthermore, we explored the value of conducting image-content based searches in a dataset where individual expression components (or domains) of multi-domain expression patterns were also included separately. We found that this technique improves performance of both IMV and BFV based searches. Conclusions We conclude that the BFV representation consistently produces a more extensive and better list of biologically useful patterns than the IMV representation. The high quality of results obtained scales well as the search database becomes larger, which encourages efforts to build automated image query and retrieval systems for spatial gene expression patterns.


Background
The complexity of animal body form arises from a single fertilized egg cell in an odyssey of gene expression and regulation that controls the multiplication and differentiation of cells [1][2][3]. For over two decades, Drosophila melanogaster (the fruit fly) has been a canonical model animal for understanding this developmental process in the laboratory. The raw data from experiments consist of photographs (two dimensional images) of the Drosophila embryo showing a particular gene expression pattern revealed by a gene-specific probe in wildtype and mutant backgrounds. Manual, visual comparison of these spatial gene expressions is usually carried out to identify overlaps in gene expression and to infer interactions [4][5][6].
Whole fruit fly embryo and other related gene expression patterns have been published in a wide variety of research journals since late 1980's. These efforts have now entered a high-throughput phase with the systematic determination of patterns of gene expression [e.g., [7]]. As a result, the amount of data currently available has doubled leading to the imminent availability of multiple expression patterns of every gene in the Drosophila genome [7]. In addition, the use of micro-array technology to study Drosophila development has revealed additional and important insights into changes in gene expression levels over time and under different conditions at a genomic scale [8,9].
With this rapid increase in the amount of available primary gene expression images, searchable textual descriptions of images have become available [7,10,11]. However, a direct comparison of the gene expression patterns depicted in the images is also desirable to find biologically similar expression patterns, because textual descriptions (even using a highly structured and controlled vocabulary) cannot fully capture all aspects of an expression pattern. In fact, there is a need for automated identification of images containing overlapping or similar gene expression patterns [6,12] in order to assist researchers in the evaluation of similarity between a given expression pattern and all other existing (comparable) patterns in the same way that the BLAST [13] technique functions for DNA and protein sequences. Of course, unlike the genomes with four letters and proteomes with 20 letters, all gene expression anatomies cannot be easily reduced to, and thus represented by, a small number of components.
We previously proposed a binary coded bit stream pattern to represent gene expression pattern images [6]. In this digital representation, referred to as the Binary Feature Vector (BFV; BSV in [6]), the unstained pixels in the images (white regions and background) were denoted by a value of 0 and the stained areas (colored and foreground: gene expression) were denoted by a value of 1.
Based on the BFV representations of the expression pattern, we proposed a Basic Expression Search Tool for Images (BESTi) [6] with an aim to produce biologically significant gene expression pattern matches using image content alone, without any reference to textual descriptions. We found that the BESTi approach generated biologically meaningful matches to query expression patterns [6].
In this paper, we explore how a more sophisticated Invariant Moment Vectors (IMV, [14]) based digital representation of gene expression patterns performs in generating an ordered list of best-matching images that contain similar/ overlapping gene expression patterns to that depicted in a query image. IMV are frequently used in natural image processing (e.g., optical character recognition [15]) and have a number of desirable properties, including the compensation for variations of scale, translation, and rotation. If successful, IMV representations hold the promise of producing significantly shorter computing times for image-to-image matching compared to BFV.
Previously, we had examined the performance of the BFV representation for a limited dataset of early stage images [6]. Here we compare the relative performances of BFV and IMV first using a dataset containing 226 images (from 13 research papers). Then we test for scalability of the BESTi search by using a seven times larger dataset containing 1819 (1593 new + 226 previous) images from 262 additional research papers (list available upon request from the authors). Both datasets contained lateral views of early stage (1-8) embryos.
During these investigations, we also developed another measure of image-to-image similarity for the BFV representation. This measure is aimed at finding images that contain as much of the query image expression pattern as possible, but without penalizing for the presence of any expression outside the overlap region in the target image. In addition, we examined whether partitioning a multidomain expression pattern into multiple BFV representations, each containing only one domain, yields a better result set.
Recently, Peng and Myers [16] have proposed a different procedure involving the global and local Gaussian Mixture Model (GMM) of the pixel intensities (of expression) to identify images with similar patterns. This GMM method is expected to find images with intensity and spatial similarities. This is different from the BFV and IMV methods examined here, which are intended to find only spatially similar patterns. This focus is important because, as mentioned in [6], the differences in gene expression intensity among images in published literature can arise simply due to use of different techniques, illumination conditions, or biological reasons. However, Peng and Myers method [16] appears to be promising and we plan to examine its effectiveness in a separate paper.

Data set generation
An image database of 226 gene expression pattern images was initially generated using data from the literature [17][18][19][20][21][22][23][24][25][26][27][28][29]. All were lateral images and exhibited early stage (1-8) expression patterns. These images were selected because they had some commonality of gene expression (as seen by the human eye), which allowed us to evaluate the performance of the BESTi in finding correct as well as false matches under controlled conditions. BESTi was also tested for scalability on a larger dataset containing 1819 (1593 plus the 226) lateral views of early stage embryos. These 1593 images were obtained from 262 articles.
In order to present comprehensible result sets in this paper, we have primarily discussed the findings from the dataset of 226 and provided information on how those queries scaled when they were conducted for the larger dataset. In general, our focus was to show the retrieval of biologically significant matches based on both the visual overlap of the spatial gene expression pattern and the genes associated with the pattern retrieved.
Each image was standardized and the binary expression pattern extracted following the procedures described previously [6]. These extracted patterns, their invariant moments (φ 1 through φ 7 ), and binary feature representations were stored in a database. We also calculated and stored the expression area (the count of the number of 1's in the binary feature represented image), the X and Y coordinates of the centroid ( , ), and the principal angle (θ) for each extracted pattern.
To quantify the similarity of gene expressions in two images, we computed two measures (S S , S C ) based on the BFV representation (See equations 2 and 3 in Methods). S S is designed to find gene expression patterns with overall similarity to the query image, whereas S C is for finding images that contain as much of the query image expression pattern as possible without penalizing for the presence of any expression outside the overlap region in the target image. For a given pair of gene expression patterns (A and B), S S is the same irrespective of which image in the pair is the query image. That is, S S (A,B) = S S (B,A). This is not so for S C , because S C measures how much of the query gene expression pattern is contained in the image. Therefore, S C (A,B) ≠ S C (B,A).
For IMV representation, we computed one dissimilarity measure (D φ , equation 13 in Methods). Results from D φ should be compared to that from S S , as both of these measurements do not depend on the reference image, i.e., D φ (A,B) = D φ (B,A) and, also they capture overall similarity or dissimilarity.

Matches and their biological significance
The effectiveness of the BESTi in finding biologically similar expression patterns was geared towards determining the biological validity of the results obtained from the image matching procedure. All results were based solely on quantitative similarities between images without using any textual descriptions. All images were lateral views from the early stages of fruit fly embryogenesis and were oriented anterior end to the left and dorsal to the top. We refer to the images retrieved as the BESTi-matches. Figure 1A shows the query image with gene expression restricted to the anterior (left) portion of the embryo, except that the expression is absent at the anterior terminus [22]. The query image depicts the expression of the sloppy paired (slp1) gene in a wildtype embryo. The BESTimatches based on the S S measure for the representations are given in Figure 1A1-A8. BESTi retrieves images showing similar expression patterns, all of which are from same research article as the query image [22]. These images depict the expression patterns of sloppy paired genes (slp1 and slp2) in a variety of genetic backgrounds or in combination with a head gap gene orthodentical (otd); all of these genes are essential for the pattern formation in Drosophila head development [30]. In fact, slp1 and slp2 are tightly linked genes found in the slp locus of the Drosophila genome. They are not only closely related in their primary sequence structure, but also significantly similar in their expression pattern (compare Figure 1A7 and 1A8).

Performance of BFV-S S search
A search was conducted using the same query image and same distance measure (S S ) on the larger dataset. Figure 2 shows the top-35 matches, which contain all 8 matches shown in Figure 1A (images with blue colored legends). This allowed us to directly compare the quality of matches between the two datasets. Analysis of larger database of images yields more matches for the same S S cut-off value, as expected. A visual inspection reveals that these are all relevant images (Figure 2), with the larger dataset yielding more images for otd (20 images, Figure 2C). Images with expression patterns from slp1, slp2 and combined otd expression are found in Figure 2A,B, and 2D. More importantly, searches in the larger dataset provide images containing expression patterns of additional genes: Kruppel (Kr), hunchback (hb), bicoid (bcd), nanos, snail, hu-li tai shao (hts) and hairy ( Figure 2E-K). Since these images did not exist in the smaller dataset, they were not included in the search results in Figure 1A. All are biologically useful matches because combinatorial input from gap genes (Kr, hb) along with slp1 establishes the domains of segment  BESTi-matches are arranged in descending order starting with the best hit for the given search image. Values of difference in centroids (∆C XY ) and principal angles (∆θ) are also given. Each image is identified by the last name of the first author of the original research article and the figure number with the following abbreviations: Ashe [19]; Casares [20]; Gaul1 [28]; Grossniklaus [22]; Hartmann [24]; Hulskamp1 [27]; Hulskamp3 [26]. polarity genes in the head [22]. As for the snail, hts and hairy genes, there are no known interaction between them and slp1 (gene in the query image) in the wildtype embryo, but the images show overlap in gene expression due to the genetic backgrounds used [31][32][33]. Therefore, they are also biologically relevant matches.

Performance of IMV search
We used the same query image for the IMV method applied to the smaller dataset (D φ , results in Figure 1B) and compared the results to the BFV-S S search. In this case, we obtain images containing expressions of hb, Kr, tailless (tll), slp1, hairy and infra-abdominal (iab) (type I transcript). It is clear that IMV search produces some biologically disconnected matches. For example, Figures 1B2, 1B4-B7 exhibit no visual overlap in gene expression pattern with the query. Furthermore, even the biologically significant matches were retrieved out of order ( Figure  1B1 before 1B3). This happens because D φ retrieves expression patterns that are of similar shape and/or size, regardless of the translation or rotation with respect to the query image.
A comparison of the results from the smaller and larger dataset for the IMV measure is given in Figure 3. Twentysix images were retrieved from the larger dataset when we used the same maximum distance value for the same query image. Of these, only two images were with expression pattern from slp1 ( Since both S S and D φ measures capture the overall similarity or dissimilarity, we can use Figures 2 and 3 to compare the relative effectiveness of the BFV and IMV methods on the larger dataset. We clearly see that the BFV method performs much better in retrieving both overlapping and similar expression patterns that are also biologically significant.
In addition to the Hu moments, one could also compute Zernike moments, which are based on the polar coordinate system. Both Hu moments and Zernike moments are susceptible to the same problem namely expression patterns showing a similar shape but translated to different locations in the embryo would be in the same result set. We chose to study the Hu Invariant Moment Vectors mainly because the centroid of the image can be used to distinguish between similarly shaped but translated expression patterns. With Zernike moments, the image must be inherently contained within a unit circle anchored at the centroid [34]. Thus, there is no straightforward method to eliminate the translational problem.
Using the Hu moments, the spatial location problem can be corrected by considering the Euclidean difference in the centroid location expressed in pixels (∆C XY ) of the query and results. In the case of BFV-S S search results in Figure 1 (A1-A8), the maximum ∆C XY is less than or only slightly greater than the minimum ∆C XY for the IMV search results (Figure 1 B1-B8). Therefore, in the present case, the IMV-based BESTi search results need to be pared down using the centroid location difference. For example, if we consider results based on a ∆C XY lesser than or equal to 50 pixels, images shown in Figure 1 B2, B4-B7 would be removed producing a more meaningful result set. Figure 1C shows the result for the same query image as used in Figure 1A, but using the newly devised S C distance for the BFV representation (BFV-S C search). This is expected to retrieve images with gene expression patterns that contain the largest amount of the overlap with the expression pattern in the query image. The top eight hits shown ( Figure 1C1-C8) all contain over 93% of the query expression pattern: five of the matches are to the expression of hunchback (hb; C1, C3-C6) and the remaining three are from slp1 under different genetic backgrounds. As mentioned above, the combinatorial input from gap genes (including hb) along with slp1 establishes the domains of segment polarity genes in the head [22]. Therefore, gene expression patterns found by BFV-S C search are for developmentally connected genes. However, using the same query image, BFV-S C search yielded only two images in common with the BFV-S S results (Figure 1; C7 and C8 are the same as A5 and A4, respectively). This difference occurs because S S is designed to find gene expression patterns with overall similarity to the query image ( Figure 1A), whereas S C is intended for finding images that contain as much of the query image expression pattern as possible and exclusive of the presence of the gene expression in the result image outside the region of overlap with the query image. Therefore, BFV-S S and BFV-S C have the capability of finding gene expression patterns from different biological perspectives.

(B) bcd
Using the same minimum similarity value for the BFV-S C in the larger dataset resulted in 55 images, given in Figure  4. Gene expression patterns of slp1 and otd accounted for 8 of these images ( Figure 4A and 4B). 22 images contained expression patterns of the various gap genes hb, Kr, kni and tll ( Figure 4C, 4E-F, 4I-L) that were co-expressed with bcd and nanos ( Figure 4E and 4J) or with en ( Figure  4I). Five other genes, developmentally connected to the gene, slp1, in the query image were also retrieved in this result set (eve, twist, dpp (decapentaplegic) [35]; en (engrailed) [36]; arm (armadillo) [37]; Figure 4M-Q). These images were not found in the top-35 of S S result set, which accentuates the different capabilities of the two BFV similarity measures in retrieving biologically relevant matches. The remaining images had expression patterns of AS-C, sc (scute), snail, hairy, zen (zerknullt), run, Hsp83, nmo (nemo), Tc'hb, iab, hts and sog ( Figure 4D, 4G-H, 4R-Z) which are not known to be directly related to the gene slp1. All but seven of these images (Figures 4 D3-D4, H1-H2, R1, X1 and 4Y1) were from a different developmental stage than the query image. Hence, by limiting the results to those from a specific stage, extraneous matches can be removed. The seven images having the same stage as the query image were retrieved because of their significant overlap (more than 94%) with the query gene expression pattern. Thus, we observe that the new distance measure S C has the potential to identify images containing expression patterns of developmentally connected genes, other than those retrieved by S S , thus improving the overall performance of the BFV method and the BESTi tool.

Analysis of multi-domain gene expression patterns
Due to the presence of multiple areas of expression, some patterns in the database that appeared to contain much better matches (by eye and biologically) to the query image were not found or ranked very high. Hence, we also analyzed multi-domain expression patterns separately for the smaller dataset. Developmental biologists are also interested in finding such patterns as they contain overlaps with the expression domains in the query image. In fact, a large number of the expression patterns available today contain multiple isolated domains of expressions since more than one topologically distinct region of expression may be produced by many genes, transgenic constructs, probes or experimental techniques (multiple staining). In such cases, we need to consider each of these regions individually as well as in the context of the composite pattern. Biologically, it is important to consider them separately because different regions of expression may be under the control of distinct cis-regulatory sequences [e.g., [28,38]] or may represent the expression of different genes in a multiply-stained embryo.
generated multiple images from the same initial image and included them in the target dataset. This resulted in 192 additional images (418 total) in the database all of which were components of the initial gene expression patterns. The images were separated into expression regions horizontally and/or vertically depending on the gene expression. For this new set of images, the IMV as well as BFV representations were re-calculated and the BESTi query constructed as above.
Results from BFV-S S and IMV queries for this data set are given in Figures 1D and 1E, respectively. Now, many images with multiple regions of expression are retrieved in the result set ( Figure 1D: D1-D8) and many of them show an even better match with the query pattern than those in Figure 1A for the BFV-based BESTi search. For instance, gene expression patterns are now retrieved (with more than 55% pattern similarity) from embryos with the expression of tailless (tll), which is known to interact with slp1 in defining the embryonic head [22], and with a composite expression of race (related to angiotensin converting enzyme), sog (short gastrulation) and eve (even-skipped) due to enhanced race expression in the anterior domain caused by a transgenic construct causing ectopic expression of sog [19]. Therefore, the strategy of dividing multidomain expression data into individual domains provides additional flexibility to query individual components or sub-sets of complex expression patterns. Results also improved for IMV ( Figure 1E), but again the outcome reinforced the need to use the difference in centroid to limit the result set.
Next we examine the performance of S S , S C and D φ in finding BESTi matches for a query pattern with multiple regions of expression ( Figure 5A). This complex expression pattern consists of anterior and posterior domains caused by enhanced race expression resulting from dosage alteration of dpp in a gastrulation defective (gd) mutant background, and a middle stripe due to misexpressed sog using an eve stripe-2 enhancer [ Figure 2d in [19]]. The results from this query are shown in Figure 5A1-A8 (only the original image set (226) was used as the target database in this case). We again find that S S finds many images from the same paper as well as some images from other research articles with similar expression patterns. The results correctly include expression pattern of eve ( Figure  5A4), of another pair-rule gene (ftz: fushi tarazu; Figure  5A6), and of two other developmentally related genes [39,40].
When D φ is used as a search criterion, it produces some correct matches in the result set ( Figure 5B1-B8). However, it generally fails to rank biologically meaningful matches as the best matches. Use of the centroid in this case is also not productive, as most of the matches show very close centroids. The principal angle (θ) value calculated does not show a significant difference in the early stage embryos used in this study. The results using the S C based search are given in Figure 5C1-C8. They show a number of images in common with the S S results. However, as expected, there are significant differences between the two searches.
The results in Figures 5D and 5E demonstrate the power of the BESTi-search when the multi-domain expression data are represented in their component patterns (domain database). In this case, all the BESTi searches are based on the use of S S as the search criterion. These searches are based on the complete expression ( Figure 5D) and on one of its components (bottom-left domain, Figure 5E). All, but one, BESTi-matches in Figure 5D contain both domains of expression. In contrast, the use of only the left, anterior, domain ( Figure 5E) in the BESTi search produces many other images in which the gene expression pattern is similar to only the anterior-ventral query pattern. Therefore, the use of individual expression components as search arguments increases the potential of directly identifying different overlapping expression patterns.

Conclusions
We have found that it is possible to identify biologically significant gene expression patterns from a dataset by first extracting numeric signature descriptors and then using those descriptors in a computerized search of the database for expression patterns with similar signatures or maximum pattern similarities. We find that the BFV methodologies provide a longer and more biologically meaningful set of expression pattern matches than IMV. Even though IMV representations will produce much faster retrieval speeds for large collections of embryogenesis images, the lack of biological validity of BESTi-matches retrieved makes IMV undesirable for the present problem. Instead, investigations and strategies aimed at improving the real time performance of the BFV representation will better serve the developmental biological research.

Methods
The wide variety of input methodologies, illumination conditions, equipment, and publication venues involved in the acquisition and presentation of gene expression patterns makes the available gene expression pattern data rather diverse. Extracting a gene expression pattern from its background requires the use of a combination of manual and automatic techniques. Each image is first standardized into a binary image as described in [6]. The standardized images are then represented using the Binary Feature Vector (BFV) [6], and the Invariant Moment Vectors (IMV) [14]. Similarity measures S S and S C are derived from BFV of which, S S is the one's complement of the distance metric D E presented in [6] and S C is a new measure BESTi search results with multiple domains of expression using smaller database Original data used to generate these expression patterns are shown above this row. BESTi-matches are arranged in descending order starting with the best hit for the given search statistic. Values of difference in centroids (∆C XY ) and principal angles (∆θ) are also given for panels A, B and C. Each image is identified by the last name of the first author of the original research article and the figure number; with the abbreviations as follows: Ashe [19]; Arnosti [17]; Borggreve [18]; Casares [20]; Gaul1 [28]; Gaul2 [29]; Grossniklaus [22]; Hartmann [24]; Hulskamp1 [27]; Hulskamp2 [25]; Hulskamp3 [26]. introduced in this paper. The third metric D φ is deduced from the invariant moment vectors.

Binary Sequence Vector analysis
The binary coded bit stream pattern, in which the two possible states indicate staining over or under a threshold value, is called as Binary Feature Vector (BFV). This is referred to as the Binary Sequence Vector (BSV) in [6]. In other words, we represent each image as a sequence of 1's and 0's, where the black pixels (stained areas) are denoted by a value of 1 and the white pixels (unstained and background) are denoted by a value of 0. This BFV holds the gene expression and localization pattern information of each image.
The expression patterns are ordered by evaluating a set of difference values, D E , between the binary feature vectors of every possible pair of images in the dataset. D E was introduced in [6] and is formally given as, The term Count(A XOR B) corresponds to the number of pixels not spatially common to the two images and the term Count(A OR B) provides the normalizing factor, as it refers to the total number of stained pixels (expression area) depicted in either of the two images being compared. For simplicity, we use the one's complement of D E , as a measure of similarity of gene expression patterns between two images, S S , is given by the equation S S quantifies the amount of similarity based on the overlap between two expression patterns. S S is equal to 1 when the two expression patterns are identical (D E = 0).
We introduce a new similarity measure in this paper that does not penalize for any non-overlapping region. The measure S C quantifies the amount of similarity based on the containment of one expression pattern in the other given by If the entire query image is contained within the result set images found in the database, i.e., there is complete overlap (with respect to the query image) S C is equal to 1. Note that, S C (A,B) ≠ S C (B,A), because the denominator corresponds to the gene expression area of the query image.

Invariant Moment Vector (IMV) analysis
Some methodologies of image analysis produce numeric descriptors that compensate for variations of scale, translation and rotation. In the following section, we describe the invariant moment analysis of gene expression data. Invariant moment calculations have been used in optical character recognition and other applications for many years [15].
To calculate these invariant moment descriptors the standardized binary image [6] is converted to a binary representation of the same pattern (BFV). From this binary sequence of the image, the invariant moments and other descriptors are extracted using the following method [14,41]. The continuous scale equation used is where φ 7 is a skew invariant to distinguish mirror images.
In the above, φ 1 and φ 2 are second order moments and φ 3 through φ 7 are third order moments. φ 1 (the sum of the second order moments) may be thought of as the "spread" of the gene expression pattern; whereas the square root of φ 2 (the difference of the second order moments) may be interpreted as the "slenderness" of the pattern. Moments φ 3 through φ 7 do not have any direct physical meaning, but include the spatial frequencies and ranges of the image.
In order to provide a discriminator for image inversion (and rotation), sometimes called the "6", "9" problem, it has been suggested [14,42] that the principal angle be used to determine "which way is up". This is extremely important in embryo images because gene expression at the anterior and posterior regions may simply appear to be mirror images of each other to the invariant moments, but biologically they are completely distinct. The principal axis of the gene expression pattern f(x, y) is the angular displacement of the minimum rotational inertia line that passes through the centroid ( , ) and is given as: The slope of the principal axis is called the principal angle θ. It is calculated knowing that the moment of inertia of f around the line is a line through ( , ) with slope θ. We can find the θ value at which the momentum is minimum by differentiating this equation with respect to θ and setting the results equal to zero. This produces the following equation: Using the condition |θ| < 45° one can distinguish the "6" from the "9" and rotationally similar gene expression patterns.
In invariant moment analysis, our initial method of image comparison calculates the Euclidean distance between the images using all moments (φ 1 through φ 7 ) and combinations of these moments. For example, if the first two invariant moments are used, then and the distance D ij , between a pair of images i and j where i, j = 1, 2,...n is given by This can be expanded to use all of the moment variables.
Here, the Euclidean distance, D φ , between any two images is calculated as where i and q designate images whose distance is being calculated and j designates the parameters used in the distance calculation and j = 1, 2, ..., 7. This assumes that all moments have the same dimensions or that they are dimensionless.
Using this method, it is possible to rank each of the images in order of their similarity based on, for example, the first two invariant moments that have clear-cut physical meanings. Expansion to include additional moments or parameters can be performed in a number of ways. It is possible to add additional parameters to the distance calculation making sure that each of the parameters has the same dimension. For example, φ 1 has the dimension of distance squared, while φ 2 has the dimension of the fourth power of distance, thus requiring the square root function to equalize dimensions for comparable distance calculation purposes. In general, the greater number of invariant moments used in the distance calculation, the more selective the ranking. We have also allowed for the use of the centroids and principal angle as a means of list limiting.

Authors' contributions
SK originally conceived the project, developed the image distance measures based on the BFV representation, wrote an early version of the manuscript, and edited it until the final version. RG was responsible for writing new and using pre-existing programs to perform the image distance and parameter calculations, helped prepare the figures, searched the literature for gene expression data, maintained the database of gene expression pattern images, and helped in writing the manuscript. BVE provided the IMV method description, managed the day-to-day 13 ( ) activities in the project, and did significant editing to produce the manuscript in the desired format for the journal. SP originally proposed the use of invariant moment vectors for biological image analysis, contributed significantly for the image distance and parameter calculations and provided critical feedback during the later stages of revision.