BMC Bioinformatics BioMed Central

Background The Complete Arabidopsis Transcript MicroArray (CATMA) initiative combines the efforts of laboratories in eight European countries [1] to deliver gene-specific sequence tags (GSTs) for the Arabidopsis research community. The CATMA initiative offers the power and flexibility to regularly update the GST collection according to evolving knowledge about the gene repertoire. These GST amplicons can easily be reamplified and shared, subsets can be picked at will to print dedicated arrays, and the GSTs can be cloned and used for other functional studies. This ongoing initiative has already produced approximately 24,000 GSTs that have been made publicly available for spotted microarray printing and RNA interference. Results GSTs from the CATMA version 2 repertoire (CATMAv2, created in 2002) were mapped onto the gene models from two independent Arabidopsis nuclear genome annotation efforts, TIGR5 and PSB-EuGène, to consolidate a list of genes that were targeted by previously designed CATMA tags. A total of 9,027 gene models were not tagged by any amplified CATMAv2 GST, and 2,533 amplified GSTs were no longer predicted to tag an updated gene model. To validate the efficacy of GST mapping criteria and design rules, the predicted and experimentally observed hybridization characteristics associated to GST features were correlated in transcript profiling datasets obtained with the CATMAv2 microarray, confirming the reliability of this platform. To complete the CATMA repertoire, all 9,027 gene models for which no GST had yet been designed were processed with an adjusted version of the Specific Primer and Amplicon Design Software (SPADS). A total of 5,756 novel GSTs were designed and amplified by PCR from genomic DNA. Together with the pre-existing GST collection, this new addition constitutes the CATMAv3 repertoire. It comprises 30,343 unique amplified sequences that tag 24,202 and 23,009 protein-encoding nuclear gene models in the TAIR6 and EuGène genome annotations, respectively. To cover the remaining untagged genes, we identified 543 additional GSTs using less stringent design criteria and designed 990 sequence tags matching multiple members of gene families (Gene Family Tags or GFTs) to cover any remaining untagged genes. These latter 1,533 features constitute the CATMAv4 addition. Conclusion To update the CATMA GST repertoire, we designed 7,289 additional sequence tags, bringing the total number of tagged TAIR6-annotated Arabidopsis nuclear protein-coding genes to 26,173. This resource is used both for the production of spotted microarrays and the large-scale cloning of hairpin RNA silencing vectors. All information about the resulting updated CATMA repertoire is available through the CATMA database http://www.catma.org.


Background
Rhabdomyosarcoma (RMS) is the most common sarcoma of the soft-tissue, the third most common extracranic solid tumor in children [1]. The incidence of RMS is about four new cases per million per year [2]. RMS is believed to arise from regulatory disruption of growth and differentiation of skeletal muscle stem cells [3,4].
Several approaches, such as FISH, Comparative Genomic Hybridization (CGH), representational difference analysis [5][6][7] were proposed to investigate at a genomic level DNA amplification in tumour samples. Microarray technology methods, applied to identification of genomewide chromosomal imbalances tried to overcome limitations of conventional CGH by hybridizing sample DNA to mapped sequences instead of metaphase chromosomes [8]. More recently, comparative expressed sequence hybridisation (CESH) was applied to the analysis of chromosomal regions including genes overexpressed in a leukemic cell line and in four rhabdomyosarcoma cell lines [9]. A number of discrete regions of increased tumour expression (RITEs) have been identified by a computational method, which compared gene expression levels in unbiased EST libraries from tumour and normal tissues. Several RITEs were identified, corresponding to chromosomal segments previously detected by CGH analysis. Permutation analyses suggested that chromosomal and genomic distribution of RITEs is probably not random [10]. Unfortunately, this study did not consider muscle tissues or muscle tumours.
The aim of this study was to build and compare transcript maps of normal human skeletal muscle and of alveolar rhabdomyosarcoma, in order to identify specific regions of the human genome involved in development of this tumour. This would facilitate discovery of novel genes playing a role in rhabdomyosarcoma and, possibly, of novel tumour markers.
In this paper we report on detection of chromosomal regions harbouring genes mostly expressed in RMS or downregulated in RMS, if compared with normal skeletal muscle, by using different computational methods.

Reconstruction of expression profiles at genome level
We reconstructed tissue expression profiles by a computational approach, starting from unbiased cDNA libraries data. In particular, the expression profile of normal adult skeletal muscle (SM) was reconstructed by using information from 26,964 ESTs, corresponding to 8,517 genes expressed in SM, such as from data obtained by merging four cDNA libraries. Human alveolar rhabdomyosarcoma (ARMS) expression profile was reconstructed from 24,175 ESTs, derived from one library, accounting for 4,549 genes expressed in ARMS. Full data are accessible online [11].
The number of ESTs reported in each UniGene cluster recorded in the expression profile was used to estimate the level of activity of the corresponding gene, as previously described [12], under the assumption that the number of detected ESTs per gene is a function of its transcript frequency in the original population of messenger RNA. Thus, each expression profile is a catalogue of genes expressed in a given tissue. Each gene in the catalogue is identified by UniGene cluster ID and RefSeq accession number and it is associated to the encoded product description and linked to LocusLink and to GenBank databases.

Estimation of density of expressed genes along chromosomes
Genomic map positions, established at one-nucleotide level of resolution, were obtained by querying UCSC BLAT facility with representative DNA sequences of each of 8,517 genes expressed in skeletal muscle and of 5,520 genes expressed in ARMS, according to RefSeq or to Uni-Gene database. Location of each gene in the human genome sequence is given by starting and ending position of the transcribed region, expressed as bp distances from the short arm telomere.
After stringent alignment quality check (see Methods) 98% of genes considered by the study were unambiguously located to their chromosomal positions. In particular, the precise chromosomal location was determined for 8,308 genes expressed in SM and for 5,414 genes expressed in ARMS.
Chromosomal base-pair coordinates of such genes were used to establish "gene density" along each chromosome. Gene density was expressed as a fraction of the genomic sequence covered by gene-associated sequences, in chromosomal regions spanned by a sliding window. Gene density of skeletal muscle genes and of ARMS genes was calculated using 1 Mb-wide windows with 10 Kb overlap, thus obtaining 304,691 different values of gene density along the whole human genome. Window width was set to 1 Mb, in order to scan the genome with a window wider than the average length of most human genes. Taking into account that the average size of human genes is about 14 Kb and the median value is 27 Kb [13], that gene density is highly variable in the human genome, ranging from zero to about 100 genes per Mb (e.g. 6p21.33), and that only a fraction of all human genes is expressed in a given tissue, the selected dimension of one megabase for the window used in the present study should be adequate for measuring gene density in the human genome. The shift between adjacent windows was set to 10 Kb, in order to obtain sufficiently numerous data points, for a precise calculation of gene density along human chromosomes.

Comparison between density of genes expressed in normal skeletal muscle and ARMS
Gene density values, normalized to the total number of expressed genes, were plotted against the base-pair scale length of different human autosomes and of the X chromosome.
Observed gene density values ranged from zero to 1.993 for SM and from zero to 1.197 for ARMS. Average gene density were 0.129 and 0.069 for SM and ARMS, respectively. The difference between normalised ARMS and normal skeletal muscle gene densities was measured. For 27.6% of the genome ARMS gene density was higher than

SM.
The absolute value of gene density difference was higher than 0.3 for 10,653 windows (7.6%). A total of 2,384 windows showing over 0.6 absolute difference were observed (0.7%), corresponding to 33 chromosomal regions harbouring genes overexpressed in ARMS if compared with SM and to 48 chromosomal segments, which appeared strongly downregulated in ARMS (Table 1 and Figure 2). The 0.6 threshold for the observed difference of gene density has been selected in order to pick up genomic regions in which the difference was either moderate or high, without producing an unmanageable increase of genomic regions included and, in particular, to avoid their fragmentation.
Complete plots of gene densities along different human chromosomes are available as supplementary material at our website [14]. As an example, details of the human X chromosome are shown in Figure 1.
Identification of genes differentially expressed in normal and tumour tissues may provide a valuable help in understanding tumour development, in discovering diagnostic and prognostic markers and in identifying novel targets for drug therapy. Digital expression profiling of six different solid tumour types has been used for discovering novel cancer genes [15], but the data set included both normalized and subtracted cDNA libraries and Fisher's exact test was inappropriately used. Artificial neural networks have been used to classify different cancer samples or cell lines on the basis of their expression signature [16].
As far as ARMS is concerned, a cDNA microarray study showed consistent patterns of gene expression in different ARMS cell lines [17].
In the present study, by associating transcripts to the corresponding gene sequence on the human genome, we mapped at genomic level genes expressed in SM and in ARMS. Moreover, we calculated density along each human chromosome of such genes, by using a sliding window approach. Because of the relatively low gene density per window, a statistical approach for comparing gene density was considered to be less convenient than a "qualitative", plot-based approach.
In human genome, possible clustering of genes highly or specifically expressed in specific tissues has been suggested by different studies (see for instance [18][19][20]) which considered both normal and cancer tissues. Fujii and colleagues [21] found that numerous groups of genes expressed in specific tumours, such as squamous cell carcinoma or adenocarcinoma, cluster in given regions of the human genome. On the other hand, recurrent non random imbalances pertaining specific chromosomal regions are frequently reported in tumours. Several experimental or computational methods are currently used for identification of such regions.
We identified specific chromosomal regions hosting more genes expressed in ARMS than expressed in SM. This provides a starting point for identifying novel candidate can-  An effort was made to compare our results with CESH data [9], but the two approaches are different in terms of Density of skeletal muscle and of RMS genes associated sequences along human X chromosome Figure 1 Density of skeletal muscle and of RMS genes associated sequences along human X chromosome. Gene density was calculated in chromosomal regions spanned by sliding windows, as the fraction of genomic sequence covered by geneassociated sequences, on the basis of 15,165 windows of 1 Mb with an overlap between adjacent windows of 10 Kb. As an example, details about gene density on the human X chromosome are shown. Gene density values were normalized to the total number of expressed genes and plotted together on the same base-pair scale axis. Chromosomal regions in which absolute difference between RMS and skeletal muscle gene density was over 0.6 were selected. On the X chromosome, one region resulted to harbour mostly genes expressed in RMS (p22.31-p22.22) and two regions resulted to contain genes expressed in fully differentiated muscle but silent in RMS (q21. 2  sensitivity. At its highest definition CESH measures hybridization intensities on a 450-bands resolution metaphase, whereas our procedure localizes genes on the human genome sequence, at single nucleotide level. However, it may be noticed that in the present study we identified with high confidence one chromosomal region (2p23.2-p23.1) containing a segment showing considerable differential gene density between RMS and SM, within a chromosomal region in which overexpression of RMS Regions of human chromosomes with absolute difference between RMS and skeletal muscle gene density over 0.6 Figure 2 Regions of human chromosomes with absolute difference between RMS and skeletal muscle gene density over 0.6. Thirty-three chromosomal regions harboring mostly genes expressed in RMS are boxed with continue line, whereas 48 chromosomal segments, possibly related to function of differentiated skeletal muscle but silent in RMS are indicated by dashed boxes. genes was detected by CESH [9]. In addition, in the long arm of chromosome 6, reportedly underexpressed in all ARMS lines analyzed by CESH, we found four different regions harboring genes expressed in SM but silenced in ARMS.
From a technical point of view, comparison with the work reporting detection of RITEs [10] was more feasible and meaningful even if they did not take in consideration any SM or ARMS library. We compared chromosomal regions harboring genes more expressed in ARMS than in SM with the list of regions showing increased gene expression in tumour of brain, breast, liver, and lung, identified by Zhou and colleagues [10]. We found that many regions showing higher gene densities in ARMS than in SM were included in this list of regions of increased expression in tumours of several tissues (these regions are indicated with asterisks in Table 1). Therefore, some of these regions could be associated to a general neoplastic phenotype, consequent to genomic rearrangements or due to deregulation of expression of adjacent genes.
Chromosomal regions where differential expression of genes was noticed in ARMS in this study, but were never reported previously, appear as candidate regions for harboring genes possibly involved in origin of ARMS or possible targets for chemotherapy.
Several recurring cytogenetical abnormalities have been observed by Comparative Genomic Hybridization (CGH) studies in RMS cell lines. They involved the gain or loss of complete chromosomes or of specific chromosomal regions. In spite of the different nature of these data, we attempted to find possible overlap between the location of chromosomal segments apparently rich or poor in genes expressed in RMS, identified by the present study, with those of chromosomal regions reportedly involved in prominent imbalances observed in rhabdomyosarcoma by several CGH studies. Several correspondences between ESTs based and CGH results were observed, such as overexpression of RMS genes in chromosomal regions reportedly amplified or translocated in RMS, and underexpression of RMS genes in genomic regions deleted in RMS. In particular, overexpression of genes in ARMS, in regions 1p31.3, 3p12.3, 22q12.3 and 7p14.2-14.1, is consistent respectively with 1p31-p21, 3p12 and 22q amplification and gain of 7p reported by Gordon and colleagues [22]. Underexpression of ARMS genes in regions, 3p24.1, 3p22.3, 3p14.2, 5q34-q32.1, 10q23.33 and 13q32.1 is consistent with loss of 3p, 5q32-qter, 10q23 and chromosome 13 [22]. Furthermore, gain of 2q, reported in 40% of 45 different samples of alveolar and embryonal RMS, 7q (31%), 11q (31%) and 16q (27%) [23] is consistent with our finding of overexpression in RMS of regions 2q22.1 and 2q22.2-q22.3, 7q11.22 and 7q22.1, 11q22.3 and 16q23.3. Moreover, Pandita and colleagues [24] reported gain of regions of chromosomes 2, 5, 7, 8, 11 and 12 in the majority of considered RMS, which is consistent with our findings of eleven distinct regions in these chromosomes harbouring genes expressed in ARMS but silent in SM.

Conclusions
In this study we developed a method and the associated software for the comparative analysis of gene expression in genome. All developed software is freely available online [14]. We identified chromosomal segments showing differential gene expression in human skeletal muscle and in alveolar rhabdomyosarcoma, which appear as candidate regions for harboring genes involved in origin of ARMS and for being possible targets for drug treatment or markers of tumour development.

Methods
We considered for the study only "unbiased" UniGene cDNA libraries pertaining to selected human tissues, such as cDNA clones collections which did not underwent, during their construction, normalization or subtraction processes. These methods deeply bias ESTs frequencies and alter correlation between frequency of ESTs pertaining to a specific gene in the library and gene expression level in the considered tissue.
All data presented in this paper were mined from Uni-Gene release #159 [25], by perl software designed for completing a fully automated procedure for data retrieval and expression profiles reconstruction. After downloading human UniGene clusters and libraries data, expression profiles are reconstructed by eventual pooling of libraries and calculation of expression data. HTML pages are automatically built. Each gene in a profile is identified by gene name and description, UniGene cluster, LocusLink number and GenBank ID of the longest sequence representative of the cluster.
The number of ESTs per each UniGene cluster, pertaining to a specific gene in a given library (or pool of libraries), was used to estimate the gene expression level as a percentage of the total detected transcriptional activity in the tissue (total number of ESTs per library or pool of libraries) [11]. The whole computational work was accomplished by a software pipeline estimating the level of expression of genes and producing expression profiles, integrating gene expression and annotation data in HTML format with links to external resources as LocusLink and GenBank databases. The interactive console tool is conceived to give the possibility of automatically download of large UniGene files and of lists of cDNA libraries files. By simple commands expression profiles are produced. Different profiles could also be merged to create expression data matrices.
A perl-code web client was developed to automatically obtain genome map positions of large sets of genes, requiring only the availablility of a transcribed sequence for each gene. Gene location was established at a nucleotide level resolution, by iteratively querying BLAT search facility at UCSC [26] with the representative nucleotide sequence. Representative sequences of UniGene clusters, or, whenever available, reference sequences of corresponding genes, as established by NCBI RefSeq project [27], were used. BLAT, after searching for similarity between the query sequence and the human genome assembly, outputs the best scoring alignments. We defined stringent thresholds for extension and identity of alignments reported between genomic and query nucleotide sequences, in order to exclude spurious results: only alignments extending for more than 400 bp with an identity percentage greater than 95% were considered. In this way, we obtained tissue-specific chromosomal maps of genes expressed in rhabdomyosarcoma or muscle, encompassing the whole genome. The web client takes as input file a list of UniGene clusters and needs the Hs.data and Hs.sequniq files and their indexes. For each gene, gene name and sequence could be retrieved and this information is used to BLAT the sequence to the Human Genome Sequences and to produce a file output with the list of UniGene clusters genomic positions.
Perl software was also developed to calculate gene density along chromosomes, by using a sliding window. Basically, using as input a list of chromosome lengths in bp and a list of gene locations (start and end of transcribed regions, as bp distances from the p telomere) and given selected window size and length of overlap between adjacent windows, the software allow the calculation of gene density along chromosomes with the method of sliding windows. The calculated gene density for a specific window is associated to its central point and, ultimately, a list of gene densities is compiled.
For each of reconstructed expression profiles, a list of sequences, corresponding to the set of detected transcripts, and the base-pair coordinates corresponding to their start and ending points on the human genome assembly, were used to calculate, for each profile, the fraction of the genome covered by gene associated sequences.
A density value, calculated on the selected window length, was therefore associated to the point coordinate of the human genome sequence corresponding to the center of that window. In this study, gene densities were calculated by using a window length of 1 Mb and a shift of 10 Kb. In presence of overlapping genes, the contribution of each one to the coverage was considered separately. Therefore, the calculated "gene density" resulted to be more than one in some regions.