AbsIDconvert: An absolute approach for converting genetic identifiers at different granularities
© Mohammad et al.; licensee BioMed Central Ltd. 2012
Received: 15 February 2012
Accepted: 9 August 2012
Published: 12 September 2012
High-throughput molecular biology techniques yield vast amounts of data, often by detecting small portions of ribonucleotides corresponding to specific identifiers. Existing bioinformatic methodologies categorize and compare these elements using inferred descriptive annotation given this sequence information irrespective of the fact that it may not be representative of the identifier as a whole.
All annotations, no matter the granularity, can be aligned to genomic sequences and therefore annotated by genomic intervals. We have developed AbsIDconvert, a methodology for converting between genomic identifiers by first mapping them onto a common universal coordinate system using an interval tree which is subsequently queried for overlapping identifiers. AbsIDconvert has many potential uses, including gene identifier conversion, identification of features within a genomic region, and cross-species comparisons. The utility is demonstrated in three case studies: 1) comparative genomic study mapping plasmodium gene sequences to corresponding human and mosquito transcriptional regions; 2) cross-species study of Incyte clone sequences; and 3) analysis of human Ensembl transcripts mapped by Affymetrix®; and Agilent microarray probes. AbsIDconvert currently supports ID conversion of 53 species for a given list of input identifiers, genomic sequence, or genome intervals.
AbsIDconvert provides an efficient and reliable mechanism for conversion between identifier domains of interest. The flexibility of this tool allows for custom definition identifier domains contingent upon the availability and determination of a genomic mapping interval. As the genomes and the sequences for genetic elements are further refined, this tool will become increasingly useful and accurate. AbsIDconvert is freely available as a web application or downloadable as a virtual machine at:http://bioinformatics.louisville.edu/abid/.
The Nucleic Acid Research (NAR) 2012 database issue features 1,380 databases covering various aspects of molecular biology including sequences, gene expression, structures, pathways and diseases. Most of these databases are independent of each other and have been created as a result of the respective developers’ domain of interest and resource limitations. Due to a lack of standard naming conventions, most of these databases prefer to assign their own custom generated identifiers (IDs) to the biological entities. Major public databases such as GenBank and RefSeq use accession numbers, Gene Ontology (GO) uses a naming convention from organism specific databases, the HUGO (Human Genome Organization) Gene Nomenclature Committee (HGNC) uses the gene symbol and a custom generated ID, Entrez uses numeric integers, sequencing projects use systematic names and biologists sometimes use additional aliases. As an example, the breast cancer early onset gene has the official gene symbol of BRCA2 provided by HGNC and an associated ID 1101, Ensembl gene ID ENSG00000139618, OMIM (Online Mendelian Inheritance in Man) ID 600185, HPR (Human Protein Reference) database[9, 10] ID 02554, RefSeq ID NM_000059, GenBank Accession U43746, Entrez Gene ID 675, VEGA (the Vertebrate Genome Annotation database) gene ID OTTHUMG00000017411, UCSC[12, 13] gene ID uc001uub.1, UniProt ID P51587, and gene aliases FAD, FAD1, BRCC2, FANCD1, FACD, FANCD.
Fortunately, there is a wealth of information available to the research community in a wide variety of databases. However, it is often difficult to extract or integrate information about a particular biological entity from multiple resources. For instance, a researcher may be interested in extracting functional information spread across different databases for a biological entity such as a gene or a protein; comparing two independent pathways which use different types of identifiers; or comparing results across species, platforms or labs. The lack of a common identifier across these heterogeneous and sometimes redundant biological databases makes the functional analysis of biological data tedious, time consuming, and error prone.
One solution to handle heterogeneous databases is to use a global identifier for annotations such as the one described by MIRIAM (Minimum Information Requested In the Annotation of biochemical Model). MIRIAM requires a global identifier to contain both the data source as well as an internal identifier. For example, urn:miriam:hgnc:brca2 is composed of urn:miriam that defines the notation to be a URN (Uniform Resource Name) using the MIRIAM scheme with data type hgnc and identifier brca2. This method appears promising and has the potential to solve some of the previously mentioned problems, but very few databases follow this standard. Another solution is to manually search for these genes one by one in publicly available databases such as Entrez, KEGG[16, 17], or GEO[18, 19] and infer their functionality. This method is fruitful when the number of genes is small, but is impractical for high throughput experiments, where the number of gene fragments can be on the order of tens of thousands or more. A third solution is to use an ID converter tool that uses a database to store all possible annotations where a list of IDs may be input as a query which is then converted into the corresponding target IDs in a precise and efficient way.
Another difficulty in the development of such tools is the dynamic nature of annotations. Of late, rapid advances in sequencing and their declining costs have enabled researchers to perform novel sequencing as well as resequencing projects. These result in an increased depth of coverage of a genomic sequence, with gaps being filled and repeats more accurately mapped. Sometimes, the sequence underlying a genetic entity may change, and on a less frequent basis the whole genomic sequence needs to be updated (as of August 1 st , 2012, the currently available genome versions for human, mouse and rat are 19 (GRCh37), 10 (GRCm38) and 5 (RGSC 5.0) respectively). These changes may modify the structural and functional annotations of a genetic entity (GenBank, RefSeq and Ensembl are updated everyday). Frequent updates in annotations also create problems in the manufacturing of DNA microarrays. Microarray chips are designed and their probes are annotated using the current build of a specific genome. Regardless of the care taken in this design, the system will include flaws due to the combination of the delay inherent in the process of microarray design-manufacture-deployment (compounded by the latency to use) and the dynamic nature of annotations. Attempts to address these problems have been the focus of a number of previous studies. Gautlier et al. found redundancies in the annotations of Affymetrix®; probes at a sequence level that map to multiple RefSeq genes. Such ambiguities may result in inaccurate interpretations. AffyProbeMiner uses RefSeq and GenBank’s validated complete coding sequences to regroup the probes on an Affymetrix®; chip into consistent probe sets. In their study, regrouping of the probes affected almost 65% of the probes on the HG-U133A chip. Harbig et al. reidentified the Affymetrix®; U133 plus 2.0 GeneChip®; array probes in an attempt to increase the reproducibility of microarray experiments. They used BLAST to remap the probes against the genome and redefined approximately 37% of the probes. These studies suggest that redefinition or reorganization of probesets will improve the analytical accuracy of the microarray data, a process that would be greatly facilitated by a means for high-throughput query and mapping/comparison of given sequences (such as microarray probes) to other genomic annotations stored across a wide variety of databases.
Currently available ID conversion tools
The problem of ID conversion persists even though a number of tools exist to address this problem. Some of these are generic and perform ID conversion for probes, genes, proteins, and additional annotations while others are more specific to DNA microarray probes. Organism support varies with many of the tools catering to either a single organism or a small set of comparable species. In addition, cross–species comparison is variable, with most methodologies providing only intra–species conversion. Almost every approach uses some sort of relational database with the unique identifier being Ensembl IDs, RefSeq IDs, or custom generated IDs. A brief description of some popular tools follows.
DAVID (Database for Annotation, Visualization and Integrated Discovery)[24–26] is a web based structural and functional annotation tool to extract biological meaning from a gene list. It uniquely generates custom IDs for querying a set of relations and is dependent on annotations from other databases. A component of DAVID, DICT (DAVID gene ID Conversion Tool), facilitates ID conversion. EASE, developed by the DAVID Bioinformatics team, is a customizable, standalone, Windows®; desktop software application, having similar analytic capabilities as that of DAVID. Babelomics[29, 30] is an integrated web based tool for structural and functional annotation with an ID converter being one of its components. This component uses a universal index linked to Ensembl to create a database of 11 species. g:Convert, a component of g:Profiler, allows arbitrary conversion of genes, proteins and probes into one another. Every alias in g:Profiler is mapped through a three-level index of gene, transcript and protein Ensembl IDs. For each index level, all corresponding IDs are stored in the database. The Hyperlink Management System and ID Converter System automatically updates and maintains hyperlink information among major public biological and chemical databases. It downloads data everyday from authoritative databases and produces a large correspondence table which is used to show the most up-to-date URL for genes of interest. Users can use CGI programs to create hyperlinks to this data. Synergizer assigns a unique internally generated identifier, “peg”, to all external IDs that refer to the same biological entity. It mostly uses the NCBI “gene2accession” file to maintain a database of synonym relationships and produce a simple web interface. MADGene uses correspondence tables and allows conversions in an efficient way. The Clone/Gene ID Converter, MatchMiner, the Gene name converter in GeneMerge, RESOURCERER and GeneLynx are additional ID conversion tools.
Some of the ID conversion tools are more specific, such as those that work only at the probe level. GATExplorer is a web based tool for analysis and visualization of Affymetrix®; probes at the genomic and transcriptomic level. It performs de–novo mapping of all the probes of Affymetrix®;’s expression and exon arrays against the transcriptome of the corresponding organism using BLAST and records the coordinates on the genome. Unmapped probes are mapped to an ncRNA database downloaded from RNAdb. Only the perfect match alignment is selected while mapping these probes. The location of a gene or probe on the genome can be visualized along with all the transcripts present in that region. NetAffx™, provided by Affymetrix®;, performs ID conversion of Affymetrix®; probes for different organisms and has a feature to perform structural and functional annotation. PLANdbAffy is a Probe-Level ANnotation database for Affymetrix®; microarrays (HG-U133A, HG-U133B, HG-U133 plus 2.0, Human Exon 1.0, Human Gene 1.0) that uses BLAT to map individual probes onto the human genome. These probes are then annotated using information extracted from RefSeq. ProbeMatchDB uses a number of public databases to perform cross-species and cross-platform probe mapping. The database conversions are enabled by UniGene and HomoloGene identifiers. UniProt’s[45, 46]ID mapping tool works on the gene and protein level and converts gene IDs into UniProt IDs and vice versa.
Some software tools have unique methods for mapping between different IDs. Onto–Translate[47, 48] converts one type of IDs into another by calculating the optimal path between IDs, taking into account the “trustworthiness” of data contained in various databases. The AliasServer uses a custom generated unique 64-bit reference identifier which is computed from the amino acid sequence using the CRC (Cyclic Redundancy Check) algorithm where each ID is a unique combination of species identifier, type of database and the ID itself.
Feature comparison of different conversion tools (As of April 2012)
genes , prots. and
html, txt, xls
H, M, R, O
HMS and IC
genes, prots. and
H, M, O
probes, genes and
H, M, R, O
html, txt, xls
H, M, R
H, M, R, O
Affy expression &
H, M, R
H, M, R, O
H, M, R
EST, gene, prots.
genes or prots.
web, API, ⇓
Affy, uniGene clusters,
H, M, R, O (58 org.)
html, txt, xls
H, M, R, O
web, API, ⇓
depends on DB
H, M, R, O
web, ⇓ VM
ID converter tools, data sources and availability
GenBank, RefSeq, KEGG, OMIM, UniGene
Go, KEGG, Ensembl and others
GO, KEGG, Ensembl, TRANSFAC, Reactome
HMS and IC
Ensembl, GO, KEGG and others
Ensembl, NCBI, RGD, SGD, KEGG, WormBase and EcoCyc
Clone/Gene ID Converter
Ensembl, NCBI, Pubmed, UCSC, KEGG, Reactome
GEO, UniGene, Entrez and others
NCBI, GO, KEGG and others
Affymetrix®;, UCSC, NCBI
GenBank, RefSeq, GO and others
Ensembl, GO, KEGG and others
Ensembl, EMBL, NCBI, SGD and others
Affymetrix®;, UCSC, UniGene, Entrez, OMIM
NCBI, GO, KEGG and others
Ensembl and others
UCSC, NCBI, Ensembl, Agilent, Affymetrix®; and others
Drawbacks associated with existing approaches
As stated previously, annotations are dynamic and databases such as Ensembl and RefSeq are updated daily making it difficult to keep the databases of ID conversion tools current. This is more problematic when the intermediate IDs are custom generated as these require more effort to update. Most of the tools are based on a relational database and the dynamic nature of annotations may introduce database anomalies because of the frequent insertion, deletion and updating of the annotations. If a gene is discovered, deleted or updated in any of these databases, or the annotations corresponding to an entity are added, deleted or updated, then all the databases or correspondence tables also need to be updated. In the case of microarray experiments, if a probe corresponds to a recently deleted entity then that probe annotation needs to be edited as well. Updating any of these authoritative databases may induce a chain-reaction for any other systems using that information and any experimental result deduced from the updated probe may become invalid. Those tools that generate their own unique identifier such as DAVID, Synergizer or Babelomics, although efficient, face a similar situation and need to be updated frequently. As updating an annotation database is labor and resource intensive, some of the tools cannot afford to update their knowledgebase regularly.
Absolute (sequence based) method for ID conversion
A feature of biological entities that is currently ignored in ID conversion is the sequence mapping information. For species where a reference genome is available, all nucleic acid and protein-based annotations, no matter the granularity, can be aligned to that reference genome sequence and therefore annotated by genomic intervals. Once the absolute genomic coordinates on a reference genome for all entities have been determined, these can be queried to find all overlapping entities, thus performing ID conversion. This conversion uses the same two step method as adopted by most of the ID conversion tools, considering the genomic coordinates as the basis of conversion, rather than the annotation level information used by other tools. Compared to other types of intermediate IDs, the intervals on a reference genome sequence are relatively static, and remapping of entities to modified genomic sequences is relatively trivial, making it possible to easily update the system. Using interval trees, conversion by finding overlapping intervals is fast and efficient.
Once structural annotation for each of the identifiers is available, AbsIDconvert can query this information. This query step uses the structural annotation information of each identifier and the organism specific database generated from the previous step. AbsIDconvert assumes two biological entities (nucleic acid, protein entity) are the same if their genomic sequences are also the same, overlap or one is contained within the other. As the number of annotations are large and frequent insertions and deletions are routine, an efficient data structure for storage and computational operations is needed. Considering that the structural annotation is in the form of genomic intervals, a modified Red-Black tree, known as an interval tree, is used to store the information for all IDs. An interval tree maintains a dynamic set of elements, with each element x containing an interval int[x]. This int[x] stores the start and end of the interval apart from other auxiliary information. This data structure is dynamic in nature and can perform insertions and deletions efficiently in time O(lo g2n), where n is the number of elements. Interval trees have been shown to be efficient for working with a large number of genomic intervals.
There are four possible ways in which AbsIDconvert may be queried:
Lookup identifiers: Given a mixed list of identifiers, AbsIDconvert can determine the types of identifiers in the list. This step uses the relational database created in the preprocessing step and can efficiently categorize the IDs in the list.
Batch conversion of IDs: Given a list of identifiers, AbsIDconvert uses the interval tree to find their genomic coordinates. Once the coordinate information is available, all overlapping identifiers can be found by querying the interval tree. This uses the IRanges and GenomicRanges packages internally to maintain the genomic intervals which are based on Allen’s Interval Algebra. Users can specify various range parameters using the interface. The overlap type (‘type’) parameter may take any one of ‘any’, ‘start’, ‘end’, ‘equal’ or ‘within’ as its value. By default ‘any’ overlap is accepted. If ‘type’ value is ‘start’ or ‘end’ then the query intervals are required to have matching ‘start’ and ‘end’ respectively with subject intervals in the database. If ‘type’ is ‘equal’ then only those subjects are retrieved which have the exact same coordinates. For ‘within’, the query must be contained wholly within the subject intervals. Another parameter is for specifying the maximum gap (‘maxgap’) between subject and query intervals to consider them as overlapping. The default value is zero which assumes there should not be any gap between the subject and query intervals. This parameter is useful for finding genes in the flanking regions of the specified intervals. The third parameter is the minimum overlap (‘minoverlap’) size that specifies the minimum number of overlapping base pairs needed to consider the query and subject an overlap. The default overlap value is one. The last parameter is the ‘select’ parameter that specifies which type of overlaps will be reported. By default, all overlapping intervals will be reported. Selecting ‘first’, ‘last’ and ‘arbitrary’ will report first, last and arbitrary overlapping intervals from the result. A simple example using intervals is shown in Figure5. In this case, the reference genome is 10 BP long. The subject database contain four intervals s1, s2, s3 and s4 that represent the interval database. Query intervals also consist of four intervals q1, q2, q3 qnd q4. Considering default values for range parameters, q1 overlaps with s1, q2 and q3 overlap with all the intervals in the subject, whereas q4 overlaps with s2, s3 and s4. If the values of the parameters are type=‘within’, maxgap = 0, minoverlap=1, select= ‘all’ then q1 overlaps with s1, q2 with s2 and q4 with s2 and s3. If the values of the parameters are type=‘end’, maxgap = 1, minoverlap = 1, select= ‘all’ then q2 overlaps with s2, q3 with s3 and q4, and q4 with s2.
Intervals as input to AbsIDconvert: A unique feature of the ID conversion is to find target identifiers corresponding to a given interval. For example, next-generation sequencers generally map the DNA sequences or reads to a reference genome and output the intervals for each aligned reads. Finding desired target identifiers corresponding to these intervals is routinely required. AbsIDconvert efficiently converts these coordinates into target identifiers in a high throughput manner. For instance, a user of AbsIDconvert is able to take a set of intervals upstream of a set of transcription start sites to determine if any features are annotated proximal to the regions of interest.
Sequences as input to AbsIDconvert: Sometimes a user may be interested in finding all identifiers that correspond to a particular sequence or a list of sequences. For instance, a user may be interested in finding all gene names and Entrez IDs corresponding to a set of sequences. In this case, AbsIDconvert maps these sequences to the corresponding genome (or any other genome for cross–species comparisons) and determines the genomic intervals they belong to and then retrieves all the desired target identifiers that overlap these intervals. Due to the computational complexity involved in mapping long sequences using a generic mapping algorithm such as BLAT or BLAST, the web version of AbsIDconvert supports only short sequence mapping using Bowtie. Longer sequences can be mapped using BLAT in the virtual machine version of AbsIDconvert. Sequence output from next-generation sequencing technologies can be catered efficiently using AbsIDconvert. Alternatively, the coordinate information may be obtained by submitting the sequences to Galaxy[60–62] or the UCSC genome browser and subsequently inputting the intervals using AbsIDconvert. Mapping parameters can be specified by the user through the interface. Parameters include the maximum number of mismatches which can range from zero (default) to three. The second mapping parameter specifies which type of alignments are to be reported. The default value is ‘all Best’ in which all best alignments will be reported by Bowtie. However, ‘all’, ‘k’ or ‘k Best’ can be selected for Bowtie output. AbsIDconvert also has another parameter ‘Do not report (.more)’ that takes a positive integer value which specifies that Bowtie will suppress all alignments for a particular read if the total number of reportable alignments for that read is more than the specified value. The default value of -1 specifies that all alignments will be accepted. For instance, if this value is set to 100, then Bowtie will suppress all those alignments for reads that map to 100 or more locations on the genome. This is an effective option to mask repeat sequences or small sequences from appearing into the output because their probability to map at multiple locations on the genome is higher.
AbsIDconvert supports 53 major species for performing ID conversion on a list of identifiers and a list of intervals. It also has sequence level mapping support for 12 major species including Homo sapiens, Mus musculus, Rattus norvegicus, Bos taurus, Gallus gallus, Sus scrofa, Xenopus tropicalis, Anopheles gambiae, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, and Danio rerio. AbsIDconvert converts the input (intervals, IDs and sequences) into target identifiers with links to authoritative databases. All intermediate interval files are available to download for later use. It also generates custom annotation files that can be used to view the IDs simultaneously (chromosome–wise) as a custom track in the UCSC Genome Browser. The performance and potential uses for AbsIDconvert are discussed in the following sections.
Results and discussion
Intervals vs. relational database
The genomic coordinate information for different identifier types mapped to 53 species were stored as intervals. An interval tree method was implemented and used to store and query the corresponding interval information for each identifier type. For comparison with relational databases, an equivalent MySQL database was implemented to perform ID conversion based on coordinate information, and the run time for both of these methods were compared.
Run time (sec.) to convert 1,000 IDs from one type to another using web–based AbsIDconvert
NCBI contains the most up to date information and its annotations are correct.
An Entrez ID may be annotated by more than one gene symbol.
Given an Entrez ID x, if a tool converts x to a set of gene symbols, Y (x → Y), and NCBI annotates x to another set of gene symbols, Z (x→Z), then accuracy terms can be defined as:
True positives (TP) are those conversions in which the converted gene symbol set contains all the gene symbol(s) annotated by NCBI (i.e. Z ⊆ Y).
False positives (FP) are unexpected results. This includes incorrect conversions (Z ⊈ Y), as well as those conversions in which NCBI does not annotate an Entrez ID with any gene symbol, but a tool finds some gene symbol corresponding to that Entrez ID (Z = ϕ and Y ≠ ϕ).
False negatives (FN) are missing conversions in which a tool could not find corresponding gene symbol(s) (Z ≠ ϕ and Y = ϕ).
True negatives (TN) are the correct absence of conversion in which NCBI as well as a particular tool does not convert an Entrez to any gene symbol (Z ≠ ϕ and Y ≠ ϕ).
- 4.Accuracy is defined as(1)
Entrez ID to gene symbol conversion accuracy
HMS & IC
Clone/Gene ID converter
Of the 94 Entrez IDs that AbsIDconvert was not able to convert but other tools were (Additional file2), most were either “not on current assembly”, meaning that the reference sequence for that Entrez ID could not be mapped to the current genome (28 IDs), but could be mapped to previous genome assemblies; or “not annotated on reference assembly”, indicating that the sequence cannot be found on the reference assembly at all (61 IDs). Five conversions were found where the Entrez IDs reported had since been deleted and replaced (DAVID and MADGene both converted these IDs).
In a second conversion test, 1,000 randomly sampled Entrez IDs were converted to RefSeq IDs using ten of the 19 tools listed in Table1 (the others are not able to perform this type of conversion and were not evaluated). There are many different classes of RefSeq IDs, including mRNA (ID starts with NM_ ), RNA (NR_ ), protein (NP_ ), as well as predicted versions of each one (XM_ , XR_ and XP_ respectively). How RefSeq IDs are segregated for conversion differs among the tools tested. For example, a number of tools combine all the different types of RefSeq IDs into one converted ID type while others treat each one separately. Other tools ignore the predicted RefSeq IDs and only consider mRNA and RNA. For example, AbsIDconvert’s RefSeq database combines both mRNA and RNA, whereas MADGene includes predicted products (XM). DAVID and Synergizer have separate options for RNA and mRNA RefSeq. Therefore, to enable comparison between all the tools, only those conversions that result in mRNA or RNA RefSeq IDs are considered, and for those tools that report them separately, the results from both conversions were combined. In addition, any predicted RefSeq IDs (i.e. those that begin with X) were removed.
Entrez ID to RefSeq ID conversion accuracy
Clone/Gene ID converter
HMS & ID
The results from the four most accurate tools were investigated further. 497 Entrez IDs were converted commonly by all tools (Figure8(b)). AbsIDconvert converted 586, followed by MADGene (551), DAVID (549) and Onto-Translate (501). Five conversions specific to MADGene were not found by AbsIDconvert (Additional file3). In this case, AbsIDconvert correctly mapped the Entrez IDs to the genome (Additional file4); however, the corresponding RefSeq IDs were not in the data obtained from UCSC. Other conversions that AbsIDconvert did not report were found to be false positives reported by other tools. For example, DAVID and Onto-Translate both reported converting “4586” to “NM_017511” and “441956” to “NM_001013729”; however, the genomic intervals for those IDs do not overlap, and both RefSeq IDs are shown in NCBI as “permanently suppressed”. For the twenty conversions specific to DAVID, the reported RefSeq IDs were found to be associated with different Entrez IDs in NCBI (Additional file5).
The thirty-eight Entrez IDs converted only by AbsIDconvert were investigated further to verify whether they were “correct”. Thirty-three are in agreement with the NCBI data (Additional file6). For the other five, we examined the genomic intervals of both the Entrez IDs and reported RefSeq IDs to verify that they do indeed overlap (intervals are reported in Additional file7). In all cases the converted IDs do have overlapping intervals with two of the Entrez IDs discontinued and replaced since the initial construction of the AbsIDconvert database, “100505905” (to “23189” on March 2, 2012) and “100652874” (to “100505641” on Feb 3, 2012).
To better assess the accuracy of AbsIDconvert compared to other tools, the Entrez to RefSeq ID conversion was repeated ten times, randomly choosing 1,000 Entrez IDs each time. Out of the 10,000 randomly selected Entrez IDs, 8,974 were unique. AbsIDconvert converted 5,700 (63%), followed by MADGene (5,343, 59.5%), DAVID (5,254, 58.5%) and Onto-Translate (4,786, 53.3%) (Figure8(c)). A total of 945 (10%) of the IDs were exclusively converted by AbsIDconvert.
In the third conversion, 1,000 randomly sampled human Affymetrix®; GeneChip HG-U133 Plus 2.0 probesets were converted to Agilent Cgh44b probes (Figure8(d)). This type of cross-platform conversion is important in meta-analysis studies where results are drawn by integrating and analyzing data from a number of independent studies/platforms. As this type of conversion is available only in Synergizer, we compared the conversion results of this tool with AbsIDconvert. Synergizer converted 183 whereas AbsIDconvert converted 162 probesets. The reason for the small number of conversions is primarily due to the design differences of the probes on these chips. Two questions required deeper investigation: 1. Why was AbsIDconvert not able to convert 64 Affymetrix®; IDs that were successfully converted by Synergizer; and 2. Are the 43 conversions exclusive to AbsIDconvert valid? To answer these, we extracted the design annotation of all the Affymetrix®; GeneChip HG-U133 Plus 2.0 probesets provided by Affymetrix’s NetAffx along with the design annotations for the Agilent Cgh44b probes supplied by Agilent. These provided the individual locations of each probe on the hg19 genome, thereby enabling investigation of the interval separation between the probesets.
In order to examine the 64 probesets converted by Synergizer but not by AbsIDconvert, the genomic location(s) of the Affymetrix®; probesets were compared to the genomic locations of the Agilent probes. Fifty-six (out of 64) of the probes are separated according to their genomic locations and do not overlap at all. This separation ranges from 75 to 418,671 BP with a median separation of 4,736 bases. Further analysis determines that these all lie in the regions between the individual probes of the respective probesets and therefore have no shared sequence identity.
Most of the ID converter tools including Synergizer map the genetic entities (probes, probesets) spanning tens of bases to an intermediary such as Ensembl that is at a coarser granularity spanning a few kilobases with possible intronic regions. While performing conversions, these tools only use the probe annotation, disregarding the actual sequence information. The above false positives provided by Synergizer are likely the result of ignoring the sequence level information as the two types of probes actually span different genomic intervals.
Next we considered conversions found exclusively by AbsIDconvert. Based on the official annotation from NetAffx™, we found that intervals for all 43 Affymetrix®; probesets actually contain or overlap the converted Agilent probes with a mean overlap of 56.43 bases. Considering that most of the Agilent probes are 60 bases long and an Affymetrix®; probeset contains overlapping 25 bp probes, this indicates most of these Agilent probes are contained in the Affymetrix®; probeset region. These probesets were checked at the probe level and it was determined that these converted Agilent probes overlap with individual Affymetrix®; probes to some extent, or are completely contained with a mean overlap length of 38.70 BP. We are not sure why Synergizer was unable to convert these 43 probes; however, the official annotation confirms these annotations and bolsters our confidence in the power and accuracy of our sequence based ID conversion.
Three illustrative case studies were explored to demonstrate the capabilities of AbsIDconvert. The first case study considers sequence-based mapping of identifiers in a comparative genomics analysis of organisms involved in malaria; the second examines remapping of probes to annotations within and across species using a historical cDNA platform from Incyte; and the third identifies Ensembl transcripts mapped by Agilent and Affymetrix®; arrays.
Case study 1: Comparative genomics: plasmodium mapped to human and Anopheles gambiae
Recent studies have surveyed the role of both host and pathogen genetic variability to determine molecular signatures for host-pathogen interactions. While the interactions between a pathogen and its host are often mediated by the host immune system responses to the pathogen, host-pathogen relationships theoretically have the potential to create a metagenomic environment whereby the total transcriptome is contributed by both the host and pathogen genes. In some cases, such as Neisseria meningitidis, a direct interaction between host and pathogen genes has been demonstrated. As an illustrative example, it might be possible that shared sequence similarities between pathogen and host genes play a role in host gene regulation via pathogen genes and gene products that provide additional promoter sites, miRNA targets, and binding motifs similar to those found in the host. To test the feasibility of this possibility in the context of malaria, we used absIDConvert to identify coding sequences identical between the Plasmodium falciparum (PF) and Plasmodium vivax (PV) species and the human and anopheles genomes.
Plasmodium is a parasite responsible for causing malaria in humans primarily in tropical and sub–tropical areas. About 3.3 billion people are at risk of this disease, leading to 250 million malaria cases and one million deaths worldwide every year (http://www.who.int/features/factfiles/malaria/). Altogether four Plasmodium species are responsible which are carried by the female Anopheles gambiae mosquito. PF and PV are the most common, with PF being the deadliest.
A total of 75 gene fragments from PF (PF_Hg19 in Figure9(a)) had an exact sequence match to 692 human genes (PF_Hg19 in Figure9(b)). For PV, the aligned number of gene fragments and corresponding genes were 17 (PV_Hg19 in Figure9(a)) and 340 (PV_Hg19 in Figure9(b)), respectively. These numbers indicate that the gene fragments align to multiple locations on the human genome. Among genes that were mapped from PF and PV gene fragments, a total of 134 genes were common. When the same gene fragment sequences from PF and PV were aligned to the Anopheles gambiae genome (AnoGam2), a total of 99 (PF_AnoGam2 in Figure9(a)) gene fragments from PF were mapped to 87 (PF_AnoGam2 in Figure9(b)) different genes, showing that the correspondence between the gene fragments and genes is largely one–to–one. These numbers for PV were 12 (PV_AnoGam2 in Figure9(a)) and 31 (PV_AnoGam2 in Figure9(b)), respectively.
Significantly enriched (p-value < 0.001, number of genes ≥ 2) Gene Ontology biological processes for the P. falciparum and P. vivax genes
positive regulation of developmental growth
central nervous system development
regulation of glycoprotein biosynthetic process
extracellular structure organization
retinal ganglion cell axon guidance
positive regulation of axonogenesis
homophilic cell adhesion
smooth muscle tissue development
organic substance transport
regulation of glucose transport
positive regulation of glycogen biosynthetic process
positive regulation of glucose metabolic process
positive regulation of carbohydrate metabolic process
positive regulation of cellular carbohydrate metabolic process
actin cytoskeleton organization
actin filament-based process
Case study 2: Reinterpretation of prior datasets
Homologous genes can be compared across species using NCBI’s Homologene resource when gene names are known. However, if sequence information is available, it would be best to use that sequence information to determine if homology exists based on sequence conservation, particularly in cases where probes of known sequence are being used to measure a specific gene, such as in DNA microarrays or in–situ hybridization. Both methodologies were applied to the Incyte array used in[69–71].
Comparison of Homologene and sequence based homologs
Case study 3: Meta–analytic studies across platforms
Meta-analysis enables the integration of many different experiments with a common research hypothesis. However, high-throughput -omics meta-analyses are hindered due to the heterogeneity of DNA microarray array designs (length and location of probes), data acquisition, analysis, and inter- and intra-study variability. Therefore, many meta-analyses use the same species or even the same array platform to mitigate some of these heterogeneities. However, many studies do still attempt to perform cross-platform and inter-species meta-analyses, and tools such as AILUN (Array Information Library Universal Navigator), A-MADMAN (Annotation-based microarray data meta-analysis tool), and LOLA (List Of Lists Annotated) enable cross-species meta-analysis using Entrez ID, gene symbol or other IDs as a conversion intermediary. AbsIDconvert can perform cross-platform/-species analysis efficiently using the sequence based approach. We previously demonstrated that AbsIDconvert efficiently and accurately converted Affymetrix®; HG_U133Plus2.0 probes into Agilent Cgh105a probes, among other types of conversions.
AbsIDconvert is the only known gene ID conversion tool based on genomic coordinates/intervals of which we are aware. This is a novel and important contribution in the realm of gene ID conversion due to the large variety of genetic entities in current use by biologists, the need to convert between them, and the fact that most biological entities (nucleic acid, protein entities etc.) have an associated sequence. Mapping of the entity sequence to a reference genome sequence provides the concomitant genomic interval that allows determination of other entities that have overlapping genomic intervals.
The interval basis of AbsIDconvert provides ease of flexibility with respect to any additions, deletions or updates of the underlying objects, requiring only adding of intervals, removing intervals, or modifying the intervals themselves, respectively. This makes it possible to easily keep the structure updated as the current state of biological knowledge changes. A major update is only required when the underlying genome changes, a fairly rare occurrence for most organisms, especially when compared to how often other genomic databases are modified.
These intervals also allow easy discovery of genetic entities that only partially overlap with queried IDs/intervals, or that are within a specified distance nearby. More frequently, researchers are interested in those genes that are near specific genomic intervals corresponding to various types of genetic control elements such as transcription factor binding sites, enhancers, untranslated regions, and hyper/hypo methylated regions. AbsIDconvert makes it easy to find those entities that overlap or lie nearby regions of interest. With the incorporation of a sequence mapping algorithm, AbsIDconvert integrates the determination of genomic intervals for any supplied sequence, making it possible to easily find and convert between IDs from any platform and organism, such as the examination of correspondence of the human EST clones with rat and mouse genes (case study 2) and of plasmodium and human genes (case study 1). We do not know of any other system that can easily accomplish these types of analyses.
AbsIDconvert can greatly facilitate the work of those who are involved in meta analyses studies. When comparing studies where either the species and / or platform varies, this methodology will have clear advantages over others as it is based on common genomic coordinates.
The use of an interval tree structure makes it possible to perform large conversions quickly and efficiently. This method is efficient while dealing with genomic intervals and has a significant advantage over other methods such as relational databases. Although theoretically limited by working memory, none of the interval trees generated and used by AbsIDconvert require more than 300MB of RAM on the deployed server, with the majority being rather small in size (less than 10 MB). If the data cannot fit into main memory, a method such as that proposed by Arge et al.[76, 77] can be used that maintains the interval tree in secondary memory efficiently.
AbsIDconvert is provided as a web page athttp://bioinformatics.louisville.edu/abid/, and is also available as a virtual machine for those wishing to run a local instance. Future work will include providing command line access, a RESTful interface, and modifying the interface to utilize a workflow management tool for genomic data such as GALAXY, where the primary data units are genomic sequences and intervals.
FM was involved in all aspects of the project and was the main code developer. ECR and JCP designed the overall project goals. ECR was responsible for directing the project to completion. RMF played a large role in the project design, identifying sources of genetic entities for inclusion, and the design of the case studies. BJH provided critical assessment and usability design. All authors contributed to the preparation of the manuscript. All authors read and approved the final manuscript.
This work was partially funded by National Institute of Health (NIH) grants P20RR016481, 3P20RR016481-09S1, 8P20GM103436-12 and a Department of Energy (DOE) contract DE-EM0000197. Partial support was also provided by the Paralyzed Veterans of America (fellowship to BJH) and by the Kentucky Spinal Cord and Head Injury Research Trust (JCP). Its contents are solely the responsibility of the authors and do not represent the official views of the funding organizations.
- Galperin MY, Fernández-Suárez XM: The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res 2012, 40: D1-D8. [http://www.ncbi.nlm.nih.gov/pubmed/22144685]  10.1093/nar/gkr1196PubMed CentralView ArticlePubMedGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Clark K, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res 2012, 40(Database issue):D48-D53. [http://www.ncbi.nlm.nih.gov/pubmed/22144687] PubMed CentralView ArticlePubMedGoogle Scholar
- Maglott DR, Katz KS, Sicotte H, Pruitt KD: NCBI’s LocusLink and RefSeq. Nucleic Acids Res 2000, 28: 126–128. [http://www.ncbi.nlm.nih.gov/pubmed/10592200]  10.1093/nar/28.1.126PubMed CentralView ArticlePubMedGoogle Scholar
- The Gene Ontology Consortium: The Gene Ontology: enhancements for 2011. Nucleic Acids Res 2012, 40(D1):D559-D564. [http://www.ncbi.nlm.nih.gov/pubmed/22102568] PubMed CentralView ArticleGoogle Scholar
- Seal RL, Gordon SM, Lush MJ, Wright MW, Bruford EA: genenames.org: the HGNC resources in 2011. Nucleic Acids Res 2011, 39(Database issue):D514–9. [http://www.ncbi.nlm.nih.gov/pubmed/20929869] PubMed CentralView ArticlePubMedGoogle Scholar
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 2011, 39(Database issue):D52-D57. [http://dx.doi.org/10.1093/nar/gkq1237] PubMed CentralView ArticlePubMedGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, Gordon L, Hendrix M, Hourlier T, Johnson N, Kähäri AK, Keefe D, Keenan S, Kinsella R, Komorowska M, Koscielny G, Kulesha E, Larsson P, Longden I, McLaren W, Muffato M, Overduin B, Pignatelli M, Pritchard B, Riat HS, et al.: Ensembl 2012. Nucleic Acids Res 2012, 40(Database issue):D84-D90. [http://www.ncbi.nlm.nih.gov/pubmed/22086963] PubMed CentralView ArticlePubMedGoogle Scholar
- Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005, 33(Database issue):D514-D517. [http://www.ncbi.nlm.nih.gov/pubmed/15608251] PubMed CentralView ArticlePubMedGoogle Scholar
- Prasad TSK, Kandasamy K, Pandey A: Human Protein Reference Database and Human Proteinpedia as discovery tools for systems biology. Methods Mol Biol 2009, 577: 67–79. [http://www.ncbi.nlm.nih.gov/pubmed/19718509]  10.1007/978-1-60761-232-2_6View ArticlePubMedGoogle Scholar
- Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TKB, Gronborg M, Ibarrola N, Deshpande N, Shanker K, Shivashankar HN, Rashmi BP, Ramya MA, Zhao Z, Chandrika KN, Padma N, Harsha HC, Yatish AJ, Kavitha MP, Menezes M, Choudhury DR, Suresh S, Ghosh N, Saravana R, Chandran S, Krishna S, Joy M, et al.: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 2003, 13(10):2363–2371. [http://www.ncbi.nlm.nih.gov/pubmed/14525934]  10.1101/gr.1680803PubMed CentralView ArticlePubMedGoogle Scholar
- Wilming LG, Gilbert JGR, Howe K, Trevanion S, Hubbard T, Harrow JL: The vertebrate genome annotation (Vega) database. Nucleic Acids Res 2008, 36(Database issue):D753-D760. [http://www.ncbi.nlm.nih.gov/pubmed/18003653] PubMed CentralPubMedGoogle Scholar
- Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A, Diekhans M, Dreszer TR, Giardine BM, Harte RA, Hillman-Jackson J, Hsu F, Kirkup V, Kuhn RM, Learned K, Li CH, Meyer LR, Pohl A, Raney BJ, Rosenbloom KR, Smith KE, Haussler D, Kent WJ: The UCSC Genome Browser database: update 2011. Nucleic Acids Res 2011, 39(Database issue):D876-D882. [http://dx.doi.org/10.1093/nar/gkq963] PubMed CentralView ArticlePubMedGoogle Scholar
- Karolchik D, Hinrichs AS, Kent WJ: The UCSC Genome Browser. Curr Protoc Bioinformatics 2009, Chapter 1: Unit1.4. [http://dx.doi.org/10.1002/0471250953.bi0104s28] PubMedGoogle Scholar
- Magrane M, Consortium U: UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford) 2011, 2011: bar009. [http://www.ncbi.nlm.nih.gov/pubmed/21447597]  10.1093/database/bar009View ArticleGoogle Scholar
- Laibe C, Novère NL: MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology. BMC Syst Biol 2007, 1: 58. [http://dx.doi.org/10.1186/1752–0509–1-58]  10.1186/1752-0509-1-58PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28: 27–30. [http://www.ncbi.nlm.nih.gov/pubmed/10592173]  10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res 2012, 40(Database issue):D109-D114. [http://www.ncbi.nlm.nih.gov/pubmed/22080510] PubMed CentralView ArticlePubMedGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Holko M, Ayanbule O, Yefanov A, Soboleva A, NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res 2011, 39(Database issue):D1005-D1010. [http://dx.doi.org/10.1093/nar/gkq1184] PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002, 30: 207–210. [http://www.ncbi.nlm.nih.gov/pubmed/11752295]  10.1093/nar/30.1.207PubMed CentralView ArticlePubMedGoogle Scholar
- Gautier L, Møller M, Friis-Hansen L, Knudsen S: Alternative mapping of probes to genes for Affymetrix chips. BMC Bioinformatics 2004, 5: 111. [http://dx.doi.org/10.1186/1471–2105–5-111]  10.1186/1471-2105-5-111PubMed CentralView ArticlePubMedGoogle Scholar
- Liu H, Zeeberg BR, Qu G, Koru AG, Ferrucci A, Kahn A, Ryan MC, Nuhanovic A, Munson PJ, Reinhold WC, Kane DW, Weinstein JN: AffyProbeMiner: a web resource for computing or retrieving accurately redefined Affymetrix probe sets. Bioinformatics 2007, 23(18):2385–2390. [http://dx.doi.org/10.1093/bioinformatics/btm360]  10.1093/bioinformatics/btm360View ArticlePubMedGoogle Scholar
- Harbig J, Sprinkle R, Enkemann SA: A sequence-based identification of the genes detected by probesets on the Affymetrix U133 plus 2.0 array. Nucleic Acids Res 2005, 33(3):e31. [http://dx.doi.org/10.1093/nar/gni027]  10.1093/nar/gni027PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410. [http://dx.doi.org/10.1006/jmbi.1990.9999] View ArticlePubMedGoogle Scholar
- Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4(5):P3. [http://www.ncbi.nlm.nih.gov/pubmed/12734009]  10.1186/gb-2003-4-5-p3View ArticlePubMedGoogle Scholar
- Huang DW, Sherman BT, Zheng X, Yang J, Imamichi T, Stephens R, Lempicki RA: Extracting biological meaning from large gene lists with DAVID. Curr Protoc Bioinformatics 2009, Chapter13: Unit 13.11. [http://dx.doi.org/10.1002/0471250953.bi1311s27] Google Scholar
- Sherman BT, Huang DW, Tan Q, Guo Y, Bour S, Liu D, Stephens R, Baseler MW, Lane HC, Lempicki RA: DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis. BMC Bioinformatics 2007, 8: 426. [http://dx.doi.org/10.1186/1471–2105–8-426]  10.1186/1471-2105-8-426PubMed CentralView ArticlePubMedGoogle Scholar
- Huang DW, Sherman BT, Stephens R, Baseler MW, Lane HC, Lempicki RA: DAVID gene ID conversion tool. Bioinformation 2008, 2(10):428–430. [http://www.ncbi.nlm.nih.gov/pubmed/18841237]  10.6026/97320630002428PubMed CentralView ArticleGoogle Scholar
- Hosack DA, Dennis G, Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of genes with EASE. Genome Biol 2003, 4(10):R70. [http://dx.doi.org/10.1186/gb-2003–4-10-r70]  10.1186/gb-2003-4-10-r70PubMed CentralView ArticlePubMedGoogle Scholar
- Al-Shahrour F, Carbonell J, Minguez P, Goetz S, Conesa A, Tárraga J, Medina I, Alloza E, Montaner D, Dopazo J: Babelomics: advanced functional profiling of transcriptomics, proteomics and genomics experiments. Nucleic Acids Res 2008, 36(Web Server issue):W341-W346. [http://dx.doi.org/10.1093/nar/gkn318] PubMed CentralView ArticlePubMedGoogle Scholar
- Medina I, Carbonell J, Pulido L, Madeira SC, Goetz S, Conesa A, Tárraga J, Pascual-Montano A, Nogales-Cadenas R, Santoyo J, García F, Marbà M, Montaner D, Dopazo J: Babelomics: an integrative platform for the analysis of transcriptomics, proteomics and genomic data with advanced functional profiling. Nucleic Acids Res 2010, 38(Web Server issue):W210-W213. [http://dx.doi.org/10.1093/nar/gkq388] PubMed CentralView ArticlePubMedGoogle Scholar
- Reimand J, Kull M, Peterson H, Hansen J, Vilo J: g:Profiler–a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res 2007, 35(Web Server issue):W193-W200. [http://dx.doi.org/10.1093/nar/gkm226] PubMed CentralView ArticlePubMedGoogle Scholar
- Imanishi T, Nakaoka H: Hyperlink Management System and ID Converter System: enabling maintenance-free hyperlinks among major biological databases. Nucleic Acids Res 2009, 37(Web Server issue):W17-W22. [http://dx.doi.org/10.1093/nar/gkp355] PubMed CentralView ArticlePubMedGoogle Scholar
- Berriz GF, Roth FP: The Synergizer service for translating gene, protein and other biological identifiers. Bioinformatics 2008, 24(19):2272–2273. [http://dx.doi.org/10.1093/bioinformatics/btn424]  10.1093/bioinformatics/btn424PubMed CentralView ArticlePubMedGoogle Scholar
- Baron D, Bihouee A, Teusan R, Dubois E, Savagner F, Steenman M, Houlgatte R, Ramstein G: MADGene: retrieval and processing of gene identifier lists for the analysis of heterogeneous microarray datasets. Bioinformatics 2011, 27(5):725–726. [http://www.ncbi.nlm.nih.gov/pubmed/21216776]  10.1093/bioinformatics/btq710PubMed CentralView ArticlePubMedGoogle Scholar
- Alibés A, Yankilevich P, Cañada A, Díaz-Uriarte R: IDconverter and IDClight: conversion and annotation of gene and protein IDs. BMC Bioinformatics 2007, 8: 9. [http://dx.doi.org/10.1186/1471–2105–8-9]  10.1186/1471-2105-8-9PubMed CentralView ArticlePubMedGoogle Scholar
- Bussey KJ, Kane D, Sunshine M, Narasimhan S, Nishizuka S, Reinhold WC, Zeeberg B, Ajay W, Weinstein JN: MatchMiner: a tool for batch navigation among gene and gene product identifiers. Genome Biol 2003, 4(4):R27. [http://www.ncbi.nlm.nih.gov/pubmed/12702208]  10.1186/gb-2003-4-4-r27PubMed CentralView ArticlePubMedGoogle Scholar
- Castillo-Davis CI, Hartl DL: GeneMerge–post-genomic analysis, data mining, and hypothesis testing. Bioinformatics 2003, 19(7):891–892. [http://www.ncbi.nlm.nih.gov/pubmed/12724301]  10.1093/bioinformatics/btg114View ArticlePubMedGoogle Scholar
- Tsai J, Sultana R, Lee Y, Pertea G, Karamycheva S, Antonescu V, Cho J, Parvizi B, Cheung F, Quackenbush J: RESOURCERER: a database for annotating and linking microarray resources within and across species. Genome Biol 2001, 2(11):SOFTWARE0002. [http://www.ncbi.nlm.nih.gov/pubmed/16173164] PubMed CentralView ArticlePubMedGoogle Scholar
- Lenhard B, Hayes WS, Wasserman WW: GeneLynx: a gene-centric portal to the human genome. Genome Res 2001, 11(12):2151–2157. [http://www.ncbi.nlm.nih.gov/pubmed/11731507]  10.1101/gr.199801PubMed CentralView ArticlePubMedGoogle Scholar
- Risueño A, Fontanillo C, Dinger ME, Rivas JDL: GATExplorer: genomic and transcriptomic explorer; mapping expression probes to gene loci, transcripts, exons and ncRNAs. BMC Bioinformatics 2010, 11: 221. [http://dx.doi.org/10.1186/1471–2105–11–221]  10.1186/1471-2105-11-221PubMed CentralView ArticlePubMedGoogle Scholar
- Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, Kulp D, Siani-Rose MA: NetAffx: Affymetrix probesets and annotations. Nucleic Acids Res 2003, 31: 82–86. [http://www.ncbi.nlm.nih.gov/pubmed/12519953]  10.1093/nar/gkg121PubMed CentralView ArticlePubMedGoogle Scholar
- Nurtdinov RN, Vasiliev MO, Ershova AS, Lossev IS, Karyagina AS: PLANdbAffy: probe-level annotation database for Affymetrix expression microarrays. Nucleic Acids Res 2010, 38(Database issue):D726-D730. [http://dx.doi.org/10.1093/nar/gkp969] PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ: BLAT–the BLAST-like alignment tool. Genome Res 2002, 12(4):656–664. [http://dx.doi.org/10.1101/gr.229202.ArticlepublishedonlinebeforeMarch2002] PubMed CentralView ArticlePubMedGoogle Scholar
- Wang P, Ding F, Chiang H, Thompson RC, Watson SJ, Meng F: ProbeMatchDB–a web database for finding equivalent probes across microarray platforms and species. Bioinformatics 2002, 18(3):488–489. [http://www.ncbi.nlm.nih.gov/pubmed/11934751]  10.1093/bioinformatics/18.3.488View ArticlePubMedGoogle Scholar
- Jain E, Bairoch A, Duvaud S, Phan I, Redaschi N, Suzek BE, Martin MJ, McGarvey P, Gasteiger E: Infrastructure for the life sciences: design and implementation of the UniProt website. BMC Bioinformatics 2009, 10: 136. [http://www.ncbi.nlm.nih.gov/pubmed/19426475]  10.1186/1471-2105-10-136PubMed CentralView ArticlePubMedGoogle Scholar
- UniProt Consortium: Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res 2011, 39(Database issue):D214-D219. [http://www.ncbi.nlm.nih.gov/pubmed/21051339] View ArticleGoogle Scholar
- Khatri P, Draghici S, Ostermeier GC, Krawetz SA: Profiling gene expression using onto-express. Genomics 2002, 79(2):266–270. [http://dx.doi.org/10.1006/geno.2002.6698]  10.1006/geno.2002.6698View ArticlePubMedGoogle Scholar
- Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA: Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. Nucleic Acids Res 2003, 31(13):3775–3781. [http://www.ncbi.nlm.nih.gov/pubmed/12824416]  10.1093/nar/gkg624PubMed CentralView ArticlePubMedGoogle Scholar
- Iragne F, Barré A, Goffard N, Daruvar AD: AliasServer: a web server to handle multiple aliases used to refer to proteins. Bioinformatics 2004, 20(14):2331–2332. [http://dx.doi.org/10.1093/bioinformatics/bth241]  10.1093/bioinformatics/bth241View ArticlePubMedGoogle Scholar
- Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A: BioMart Central Portal–unified access to biological data. Nucleic Acids Res 2009, 37(Web Server issue):W23-W27. [http://www.ncbi.nlm.nih.gov/pubmed/19420058] PubMed CentralView ArticlePubMedGoogle Scholar
- Guberman JM, Ai J, Arnaiz O, Baran J, Blake A, Baldock R, Chelala C, Croft D, Cros A, Cutts RJ, Di Genova A, Forbes S, Fujisawa T, Gadaleta E, Goodstein DM, Gundem G, Haggarty B, Haider S, Hall M, Harris T, Haw R, Hu S, Hubbard S, Hsu J, Iyer V, Jones P, Katayama T, Kinsella R, Kong L, Lawson D, et al.: BioMart Central Portal: an open database network for the biological community. Database (Oxford) 2011, 2011: bar041. [http://www.ncbi.nlm.nih.gov/pubmed/21930507]  10.1093/database/bar041View ArticleGoogle Scholar
- Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res 2004, 14: 160–169. [http://dx.doi.org/10.1101/gr.1645104] PubMed CentralView ArticlePubMedGoogle Scholar
- van Iersel MP, Pico AR, Kelder T, Gao J, Ho I, Hanspers K, Conklin BR, Evelo CT: The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services. BMC Bioinformatics 2010, 11: 5. [http://www.ncbi.nlm.nih.gov/pubmed/20047655]  10.1186/1471-2105-11-5PubMed CentralView ArticlePubMedGoogle Scholar
- Cote RG, Jones P, Martens L, Kerrien S, Reisinger F, Lin Q, Leinonen R, Apweiler R, Hermjakob H: The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases. BMC Bioinformatics 2007, 8: 401. [http://www.ncbi.nlm.nih.gov/pubmed/17945017]  10.1186/1471-2105-8-401PubMed CentralView ArticlePubMedGoogle Scholar
- Mohammad F, Flight R, Harrison B, Petruska J, Rouchka E: Interval Trees for Detection of Overlapping Genetic Entities. 2011 11th IEEE International Conference on Bioinformatics and Bioengineering. IEEE; 2011, 278–281.View ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009, 10(3):R25. [http://dx.doi.org/10.1186/gb-2009–10–3-r25]  10.1186/gb-2009-10-3-r25PubMed CentralView ArticlePubMedGoogle Scholar
- Pages H, Aboyou P, Lawrence M: IRanges: Infrastructure for manipulating intervals on sequences. R package version 2010, 1(6):1–23.Google Scholar
- Aboyoun P, Pages H, Lawrence M: GenomicRanges: Representation and manipulation of genomic intervals. R package version 2010, 1(6):1–25.Google Scholar
- Allen J: Maintaining knowledge about temporal intervals. Commun of the ACM 1983, 26(11):832–843. 10.1145/182.358434View ArticleGoogle Scholar
- Goecks J, Nekrutenko A, Taylor J: Galaxy Team: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010, 11(8):R86. [http://www.ncbi.nlm.nih.gov/pubmed/20738864]  10.1186/gb-2010-11-8-r86PubMed CentralView ArticlePubMedGoogle Scholar
- Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J: Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 2010, Chapter 19: Unit 19.10.1–21. [http://www.ncbi.nlm.nih.gov/pubmed/20069535] Google Scholar
- Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A: Galaxy: a platform for interactive large-scale genome analysis. Genome Res 2005, 15(10):1451–1455. [http://www.ncbi.nlm.nih.gov/pubmed/16169926]  10.1101/gr.4086505PubMed CentralView ArticlePubMedGoogle Scholar
- Affymetrix HG-U133 Plus 2.0 annotation file [https://www.affymetrix.com/analysis/downloads/na32/ivt/HG-U133_Plus_2.na32.annot.csv.zip] 
- Agilent Cgh annotation file [https://earray.chem.agilent.com/earray] 
- Khor CC, Hibberd ML: Revealing the molecular signatures of host-pathogen interactions. Genome Biol 2011, 12(10):229. [http://www.ncbi.nlm.nih.gov/pubmed/22011345]  10.1186/gb-2011-12-10-229PubMed CentralView ArticlePubMedGoogle Scholar
- Tan LKK, Carlone GM, Borrow R: Advances in the development of vaccines against Neisseria meningitidis. N Engl J Med 2010, 362(16):1511–1520. [http://www.ncbi.nlm.nih.gov/pubmed/20410516]  10.1056/NEJMra0906357View ArticlePubMedGoogle Scholar
- Aurrecoechea C, Brestelli J, Brunk BP, Dommer J, Fischer S, Gajria B, Gao X, Gingle A, Grant G, Harb OS, Heiges M, Innamorato F, Iodice J, Kissinger JC, Kraemer E, Li W, Miller JA, Nayak V, Pennington C, Pinney DF, Roos DS, Ross C, Stoeckert CJ, Treatman C, Wang H: PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res 2009, 37(Database issue):D539-D5343. [http://www.ncbi.nlm.nih.gov/pubmed/18957442] PubMed CentralView ArticlePubMedGoogle Scholar
- Kwiatkowski DP: How malaria has affected the human genome and what human genetics can teach us about malaria. Am J Hum Genet 2005, 77(2):171–192. [http://www.ncbi.nlm.nih.gov/pubmed/16001361]  10.1086/432519PubMed CentralView ArticlePubMedGoogle Scholar
- Sacheck JM, Hyatt JPK, Raffaello A, Jagoe RT, Roy RR, Edgerton VR, Lecker SH, Goldberg AL: Rapid disuse and denervation atrophy involve transcriptional changes similar to those of muscle wasting during systemic diseases. FASEB J 2007, 21: 140–155. [http://dx.doi.org/10.1096/fj.06–6604com] View ArticlePubMedGoogle Scholar
- Lecker SH, Jagoe RT, Gilbert A, Gomes M, Baracos V, Bailey J, Price SR, Mitch WE, Goldberg AL: Multiple types of skeletal muscle atrophy involve a common program of changes in gene expression. FASEB J 2004, 18: 39–51. [http://www.ncbi.nlm.nih.gov/pubmed/14718385]  10.1096/fj.03-0610comView ArticlePubMedGoogle Scholar
- Jagoe RT, Lecker SH, Gomes M, Goldberg AL: Patterns of gene expression in atrophying skeletal muscles: response to food deprivation. FASEB J 2002, 16(13):1697–1712. [http://www.ncbi.nlm.nih.gov/pubmed/12409312]  10.1096/fj.02-0312comView ArticlePubMedGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S, Feolo M, Fingerman IM, Geer LY, Helmberg W, Kapustin Y, Krasnov S, Landsman D, Lipman DJ, Lu Z, Madden TL, Madej T, Maglott DR, Marchler-Bauer A, Miller V, Karsch-Mizrachi I, Ostell J, Panchenko A, Phan L, Pruitt KD, Schuler GD, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2012, 40(Database issue):D13-D25. [http://www.ncbi.nlm.nih.gov/pubmed/22140104] PubMed CentralView ArticlePubMedGoogle Scholar
- Chen R, Li L, Butte AJ: AILUN: reannotating gene expression data automatically. Nat Methods 2007, 4(11):879. [http://dx.doi.org/10.1038/nmeth1107–879]  10.1038/nmeth1107-879PubMed CentralView ArticlePubMedGoogle Scholar
- Bisognin A, Coppe A, Ferrari F, Risso D, Romualdi C, Bicciato S, Bortoluzzi S: A-MADMAN: annotation-based microarray data meta-analysis tool. BMC Bioinformatics 2009, 10: 201. [http://dx.doi.org/10.1186/1471–2105–10–201]  10.1186/1471-2105-10-201PubMed CentralView ArticlePubMedGoogle Scholar
- Cahan P, Ahmad AM, Burke H, Fu S, Lai Y, Florea L, Dharker N, Kobrinski T, Kale P, McCaffrey TA: List of lists-annotated (LOLA): a database for annotation and comparison of published microarray gene lists. Gene 2005, 360: 78–82. [http://dx.doi.org/10.1016/j.gene.2005.07.008]  10.1016/j.gene.2005.07.008View ArticlePubMedGoogle Scholar
- Arge L, Vitter J: Optimal dynamic interval management in external memory. Foundations of Computer Science, 1996. Proceedings., 37th Annual Symposium on IEEE 1996, 560–569.Google Scholar
- Arge L, Vitter J: Optimal External Memory Interval Management. SIAM J Comput 2003, 32: 1488–1508. [http://portal.acm.org/citation.cfm?id=944295.945604]  10.1137/S009753970240481XView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.