From: Ultra-Structure database design methodology for managing systems biology data and analyses

Different representations of genomic regions. Panel A shows an entry from the Known Genes dataset from UCSC, using its native database structure. Here, "Tx" stands for "Translation" and "Cd" for "Coding"; thus "Tx Start" and "Tx Stop" define the bounds for the particular transcript's transcribed region, just as "Cd Start" and "Cd Stop" define a coding region. Panel B illustrates how various annotations of interest are represented using this data structure (the figure does not represent any particular gene, and is not to scale). The transcribed region and coding region are both explicitly defined, while untranslated regions must be inferred. Exons are represented altogether differently, using arrays of start and stop positions. Introns must be inferred as regions lying between these exons. All these types of annotations are of interest to researchers, but the variety of representations used here pose challenges to querying. The data structure also assumes that all genes will fit this basic structure, which may not always be the case (e.g., trans-spliced genes can be composed of segments from different chromosomes).

