Next generation models for storage and representation of microbial biological annotation

Quest, Daniel J; Land, Miriam L; Brettin, Thomas S; Cottingham, Robert W

doi:10.1186/1471-2105-11-S6-S15

BMC Bioinformatics

Table 1 A comparison of five common data storage technologies currently deployed in annotation systems.

From: Next generation models for storage and representation of microbial biological annotation

	Free Text	Tab/Line Delimited	XML	RDF/XML	Relational-DB
Description Logic (FOL)	NO	NO	NO	YES	NO
Ontology Standards	NO	NO	NO	YES	NO
Centralized/not scalable	NO	YES*	NO	NO	YES**
Human Readable	YES	YES	NO	YES	YES
Domain Expert Understandable	YES	YES	NO	YES	NO
Data Structure	NONE	Single Table	Tree	Graph	Relational Tables
Data Expectations	NONE	NONE	Schema - Constraints	Inference rules	Schema – Constraints
Native Format	Text	Text	Text	Text	Binary
Query Engine Language	Programmed by hand	Programmed by hand	Libraries available	SPARQL	SQL
Naming Standard	NO UNA****	NO UNA	NO UNA	NO UNA	UNA
Sequence Storage Solution	In Text	In Text	XML/Indexed	XML/Indexed	Indexed
CWA/OWA***	OWA	CWA	CWA	OWA	CWA
Search Speed (Worst Case)	NP-Hard	O(n)	O(n)	P	O(log n)
Update Speed (Worst Case)	NP-Hard	O(n)	O(n)	P	O(log n)
Conversion to Semantic	Data loss possible – done by hand	No data loss – done by hand	No data loss – done with robust libraries	-	No data loss –library usage and some added labeling by hand
Conversion from Semantic	No data loss	Data loss	Data loss	-	Data loss

Free text is used in repositories such as scientific journals. Tab/Line delimited files are used in popular formats such as FASTA, GFF, and BLAST. Tab/Line delimited files also constitutes the bulk of program output from most bioinformatics software. Mature tools and sequence repositories such as GenBank support XML output. Many OWL based ontology repositories exist for semantic data integration, however RDF/XML data is currently scarce. Relational databases typically do not provide direct access to the data, instead a programming interface is provided for access to the underlying database. Free text is the most flexible, and also the least machine-readable. Relational databases are the most formal structures with the fastest indexing and searching capabilities. Relational technology requires the greatest computational expertise investment while free text is the most natural. XML and RDF/XML are designed for modification over time and in sharing data. In the rows discussing search speed and update speed, O(log n), O(n), P and NP-Hard are computer science terms indicating a range of how fast a computer solution can be obtained to a particular problem. P indicates a reasonable solution is possible in polynomial time, NP-Hard means that the solution space explodes relative to the input size. NP-Hard problems are expected to not be solvable on a computer in reasonable time. O(log n), O(n), and P are all solvable efficiently on a computer. In the rows discussing conversion to and from RDF/XML, Turtle, and other semantic aware data storage technologies, loss of information includes schema, constraints, data and formatting. For example, to convert from a relational schema to tab-delimited files, information is lost because the schema, triggers and views are not representable using tab-delimited files. So these columns are more than just data, they are data and descriptions surrounding the data for making logical conclusions and for executing computer codes in reasonable time. In the conversion from free text to semantic standards, assumptions and domain expertise may be lost.
*Assuming all information is in one file. If multiple files exist, then an indexing system needs to be developed to organize information.
**Relational databases are assumed to exist as a single installation on a powerful single resource. New database technologies have lessened this restriction in recent years.
***CWA – Closed World Assumption, missing information treated as false. OWA – Open World Assumption, missing information treated as unknown.
****UNA Unique Name Assumption – Each individual has a single unique name.

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com