Clinical Bioinformatics (CBI) can be defined as "the clinical application of bioinformatics-associated sciences and technologies to understand molecular mechanisms and potential therapies for human diseases" [1]. Being specifically focused on clinical context, CBI is characterized by the challenge of integrating molecular and clinical data to accelerate the translation of knowledge discovery into effective treatment and personalized medicine. CBI shares methods and goals with Translational Bioinformatics (TBI), which has been defined as the "development of storage, analytic, and interpretative methods to optimize the transformation of increasingly voluminous biomedical data - genomic data in particular - into proactive, predictive, preventive, and participatory health management" [2]. CBI and TBI can be thus considered as almost synonymous terms, being both related with the same set of scientific questions. In this paper we will refer to CBI, wanting to stress the clinical decision making aspects of bioinformatics, although we claim that the two terms are being used in current practice in an interchangeable manner.
More specifically, CBI is aimed at providing methods and tools to support two different decision-makers. On the one hand, it should assist clinicians in dealing with clinical genomics (biomarker discovery), genomic medicine (identification of genotype/phenotype correlations), pharmacogenomics and genetic epidemiology at the point of care (see [3] for a detailed discussion); on the other hand, it must support researchers in the proper reuse of clinical data for research purposes [4]. For this reason, together with bioinformatics problems, related to the management, analysis and integration of "-omics" data, CBI needs to deal with the proper definition of clinical decision-support strategies, an area deeply studied in the context of medical informatics and artificial intelligence in medicine. CBI is therefore at the confluence of different disciplines, and may foster the definition of a comprehensive framework to deal and manage all kinds of biomedical data, supporting their transformation into information and knowledge.
Even if the main aim of CBI is very ambitious, there is a variety of enabling factors that strongly support research in this direction. First of all, in the last few years new genome sequencing and other high-throughput experimental techniques have generated vast amounts of molecular data, which, when coupled with clinical data, may lead to major biomedical discoveries, if properly exploited by researchers.
Second, new diagnostic and prognostic tests based on molecular biomarkers are increasingly available to clinicians, thus consistently refining the capability of dissecting diseases and, at the same time, enlarging the decision space on the basis of the improved assessment of risk.
Third, the increasing online availability of the "bibliome", i.e., the biomedical text corpus, made through published manuscripts, abstracts, textual comments and reports, as well as direct-to-Web publications, has stimulated the development of new algorithms able to semi-automatically extract knowledge from these texts so as to make it available in computable formats. Such algorithms have been proved to be able to effectively combine the information reported in the text with that contained in biological knowledge repositories and are increasingly used for hypothesis generation, or corroboration of clinical findings. Their use in the clinics poses challenges, but may be a consistent and important tool to support decision-making.
Finally, the consistent growth of publicly available data and knowledge sources and the possibility to easily access low-cost, high-throughput molecular technologies has meant that computational technologies and bioinformatics are increasingly central in genomic medicine; cloud computing technology is being recognised as a key technology for the future of genomic research to facilitate large-scale translational research.
Network Tools and Applications in Biology (NETTAB) Workshops are a series of meetings focused on the most promising and innovative ICT tools and to their usefulness in Bioinformatics [5]. They aim at introducing participants to the most promising among evolving network standards and technologies that are being applied to the biomedical application domain. Each year, they are focused on a different technology or domain for which talks on basic technologies, tools, and platforms of interest, as well as real applications, are presented. The NETTAB 2011 workshop, held in Pavia, Italy, in October 2011 was aimed at presenting some of the most relevant methods, tools and infrastructures that are nowadays available for CBI.
In this paper, the viewpoints and opinions of three world CBI leaders, who have been invited to participate in a panel discussion of the NETTAB workshop on the next challenges and future opportunities of this field, are reported.
Looking at CBI from the technological side, these experts have identified three areas that need advancement and further research. These include the development of data warehouses and ICT infrastructures for data sharing, the definition of standards for sharing phenotypic data and the implementation of novel tools to implement efficient search computing solutions. In the following of the editorial we report such opinions and discuss their relevance to the field.
ICT infrastructures for supporting clinical bioinformatics: important design features of the i2b2 system
i2b2 (Informatics for Integrating Biology and the Bedside) is an NIH-funded National Center for Biomedical Computing based at Partners HealthCare System that is an integrated framework for using clinical data for research [4].
The back end of i2b2 has a modular software design, called the 'Hive,' that manages everything having to do with how data is stored and accessed. The front end of i2b2 is the i2b2 Web client, a user interface that allows researchers to query and analyze the underlying data. The software is open source and can be extended by users once the core cells of the Hive are included and correctly configured.
To date, i2b2 has been deployed at over 70 sites around the world, where it is being used for cohort identification, hypothesis generation and retrospective data analysis. At many of these sites, additional functionality is being developed to suit the needs of the researchers.
Several aspects of i2b2 contribute to its rapid adoption by the clinical research community. The first is that it is open source and therefore not only is it free to try, but there is a built-in set of collaborators - other users - with whom to engage both to get help with any questions and to foster innovation. The open source, self-service nature of i2b2 allows investigators to try out ideas stepwise at their own pace and at no financial cost. The online documentation and community wiki are kept up-to-date and greatly assist in user support. Secondly, both the fact that it is open source and the modularity of the design enforce backward compatibility with existing research, so that it is added to the i2b2 platform and does not become obsolete.
But perhaps the key to the utility of i2b2 is the simplicity of its database design. A research data warehouse typically includes data from disparate sources, such as electronic health records, administrative systems, genetic and research data, and lab results, to name a few. The structure of the i2b2 database allows this data to be aggregated and optimized for rapid cross-patient searching in a way that is transparent to the user. The specific design and flexibility of the data model supports new research data being added to the database as it is amassed, while allowing users to construct complex queries against the multiple source systems.
i2b2 data is stored in a star schema, first described by Kimball [6]. A very large central fact table (observation_fact) is surrounded by and connected to the smaller dimension tables, i.e., the patient, observer, visit, concept and modifier dimensions (Figure 1). A fact is defined as an observation on a patient, made at a specific time, by a specific observer, during a specific event. Dimension tables hold descriptive information and attributes about the facts.
The star schema is optimized for analytic querying and reporting. Its design tends to mirror the way users think about and use data, which is important since users must understand what data is available in order to formulate queries. The straightforward connections between the fact and dimension tables mean that navigation through the database via joins and drilling into or rolling up dimensional data is simple and quick. The design allows the fact table to grow to billions of rows while maintaining performance. Another advantage of the fact table design is that it is well suited to handle "sparse" data; data that has many possible attributes (such as all possible medical concepts), but with only a few that are applicable. In this model, only positive facts are recorded, thus resulting in more efficient storage.
Perhaps the most powerful aspect of the i2b2 database design is the design of the metadata. In i2b2, metadata is the vocabulary, all the medical terms that describe the facts in the database. Metadata is what allows users to interact with the database. A typical clinical data warehouse may have 100,000 to 500,000 concepts, including ICD-9 [7], SNOMED-CT [8], CPT [9], HCPCS [10], NDC [11] and LOINC [12] codes, as well as a host of local codes from in-house systems. Without an intuitive and easy-to-use structure, users would be stymied in understanding and using the codes. In i2b2, a hierarchical folder system is used to group the concepts. General terms are located in higher level folders, with more specific but related terms in folders and leaves underneath. The way the metadata looks in the i2b2 Web client directly reflects its structure in the table (Figure 2). A user can drill up and down in the folders in the user interface to clearly see the hierarchy and find terms of interest.
Maintaining and updating the metadata is a significant, but workable challenge. New medical codes are constantly being created, and old codes are discarded or changed. The structure of the metadata must be able to seamlessly absorb new codes while remaining backward compatible with old coding schemes. The hierarchical classification scheme of i2b2 makes it easy to map new codes to existing folders and to create new folders as needed. Entire new coding systems can be added just by creating a new folder. Discarded codes can remain in the hierarchy next to newer ones and used to reference older data, or hidden to discourage their usage in new queries.
One goal of i2b2 is to help integrate data from the many different sources that exist in modern day healthcare institutions in order to present a comprehensive view of patient care for research. The simple and intuitive design of the i2b2 database enables users to construct complex queries over these disparate data sources.
Using the new generation of Healthcare and Life Sciences standards for Personalized Medicine
The success of Personalized Medicine (PM) at the point of care is dependent on the effective use of PM knowledge (e.g., pharmacogenomic interpretation of somatic mutations in tumor tissues) while considering the complete patient's medical history (e.g., other diseases, medications, allergies, and genetic mutations).
In order for PM knowledge to be effectively applied to the patient medical records, representations of data and knowledge need to be standardized due to the heterogeneity of their original formats. Both data and knowledge are generated nowadays by a variety of sources, each of them using proprietary formats and idiosyncratic semantics, often not represented explicitly (for example, when contextual data is unstructured and thus cannot be parsed by decision support applications).
Interpretation of clinical data typically starts at parsing the metadata, e.g., the predefined schemas of clinical information systems. However, these schemas (most often relational) cannot accommodate the complexity of contextual data representation. Thus, it is important to have a richer language allowing the explicit representation of patient-specific context of each discrete data item and of how it relates to other data items, as well as how it fits within the entire health history of an individual.
Dispersed and disparate medical records of a patient are often inconsistent and incoherent. A patient-centric, longitudinal electronic health record (EHR) based on international standards (e.g., CEN EHR 13606 [13]) could provide a coherent and explicit representation of the data's semantics. New PM evidences, generated by clinical research and validated in clinical trials and by data mining, should be represented in alignment with clinical data representations in a way that lends itself to PM realization. A constantly growing stream of raw data is available today in both research and clinical environments, e.g., DNA sequences and expression data along with rare variants and their presumed affected function, as well as sensor data along with deduced personal alerts.
The representation of such raw data should adhere, as much as possible, to common and agreed-upon reference models (e.g., HL7/ISO RIM - Reference Information Model [14] or the openEHR RM - Reference Model [13]) that provide unified representations of the common constructs needed for health information representation. For example, any observation could be represented in the same way in terms of its attributes, such as id, timing, code, value, method and status, but more importantly, using the same reference models could lead to the standard representation of clinical statements (e.g., "observation of gall bladder acute inflammation indicated having a procedure of cholecystectomy", or "EGFR variations cause resistance to Gefitinib"), where implicit semantics can become explicit and thus processable by decision support applications.
The abovementioned reference models can underlie the logical models of health data warehousing. Such warehousing could maintain the richest semantic representation of data and knowledge in a way that is also interoperable with other information systems. Performing specific tasks, such as summarizing patient data or analyzing cohort data in research studies, needs more optimized representations of the data and knowledge persisted in warehouses. Data marts are such optimized representations, and multiple data marts could be derived from a single warehouse. For example, the star schema underlying the i2b2 framework (see Figure 1) could be seen as a generic data mart for translational research that could be based on data exported from a standardized data warehouse maintained by a single health organization or across organizations, such as in the case of clinical affinity domains or integrated delivery networks.
In many cross-enterprise warehousing efforts, the main format used to convey patient data is the Clinical Document Architecture (CDA) standard [15]. CDA documents strike a balance between physicians' narratives and structured data in order to facilitate the gradual transition from unstructured clinical notes to standardized and structured data. The same transformation should also take place in knowledge representations, from scientific papers in natural language to structured knowledge, for example.
The efforts to apply Natural Language Processing (NLP) to health information could be connected to healthcare information technologies through standards like CDA that uses the clinical statement concept. The NLP fundamentals can be reduced to the clinical statement constituents and the CDA can thus be a good "catcher" of the results of NLP running over unstructured health information.
Search and extraction of relevant information from big data amounts
The continuously increasing amount of available data poses significant technological and computational challenges, both to their management (collection, storage, integration, preservation) and effective use (access, sharing, search, extraction, analysis). This issue is becoming predominant in several fields and it is being addressed in different ways, according to each specific field peculiarities.
The Web is a paradigmatic field for this aspect. A rapidly growing mass of data is flooding the Web. Yet, leveraging on the typical linked nature of Web data, technological and computational advancements are preventing (at least for now) drowning by Web data. Automatic robots have been implemented to crawl the Web resources, collect their huge key data and store them in powerful database management systems. Effective indexing and ranking techniques, such as the Google PageRank [16], have been implemented to efficiently catalogue and sort Web resources according to their key data and likely relevance. This enables Web search engines to provide lists of items which often include among their top 10 or 20 items the one(s) that can reasonably answer numerous, yet simple, user search questions.
Such ability, which is tremendously boosting the Web as an extraordinary easy-to-use source of information, is based on the assumption that user searches are mainly aimed at finding "at least one" or "the most evident" item that can answer his/her question. Current Web search technologies are not enough when search questions either become more complex, simultaneously involve different topics, or require the retrieval of most of (if not all) available items regarding the question, possibly ordered according to different user-defined features. Furthermore, only an estimated limited part of all data accessible through the Web can actually be found by current search engines: the vast "deep Web", including dynamic pages returned in response to a query or accessed through a form, resources protected by password, sites limiting access by using various security technologies (e.g., CAPTCHAs), and pages that are accessible through link-produced scripts, remains unrevealed.
Especially in the CBI field, the amount of collected data is continuously and rapidly increasing, in particular with the recent collection of -omics data. Also, compared to the Web, the current ability of extracting relevant biomedical information and of answering even common CBI questions is far less, due to many reasons.
First, the biomedical-molecular data - which are of various types - are stored in several different formats within systems that are distributed, heterogeneous, and often not interoperable. Furthermore, a lot of important information is subjectively described in free texts, within chief complaints, discharge letters, clinical reports or referrals, which are intrinsically unstructured. The adoption of electronic medical or health records can significantly enhance the availability and sharing of clinical data and information, which are still only on paper in very many healthcare sites. Yet, the digitalization of health data alone is far from sufficient; having clinical reports and referrals in PDF format is evidently not enough to solve the information extraction and question answering issues. A standard data and information representation according to a shared reference model has to be adopted, together with controlled terminologies and ontologies to objectively describe medical and biomolecular findings. Moreover, the use of advanced Natural Language Processing techniques suited for the clinical domain to extract and structure information from previous medical textual descriptions can also greatly help.
Second, usual biomedical-molecular questions are generally more complex than Web search questions. They often involve more types of data, as well as topics with usually several attributes. In many cases, retrieving only a few of the items related to a biomedical-molecular search question, or even the K top items according to some user-defined ranking, may not be enough for a proper answer, which can instead require the exploration of all available items and their attributes.
Advanced search computing techniques are being developed to answer complex, multi-topic Web search questions involving the integration of possibly ranked partial search results [17]. These techniques can also be applied in the CBI domain to tackle such issues, at least partially. Yet, the complex and heterogeneous nature of the biomedical data, as well as the multifaceted structure of the clinical settings, pose formidable technological and organizational challenges for the effective management and use of biomedical-molecular data. In particular, integrated search and retrieval of bio-data, and their comprehensive analysis towards extraction of relevant information [18] and inference of biomedical knowledge, constitute some of the major challenges for the present and future of CBI, with a potential remarkable impact on the advancement of clinical research and patient treatment.