SYMBIOmatics: Synergies in Medical Informatics and Bioinformatics – exploring current scientific literature for emerging topics

Background The SYMBIOmatics Specific Support Action (SSA) is "an information gathering and dissemination activity" that seeks "to identify synergies between the bioinformatics and the medical informatics" domain to improve collaborative progress between both domains (ref. to ). As part of the project experts in both research fields will be identified and approached through a survey. To provide input to the survey, the scientific literature was analysed to extract topics relevant to both medical informatics and bioinformatics. Results This paper presents results of a systematic analysis of the scientific literature from medical informatics research and bioinformatics research. In the analysis pairs of words (bigrams) from the leading bioinformatics and medical informatics journals have been used as indication of existing and emerging technologies and topics over the period 2000–2005 ("recent") and 1990–1990 ("past"). We identified emerging topics that were equally important to bioinformatics and medical informatics in recent years such as microarray experiments, ontologies, open source, text mining and support vector machines. Emerging topics that evolved only in bioinformatics were system biology, protein interaction networks and statistical methods for microarray analyses, whereas emerging topics in medical informatics were grid technology and tissue microarrays. Conclusion We conclude that although both fields have their own specific domains of interest, they share common technological developments that tend to be initiated by new developments in biotechnology and computer science.


Background
The SYMBIOmatics Specific Support Action (SSA) is a European funded project. The main goal is to identify synergies between the bioinformatics (BI) and medical informatics (MI) research domains. In addition to experts that are approached through a survey, input will also be gathered from the analysis of scientific literature. In this paper, we focus on the analysis of scientific literature.
Bioinformatics (BI) and medical informatics (MI) are two research fields that have become mature in the past 20 years. They serve the needs of different but related research communities: BI provides solutions to scientists doing biological research whereas MI fulfils the demands from clinical personnel, for practitioners and scientists in medical research [1,2]. Although biological research may be part of a medical research project, it is often unclear how BI and MI research are coupled together [3]. Both research domains profit from progress in new IT developments and computer science as well as related scientific fields (e.g. physics, mathematics, etc.). However, the degree of exchange of new developments between the BI and the MI research domain has not been analysed [4]. Some indications of cross-fertilisation between the BI and the MI domain have been reported [5]. Both domains share a common IT infrastructure (e.g. electronic databases and terminologies), and scientists in both domains adopt solutions from the other domain if they work in an interdisciplinary environment (e.g. biological research done in a clinical environment) [6]. Last but not least, both domains share the common goal to provide new ITbased solutions to biomedical research and contribute to the treatment and cure of diseases. As a result synergies between MI and BI research can be expected as they contribute to medical or biological research that aims at a better understanding of the molecular basis of diseases, i.e. the genetic predisposition for a disease [7].
Although BI and MI contribute to biomedical research and share information technology, the extent to which researchers in the BI domain contribute to ongoing work in MI research and vice versa has not yet been analysed. Some researchers will be active in both fields, i.e. they collaborate with researchers from the BI and the MI domain and publish in journals reporting on MI research as well as in journals for BI research. A different indicator of crossfertilization between both domains is the uptake of new technologies from the other domain, e.g. postprocessing of data from microarray experiments and the use of controlled vocabularies such as UMLS and gene ontology. Although it can be expected that BI and MI researchers benefit from common research, development and collaborations, it is yet unclear to which extent researchers are active in both fields and how current and future collabo-rations can lead to benefits for both sides. Therefore we analyzed a large set of publications from BI and MI research to identify topics that are relevant to both research domains.
The scientific literature forms the repository of research accomplished in the past. Medline provided by the National Library of Medicine (NLM, Bethesda, MD, U.S.A) is the most comprehensive set of documents of biomedical research covering BI and MI research as well. Each Medline abstract contains in a condensed form details on technologies applied and results obtained. As part of the SYMBIOmatics project abstracts from BI and MI journals were processed to extract topics that are shared between the BI and MI domain and thus have the potential for synergies for both.
In recent years Medline abstracts have been used to extract facts such as protein-protein interactions, functional annotations of proteins, pathway information, point mutations, gene-disease associations and other protein or gene related information [8][9][10][11]. All approaches rely on existing terminological resources to extract facts from the literature that are linked to the known terms. It is obvious that there is no terminological resource representing all BI and MI topics. By contrast every new scientific publication could contain a new topic depending on the potential of the solution presented in the document. Others have proposed to extract paradigm shift patterns from the text, but rely on known syntactical patterns for the representation of such facts [12]. Such patterns are not available for new emerging technologies or for common topics between the BI and the MI domain. The identification of microparadigms, i.e. chains of collective reasoning, and discourse structure in the documents is as well not suitable, since new emerging technologies are not part of a discourse structure [13,11]. As a result we chose to analyze the distribution of bigrams from the literature to find evidence for new emerging technologies in the literature.
The rest of the document is organized as follows. The "Result" section reports on identified and shared topics between both domains and in the "Discussion" section we interpret the findings and discuss shortcomings of our approach. In the "Method" section we describe the generation of the corpus and the extraction of bigrams.

Results
The BI journal corpus contains 8,696 documents and the MI journal corpus 6,309 documents (table 1). The BI query corpus consists of 142,656 documents in comparison to 49,119 documents in the MI query corpus; 689 documents were in both corpora (not shown). Comparing statistical parameters describing the BI journal corpus and the MI journal, we find that the size of both corpora and the distribution of bigrams extracted from both corpora are similar (table 1).
Analyzing the overlap between the BI query corpus and the two journal corpora shows that 44% of the BI journal corpus is contained in the BI query corpus whereas only 3% of the MI journal corpus overlaps with the BI journal corpus (table 2). 62% of the MI journal corpus overlaps with the MI query corpus, but only 8% of the BI journal corpus. The MI journal corpus seems to be more homogeneous than the BI journal corpus and better represented in the MI query corpus in comparison to the two corpora for the BI domain.
We extracted the publication date of the documents from the BI journal corpus and the MI journal corpus and calculated the distribution over time (figure 1). We observe a strong increase in publications in the BI field over the past 5 years, whereas the main growth in publications in the MI field took place during 1990 and 2000. In the case of the BI journal corpus the most frequent bigrams over the past 15 years are "gene expression" (Df = 711), "amino acid" (Df = 490) and "protein sequence" (Df = 438; table 3). In the same way the selection of the most frequent bigrams from the MI journal corpus ("information system", Df = 899; "health care", Df = 881; and "decision support", Df = 536; table 4) has again the same distribu-tion as known across the whole document set. We conclude that researchers working on the most relevant topics to the MI and the BI domain generate a continuous stream of publications for every journal and conference of the domain.
For the identification of new technologies and topics we identified those bigrams that have been mentioned during the period 2000-2005 but at a low frequency before (called "emerging bigrams"). From all emerging bigrams we selected the15 bigrams with the highest document frequency and compared them to bigrams that had the highest document frequency amongst recent and past documents. In the BI journal corpus, most frequent emerging bigrams were "microarray datum" (emerged 2000, Df = 268), "microarray experiment" (2000, Df = 184) and "microarray data" (2000, Df = 169). The first bigram is already amongst the highest ranking bigrams during 2000-2005 (position 12) and is more frequent than bigrams having the ranks 8-10 for the bigrams from the past 15 years. The importance of microarray experiments for the BI domain is reflected in the high frequency of publications attached to this emerging technology and in addition by other bigrams in the list of the top 15 (e.g. "expression profile", "cdna microarray", "microarray technology" and "microarray gene"). Other topics that had a strong representation in recent documents are "gene  1990 -1999). From all documents bigrams were extracted from noun phrases (for details see text). The analysis was restricted to bigrams with document frequency of at least 20. In the set of recent documents we identified those bigrams that were not mentioned before 2000 ("emerging"). The BI journal corpus and the MI journal corpus are similar in terms of the document members and contained bigrams. ontology", "support vector" and "vector machine", "protein interaction" and "interaction network", "whole genome" and "nucleotide polymorphism".
Top ranking emerging bigrams in the MI journal corpus were "patient safety" (Df = 64), "gene expression" (Df = 44) and "medical error" (Df = 41). The frequency of the emerging bigrams was much lower than the frequency of the top ranking emerging bigrams in the BI journal corpus and much lower than the frequency of bigrams in recent and past documents. This shows that new developments emerged in the MI domain at a lower frequency in recent documents than in the BI domain. A few bigrams such as "gene expression", "open source" and "expression datum" are typically attributed to the BI domain. Other bigrams such as "support vector" and "vector machine" show that the MI domain as well as the BI domain profit from new developments in computer science and mathematics.
We extracted all bigrams with high TfIdf values that emerged between 2000 and 2005, i.e. all bigrams that were not mentioned before 2000 and that had a high frequency in the corpus. As expected microarray experiments and technologies related to microarrays were the most prominent developments starting in 2000 (table 5). Other emerging new topics refer to "gene ontology", "support vector" and "vector machines", "text mining", "open source", "system biology", "association study" and other. From 2002 to 2003 new topics are again related to microarray experiments such as "false discovery" and "discovery rate", "r package" and "microarray study", whereas others are related to ontologies ("go term", "go annotation"). During this period and during 2004-2005 new topics refer to splicing ("splicing event") and text mining ("biocreative task", "task 1a", "task 2").
In the MI domain new topics between 2000 and 2001 emerged at a lower frequency (TfIdf value). In synergy to the BI domain, the topics "open source", "expression datum", "support vector" and "vector machine" emerged (table 6). In contrast to the BI domain the topics "medical error", "snomed ct" and "study background" were prominent. During 2002 to 2003 bigrams related to microarray technology appeared as well as the topic "gene ontology", all are primarily attributed to the BI domain, but not necessary originated in the BI domain. In the past 2 years in particular "grid technology" and "ubiquitous computing" as well as tissue microarray data exchange specification ("tma des", "microarray data", "exchange specification") emerged.
Altogether, a number of topics are shared between the BI and the MI domain that have developed over the past 5 years (microarray experiments, ontologies, open source, text mining, support vector machines). All of them are the basis of synergetic development.

Discussion
Both the BI and the MI domain undergo fast changes: new biomedical and IT technologies are introduced and lead to changes in research. The rate of publications in the BI domain shows a strong increase over past years with a The large portion of the research work directly linked to microarray experiments. The importance of microarray experiments for biomedical research is also visible in the MI domain and will become a lot more visible in the MI domain in the future.
In the MI domain, "patient safety" and "medical error" were strong emerging topics reflecting concerns resulting from recent studies that errors in medical treatment could be avoided with better IT support [14]. By nature these topics will not be of any importance to the BI domain. Finally it is obvious that not all emerging topics could be identified in our analysis since it relies on the extraction of bigrams. New topics that have not been identified are telemedicine, pharmacogenomics, biochips and lab-on-achip.

Conclusion
From our analysis of the scientific literature for bioinformatics and medical informatics we find that although both fields have their own specific domains of interest, they share common topics. The analysis of microarray experiments as a shared topic is driven by the new technology changing biological and medical research. Other topics such as text mining and ontology development is co-evolving in both domains and support vector machines have been introduced to both domains at the same time by new developments in computer science and mathematics. These topics form currently the core of synergies between the BI and the MI according to our literature analysis. It could happen that new topics currently relevant to the BI domain and related to population genetics and system biology will be more prominent in the near future.  All corpora were separated into two sets: the first one covering the years 2000 to 2005 ("recent documents") and the second one covering the years 1990 to 1999 ("past documents"). All corpora were processed in the same way using a modular information extraction infrastructure available from the European Bioinformatics Institute [20]. The compute server was a Linux farm of 220 IBM dual-cpu nodes (1.2-2.8 Ghz, 2 GB RAM).

Methods
The noun phrases were selected from the documents, where a noun phrase is represented by the language pattern "Det (Adj|Adv|Noun)+ Noun+". All noun phrases were processed to extract all contained bigrams, which then serve as features of the document representing the content. A bigram is any combination of two consecutive words from the noun phrase. The leading determiner was dropped. Every word of the noun phrase was normalized to lower case and lemmatized to use the base form only. For example, the noun phrase "the protein secondary structure" was split up into the noun phrases "protein secondary" and "secondary structure". Every Medline abstract was represented by a list of bigrams extracted from the document.
The extraction of bigrams from noun phrases is advantageous in comparison to the use of single terms from noun phrases, since single terms tend to be ambiguous. On the other side, bigrams are less specific than noun phrases, since bigrams are shorter and have less syntactical variability.
For every bigram, the frequency in the document was calculated (term frequency, Tf) as well as the frequency of the bigram in all documents of the corpus (document fre- quency, Df) resulting in the TfIdf value (Tf / Df) for every bigram. Recent and past documents were processed separately. For every document, the bigrams were ranked according to their TfIdf value and the 10 bigrams with the highest TfIdf score were selected for further analysis. Note that some documents do contain bigrams that have only a relatively low TfIdf score in comparison to the whole set of all identified bigrams. Such documents either deal with new developments or with a niche research topic. These bigrams were also included into the analysis, since they represent a document. If bigrams were mentioned in less than 20 documents over the period from 1990 to 2005, then they were excluded from further analysis. All bigrams were again ranked according to their TfIdf value.
We computed 2 bigram lists for each of the 4 corpora: one list contained the bigrams for the recent documents and the other for the past documents. We extracted from the bigram list of the recent documents all the bigrams that were mentioned amongst the bigrams of the past documents at a very low document frequency (Df < 4) and which had a high Df score after 1999, which resulted in the list of "emerging bigrams". Any bigram not mentioned before a given time period is called a "new bigram".

Authors' contributions
All authors have worked together on the SYMBIOmatics project and have contributed to the discussion on the synergies between the BI and MI research domains. The liter- ature analysis is the result of collaborative work between Dietrich Rebholz-Schuhmann, Erik van Mulligen, Jean-Louis Coatrieux and Johan van der Lei and was delivered as a result of a working package to the project. Special thanks belong to Graham Cameron, Dominic Clark, Fernando Martin-Sanchez and Luciano Milanesi for feedback on the manuscript and for integration of the analysis into the consultation work of the project.