- Open Access
Analyzing the field of bioinformatics with the multi-faceted topic modeling technique
© The Author(s). 2017
- Published: 31 May 2017
Bioinformatics is an interdisciplinary field at the intersection of molecular biology and computing technology. To characterize the field as convergent domain, researchers have used bibliometrics, augmented with text-mining techniques for content analysis. In previous studies, Latent Dirichlet Allocation (LDA) was the most representative topic modeling technique for identifying topic structure of subject areas. However, as opposed to revealing the topic structure in relation to metadata such as authors, publication date, and journals, LDA only displays the simple topic structure.
In this paper, we adopt the Tang et al.’s Author-Conference-Topic (ACT) model to study the field of bioinformatics from the perspective of keyphrases, authors, and journals. The ACT model is capable of incorporating the paper, author, and conference into the topic distribution simultaneously. To obtain more meaningful results, we use journals and keyphrases instead of conferences and bag-of-words.. For analysis, we use PubMed to collected forty-six bioinformatics journals from the MEDLINE database. We conducted time series topic analysis over four periods from 1996 to 2015 to further examine the interdisciplinary nature of bioinformatics.
We analyze the ACT Model results in each period. Additionally, for further integrated analysis, we conduct a time series analysis among the top-ranked keyphrases, journals, and authors according to their frequency. We also examine the patterns in the top journals by simultaneously identifying the topical probability in each period, as well as the top authors and keyphrases. The results indicate that in recent years diversified topics have become more prevalent and convergent topics have become more clearly represented.
The results of our analysis implies that overtime the field of bioinformatics becomes more interdisciplinary where there is a steady increase in peripheral fields such as conceptual, mathematical, and system biology. These results are confirmed by integrated analysis of topic distribution as well as top ranked keyphrases, authors, and journals.
- Text mining
- Topic modeling
- ACT model
- Keyphrase extraction
Over the years, academic subject areas have converged to form a variety of new, interdisciplinary fields. Bioinformatics is one example. Research domains from molecular biology to machine learning are used in conjunction to better understand complex biological systems such as cells, tissues, and the human body. Due to the complexity and broadness of the field, bibliometric analysis is often adopted to assess the current knowledge structure of a subject area, specify the current research themes, and identify the core literature of that area .
Bibliometrics identifies research trends using quantitative measures such as a researcher’s number of publications and citations, journal impact factors, and other indices that can measure impact or productivity of author or journal [2–5]. In addition, other factors such as the affiliation of authors, collaborations, and citation data are often incorporated into bibliometric analysis [6–9].
Previous studies mainly rely on quantitative measures and suffer from the lack of content analysis. To incorporate content analysis into bibliometrics, text-mining techniques are applied. Topic-modeling techniques are mostly adopted to identify the topics of a subject area while analyzing that area more abundantly [10–13]. These techniques allow for enriched content analysis. As an extension of Latent Dirichlet Allocation (LDA), which is the best received topic-modeling technique, Steyvers et al.  proposed the author-topic modeling technique that analyzes authors and topics simultaneously. They identify the authors’ impact or productivity of researchers in a given subject area [15, 16]. By adding multiple conditions to LDA, Tang et al.  suggested a new methodology, called the Author-Conference-Topic (ACT) model that analyzes the author, conference, and topic in one model to understand the subject area in an integrated manner.
In this paper, we apply the ACT model to examine interdisciplinary nature of bioinformatics. Unlike studies that use extended versions of LDA for topic analysis, the ACT model enables us to analyze topic, author, and journal at one time, providing an integrated view for understanding bioinformatics. The research questions that we are to investigate in this paper are: 1) What are the topical trends of bioinformatics over time? 2) Who are the key contributors in major topics of bioinformatics?, and 3) Which journal is leading which topic?
To address these questions, we collect PubMed articles in XML format and extract metadata and content such as the PMID, author, year, journal, title, and abstract. From the title and abstract, we extract keyphrases, which provide more meaningful interpretations than single words, as an input of the ACT model. We also divide the collected datasets into four time periods to examine the topic changes over time. The results of ACT model–based analysis show that various topics begin to appear and mixed subject topics become more apparent over time.
The rest of the paper is organized as follows. In the Background section, we discuss work related to bibliometric analysis and topic modeling. We then describe the proposed method in the Methods section. We analyze and discuss the results of leading topics, authors, and journals in the Result and Discussion section. Finally, we conclude the paper and suggest future lines of inquiry in Conclusions.
Bibliometric analysis identifies the research trends in a given subject area and core journals or documents, and helps with contrastive analysis. Many bibliometric studies use the number of published articles or journal impact factors to measure research productivity or to identify core journals in a specific field. Soteriades and Falalgas  applied quantitative and qualitative measurements to analyze the fields of preventive medicine, occupational and environmental medicine, epidemiology and public health using the number of articles and impact factor. Ugolini et al.  measured research productivity and evaluated the publication trends in the field of cancer molecular epidemiology. To quantify productivity, they used the number of articles and average and sum of impact factors. To evaluate publication trends, they collected and divided the keywords from MeSH terms about the publication into six groups. Ramos et al.  measured the national research activity of the tuberculosis field, using impact factor and the first author’s address. Claude et al.  examined research productivity by using distribution of publications related to medicine and ANN, the subfield of biology. They used the number of publications, impact factor, and journal category compared with national gross domestic product (GDP). In the bioinformatics field, Patra and Mishra  used the number of articles, publication of each journal, publication type, and the impact factor of journals to understand the growth of bioinformatics. They also found the core journals in the bioinformatics fields. Using author affiliation, they applied Lotka’s law to assess the distribution of each author’s productivity. Chen et al.  identified research trends using statistical methods based on the type of publication, language, and distribution of nation or institution. They measured h-index, adding statistical materials with the number of citations. Through this, they analyzed the research productivity by topic, institution, and journal. In addition, they conducted a keyword analysis to comprehend the research trend in a macroscopic view.
Mainstream bibliometrics research focuses on identifying the knowledge structure of a certain field with quantitative measures. In addition, some studies use author information or the collaboration pattern among authors to understand the certain field. Seglen and Aksnes  used the size and the productivity of research groups in the microbiology field in Norway as a measurement for bibliometric analysis. Geaney et al.  performed bibliometric analysis and density-equalizing mapping on scientific publications related to type 2 diabetes mellitus. They collected citation data and used various citation-oriented measures such as the number of citations, the average number of citations per journal, the total number of publications, impact factor, and eigenfactor score. To conduct content analysis and study the collaboration pattern between authors and the core sub-field of AIDS, Macías-Chapula and Mijangos-Nolasco  analyzed MeSH thesaurus using check tags, main headings, and subheadings of each MeSH term hierarchy. In addition, to measure the national research productivity, they used the authors’ address information. Bornmann and Mutz  recently identified the development of modern science by bibliometric analysis. They divide the data into three time periods to analyze the changes of fields over time.
Text mining applied to bibliometrics
Recently, there have been many attempts to apply text-mining techniques to bibliometric analysis to identify the knowledge structure of the field or measure its influence on other researchers and their fields and productivity. Song and Kim  collected full-text articles from PubMed Central and computed their citation relation. They infer the knowledge structure and understand the trend of the bioinformatics field. In a similar vein, Song et al.  measured the influence and productivity of bioinformatics by mining full-text articles retrieved from PubMed Central. To calculate the field’s productivity, they identified the most productive author, nation, institution, and topic word; to calculate its influence, they identified the most-cited paper, author, and rising researcher. Song et al.  analyzed topic evolution in the bioinformatics field using DBLP data in the field of Computer Science. To identify topic trends over time, they divided a dozen years (2000–2011) into four periods and applied the Markov Random Field-based topic clustering. For automatic clustering labeling, they calculated topic similarity based on Within-Period Cluster Similarity (WPCS) and Between-Period Cluster Similarity (BPCS). Their approach created topic graphs that show interaction among topics over someperiod of time. Lee et al.  mapped the Alzheimer’s disease field in three different perspectives: indexer, author, and citer. They applied entity-metrics  the extended notion of bibliometrics, to analyze the field by constructing four kinds of networks that convey these three perspectives.
These studies identify the knowledge structure of a certain field by constructing bibliometric networks or databases with text-mining techniques. The most prevalent approach is to apply topic modeling to content analysis as a part of bibliometrics. Starting from the probabilistic Latent Semantic Indexing (pLSI)  model, Latent Dirichlet Allocation (LDA)  is the most accepted topic modeling technique for bibliometrics. While each document consists of a set of topics in pLSI, using the LDA model a more precise manipulation is added to organize the topics. Yan  used the LDA model to measure the influence and popularity of library and information science. He also identified the most-cited area and the patterns in this field. Jeong and Song ’s research measured the time gap among three different resources—web, patent, and scientific publication—in two research domains by applying the LDA model. The basic input unit for LDA is a set of documents. To organize author information into topics, Rosen-Zvi et al.  and Steyvers et al.  proposed the author-topic model with different theoretical background. Li et al.  identified the relations between authors and topics by using the author-topic model. They analyze the topic distribution to examine how many authors are associated with a certain topic. Also through the number of authors, they identify topics that are studied by many researchers. Tang et al.  proposed the ACT model which identifies paper, author, and conference simultaneously. Additionally, they developed the ArnetMiner system for mining academic research social networks.Tang et al.  also supplement ArnetMiner for a topic level expertise search over heterogeneous networks using the ACT model. It generates the most issued topics, author’s interestedness, paper search, academic suggestion, and experts in a specific field. Kim et al.  adopted the ACT model in terms of citation analysis. They collected the dataset in the field of oncology from PubMed Central, which provides the full-text articles in the biomedical field. They utilized the ACT model for analyzing citation sentences and journals instead of abstracts and conferences.
In conclusion, most previous studies identified knowledge structures by adopting not only bibliometric analysis but text-mining techniques such as the LDA model. To supplement bibliometric analysis, there are many attempts to incorporate content analysis into bibliometrics by adopting the LDA model text-mining techniques. However, the main limitation of this application of the LDA model,the representative method for trend analysis, is that it only explains topical trends by using one parameter such as bag-of-words on documents via topical terms. It is not sufficient to conduct comprehensive analysis for understanding knowledge disciplines. Therefore, in this paper, we apply the ACT model to the bioinformatics field for integrated analysis. Applying the ACT model, we aim to explore the importance of authors and journals in relation to topics. We divided the collected datasets into four periods to trace the changes of topic, author, journal ranking over time, and combine the results with bibliometric analysis.
Statistics of collected publications
Number of Papers
Journal of Molecular Biology
The EMBO Journal
Journal of Theoretical Biology
Human Molecular Genetics
Protein Science : a publication of the Protein Society
Journal of Proteome Research
Journal of Biotechnology
PLOS Computational Biology
BMC Research Notes
Trends in Biochemical Sciences
Trends in Genetics
Journal of Molecular Modeling
Molecular & cellular proteomics : MCP
Trends in Biotechnology
Bulletin of Mathematical Biology
Journal of Proteomics
Journal of Computer-Aided Molecular Design
BMC Systems Biology
Pharmacogenetics and Genomics
Statistical Methods in Medical Research
Journal of ComputationalNeuroscience
Molecular Systems Biology
Theoretical Biology and Medical Modeling
Comparative and Functional Genomics
Briefings in Functional Genomics & Proteomics
Algorithms for Molecular Biology
Journal of Biomedical Semantics
EURASIP Journal on Bioinformatics and Systems Biology
Source Code for Biology and Medicine
Data preprocessing and keyphrase extraction
Time-based statistics for 20 years
Number of Papers
We extract various metadata, such as the PMID, author, publication year, journal title, title, and abstract, from XML formatted records. After XML processing, we combine the title with abstract and conduct keyphrase extraction. For keyphrase extraction, we use MAUI, which has the keyphrase model trained with MeSH terms . In this dataset, there are 500 documents and several keys consisting of MeSH terms about each documents, which were manually assigned by the indexer. MAUI is a newer version of the keyphrase extraction algorithm KEA . Keyphrase extraction enables researchers to select representative phrases to make topic detection more meaningful. Therefore, we use keyphrases extracted from the title or abstract as our input for the ACT model instead of individual words.
Example of results of keyphrase extraction and other metadata from PMID of 26030820
encoding cell amplitude frequency modulation
Micali Gabriele, Aquino Gerardo, Richards David M, Endres Robert G
PLOS computational biology
Down-Regulation | Ion Channels | Ions | L Cells (Cell Line) | Ligands | Social Control, Formal | Social Control, Informal | Up-Regulation
ACT Model Application
The ACT model, proposed by Tang et al.  as an extension of the LDA model , is a unified topic model for modeling various metadata simultaneously. This model starts with the assumption that the order of the topic created by the paper, author, and conference is same. It also estimates the statistical distribution associated with all topics for the purpose of discovering latent topic distribution related with paper, author, and conference. In this paper, two metadata types are changed. First, conference is replaced with journal. Also, a bag-of-keyphrases are used instead of a bag-of-words to represent documents in a more precise manner.
Notation and description of the ACT model
Total number of words in paper d
Total number of authors in paper d
Total number of papers
Total number of authors
Selected number of topics
Hyper-parameters of Dirichlet distribution
Average of Pearson correlation coefficients result
Number of Runs
Pearson correlation coefficients
Perplexity result of topic model
Number of Topics
Together with this result, we analyzed the results of the ACT model.
We analyze leading authors and journals in relation to topics over time. In the following section, we provide the detail explanations of the trend per period.
Topic analysis per period
The results of our time series topic analysis show that topics seem to be more distinct and subdivided closer to present. In addition, new topics have emerged in recent years, and they do not make a new cluster, which means the exclusive topics become apparent. The results also show that research fields such as molecular biology, genomics, genetics, and proteomics play a supplementary role in biology, but also become diversified into a unique field.
First period analysis
In the first period (1996–2000), five dominant topic clusters are identified (Additional file 1: Appendix 1). Those five topics are mainly associated with proteins and peptides. Phrases such as molecular biology and chemical compound are widespread, and thermodynamics- and kinematics-related topics appear. These topics are composed of jargon in their specific fields. The mathematical biology field is shown by topical phrases such as database, cluster analysis, model, theoretical, and software.
Topics 0, 2, and 3 are about molecular biology, which are derived from biochemistry and composed of hydrogen bonding–related chemical compounds such as enzymes or lipids. Topics 4, 5, 6, and 7 are related to proteins, peptides, and protein structure. Topics 9 and 14 include words such as ‘probability’ and ‘statistics’, which are related to mathematical biology. Topics 13, 17, 18, and 19 cover mutagenesis, disease, and syndromes. These are all related with genetic diseases. Mutagenesis consists of gene mutation, and syndromes are caused by genetic disorder. Topic 19 includes the word ‘genetic’ that is a parent category of previously mentioned words. Topics 15 and 16 consist of kinetics.
Protein-related topics are dominant, and authors involved in peptide and protein structure are prevalent in the first period. Authors who are in topic 5, such as Fersht A.R., Thornton J. M., Dobson C. M., Serrano L. and Karplus M., have a high probabilistic distribution value, which means they are leading researchers in this area. Their research interest is mainly in protein structure, and they have publications in the Journal of Molecular Biology. This journal appears in almost all of the topics related to protein and deals with structure and function of macromolecules, complexes, and protein folding.
Second period analysis
There are four topic clusters and one exclusive topic in the second period (Additional file 1: Appendix 2). In the second period (2001–2005), studies about genetics and genomics are actively conducted, and protein-related topics are diversified into subfields such as proteomics. In addition, mathematical biology and computational biology–related topics are maintained in this period.
Topics 1, 2, 5, 7, and 11 include DNA mechanism, molecular structure, genetics, genomics, and diseases caused by DNA or genome such as Down syndrome, DNA transposable elements, and ribonucleases. Topics 0, 3, 14, and 16 are mainly about proteomics, specifically focusing on protein structure. Topics 12, 18, and 19 contain biotechnology, molecular modeling, and structure. Topics 8 and 9 focus on mathematical biology and computational biology. Topic 4 exclusively contains enzymology-related phrases such as enzyme activators and oxygen. Enzymology-related topics are less common compared with the first period.
The second period mainly focuses on gene-related topics. Topic 5 has the highest probabilistic distribution among top-ranked authors such as Petsko Gregory A., Aravind L., Koonin Eugene V., Gerstein Mark., and Hurst Laurence D. They are interested in genomics and biomedical engineering. Those authors publish papers in Genome Biology. Genome Biology covers subject matters related to genomics and post-genomics. Similar to the first period, protein-related research is a major topic in the second period. Top-ranked authors in this topic include Aebersold Ruedi, Roepstorff Peter, Righetti Pier Giorgio, Sanchez Jean-Charles, and Jungblut Peter R. These authors are pioneers of proteomics. Their papers are published in the Journal of Molecular Biology and Proteomics.
Third period analysis
In the third period (2006–2010), the topics are divided into three clusters: genomics, proteomics, and other (Additional file 1: Appendix 3). Different from the first two periods, four exclusive topics exist and seem to be distinct from topics in the other three periods. For instance, studies about genomics or proteomics are more diversified than in the earlier periods. Exclusive topics that are not included in two large fields emerge, indicating that bioinformatics research is conducted in various fields related to bioinformatics.
Topics 3, 7, 10, 11, 13, and 16 consist of proteomics, protein evolution, and protein structure. Proteomics-related topics are subdivided. The representative journals in the area are Proteomics, the Journal of Proteome Research, and the Journal of Proteomics. Topics 5, 6, 12, 14, and 19 are gene-related topics such as gene expression, gene transcription, and genomics. Gene-related studies become prevalent in the second period. The distinct topics that appear in the third period are topics 0, 15, 17, and 18. Topic 0 is about molecular biology, especially focusing on hydrogen bonding. In the first and second periods, topic 15 includes various topics related to theoretical biology. Topic 17 is related to hepatitis, the infection in liver cells and tissues. Different from previous periods, in the third period topics are associated with specific diseases. Topic 18 includes peptide-associated phrases, and, unlike prior periods, concrete themes like specific chemical compounds and protein appear.
Overall, protein-related topics are most common in the third period. The third period also has more sub-divided and distinct topics than previous periods do. In this period, general topics such as proteomics appear, as do specific topics such as protein evolution, protein analytics, and protein ubiquitin. Among these areas, the topic with the highest distribution is analytics about protein, and it is sub-categorized in proteomics. Top-ranked authors in this period include Mann Matthias, Aebersold Ruedi, Smith Richard D., Heck Albert J. R., and Thongboonkerd Visith. They are experts in protein analytics, and commonly use mass spectrometry for their analyses. They actively publish in the Journal of Proteome Research and Proteomics. These two journals are top-rated journal in protein-related topics. The Journal of Proteome Research is computer technology–oriented and focused on protein-analysis research. The journal with the highest probabilistic distribution in all topic areas is the EMBO Journal. This journal is focused on molecular biology and also covers proteomics.
Fourth period analysis
The fourth period (2011–2015) shows three major topic clusters and two exclusive topics (Additional file 1: Appendix 4). Similar to the third period, the topics related with genomics and proteomics are further divided into subfields and represent concrete topical characteristics. Compared with the third period’s results, theoretical biology–related topics form one cluster. The compositions of the cluster are one big topic (systems biology) and four sub-divided topics.
Topics 1 and 16 are theoretical biology–related, and topics 6 and 10 are about systems biology. They can be clustered as a broader category of system biology. The representative journals in this cluster are PLOS Computational Biology, Journal of Theoretical Biology, and Journal of Computational Neuroscience, which are focused on systems biology. Topics 0, 11, 12, 18, and 19 are about genetics and genomics. Topics 4, 9, 13, and 17 represent proteomics. Exclusive topics are topics 8 and 15, each of which is related to molecular biology and cell biology. Topic 8 includes phrases like hydrogen bonding, and GTP-binding proteins, and topic 15 contains phrases like enteroendocrine cells and COS cells. The top journals in these areas are biochemistry, journal of molecular biology, and journal of molecular modeling.
In the fourth period, the major topics are systems biology, genomics, and proteomics. Topics that are not in the main stream of bioinformatics are found in this period, and topics about theoretical biology and systems biology become a distinct cluster. This means that these areas are growing in the bioinformatics area. The representative researchers in this area are Nowak Martin A., Iwasa Yoh, Steel Mike, Dieckmann Ulf, and Paninski Liam. They are mostly involved in mathematics and theoretical biology. The journal which has the highest probabilistic distribution is the Journal of Theoretical Biology. This journal is focused on research that combines biology and topics such as statistical analysis, mathematical definition, comparative research, experiment, and computer simulation. The second raked the Journal of Bioinformatics, which mainly accepts research about genome bioinformatics and computational biology.
In this section, we analyzed the results from three different perspectives: topical keyphrase, journal, and author. In addition, to further identify which researchers and journals focus on which topic over time, the results of the ACT model (top-ranked keyphrases, authors, and journals) are examined in an integrated perspective.
Time series analysis
One interesting observation is that keyphrases related with genes or genetic processes such as ‘gene expression’, ‘down-regulation’, and ‘up-regulation’ were not ranked high in the first period. However, they emerged as top keyphrases in later periods. In particular, ‘proteome’, ‘reproducibility of results’, ‘proteomics’, and ‘genotype’ did not appear in the first period but emerged gradually after that. In addition, the term ‘algorithms’ appeared most frequently in the fourth period. In author perspective, across the four periods, the number of unique authors was 1,396. Top ranked author Robinson Richard appeared in five topics. Seven authors, including Gross Liza, appeared four times, 45 authors appeared three times, 137 authors appeared twice, and the remaining 1,184 authors were shown only in one topic. There was no author who appears in all four periods. Thirty-nine authors appeared across three periods, 125 authors appeared in two periods, and 1,210 authors appeared only in a period. In journal centered view, only 21 out of 46 journals appeared in the first period. In the second and third period, 34 and 46 journals were presented respectively. Forty-five journals appeared in the last period; one the journal ‘Briefings in Functional Genomics & Proteomics’ was not shown in the last period.
These results imply that the bioinformatics field is diversified and new topical disciplines are recently emerged. For instance, proteomics-related topics start to appear in the second period, become segmented into detail research fields and later evolved in the third and fourth periods. In addition, while conceptual biology–related topics exist in the first period, they become clearly progressed in the fourth period. Conversely, the topics about kinetics appear in the first period, but then fade out.
Integrated view of graph pattern analysis
The list of journals showed in all periods
Sum of Probability
Average of Probability
Bioinformatics (Oxford, England)
Bulletin of mathematical biology
Human molecular genetics
Journal of biotechnology
Journal of computational neuroscience
Journal of computer-aided molecular design
Journal of molecular biology
Journal of theoretical biology
Statistical methods in medical research
The EMBO journal
Trends in biochemical sciences
Trends in biotechnology
Trends in genetics
Not only the rising and falling patterns, but the convex and concave patterns of journals exist. In graph (g), BMC Genomics was the journal showing the convex pattern. BMC Genomics dealt with genome-scale analysis, functional genomics, and proteomics. The average impact factor from 2000 to 2015 was 4.0464. The topical probability of the journal rose from 1st period (0.00125) to 3rd period (1.19977), but since then the probability dropped to 0.51576. The gap between maximum and minimum value was 1.19852. The journal changed the status to open access in 2000, and maybe it caused the slight rise from 1st period to 2nd period. For the drastic increase from 2nd period to 3rd period, the number of total citations increased approximately 2 times according to the JCR. The increasing and decreasing number in total citations may cause the wave in graph (g). The top 5 ranked authors’ research interests had various characteristics. Jenkins N.A and Copeland N.G was a couple. Jenkins N.A was interested in molecular & cellular biology, while the husband Copeland N.G was in the biochemistry field. They collaborated a lot and both were celebrated researchers in the field. The other researchers had also different research areas; treatment or preventive therapy (Petsko Gregory A), genomic medicine (Nakamura Y), and genome analysis (Aravind L). Through all 4 periods, the journal mainly published articles which dealt with the genomic related keyphrases. The authors and keyphrases located in this journal were not quite related to the informatics fields. Last, the concave pattern was shown in graph (h), and Human molecular genetics was the representative journal. The journal had steadily decreasing impact factor flow in 20 years (e.g., 1st period: 9.05475 to 4th period: 6.8766). The gap between maximum and minimum probability was 0.81973. In graph (h), the drastic fallen point exists. Top ranked authors related with the journal focused on genetics. Copeland N.G and Jenkins N.A appeared again. Rest of the authors did not overlap in research areas. The journal was interested in broad genetics related topical keyphrases but not related with informatics fields.
Through the pattern analysis integrated with journal, author, and keyphrase, we identified that the bioinformatics field was a converging area, and certain journals clearly showed rising and falling patterns. Different from period analysis, the integrated view of analysis showed journals’ topical trends over time along with top journals and researchers.
Bioinformatics mainly tackles biological problems at the molecular level using applied mathematics, information science, statistics, computer science, chemistry, and biochemistry. This characteristic of bioinformatics has driven the field to become interdisciplinary, combining approaches from various fields to make use of a large amount of data.
In this study, we investigate the bioinformatics field using the ACT model to conduct comprehensive topic analyses of keyphrases, authors, and journals. To this end, we collect 46 journals belonging to the bioinformatics field by searching journal name in PubMed, yielding 170,099 papers. To analyze topic evolution over time, we divide the collected datasets into four, five-year periods: 1996–2000, 2001–2005, 2006–2010, and 2011–2015. In a time series topic analysis, we examine topic clusters within period. In more recent periods, distinct characteristics of the field emerge and more new topics are presented independently. In addition, we analyze trends in keyphrases, authors, and journals. Our keyphrase analysis similarly indicates the emergence of greater interdisciplinary research over time. In our author analysis, we observe the pattern of authors who appear in top rank. In our journal analysis, we analyze the common topic area of top journals and identify major focus areas of those top journals, including computational biology, theoretical biology, and mathematical biology. Also, we examine topic distribution over journal with top ranked authors and keyphrases. In this analysis, journals are identified as four patterns over time such as rising, falling, convex, and concave patterns.
The results of these analyses imply that the bioinformatics field is highly interdisciplinary, consisting of active convergence studies. In addition, we observed that the characteristics of the bioinformatics field become more distinct and more specific, and the supporting role of peripheral fields of bioinformatics, such as conceptual, mathematical, and systems biology, gradually increases over time, though the core fields of proteomics, genomics, and genetics are still the major topics. This is consistently confirmed by analysis of topic distribution of journals over time as well as integrated analysis of top ranked keyphrases, authors, and journals.
In the future, we plan to apply the same approach to other domains, such as information science. We also plan to use other metadata such as MeSH terms and bio-entities to compare with keyphrases. In addition, we plan to explore how to infer authors’ interests by time series analysis and identify representative authors and the journals that are the best suited for a paper on a particular subject.
This work was supported by the Bio-Synergy Research Project (NRF-2013M3A9C4078138) of the Ministry of Science, ICT and Future Planning through the National Research Foundation. This work was also supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF-2014R1A2A2A01004454). The publication charges for this article was funded by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP).
Availability of data and materials
GEH conceived and designed the study. GEH and KYK conducted the analysis and drafted the manuscript. MS designed and coordinated the study. All authors contributed to the writing of the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
About this supplement
This article has been published as part of BMC Bioinformatics Volume 18 Supplement 7, 2017: Proceedings of the Tenth International Workshop on Data and Text Mining in Biomedical Informatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-18-supplement-7.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Dong D, Chen ML. Publication trends and co-citation mapping of translation studies between 2000 and 2015. Scientometrics. 2015;105(2):1111–28.View ArticleGoogle Scholar
- Chen H, Wan Y, Jiang S, Cheng Y. Alzheimer’s disease research in the future: bibliometric analysis of cholinesterase inhibitors from 1993 to 2012. Scientometrics. 2014;98(3):1865–77.View ArticleGoogle Scholar
- Soteriades E, Falagas M. A bibliometric analysis in the fields of preventive medicine, occupational and environmental medicine, epidemiology, and public health. BMC Public Health. 2006;6(1):1.View ArticleGoogle Scholar
- Ugolini D, Puntoni R, Perera FP, Schulte PA, Bonassi S. A bibliometric analysis of scientific production in cancer molecular epidemiology. Carcinogenesis. 2007;28(8):1774–9.View ArticlePubMedGoogle Scholar
- Wang L, Chen X, Bao A, Zhang X, Wu M, Hao Y, He J. A bibliometric analysis of research on Central Asia during 1990–2014. Scientometrics. 2015;105(2):1223–37.View ArticleGoogle Scholar
- Bornmann L, Mutz R. Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. J Assoc Inf Sci Technol. 2015;66(11):2215–22.View ArticleGoogle Scholar
- Geaney F, Scutaru C, Kelly C, Glynn RW, Perry IJ. Type 2 Diabetes Research Yield, 1951-2012: Bibliometrics Analysis and Density-Equalizing Mapping. PLoS One. 2015;10(7):e0133009.View ArticlePubMedPubMed CentralGoogle Scholar
- Macías-Chapula CA, Mijangos-Nolasco A. Bibliometric analysis of AIDS literature in Central Africa. Scientometrics. 2002;54(2):309–17.View ArticleGoogle Scholar
- Seglen P, Aksnes D. Scientific productivity and group size: A bibliometric analysis of Norwegian microbiological research. Scientometrics. 2000;49(1):125–43.View ArticleGoogle Scholar
- Jeong DH, Song M. Time gap analysis by the topic model-based temporal technique. J Informetrics. 2014;8(3):776–90.View ArticleGoogle Scholar
- Song M, Kim SY. Detecting the knowledge structure of bioinformatics by mining full-text collections. Scientometrics. 2013;96(1):183–201.View ArticleGoogle Scholar
- Song M, Kim SY, Zhang G, Ding Y, Chambers T. Productivity and influence in bioinformatics: A bibliometric analysis using PubMed central. J Assoc Inf Sci Technol. 2014;65(2):352–71.View ArticleGoogle Scholar
- Yan E. Research dynamics, impact, and dissemination: A topic-level analysis. J Assoc Inf Sci Technol. 2015;66(11):2357–72.View ArticleGoogle Scholar
- Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T. Probabilistic author-topic models for information discovery. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. New York: ACM; 2004. p. 306–315.Google Scholar
- Li D, Okamoto J, Liu H, Leischow S. A bibliometric analysis on tobacco regulation investigators. BioData Mining. 2015;8(1):1.View ArticleGoogle Scholar
- Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P. The author-topic model for authors and documents. In: Proceedings of the 20th conference on Uncertainty in artificial intelligence. Banff: AUAI Press; 2004. p. 487–494.Google Scholar
- Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z. Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. New York: ACM; 2008. p. 990–998.Google Scholar
- Ramos JM, Padilla S, Masia M, Gutierrez F. A bibliometric analysis of tuberculosis research indexed in PubMed, 1997–2006. Int J Tuberc Lung Dis. 2008;12(12):1461–8.PubMedGoogle Scholar
- Claude R, Charles-Daniel A, Jean A, Jean-Francois G. Bibliometric overview of the utilization of artificial neural networks in medicine and biology. Scientometrics. 2004;59(1):117–30.View ArticleGoogle Scholar
- Patra SK, Mishra S. Bibliometric study of bioinformatics literature. Scientometrics. 2006;67(3):477–89.View ArticleGoogle Scholar
- Song M, Heo GE, Kim SY. Analyzing topic evolution in bioinformatics: investigation of dynamics of the field with conference data in DBLP. Scientometrics. 2014;101(1):397–428.View ArticleGoogle Scholar
- Lee D, Kim WC, Charidimou A, Song M. A Bird's-Eye View of Alzheimer's Disease Research: Reflecting Different Perspectives of Indexers, Authors, or Citers in Mapping the Field. J Alzheimers Dis. 2015;45(4):1207–22.PubMedGoogle Scholar
- Ding Y, Song M, Han J, Yu Q, Yan E, Lin L, Chambers T. Entitymetrics: Measuring the impact of entities. PLoS One. 2013;8(8):e71416.View ArticlePubMedPubMed CentralGoogle Scholar
- Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. Berkeley: ACM; 1999. p. 50–57.Google Scholar
- Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003;3:993–1022.Google Scholar
- Tang J, Zhang J, Jin R, Yang Z, Cai K, Zhang L, Su Z. Topic level expertise search over heterogeneous networks. Mach Learn. 2011;82(2):211–37.View ArticleGoogle Scholar
- Kim HJ, An J, Jeong YK, Song M. Exploring the Leading Authors and Journals in Major Topics by Citation Sentences and Topic Modeling. In: BIRNDL@ JCDL. Newark: ACL; 2016. p. 42–50.Google Scholar
- Medelyan O. Human-competitive automatic topic indexing. PhD thesis. New Zealand: Department of Computer Science, University of Waikato; 2009.Google Scholar
- Witten IH, Paynter GW, Frank E, Gutwin C, Nevill-Manning CG. KEA: Practical automatic keyphrase extraction. In: Proceedings of the fourth ACM conference on Digital libraries. Berkeley: ACM; 1999. p. 254–255.Google Scholar