Subtyping irritable bowel syndrome using cluster analysis: a systematic review

Zarei, Diana; Saghazadeh, Amene; Rezaei, Nima

doi:10.1186/s12859-023-05567-8

Research
Open access
Published: 15 December 2023

Subtyping irritable bowel syndrome using cluster analysis: a systematic review

Diana Zarei^1,2,
Amene Saghazadeh^3,4 &
Nima Rezaei^3,4,5

BMC Bioinformatics volume 24, Article number: 478 (2023) Cite this article

1250 Accesses
1 Altmetric
Metrics details

Abstract

Background

Irritable bowel syndrome (IBS) is a common chronic functional gastrointestinal disorder associated with a wide range of clinical symptoms. Some researchers have used cluster analysis (CA), a group of non-supervised learning methods that identifies homogenous clusters within different entities based on their similarity.

Objective and methods

This literature review aims to identify published articles that apply CA to IBS patients. We searched relevant keywords in PubMed, Embase, Web of Science, and Scopus. We reviewed studies in terms of the selected variables, participants’ characteristics, data collection, methodology, number of clusters, clusters’ profiles, and results.

Results

Among the 14 articles focused on the heterogeneity of IBS, eight of them utilized K-means Cluster Analysis (K-means CA), four employed Hierarchical Cluster Analysis, and only two studies utilized Latent Class Analysis. Seven studies focused on clinical symptoms, while four articles examined anocolorectal functions. Two studies were centered around immunological findings, and only one study explored microbial composition. The number of clusters obtained ranged from two to seven, showing variation across the studies. Males exhibited lower symptom severity and fewer psychological findings. The association between symptom severity and rectal perception suggests that altered rectal perception serves as a biological indicator of IBS. Ultra-slow waves observed in IBS patients are linked to increased activity of the anal sphincter, higher anal pressure, dystonia, and dyschezia.

Conclusion

IBS has different subgroups based on different factors. Most IBS patients have low clinical severity, good QoL, high rectal sensitivity, delayed left colon transit time, increased systemic cytokines, and changes in microbial composition, including increased Firmicutes-associated taxa and depleted Bacteroidetes-related taxa. However, the number of clusters is inconsistent across studies due to the methodological heterogeneity. CA, a valuable non-supervised learning method, is sensitive to hyperparameters like the number of clusters and random initialization of cluster centers. The random nature of these parameters leads to diverse outcomes even with the same algorithm. This has implications for future research and practical applications, necessitating further studies to improve our understanding of IBS and develop personalized treatments.

Peer Review reports

Introduction

Irritable bowel syndrome (IBS) is a chronic functional gastrointestinal (GI) disorder that manifests with abdominal pain, bloating, and altered bowel habits in the absence of any organic disorder or biological markers [1,2,3]. IBS predominantly affects women [4]. The global prevalence based on ROME III criteria is 9.2%, whereas, based on the ROME IV version, it is estimated at 3.8% [5]. The burden of IBS is significant: individual patients, their families, society, and health care system are all affected [5]. Patients with IBS frequently report lower quality of life (QoL). Particularly, those in the diarrhea-predominant subgroup have lower income because of their absence from work, and their partner and family are also affected by the burden of the disease because these patients might avoid traveling, socializing, etc.

Diagnosing and treating patients with IBS is challenging because there is no single cause [6]. The following possible causes have been considered: mucosal inflammation, mucosal immune activation, changes in intestinal permeability, alteration in the gut microbiome, and post-infectious changes [7]. According to the last published criteria (ROME IV), IBS has four subtypes [8]. However, almost one-third of patients may experience intermittent symptoms. This intermittency complicates subtyping; patients in the same subgroup may have suffered from different underlying mechanisms [9, 10].

To address heterogeneity in research and analysis, various approaches have been used, including subgroup analysis, stratification, regression modeling, and cluster analysis (CA) [11,12,13,14]. CA, in particular, has been valuable in identifying distinct subgroups within datasets. However, it is important to choose the appropriate clustering algorithm to ensure reliable and meaningful results. Researchers should carefully consider the best approach to address heterogeneity and enhance the interpretation of their findings.

As a result, a series of researchers decided to use CA, a group of non-supervised learning methods that classifies entities or objects into different homogenous groups or clusters based on their similarity [15,16,17]. Many algorithms have been introduced, but some are more frequently used [18]. CA has several benefits; for instance, it improves diagnostic criteria to conclude a more comprehensive and meaningful profile, interprets heterogeneous outcomes, and adjusts treatments [19,20,21]. CA has been used in hypothesis generation, finding a topography, data exploration, and data reduction [22,23,24]. CA also has some specific usage; it can identify a group of genes with similar biological functions [25] or identify a group of patients that need targeted interventions [22, 23].

CA has several advantages over other methods. It allows researchers to uncover hidden patterns and structures in complex datasets without making assumptions about data distributions making it a versatile technique [14]. However, it is important to note that CA is sensitive to the initial configuration, and choice of algorithm, which means different results can be obtained [26]. To address this, researchers should carefully select appropriate algorithms and validate the stability of the clusters obtained [27]. Furthermore, it is essential to understand that CA alone does not provide casual relationships or explanations, so, further analysis and interpretation are required. Despite these limitations, CA remains a powerful tool for gaining insights into data structures across various fields.

However, there are challenges with using CA [28]. The sample size is calculated based on the variables included in the analysis and the number of identified clusters [29]. To achieve sufficient power, we need to have a large sample size (greater than 200) and split it into two groups: one for training and one for validation [30]. The results can be reported when the same subgroups are obtained in multiple samples of the target population [31]. This article reviews CA studies in IBS.

Methods

We conducted the present systematic review based on preferred reporting items for systematic reviews and meta-analysis (PRISMA) guidelines (Additional file 1).

Search strategy

We searched PubMed, Embase, Scopus, and the Web of Science from initiation until November 03, 2022 for relevant published articles in English without restricting the publication date. We used a combination of the keywords related to irritable bowel and cluster analysis. The Additional file 2 includes the queries used for searching in each database.

Selection criteria

We included studies on patients with IBS who were over the age of 17 years old and had not any organic GI disorder. Non-English and animal articles were excluded.

Methods of review

The study selection is a four-step process: identification, screening, eligibility, and inclusion. At first, in the identification step, we gather all search records that were obtained from databases and removed duplicates. Then, we screened search results by title/abstract. In the third step, we assessed the potentially eligible articles by their full text and included them in our systematic review if they met the inclusion criteria.

Data extraction

We evaluated the methods and results section of each included article. Specifically, we retrieved details on the following items: study design, participants’ characteristics, diagnostic criteria, the variables considered for clustering, data collection methods, data preprocessing techniques, clustering algorithms, validation, interpretation of the results, number of clusters, findings, limitations, and suggestions for future studies.

Study design

Studies were eligible for inclusion in the present review if their results were obtained from original research. Review articles, systematic reviews, and meta-analyses were excluded. Cohort studies, cross-sectional studies, and case-control studies were included.

Participants’ characteristics

We included studies that were conducted on IBS patients, adult participants, and evaluated both sexes.

Variables

Selecting relevant variables for discriminating clusters is very important. The variables included were related to GI symptoms, bowel habits, pain, bloating, psychological disturbances, QoL, anorectal function, colon transit time (CTT), anal pressure waves, cytokines levels, mast cell (MC) numbers, and microbial composition.

Data collection method

The methods of collecting participants’ data or tools for evaluating patients were reviewed: questionnaire, direct interview, data collection on consecutive days, a rectal examination tool, etc.

Data preprocessing methods

Considering that the data obtained from the studies might be different in terms of units or other items, we examined studies to control if they applied standardization and data normalization methods before CA.

Cluster analysis

CA is a group of machine learning algorithms that classify data into homogenous groups with the least similarity to other groups [32]. There are different types of clustering algorithms (Fig. 1). K-means CA and hierarchical cluster analysis (HCA) are the most frequently used [33]. K-means CA is preferable due to its good measurement capability. One of the features of this algorithm is the need to calculate the number of clusters before analysis under the title of K [34]. There are different methods for choosing optimal cluster numbers, for example, BIC, AIC, elbow, etc., in K-means CA. The distance metric is another important feature in K-means CA, which uses Euclidian.

HCA converts a distance matrix of all items’ similarity measurements into a hierarchy of nested groups. In this method, two different approaches are used: agglomerative and divisive [34]. HCA is aiming to group similar objects together based on their attributes and characteristics. It involves constructing a hierarchy of clusters, where each object begins as a separate cluster and is progressively merged with others to create larger clusters. This process continues until all subjects are consolidated into a single cluster or until a predetermined stopping condition is satisfied [14]. Latent class analysis (LCA) is another popular method that is a kind of finite mixture model (FMM). In this method, hidden clusters are uncovered by some predetermined multifactorial feature [35]. LCA estimated the probability of belonging to each latent class for each individual allowing researchers to understand the heterogeneity within a population. By uncovering these latent classes LCA provides insights into the structure and patterns of categorical data [36]. Principal component analysis (PCA) is a method that decreases multi-dimensional data before analysis [37], increases interoperability of the results, and minimizes information bias. PCA does the analysis by using new uncorrelated variables [38].

Cluster validation

One of the most critical steps in CA is the evaluation of the clusters obtained from the analysis. There are some methods for this assessment, such as Silhouette and Davies-Bouldin indexes [28, 39].

Interpretation of the results

The main goal in conducting CA studies is to obtain subgroups and relevant individual characteristics. CA is insufficient in determining the characteristics of clusters and assessing the relationship between different variables. So, after the analysis results are prepared, other methods apply to interpret the results, for instance, using Bayesian inference.

Results

As illustrated in Fig. 2, the database search retrieved 413 records. One hundred sixty-six records were duplicated. We screened 247 discrete records by title and abstract, of which 25 appeared potentially eligible. During full-text reviewing, we excluded 11 articles due to not assessing outcomes of interest [40,41,42,43,44], not using CA [45,46,47], not including IBS patients [48, 49], and not available full-text [50]. Finally, 14 eligible articles were included in this article. The included articles were published between 1995 and 2021.