Our pipeline performs a grid search of the parameter space by evaluating all combinations of input parameters. It proceeds in five stages:
-
1.
Compute the Gower distance matrix [13] for the input dataset;
-
2.
Enumerate all combinations of input parameters;
-
3.
For each set of input parameters, (i) compute Mapper graph; and (ii) identify statistically significant representative topological features (i.e. Clusters);
-
4.
Among all mapper graphs, rank clusters in terms of their impurity [14, p. 309] with respect to a chosen outcome or variable of interest; and
-
5.
Visualise and summarise the top five clusters.
In an example application with a sample of 430 patients, we aimed to identify subgroups that were similar in terms of baseline clinical and genetic characteristics (140 variables) as well as outcome (remission following 12 weeks of treatment). Input data were a mixture of categorical and continuous variables.
Step 1
We first construct a distance matrix for clinical and genetic predictors from the input dataset. To allow a mix of continuous and categorical variables we use the Gower distance [13] implemented in the gower package for Python [15]. This computes distances between pairs of variables using appropriate measures (Manhattan distance metric for continuous variables; Sørensen-Dice coefficient for categorical) and then combines these into a single distance averaged over all variables ranging from 0 to 1. Importantly, outcomes are not used to derive the Gower matrix.
Step 2
We then define sets of input parameters for the Mapper algorithm. While some parameters can be derived automatically [16] several must be specified by the user including: (i) the choice of filter(s); (ii) gain; (iii) resolution; and (iv) clustering algorithm. ‘Gain’ and ‘resolution’ control how the range of the filter function is divided into intervals (see Additional file 1: Fig. S1). The ‘gain’ refers to the overlap between consecutive intervals whereas the ‘resolution’ refers to the diameter of the intervals. By choosing the number of intervals and the percentage overlap between them, the user can adjust the level of the detail at which to view their data. For a single filter, resolution can be derived automatically, but must be specified when combining multiple filters. We enumerate all combinations of parameters and store these as inputs for subsequent steps (i.e. a grid search). Since optimal parameters will depend on the input dataset, we recommend exploring a range of values. Our example application considered combinations of:
-
i.
Five filters comprising two ‘data filters’ based on continuous predictor variables; two ‘computed filters’, based on the first two components from Principal Components Analysis (PCA); and combinations of data and computed filters.
-
ii.
Four values for gain (0.1, 0.2, 0.3, 0.4);
-
iii.
Six values for resolution (1, 3, 5, 10, 30, 50);
-
iv.
Two clustering algorithms (Density-based spatial clustering of applications with noise, DBSCAN; and Agglomerative Clustering).
In our application the ‘data filters’ were theoretically chosen. We considered as filters variables known to be important for the outcomes in question. However, an alternative approach could be to consider all continuous variables in the input dataset as candidate filters. Following the steps described below, the pipeline would then identify the ‘optimal’ clusters having considered all candidate filters. This approach would be computationally intensive since the search grid would expand substantially. However, by allowing all filters to considered and ranked (based on clusters homogeneity in terms of the outcome variable, as described below) this process would provide an effective form of feature selection; the ranked list of filters would indicate their importance.
Step 3
For each set of input parameters, we (i) compute the Mapper graph; and (ii) identify representative topological features; and (iii) evaluate the statistical significance of each representative feature with the bootstrap. This uses re-sampling methods to assess whether a given topological feature is robust to small variations in the dataset [16].
Step 4
From the list of candidate topological features, we rank clusters based on the best separation with regards to the chosen outcome of interest (i.e. homogeneity within cluster). We first exclude non-significant or small features (\(<5\) or \(>95\) percent of sample). We then calculate homogeneity for each feature with respect to the chosen outcome of interest as well as percentage improvement in homogeneity compared to overall homogeneity of the sample. For binary outcomes, homogeneity is assessed using Gini impurity [14, p. 309] defined as \(1 - (1 - p)^2 - p^2\) where p is the proportion of individuals in the feature experiencing the outcome of interest. Lower values indicate lower Gini impurity, down to a minimum of 0 at which point all individuals in the cluster fall into a single outcome category. For continuous outcomes homogeneity could be measured using the standard deviation. This is calculated for each candidate feature separately as well as for the overall sample. Finally, we sort all features by their percentage improvement in homogeneity.
Step 5
We select the top five features and describe each by:
-
a.
Describing differences in each predictor between members and non-members of the chosen feature, including p-values to indicate statistical significance;
-
b.
Predicting membership to the feature using gradient boosted trees (XGBoost).
-
c.
Visualising the Mapper graph and highlighting the chosen topological feature.