 Research
 Open access
 Published:
A cell abundance analysis based on efficient PAM clustering for a better understanding of the dynamics of endometrial remodelling
BMC Bioinformatics volume 24, Article number: 440 (2023)
Abstract
Background
Singlecell RNA sequencing (scRNAseq) is a powerful tool for investigating cell abundance changes during tissue regeneration and remodeling processes. Differential cell abundance supports the initial clustering of all cells; then, the number of cells per cluster and sample are evaluated, and the dependence of these counts concerning the phenotypic covariates of the samples is studied. Analysis heavily depends on the clustering method. Partitioning Around Medoids (PAM or kmedoids) represents a wellestablished clustering procedure that leverages the downstream interpretation of clusters by pinpointing real individuals in the dataset as cluster centers (medoids) without reducing dimensions. Of note, PAM suffers from high computational costs and memory requirements.
Results
This paper proposes a method for differential abundance analysis using PAM as a clustering method and negative binomial regression as a statistical model to relate covariates to cluster/cell counts. We used this approach to study the differential cell abundance of human endometrial cell types throughout the natural secretory phase of the menstrual cycle. We developed a new R package scellpam, that incorporates an efficient parallel C++ implementation of PAM, and applied this package in this study. We compared the PAMBS clustering method with other methods and evaluated both the computational aspects of its implementation and the quality of the classifications obtained using distinct published datasets with known subpopulations that demonstrate promising results.
Conclusions
The implementation of PAMBS, included in the scellpam package, exhibits robust performance in terms of speed and memory usage compared to other related methods. PAM allowed quick and robust clustering of sets of cells with a size ranging from 70,000 to 300,000 cells. https://cran.rproject.org/web/packages/scellpam/index.html. Finally, our approach provides important new insights into the transient subpopulations associated with the fertile time frame when applied to the study of changes in the human endometrium during the secretory phase of the menstrual cycle.
Background
Singlecell transcriptomic maps have provided unprecedented insights into the identity of the cellular components of a given tissue. Each cell is described using a high dimensional vector, providing the number of short reads aligned over unitary regions (genes) of a reference genome. As a result, we obtain a highdimensional transcriptomic description of each cell. Each biological condition (described with phenotypic covariates) provides one or several samples containing thousands of cells, and generally, we consider a selected set of phenotypic variables describing the distribution of samples. A central biological question involves detecting changes in cell populations with regard to their phenotypic description.
Differential abundance analyses comprise three steps: Step one employs a cluster analysis for all cells without considering their biological condition (phenotypic variation) to define (possible) cell types/states; Step two calculates the number of cells per sample and cluster, which yields a count per cluster for each sample, with each sample having associated phenotypic covariates; Step three consists of studying how these counts depend on said phenotypic covariates, revealing the population changes in the area of interest.
The classification of sequenced cells into functional groups (clusters) determines the populations and states constituting the tissue under study [1]. Cell population abundances can become altered under pathological conditions [2, 3]. Evaluating alterations in population abundances under varying conditions using singlecell RNA sequencing (scRNAseq) data requires thoughtful consideration of every step of the analytical workflow: quality control filters, clustering, and statistical hypothesis testing of differential abundance [4, 5].
This paper proposes the use of partitioning around medoids (PAM) [6] as the clustering procedure for the first step and a negative binomial regression for the third step [7, 8].
A range of clustering methods have been proposed and used for singlecell analysis. Duó et al. [9] systematically evaluated the performance of 14 clustering algorithms implemented in R [10]. The authors “found substantial differences in the performance, run time and stability between the methods with SC3 and Seurat showing the most favorable results”. SingleCell Consensus Clustering (SC3) is a tool for unsupervised clustering of scRNAseq data [11], and Seurat [12,13,14,15] is a widely used toolkit for singlecell genomics.
As Luecken et al. [16] highlight in their tutorial on best practices in scRNAseq analysis, the default clustering algorithm implemented both in Seurat and Scanpy (SingleCell Analysis in Python) [17] is the Louvain community detection algorithm [18], a hierarchical clustering method initially developed to extract the community structure of large networks. In addition, the authors recommend the use of this algorithm. In contrast, Traag et al. [19] reported that the Louvain algorithm may yield arbitrary and poorly connected communities; instead, they introduced the Leiden algorithm to overcome associated issues. We emphasize that no single method exists that works optimally in all cases; clustering success and method choice should be (at least partially) evaluated through the biological meaning of the cell groups obtained.
As proposed in [6] chapter 2, PAM represents a robust example of an intuitive clustering method that minimizes a welldefined reasonable objective function: the sum of distances to the nearest cluster center.
The original algorithm was later improved with faster versions (FASTPAM1 with and without eager swapping [20]), which has been evaluated using several benchmark problems and displays robustness and interpretability. Section 2.2 details the clustering algorithm itself. Nevertheless, the direct application of PAM to largescale problems such as singlecell analysis had been prevented by implementation problems, mainly due to memory requirements and long execution times. The most commonly used PAM implementation, the R package cluster [21], cannot manage more than 65,536 observations/cells, and the distances are saved using double precision; therefore, any R package based on the cluster package inherits these constraints (as is the case of RaceID2 [22]).
Our implementation of the PAMBS method (named after PAM with BUILD + SWAP) in the R package scellpam [23], allows the choice of the distance data type and yields a memory reduction of 50% using float precision vs. double. Further details of this implementation and its possibilities are included in Sect. 2.2.2 (description of the original algorithms) and Sect. 3.1.1 (computational evaluation of our implementation).
Although we primarily aimed to provide a reliable software implementation of PAM that can manage a large number of single cells, we have an equal interest in analyzing those changes in the human endometrium that occur during the menstrual cycle using PAM as the clustering method.
During a natural human menstrual cycle, the endometrium undergoes shedding, complete regeneration, and remodeling. The relevance of this cycle to human reproduction has driven many studies since its first description, from classical histology [24] to wholetissue transcriptomic profiling [25, 26]. Current knowledge at the singlecell level [27] has provided insight into the temporal gene expression changes in the epithelial and stromal cell compartments; however, how transcriptomic changes become reflected in the shape of detected cell populations remains incompletely explored.
Materials and methods
Menstrual cycle data
The dataset analysed later in the biological application is described bellow. scRNAseq 10X data was collected from a previous study on the natural menstrual cycle [27] (available in the GEO repository under accession number GSE111976). Fresh endometrial biopsies from healthy individuals were used in this previous study. Isolation and sequencing protocols are detailed in a previous publication.
Raw expression counts per cell formed the input material to Seurat package (v.3.1.2) and downstream scripts in R. Briefly, count matrices were evaluated by quality control metrics that included feature counts (\(\ge 2500\) counts per cell), the number of genes detected (\(\ge 750\) genes per cell), and the percentage of expression of mitochondrial genes (\(\le 25\%\) mitochondrial ratio per cell), and then submitted to dimensional reduction, default clustering, and identification of cell populations. Multiple sets of highly variable genes (HVGs) were obtained using FindVariableFeatures with the vst method containing: 100, 500, 1000, 2000, 3000, 4000, and 5000 HVGs. Each cluster differentially expressed genes were assessed (one vs. rest) using the FindMarkers function (which internally uses the Wilcoxon test), the pvalues were adjusted using the BenjaminiHochberg method, and the false discovery rate (FDR) used was 0.05.
Potential poorquality clusters with no uniquely expressed genes, doublets (detected by DoubletFinder 2.0.2), or enriched in cells with extreme values in the three quality metrics cited above were evaluated. Finally, no lowquality cluster was found, and 71,032 cells in total were retained for downstream analysis.
Methods
Outline
Figure 1 provides a detailed outline of our approach for analyzing RNAseq data.

1.
The process starts with the raw count matrix, which provides the short read aligned counts for different cells and samples.

2.
The raw count matrix was normalized using two different procedures: rawn and log1n. The rawn method normalizes the expression of each gene in each cell, by dividing it by the total cell count. The log1n method is analogous, but with the logarithm of the counts plus one.

3.
The highly variable genes (HVGs) in the dataset were identified using the vst method implemented in the FindVariableFeatures function of the R package Seurat [15], and this list is used by scellpam to filter out HVGs.

4.
The dissimilarity matrix of the cells is calculated by using the scellpam package [23], where the Euclidean and the Manhattan metrics, alongside one minus the modulus of the Pearson correlation coefficient have been implemented. A dissimilarity matrix for each metric is also calculated.

5.
The whole set of cells is classified into k groups using PAMBS implemented in scellpam.

6.
The number of cells for each cluster and sample is calculated.

7.
Finally, the likelihood and quasilikelihood approaches proposed in [7, 8] respectively are used to evaluate the differential abundance among the obtained clusters for the different biological conditions or samples.
Additional file 1 contains the entire workflow in R code and applied functions from the new scellpam package.
Algorithms
We first review the algorithmic principles of PAM to understand its suitability in this context.
Let X be a set of n points (in this case representing cells) in a pdimensional space (representing gene expression), and let d be a metric or dissimilarity between them. Let k be the number of groups being considered. The method obtains an optimal set \(M\subset X\) consisting of k points called medoids \(M=\{x_{m_1},..,x_{m_k}\}\) taken from X which minimizes to
where the sum extends to all points in X and \(x_{m_i}\) is the element of M such that \(m_i = arg \min _{k\in M} d(x_i,x_{m_k})\) i.e. the medoid closest to each \(x_i\). This induces clustering: each cluster comprises the points closer to medoid k than any other point. This subsequently implies that cluster representatives (the medoids) always represent members of the initial set, different from other algorithms such as kmeans.
The algorithm for obtaining the set M entails two stages: selecting an initial set of medoids and swapping pairs of points between the set M and the rest of the set X until no further reduction of TD is found.
We refer the reader to [28] for additional information on the available options for these stages, including the BUILD and LAB alternatives for the first stage.
The algorithm commonly selected for the initial stage is BUILD due to the superior quality of results (a lower initial TD value compared to alternatives like LAB in all our datasets); however, BUILD is more computationally intensive.
For the second stage, the algorithm called FATSPAM1 in [20] represents the best option balancing speed and quality of results.
FASTPAM1 is the fastest known algorithm that proceeds deterministically (swapping is always carried out by choosing the option that reduces TD to the most significant degree at each iteration). It is the one implemented in scellpam.
The pseudocodes for BUILD and FASTPAM are shown as Algorithms 1 and 2, respectively. Both have been taken from [20].
Our implementation of PAM, programmed in C++, involves several improvements concerning previous packages:

Parallel computing is transparently used (i.e.: without user intervention) in many parts. The number of cores used can be selected automatically, and work division is carried out automatically and internally by all functions.

Our implementation allows the use of any storage data type (signed/unsigned int, float, double) for the original data X and either float or double for the distance matrix d (i.e., float can be used instead of double, which requires half the memory if the precision is sufficient for the user’s purposes).

The distance matrix calculation, a step prior to PAM, is carried out in parallel. The parallelization has been carefully designed, i.e., the number of distance pairs calculated by each thread is the same, so that all threads finish simultaneously and the time spent by one thread is the total execution time. (See [29] for more details).

As in other packages (Seurat, SC3), initial data X can be loaded as a sparse matrix, which significantly reduces the memory requirements. Moreover, this fact is also considered in the distance calculation: components whose value is equal to zero in both vectors are bypassed and components with value equal to zero in exactly one of the vectors receive special treatment, which contributes to the time reduction.

The choice of the initialization phase for PAM can be BUILD or LAB, and a parallel implementation compensates for the higher computational cost of BUILD. The internal loop in lines 13–25 of the Algorithm 1 is done in parallel.

Similarly, the SWAP phase uses the fastest variant (FASTPAM1), and is also implemented in parallel (loop in lines 12–21 of the Algorithm 2).

A function for calculating silhouettes [30] is also implemented in parallel.
For each cell i, the silhouette index s(i) is obtained as follows,
where \(\mid C_I \mid\) is the number of points in cluster I and
and
d(i, j) representing the dissimilarity between points i and j.
The silhouette index s(i) belongs to \([1, 1]\). It is close to +1 when the point is wellcentered in its own cluster. It is nearly 0 for points in the border between two clusters, and approaches 1 when the point would be better classified in its neighboring cluster. The average of the s(i)’s over all observations (cells) measures how well the data have been clustered or the cluster structure.
Results
We present two types of results in this section: one focuses on the performance of the implementation of PAMBS in scellpam, and the other concerns the study of changes in the endometrium. More precisely, Sect. 3.1, which evaluates the performance of PAMBS, is divided into two subsections: Sect. 3.1.1 is devoted to evaluating purely computational aspects of the implementation of PAMBS in scellpam. In this case, there is no “true” partition to compare against.
After the computational study (in terms of time and memory usage depending on the number of cells and genes), Sect. 3.1.2 is dedicated to the comparison between the classifications obtained by other clustering methods and those obtained by PAMBS when “true” partitions are known, which is based on study by Duó et al. [9]. Here, we retrieve the results obtained by the authors in their comparison of different clustering methods on several datasets for which the true partition is assumed to be known. Then, we implement our method on these same datasets. Section 3.2 is the most interesting one, from a biological perspective, as it presents the results of our study of cell population changes that occur in the human endometrium during the secretory phase of the menstrual cycle.
Two studies on PAMBS performance
Computational comparisons
We compare the results obtained by applying the implementation of PAMBS in scellpam with those obtained by the standard implementation of PAM in the cluster R package [21].
The first factor to validate is the use of the float data type. Obviously, using float reduces the memory by half, but it must be corroborated that results are equivalent. To validate our implementation, we took several samples of 60,000 cells from a total of 71,032 cells (due to the limitation of the package cluster to 65,536 observations). The use of double type requires a memory of 13.76 GiB to store the distance matrix; our package using float type requires 6.70 GiB. Clustering results were the same in all cases (same set of medoids M), which supports the correctness of our implementation (and also that of the cluster package) in algorithmic terms, indicating the probable absence of programming errors, since both implementations were created by different teams and do not share any code. Regarding execution times, we observed an almost linear time reduction in the distance matrix calculation with the number of distances computed (not the case for the cluster package). The structure of the PAM algorithm indicates that a time reduction strictly linear with the number of cells cannot be expected, but it was substantial. We evaluated the time performance and memory usage of PAMBS using different datasets with increasing cell number sizes:

1.
scRNAseq on the human endometrium throughout the natural menstrual cycle; \(N=71,032\) cells. NCBI GEO accession number GSE111976 [27], which we denoted as Wang.

2.
scRNAseq of superficial endometrial biopsies; \(N=100,307\) cells. ArrayExpress accession number EMTAB10287 [31], which we denoted as Garcia.

3.
scRNAseq of the endometriotic endothelium; \(N=118,144\) cells. NCBI GEO accession number GSE213216 [32], which we denoted as Fonseca.

4.
The merging of the previous three datasets; \(N=289,483\) cells, which we denoted as Merge.
We also compared PAMBS to other broadly used clustering methods: Scanpy (1.9.3) [17], SC3 (1.28.3) [11], and Seurat (3.1.2) [13]. RaceID2 [22] could not be applied to any dataset due to memory exhaustion, even with the smallest dataset (Wang).
We made comparisons using the same computer (AMDRyzen Threadripper 3990X processor at 2.2GHz) with 128 GiB RAM exclusively devoted to this task. As the distance matrix calculation is only required as a first step by PAM, our package cannot be compared in this respect with Scanpy, Seurat or SC3. To make a meaningful comparison, we used the daisy function of the cluster package on a sample of 35, 516 cells. Calculating the distance matrix with daisy took approximately 6.25 days. The scellpam package took 18 h using the serial version (one thread) and 27 min with parallel implementation (with 128 threads) using double as data type.
Table 1 contains the results of the distance matrix calculation for Pearson dissimilarity with scellpam using 128 threads in the above datasets in terms of time spent (in s) and memory used (in MiB).
Concerning the complete execution, application of PAMBS to the aforementioned subset of 35, 516 cells takes approximately the same amount of time to package cluster (290 s) as our package scellpam (272 s) when using a single thread; however our parallel implementation reduced this time to 14 s.
The entire Wang set (71, 032 cells) cannot be managed by cluster; however, scellpam used 33 min to calculate the distance matrix and 3 min and 17 s for PAMBS to run (both in parallel). Additional file 2 contains a table describing all results. This results demonstrate that, even if further reductions can still be achieved, integration with R and the overall ease of use favor the feasibility of our current approach.
Table 2 displays the results obtained by applying PAMBS, using 64 threads for the aforementioned datasets with 30 medoids. and presents spent time (in s), used memory (in MiB), and the number of medoid swap iterations.
Of note, the number of SWAP iterations in the second phase of PAMBS mainly determines the total time; the number of genes has no relevance here as this number has been subsumed when calculating the distance/dissimilarity matrix.
Other trials carried out with a number of medoids between 25 and 45 provided comparable results: total time for PAM in the Wang dataset varied from 159 to 252 s, increasing with the number of medoids. Additional file 2 contains a table with the details of this experiment.
Of note, while the algorithm that computes the silhouettes does not involve a significant time cost, it gains an advantage from parallelization; for example, the execution time associated with the Wang dataset becomes reduced from approximately 18 s in serial to 7 s using 64 threads.
Finally, we applied the Scanpy, SC3 and Seurat methods to the same datasets. Table 3 reports the results regarding execution time (in s) and used memory (in Mib).
SC3 was unsuccessful with 100 and 500 HVGs as the dimensionality reduction phase produced singular matrices; moreover, both Seurat and SC3 failed when applied to the most extensive cell set containing all genes due to insufficient memory.
Nevertheless, it is fair to remark that these methods were designed for application with smaller sets of relevant genes, rather than in a highly dimensional space; however, it is also true that PAMBS is not subject to this limitation as dimensions are subsumed after distance calculation.
This factor provides an advantage, but it also makes PAM slower than Scanpy. Additionally, PAMBS is also slower than Seurat when not all genes are used. This increase in comparative speed derives from the fact that these packages use hierarchical clustering algorithms (the Louvain community detection algorithm, [18]). In contrast, PAM represents a clustering algorithm that considers all distances between cells in all steps, which allows the cluster reassignment of any cell at any step.
Comparison of classifications
As noted above, this section uses a part of the study conducted by Duó et al. [9] The partitions obtained by the authors can be retrieved from their DuoClustering2018 package [33]. This allowed the comparison of our own results with those of the other methods. While we provide information regarding the methodologies, the datasets analyzed and their conclusions for clarity we refer the reader to [9] for a more detailed explanation.
The authors considered and evaluated 14 clustering algorithms using 12 datasets: 9 real datasets and 3 simulated datasets. The true partition of each dataset is known. Three methods to reduce the number of genes provided as input to the clustering methods were used for each dataset. This provides \(12 \times 3 = 36\) possibilities to evaluate the algorithms.
Table 1 in [9] provides an overview of the datasets used in the study (e.g. sequencing protocol, number of cells, number of features...), while Table 2 describes each method. We include the information shown in Table 2 to improve the readability of this paper. The clustering methods compared are:

ascend (v0.5.0): Principal Component Analysis (PCA) dimension reduction (dim=30) and iterative hierarchical clustering.

CIDR (v0.1.5): PCA dimension reduction based on zeroimputed similarities followed by hierarchical clustering.

FlowSOM (v1.12.0): PCA dimension reduction (dim=30) followed by selforganizing maps \((5 \times 5, 8 \times 8\) or \(15 \times 15\) grid, depending on the number of cells in the dataset) and hierarchical consensus metaclustering to merge clusters.

monocle (v2.8.0): tSNE (tdistributed stochastic neighbor embedding) dimension reduction (initial PCA dim=50, tSNE dim=3) followed by densitybased clustering.

PCAH: PCA dimension reduction (dim=30) and hierarchical clustering with Ward.D2 linkage.

PCAKmeans: PCA dimension reduction (dim=30) and kmeans clustering with 25 random starts

pcaReduce (v1.0): PCA dimension reduction (dim=30) and kmeans clustering through an iterative process; stepwise merging of clusters by joint probabilities and reducing the number of dimensions by PCA with the lowest variance; repeated 100 times following consensus clustering using the clue package ([34, 35]).

RACEID2 (version: March 3, 2017): kmedoids clustering based on Pearson correlation dissimilarities.

RtsneKmeans: tSNE dimension reduction ((initial PCA dim=50, tSNE dim=3, perplexity = 30) and kmeans clustering with 25 random starts.

SAFE (v2.1.0): Ensemble clustering using SC3, CDIR, Seurat and tSNE + kmeans.

SC3 (v1.8.0): PCA dimension reduction or Laplacian graph. kmeans clustering on different dimensions; hierarchical clustering on consensus matrix obtained by kmeans.

SC3svm (v1.8.0): Using SC3 to derive the clusters for half of the cells, then using a support vector machine (SVM) to classify the remaining cells.

Seurat (v2.3.1): Dimension reduction by PCA (dim=30) followed by nearest neighbor graph clustering.

TSCAN (v1.18.0): PCA dimension reduction followed by modelbased clustering.
The hyperparameter values for all clustering algorithms and datasets can be accessed using the function duo_clustering_all_parameter_settings_v2() in the package [33]. For instance, the Seurat range resolutions for the KumarTCC dataset filtered by HVG10 ranged from 0.3 to 1.5 in increments of 0.1. Additional file 3 provides additional details.
The authors of [9] employed nine publicly available scRNAseq datasets and three simulated datasets with varying degrees of separation to evaluate the methods. Table 4 (a reduced version of Table 1 in [9]) notes the dataset names and the number of cells, features, and true populations to be recovered.
The names of the three gene filtering methods are Expr10 (10% genes with the highest average expression), HVG10 (10% most highly variable genes, HVGs) and M3Drop10 (the dropout rate of the genes represents a function of the mean expression level, which keeps 10% of genes.
Duó et al. [9] aimed to assess the ability to recover known populations, the run times, and the stability of the methods.
The authors used the Adjusted Rand Index (ARI) [36] to compare two partitions to evaluate how well the clusters recovered the true populations. This index measures the agreement between two classifications, not necessarily with the same number of clusters. If \(P=\{ 1,\ldots s\}\) and \(P^*=\{ 1,\ldots ,r\}\) denote two partitions of a given dataset with s and r clusters respectively, the ARI(\(P,P^*\)) is defined as
where \(n_{ij}\) is the number of individuals belonging to class i in the first clustering (P) and to class j in the second one (\(P^*\)), \(a_i\) (\(1 \le i \le r\)) is the number of individuals in class i in \(P^*\) and \(b_j\) (\(1 \le j \le s\)) is the number of individuals in class j in P.
While this index has been widely used to quantify agreements between two partitions for all clusters simultaneously, it is not straightforward to interpret. The closer the index value is to 1, the better the agreement; however, a tendency to mainly reflect the degree of agreement between the partitions on the large clusters must be considered. Clusters with few elements have less influence on ARI values [37].
Within the conclusions (page 7 in [9]), the authors state that the highest ARI values are obtained for the “wellseparated datasets” (Kumar, KumarTCC and SimKumar4easy). “All the methods failed to recover the partition of the cells by time point in the Trapnell datasets, where the ARIs were consistently below 0.5.” They also noted that “the M3Drop filtering consistently led to a worse performance for the simulated datasets, while the performance was more similar to the other filterings for real datasets”. They also comment, “While none of the methods consistently outperformed the others \(\ldots\) SC3 and Seurat often showed the best performance.”
The partitions obtained by applying each method on each dataset for different values of the number of clusters (k) can be retrieved using functions included in [33]. For instance, the function clustering_summary_filteredExpr10_KohTCC_v2() contains the clustering results from the performance evaluation of the clustering methods analyzed in [9] when tested on the dataset KohTCC filtered by Expr10.
Our primary interest was evaluating whether our implementation of PAM produces ARI values similar to those obtained by other methods.
Additional file 3 includes the R code for the following analysis and a detailed discussion.

For each method and dataset in [33], the true partition is known. The partition obtained when k is equal to the true number of clusters is obtained; then the ARI between both is calculated.

The scellpam package allows the choice of the dissimilarity measure (\(L_1\), \(L_2\) and Pearson), the normalization method (rawn and log1n), and the number of clusters k; for each dataset, k is set as its true number of clusters and PAMBS was applied for all six possible combinations of dissimilarity and method, and for the 36 datasets mentioned above.

Comparisons were carried out between the partitions obtained by each clustering method and the true partitions, resulting in an ARI value for each case. Table 5 displays the ARI scores for the different datasets in [9] and the three filtering methods obtained by applying PAMBS with the options \(L_1\) and rawn. Additional file 3 reports the ARI scores for all the methods and datasets in Tables 1, 2, 5, 6, 9 and 10. Tables 1 and 2 refer to the datasets filtered with the Expr10 procedure, Tables 5 and 6 refer to those filtered with HVG10, and Tables 9 and 10 refer to those filtered with M3Drop10.

We performed three separate analyses, one for each gene filtering method as each produces a different dataset when applied to the same initial dataset. We used the Friedman rank sum test [38] for each analysis to evaluate for differences between methods. We used the implementation by Eisinga et al. [39] to perform exact allpairs comparison tests for the posthoc analysis. We excluded the ascend and SAFE methods from the comparison as they failed to return a partition with the true number of groups for some datasets.
We now report the results obtained using the Expr10 gene filtering method. Additional file 3 details the results obtained with HVG10 and M3Drop.
The conclusion was to reject the null hypothesis of no difference between methods (Friedman chisquared = 84.243, df = 17, pvalue = 6.699e\(\)11).
Concerning the posthoc analysis, we found no significant differences (\(\alpha =0.05\)) between PAMBS with normalization=rawn and distance=L1 and SC3, Seurat, or any other method. However, when we parameterize PAMBS with other combinations of normalization and distance, we found significant differences with some methods, such as SC3 or Seurat, which performed better. Tables 3 and 4 in Additional file 3 contain the pvalues corresponding to the all pairs of comparisons.
Additional file 3 details the results obtained with HVG10 and M3Drop (Tables 7 and 8, and 11 and 12, respectively).
We obtained Fig. 2 using the function plot_performance() in [33].Figure 4 in Additional file 3) graphically compares PAMBS with the other clustering methods. Each row corresponds to a dataset and each column to a clustering method. White squares indicate a method that failed to return clustering with the same number of groups as the true partition for a given dataset.
We executed PAMBS setting ctype=L1 and normalization = rawn. Given the deterministic nature of PAMBS, the median ARI is simply the ARI. Concerning ARI values for the remaining methods, they were obtained from the information retrieved from [33], as explained in the text.
As a qualitative summary of this section, PAMBS does not display a significant difference (\(\alpha =0.05\)) with any other clustering methods for the studied datasets filtered by HVG10 or Expr10 when making an appropriate choice of its parameters (normalization type and distance/dissimilarity choice). However, SC3 performs significantly better than PAMBS (\(\alpha =0.05\)) when genes are filtered using the M3Drop method. In general, no method outperforms the remaining methods for all datasets and gene filtering options.
Human endometrium study
PAMBS clustering and a likelihoodbased differential abundance test for the detection of changes in human endometrial cell populations
The primary aim of this paper was to evaluate differential cell abundance throughout the natural menstrual cycle using the number of cells per sample and cluster as inputs. This study searches for possible associations between cluster counts and the phenotypic variables describing the biological samples. The counts and the clustering compactness depend on the selection of the normalization method, the dissimilarity measure, the number of HVGs, and the number of clusters. We evaluated such compactness using the mean silhouette of the clusters [30].
As previously stated, the Rpackage scellpam also provides a parallel calculation of the silhouette index for the resulting clustering. Silhouette width has previously been used in scRNAseq literature to evaluate clustering performance [40].
In all cases, the distance/dissimilarity used to calculate the silhouette is the same as that employed to build the cluster with PAM. Figure 3a–c demonstrate these mean silhouettes as a function of the number of clusters, dissimilarities, and normalization methods. The primary purpose of silhouette use was to support the choice of the normalization method, dissimilarity, and number of HVGs.
Although the silhouette provides a good proxy of cluster structure, outcomes must be validated by the biological coherence of cell groups, which must exhibit homogeneous expression patterns. We used the silhouette as a guide to set the clustering parameters. Among the combinations with high silhouette, we chose the combination that yielded the strongest signal of abundance changes for welldelimited cell types; thus, we chose normalization with the log1n method, Pearson dissimilarity and 100 highly significant genes.
We considered three different generalized linear models [7], each using a different predictor. The first model uses time, defined as the day of the menstrual cycle; the second model uses time2, a binary variable indicating if the day of the menstrual cycle is less than or equal to day 20 (value 0) or greater than 20 (value 1); while the third model uses the predictor phase, given in the sample description as the traditional classification of the menstrual cycle phases (level 3 = late proliferative canonical phase; level 4 = early to midsecretory canonical phase; and level 5 = late secretory canonical phase). In summary, we can evaluate menstrual time in three manners: one uses the day, while the others employ relevant biological knowledge.
We tested the null hypothesis of no effect of the chosen predictor (i.e., the corresponding coefficient equal to zero) for the two first models, which evaluates the cluster counts’ dependence on the corresponding predictor. The third model has a categorical predictor  the menstrual phase  with three levels; the phase is coded using two dummy variables, and we tested if both coefficients can be jointly considered null. Additionally, we considered comparisons of each pair of levels.
We applied two of the testing procedures implemented in the edgeR package. The first [7] assumes a generalized linear model with a negative binomial response. The second quasilikelihood approach, modifying the meanvariance relationship of the negative binomial model, is proposed in [8]. In our case, the number of significant clusters detected with the quasilikelihood approach remains lower than with the likelihood approach. Additional file 4 contains the results for both approaches; from this point forward, we will use the likelihood approach.
We observed more significant clusters using raw count normalization instead of logarithmic normalization. Figure 4a displays the number of significant clusters obtained using different dissimilarity measures with time2 as the predictor. Overall, Euclidean distance appears to provide more significant clusters, followed by the Pearson dissimilarity, and finally, the Manhattan metric (\(L_1\)). The different criteria to choose the normalization and metric seem reasonable, which makes this an open question. Different choices should be evaluated in each separate case.
Figure 4b reports the number of significant clusters using the whole set of cells (labeled without) and using a filtered set of cells, considering the silhouette values (i.e., cells with low silhouette values are removed) (labeled with). Specifically, we applied successive steps of removing the 15% of cells with a lower silhouette until at least 60% of the remaining cells have a silhouette higher than 0.7 were applied. Cell filtering provides more significant clusters when fitting the models with time and time2; however, this statement does not remain true using the model with phase (as shown in Additional file 4).
Although the model using time2 as a predictor variable encounters fewer significant clusters than the model using phase as a predictor, the first method is preferred given the consistency of results. The number of significant clusters remains similar with and without filtering using the model with time2, while the model with phase as a predictor finds almost all clusters significant without filtering (28 to 30 using different normalization methods and metric/dissimilarity) but a number from 3 to 8 with filtering.
We compared the distinct phases of the menstrual cycle with the corresponding comparisons. Only comparing the two last phases (3 and 4) provided significant clusters using the likelihood approach. Comparisons of time2 levels and phase levels 3 and 4 remain biologically similar since they overlap the natural time frames of the menstrual cycle. Intriguingly, while a few clusters display significance when comparing 3 and 4 phases, the global evaluation of the predictor indicates the significance of almost all clusters. The unbalanced number of samples between phases could yield this result. Additional file 4 contains the global evaluation and the comparisons.
Due to its superior stability across numerous factors, such as choice of normalization, metric/dissimilarity, and silhouette filtering, we used the model with “time2” as the predictor variable in subsequent analyses. We made this choice in favor of consistency and reliability. Figure 5 displays a multidimensional scaling plot corresponding to the medoids obtained using log normalization with a Pearson dissimilarity. Medoids from the two most abundant cell types (epithelial and stromal) display trends in opposite directions and increasing distances from the center of the scaling plot, which we interpret as cell states in progression through the menstrual cycle. We observed natural killer (NK) cells and Tlymphocyte medoids positioned far from the center, denoting their distinct immune nature compared to the remaining cell types.
Most endometrial cell types annotated in our dataset possessed clusters that reported a significant pvalue after day 20 of the menstrual cycle (i.e., epithelium, stroma, endothelium, perivascular cells, and immune cells (Fig. 6b). These cells (colored blue in Fig. 6a) represent the changing subpopulations of each primary cell type that modify their abundance during this time phase.
Differentially expressed genes of changing populations have direct involvements in endometrial receptivity
We performed a differential expression evaluation between each significant population from the abundance test against its peer nonsignificant cells within each cell type considered. Fig. 7 summarizes the obtained results.
The PAEP, GPX3, and CXCL14 genes, which displayed overexpression in most cell types analyzed, represent reportedly robust markers of the receptive endometrium [41, 42]. PAEP (progestagenassociated endometrial protein or Glycodelin) is a progesteroneregulated gene that regulates critical fertilization steps [43, 44]. The gene encodes four glycoforms differing concerning glycosylation patterns  glycodelinS, A, F, and C. These glycoforms have distinct, essential roles in maintaining a uterine environment suitable for pregnancy and in the timing and occurrence of the appropriate sequence of events in the fertilization process [45], including immunomodulatory activities. The presence of glycodelins primarily associates with the epithelial compartment [27, 46, 47]; however, our analysis detected the presence of PAEP expression at the singlecell resolution beyond the limits of the epithelium, with significantly altered levels observed in populations within the vascular endothelium, perivascular compartment, and immune cell types (macrophages, Tlymphocytes and granulocytes).
Differential expression analysis in the endothelium also identified genes related to cellular stress (NUPR1) [48], heat shockrelated proteins (CRYAB) [49], mitochondria genes (MTND4L), and genes related to immune activation and proinflammatory response (SLPI and CXCL8). This gene set relates closely to the inflammatory processes in the late secretory phase. This inflammatory phenomenon plays a dual role: first, the endothelial recruitment of immune cells to infiltrate the endometrium and activate the production of embryo adhesion molecules by the luminal epithelial cells [50], and second, the promotion of a controlled proinflammatory environment that later supports the regeneration and healing of the endometrium during menstruation [51]. Furthermore, we discovered that a subpopulation of perivascular cells resident cells of the surrounding blood vessels—expressed CXCL2 and CXCL8, contributing to the proinflammatory microenvironment [52]. In short, the identified transcriptomic signature supports using PAMBS as an effective tool to detect changes in cell abundances between different biological states.
Discussion
As part of this study, we aimed to investigate how cellular populations change in abundance in a dynamically modulated tissue, the human endometrium, throughout the menstrual cycle. The cellular dynamics of a highly renewable tissue remain challenging to dissect. Physiologically, the endometrium enters a narrow window of receptivity known as the window of implantation (WOI) [53] that is structurally and biochemically ideal for embryo implantation [54] during the second half of the menstrual cycle (functionally known as the secretory phase). The duration and precise timing of the WOI are subject to broad inter and intraindividual variation [24, 55], and colossal efforts have been devoted to determining the gene expression patterns that control this timing [25, 26]. The development of singlecell strategies has provided indepth knowledge of gene expression patterns categorized by tissue celltype strata; however, populationcentered changes remained previously unapproached (to the best of our knowledge).
Due to the general importance associated with fertility and assisted reproductive technologies, we also aimed to study the remodeling of cell populations during the WOI by developing a robust procedure to cluster cells and statistically evaluate changes in their abundance.
Our results prove our ability to determine changing cell populations during the WOI. The genes associated with these subpopulations denote their transcriptional transition to a different cellular state. Like cell cycle genes, the described behavior of WOI genes and their temporal expression patterns in different cell types might interfere with further data analysis, becoming a potential source of unwanted variation. Management and proceedings related to cell cycle genes have already been implemented as routine in wellestablished analysis pipelines [56, 57]. Their expression allows for the detection of cells in active division, but their removal is highly recommended for downstream steps such as trajectory inference analysis.
To achieve our primary goal, we implemented PAM as PAMBS on the dataset of our interest. We evaluated the performance of PAMBS by comparing it to a benchmark study by Duó et al [9]. The datasets available can be considered small or mediumsized, and the results obtained by applying PAMBS, with an appropriate parameter selection, remained similar to those of the additional similar methods.
Conclusions
From a technical point of view, this paper contributes to the field by describing a sound and welltested clustering method, PAM, through our PAMBS implementation to large sets of data (assemblies of tens or even hundreds of thousands of cells with the expression of thousands of genes) thanks to efficient memory use and automatic parallelization for the most widely available hardware (multicore processors). The complete workflow for analyzing scRNAseq datasets involves choosing the normalization procedure, the distance/dissimilarity type, and the number of features (which, in this case, corresponds to the number of HVGs). Notably, such choices should be made for each dataset. Filtering by silhouette represents an optional available procedure that may guide the clustering method in some cases. In this bioinformatics context, clustering success must be evaluated by exploring the biological meaning of the obtained cell groups, as the final purpose is to capture insight from each dataset to answer different biological questions (e.g., to describe a set of cell subpopulations that carry out a unique biological process in the tissue under study). We captured transcriptional profiles associated with the principal cell compartments (the epithelium and stroma) and lessstudied cell populations within the endometrium and the WOI context. These lessstudied cell populations include the endothelium, perivascular cells, and distinct immune cell types. We detected cellular abundance changes in all main cell lineages entering the WOI using the proposed approach. We employed the scellpam package to fully understand the transcriptional landscape within the endometrium and the WOI, thus providing an alternative insight into the intricate molecular processes in these specific biological contexts. In summary, we contribute to both the fields of reproductive biology and computational biology; we support a better understanding of the dynamics of endometrial remodeling in reproductive biology and report a readytouse novel methodology for singlecell data analysis in computational biology.
Availability of data and materials
The datasets analyzed during the current study are available in the GEO and ArrayExpress repositories:
scRNAseq on human endometrium across the natural menstrual cycle; N = 71,032 cells. NCBI GEO accession number GSE111976 [27].
scRNAseq of endometrial superficial biopsies; N = 100,307 cells. ArrayExpress accession number EMTAB10287 [31].
scRNAseq of endometriosis; N = 118,144 cells. NCBI GEO accession number GSE213216 [32].
No cell lines were used in this study.
Code availability
The package developed for and used in this study, scellpam, is freely available under GNU license from the CRAN repository: https://cran.rproject.org/web/packages/scellpam/index.html.
Abbreviations
 ARI:

Adjusted Rand index
 HVG:

Highly variable genes
 PAM:

Partitioning Around Medoids
 PAMBS:

PAM with BUILD + SWAP
 PCA:

Principal Component Analysis
 scRNAseq:

Singlecell RNA sequencing
 tSNE:

Tdistributed Stochastic Neighbor Embedding
 WOI:

Window of implantation
References
Trapnell C. Defining cell types and states with singlecell genomics. Genome Res. 2015;25(10):1491–8. https://doi.org/10.1101/gr.190595.115.
Zhao J, Jaffe A, Li H, Lindenbaum O, Sefik E, Jackson R, Cheng X, Flavell RA, Kluger Y. Detection of differentially abundant cell subpopulations in scRNAseq data. Proc Natl Acad Sci USA 2021;118(22).
Ramachandran P, Dobie R, WilsonKanamori JR, Dora EF, Henderson BEP, Luu NT, Portman JR, Matchett KP, Brice M, Marwick JA, Taylor RS, Efremova M, VentoTormo R, Carragher NO, Kendall TJ, Fallowfield JA, Harrison EM, Mole DJ, Wigmore SJ, Newsome PN, Weston CJ, Iredale JP, Tacke F, Pollard JW, Ponting CP, Marioni JC, Teichmann SA, Henderson NC. Resolving the fibrotic niche of human liver cirrhosis at singlecell level. Nature. 2019;575(7783):512–8. https://doi.org/10.1038/s4158601916313.
Lun ATL, Richard AC, Marioni JC. Testing for differential abundance in mass cytometry data. Nat Methods. 2017;14(7):707–9. https://doi.org/10.1038/nmeth.4295.
Dann E, Henderson NC, Teichmann SA, Morgan MD, Marioni JC. Differential abundance testing on singlecell data using knearest neighbor graphs. Nat Biotechnol. 2021;40(2):245–53. https://doi.org/10.1038/s4158702101033z.
Kaufman L, Rousseeuw PJ. Finding groups in data. An introduction to cluster analysis. Hoboken: Wiley; 1990.
McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNASeq experiments with respect to biological variation. Nucleic Acids Res. 2012;40(10):4288–97. https://doi.org/10.1093/nar/gks042.
Lund SP, Nettleton D, McCarthy DJ, Smyth GK. Detecting differential expression in RNAsequence data using quasilikelihood with shrunken dispersion estimates. Stat Appl Genet Mol Biol. 2012. https://doi.org/10.1515/15446115.1826.
Duò A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for singlecell RNAseq data. F1000Research. 2020;7:1141. https://doi.org/10.12688/f1000research.15666.3.
R Core Team: R: A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria 2022. R Foundation for Statistical Computing. https://www.Rproject.org/
Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, Hemberg M. SC3: consensus clustering of singlecell RNAseq data. Nat Methods. 2017;14(5):483–6. https://doi.org/10.1038/nmeth.4236.
Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of singlecell gene expression data. Nat Biotechnol. 2015;33:495–502. https://doi.org/10.1038/nbt.3192.
Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating singlecell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36:411–20. https://doi.org/10.1038/nbt.4096.
Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive integration of singlecell data. Cell. 2019;177:1888–902. https://doi.org/10.1016/j.cell.2019.05.031.
Hao Y, Hao S, AndersenNissen E, Mauck WM, Zheng S, Butler A, Lee MJ, Wilk AJ, Darby C, Zagar M, Hoffman P, Stoeckius M, Papalexi E, Mimitou EP, Jain J, Srivastava A, Stuart T, Fleming LB, Yeung B, Rogers AJ, McElrath JM, Blish CA, Gottardo R, Smibert P, Satija R. Integrated analysis of multimodal singlecell data. Cell. 2021. https://doi.org/10.1016/j.cell.2021.04.048.
Luecken MD, Theis FJ. Current best practices in singlecell RNAseq analysis: a tutorial. Mol Syst Biol 2019. https://doi.org/10.15252/msb.20188746
Wolf FA, Angerer P, Theis FJ. SCANPY: largescale singlecell gene expression data analysis. Genome Biol. 2018. https://doi.org/10.1186/s1305901713820.
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech Theory Exp. 2008;2008(10):10008. https://doi.org/10.1088/17425468/2008/10/p10008.
Traag VA, Waltman L, Eck NJ. From Louvain to Leiden: guaranteeing wellconnected communities. Sci Rep. 2019. https://doi.org/10.1038/s4159801941695z.
Schubert E, Rousseeuw PJ. Fast and eager kmedoids clustering: O(k) runtime improvement of the pam, clara, and clarans algorithms. Inf Syst. 2021;101: 101804. https://doi.org/10.1016/j.is.2021.101804.
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K. Cluster: cluster analysis basics and extensions. 2022. R package version 2.1.4—For new features, see the ’Changelog’ file (in the package source). https://CRAN.Rproject.org/package=cluster
Grün D, Muraro MJ, Boisset JC, Wiebrands K, Lyubimova A, Dharmadhikari G, Born M, Es J, Jansen E, Clevers H, Koning EJP, Oudenaarden A. De novo prediction of stem cell identity using singlecell transcriptome data. Cell Stem Cell. 2016;19:266. https://doi.org/10.1016/J.STEM.2016.05.010.
Domingo J. Scellpam: applying partitioning around medoids to single cell data with high number of cells. 2022. R package version 1.4. https://CRAN.Rproject.org/package=scellpam
Noyes RW, Hertig AT, Rock J. Dating the endometrial biopsy. Fertil Steril. 1950;1(1):3–25. https://doi.org/10.1016/s00150282(16)300620.
Riesewijk A. Gene expression profiling of human endometrial receptivity on days LH+2 versus LH+7 by microarray technology. Mol Hum Reprod. 2003;9(5):253–64. https://doi.org/10.1093/molehr/gag037.
DíazGimeno P, Horcajadas JA, MartínezConejero JA, Esteban FJ, Alamá P, Pellicer A, Simón C. A genomic diagnostic tool for human endometrial receptivity based on the transcriptomic signature. Fertil Steril. 2011;95(1):50–6015. https://doi.org/10.1016/j.fertnstert.2010.04.063.
Wang W, Vilella F, Alama P, Moreno I, Mignardi M, Isakova A, Pan W, Simon C, Quake SR. Singlecell transcriptomic atlas of the human endometrium during the menstrual cycle. Nat Med. 2020;26(10):1644–53. https://doi.org/10.1038/s415910201040z.
Schubert E, Rousseeuw PJ. Faster kmedoids clustering: Improving the pam, clara, and clarans algorithms. In: Amato G, Gennaro C, Oria V, Radovanović M, editors. Similarity search and applications. Cham: Springer; 2019. p. 171–87.
Domingo J, Leon T, Dura E. Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scrnaseq data sets. BMC Bioinform. 2023;24(1):342. https://doi.org/10.1186/s12859023054711.
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65. https://doi.org/10.1016/03770427(87)901257.
GarciaAlonso L, Handfield LF, Roberts K, Nikolakopoulou K, Fernando RC, Gardner L, Woodhams B, Arutyunyan A, Polanski K, Hoo R, SanchoSerra C, Li T, Kwakwa K, Tuck E, Lorenzi V, Massalha H, Prete M, Kleshchevnikov V, Tarkowska A, Porter T, Mazzeo CI, Dongen S, Dabrowska M, Vaskivskyi V, Mahbubani KT, Park JE, JimenezLinan M, Campos L, Kiselev VY, Lindskog C, Ayuk P, Prigmore E, Stratton MR, SaebParsy K, Moffett A, Moore L, Bayraktar OA, Teichmann SA, Turco MY, VentoTormo R. Mapping the temporal and spatial dynamics of the human endometrium in vivo and in vitro. Nat Genet. 2021;53(12):1698–711. https://doi.org/10.1038/s41588021009722.
Fonseca MAS, Haro M, Wright KN, Lin X, Abbasi F, Sun J, Hernandez L, Orr NL, Hong J, ChoiKuaea Y, Maluf HM, Balzer BL, Fishburn A, Hickey R, Cass I, Goodridge HS, Truong M, Wang Y, Pisarska MD, Dinh HQ, ElNaggar A, Huntsman DG, Anglesio MS, Goodman MT, Medeiros F, Siedhoff M, Lawrenson K. Singlecell transcriptomic analysis of endometriosis. Nat Genet. 2023;55(2):255–67. https://doi.org/10.1038/s41588022012541.
Duó A, Soneson C. DuoClustering2018: Data, Clustering Results and Visualization Functions From Duó et al (2018). 2022. R package version 1.14.0
Hornik K. Clue: cluster ensembles. 2023. R package version 0.364. https://CRAN.Rproject.org/package=clue
Hornik K. A CLUE for CLUster Ensembles. J Stat Softw. 2005;. https://doi.org/10.18637/jss.v014.i12
Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193–218. https://doi.org/10.1007/bf01908075.
Warrens MJ, Hoef H. Understanding the adjusted rand index and other partition comparison indices based on counting object pairs. J Classif. 2022;39(3):487–509.
Pohlert T. PMCMRplus: Calculate Pairwise Multiple Comparisons of Mean Rank Sums Extended. 2022. R package version 1.9.6. https://CRAN.Rproject.org/package=PMCMRplus
Eisinga R, Heskes T, Pelzer B, Te Grotenhuis M. Exact pvalues for pairwise comparison of friedman rank sums, with application to comparing classifiers. BMC Bioinform. 2017;18(1):68.
Hie B, Bryson B, Berger B. Efficient integration of heterogeneous singlecell transcriptomes using scanorama. Nat Biotechnol. 2019;37(6):685–91. https://doi.org/10.1038/s4158701901133.
Talbi S, Hamilton AE, Vo KC, Tulac S, Overgaard MT, Dosiou C, Shay NL, Nezhat CN, Kempson R, Lessey BA, Nayak NR, Giudice LC. Molecular phenotyping of human endometrium distinguishes menstrual cycle phases and underlying biological processes in normoovulatory women. Endocrinology. 2006;147(3):1097–121. https://doi.org/10.1210/en.20051076.
Bhagwat SR, Chandrashekar DS, Kakar R, Davuluri S, Bajpai AK, Nayak S, Bhutada S, Acharya K, Sachdeva G. Endometrial receptivity: a revisit to functional genomics studies on human endometrium and creation of HGExERdb. PLoS ONE. 2013;8(3):58419. https://doi.org/10.1371/journal.pone.0058419.
Oehninger S, Coddington CC, Hodgen GD, Seppala M. Factors affecting fertilization: endometrial placental protein 14 reduces the capacity of human spermatozoa to bind to the human zona pellucida. Fertil Steril. 1995;63(2):377–83. https://doi.org/10.1016/s00150282(16)573725.
Rachmilewitz J, Riely GJ, Tykocinski ML. Placental protein 14 functions as a direct tcell inhibitor. Cell Immunol. 1999;191(1):26–33. https://doi.org/10.1006/cimm.1998.1408.
Chiu PCN, Chung MK, Koistinen R, Koistinen H, Seppala M, Ho PC, Ng EHY, Lee KF, Yeung WSB. Cumulus oophorusassociated glycodelinc displaces spermbound glycodelina and f and stimulates spermatozoazona pellucida binding. J Biol Chem. 2007;282(8):5378–88. https://doi.org/10.1074/jbc.m607482200.
Seppälä M, Suikkari AM, Julkunen M. Human endometrial proteins. Reprod Nutr Dév. 1988;28(6B):1649–54. https://doi.org/10.1051/rnd:19881009.
Turco MY, Gardner L, Hughes J, CindrovaDavies T, Gomez MJ, Farrell L, Hollinshead M, Marsh SGE, Brosens JJ, Critchley HO, Simons BD, Hemberger M, Koo BK, Moffett A, Burton GJ. Longterm, hormoneresponsive organoid cultures of human endometrium in a chemically defined medium. Nat Cell Biol. 2017;19(5):568–77. https://doi.org/10.1038/ncb3516.
Martin TA, Li AX, Sanders AJ, Ye L, Frewer K, Hargest R, Jiang WG. Nupr1 and its potential role in cancer and pathological conditions (review). Int J Oncol. 2021.
Zhang J, Liu J, Wu J, Li W, Chen Z, Yang L. Progression of the role of cryab in signaling pathways and cancers. OncoTargets and therapy 2019.
Granot I, Gnainsky Y, Dekel N. Endometrial inflammation and effect on implantation improvement and pregnancy outcome. Reproduction. 2012;144(6):661–8. https://doi.org/10.1530/rep120217.
Critchley HO, Maybin JA, Armstrong GM, Williams AR. Physiology of the endometrium and regulation of menstruation. Physiol Rev. 2020;100(3):1149–79. https://doi.org/10.1152/physrev.00031.2019.
Stark K, Eckart A, Haidari S, Tirniceriu A, Lorenz M, Brühl ML, Gärtner F, Khandoga AG, Legate KR, Pless R, Hepper I, Lauber K, Walzog B, Massberg S. Capillary and arteriolar pericytes attract innate leukocytes exiting through venules and ‘instruct’ them with patternrecognition and motility programs. Nat Immunol. 2012;14(1):41–51. https://doi.org/10.1038/ni.2477.
Harper MJK. The implantation window. Baillieres Clin Obstet Gynaecol. 1992;6(2):351–71. https://doi.org/10.1016/s09503552(05)800926.
Wilcox AJ, Baird DD, Weinberg CR. Time of implantation of the conceptus and loss of pregnancy. N Engl J Med. 1999;340(23):1796–9. https://doi.org/10.1056/nejm199906103402304.
Noyes RW, Hertig AT, Rock J. Dating the endometrial biopsy. Am J Obstet Gynecol. 1975;122(2):262–3. https://doi.org/10.1016/s00029378(16)335001.
Barron M, Li J. Identifying and removing the cellcycle effect from singlecell RNAsequencing data. Sci Rep. 2016. https://doi.org/10.1038/srep33892.
Lun ATL, McCarthy DJ, Marioni JC. A stepbystep workflow for lowlevel analysis of singlecell RNAseq data with bioconductor. F1000Research 2016;5:2122. https://doi.org/10.12688/f1000research.9501.2
Acknowledgements
None applicable.
Funding
This publication is part of the PGC type B with reference PID2020117114GBI00 funded by the Spanish Ministry of Science and Education, MCIN/ AEI/10.13039/501100011033/ and of the I+D+i funded project CellStats (PI2021010) from the VLCBIOCLINIC program 2021. B.R. was supported by the H2020funded project Human Uterus Cell Atlas (HUTER) (2020/2021) (Grant Agreement 874867). R.P.M was supported by an Industrial Doctorate Grant, (DIN2020011069) from the Spanish Ministry of Science and Innovation (MICINN).
Author information
Authors and Affiliations
Contributions
JD developed and implemented the code for the package and methods used in the study. OK performed data analysis and developed R code for the study. TL contributed to the development of R code, implemented the methods, and interpreted the results. RP conducted data analysis and contributed to the interpretation of the results. GA and BR designed the study and contributed to the interpretation of the results. All authors participated in writing the paper and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
No human or animal subjects were used for this study/No ethical issues are involved/Not applicable.
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. Scellpam package installation and workflow.
Additional file 2
. Tables with additional comparisons for PAMBS execution times.
Additional file 3
. Graphical and numerical comparisons among PAMBS and other methods.
Additional file 4
. R code and results for the differential cell abundance analysis.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Domingo, J., KutsyrKolesnyk, O., Leon, T. et al. A cell abundance analysis based on efficient PAM clustering for a better understanding of the dynamics of endometrial remodelling. BMC Bioinformatics 24, 440 (2023). https://doi.org/10.1186/s12859023055696
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023055696