Characterization of retinal cell types is an important field of study with wide applications in ophthalmology and regenerative medicine. With the advent of single cell RNA-sequencing (scRNA-seq), computational methods for gene reporting can yield valuable insights into genes that are important in determining cell fate . Human pluripotent stem cells (hPSCs) can be used to generate retinal cell types in vitro with potential applications to cure age-related macular degeneration, retinitis pigmentosa and other retina-related causes of blindness. However, gene reporting and characterization of these cell types is difficult as they differentiate asynchronously in complex cultures . In addition, more datasets of mouse models exist compared to human or organoid models. We propose using Boolean implication analysis to improve the prediction accuracy of existing correlational methods for gene reporting.
Previous methods in vivo and in vitro
One of the most common methods to study the effect of key genes on retinal development is the use of genetically modified “knockout” murine models, which are frequently used to validate differentially expressed genes from microarray data [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]. Fluorescent gene reporter lines are widely used to check for gene expression in single cells, or purified populations of a single cell type [2, 21,22,23,24,25]. Bulk RNA sequencing (RNA-seq) has helped define the transcriptomes of larger populations of retinal cell types [3, 9, 14, 17, 21, 24, 26,27,28,29,30,31,32,33,34,35]. To study the characteristics of isolated cells or droplets, flow cytometry was formerly a major method [36, 37]. Single-cell RNA sequencing (scRNA-seq) is increasingly common today and is one the most detailed methods to profile transcriptomes of retinal cell types and subtypes [2, 8, 13, 22, 38,39,40,41,42,43,44,45,46,47,48].
Most studies on retinal cell types have relied upon murine models, but many increasingly study human donor retinas [6, 30, 31, 48,49,50], especially in order to profile retinal disease [31, 43, 50,51,52,53]. Glaucoma, age-related macular degeneration and retinal light damage have also been studied in murine models [7, 14, 29, 34, 35, 54, 55]. Some studies have grown cell lines in vitro from fetal retina [49, 56], whereas others have used human pluripotent, induced pluripotent or embryonic stem cells to generate purified cell populations or retinal organoids [2, 3, 8, 28, 38, 57,58,59]. In order to study the development of retinal cell types over time, the lineage of stem cell progeny  and time course data from different time points (using PCR and RNA-seq) have been investigated [39, 41, 54].
Previous computational methods
Differential expression analysis is the most common method to identify retinal cell type-specific genes and biomarkers from microarray, RNA-seq and scRNA-seq data [10, 13, 14, 17, 24, 29,30,31, 39, 41, 46, 47, 53, 56, 59]. In single-cell analysis, dimension reduction through Principal Component Analysis to reduce the size of data and allow visualization is often performed before hierarchical clustering identify cell clusters [2, 7, 30, 41, 42, 49, 56, 60]. Cell clusters can be assigned to different cell types or subtypes based on the expression of key marker genes . AI-guided identification of cell clusters has recently been investigated .
scRNA-seq data provides opportunities for in-depth analysis of the transcriptome of individual cells, and subsequent characterization of cell types, subtypes and regions of retina. However, scRNA-seq data is highly noisy, and contains large numbers of zeroes, among which true and false negatives are indistinguishable. Many of these zeroes are dropouts, caused by a failure to capture or amplify a transcript. As a result, scRNA-seq data generates sparse arrays with low false omission rate and high negative predictive values .
Most studies, to date, have been highly dependent on cell clustering, which is not always achievable, especially in datasets containing immature or developing cells . Pseudo-time analysis, which maps single-cell trajectories along developmental processes, has been applied to retinal organoids, and takes into account transitory states rather than discrete clusters . However, these approaches are hindered by asynchronous differentiation of cell types in retina and the symmetric nature of clustering algorithms . Correlational methods for ranking gene expression are also widely used, bypassing the need to discover cell clusters and identifying co-expressed genes in complex cultures, including developing retinal organoids [2, 8, 23, 27, 49, 64].
Identifying relationships between genes has led towards broader goals of graph [47, 60] and network-based analysis [9, 10, 17, 25, 27, 31, 60, 65]. Gene expression networks can be used to identify transitions between phenotypes and disease states, paving the way for clinical target identification. Correlational analysis is traditionally used to derive co-expression networks, and knockout murine models are used to directly investigate the effect of one gene’s absence. However, the symmetric nature of correlation can lead to loss of valuable information and does not provide insight into the expression of genes over time. Bayesian networks of gene regulation and expression in the retina mainly identify transcription factors and their targets [60, 66]. Hence, the motivation of our work was to develop a universally applicable state-of-the-art method that filtered out noise, could be applied to a wide variety of datasets and lent insight into gene expression over differentiation.
A Boolean approach
Boolean logic is a simple mathematical relationship between two values such as high/low or 1/0. We propose using Boolean implication (“if–then” relationships) to study the dependency between genes from scRNA-seq data. Research by Sahoo et al. has shown that analysis of Boolean implication relationships is better at filtering out noise than a correlational approach . Analysis of Boolean implication lends insight into asymmetric relationships disregarded by correlation.
While Boolean implication, like correlation, does not imply causation, asymmetric Boolean relationships can be thought of in terms of subsets. For example, the relationship Gene A high ⇒ Gene B high indicates that all cells with Gene B high are a subset of those with Gene A high. This allows for analysis of developmentally regulated genes using Boolean implication, first pioneered in the MiDReG tool published by Sahoo et al. .
In previous research, Boolean methods have led to the discovery of prognostic biomarkers for bladder and colon cancer [69,70,71]. These methods have also led to characterization of hematopoietic stem cells and identification of B and T cell precursors [72, 73]. Our methods have not previously been applied to stem cell-derived retinal cell types, but have yielded insights into changes in transcriptional profiles of healthy retina and retinoblastoma . The StepMiner and BooleanNet algorithms were developed for microarray data by Sahoo et al. to identify Boolean implication relationships between genes, but have since been applied to a wide variety of high-throughput data, such as RNA-seq, scRNA-seq and microbiome data [68, 75,76,77].
A small number of previous studies have used single-cell RNA-seq data to construct gene regulatory networks that use Boolean relationships such as AND, OR and NOT to model processes such as hematopoiesis . These studies also begin with binarizing the data to build dynamic executable models (sequential logic with memory) that are classically different from Boolean implication relationships which follows combinational logic (memoryless). Qiu 2020 recognizes binarizing gene expression values can “embrace the dropouts” in single-cell data by using zero values in the data to characterize cell types . However, the data is binarized by simply replacing any non-zero values with 1, losing the quantitative information of gene expression. In this work, the StepMiner algorithm computes a threshold that considers the quantitative expression values before binarizing them as low or high. This approach focuses on Boolean implication relationships as they can identify cell populations based on a relationship between two genes and shed light on gene expression during differentiation.