Tissue homeostasis results from an intricate gene regulation network that when disrupted can lead to disease. Gene expression is an intermediate state linking the genome to phenotypic outcomes. Biomedical studies have relied on the analysis of gene expression levels to disease state, progression, outcome and treatment [1,2,3]. Changes in gene expression have helped identify patterns and signatures associated with disease (e.g. community-acquired pneumonia [4] or tuberculosis [5]), improve the understanding of aging conditions [6, 7] and complement genetic information in Mendelian disease diagnosis [8]. Furthermore, analyzing these alterations can facilitate the discovery, development and assessment of novel drug treatments [9,10,11], the optimization of drug application and dosage in cancer therapies [12, 13], or the prediction of response to therapy [3].
Advances in sequencing technologies have produced an unprecedented amount of multi-modal molecular data. This opened the possibility for a thorough and informative interrogation of genotype-phenotype associations on complex diseases, presenting new opportunities for precision medicine. However, finding statistically significant and meaningful associations between molecular data and phenotypic or clinical annotations remains a hard challenge. This difficulty is exacerbated by the high dimensionality of the data and, as is the case with gene expression, a considerable stochastic component.
Machine Learning (ML) seeks to automatically capture statistical associations in data while improving knowledge with additional evidence [14,15,16]. ML is of particular interest in computational biology because it can describe biological phenomena without the need to explicitly model them. Supervised ML builds mathematical models that map the input data, described by several attributes or features, to the corresponding output value or label for each instance or sample. The model can then be used to predict the outcome of unseen or incoming data. Figure 1 shows a general flowchart of a supervised learning (SL) pipeline: (1) train/test data split; (2) model training; (3) model evaluation. Discrete categories are determined through classification tasks (Fig. 1a); when predicting numerical values, the task is called regression (Fig. 1b).
In the context of gene expression analysis, there are several examples of phenotype inference from predictive models. Classification has been used for predicting therapeutic or drug response [17, 18], as well as cancer molecular subtype from gene expression, DNA methylation data, or both [19]. Use cases for regression include the prediction of expression levels of target genes from landmark genes [20] or mutational effects from DNA sequences [21].
While ML provides an adequate solution for statistical association challenges, it often incurs substantial computational demands. Processing power and memory requirements are amplified by the size of the data and the complexity of the models. When building models in a single machine, such as a laptop, it might be too costly or even impossible to load large amounts of data into memory. Moreover, while high-performance computing (HPC) solutions are capable of meeting these requirements, it is not always trivial to fully leverage these resources. Therefore, implementations for scalable computing of ML and other data analytics tasks are required.
There are several frameworks for scalable data analytics available for the widely-used Python programming language. In this review we will focus on a particular framework called Dask [22]. Dask divides data into smaller blocks on which to perform highly parallel computations, thus allowing larger data sets to fit in the memory of single machines. It tightly integrates with existing libraries for Python data analytics and shares a similar interface for implementation, which minimizes the need for code rewrites and facilitates the transition to HPC environments. Here we explore the potential of Dask for scalable ML and data science, applied to bulk and single cell transcriptomics, providing usage recommendations supported by numerous code examples and performance tests.
This paper begins with a review on the need for scalable data analysis in computational biology. We start by describing the standard supervised ML workflow and how it can be adapted for gene expression data analysis. In the following section, we look into the specifics of the Python programming language and how it can be used for ML and data science, discussing its advantages and limitations. We then introduce the general concepts of distributed and parallel computing and enumerate different scalable ML frameworks, with a focus on Dask. We perform several benchmarks and comparisons highlighting the advantages of Dask. Finally, we propose guidelines for the efficient use of Dask for transcriptomics analysis. This review is supported by examples of different tasks, comparative performance tests, and supporting code.
Building gene expression predictive models
Gene expression predictive models are built using supervised learning, which often rely on the data being stored in a structured format, such as tables, see Fig. 1. The labels (categorical or numerical) represent the phenotype or the clinical information to be inferred.
Supervised learning pipelines encompass several key steps [23]. First, gene expression and phenotype data are loaded. At this point, preprocessing can be applied to prepare the data for the remaining steps. The data is then split into training and test sets. ML algorithms learn from the training data to generate a predictive model. In supervised learning, training consists in finding the coefficients, known as model parameters, that provide the best mapping between input features and output labels, as judged by intermediate validation scores. Lastly, the trained model is evaluated on the hold-out test set. The test set is used to estimate the generalization error, i.e. how well the model behaves on unseen data. To ensure the robustness of the generalization error, k-fold cross-validation (CV) [24, 25] is customarily performed during model training. This strategy consists in creating several (k) train/test partitions from the original dataset, repeating the model-building process k times, and averaging out k evaluation scores. If the model reaches an evaluation score that is deemed sufficiently high for a specific application, it can be deployed to work with new incoming data.
For the most part, the development of transcriptomics predictive models follows the traditional approach. However, because gene expression levels obtained from RNA-seq experiments are noisy and have a large amplitude, data preprocessing may have an impact on downstream results. Therefore, feature selection, scaling and normalization are required [26,27,28,29,30,31,32]. Feature selection removes less informative genes, for example those with low variability or low expression levels. It can also select biologically-relevant gene subsets (e.g. protein coding). Scaling is used to attenuate fold differences in expression ranges, typically by shifting the data from a linear to a logarithmic scale. Sample normalization methods account for differences that arise from the sequencing process, like library size, batch effects and gene structure.
Although models are trained automatically, the building process can be controlled by static values defined a priori by the user, known as hyper-parameters [33]. Different hyper-parameter values influence the performance of the final predictive model in different ways. Therefore, hyper-parameter optimization (HPO) can be performed during CV in an attempt to lower the generalization error. If the model reaches an optimized performance according to the evaluation measures, the model is selected and applied to new incoming data. Determining the best set of hyper-parameter values is a computationally expensive combinatorial problem. It requires traversing a multidimensional grid of values, with a model to be trained and tested for each combination. Random search alleviates this issue by sampling a limited number of hyper-parameter combinations [34].
To accurately estimate the performance of an ML algorithm, k-fold CV and HPO can be combined into a strategy known as nested CV. While this provides exhaustive performance estimates, it further exacerbates the computational cost incurred by the two techniques in isolation.
Scientific computing with Python
Python is an interpreted, high-level and general-purpose programming language with a focus on readability [35]. It offers interactive and scripting modes that allow for quick prototyping and deployment of a broad range of scientific applications. Data science and ML are areas where Python’s role has greatly expanded in recent years. The Scientific Python Environment (SPE) is at the core of this expansion. The SPE is based on several highly efficient libraries that support numerical and scientific computation, namely: NumPy [36], for storing and operating over large and multi-dimensional arrays and matrices; Scipy [37] for fundamental routines and algorithms in scientific and technical computation; pandas [38] for data manipulation and analysis; matplotlib [39] for plotting and visualization; and scikit-learn [40] for traditional ML. Python and the SPE have thus become popular tools for data science and ML [23]. However, as the size of the data increases and the tasks become more complex and expensive there is a need for improving program efficiency, namely in terms of execution runtime.
Python concurrent and parallel computation
Parallelism and concurrency are two approaches for improving the efficiency of a program. Their use is highly dependent on the underlying architecture of the central processing unit (CPU). In Python, they can be implemented through native packages such as multiprocessing and multithreading.
A process is an independent instance executed in a processor core or node with dedicated memory space. To speed up computation in CPU-intensive tasks, multiprocessing spawns multiple processes that execute parts of a task in parallel. A thread is run within a process and allows parts of a program to run concurrently when there is a separate flow of execution. Multiple threads can be run in the same process, sharing the same memory space. This reduces the overhead associated with copying and moving data, and making code faster and safer to execute. Multithreading allows concurrency by progressing multiple independent tasks simultaneously.
When multiple threads are launched at the same time, problems may arise if computations are performed out of order. Standard Python implementations address this issue by enforcing the global interpreter lock (GIL) mechanism. The GIL ensures that only a single thread is run at a time. To overcome this important limitation, it is necessary to find solutions for carrying out computations outside the GIL, either by rewriting code in a different language like C or using specialized libraries. In particular, SPE libraries implement several strategies for speeding up computation. For example, NumPy and SciPy can efficiently perform numerical linear algebra operations and bypass the GIL mechanism by leveraging a low-level Basic Linear Algebra Subprograms implementation known as OpenBLAS [41, 42]. SPE libraries also come with the native capability to spawn multiple processes, or jobs, to parallelize computation across several CPU cores. Yet, increasing the number of processes does not always speedup computation. In some cases, excessive parallelization can even be detrimental to overall performance. Therefore, understanding how multiprocessing processes might nest or interact with each other is critical for improving program performance.
Limitations of the scientific Python ecosystem
For all their promise, the advantages of parallelism can be difficult to achieve. To bypass the GIL, multiprocessing has to launch a Python instance containing a full, in-memory copy of the data for each additional CPU core in use. Furthermore, certain ML tasks such as nested CV or hyper-parameter optimization incur heavy computational burdens. Hence, the parallelization of ML pipelines rapidly fills available memory. Furthermore, increases in data size lead to difficulties in data processing and analysis. The limitations of a single machine become apparent when datasets no longer fit comfortably in memory or take too long to load and process. One approach to alleviate these issues would be to share the workload across several machines, but in Python this is not a straightforward task. Data science libraries like pandas or scikit-learn are not designed specifically to operate on distributed systems [43]. It is thus imperative to find computational paradigms that can handle multiple ML models simultaneously built from large or massive datasets and take full advantage of all available CPU cores through efficient parallelization, while offering the possibility to scale beyond a single machine.
Scalable data science
The transformative power of data science is supported by advances in computing capabilities that enable the production, collection, storage and processing of increasingly larger amounts of data. Thus, the computational requirements of large datasets are an important bottleneck in data analysis. This bottleneck can be overcome by improving the quality and/or quantity of computational resources available to a single machine (scaling-up), or spreading computational load across several machines (scaling-out).
Scalable data analytics frameworks can assist in implementing these strategies, ensuring that available resources are fully and expanding computational capabilities in tasks such as ML. For example, when a dataset does not entirely fit in memory, out-of-core computation can be a solution. Out-of-core algorithms are optimized to efficiently fetch and transfer batches of data from disk to memory and to process them in smaller blocks, expanding total available memory.
For several years, the Apache Spark [44, 45] framework has been a popular choice for scalable data analytics. Spark is a unified engine written in Scala for distributed data processing. As a part of the all-in-one Apache ecosystem, Spark interfaces well with other Apache projects and is capable of integrating and parallelizing computation across several languages, namely Java, R, and Python (via the Pyspark library). However, integration with these languages requires additional programming efforts and often incurs serialization costs, as abstract data structures need to be converted between languages.
As such, several frameworks and tools are currently being developed under the Python ecosystem to scale one or several steps of data analysis across several CPUs (and, in some cases, GPUs) while minimizing the need for code rewrites (Table 1). Some frameworks focus on out-of-core parallelization of large tabular datasets, while others provide support for a complete analysis pipeline in a distributed environment. Overall, there is an ongoing effort to develop scalable data science frameworks that provide seamless integration and compatibility with existing SPE code. These frameworks aim to alleviate the burden associated with technical implementation, allowing researchers to focus on scientific questions.
From among the solutions for scalable data science in Python, this paper highlights the Dask framework [22].
Scaling computational biology with Dask
Dask has been applied to a variety of scalable problems in computational biology. It has been used to introduce parallel computation in molecular dynamics analysis and simulations [46,47,48,49], efficiently handle genotype data as arrays [50], assess the reliability of methodologies for the analysis of human brain white matter connections [51], as a component in general neuroimagining pipelines [52], and for powering a workflow in gene regulatory network model exploration [53].
Applications of Dask specific to omics data analysis include scaling the reconstruction of gene regulatory networks from single-cell yeast RNA [54], integrative analysis of multi-omics data [55], inferring gene regulatory networks from large single-cell gene expression datasets [56, 57], analysis of high resolution rRNA sequencing data from multiple amplicons [58], and the study of tissue organization and cellular communications derived from spatial omics data analysis and visualization [59].
Furthermore, Dask plays a supporting role as an underlying component for several data science and machine learning tools used in computational biology. For example, the GPU-accelerated tool RAPIDS [60] builds on top of the Dask framework in order to scale its data preparation and model training steps across multiple GPUs and machines [23]. In this assisting capacity, Dask has made it possible to scale single-cell analysis pipelines to upwards of millions of cells [61]. In large-scale distributed environments, i.e. supercomputers, Dask has been deployed as a workflow manager for GPU-accelerated computation to predict protein structures from genomic sequences [62].
The Dask framework
Dask [22] is a native Python framework that extends the SPE’s capabilities for multi-core and out-of-core computation. This allows Dask to scale beyond a single machine and fit increasingly large datasets into memory. By cloning the APIs of commonly used libraries like NumPy, pandas or scikit-learn, Dask minimizes the changes required to port pre-existing code [36]. The simplicity of Dask greatly reduces the barrier to entry for analysts that are new to distributed and parallel computing. This is especially important in domains such as computational biology and bioinformatics where data analysis pipelines are often developed by scientists without a formal computer science background using the SPE in local workstations.
The Dask framework combines blocked algorithms with task scheduling to achieve parallel and out-of-core computation. The framework is composed of collections, task graphs and schedulers. Collections are data structures that represent and organize the data. Task graphs and schedulers determine how computations are performed (Fig. 2).
Dask implements three collection types specialized in building task graphs for structured and unstructured data: Dask Arrays, Dask Dataframes and Dask Bags. Dask Dataframes are limited to two-dimensional tables, while Dask Arrays can have higher dimensionality. Dask Dataframes split the dataset into multiple partitions along the index. Each partition consists of a pandas Dataframe that can be processed independently, parallelizing the workload. Dask Arrays creates chunks by splitting NumPy arrays along the index and columns. Bags are a more flexible data type that can hold any combination of Python objects. They are particularly useful for working with unstructured data in tasks such as graph or text mining. Because it mimicks the interface of the pandas and NumPy libraries, Dask offers many of the same statistical and analytical functionalities.
In addition, Dask provides the more general Delayed and Futures interfaces that operate as building blocks for implementing custom data structures and algorithms in Dask. Like the aforementioned specialized collections, Dask Delayed executes task graphs lazily as needed, while Future executes functions eagerly and concurrently across multiple cores and machines.
To efficiently handle collections, Dask builds task graphs with blocked algorithms. Blocked algorithms are an approach to out-of-core, distributed computation. Blocked algorithms split data into smaller blocks that are loaded on demand and processed in parallel [63].
A task graph is a directed acyclic graph used to keep track of all tasks and their order. Each node is a function and the edges are objects generated by the preceding function node and used as input by the succeeding node (Fig. 2). Graph evaluation is performed lazily, meaning that task processing is delayed until computation is explicitly called. Due to the great flexibility of blocked algorithms, task graphs can interface with several problem assignment algorithms, such as task scheduling [64].
Task scheduling allows Dask to achieve parallelization by dividing the program into smaller tasks. For instance, when summing three partitions with three numbers each, the scheduler will assign one worker to each partition and process them in parallel. Once this done, the resulting smaller sums are aggregated into a final solution, Fig. 2.
In Dask, there are two groups of task schedulers: single-machine and distributed. Single-machine schedulers leverage local processes or threads, while distributed schedulers operate both locally and across multiple machines. While single-machine schedulers do not require setup and incur in a smaller overhead, distributed schedulers are more sophisticated and offer more features such as a diagnostics dashboard for monitoring performance.
Single-machine schedulers include the threaded scheduler, multiprocessing scheduler and single-threaded synchronous scheduler. The single-threaded scheduler has no parallelism, performing all computations in a single thread, which facilitates debugging. The threaded scheduler leverages a single machine’s entire thread pool, which allows for faster computation times for code that releases the GIL. Lastly, the multiprocessing scheduler provides parallelism by assigning tasks to local processes. This is useful for bypassing the GIL when it cannot be released, as is often the case with pure Python code. However, because inter-process communication is slower than threaded execution, the multiprocessing scheduler does not always reduce execution times.
While single-machine schedulers ship threads or processes to local thread or process pools directly, distributed schedulers launch and interface with computations through clients that assign tasks to workers. In Dask, one worker corresponds to one process, with access to a worker-specific subset of the thread pool. One of the workers is appointed as the master which will coordinate other workers and direct them to execute the individual tasks found in graphs. As tasks are completed, workers become free and are assigned to any remaining tasks until all tasks are executed.
The primary purpose of distributed schedulers is to parallelize computation in clusters. However, clients for interfacing with distributed schedulers can also be initialized locally. This might be desirable for a number of reasons. As previously mentioned, distributed schedulers grants users with access to a diagnostics dashboard which allows them to assess computational performance. Furthermore, distributed schedulers offer improved data locality. Data locality consists in the process of moving computations closer to where data is stored, which is often more efficient than the reverse operation. In such cases, the additional overheard caused by setting up a distributed scheduler locally may prove worthwhile. Lastly, the distributed scheduler can leverage additional worker resources, such as additional memory for the local processing of large datasets or extra CPUs for CPU-intensive computations.