Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

Background Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. Results One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a full exploitation of the underlying distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. Conclusions We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.


Introduction
With the rapid growth of biological sequence datasets and the evolution of the sequencing technologies, many algorithms and software commonly used for the analysis of biological sequences are becoming obsolete.For this reason, computational approaches based on frameworks for big data processing started to be proposed in order to deal with problems involving large amounts of biological data [1,2,3,4,5].Unfortunately, the fundamental domain of alignment-free linguistic and informational analysis of genomic and proteomic sequences, e.g., [6,7,8,9,10,11,8,12], has received yet few attention in the big data context.In this respect, an important task that is at the hearth of this domain is the collection of k-mer statistics, i.e., how many times each sequence of length k over a finite alphabet appears in a set of biological sequences, at a genomic scale.Once that such information is available, one can use it to compute many informational and linguistic indices [7,13], as well as de Bruijn graph assemblers, and error/repeat detection systems.
Recently, the software tool KCH [14] has been proposed for the linguistic and informational analysis of biological sequences based on Hadoop [15] and MapReduce [16].It allows for an efficient collection and analysis of k-mers from a collection of genomic sequences.KCH has been the first tool showing that big data technologies can be superior to highly-optimized shared memory multi processor approaches, even when considering mid-size problem instances.This latter methodological contribution, combined with results in [17], gives groundbreaking experimental evidence that big data technologies can be extremely pervasive for an effective solution of a broad spectrum of computational problems in the Life Sciences, going from basic primitives to full-fledged analysis and storage pipelines.However, in quantitative terms, that is only a first step towards the acquisition of full knowledge of how big data technologies can affect Computational Biology and Bioinformatics.
In this manuscript we first discuss how the introduction of new, distributed, technologies for big data processing is affecting the biological domain.Then, we focus on the problem of k-mer counting as a case study, and in particular we present a distributed approach for k-mer counting that extends and is more efficient than KCH.The system is called FastKmer, and has been carefully engineered in order to maximize the potential of Apache Spark [18], the big data framework on which it is based.The result is that, to the best of our knowledge, FastKmer is the fastest distributed system available so far for extracting k-mer statistics from large genomic and meta-genomic sequences using arbitrary values of k.Indeed, attention has been payed in order to enforce a balanced distribution of the workload of k-mer statistics aggregation tasks over the nodes of a computing cluster.This same approach may be useful in other scenarios involving more general notions than k-mers, like spaced words and seeds (see [19,20] and references therein).The performance of FastKmer has been assessed though a comparative experimental analysis with other distributed k-mer statistics systems over real-world datasets.
The rest of the manuscript is organized as follows.In Section 2 some background for our work is provided.In Section 3 the algorithm FastKmer is described along with some implementation details.The results of an experimental evaluation of FastKmer, and an improved version of it, are presented in Section 5, as well as an experimental comparison with other distributed systems for the collection of k-mer statistics.Finally, some conclusions and future directions for our work are outlined in Section 5.

Background
Here we first summarize a brief history on the evolution of distributed computing, then provide some details on the MapReduce programming paradigm and finally describe the main approaches proposed for k-mer counting in this context.

Computational Power via Commodity Hardware
In the late nineties, in order to cope with an increasing demand of computing power, the design and deployment of data centers went through deep changes, gibing birth to the so called server consolidation process.The big mainframes and supercomputers left their place to small standard 19" racks with many x86 processors.In that period, the new data centers consisted of three separated components: servers, storage and networking.In turn, those racks could be interconnected so as to have a form of parallelism in terms of distributed computing, de facto obtained via commodity hardware.Thanks to the scalability and virtually unlimited computational power, those new networks offered an effective and affordable solution to cope with the challenging requirements posed by the spread of the Internet.To give a few examples: (a) the first versions of the Google web pages index were constructed via a systematic use of an approach to distributed computing based on commodity hardware [21,22]; (b) Apache designed the Hadoop framework with the aim to devise a tool supporting the efficient processing of large datasets, exploiting cluster resources through data partitioning and the MapReduce paradigm.Specifically, Hadoop was originally designed to run on many small servers with local CPU, memory, and storage.The availability of local resources enabled the concept of physical grid as a typical (at the time) Non Uniform Memory Access (NUMA) architecture.The communication among nodes was provided by the Hadoop Distributed File Systems like a sort of a common storage where data could be shared (if necessary).This solution was found to be extremely flexible and scalable and therefore, started to be adopted by many big players.
At the same time, Hadoop was installed also on sets of workstations (commodity hardware) providing unexpected computing power.These kind of clusters were able to exploit the full power of the Hadoop architecture providing balanced resource usage and the exploitation of data-locality.
Summing up, since their introduction, both MapReduce and Hadoop have become a cornerstone of big data processing.Indeed, quite remarkably, the key for their success is that MapReduce programming supported by Hadoop provided programmers with a quite convenient environment to develop effective applications, allowing them to focus more on the specific task at hand, rather than issues such as synchronization and process-to-process communication, as opposed to what traditional, low-level primitives such that provided by MPI Standard (Message Passing Interface [23]) or its ancestor, PVM (Parallel Virtual Machine [24]), allowed for.Indeed, within such programming environments, concurrency must be explicitly handled by the programmer and the running program strongly depends on the physical network topology.In the realm of bioinformatics, this point is well illustrated in the Magellan Final Report [25] regarding the collection of k-mer statistics in large meta-genomic datasets: MPI solutions would work but it was much more convenient to use Hadoop and MapReduce, also considering the availability of higher level tools like Pig [26,27].

From Commodity Hardware to the Cloud: Converged Architectures
The main drawbacks of the big data processing approaches outlined in the preceding section, were relative to the inherent complexity behind installing software and middleware on many nodes and manage their configuration changes.
With the beginning of the new millennium, the application of Virtualization [28,29,30] in the big data context promised help to speed up this process.To this end, manufacturers assembled pretested components, servers, storage and networking converged in a single rack, in order to speed up the tuning and testing phases.Moreover, the introduction of a new software layer, namely hypervisor with the promise of taking care of all hardware details, introduced the concept of Virtual Machine (VM).Hundreds of VMs became manageable with a single click, and this enabled the design of very large clusters with thousands of computing cores and high amounts of working memory.All these advantages brought to the converged compute and storage for virtualized environments.To keep this flexibility, VMs may have access to storage areas, represented by a logical partition of a huge disk array, represented by a Logical Unit Number, or LUN, that refers, by extension, also to the partition itself.The storage is therefore deployed in Storage Area Networks (SANs), a network, separated from compute servers, which connects storage resources, typically using fiber optics cabling.Such a solution is very effective but introduces a serious scalability issue, as this connection might become a bottleneck for the entire architecture (from the cluster perspective).Indeed, such an approach is effective for cloud resource providers that offer custom made Infrastructures as a Service (IaaS), where each infrastructure is virtual and independent of the others.However, it introduces undesirable side effects when applied to distributed architectures such as Hadoop: although not immediately evident, the SAN represents an actual bottleneck in the Hadoop cluster design because in this context data locality becomes only a logical concept, as storage is fundamentally physically separated from compute.More explicitly, node data locality no longer exists.As a side effect, by adding nodes to the cluster, the I/O channel might suffer from congestion, even with a small number of physical clients: scalability, both in the form of response time and throughput, may be compromised independently of any careful design of the program being executed from the scalability standpoint.To face such issues and to restore the concept of data locality as originally meant, local onboard memory caches, sometime using SSD technology, were specifically introduced on the CPU board.This approach mitigated the problem but increased the management complexity so that big players, like Amazon, singled out their offer of MapReduce by supporting it with highly specialized managed services, i.e., Amazon Elastic MapReduce (Amazon EMR).In such a case, the provider, instead of a set of standard VMs, is offering a Hadoop cluster with the required resources (number of nodes, VCores, RAM and storage) already configured and tuned up, ready to run.In other words, the initial offer of a virtual infrastructure (IaaS) became a full platform (Platform as a Service or PaaS for short) adding the required middleware to the computing resources and the skill to get the best performance.In some cases, the services include special purpose and dedicated hardware resources far away from the standard architecture used for Amazon Elastic Web Services.Therefore such services allow customers to develop, run and manage their distributed applications without concerning with the complexity of building, tuning and maintaining the infrastructure.

2.1.3
Storage Closer to Compute in the Cloud: Hyperconverged Architectures In the past 5 years, to leverage the concept of data locality for distributed clusters, manufacturers introduced the concept of Hyperconverged hardware.A new server generation has been designed assembling CPU, memory and storage (disk or SSD) in a single box, and therefore substituting the SAN and its disk arrays with a local distributed storage.This change was supported by VMWare, which added to its hypervisor new capabilities dedicated to Software Defined Storage (SDS), replacing the classic SAN with a Virtual SAN (VSAN), thus gaining a higher level of abstraction obtained by merging the storage available on each node, allowing for a more granular and flexible segmentation of storage resources.This change was even more effective because redundancy and reliability were handled at node level instead of a single cabinet for storage.This way data locality returned to be related to the physical distance between stored data and CPU that should process it, and the whole cluster storage returns to be a sort of NUMA architecture, allowing nodes to distinguish between locally or remotely stored data.Local data can be accessed through high speed I/O channels, while remote data can be accessed through slower network connection.Scalability is enforced because adding new boxes (more CPU, RAM and storage) improve the overall I/O performance providing new local I/O channels and without overloading the intra node network connection.As far as Hadoop is concerned, with this hardware architecture, data locality can be again exploited because each box hosts a set of cluster virtual nodes.All these instances can access their data which will be kept close to the CPU by the SDS capabilities.

2.2
The MapReduce Programming Paradigm and the Middleware Supporting It

The Paradigm
MapReduce [22] is a paradigm for the processing of large amounts of data on a distributed computing infrastructure.Assuming the input data is organized as a set of key, value pairs, it is based on the definition of two functions.
The map function processes an input key, value pair and returns a (possibly empty) intermediate set of key, value pairs.The reduce function merges all the intermediate values sharing the same key to form a (possibly smaller) set of values.These functions are run, as tasks, on the nodes of a distributed computing cluster.All the activities related to the management of the lifecycle of these tasks, as well as the collection of the map results and their transmission to the reduce functions, are transparently handled by the underlying framework (implicit parallelism), with no burden on the programmer side.

Apache Hadoop
Apache Hadoop is the most popular framework supporting the MapReduce paradigm.It allows for the execution of distributed computations thanks to the interplay of two architectural components: YARN (Yet Another Resource Negotiator ) [31] and HDFS (Hadoop Distributed File System) [32].YARN manages the lifecycle of a distributed application by keeping track of the resources available on a computing cluster, and allocating them for the execution of application tasks modeled after one of the supported computing paradigms.HDFS is a distributed and block-structured file system designed to run on commodity hardware and able to provide fault tolerance through data replication.
A basic Hadoop cluster is composed of a single master node and multiple slave nodes.The master node arbitrates the assignment of computational re-sources to applications to be run on the cluster, and maintains an index of all the directories and the files stored in the HDFS distributed file system.Moreover, it physically tracks the slave nodes, storing the data blocks making up these files.Slave nodes, in turn, host a set of worker s (also called containers), in charge of running the map and reduce tasks of a MapReduce application, as well as using the local storage to maintain a subset of the HDFS data blocks.
One of the main characteristics of Hadoop is its ability to exploit data local computing.This term, refers to the possibility to move tasks closer to the data they need to operate on (rather than the opposite).This allows to greatly reduce network congestion and increase the overall throughput of the system when processing large amounts of data.Moreover, in order to reliably maintain files and to properly balance the load between different nodes of a cluster, large files are automatically split into smaller blocks, replicated and spread across different nodes.

Apache Spark
Spark is a fast and general distributed system for cluster computing on big data.It provides high-level APIs which support multiple languages such as Scala, Java and Python.It consists of two main blocks: a programming model that creates a dependency graph, and an optimized runtime system which uses this graph to schedule work units on a cluster, and also transports code and data to worker nodes where they will be processed by executor processes.
Resilient Distributed Datasets At the core of the Spark programming model is the Resilient Distributed Dataset (RDD) abstraction, a fault-tolerant, parallel data structure that can be created and manipulated using a rich set of operators.Programmers start by defining one or more RDDs through transformations of data that originally resides on stable storage or other RDDs.Example of transformations are operations like map, filter or reduce, and are lazy, in the sense that they return a new RDD which depends on the old RDD but they are not performed until an action that involves such RDD is specified in the pipeline graph.Actions are operations that return a value to the application and export data to a storage system.
Apart from the internal cluster manager, Spark applications can be run also on external cluster managers like Hadoop YARN [31] or Mesos [33].Moreover, a Spark application can be run over a distributed file system, e.g., HDFS [32].This allows each worker node of a cluster to read input data and to write output data using a local disk rather than a remote file server.
Partitions, parallelism, and shuffling By default, Spark tries to read data into an RDD from the nodes that are close to it.Since Spark usually accesses distributed data, to optimize transformation operations it creates partitions to hold the data chunks.The number of partitions of an RDD reflects the degree of parallelism (number of tasks) employed by Spark while processing it.When an RDD is created by loading files from HDFS, its number of partitions is equal to the number of input splits1 of the original file on HDFS.The Spark mechanism for redistributing data across partitions is called shuffling.It occurs when certain transformations, such as groupByKey or reduceByKey, are issued on an RDD and cause moving data across different processes or over the wire (between executors on separate machines).An RDD that is obtained via a shuffle transformation of another RDD, will inherit its number of partitions.However, as far as choosing a "good" number of partitions is of concern, what is typically desired is to have at least as many partitions as the number of cores.A way to influence this choice is by specifying a custom value for the spark.default.parallelismproperty.Another option, is introducing a custom partitioner.Partitioners are objects that define how elements of a key-value RDD are partitioned by key.The Spark default partitioner (i.e., HashPartitioner) chooses the partition where to map an element as the Java's Object.hashCodevalue of its key (modulo number of partitions), or 0 for negative hashes.It is also possible to implement a custom partitioner class defining a custom method for assigning keys to partitions.This feature is useful when for some reason the default partitioner causes RDD data to be unevenly distributed across partitions.

Big Data Based Approaches for the Analysis of Biological Sequence Datasets: The Special Case of k-mers Counting
The modern high-throughput technologies produce high amounts of sequence collections of data, and several methodologies have been proposed for their efficient storage and analysis [34,35].Recently, approaches based on MapReduce and big data technologies have been proposed (see, e.g., [36], and [3] for a complete review on this topic).An important issue in this context is the computation of k-mer statistics, that becomes challenging when sets of sequences at a genomic scale are involved.Due to the importance of this task in several applications (e.g., genome assembly [37] and alignment-free sequence analysis [34,35]) many methods that use shared-memory multi processor architectures or distributed computing have been proposed.The basic pattern followed by most of these methods is to maintain a shared data structure (typically, a hash table) to be updated according to the k-mers extracted from a collection of input files by one or more concurrent tasks.When memory is not enough to maintain all the extracted k-mers, these can be organized in disjoint partitions and temporarily saved on file without aggregation.Then, they will be loaded in memory one partition at time and summed to return the definitive k-mer statistics.
Here, we provide a summary of the main techniques proposed for k-mers counting in the biological scenario, distinguished in two main categories: those designed to work on shared memory and/or multi-processor systems, and those implemented for distributed systems (the interested reader can find a more deep survey of them in [14]).
Shared Memory, Multi-Processor Systems MSPKmerCounter [38] introduced a disk-based approach where consecutive k-mers are not saved individually but first compressed to a single superkmer.This solution leads to a significant reduction in the amount of data to be temporarily saved on disk and, then, recovered to memory, thus allowing for a significant performance boost with respect to other algorithms.The notion of minimizer has been refined in KMC2 [39] and, later, in KMC3 [40] with the introduction of k-mer signatures.These are a specialization of minimizers and are built with the idea of discouraging an extremely imbalanced partitioning of super k-mers among the different bins while keeping the overall bins size as small as possible.An additional contribution provided by these systems is in the counting phase.Input super k-mers are broken into (k, x)-mers, a compact representation of a sequence of k-mers, and sorted efficiently using a parallel version of radix sort [41].

Distributed Systems
The applicability and scalability of multi-processor shared-memory architectures is inherently constrained by architectural factors, such as the maximum number of processing cores on a processor, and the maximum amount of memory on a single hosting machine.Distributed systems allow to overcome these limitations.Indeed, the availability of an arbitrary number of independent computation nodes allows to virtually extend to any size the data structure used to keep the k-mer statistics in memory, while using the network as a temporary buffer between the extraction phase and the aggregation phase.This is the approach followed by Kmernator [42] and Kmerind [43].Both these tools are developed as MPI-based parallel applications and are able to handle data sets whose size is proportional to the overall memory of the MPIbased system where they are run.However, the development and management of an in-house MPI-based supercomputing facility is usually very complex and expensive.
An alternative approach that is gaining popularity in the community of Bioinformaticians is the usage of big data processing frameworks, such as Apache Hadoop and Spark.As already mentioned, these technologies are cheaper and adopt a simpler programming model than MPI.To cite only some examples, BioPig [44] is an Hadoop-based analytic toolkit for the processing of genomic data.It has been developed as an extension of the Pig language that, in turn, offers a set of data querying and transformation primitives that are translated into MapReduce jobs.BioPig includes a module, called pigKmer, that allows to extract and count the k-mers existing in a set of sequences.Each sequence is splitted in several blocks saved on the different nodes of a distributed system, with each block being processed by a distinct task.The k-mers extracted this way are then aggregated, using a reduce operation, and finally counted.An alternative distributed k-mers counter is the one provided by ADAM [45], a Spark-based toolkit for exploring genomic data, which follows the same application pattern of BioPig.The algorithmic approach of these two systems is pretty naive, so they are able to process very large genomic sequences but at the expense of very bad performance.
The first and, to date, the only distributed system able to extract efficiently k-mer statistics from large collections of genomic sequences, with k ≤ 31 is KCH [14].It is a distributed system based on MapReduce and Hadoop which follows a two-level aggregation strategy.In particular, it first partitions the universe of possible k-mers in a fixed number of bins (291, by default), and then extracts the k-mers counts from a collection of input sequences in two stages of alternate map and reduce tasks.In the first stage, each map task creates a distinct hash table for each bin and updates them with the statistics of the k-mers extracted from a chunk of the input sequences.At the end of this stage, each map task returns its collection of hash tables holding the partial k-mer counts.During the second stage, all the hash tables corresponding to the same bin are aggregated by a reduce task and the result is saved on file.This strategy is able to significantly reduce the communication overhead between the different nodes of the system, thus allowing for execution times that are up to 100× faster than those of BioPig, when run on fairly large sequences.

Basics
Let Σ be an alphabet and S be a finite set of collections of sequences over Σ.A cumulative statistics collects how many times each of the k-mers in Σ k appears in the collections of sequences in S.Here Σ = {A, C, G, T } and S is a collection of genomes or meta-genomes.
The algorithms for k-mer statistics computing usually have a common structure: they first process the sequences in S from left to right in order to extract all k-mers, and then perform aggregation and evaluation.A naive implementation, such that all single k-mers are extracted in a sliding window fashion, is highly redundant in space.Indeed, for an input length of n characters, generating all k-mers determines an unfolded sequence of (n − k + 1) • k symbols.Since, by definition, consecutive k-mers along a sequence share k − 1 symbols, it would be beneficial to have a compressed representation of them, where all contiguous k-mers are stored in a compact sequence.Yet unfortunately, to be able collect the statistics, especially in a distributed setting where different portions of the input data will be processed in physically separated machines, we need a way to keep together all instances of each unique k-mer for the evaluation phase.A clever solution to this problem is based on the notion of minimizers [46,38,47].
Minimizers Given a k-mer s, a minimizer of s is a word of length m (with m fixed a priori) occurring in s.Usually many consecutive k-mers have the same minimizer and, therefore, can be compressed into a sequence of more than k symbols, a superkmer, significantly reducing the redundancy.Minimizers may be used to partition k-mers into multiple disjoint sets, as well as retaining adjacent k-mers in the same partition: superkmers can be distributed into different bins according to their related minimizer ensuring that all the corresponding instances of a k-mer will appear in the same bin (see Section 3.3.1 for further details).

k-mer Statistics Collection on Spark: the FastKmer Algorithm
Here FastKmer is described, focusing also on the engineering aspects which make it more efficient with respect to its competitors (e.g., KCH).

Design Overview
The FastKmer algorithm is implemented on the Spark pipeline described in Figure 1.The figure does not show the initial configuration/parameterization stage.The core of the pipeline consists of two main stages, as well as a preliminary stage responsible of fetching the input dataset, that is a FASTA/FASTQ file [48,49], from HDFS storage, and delivering all of its blocks to the first stage of the pipeline (leftmost portion of Figure 1).The first stage performs the extraction of superkmers.The second stage computes and collects the k-mer statistics.Both stages are described in detail in the following.

First stage: extracting superkmers
Since consecutive k-mers share k − 1 symbols, an approach that exhaustively extracts all k-mers with a sliding window and propagates them to the next stage leads to high redundancy.To address this issue, the first stage of our approach processes all input sequences in a way to guarantee a degree of compression: sequences are broken into superkmers using their corresponding signatures which are in turn used to implement a binning mechanism.This is a variant of the Minimum Substring Partitioning (MSP) technique [50,51], which is in turn based on the notion of minimizers.In particular, FastKmer adopts a slightly different notion that is the one of signatures [39], i.e., canonical minimizers of length m (a tunable parameter) that do not start with AAA nor ACA, neither contain AA anywhere except at their beginning.A toy example of splitting a sequence into superkmers using signatures is depicted in Figure 2.
From signatures to bins Since signatures are based on lexicographic ordering, superkmers having a given signature s are mapped to one of a set of B bins (a parameter) using a shift-based integer hash function, thus aiming at a uniform distribution of superkmers in processing units for the subsequent phase.
The output of the first stage is therefore a sequence of bins, where each bin is described by an integer key in the range {1, . . ., B} and holds a sequence of superkmers.Then, bins originating from different distributed workers are automatically aggregated by Spark based on their key in an intermediate phase before the next stage (red shuffling phase in Figure 1).

Second stage: partitioned counting
The second stage is responsible of the counting phase: due to the signaturebased binning process, all instances of a given k-mer are guaranteed to reside in the same bin.Therefore each bin is processed independently and all the k-mers contained therein are inserted in a hash table that also maintains their relative counts.After processing each bin, the table is traversed, and counts are stored on HDFS.

Implementation details
FastKmer has been implemented as a Spark application using Scala [52].Input sequences are read by FastKmer using FASTdoop [53].FastKmer represents k-mers (and likewise, superkmers) of arbitrary length using an Array [Long], where information is encoded by a binary representation.Therefore, since the alphabet of valid nucleotides consists of four items, as Σ = {A, C, G, T }, a string over Σ k can be encoded using only two bits per symbol.

Results and Discussion
Here we describe the results of an experimental analysis which show as different choices of the parameters and Spark-related configurations may result in different performance of the algorithm presented as a case study.

Setup
Testing Platform The experiments have been performed on the Microsoft Azure Cloud infrastructure.In particular, a 8-node Spark 2.1.0cluster has been deployed into HDInsight, Microsoft's cloud distribution of the Hadoop ecosystem (Hadoop 2.7.3), based on the Hortonworks Data Platform (HDP).Two cluster nodes act as head nodes, and are equipped with an 8-core 2.4 GHz Intel Xeon E5-2673 v3 processor and 28GB of RAM each, plus other six worker nodes, each with two 8-core 4.78 GHz Intel Xeon E5-2673 v3 processors for a total of 16 cores, 112GB of RAM and a 800GB local SSD disk, and an overall disk capacity of 4.8TB.The job configuration consists of 16 Spark executors with 2 cores each (for a total of 32 workers).

Datasets
The dataset considered here refers to [54].In particular, the SRR094926 run of the SRP 004875 SRA study (available on the NCBI short read archive) has been used, for a total occupation of about 125GB (FASTA format).For tuning experiments, a different file containing only the first 32GB of this run has been considered.See Table 1

Experimental Evaluation
Here a study of different configurations and parameters, as well as of their main implications on the performances of FastKmer, is presented.
Value of k For all experiments, we examine the running time performance of the k-mer statistics collection task on our datasets for two different, reference values of k: 28 and 55.
Signature length Preliminary experiments have been performed in order to tune the signature length parameter m (data not shown but available upon request).The result is that small values of m increase the probability that consecutive k-mers share the same minimizer and thus reduces the I/O cost at the end of the first stage.However, if too small, it might render the distribution of partition sizes skewed and the largest partition might not fit in memory.On the other hand, a large m will make the distribution of partition sizes evener at the cost of a higher redundancy (with no compression for m → k).
The assignment which yields, on average, the best performance on the considered datasets is m = 10.This is in line with the results in [39,50] on datasets of comparable characteristics.

Number of bins
The number of bins B used for the signatures binning scheme is considered.At the starting of the second stage, each partition contains a number of bins to be processed.Having few bins decreases the overall memory management overhead to be paid at the beginning and at the end of the processing of each bin, yet having few, very large bins which might exceed the memory available to a worker process.On the other hand, a larger number of bins allows for a better granularity of the distributed execution, as each bin can be processed independently, while incurring in an increased memory management overhead.
Spark parallelism Aside from the parameters of the proposed algorithm, there is a Spark-specific parameter which may have an impact on the running time and cluster usage, that is Spark parallelism level (p).This parameter corresponds to the number of tasks that are spawned by Spark (as well as the number of partitions, see Section 2.2.3), and has a side effect on the number of bins mapped to partitions: if bin numbers are uniformly spread, each task will receive a number of bins that tends to B/p.
On large bins As previously stated, when using the minimizer-based approach, the distribution of superkmers associated to signatures can be very uneven, with small (in terms of lexicographic ordering), or particularly frequent signatures tending to have a very large fraction of superkmers, compared to the other ones.This is partly mitigated by the choice of signatures within a suitably filtered sets of canonical minimizers that do not start with common prefixes (see Section 3.3.1),and by the hash-based mapping of signatures to bins.Nevertheless, since the scheme is data-oblivious, it might still produce large bins.In our distributed setting, this is particularly relevant because they can introduce bottlenecks where the running time is impacted by a few number of workers that take more time than the others, thus leading to a non-optimal utilization of the cluster.
Figure 3 shows the running time of FastKmer for values of B corresponding to powers of two between 512 and 16.384 and a parallelism levels ranging from 32 to 512 (corresponding, respectively, to 1 and 16 average tasks per core).As for the number of bins, performances improve consistently for values of B up to 8192, after which we have no improvement (also for higher values of B, not plotted for legibility).The x-axis shows the parallelism p: again, for both values of k, we see a performance increase when we raise p up to 320 tasks.No improvement, if not a slight deterioration, is noticeable for higher values of p.This is expected as, whilst higher parallelism tends to better spread big bins across many partitions, conversely more tasks determine more scheduling overhead for their start and finalization.Spark tuning guidelines2 suggest having a number of tasks that is in the range of 2 − 3 tasks per core: this choice of p leads to 10 tasks per core in our test instance (32 workers), suggesting a possible heavy scheduling-related overhead.
Intuitively, increasing both values of B and p mitigates the big bins problem, as: (i) mapping signatures to more bins means potentially better spreads big signatures, (ii) the number of tasks better spreads big bins as they are more granular and distributed across more tasks, yet it does not fully remedy the fact that bins can have very different sizes.
A further inspection of the distribution of the single task running times for low values of B and p showed that some tasks take much longer than others (peaking at as much as 50% of cluster underutilization, for some configurations).With larger values of B and p, and in particular for the best configuration B = 8192 and p = 320, the problem is indeed mitigated, yet still we have a single task running at the end of the job for about 5% of the running time.
Another problem is related to the fact that the optimal values of B and p ultimately depend on the dataset.We wish to have a solution that allows for a degree of adaptability of the algorithm to dataset variability, and, more generally, exhibiting better load balancing guarantees.In the next section, we explore a promising direction of improvement to address these issues.

Comparative Experimental Analysis
Here an experimental comparison of FastKmer against other big data systems for the extraction of k-mer statistics is presented.For each system described in Section 2.3, the runtime configuration suggested by their respective authors has been adapted to the considered testing platforms.

Testing Platform
The testing platform used for this analysis is a 33 nodes Linux-based Hadoop cluster, with one node acting as resource manager and the remaining nodes being used as workers.Each node of this cluster is equipped with two 8-core

Running Times
In this test we considered both the 125GB and 32GB datasets while using k = 28 and k = 55.As presented in Table 2, we were not able to run ADAM on our testing platform because of memory issues.A further investigation revealed that this system extracts k-mers from an input FASTA file by first converting it in another format through an operation that requires the whole file to be loaded in the main memory of the driver process.This approach, apart from being extremely inefficient, prevents the system to work whether the driver process has not enough memory to fulfill this operation (like in our case).As expected, the performances of BioPig are considerable lower than those of FastKmer and KCH, taking more than 10 hours to complete, on the 125GB dataset.Indeed, the lack of any aggregation strategy during the k-mers extraction phase and the choice of a standard character-based encoding for the extracted k-mers increases significantly the amount of data to be moved from the extraction phase to the evaluation phase, thus putting a heavy burden on the overall execution time of this system.We now turn to the case of KCH.We recall that this system has been developed to only support values of k smaller than 32.As for the case of k = 28, we notice that FastKmer is about 20% faster than KCH when processing the 32GB dataset.This difference becomes even more significant when considering the 125GB dataset.Here, FastKmer is about two times faster than KCH.To explain this, consider that KCH aggregates k-mers in bins at a much coarser level and that it lacks a scheduling strategy able to ensure an even distribution of the workload among the nodes of the underlying distributed system.

Scalability
A comparison between FastKmer and of KCH is presented here, with respect to cluster scale and dataset sizes.Figure 4 shows the running time comparison of FastKmer and KCH for k = 28 (again, k = 55 is not considered as KCH supports only values up to k = 31).From Figure 5 we see that FastKmer outperforms previous distributed approaches in terms of running time, showing to scale better for an increasing number of workers.

Coping with Data Skew: a Multiprocessor Scheduling Inspired Partitioner
As previously stated, the unbalanced partitioning of bins resulting from our experiments is mainly driven by the minimizer-based scheme.Bins exacerbate this fact even more, as they contain multiple signatures, (and possibly many of such "big" ones), with the consequence that few larger bins lead some workers to have a much longer running time than the others.The standard partitioner of Spark does not come into rescue, as it maps bin ids to partitions following their hashCode (Section 2.2.3), and therefore cannot take into consideration their size.The necessity of achieving a balanced distribution of the workload induced by bins while taking into account the actual number of available processing units can be framed as an instance of the more general Multi-Processor Scheduling (MPS) problem [55].In this problem, the input consists of m identical machines and n jobs, where each job has a processing time greater than zero.The goal is to assign jobs to machines so as to minimize the maximum load of a machine (i.e., the sum of the processing time of its jobs) which, as all the machines operate in parallel, can be seen as the actual schedule time.Computationally, MPS is NP-Hard, therefore FastKmer resorts to an heuristic that is often used and simple to implement, the Longest Processing Time (LPT) algorithm.LPT proceeds by first sorting jobs by processing time and then assigns each of them to the machine with the earliest end time (i.e., lowest load) so far.This algorithm achieves an upper bound of 4 3 − 1 3m OP T [56].In this setting, jobs correspond to bins, and their processing time is estimated as the number of k-mers contained in the corresponding bin (later, bin size), and the number of machines is assumed to be the number of partitions.From the FastKmer viewpoint, the integration of this scheduling algorithm requires two main modifications to the original pipeline.At the beginning of the computation, a new preliminary stage is run to derive an estimation of bin sizes by examining a sample of the input data (whose size can be specified, defaulting to 1%).This estimation is then used to compute a schedule using LPT.In turn, the resulting schedule is used by a custom partitioner replacing the original one available with Spark for mapping bins to partitions at the end of the first stage.

Granularity of working units
In order to further mitigate the "big bins problem", we also take into consideration a variant allowing only one signature per bin, i.e., using the signature value to implement the binning process.In  our multiprocessor scheduling analogy, the set of superkmers belonging to a signature represents a job.This choice achieves two major benefits: (i) it allows to get rid of a parameter (B), (ii) it allows for the finest granularity of work units for the second phase of the task, that will prove to be particularly convenient for our custom-partitioner based implementation, as shown in the following experimental section.The next section reports an analysis of the performance improvement of our algorithm with custom partitioning.

Results
Figure 6 reports a running time comparison between an implementation of FastKmer using the default Spark partitioner (left), and another one implementing our custom multiprocessor scheduling-based partitioning scheme (right) run on our 32GB dataset.As for our custom partitioner, it further compares two different granularities for the work units: bins and signatures (solid vs dashed lines).For the bins granularity (B = 8192, resulting from the previous analysis), the impact of the custom partitioner is moderate.For the signatures choice, instead, the improvement of the custom partitioner has a consistently higher margin, suggesting an important impact of the imbalanced signatures distribution.On a related note, it can be noticed that the improvement is starting at the lowest level of parallelism (1 task per CPU core), and increases up to 128 total tasks (4 tasks per CPU core).After 128 tasks we see no improvement: this is also  expected as the goodness of a LPT schedule decreases with higher number of machines with respect to the optimal solution (in accordance to LPT bounds with respect to the optimum solution, see Section 4.4).
Based on these results, our default implementation makes use of our custom partitioning scheme, with signature-based binning.

Conclusions and Future Directions
It is worth to remark that the advantages of technologies like Hadoop or Spark for the analysis of big data come at a cost.A naive usage of these technologies may bring to solutions that, although being able to run on big data, are inefficient.
In this paper, we have presented FastKmer, an efficient system for the extraction of k-mer statistics from large collection of genomic and meta-genomic sequences using arbitrary values of k.FastKmer succeeds in being, to the best of our knowledge, the fastest k-mer statistic distributed system to date, not only because it implements a clever algorithm for the extraction and the aggregation of k-mers, but even because it has been purposely engineered and tuned so to extract the most from the underlying Spark framework.This is especially the case of the different strategies that we developed for the distribution of the k-mers aggregation workload over a Spark cluster and that can be used as well in more general Bioinformatics application scenarios.
As a future direction, we observe that the internal architecture of FastKmer has been conceived so as to make it easy to integrate its workflow in more complex data processing pipelines.For instances, we cite the case of distributed alignment-free algorithms.These could use FastKmer as a sub-pipeline to extract the k-mers from each sequence of a collection for later comparison.

Figure 4 :
Figure 4: Running time comparison of FK and other algorithms on the 32GB dataset with k fixed to 28 and an increasing number of executors.

Figure 5 :
Figure 5: Running time comparison of FK and other algorithms for various values of k and an increasing number of workers.

Figure 6 :
Figure6: Comparison of execution times of FastKmer on the 32GB for k ∈ {28, 55}, using default (left) or custom (right) MPS-based partitioning method, for an increasing parallelism level.The number of bins is set to 8192.For the custom partitioning scheme the performance of the signature granularity (as proposed in Section 4.4) is also shown, marked with a dashed line.

Table 1 :
for summary information.Number of distinct and total k-mers for our datasets.
Algorithm execution times on the 32GB dataset with k ∈ {28, 55} and an increasing number of bins (B) and parallelism level.Best combination shown in bold.

Table 2 :
Running time (minutes) for various distributed k-mer counting algorithms, with a time limit of 10 hours.Dash symbols represent combinations where the value of k is not supported by the algorithm.