The challenge
As mentioned previously, the unbiased miRNA functional enrichment analysis involves the examination of a large number of biological processes (or, equivalently, annotation categories) to identify those, which are more likely to be affected by the gene-targets of a miRNA group. During this type of analysis, both biological processes and miRNAs are represented as gene sets: each biological process is represented by the genes involved in it, while each miRNA by its gene-targets.
It becomes evident that computing the biological process overlap of a miRNA group (see the “Background” section) involves the calculation of the intersection between the set of genes targeted by the miRNA group and the set of genes involved in the biological process. Moreover, the set of genes targeted by each miRNA group needs to be calculated “on the fly” by performing union operations on the gene sets of each miRNA in the group. Therefore, the unbiased miRNA functional enrichment analysis relies on performing a very large number of set unions and intersections. For instance, for a given query miRNA group of size 10, about 10 million unions and more than 8 billion intersections are required to produce a p-value.
The state-of-the-art implementation of the unbiased miRNA functional enrichment analysis [15] uses hash tables (more specifically, Python sets1) to represent gene sets. The advantage of this data structure is that performing union and intersection operations for small sets is usually very fast. Both operations are performed by executing a variant of the hash-join algorithm [17]. On the other hand, hash-join becomes very inefficient when operating on large sets.
Unfortunately, in the case of the unbiased miRNA functional enrichment analysis, all union operations are performed on large gene sets. This is attributed to the fact that each of these gene sets corresponds to the predicted targets of a particular miRNA. Since miRNA target prediction algorithms usually produce hundreds or even thousands of results (interactions) for a single miRNA, it becomes evident that most of the performed union operations can be quite slow if hash-join is used.
To overcome this problem, the bitset (or bit-vector) [17], an alternative data structure, which is more suitable for the representation of large sets, can be used. When sets of genes are implemented as bitsets, unions and intersections between them can be calculated by performing bitwise operations on bit blocks. In particular, bitwise-or can be used to get the union of two sets, while bitwise-and to get their intersection. Such operations are efficient for large sets, since their execution time is not affected by the size of the set2. Additionally, the representation of gene sets as bitsets is more efficient, memory-wise, in the case of relatively large sets of genes (like those produced by miRNA target prediction algorithms).
The calculation of the targets of each miRNA group would benefit greatly by the use of bitwise-or, since, as previously mentioned, it involves a large number of union operations on large gene sets represented by dense bitsets. In this case, bitsets also have a reduced memory footprint compared to hash-tables. On the other hand, gene sets related to biological processes, as provided by Gene Ontology annotations [9], usually consist of a small number of genes. Therefore, hash-join on these sets can be rather efficient3.
The previous discussion suggests that a hybrid solution, using bitwise operations for unions and hash-join for intersections, seems more suitable than both of the aforementioned approaches. Unfortunately, this hybrid approach has a major drawback. The gene sets generated by bitwise-or for all miRNA groups must be provided as input to the hash-join algorithm for the calculation of the biological process overlaps. However, the bitwise-or algorithm produces gene sets represented as bitsets, while hash-join requires its input in the form of hash tables. Therefore, a data structure conversion must be performed, introducing an important execution overhead that counterbalances any gains in efficiency.
The BUFET approach
Our approach, called BUFET, is demonstrated in Fig. 1. It combines the best characteristics of bitset- and hash-table-based methods without suffering from the aforementioned shortcomings of a hybrid approach. It takes advantage of the efficiency of bitwise-or in calculating the union of large sets to produce the gene sets targeted by particular miRNA groups. These gene sets are represented as bitsets, called miRNA group bitsets.
Meanwhile, the biological process overlap of each miRNA group is calculated as follows: for each gene annotated as part of the biological process, the respective bit in the miRNA group bitset is examined. If the bit is set, then the value of a counter is increased by one (its value is initially zero). Otherwise, the value of the counter remains intact. After all genes related to the biological process have been considered, the value of the counter provides the size of the intersection and, subsequently, the biological process overlap. Since the genes are used to probe the miRNA group bitset, we refer to this method as bit-probing.
Further optimizations were introduced in order to achieve additional performance improvements. First, biological processes that have no common genes with the miRNA group under examination can be excluded from the analysis (since no interference by the miRNAs in the group with the process is recorded). Additionally, BUFET supports full utilization of multi-core computing systems by supporting parallelization at the biological process level.
It should be noted that parallel execution is also supported by the state-of-the-art approach presented in [15]. However, in contrast to the use of multiprocessing adopted by this approach to implement parallelization, BUFET uses multithreading. The advantage of multithreading over multiprocessing is that all processes running in parallel have access to the same part of the main memory. This eliminates the need to copy data across processes, thus reducing the execution time and memory footprint.
On the other hand, an issue with this approach is that the bitsets containing the targets of the random miRNA groups have to be calculated and stored in main memory. This step is necessary, so that every thread is able to access the data in order to calculate a p-value. Consequently, this increases the memory footprint, although, the amount of memory required does not pose a big challenge for contemporary computers. More specifically, none of the many real-world analysis scenarios examined during our experiments resulted in the allocation of more than 3.5 GB of RAM to our script.
Functionality and source code
BUFET is provided as a free, open source software licensed under GPL v3 (a download link is provided in the “Availability and requirements” section). Its core is implemented in C++ for greater efficiency, while a Python wrapper script facilitates its execution and its incorporation in existing bioinformatics workflows.
The input of the BUFET software consists mainly of two CSV files: one containing miRNA-to-gene interactions and another containing associations of biological functions with particular genes. The proper format of these files is described in the software download page. It should be noted that BUFET provides flexibility, enabling the users to upload miRNA-to-gene interactions based on the prediction algorithm of their choice (e.g., TargetScan [8], DIANA-microT [5, 6], miRanda [7], etc.) and to use biological function annotations collected by their preferred source (e.g., GO [9], KEGG [10, 11], or PANTHER [12]).
Finally, BUFET also performs Benjamini-Hochberg FDR correction [18]. More specifically, following the method in [19], we assume that 5% (and 1%) of the produced p-values (under the 0.05 threshold) are false positives, while the rest are significant results. P-values significant at FDR 0.05 are marked with “*” while p-values significant at 0.01 are marked with “**” in the output file.