NormiRazor: tool applying GPU-accelerated computing for determination of internal references in microRNA transcription studies

Background Multi-gene expression assays are an attractive tool in revealing complex regulatory mechanisms in living organisms. Normalization is an indispensable step of data analysis in all those studies, since it removes unwanted, non-biological variability from data. In targeted qPCR assays it is typically performed with respect to prespecified reference genes, but the lack of robust strategy of their selection is reported in literature, especially in studies concerning circulating microRNAs (miRNA). Unfortunately, this problem impedes translation of scientific discoveries on miRNA biomarkers into widely available laboratory assays. Previous studies concluded that averaged expressions of multi-miRNA combinations are more stable references than single genes. However, due to the number of such combinations the computational load is considerable and may be hindering for objective reference selection in large datasets. Existing implementations of normalization algorithms (geNorm, NormFinder and BestKeeper) have poor performance and may require days to compute stability values for all potential reference as the evaluation is performed sequentially. Results We designed NormiRazor - an integrative tool which implements those methods in a parallel manner on a graphics processing unit (GPU) using CUDA platform. We tested our approach on publicly available miRNA expression datasets. As a result, the times of executions on 8 datasets containing from 50 to 400 miRNAs (subsets of GSE68314) decreased 18.7 ±0.6 (mean ±SD), 104.7 ±4.2 and 76.5 ±2.2 times for geNorm, BestKeeper and NormFinder with respect to previous Python implementation. To allow for easy access to normalization pipeline for biomedical researchers we implemented NormiRazor as an online platform where a user could normalize their datasets based on the automatically selected references. It is available at norm.btm.umed.pl, together with instruction manual and exemplary datasets. Conclusions NormiRazor allows for an easy, informed choice of reference genes for qPCR transcriptomic studies. As such it can improve comparability and repeatability of experiments and in longer perspective help translate newly discovered biomarkers into readily available assays.

(Continued from previous page) Conclusions: NormiRazor allows for an easy, informed choice of reference genes for qPCR transcriptomic studies. As such it can improve comparability and repeatability of experiments and in longer perspective help translate newly discovered biomarkers into readily available assays.

Background
Despite the boom of next-generation sequencing techniques, quantitative RT-PCR (qPCR) assays remain a popular method for evaluating gene expression. Furthermore, due to recent advances in design of highly specific probes this method is becoming effective for analysis of short non-coding RNAs [1]. Although such experiments are easy to conduct and non-expensive, they are burdened with a problem of unwanted, non-biological variability in obtained data. Such variability comes mainly from inequality of RNA concentrations, slight differences in samples' handling and devices' operation and changes in experimental environment. Since it cannot be avoided at the prelaboratory or laboratory levels, this undesirable component of variation should be removed by a data pre-processing operation, called normalization [2]. Otherwise, those inaccuracies could shadow true biologically relevant differences. Many normalization strategies have been already suggested, but the problem is still considered open to new proposals, especially in those areas where intensive studies have been started only recently [3,4]. One of such fields, that prompted our group to review available normalization algorithms, is research on circulating microRNAs (miRNAs) as potential biomarkers [5,6]. In past years, circulating miRNAs were shown as very promising diagnostic tools in cancer research [7,8], radiation exposure [9], diabetes [10], cardiology [11] and numerous other areas. Problems with finding optimal normalization strategy often pose a major obstacle in translating results of research into clinically applicable diagnostic tools [12,13]. In such studies, the first stage usually consists in screening the whole miRNA expression profile in order to identify those miRNAs, which are expressed differentially in the studied condition in comparison to the control group. At this stage, normalization is best performed using a statistical characteristics of samples (e.g. mean expression). It is routinely done when the data comes from high-throughput qPCR [14] or RNA sequencing [15]. However, the approach described above is impossible to apply when the qPCR assay targets only few differentially expressed miRNAs (or genes). The mean expression cannot be used as reference in this scenario as the miRNAs are chosen on the basis of betweengroup differences. Therefore, in such situations, normalization must be performed with respect to endo-or exogenous control genes or total input RNA [16]. Exogenous controls are the RNAs (synthetic or coming from other species) that are artificially introduced to the sample in known amounts for the purpose of quality control of consecutive cycles of PCR reaction and normalization procedures. In contrast, endogenous reference genes originate from a tissue sample itself and are selected for their known and experimentally validated, stable expression, in most cases unchanged by pathological conditions or other clinical factors. In miRNA expression experiments either some stably expressed miRNAs or other small RNAs are chosen as those references. However, there is still no consensus on choosing proper internal references for qPCR miRNA studies, especially in biofluids [17,18].
Literature review [3,4,19,20] revealed that the most commonly used method of selecting reference miRNAs is the application of algorithms that were previously designed for identification of housekeeping genes for RNA tissue expression studies: geNorm [21], NormFinder [22] and BestKeeper [23]. GeNorm is an iterative procedure that finds optimal normalizers on the basis of pairwise variation of candidate reference genes. In each iteration a stability score is calculated for all candidates and the one scoring the worst is removed. Then, the procedure is repeated until only 2 genes remain. Best-Keeper is based on the assumption that the geometric mean of all stably expressed genes in particular experiment (called Best Keeper Index, BKI) provides good or even optimal reference for this experiment. Thus, if there is a reason to choose only one or few reference genes they should be the ones that correlate the strongest with BKI. NormFinder is based on the mathematical model of intra-and intergroup variability and aims at finding such reference genes that are characterized by no or limited variation. In multi-group experimental designs (eg. patients affected by particular disease and healthy controls), that are typical for biomedical sciences, NormFinder treats the variability within groups and between groups as two distinct and equally important constituents of overall variation. Therefore, the final stability measure takes both into account.
In these methods potential reference genes are ranked according to their respective stability scores and the most stable one is applied as a normalization factor. Stability, in this context, is understood as the invariability of expression of genes or microRNAs between experiments that are subject to changing technical aspects and whose samples may represent different pathological conditions or clinical factors. Although this approach is commonly used (if any systematic procedure of finding reference genes is applied at all), it has recently been questioned by members of our research team [24] as an inadequate method of selecting normalization factors.
A new algorithm was therefore proposed that instead of analyzing only single miRNAs, also calculates stability scores for all of their 2-and 3-element combinations in geNorm (GN), NormFinder (NF or NFG when samples' groups are provided) and BestKeeper (BK). It was proven that combination-based normalizers are statistically better than single reference genes in terms of stability [4]. Moreover, it was showed that in many cases the most stable reference is formed by averaging expression of 2 or 3 miRNAs that do not necessarily score the highest when analyzed separately [24]. While the gain in finding more stable references is evident, the major obstacle in practical application of this procedure is its computational cost that grows with the number of miRNAs in a dataset and the number of elements in a combination. Initial multi-thread implementation in Python [24] revealed the problem of long execution times for larger datasets which could not be solved in a fully satisfactory way by CPU parallelization. Following the observation that computations for each combination are independent from one another, we decided to reimplement the proposed algorithm to run on graphical processing units (GPUs). With this idea in mind, we designed a high-speed parallel processing software platform for unbiased combinatorial reference gene selection for normalization of expression data -NormiRazor.

Implementation for combination-based references
Following [24], we redesigned the three most popular normalization algorithms in the transcriptomic community, geNorm, BestKeeper and NormFinder, to facilitate calculations of rankings for combination-based references. At the same time, we reimplemented them in Compute Unified Device Architecture (CUDA) model and encapsulated in a self-contained web application called NormiRazor.
The comprehensive combination-based method of identifying reference circulating miRNAs, described in the "Background" section, consists of several operations in the following order: 1 Reading and preprocessing input data. Step 1 consists of reading data (gene expression and samples' group designation), checking format correctness and handling missing values. For further analysis we only consider the miRNA detected in all samples, as they are most likely to not produce missing values in the next targeted experiment. In step 2 single genes are ranked from the most to the least stable one. Additionally, to reduce computation times for larger datasets (e.g. from NGS), we limit the analysis to 250 genes with the highest stability. From the remaining genes, step 3 creates the list of all combinations of given length that serves later as a queue of tasks distributed for parallel processing on a GPU. Step 4 assesses stability of all combinations and, as the most computationally expensive one, is executed on a GPU. The exact number of combinations (N) generated for a given set is dependent on the number of elements (miRNAs) in that set and can be calculated using the formula: where n is the number of elements in a set and r is the length of a combination. Finally, steps 5-6 summarize the results and allow the user to normalize their data to one of the recommended references. This whole process is also illustrated in Fig. 1a.

Performance optimization
Firstly, the algorithm generates a list of all potential combination-based references. The only optimization for this part with respect to previous Python implementation was rewriting it as a C++ recursive function. Similarly, the aggregation of all previously generated results does not perform very intensive computations, so it is also executed on CPU.
Calculation of stability indices for combinations presented the greatest area of optimization in the form of: Mathematical optimization consisted mainly in limiting operations repeated in every iteration to only one execution. Formal algebraic description of the algorithms and some minor modifications introduced by us is presented in detail in Additional File 1. It was also applicable to Python implementation, but it still have not resulted in achievement of reasonably short execution times that researchers looking for reference miRNA would accept. Thus, observing that calculation for each combination are relatively simple, totally independent from one another and identical in terms of computations and differ only by data, we decided to utilize GPU with its SIMD (single instruction, multiple data) processing paradigm. CUDA technology was chosen due to its established position in research applications, especially in bioinformatics and widespread availability of CUDA-enabled GPUs. The final part of the optimization is the previously described limiting of the number of potential miRNAs to 250. This number is sufficient for a typical dataset containing 150-250 miRNA, detectable in a majority of samples and as such suitable to serve as reference genes. Still, the dataset of such size generates from 551,300 to 2,573,000 unique 3-miRNA combination to analyze.
We divided the main part of the application into three separate paths, each of them corresponding to one of proposed normalization algorithms. We then implemented them as CUDA C kernels embedded into a C++ application (scheme in Fig. 1b), as to achieve maximal performance improvement. In order to further optimize the application, we decided to implement most of the components of the entire procedure on our own with parameters tailored to our needs rather than using more general third-party solutions.

Results generation in NormiRazor
At the beginning the stability values for single miRNAs are calculated by three normalization algorithms. Since the outputs of those methods are not comparable, for the purpose of selection of the best individual microRNAs, they are normalized to the range [0, 1]. Then, the program calculates their average across the algorithms and on this basis identifies the list of up to 250 most stable candidates that are later used to form the combination-based references.
Combination stability calculation is performed on a GPU and starts with generation of combination-based reference by averaging expression of 2 or 3 miRNAs from the above mentioned list. When stability of each newly formed potential reference is calculated, this combination is treated as an additional gene, appended to the original expression dataset. Resulting stability of a combination and its ranking in the dataset is saved for the comparison with all other combinations of miRNAs. The combinations are sorted by their ranking and then their stability values, in result of which a separate list is generated for each normalization algorithm. New standardized rankings from value range [0,1] are assigned to the combinations based on their position in sorted list, generated by particular algorithm. The results of all algorithms are aggregated by averaging the rankings from all normalization algorithms applied and the list of the best normalizers based on the outcome of this procedure is generated. We decided to present 10 recommended combinations (top 5 for 2-element combinations and top 5 for 3-element combinations) to the user and then allow them to normalize their dataset with one of these reference sets. The output file is kept in the same format as for the input file to allow for easy inclusion of our normalization strategy into existing pipelines. The complete stability rankings for all combinations are also available for download.
Finally, having identified the need for a readily available tool for users with limited experience in programming, we designed graphical user interface (GUI) for our algorithm that is available on our institution's server at norm.btm.umed.pl.

Benchmark
We decided that the main measure of efficiency and user-friendliness of our application would simply be total execution time, since it is the most discernible for the end user and ultimately affects their experience the most. As a way of providing an insight into the particular parts of operation of the application, we also measured the execution times of some distinct parts of the algorithm, including purely computational one, named "kernel" in the "Results" section. We performed time of execution tests on datasets either with varying number of miRNAs (test 1) or samples (test 2). In the first set of tests, we used subsets of expression set GSE68314, available in Gene Expression Omnibus (GEO), and we adjusted the number of miRNAs from 50 to 400 with a step of 50, while we always included all 8 available samples. Benchmark with a varying number of samples (10 to 50, step of 10) was based on another GEO dataset -GSE90828. In this second benchmark we included 156 miRNAs without missing data. For both tests, separate experiments were performed for 2-and 3-element references (mean expression of 2 or 3 miRNAs, respectively). Each measurement was repeated 5 times and the result is presented as mean with standard deviation.
For the purpose of testing the performance improvement of the proposed computing method, we employed two benchmarking platforms. The system specifications of both platforms are presented in Suppl. Table 1 (Additional File 2). We used the Python algorithms implementations from [24] as a baseline for our comparisons.
Finally, an example of NormiRazor practical application is presented where we attempted a biologically-oriented validation of references found by our application. It is based on the experiment described in [25]. Dataset GSE121513, which contains miRNA expression profiles obtained by qPCR of 95 neuroblastoma samples with and without amplification of MYCN gene, was used. MYCN is a transcription factor that binds to miR-17-92 promoter and is known to affect expression of a few miRNAs. According to TransmiR v2.0 [26], experimentally validated targets of MYCN are: mir-106a, mir-17, mir-18a, mir-19a, mir-19b, mir-20a and mir-92a. Thus, we can expect that their expression should differ between MYCN-amplified and MYCN-non-amplified samples and observing such difference indicates that correct normalization was applied. We tested this assumption with data normalized to different references, including the ones found by NormiRazor.
All datasets used in tests are publicly available and were generated by qPCR panel assays. Thus, the data were expressed as qPCR quantification cycles C q that reflect number of replication cycles after which each miRNA becomes detectable; thus, lower values of C q indicates higher miRNA expression [2]. Fig. 2. We observed that the total execution time for 3-miRNA references and kernel execution time can be accurately modeled by following relations (for both implementations):

Comparison of total execution time between CUDA and Python implementation for 3-miRNA references is shown in
• BestKeeper and NormFinder: with N being the number of miRNAs in a dataset, L C the number of miRNAs in a combination-based reference and a regression coefficient. Division by 100 was included to limit the range of values produced after exponentiation. Curves generated in accordance to the above models are shown in Fig. 2.
For the purpose of quantitative comparison, we defined a speed-up gained by CUDA implementation as: where a Py and a CUDA are coefficients from models for Python and CUDA implementations, respectively. Standard deviation (SD) of the speed-up was calculated according to [27] as: Obtained speed-ups for 3-miRNA combinations were the highest in case of BestKeeper with execution of whole algorithm 100 times faster in CUDA than in Python and the lowest in geNorm with the program still running almost 20 times faster than before (Table  1 and Suppl. Table 2 in Additional File 2). In the cases of BestKeeper and both versions of NormFinder computational cores (kernels) of algorithms were sped up from about 6000 to almost 22000 times, depending on particular setting, while in the case of geNorm the kernel speed-up was comparable with the total speed-up (Suppl. Fig. 1 in Additional File 2). A difference of such magnitude was caused by the distribution of execution time between particular operations. From tables in Fig. 3, one can observe that GPU-related operations still take the majority of the execution time of geNorm, which is not the case for BestKeeper and NormFinder. Therefore, the speed-up of the two latter methods affects the whole execution speed-up to a smaller extent than in case of geNorm. The computational core takes the majority of time of GPU operations only in geNorm, while in BestKeeper and NormFinder greater fraction of time is needed for memory allocation and copying data than for actual calculations (Fig. 3). Distribution of execution time between all (CPU and GPU) operations is presented in Suppl. Fig. 2 and for comparison similar plot for Python implementation is shown in Suppl. Fig. 3 in Additional File 2.
Analogical analysis was performed with varying number of samples in dataset (test 2) and the results are shown in Suppl. Fig. 4 and 5 in Additional File 2. We also performed Fig. 3 Distribution of execution time between GPU-related operations in CUDA implementations for 3-element normalizers. Tables below the graphs indicate percentage of the total execution time (CPU+GPU) dedicated to operations on GPU. Test 1 on platform 1. cudaMalloc: memory allocation on GPU, memcpyHtD: copying data from RAM to GPU memory, memcpyDtH: copying data from GPU memory to RAM a comparison of efficiency of GPUs in two benchmark platforms and we obtained very similar kernel execution times in both of them (Suppl. Fig. 6 in Additional File 2).
Our next goal was to check the validity of identified references when they are applied to real data. Figure 4 presents the distribution of qPCR C q values (mean-centered) after different normalization protocols were applied to the data. In panel A, no normalization was applied, thus sample means and range of values were clearly different between samples. The next two panels (B and C) present the effects of normalization to the best 3-miRNA-based reference (averaged expression of miR-1976, miR-30d-5p and miR-214-3p) and normalization to the mean of miRNAs detected in all samples, respectively. Those 2 approaches equalized distributions of C q , such that distributions in all samples became very similar. Additionally, methods B and C do not produce discernibly different results; thus, they may be equally useful. In panels D (original reference 1) and E (original reference 2), references which were proposed in [28] by the authors of the chosen dataset, were used. Finally, panel F shows detrimental consequences of using a very poor reference (mean of miR-144-3p, miR-1909-3p, miR-451a, identified as the worst by NormiRazor) and highlights how a poor choice in this regard may spawn false positive findings.
Effect of normalization methods on variability of miRNA expression between samples was presented in Fig. 5. We observed that normalization to mean expression (gray curve) and normalization to a reference identified by our algorithm (blue curve) produced very similar distributions of expression variances. Moreover, both of them as well as normalization to the original reference 1 (solid green line) were significantly different than lack of normalization plotted in red (p<0.0001 in Kolmogorov-Smirnov test for all comparisons). Standard deviation of expression was on average lowered after of those  normalizations, excluding purposefully selected worst reference; thus, we conclude that all of these approaches effectively reduce variability. Original reference 2, on the contrary, does not produce distribution different from the one obtained without any normalization (p=0.1201). It is also clear that normalization to poor reference introduces additional amount of variability and is actually worse than lack of normalization (p<0.0001).
Finally, we demonstrated applicability of references found by NormiRazor in differential expression analysis on the exemplary neuroblastoma dataset (GSE121513). The results of this biologically-oriented validation are presented in Fig. 6. We calculated the fold changes (FC) of miRNA expression between MYCN-amplified and non-amplified neuroblastoma samples after applying different normalization schemes. Figure 6 shows FC values for miRNAs that are known to be affected by MYCN and thus we expected that their expression should be different between two types of neuroblastoma. Whenever statistically significant (FDR-corrected p<0.05) difference was detected, we placed an asterisk above the respective bar. When no normalization was employed, none of miRNAs of interest were found differentially expressed. When a poor reference was tested, only one difference became significant and obtained fold-changes were highly variable. Differential expression analyses after all proper normalizations detected differential expression of 7 out of 9 experimentally validated miRNAs (miR-19a, miR-17-5p, miR-18a, miR-18a*, miR-19b, miR-20a, miR-92). Thus, it must be stated that a proper normalization (either to mean expression or the internal reference) is essential for conclusions of the study.

Discussion
In the present study we proposed a new GPU-based computational approach to the recently published [24] extensive algorithm searching for reference miRNAs. Since the algorithm is based on an analysis of averaged expression of 2 or 3 miRNA as potential reference, it is highly computationally demanding. We identified that majority of calculations are easily parallelizable and can be implemented on a GPU. In our testing environment we observed that the GPU version runs from 20 up to more than 100 times faster than a previous multicore Python implementation. Speed-ups in such a range seem to be typical for GPU reimplementations of bioinformatics algorithms that previously run on CPU. For instance, GBOOST -the tool for a gene-gene interaction analysis -achieved a 40fold speed-up with respect to its CPU predecessor BOOST [29]. Even better results were reported by the authors of CUDA-miRanda that is described as 166-times faster than miRanda [30]. Obviously, interpreting such results, one must take into account characteristics of a particular testing platform (model of CPU and GPU), however, it is still clear that in many applications GPU computing is beneficial.
In the case of searching for reference genes/miRNA, the execution time of an algorithm is an essential part of the user experience. Since finding the reference is only a small, although important part of gene expression experiment, upon which following actions depend, no researcher would accept waiting for the results for an extended period. The longest computational section in our application, namely calculating stabilities for 3-element references by geNorm, takes about 100 seconds with a list of 250 candidate miRNAs, which was set as a technical limit in the final web application. Thus, taking into account the queuing of tasks and aggregation of results, reference miRNAs could be found in about 15 min, which we believe to be an acceptable waiting period.
Another important aspect to consider is the quality of identified references. Previous work of our team showed that averaged expression of 2 or 3 miRNA is a more stable reference than any single miRNA with the measure of stability defined as a score from BK, GN and NF [24]. In this study, we additionally showed that references found by NormiRazor effectively reduce variation of miRNA expression. The comparison with the normalization with respect to the mean of all miRNA (considered sometimes as the most reliable one [31]) did not show significant differences in the variability reduction capability.
Many studies stress that normalization strongly affects interpretation of experimental results [3,18,32]. Considering the last stage of our validation, we fully agree with this statement. Differential miRNA expression, that was expected to be observed due to MYCN amplification, was detected when normalization was performed by our algorithm and by the method proposed in the article, upon which this part of validation analysis was based [25]. When our approach is applied those differences are even more pronounced, so actually a careful, statistical choice of reference miRNA may be even better than the normalization to the mean expression. It seems contradictory to the results reported by Mestdagh et al. [25] who argued that the normalization to mean is the most effective method. This may be caused by them comparing their solution only with some limited set of candidate small non-coding RNA (in majority not belonging to the miRNA family), while we scanned all miRNAs expressed in all samples together with their 2-and 3-element combinations.
The entire computing suite is available as a web application. Distributed in this form it is independent of users' computer hardware and software and it does not require the possession of any CUDA-enabled GPU. Compared with previous solutions, it definitely has several advantages. Initial implementations of geNorm, BestKeeper and NormFinder are already about 15 years old and are not directly compatible with new software. First version of geNorm as VBA applet for Microsoft Excel is no longer available, since according to its website it is "no longer compatible with latest versions of Excel, slow, buggy, difficult to use" (https://genorm.cmgg.be/). Currently geNorm is implemented in commercial qbase+ software, which however is not freely available to the research community. Similarly, BestKeeper cannot be downloaded from its original source. Besides, BestKeeper analysis was originally limited to at most 10 candidate genes that with currently available computational resources is no longer a reasonable limitation. NormFinder macro for Microsoft Excel is the only one that is still available and working even with newest version of Microsoft Office suite. Although it would not be effective for extensive search in combination-based approach, we must acknowledge its stability, that is relatively uncommon in case of bioinformatics tools [33]. Some newer implementations have been developed for all the above-mentioned algorithms. NormFinder was rewritten in R and became part of some packages including NormqPCR and SlqPCR. GeNorm and Best-Keeper are included in ctrlGene R package. GeNorm belongs also to NormqPCR. They are all in use, but they require R programming skills from their users, who often are biologists or medical doctors rather than data scientists. RefFinder [34] is the only freely available GUI tool that combines all 3 algorithms together. A broad review of software packages for identification of reference genes states that all implementations produce consistent results [35], however they are still capable of analyzing only single genes and not their combinations.

Conclusions
Our tool provides researchers working with gene expression data with an easy to use, fast resource for reference miRNAs/genes selection. Its intended and tested field of application are quantitative RT-PCR studies of miRNA expression, however, we are convinced that it can be applied in other omic studies that need to account for technical variability between the samples, similarly to original geNorm, BestKeeper and NormFinder.