NormiRazor: tool applying GPU-accelerated computing for determination of internal references in microRNA transcription studies

Grabia, Szymon; Smyczynska, Urszula; Pagacz, Konrad; Fendler, Wojciech

doi:10.1186/s12859-020-03743-8

Software
Open access
Published: 29 September 2020

NormiRazor: tool applying GPU-accelerated computing for determination of internal references in microRNA transcription studies

Szymon Grabia^1,2^na1,
Urszula Smyczynska¹^na1,
Konrad Pagacz^1,3 &
…
Wojciech Fendler^1,4

BMC Bioinformatics volume 21, Article number: 425 (2020) Cite this article

2262 Accesses
16 Citations
Metrics details

Abstract

Background

Multi-gene expression assays are an attractive tool in revealing complex regulatory mechanisms in living organisms. Normalization is an indispensable step of data analysis in all those studies, since it removes unwanted, non-biological variability from data. In targeted qPCR assays it is typically performed with respect to prespecified reference genes, but the lack of robust strategy of their selection is reported in literature, especially in studies concerning circulating microRNAs (miRNA). Unfortunately, this problem impedes translation of scientific discoveries on miRNA biomarkers into widely available laboratory assays. Previous studies concluded that averaged expressions of multi-miRNA combinations are more stable references than single genes. However, due to the number of such combinations the computational load is considerable and may be hindering for objective reference selection in large datasets. Existing implementations of normalization algorithms (geNorm, NormFinder and BestKeeper) have poor performance and may require days to compute stability values for all potential reference as the evaluation is performed sequentially.

Results

We designed NormiRazor - an integrative tool which implements those methods in a parallel manner on a graphics processing unit (GPU) using CUDA platform. We tested our approach on publicly available miRNA expression datasets. As a result, the times of executions on 8 datasets containing from 50 to 400 miRNAs (subsets of GSE68314) decreased 18.7 ±0.6 (mean ±SD), 104.7 ±4.2 and 76.5 ±2.2 times for geNorm, BestKeeper and NormFinder with respect to previous Python implementation. To allow for easy access to normalization pipeline for biomedical researchers we implemented NormiRazor as an online platform where a user could normalize their datasets based on the automatically selected references. It is available at norm.btm.umed.pl, together with instruction manual and exemplary datasets.

Conclusions

NormiRazor allows for an easy, informed choice of reference genes for qPCR transcriptomic studies. As such it can improve comparability and repeatability of experiments and in longer perspective help translate newly discovered biomarkers into readily available assays.

Background

Despite the boom of next-generation sequencing techniques, quantitative RT-PCR (qPCR) assays remain a popular method for evaluating gene expression. Furthermore, due to recent advances in design of highly specific probes this method is becoming effective for analysis of short non-coding RNAs [1]. Although such experiments are easy to conduct and non-expensive, they are burdened with a problem of unwanted, non-biological variability in obtained data. Such variability comes mainly from inequality of RNA concentrations, slight differences in samples’ handling and devices’ operation and changes in experimental environment. Since it cannot be avoided at the prelaboratory or laboratory levels, this undesirable component of variation should be removed by a data pre-processing operation, called normalization [2]. Otherwise, those inaccuracies could shadow true biologically relevant differences. Many normalization strategies have been already suggested, but the problem is still considered open to new proposals, especially in those areas where intensive studies have been started only recently [3, 4]. One of such fields, that prompted our group to review available normalization algorithms, is research on circulating microRNAs (miRNAs) as potential biomarkers [5, 6]. In past years, circulating miRNAs were shown as very promising diagnostic tools in cancer research [7, 8], radiation exposure [9], diabetes [10], cardiology [11] and numerous other areas. Problems with finding optimal normalization strategy often pose a major obstacle in translating results of research into clinically applicable diagnostic tools [12, 13]. In such studies, the first stage usually consists in screening the whole miRNA expression profile in order to identify those miRNAs, which are expressed differentially in the studied condition in comparison to the control group. At this stage, normalization is best performed using a statistical characteristics of samples (e.g. mean expression). It is routinely done when the data comes from high-throughput qPCR [14] or RNA sequencing [15].

However, the approach described above is impossible to apply when the qPCR assay targets only few differentially expressed miRNAs (or genes). The mean expression cannot be used as reference in this scenario as the miRNAs are chosen on the basis of between-group differences. Therefore, in such situations, normalization must be performed with respect to endo- or exogenous control genes or total input RNA [16]. Exogenous controls are the RNAs (synthetic or coming from other species) that are artificially introduced to the sample in known amounts for the purpose of quality control of consecutive cycles of PCR reaction and normalization procedures. In contrast, endogenous reference genes originate from a tissue sample itself and are selected for their known and experimentally validated, stable expression, in most cases unchanged by pathological conditions or other clinical factors. In miRNA expression experiments either some stably expressed miRNAs or other small RNAs are chosen as those references. However, there is still no consensus on choosing proper internal references for qPCR miRNA studies, especially in biofluids [17, 18].

Literature review [3, 4, 19, 20] revealed that the most commonly used method of selecting reference miRNAs is the application of algorithms that were previously designed for identification of housekeeping genes for RNA tissue expression studies: geNorm [21], NormFinder [22] and BestKeeper [23]. GeNorm is an iterative procedure that finds optimal normalizers on the basis of pairwise variation of candidate reference genes. In each iteration a stability score is calculated for all candidates and the one scoring the worst is removed. Then, the procedure is repeated until only 2 genes remain. BestKeeper is based on the assumption that the geometric mean of all stably expressed genes in particular experiment (called Best Keeper Index, BKI) provides good or even optimal reference for this experiment. Thus, if there is a reason to choose only one or few reference genes they should be the ones that correlate the strongest with BKI. NormFinder is based on the mathematical model of intra- and intergroup variability and aims at finding such reference genes that are characterized by no or limited variation. In multi-group experimental designs (eg. patients affected by particular disease and healthy controls), that are typical for biomedical sciences, NormFinder treats the variability within groups and between groups as two distinct and equally important constituents of overall variation. Therefore, the final stability measure takes both into account.

In these methods potential reference genes are ranked according to their respective stability scores and the most stable one is applied as a normalization factor. Stability, in this context, is understood as the invariability of expression of genes or microRNAs between experiments that are subject to changing technical aspects and whose samples may represent different pathological conditions or clinical factors. Although this approach is commonly used (if any systematic procedure of finding reference genes is applied at all), it has recently been questioned by members of our research team [24] as an inadequate method of selecting normalization factors.

A new algorithm was therefore proposed that instead of analyzing only single miRNAs, also calculates stability scores for all of their 2- and 3-element combinations in geNorm (GN), NormFinder (NF or NFG when samples’ groups are provided) and BestKeeper (BK). It was proven that combination-based normalizers are statistically better than single reference genes in terms of stability [4]. Moreover, it was showed that in many cases the most stable reference is formed by averaging expression of 2 or 3 miRNAs that do not necessarily score the highest when analyzed separately [24]. While the gain in finding more stable references is evident, the major obstacle in practical application of this procedure is its computational cost that grows with the number of miRNAs in a dataset and the number of elements in a combination. Initial multi-thread implementation in Python [24] revealed the problem of long execution times for larger datasets which could not be solved in a fully satisfactory way by CPU parallelization. Following the observation that computations for each combination are independent from one another, we decided to reimplement the proposed algorithm to run on graphical processing units (GPUs). With this idea in mind, we designed a high-speed parallel processing software platform for unbiased combinatorial reference gene selection for normalization of expression data - NormiRazor.

Implementation

Implementation for combination-based references

Following [24], we redesigned the three most popular normalization algorithms in the transcriptomic community, geNorm, BestKeeper and NormFinder, to facilitate calculations of rankings for combination-based references. At the same time, we reimplemented them in Compute Unified Device Architecture (CUDA) model and encapsulated in a self-contained web application called NormiRazor.

The comprehensive combination-based method of identifying reference circulating miRNAs, described in the “Background” section, consists of several operations in the following order:

1
Reading and preprocessing input data.
2
Calculation of stability indices for single genes by geNorm, BestKeeper and NormFinder.
3
Determination of all possible 2- and 3-element combinations of input miRNAs.
4
Calculation of stability indices for combinations by the same algorithms and creating rankings of combinations according to their stability.
5
Aggregation of results from all normalization algorithms and choice of recommended references among combinations.
6
Normalization of input dataset with respect to selected reference (on user request).

Step 1 consists of reading data (gene expression and samples’ group designation), checking format correctness and handling missing values. For further analysis we only consider the miRNA detected in all samples, as they are most likely to not produce missing values in the next targeted experiment. In step 2 single genes are ranked from the most to the least stable one. Additionally, to reduce computation times for larger datasets (e.g. from NGS), we limit the analysis to 250 genes with the highest stability. From the remaining genes, step 3 creates the list of all combinations of given length that serves later as a queue of tasks distributed for parallel processing on a GPU. Step 4 assesses stability of all combinations and, as the most computationally expensive one, is executed on a GPU. The exact number of combinations (N) generated for a given set is dependent on the number of elements (miRNAs) in that set and can be calculated using the formula:

$$\begin{array}{@{}rcl@{}} N = \frac{n!}{r! (n-r)!} \end{array} $$

(1)

where n is the number of elements in a set and r is the length of a combination. Finally, steps 5-6 summarize the results and allow the user to normalize their data to one of the recommended references. This whole process is also illustrated in Fig. 1a.

Performance optimization

Firstly, the algorithm generates a list of all potential combination-based references. The only optimization for this part with respect to previous Python implementation was rewriting it as a C++ recursive function. Similarly, the aggregation of all previously generated results does not perform very intensive computations, so it is also executed on CPU.

Calculation of stability indices for combinations presented the greatest area of optimization in the form of:

1
Reducing redundant parts of calculations to a single execution (mathematical optimization).
2
Parallel implementation on CUDA-enabled GPU.
3
Limitation of miRNAs set size by the exclusion of those that are the least probable to form stable combinations.

Mathematical optimization consisted mainly in limiting operations repeated in every iteration to only one execution. Formal algebraic description of the algorithms and some minor modifications introduced by us is presented in detail in Additional File 1. It was also applicable to Python implementation, but it still have not resulted in achievement of reasonably short execution times that researchers looking for reference miRNA would accept. Thus, observing that calculation for each combination are relatively simple, totally independent from one another and identical in terms of computations and differ only by data, we decided to utilize GPU with its SIMD (single instruction, multiple data) processing paradigm. CUDA technology was chosen due to its established position in research applications, especially in bioinformatics and widespread availability of CUDA-enabled GPUs. The final part of the optimization is the previously described limiting of the number of potential miRNAs to 250. This number is sufficient for a typical dataset containing 150-250 miRNA, detectable in a majority of samples and as such suitable to serve as reference genes. Still, the dataset of such size generates from 551,300 to 2,573,000 unique 3-miRNA combination to analyze.

We divided the main part of the application into three separate paths, each of them corresponding to one of proposed normalization algorithms. We then implemented them as CUDA C kernels embedded into a C++ application (scheme in Fig. 1b), as to achieve maximal performance improvement. In order to further optimize the application, we decided to implement most of the components of the entire procedure on our own with parameters tailored to our needs rather than using more general third-party solutions.

Results generation in NormiRazor

At the beginning the stability values for single miRNAs are calculated by three normalization algorithms. Since the outputs of those methods are not comparable, for the purpose of selection of the best individual microRNAs, they are normalized to the range [0, 1]. Then, the program calculates their average across the algorithms and on this basis identifies the list of up to 250 most stable candidates that are later used to form the combination-based references.

Combination stability calculation is performed on a GPU and starts with generation of combination-based reference by averaging expression of 2 or 3 miRNAs from the above mentioned list. When stability of each newly formed potential reference is calculated, this combination is treated as an additional gene, appended to the original expression dataset. Resulting stability of a combination and its ranking in the dataset is saved for the comparison with all other combinations of miRNAs. The combinations are sorted by their ranking and then their stability values, in result of which a separate list is generated for each normalization algorithm. New standardized rankings from value range [0,1] are assigned to the combinations based on their position in sorted list, generated by particular algorithm. The results of all algorithms are aggregated by averaging the rankings from all normalization algorithms applied and the list of the best normalizers based on the outcome of this procedure is generated. We decided to present 10 recommended combinations (top 5 for 2-element combinations and top 5 for 3-element combinations) to the user and then allow them to normalize their dataset with one of these reference sets. The output file is kept in the same format as for the input file to allow for easy inclusion of our normalization strategy into existing pipelines. The complete stability rankings for all combinations are also available for download.

Finally, having identified the need for a readily available tool for users with limited experience in programming, we designed graphical user interface (GUI) for our algorithm that is available on our institution’s server at norm.btm.umed.pl.

Benchmark

We decided that the main measure of efficiency and user-friendliness of our application would simply be total execution time, since it is the most discernible for the end user and ultimately affects their experience the most. As a way of providing an insight into the particular parts of operation of the application, we also measured the execution times of some distinct parts of the algorithm, including purely computational one, named “kernel” in the “Results” section. We performed time of execution tests on datasets either with varying number of miRNAs (test 1) or samples (test 2). In the first set of tests, we used subsets of expression set GSE68314, available in Gene Expression Omnibus (GEO), and we adjusted the number of miRNAs from 50 to 400 with a step of 50, while we always included all 8 available samples. Benchmark with a varying number of samples (10 to 50, step of 10) was based on another GEO dataset - GSE90828. In this second benchmark we included 156 miRNAs without missing data. For both tests, separate experiments were performed for 2- and 3-element references (mean expression of 2 or 3 miRNAs, respectively). Each measurement was repeated 5 times and the result is presented as mean with standard deviation.

For the purpose of testing the performance improvement of the proposed computing method, we employed two benchmarking platforms. The system specifications of both platforms are presented in Suppl. Table 1 (Additional File 2). We used the Python algorithms implementations from [24] as a baseline for our comparisons.

Finally, an example of NormiRazor practical application is presented where we attempted a biologically-oriented validation of references found by our application. It is based on the experiment described in [25]. Dataset GSE121513, which contains miRNA expression profiles obtained by qPCR of 95 neuroblastoma samples with and without amplification of MYCN gene, was used. MYCN is a transcription factor that binds to miR-17-92 promoter and is known to affect expression of a few miRNAs. According to TransmiR v2.0 [26], experimentally validated targets of MYCN are: mir-106a, mir-17, mir-18a, mir-19a, mir-19b, mir-20a and mir-92a. Thus, we can expect that their expression should differ between MYCN-amplified and MYCN-non-amplified samples and observing such difference indicates that correct normalization was applied. We tested this assumption with data normalized to different references, including the ones found by NormiRazor.

All datasets used in tests are publicly available and were generated by qPCR panel assays. Thus, the data were expressed as qPCR quantification cycles C _q that reflect number of replication cycles after which each miRNA becomes detectable; thus, lower values of C _q indicates higher miRNA expression [2].

Results

Comparison of total execution time between CUDA and Python implementation for 3-miRNA references is shown in Fig. 2. We observed that the total execution time for 3-miRNA references and kernel execution time can be accurately modeled by following relations (for both implementations):

geNorm:
$$\begin{array}{@{}rcl@{}} t = a \left(\frac{N}{100} \right)^{L_{C}+2} \end{array} $$
(2)
Fig. 2
Comparison of total execution time of Python and CUDA implementations. Analysis of 3-element normalizers on dataset with varying number of miRNAs. Points represent mean and SD of 5 repeated experiments (SD is usually almost negligible), curves are model predictions. Benchmark on platform 1
Full size image
BestKeeper and NormFinder:
$$\begin{array}{@{}rcl@{}} t = a \left(\frac{N}{100} \right)^{L_{C}} \end{array} $$
(3)

with N being the number of miRNAs in a dataset, L _C the number of miRNAs in a combination-based reference and a regression coefficient. Division by 100 was included to limit the range of values produced after exponentiation. Curves generated in accordance to the above models are shown in Fig. 2.

For the purpose of quantitative comparison, we defined a speed-up gained by CUDA implementation as:

$$\begin{array}{@{}rcl@{}} SP = \frac{a_{Py}}{a_{CUDA}} \end{array} $$

(4)

where a _Py and a _CUDA are coefficients from models for Python and CUDA implementations, respectively. Standard deviation (SD) of the speed-up was calculated according to [27] as:

$$\begin{array}{@{}rcl@{}} SD(SP) = \frac {\sqrt{ SD \left(a_{Py}^{2} \right) + {{\left(\frac{a_{Py}}{a_{CUDA}}\right)}^{2}} SD \left(a_{CUDA}\right)^2 }} {a_{CUDA}} \end{array} $$

(5)

Obtained speed-ups for 3-miRNA combinations were the highest in case of BestKeeper with execution of whole algorithm 100 times faster in CUDA than in Python and the lowest in geNorm with the program still running almost 20 times faster than before (Table 1 and Suppl. Table 2 in Additional File 2). In the cases of BestKeeper and both versions of NormFinder computational cores (kernels) of algorithms were sped up from about 6000 to almost 22000 times, depending on particular setting, while in the case of geNorm the kernel speed-up was comparable with the total speed-up (Suppl. Fig. 1 in Additional File 2). A difference of such magnitude was caused by the distribution of execution time between particular operations. From tables in Fig. 3, one can observe that GPU-related operations still take the majority of the execution time of geNorm, which is not the case for BestKeeper and NormFinder. Therefore, the speed-up of the two latter methods affects the whole execution speed-up to a smaller extent than in case of geNorm. The computational core takes the majority of time of GPU operations only in geNorm, while in BestKeeper and NormFinder greater fraction of time is needed for memory allocation and copying data than for actual calculations (Fig. 3). Distribution of execution time between all (CPU and GPU) operations is presented in Suppl. Fig. 2 and for comparison similar plot for Python implementation is shown in Suppl. Fig. 3 in Additional File 2.

Table 1 Speed-up (ratio of execution times) gained by the CUDA implementation with respect to the previous Python version. Benchmark on platform 1

Full size table

Analogical analysis was performed with varying number of samples in dataset (test 2) and the results are shown in Suppl. Fig. 4 and 5 in Additional File 2. We also performed a comparison of efficiency of GPUs in two benchmark platforms and we obtained very similar kernel execution times in both of them (Suppl. Fig. 6 in Additional File 2).

Our next goal was to check the validity of identified references when they are applied to real data. Figure 4 presents the distribution of qPCR C _q values (mean-centered) after different normalization protocols were applied to the data. In panel A, no normalization was applied, thus sample means and range of values were clearly different between samples. The next two panels (B and C) present the effects of normalization to the best 3-miRNA-based reference (averaged expression of miR-1976, miR-30d-5p and miR-214-3p) and normalization to the mean of miRNAs detected in all samples, respectively. Those 2 approaches equalized distributions of C _q, such that distributions in all samples became very similar. Additionally, methods B and C do not produce discernibly different results; thus, they may be equally useful. In panels D (original reference 1) and E (original reference 2), references which were proposed in [28] by the authors of the chosen dataset, were used. Finally, panel F shows detrimental consequences of using a very poor reference (mean of miR-144-3p, miR-1909-3p, miR-451a, identified as the worst by NormiRazor) and highlights how a poor choice in this regard may spawn false positive findings.

Effect of normalization methods on variability of miRNA expression between samples was presented in Fig. 5. We observed that normalization to mean expression (gray curve) and normalization to a reference identified by our algorithm (blue curve) produced very similar distributions of expression variances. Moreover, both of them as well as normalization to the original reference 1 (solid green line) were significantly different than lack of normalization plotted in red (p<0.0001 in Kolmogorov-Smirnov test for all comparisons). Standard deviation of expression was on average lowered after of those normalizations, excluding purposefully selected worst reference; thus, we conclude that all of these approaches effectively reduce variability. Original reference 2, on the contrary, does not produce distribution different from the one obtained without any normalization (p=0.1201). It is also clear that normalization to poor reference introduces additional amount of variability and is actually worse than lack of normalization (p<0.0001).

Finally, we demonstrated applicability of references found by NormiRazor in differential expression analysis on the exemplary neuroblastoma dataset (GSE121513). The results of this biologically-oriented validation are presented in Fig. 6. We calculated the fold changes (FC) of miRNA expression between MYCN-amplified and non-amplified neuroblastoma samples after applying different normalization schemes. Figure 6 shows FC values for miRNAs that are known to be affected by MYCN and thus we expected that their expression should be different between two types of neuroblastoma. Whenever statistically significant (FDR-corrected p<0.05) difference was detected, we placed an asterisk above the respective bar. When no normalization was employed, none of miRNAs of interest were found differentially expressed. When a poor reference was tested, only one difference became significant and obtained fold-changes were highly variable. Differential expression analyses after all proper normalizations detected differential expression of 7 out of 9 experimentally validated miRNAs (miR-19a, miR-17-5p, miR-18a, miR-18a*, miR-19b, miR-20a, miR-92). Thus, it must be stated that a proper normalization (either to mean expression or the internal reference) is essential for conclusions of the study.

Discussion

In the present study we proposed a new GPU-based computational approach to the recently published [24] extensive algorithm searching for reference miRNAs. Since the algorithm is based on an analysis of averaged expression of 2 or 3 miRNA as potential reference, it is highly computationally demanding. We identified that majority of calculations are easily parallelizable and can be implemented on a GPU. In our testing environment we observed that the GPU version runs from 20 up to more than 100 times faster than a previous multicore Python implementation. Speed-ups in such a range seem to be typical for GPU reimplementations of bioinformatics algorithms that previously run on CPU. For instance, GBOOST – the tool for a gene-gene interaction analysis – achieved a 40-fold speed-up with respect to its CPU predecessor BOOST [29]. Even better results were reported by the authors of CUDA-miRanda that is described as 166-times faster than miRanda [30]. Obviously, interpreting such results, one must take into account characteristics of a particular testing platform (model of CPU and GPU), however, it is still clear that in many applications GPU computing is beneficial.

In the case of searching for reference genes/miRNA, the execution time of an algorithm is an essential part of the user experience. Since finding the reference is only a small, although important part of gene expression experiment, upon which following actions depend, no researcher would accept waiting for the results for an extended period. The longest computational section in our application, namely calculating stabilities for 3-element references by geNorm, takes about 100 seconds with a list of 250 candidate miRNAs, which was set as a technical limit in the final web application. Thus, taking into account the queuing of tasks and aggregation of results, reference miRNAs could be found in about 15 min, which we believe to be an acceptable waiting period.

Another important aspect to consider is the quality of identified references. Previous work of our team showed that averaged expression of 2 or 3 miRNA is a more stable reference than any single miRNA with the measure of stability defined as a score from BK, GN and NF [24]. In this study, we additionally showed that references found by NormiRazor effectively reduce variation of miRNA expression. The comparison with the normalization with respect to the mean of all miRNA (considered sometimes as the most reliable one [31]) did not show significant differences in the variability reduction capability.

Many studies stress that normalization strongly affects interpretation of experimental results [3, 18, 32]. Considering the last stage of our validation, we fully agree with this statement. Differential miRNA expression, that was expected to be observed due to MYCN amplification, was detected when normalization was performed by our algorithm and by the method proposed in the article, upon which this part of validation analysis was based [25]. When our approach is applied those differences are even more pronounced, so actually a careful, statistical choice of reference miRNA may be even better than the normalization to the mean expression. It seems contradictory to the results reported by Mestdagh et al. [25] who argued that the normalization to mean is the most effective method. This may be caused by them comparing their solution only with some limited set of candidate small non-coding RNA (in majority not belonging to the miRNA family), while we scanned all miRNAs expressed in all samples together with their 2- and 3-element combinations.

The entire computing suite is available as a web application. Distributed in this form it is independent of users’ computer hardware and software and it does not require the possession of any CUDA-enabled GPU. Compared with previous solutions, it definitely has several advantages. Initial implementations of geNorm, BestKeeper and NormFinder are already about 15 years old and are not directly compatible with new software. First version of geNorm as VBA applet for Microsoft Excel is no longer available, since according to its website it is “no longer compatible with latest versions of Excel, slow, buggy, difficult to use” (https://genorm.cmgg.be/). Currently geNorm is implemented in commercial qbase+ software, which however is not freely available to the research community. Similarly, BestKeeper cannot be downloaded from its original source. Besides, BestKeeper analysis was originally limited to at most 10 candidate genes that with currently available computational resources is no longer a reasonable limitation. NormFinder macro for Microsoft Excel is the only one that is still available and working even with newest version of Microsoft Office suite. Although it would not be effective for extensive search in combination-based approach, we must acknowledge its stability, that is relatively uncommon in case of bioinformatics tools [33]. Some newer implementations have been developed for all the above-mentioned algorithms. NormFinder was rewritten in R and became part of some packages including NormqPCR and SlqPCR. GeNorm and BestKeeper are included in ctrlGene R package. GeNorm belongs also to NormqPCR. They are all in use, but they require R programming skills from their users, who often are biologists or medical doctors rather than data scientists. RefFinder [34] is the only freely available GUI tool that combines all 3 algorithms together. A broad review of software packages for identification of reference genes states that all implementations produce consistent results [35], however they are still capable of analyzing only single genes and not their combinations.

Conclusions

Our tool provides researchers working with gene expression data with an easy to use, fast resource for reference miRNAs/genes selection. Its intended and tested field of application are quantitative RT-PCR studies of miRNA expression, however, we are convinced that it can be applied in other omic studies that need to account for technical variability between the samples, similarly to original geNorm, BestKeeper and NormFinder.

Availability and requirements

Project name: NormiRazorProject home page:https://norm.btm.umed.plOperating system(s): Platform independentProgramming language: C++, CUDA, Python, PHPLicense: GNU GPLSource code:https://git.btm.umed.pl/zbimt/normirazorAny restrictions to use by non-academics: None

Availability of data and materials

Data used in the study were downloaded from Gene Expression Omnibus and appropriate accession numbers of datasets are referenced in text. Availability of software is described above in “Availability and requirements” section.

Abbreviations

BK:: BestKeeper
CPU:: Central processing unit
C _q :: Quantification cycle in qPCR
CUDA:: Compute unified device architecture
FC:: Fold change
FDR:: False discovery rate
GEO:: Gene expression omnibus
GN:: GeNorm
GPU:: Graphics processing unit
GUI:: Graphical user interface
miRNA:: microRNA
NF:: NormFinder
NFG:: NormFinder with samples’groups assignment
PCR:: Polymerase chain reaction
qPCR:: Real-time quantitative polymerase chain reaction
RT-PCR:: Reverse transcription polymerase chain reaction
SD:: Standard deviation
SP:: Speed-up
SIMD:: Single instruction, multiple data

References

Jensen SG, Lamy P, Rasmussen MH, Ostenfeld MS, Dyrskjøt L, Ørntoft TF, Andersen CL. Evaluation of two commercial global miRNA expression profiling platforms for detection of less abundant miRNAs. BMC Genomics. 2011; 12:435. https://doi.org/10.1186/1471-2164-12-435.
Article CAS PubMed PubMed Central Google Scholar
Livak KJ, Schmittgen TD. Analysis of Relative Gene Expression Data Using Real-Time Quantitative PCR and the 2- ΔΔCT Method. Methods. 2001; 25(4):402–8. https://doi.org/10.1006/meth.2001.1262.
Article CAS PubMed Google Scholar
Faraldi M, Gomarasca M, Sansoni V, Perego S, Banfi G, Lombardi G. Normalization strategies differently affect circulating miRNA profile associated with the training status. Sci Rep. 2019; 9(1):1–13. https://doi.org/10.1038/s41598-019-38505-x.
Article CAS Google Scholar
Drobna M, Szarzynska-Zawadzka B, Daca-Roszak P, Kosmalska M, Jaksik R, Witt M, Dawidowska M. Identification of Endogenous Control miRNAs for RT-qPCR in T-Cell Acute Lymphoblastic Leukemia. Int J Mol Sci. 2018; 19:2858. https://doi.org/10.3390/ijms19102858.
Article PubMed Central CAS Google Scholar
Fendler W, Malachowska B, Meghani K, Konstantinopoulos PA, Guha C, Singh VK, Chowdhury D. Evolutionarily conserved serum microRNAs predict radiation-induced fatality in nonhuman primates. Sci Transl Med. 2017; 9(379):1–12. https://doi.org/10.1126/scitranslmed.aal2408.
Article CAS Google Scholar
Elias KM, Fendler W, Stawiski K, Fiascone SJ, Vitonis AF, Berkowitz RS, Frendl G, Konstantinopoulos P, Crum CP, Kedzierska M, Cramer DW, Chowdhury D. Diagnostic potential for a serum miRNA neural network for detection of ovarian cancer. eLife. 2017; 6:1–28. https://doi.org/10.7554/eLife.28932.
Article Google Scholar
Nakamura K, Sawada K, Yoshimura A, Kinose Y, Nakatsuka E, Kimura T. Clinical relevance of circulating cell-free microRNAs in ovarian cancer. Mol Cancer. 2016; 15(1):48. https://doi.org/10.1186/s12943-016-0536-0.
Article PubMed PubMed Central CAS Google Scholar
Ogata-Kawata H, Izumiya M, Kurioka D, Honma Y, Yamada Y, Furuta K, Gunji T, Ohta H, Okamoto H, Sonoda H, Watanabe M, Nakagam H, Yokota J, Kohno T, Tsuchiya N. Circulating exosomal micrornas as biomarkers of colon cancer. PLoS ONE. 2014; 9(4):92921. https://doi.org/10.1371/journal.pone.0092921.
Article CAS Google Scholar
Małachowska B, Tomasik B, Stawiski K, Kulkarni S, Guha C, Chowdhury D, Fendler W. Circulating microRNAs as Biomarkers of Radiation Exposure: a systematic review and meta-analysis,. Int J Radiat Oncol Biol Phys. 2019; 106(2):390–402. https://doi.org/10.1016/j.ijrobp.2019.10.028.
Article PubMed CAS Google Scholar
Satake E, Pezzolesi MG, Dom ZIM, Smiles AM, Niewczas MA, Krolewski AS. Circulating miRNA profiles associated with hyperglycemia in patients with type 1 diabetes. Diabetes. 2018; 67(5):1013–23. https://doi.org/10.2337/db17-1207.
Article CAS PubMed PubMed Central Google Scholar
Nonaka CKV, Macêdo CT, Cavalcante BRR, De Alcântara AC, Silva DN, Bezerra MDR, Caria ACI, Tavora FRF, Neto JDDS, Noya-Rabelo MM, Rogatto SR, Dos Santos RR, Souza BSDF, Soares MBP. Circulating miRNAs as potential biomarkers associated with cardiac remodeling and fibrosis in chagas disease cardiomyopathy. Int J Mol Sci. 2019; 20(16):1–16. https://doi.org/10.3390/ijms20164064.
Article CAS Google Scholar
Tiberio P, Callari M, Angeloni V, Daidone MG, Appierto V. Challenges in using circulating miRNAs as cancer biomarkers. BioMed Res Int. 2015; 2015:731479.
Article PubMed PubMed Central CAS Google Scholar
Witwer KW, Halushka MK. Toward the promise of microRNAs - Enhancing reproducibility and rigor in microRNA research. RNA Biology. 2016; 13(11):1103–16. https://doi.org/10.1080/15476286.2016.1236172.
Article PubMed PubMed Central Google Scholar
Mohammadian A, Mowla SJ, Elahi E, Tavallaei M, Nourani MR, Liang Y. Normalization of miRNA qPCR high-throughput data: A comparison of methods. Biotechnol Lett. 2013; 35(6):843–51. https://doi.org/10.1007/s10529-013-1150-5.
Article CAS PubMed Google Scholar
Zyprych-Walczak J, Szabelska A, Handschuh L, Górczak K, Klamecka K, Figlerowicz M, Siatkowski I. The impact of normalization methods on RNA-Seq data analysis. BioMed Res Int. 2015; 2015:621690. https://doi.org/10.1155/2015/621690.
Article CAS PubMed PubMed Central Google Scholar
Schwarzenbach H, Nishida N, Calin GA, Pantel K. Clinical relevance of circulating cell-free microRNAs in cancer. Nat Rev Clin Oncol. 2014; 11(3):145–56. https://doi.org/10.1038/nrclinonc.2014.5.
Article CAS PubMed Google Scholar
Marabita F, De Candia P, Torri A, Tegnér J, Abrignani S, Rossi RL. Normalization of circulating microRNA expression data obtained by quantitative real-time RT-PCR. Brief Bioinform. 2016; 17(2):204–12. https://doi.org/10.1093/bib/bbv056.
Article PubMed Google Scholar
Corral-Vazquez C, Blanco J, Salas-Huetos A, Vidal F, Anton E. Normalization matters: tracking the best strategy for sperm miRNA quantification. Mol Hum Reprod. 2017; 23(1):45–53. https://doi.org/10.1093/molehr/gaw072.
Article CAS PubMed Google Scholar
Peltier HJ, Latham GJ. Normalization of microRNA expression levels in quantitative RT-PCR assays: Identification of suitable reference RNA targets in normal and cancerous human solid tissues. RNA. 2008; 14(5):844–52. https://doi.org/10.1261/rna.939908.
Article CAS PubMed PubMed Central Google Scholar
Sauer E, Madea B, Courts C. An evidence based strategy for normalization of quantitative PCR data from miRNA expression analysis in forensically relevant body fluids. Forensic Sci Int Genet. 2014; 11(1):174–81. https://doi.org/10.1016/j.fsigen.2014.03.011.
Article CAS PubMed Google Scholar
Vandesompele J, De Preter K, Pattyn I, Poppe B, Van Roy N, De Paepe A, Speleman R. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol. 2002; 3(7):0034–1003411. https://doi.org/10.1186/gb-2002-3-7-research0034 http://arxiv.org/abs/1465-6906.
Article Google Scholar
Andersen CL, Ledet-Jensen J, Orntoft TF. Normalization of Real-Time quantitative reverse transcription- PCR data: a model-based variance estimation approach to identify genes suited for normalization, applied to bladder and colon cancer data sets. Cancer Res. 2004; 64:5245–50. https://doi.org/10.1158/0008-5472.CAN-04-0496.
Article CAS PubMed Google Scholar
Pfaffl MW, Tichopad A, Prgomet C, Neuvians TP. Determination of stable housekeeping genes, differentially regulated target genes and sample integrity: BestKeeper - Excel-based tool using pair-wise correlations. Biotechnol Lett. 2004; 26(6):509–15. https://doi.org/10.1023/B:BILE.0000019559.84305.47.
Article CAS PubMed Google Scholar
Pagacz K, Kucharski P, Smyczynska U, Grabia S, Chowdhury D, Fendler W. A systemic approach to screening high-throughput RT-qPCR data for a suitable set of reference circulating miRNAs. BMC Genomics. 2020; 21(1):111. https://doi.org/10.1186/s12864-020-6530-3.
Article CAS PubMed PubMed Central Google Scholar
Mestdagh P, Van Vlierberghe P, De Weer A, Muth D, Westermann F, Speleman F, Vandesompele J. A novel and universal method for microRNA RT-qPCR data normalization. Genome Biol. 2009; 10(6):64. https://doi.org/10.1186/gb-2009-10-6-r64.
Article CAS Google Scholar
Tong Z, Cui Q, Wang J, Zhou Y. TransmiR v2.0: An updated transcription factor-microRNA regulation database. Nucleic Acids Res. 2019; 47(D1):253–8. https://doi.org/10.1093/nar/gky1023.
Article CAS Google Scholar
Dunlap WP, Silver NC. Confidence intervals and standard errors for ratios of normal variables. Behav Res Methods Instrum Comput. 1986; 18(5):469–71.
Article Google Scholar
Schlosser K, McIntyre LA, White RJ, Stewart DJ. Customized internal reference controls for improved assessment of circulating MicroRNAs in disease. PLoS ONE. 2015; 10(5):1–22. https://doi.org/10.1371/journal.pone.0127443.
Article CAS Google Scholar
Yung LS, Yang C, Wan X, Yu W. GBOOST: A GPU-based tool for detecting gene-gene interactions in genome-wide case control studies. Bioinformatics. 2011; 27(9):1309–10. https://doi.org/10.1093/bioinformatics/btr114.
Article CAS PubMed PubMed Central Google Scholar
Wang S, Kim J, Jiang X, Brunner SF, Ohno-Machado L. GAMUT: GPU accelerated microRNA analysis to uncover target genes through CUDA-miRanda. BMC Med Genet. 2014; 7:9. https://doi.org/10.1186/1755-8794-7-S1-S9.
Article CAS Google Scholar
D’haene B, Mestdagh P, Hellemans J, Vandesompele J. miRNA Expression Profi ling: from reference genes to global mean normalization. In: Next-Generation MicroRNA Expression Profi Ling Technology: Methods and Protocols, Methods in Molecular Biology vol. 822. Totowa: Humana Press: 2012. p. 261–72. https://doi.org/10.1007/978-1-61779-427-8. Chap. 18.
Google Scholar
Matoušková P, Bártíková H, Boušová I, Hanušová V, Szotáková B, Skálová L. Reference genes for real-time PCR quantification of messenger RNAs and microRNAs in mouse model of obesity. PLoS ONE. 2014; 9(1):1–11. https://doi.org/10.1371/journal.pone.0086033.
Article CAS Google Scholar
Mangul S, Mosqueiro T, Abdill RJ, Duong D, Mitchell K, Sarwal V, Hill B, Brito J, Littman RJ, Statz B, Lam AKM, Dayama G, Grieneisen L, Martin LS, Flint J, Eskin E, Blekhman R. Challenges and recommendations to improve the instability and archival stability of omics computational tools. PLoS Biology. 2019; 17(6):1–16. https://doi.org/10.1371/journal.pbio.3000333.
Article CAS Google Scholar
Xie F, Xiao P, Chen D, Xu L, Zhang B. miRDeepFinder: A miRNA analysis tool for deep sequencing of plant small RNAs. Plant Mol Biol. 2012; 80(1):75–84. https://doi.org/10.1007/s11103-012-9885-2.
Article CAS Google Scholar
De Spiegelaere W, Dern-Wieloch J, Weigel R, Schumacher V, Schorle H, Nettersheim D, Bergmann M, Brehm R, Kliesch S, Vandekerckhove L, Fink C. Reference gene validation for RT-qPCR, a note on different available software packages. PLoS ONE. 2015; 10(3):1–13. https://doi.org/10.1371/journal.pone.0122515.
Article CAS Google Scholar

Download references

Acknowledgments

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P6000 GPU used for this research (in Platform 1). We thank PS for access to Platform 2.

Funding

The project was funded by the First-TEAM programme of the Foundation for Polish Science titled “Predictive biomarkers of radiation toxicity (PBRTox) and funds of the Medical University of Lodz (No 503/1-090-03/503-11-001-19-00)”. The founders had no role in study design, experiments, writing manuscript and decisions concerning submission.

Author information

Szymon Grabia and Urszula Smyczynska contributed equally to this work.

Authors and Affiliations

Department of Biostatistics and Translational Medicine, Medical University of Lodz, 15 Mazowiecka St., Lodz, 92-215, Poland
Szymon Grabia, Urszula Smyczynska, Konrad Pagacz & Wojciech Fendler
Institute of Applied Computer Science, Lodz University of Technology, 18/22 Stefanowskiego St., Lodz, 90-537, Poland
Szymon Grabia
Postgraduate School of Molecular Medicine, Medical University of Warsaw, 61 Zwirki i Wigury St., Warsaw, 02-091, Poland
Konrad Pagacz
Dana-Farber Cancer Institute, Harvard Medical School, Boston, 450 Brookline Av., Boston, MA 02215, USA
Wojciech Fendler

Authors

Szymon Grabia
View author publications
You can also search for this author in PubMed Google Scholar
Urszula Smyczynska
View author publications
You can also search for this author in PubMed Google Scholar
Konrad Pagacz
View author publications
You can also search for this author in PubMed Google Scholar
Wojciech Fendler
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SG designed and implemented software, performed benchmark and tests, and was a major contributor in writing the manuscript and visualization. US designed and implemented software, performed tests and validation experiment, and was a major contributor in writing the manuscript and visualization. KP gathered data for validation, helped design the study, evaluated the software and was involved in writing manuscript. WF designed and supervised the study, evaluated the software and acquired funding. All author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Wojciech Fendler.

Ethics declarations

Ethics approval and consent to participate

Not applicable. Ethical approval for the study was not required since only data from public repositories were used.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1

Mathematical formulation of implemented algorithms. Mathematical formulation of implemented algorithms with our changes with respect to original versions.

Additional file 2

Supplementary results. This file contains all supplementary tables and figures which are refereed in text.

Additional file 3

Archive with source code. This archive contains NormiRazor source code.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Grabia, S., Smyczynska, U., Pagacz, K. et al. NormiRazor: tool applying GPU-accelerated computing for determination of internal references in microRNA transcription studies. BMC Bioinformatics 21, 425 (2020). https://doi.org/10.1186/s12859-020-03743-8

Download citation

Received: 07 April 2020
Accepted: 07 September 2020
Published: 29 September 2020
DOI: https://doi.org/10.1186/s12859-020-03743-8

NormiRazor: tool applying GPU-accelerated computing for determination of internal references in microRNA transcription studies