- Methodology Article
- Open Access
- Published:

# Statistical analysis of a Bayesian classifier based on the expression of miRNAs

*BMC Bioinformatics*
**volume 16**, Article number: 287 (2015)

## Abstract

### Background

During the last decade, many scientific works have concerned the possible use of miRNA levels as diagnostic and prognostic tools for different kinds of cancer. The development of reliable classifiers requires tackling several crucial aspects, some of which have been widely overlooked in the scientific literature: the distribution of the measured miRNA expressions and the statistical uncertainty that affects the parameters that characterize a classifier. In this paper, these topics are analysed in detail by discussing a model problem, i.e. the development of a Bayesian classifier that, on the basis of the expression of miR-205, miR-21 and snRNA U6, discriminates samples into two classes of pulmonary tumors: adenocarcinomas and squamous cell carcinomas.

### Results

We proved that the variance of miRNA expression triplicates is well described by a normal distribution and that triplicate averages also follow normal distributions. We provide a method to enhance a classifiers’ performance by exploiting the correlations between the class-discriminating miRNA and the expression of an additional normalized miRNA.

### Conclusions

By exploiting the normal behavior of triplicate variances and averages, invalid samples (outliers) can be identified by checking their variability via chi-square test or their displacement by the respective population mean via Student’s t-test. Finally, the normal behavior allows to optimally set the Bayesian classifier and to determine its performance and the related uncertainty.

## Background

MicroRNAs (miRNAs or miRs) are small non-coding single-stranded RNAs, 19–25 nucleotides in length, acting as negative regulators of gene expression at the post-transcriptional level. More than 1000 miRNAs are transcribed from miRNA genes in the human genome. A single miRNA is able to modulate hundreds of downstream genes by recognizing complementary sequences in the 3^{′} untranslated regions (UTRs) of their target messenger RNAs. It has been estimated that in humans about 30 % of messenger RNAs are under miRNA regulation. The biological functions of miRNAs are diverse and include several key cellular processes, such as differentiation, proliferation, cellular development, cell death and metabolism.

In the last decade, evidences have accumulated to indicate that miRNAs play a role in the onset and progression of several human cancers [1]. The transcription or processing of some miRNAs is altered in neoplastic tissues, with respect to their normal counterparts. miRNAs whose levels increase in tumors are referred to as oncogenic miRNAs (onco-miRs), sometimes even if there is no evidence for their causative role in tumorigenesis. On the other hand, miRNAs down-regulated in cancer are considered tumor suppressors.

In parallel to these studies, the effectiveness of miRNAs as markers for tracing the tissue of origin of cancers of unknown primary origin was demonstrated by several authors, and the utility of miRNAs levels as diagnostic and prognostic markers became clear (reviewed by [2]). The main advantage of the use of miRNAs as markers resides in the ease of their detection and in their extreme specificity. miRNAs are stable molecules well preserved in formalin fixed, paraffin embedded tissues (FFPE) as well as in fresh snap-frozen specimens, unlike larger RNA molecules as messenger RNAs [3]. The finding that miRNAs have an exceptional stability in several tissues suggested that these tiny molecules were also preserved, detectable and quantifiable in plasma and in other biofluids, such as urine, saliva, cerebro-spinal fluid and amniotic fluid [4]. Circulating miRNAs are attracting attention as markers not only for cancer but also for neurodegenerative diseases (reviewed by [5]) as they have some important features: non-invasivity, specificity, early detection, sensitivity and ease of translatability from model systems to humans.

Methods based on next generation sequencing (NGS), microarray and quantitative reverse-transcription polymerase chain reaction (qRT-PCR) are currently used for miRNA profiling and for the identification of miRNAs differently expressed in tumor samples and in matched, healthy tissue. The majority of miRNA profiling studies have been so far carried out by using microarrays. These studies have provided signatures consisting in few to several (5–30) distinct miRNAs [6]. However, the large amount of data obtained by microarray and NGS profiling needs to be transposed into clinical trials by developing an easily performed, cost-effective and serviceable assay that can analyze the cancer-specific miRNAs for cancer diagnosis and prognosis. Such an analysis has been so far relying on qRT-PCR assays, performed by measuring the levels of a restricted number of miRNAs (see review by [7]).

The realization of classifiers based on the expression of miRNAs is widely discussed in the scientific literature. Within the context of lung cancers (see, for example, [4, 8–12]) the work by [13] describes a classifier that distinguishes squamous from nonsquamous non-small-cell lung carcinomas, by using miR-205 as a specific marker and miR-21, snRNA (small nuclear RNA) U6 as normalizers. The approach followed is essentially machine learning: the classifier relies on a sample score and a threshold. A more elaborated support vector machine, which uses the combination of 5 miRNAs for lung squamous cell carcinoma diagnosis, is described in [14]. A receiver operating characteristic curve analysis to evaluate the possibility of diagnosing the histologic subtype of pulmonary neuroendocrine tumors via altered expression of miR-21, miR-155, let-7a is discussed in [15] (a similar statistical approach is described in [16]).

These classifiers are generally declared to be efficient. For example, [13] report a sensitivity of 96 % and a specificity of 90 %. However, at least two aspects are widely overlooked in the scientific literature: first, the distribution of the measured miRNA expressions; second, the statistical uncertainty that unavoidably affects the parameters that characterize a classifier and its performance. Both aspects are crucial in order to assess the reproducibility, and thus the reliability, of a classifier. The goal of the present paper is to close these gaps. Our analysis concerns a Bayesian classifier based on the expression of a single class-discriminating miRNA, with additional miRNAs that are used either as normalizers or as performance-enhancer via noise-reduction.

In the present paper the following issues are discussed: normal distribution of the triplicate variance and identification of outliers; improvement of accuracy via normalizers; class-discriminating measures and their distributions; identification of “bias” outliers; assessment of a classifier’s performance; finally, improvement of a classifier’s performance by exploiting correlations.

As a prototypical case we discuss throughout the paper the development of a classifier that assigns samples either to adenocarcinomas (ADC) or to squamous cell carcinomas (SQC). The two classes ADC, SQC are henceforth referred to as the *target* and the *versus* class, respectively. The miRNAs used are miR-205, miR-21 and snRNA U6.

## Methods

### Distribution of triplicates: normally-distributed variance and outlier identification

Given a sample stemming from a patient, a set of miRNAs is measured in triplicate by using qRT-PCR. For each miRNA, the sample mean *x* of the corresponding triplicate and the related sample standard deviation *s* are calculated. To provide an *a priori* knowledge on the samples, each one was classified via immunohistochemical analysis and gene profiling into one of different categories of lung tumors. We use data based on lung carcinoma biopsies retrieved from the archives of the Unit of Surgical Pathology of the S. Chiara Hospital in Trento, Italy. The research project had been approved by the Ethical Committee of the Trentino Public Health System (Azienda Provinciale per i Servizi Sanitari). Most of the data analyzed here were previously published by our research group [17]. All datasets are available as Supplementary material in the Additional file 1.

For each miRNA, the distribution of the variances *s*
^{2} can be described by a normal chi-square distribution with number of degrees of freedom *ν* equal to 2: \({s^{2} / \sigma ^{2} \sim \chi ^{2}_{\nu =2}}\). To prove this behavior, we fitted the cumulative distribution \({F\left (\chi ^{2}_{\nu =2};\,\nu =2\right)}\) of the chi-square with *ν*=2 to the cumulative distribution of the measured variances: knowing that *F*(*s*
^{2}/*σ*
^{2}; 2)=1− exp[−*s*
^{2}/(2*σ*
^{2})], the fit was carried out by searching the population variance *σ*
^{2} that minimizes the Kolmogorov-Smirnov (K-S) statistic *D*. We used the K-S test because it is less sensitive to outliers than other statistical tests and does not require any assumption on the distributions. On the contrary, for example, the estimation of the population variance *σ*
^{2} out of the set of sample variances is valid only if the sample distribution is already known to be normal. The *p*-values resulting from the fit (see Table 1) show that the variance distributions are indeed compatible with chi-square distributions.

As a result, an outlier can be identified by checking its variability via the chi-square test: we assume a triplicate of a given miRNA to be an outlier if its sample standard deviation *s* exceeds the critical value *σ*
_{max} corresponding to the significance level *α*=0.05. The critical value *σ*
_{max} is given by [−2 log*α*]^{1/2}
*σ*≃2.448 *σ*, where *σ* is the population standard deviation assessed for that miRNA. The *σ*
_{max} values are reported in Table 1. The outlier definition used here implies that the significance level *α* corresponds to the rate of statistical false alarms (type I errors), i.e. the rate of valid triplicates that are falsely deemed to be outliers.^{1}

According to this procedure, the triplicate sets of both miR-205 and miR-21 contained no outliers, whereas the snRNA U6 contained 4 outliers. These samples were excluded from the following analysis. Table 1 also shows the statistics of the standard deviation for the snRNA U6 data set devoid of outliers.

We note that, for snRNA U6, the two values of *σ* are very similar. This fact reflects, as mentioned above, the robustness to outliers of the K-S approach. In addition, for all data sets devoid of outliers, the population standard deviation *σ* is approximately equal to both the root-mean-square sample standard deviation 〈*s*
^{2}〉^{1/2} and the fourth root of the variance of variances [〈*s*
^{4}〉−〈*s*
^{2}〉^{2}]^{1/4}. Because *ν*=2, this behavior provides further evidence to the null hypothesis that samples are drawn from the same normal distribution.

For each sample, i.e. patient, of the set devoid of outliers, we henceforth use the following notation for the sample mean *x* of the available triplicates: *x*
_{U6} for the snRNA U6 triplicate, *x*
_{21} for the miR-21 triplicate, and *x*
_{205} for the miR-205 triplicate. In addition, *x*
_{205}, *x*
_{21}, *x*
_{U6} will be referred to as *measures*.

### Distribution of triplicates: accuracy and normalization

A main issue to cope with towards the development of a reliable classifier is accuracy. The question is whether the values of the sample means of the triplicates are constant over different experimental sessions – i.e. measurements taken at different times and/or with different set-ups – or, rather, have to be *normalized* in order to remove experimental bias.

Table 2 shows the results of a statistical analysis, in terms of sample mean \(\overline {X}\) and sample standard deviation *S*, carried out for *x*
_{21}, *x*
_{U6} and their difference *x*
_{21}−*x*
_{U6} on data gathered in two different experimental sessions. For both single-miRNA values *x*
_{21} and *x*
_{U6}, and both for the target and the versus class, the sample means significantly differ between the two sessions. However, this is not the case for the sample means of *x*
_{21}−*x*
_{U6}: the difference of the sample mean of *x*
_{21}−*x*
_{U6} between session II and session I is 0.4±0.4 for the target class, and −0.2±0.8 for the versus class; in both cases the difference is less than twice the respective sample standard deviation (*p*>0.05). Remarkably, the sample standard deviations, and consequently the uncertainties on the sample means, do not significantly differ between the two sessions.

The necessity to improve accuracy by suitably normalizing an oncomir has been extensively discussed in the scientific literature (see for example [18]). Thus, in accordance with many previous works, we use snRNA U6 to normalize *x*
_{21} and *x*
_{205}. The following notation is henceforth used: *Δ*
*x*
_{205}≡*x*
_{205}−*x*
_{U6}; *Δ*
*x*
_{21}≡*x*
_{21}−*x*
_{U6}.

### MiRNA statistics

In this section, the statistics of *Δ*
*x*
_{205}, *Δ*
*x*
_{21}, snRNA U6 is discussed.

As stated above, each sample was classified into one of the two classes *ADC* and *SQC* via immunohistochemical analysis and gene profiling. Figure 1 shows the histograms of *Δ*
*x*
_{205}, *Δ*
*x*
_{21}, *x*
_{U6} for samples belonging either to the target class *ADC* or to the versus class *SQC*.

By means of the Shapiro-Wilk test of normality, each histogram was shown to be consistent with a normal parent population. Consequently, we assume that, for each of the two classes, the measures *Δ*
*x*
_{205}, *Δ*
*x*
_{21}, *x*
_{U6} are normally distributed with a mean and a standard deviation that are respectively estimated by the sample mean \(\overline {X}\) and the sample standard deviation *S* of the *x* values (triplicates). The results are reported in Table 3. The proof that the measures of interest are compatible with normal distributions makes up a crucial step towards the optimization of the Bayesian classifier and the determination of its performance, inclusively of the related uncertainty (see below).

With regard to *Δ*
*x*
_{205}, the histograms of the target class *ADC* and the versus class *SQC* are well-separated: Student’s t-test provides *p*<10^{−7} (see Table 3). Conversely, both for *Δ*
*x*
_{21} and *x*
_{U6}, the overlapping of the histograms of the two classes is confirmed by Student’s t-test, which provides *p*=0.90 and *p*=0.14, respectively. Therefore, only the measure *Δ*
*x*
_{205} is a good candidate to classify samples into *ADC* or *SQC*.

### A Bayesian classifier

To develop a classifier, any linear combination *y* of the available measures can be used. Given a linear combination *y* of the measures *Δ*
*x*
_{205}, *Δ*
*x*
_{21}, *x*
_{U6}, we assume the following classification rule to hold:

where *χ* is a fixed threshold. For example, in the works by [13, 17] the linear combination

was used and a threshold *χ*=2.5 was set.

The discriminator approach described in Eq. (1) requires tackling three main issues: finding a suitable linear combination *y*; finding a suitable value for *χ*; analyzing the performance of the classifier.

Given a linear combination *y* (that may also coincide with one of the three measures), the threshold *χ* can be determined by calculating the value that, according to Eq. (1), maximizes the accuracy (or rate of correct responses) *p*
_{
c
}:

In this equation, *p*
_{
T
} and *p*
_{
V
} correspond to the prior presentation probabilities of the target class and the versus class, respectively, whereas *H* and *C* respectively correspond to the sensitivity and the specificity of the classifier, provided that the *condition positive* is taken to correspond to the target class [19]. Under the assumption that the values of *y* are normally distributed, sensitivity and specificity are given by the following expressions:

where *Φ*(*x*) is the standard normal cumulative distribution. The optimal position of *χ* is given by one of the roots (the most appropriate one!) of the following second-degree equation:

where

The uncertainties on the three coefficients *η*, *β*, *γ*, and thus on the threshold *χ*, are computed by means of standard error propagation. We remind that *μ*
_{
T
}, *σ*
_{
T
}, *μ*
_{
V
}, *σ*
_{
V
} are evaluated as sample means and sample standard deviations, and are therefore uncertainty-affected: for example, the errors on *μ*
_{
T
}, *σ*
_{
T
} are \({\sigma _{T}/\sqrt {N_{T}}}\), \({\sigma _{T}/\sqrt {2(N_{T}-1)}}\), respectively, where *N*
_{
T
} is the number of triplicates belonging to the target class.

The threshold *χ* depends on the ratio *p*
_{
T
}/*p*
_{
V
} of the prior occurrence probabilities of the two classes of tumors. These probabilities are typically inferred from epidemiological studies. If prior probabilities are balanced, the ratio *p*
_{
T
}/*p*
_{
V
} is unitary and the related term in Eq. (5) vanishes. Table 4 reports the values of *χ* and the respective uncertainties for each of the measures dealt with in this paper, in the case of balanced prior probabilities. For *x*
_{U6} the normal curve for the target class lays on the left of the normal curve for the versus class (i.e., \({\overline {X}_{\textit {ADC}} < \overline {X}_{\textit {SQC}}}\); see Fig. 1 and Table 3). So, in the case of this measure, the application of the rule of Eq. (1) requires the linear combination *y* being set to −*x*
_{U6}.

### Estimated classifier performance

A new sample, i.e. a new value of the measure *y* whose costituent triplicates were checked not to be outliers, can now be assigned to the target or the versus class according to the rule of Eq. (1). An estimate of the reliability of this assignment is provided by Bayes’ theorem: the odds that the new sample, given its *y*, belongs to the target class are given by the likelihood ratio

where *p*
_{
T
} and *p*
_{
V
} are the two prior probabilities (*p*
_{
T
}+*p*
_{
V
}=1), and *η*, *β*, *γ* are given by Eq. (6). From Eq. (5) it follows that the odds at *y*=*χ* are 50:50. By means of Eq. (7), the thresholds *χ*
_{10:90}, *χ*
_{90:10} for the odds 10:90 and 90:10 can be determined. The values of these thresholds in the case of balanced prior probabilities are reported in Table 4.

### Bias outliers

In the previous sections, we have addressed the issue of outlier identification by using the triplicate variability. The normality of the “target” and “versus” distributions of the measure of interest *y* allows for the identification of a second kind of outliers: given a value *y*, one can promptly evaluate – via Student’s t distribution – the one-tailed probability of obtaining a more extreme value, i.e. a value more displaced by the population mean than the value *y*. If such *p*-value is less than a given significance level (for example, 1 %), the value *y* can be deemed to be a “bias” outlier, i.e. an outlier due to a bias in the triplicate estimates.

### Improvement of a classifier’s performance

Looking at Student’s t statistic provides two possible strategies to improve the performance of a classifier. First, one can enhance the difference at the numerator of Student’s t, namely *μ*
_{
T
}−*μ*
_{
V
}; this solution requires the linear combination of the available class-discriminating measures (in the present case *Δ*
*x*
_{205}) with new, additional measures that also reliably discriminate between the two classes. Such linear combination has to be optimized by means of methods like, for example, principal component analysis or support vector machines. The discussion of these methods goes beyond the goals of the present paper. The second strategy to improve the performance of a classifier consists in reducing the denominator in the expression of Student’s t by linearly combining available measures with new ones. These new measures are not required to be class-discriminating. In the following section the second strategy is analyzed in detail.

### Analysis of correlation

Let *y*
_{
a
}, *y*
_{
b
} be two measures: *y*
_{
a
} is supposed to discriminate between two classes (according to Student’s t-test), whereas *y*
_{
b
} is not. In the present paper, we can have *y*
_{
a
}=*Δ*
*x*
_{205} and *y*
_{
b
}=*Δ*
*x*
_{21} or *y*
_{
b
}=*x*
_{U6}. Let *y* be a linear combination of *y*
_{
a
}, *y*
_{
b
} as follows:

where *c* is a coefficient to be determined. Taking the average of the last equation provides

Because measure *y*
_{
b
} does not discriminate between the two classes, we have *μ*
_{
b,T
}≈*μ*
_{
b,V
} (the means of the distributions of *y*
_{
b
} for the target and the versus class are not significantly different). Consequently, *μ*
_{
T
}−*μ*
_{
V
}≈*μ*
_{
a,T
}−*μ*
_{
a,V
}, i.e. nothing significant can be expected with regard to the numerator of Student’s t statistic when *μ*
_{
a
} is replaced by *μ*.

With regard to standard deviations, from Eq. (8) it follows:

where *r* is the linear correlation coefficient between *y*
_{
a
} and *y*
_{
b
}. If *r*=0, adding the new measure *y*
_{
b
} is definitely detrimental for the sake of discrimination because the variance is increased by a term \({c^{2} {\sigma _{b}^{2}}}\). However, in the case of a significant correlation, setting

reduces the standard deviation as follows:

Consequently, the additional measure *y*
_{
b
} can be deemed to be a sort of noise-reducer for the class-discriminating measure *y*
_{
a
}.

The previous argument still holds if, rather than regarding the whole data set, the correlation appears only on the data subset corresponding to one of the two classes. Therefore, if *c* is set according to Eq. (9), one of the two standard deviations *σ*
_{
T
}, *σ*
_{
V
} is reduced whereas the other is enhanced. The net result can be still an increase of the classifier’s performance. The optimal value of *c* can be assessed by standard analytical and numerical techniques.

## Results

### A classifier for ADC vs. SQC

For each of the three measures of interest, Fig. 2 shows the scatter plot of the respective values as well as the three thresholds *χ*
_{10:90}, *χ*, *χ*
_{90:10}. The dot color corresponds to the class the sample was assigned to via immunohistochemical analysis and gene profiling (diagnosis). The plots contains four different regions, bounded by the three thresholds and corresponding to different outcomes of the classifier: orange ⇒ versus class with odds larger than 90:10; yellow ⇒ versus class with odds between 50:50 and 90:10; light green ⇒ target class with odds between 50:50 and 90:10; green ⇒ target class with odds larger than 90:10.

The reliability of the classifiers based on each of the three available measures can be inferred by considering the *confusion matrix* reported in Table 5. The accuracy is equal to 97 % for *Δ*
*x*
_{205}, 60 % for *Δ*
*x*
_{21}, and 71 % for −*x*
_{U6}. Due to overfitting, the performance of the classifiers based on the last two measures are apparently satisfactory, though non competitive with that of the classifier based on *Δ*
*x*
_{205}. However, if only high-reliability responses are considered, namely those with odds at least 90:10, the accuracies drop to 64 %, 29 %, 9 %, respectively: while the accuracy of the classifier based on *Δ*
*x*
_{205} is still satisfactory, the other two measures do not provide reliable outcomes. This behavior is linked to the t-statistic regarding the separation of the distributions corresponding to the two classes with respect to the widths of the distributions (see Fig. 1).

In the case of the classifier based on *Δ*
*x*
_{205}, by relying on Eqs. (3, 4) a maximum accuracy of 91.4 *%* ± 3.9 *%* can be predicted.

### Improved classifier for ADC vs. SQC

Table 6 reports the correlation coefficient *r* for the pair (*x*
_{205}, *x*
_{U6}), i.e. directly between miR-205 and snRNA U6, and for the pair (*Δ*
*x*
_{205}, *Δ*
*x*
_{21}).

We first note that no correlation between miR-205 and snRNA U6 can be significantly inferred. Consequently, despite being a normalizing miRNA – i.e. a bias-reducer– for miR-205, snRNA U6 is useless as a noise-reducer for miR-205. On the contrary, the class-discriminating measure *Δ*
*x*
_{205} has a significant correlation with *Δ*
*x*
_{21}, which can be therefore used to improve the classification performance of *Δ*
*x*
_{205}. Given an overall correlation coefficient *r* of 0.022, Eq. (9) provides a value of *c* approximately equal to –0.8 so that the optimal linear combination *y*
_{
opt
} is:

Tables 3, 4 and 5 report the statistics of the measure *y*
_{
opt
}, and the thresholds and confusion matrix of the classifier relying on this measure, respectively. For the sake of comparison, the same tables also show the data of the classifier relying on the measure *y*
_{
DV
} defined in Eq. (2), which was the topic of previous works [13, 17].

Testing the same-parent-distribution null hypothesis via Student’s t statistic provides *p*=2.6·10^{−11}, half of the value obtained by testing t on the histograms generated by using the linear combination *y*
_{
DV
}. Figure 3 shows the histograms of *y*
_{
opt
} for samples belonging either to the target class *ADC* or to the versus class *SQC*. The Shapiro-Wilk test of normality yielded *p*-values of 0.78 (target class) and 0.24 (versus class).

The reliability of the classifier based on *y*
_{
opt
} can be inferred by considering the confusion matrix of Table 5 (see also the scatter plot in Fig. 4). The accuracy is 94 %, similar to that provided by the classifier relying on *Δ*
*x*
_{205} only, and equal to that provided by the classifier relying on *y*
_{
DV
}. However, if only high-reliability responses are considered, namely those with odds at least 90:10, the accuracy of the classifier based on *y*
_{
opt
} is still 91 %, slightly better than 88 % provided by *y*
_{
DV
} and definitely larger than 64 % given by the classifier relying on *Δ*
*x*
_{205} only. The improvement with regard to a classifier based on this last measure is pointed out by the ROC curves that are also shown in Fig. 4.

In the case of the classifier based on *y*
_{
opt
}, by relying on Eqs. (3, 4) a maximum accuracy of 96.1 *%* ± 2.4 *%* can be predicted. By comparison, this last parameter is 95.6 *%* ± 2.6 *%* in the case of the classifier based on *y*
_{
DV
}.

### Test of the improved classifier on an independent data set

Figure 5 shows the results of the application on a set of 9 additional samples of the classifier based on *y*
_{
opt
} and using the population mean, population standard deviation, and thresholds expressed in Tables 3 and 4. With the exception of one single case, all values of the triplicate standard deviations comply with the respective *σ*
_{max} requirements explained above. The single outlier is a miR-21 triplicate whose standard deviation of 0.52 slightly exceeds the maximum value of 0.46 (see Table 1) given by the significance level *α*=0.05. Of the remaining 8 samples, the classification provided by the classifier of Eq. (1) coincides with the immunohistochemical diagnosis for 7 samples; in all these cases, the odds are at least 90:10 (the same would happen for the sample containing the miR-21 outlier).

The miRNA classifier of Eq. (1) provides a different diagnosis than the immunohistochemical analysis only for a single sample. Although this sample does not contain any outlier according to the variability of its triplicates, its value of *y*
_{
opt
} appears to be extremely low: according to the statistics of *y*
_{
opt
} (see Fig. 3), the probability of getting a more extreme value is *p*<5·10^{−4}. This hints, rather than to a misclassification of the sample, to a case of “bias” outlier, i.e. to a possibly wrong assessment of the triplicates, as discussed above. For the sake of comparison, for all other 8 samples, as well as for all 33 samples considered in the previous sections, *p*>0.02

## Discussion and conclusion

We introduced a Bayesian classifier that, on the basis of the expression of miR-205, miR-21 and snRNA U6, discriminates samples into two different classes of pulmonary tumors, normally classified by immuno-histochemical approaches: adenocarcinomas and squamous cell carcinomas. The advantage to use miRNAs is due to the ease of their detection and quantification by qRT-PCR, as well as in their extreme specificity. miRNAs are stable molecules well preserved in formalin fixed, paraffin embedded tissues (FFPE) as well as in fresh snap-frozen specimens, unlike larger RNA molecules as messenger RNAs [3].

Our approach is based on a method that employs the quantification of snRNA U6 as a normalizer, miR-21 as a performance enhancer via noise reduction, and miR-205 as a class discriminator. First, we determined that the variance of miRNA quantification triplicates follows a normal chi-square distribution. Thereupon, we designed a procedure to recognize invalid measures (outliers) and remove them from the analysis. The proof that the measures of interest are compatible with normal distributions makes up a crucial step towards the optimization of the Bayesian classifier, the determination of its performance, inclusively of the related uncertainty, and the identification of “bias” outliers. We then proceeded to optimally set our Bayesian classifier and to determine its performance as well as the related uncertainty. Results are displayed in Fig. 2: the classifier based on miR-205 and normalized on snRNA U6 has the best performance.

A main feature of the Bayesian approach described here is the possibility, also in presence of a limited size of the available data sets, of estimating the reliability of a classifier’s performance. This possibility relies on the verification of the normality of the different distributions of interest. Other powerful methods to set up a classifier, such as support vector machines (SVM) and decision trees, though quite versatile to optimize the decisional parameters on a training set, are less suited than a probabilistic approach to provide an immediate quantification of the reliability in the case of application to new sets of data. For example, in the cases discussed above, a SVM approach would likely result in an optimized classifier that also exploits the “apparent” classification capability of *Δ*
*x*
_{21} and even *x*
_{U6} (see, for example, Figs. 1 and 2). However, according to our analysis based on Student’s t, there is no evidence of such capability, so that such a SVM classifier would also possibly have a larger generalization error than a Bayesian classifier of the kind discussed in this paper.

Finally, we provided a method to enhance a classifiers’ performance by exploiting the correlation between the tumor-discriminating miRNA miR-205 and the expression of miR-21, used as a noise reduction factor. The method essentially consists in exploiting the nonzero covariance of two miRNAs, where the first one acts a classifier and the second one is used to abate the variability of the first one. Figure 4 shows the result of an improved classifier, indicating that only 2 samples lay within the uncertainty region, much less than the 12 samples in the case of the non-improved classifier shown in Fig. 2. Results obtained on an independent data set are also satisfactory.

In conclusion, the proposed method introduces a robust tool for determining the cases in which miRNA quantification can be applied in discriminating inter- and intra-tumoral heterogeneity.

## Endnote

^{1} Because \({f_{\alpha,\,\nu,\,\infty } = \chi ^{2}_{\alpha,\,\nu }/\nu }\), an alternative outlier definition relying on the F-test would produce the same result as that discussed in this section.

## References

Farazi TA, Spitzer JI, Morozov P, Tuschl T. MiRNAs in human cancer. J Pathol. 2011; 223:102–15. doi:10.1002/path.2806.

Del Vescovo V, Grasso M, Barbareschi M, Denti MA. MicroRNAs as lung cancer biomarkers. World J Clin Oncol. 2014; 5:604–20. doi:10.5306/wjco.v5.i4.604.

Xi Y, Nakajima G, Gavin E, Morris CG, Kudo K, Hayashi K, et al. Systematic analysis of microRNA expression of RNA extracted from fresh frozen and formalin-fixed paraffin-embedded samples. RNA. 2007; 13:1668–1674. doi:10.1261/rna.642907.

Gilad S, Meiri E, Yogev Y, Benjamin S, Lebanony D, Yerushalmi N, et al. Serum MicroRNAs are promising novel biomarkers. PLoS One. 2008; 3:3148. doi:10.1371/journal.pone.0003148.

Grasso M, Piscopo P, Confaloni A, Denti MA. Circulating miRNAs as biomakers for neurodegenerative diseases. Molecules. 2014; 19:6891–910. doi:10.3390/molecules19056891.

Shen J, Stass SA, Jiang F. MicroRNAs as potential biomarkers in human solid tumors. Cancer Lett. 2013; 329:125–36. doi:10.1016/j.canlet.2012.11.001.

Benes V, Castoldi M. Expression profiling of microRNA using real-time quantitative PCR, how to use it and what is available. Methods. 2010; 50:244–9. doi:10.1016/j.ymeth.2010.01.026.

Gallardo E, Navarro A, Vinolas N, Marrades RM, Diaz T, Gel B, et al. miR-34a as a prognostic marker of relapse in surgically resected non-small-cell lung cancer. Carcinogenesis. 2012; 30:1903–1909. doi:10.1093/carcin/bgp219.

Landi MT, Zhao Y, Rotunno M, Koshiol J, Liu H, Bergen AW, et al. MicroRNA expression differentiates histology and predicts survival of lung cancer. Clin Cancer Res. 2010; 16:430–41. doi:10.1158/1078-0432.CCR-09-1736.

Patnaik SK, Kannisto E, Knudsen S, Yendamuri S. Evaluation of microRNA expression profiles that may predict recurrence of localized stage I non-small cell lung cancer after surgical resection. Cancer Res. 2010; 70:36–45. doi:10.1158/0008-5472.CAN-09-3153.

Navarro A, Diaz T, Gallardo E, Vinolas N, Marrades RM, Gel B, et al. Prognostic implications of miR-16 expression levels in resected non-small-cell lung cancer. J Surg Oncol. 2011; 103:411–5. doi:http://dx.doi.org/10.1002/jso.21847.

Gilad S, Lithwick-Yanai G, Barshack I, Benjamin S, Krivitsky I, Edmonston TB, et al. Classification of the four main types of lung cancer using a microRNA-based diagnostic assay. J Mol Diagn. 2012; 14:510–7. doi:10.1016/j.jmoldx.2012.03.004.

Lebanony D, Benjamin H, Gilad S, Ezagouri M, Dov A, Ashkenazi K, et al. Diagnostic assay based on hsa-miR-205 expression distinguishes squamous from nonsquamous non-small-cell lung carcinoma. J Clin Oncol. 2009; 27(12):2030–037. doi:10.1200/JCO.2008.19.4134.

Tan X, Qin W, Zhang L, Hang J, Li B, Zhang C, et al. A 5-microRNA signature for lung squamous cell carcinoma diagnosis and hsa-miR-31 for prognosis. Clin Cancer Res. 2011; 17:6802–811. doi:10.1158/1078-0432.CCR-11-0419.

Lee HW, Lee EH, Ha SY, Lee CH, Chang HK, Chang S, et al. Altered expression of microRNA miR-21, miR-155, and let-7a and their roles in pulmonary neuroendocrine tumors. Pathol Int. 2012; 62:583–91. doi:10.1111/j.1440-1827.2012.02845.x.

Huang W, Hu J, Yang DW, Fan XT, Jin Y, Hou YY, et al. Two microRNA panels to discriminate three subtypes of lung carcinoma in bronchial brushing specimens. Am J Respir Crit Care Med. 2012; 186:1160–1167. doi:10.1164/rccm.201203-0534OC.

Del Vescovo V, Cantaloni C, Cucino A, Girlando S, Silvestri M, Bragantini E, et al. miR-205 expression levels in nonsmall cell lung cancer do not always distinguish adenocarcinomas from squamous cell carcinomas. Am J Surg Pathol. 2011; 35(2):268–75. doi:10.1097/PAS.0b013e3182068171.

Peltier HJ, Latham GJ. Normalization of microRNA expression levels in quantitative RT-PCR assays: identification of suitable reference RNA targets in normal and cancerous human solid tissues. RNA. 2008; 14:844–52. doi:10.1261/rna.939908.

Vilardi A, Tabarelli D, Ricci L. Tailoring a psychophysical discrimination experiment upon assessment of the psychometric function: Predictions and results. AIP Adv. 2015; 5:027121. doi:10.1063/1.4908271.

Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinforma. 2011; 12:77. doi:10.1186/1471-2105-12-77.

## Acknowledgements

The authors are grateful to Marco Vardaro for the help in classifying the existing discrimination methods. This project is funded in part by startup funds from the Centre for Integrative Biology (University of Trento) to MAD and by the Outgoing Marie-Curie/Autonomous-Province-of-Trento (PAT) co-fund grant to VDV.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors’ contributions

LR: Developed methods, conducted analysis and served as primary author. VDV provided data, conducted analysis and assisted in authoring of manuscript. CC provided data and conducted analysis. MG provided data. MB provided data and provided feedback on biological concepts. MAD provided inspiration and assisted in authoring of manuscript by providing feedback on biological concepts. All authors read and approved the final manuscript.

## Additional file

### Additional file 1

**The datasets are available as Supplementary Material.** (TXT 12.8 KB)

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## About this article

### Cite this article

Ricci, L., Vescovo, V.D., Cantaloni, C. *et al.* Statistical analysis of a Bayesian classifier based on the expression of miRNAs.
*BMC Bioinformatics* **16**, 287 (2015). https://doi.org/10.1186/s12859-015-0715-9

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/s12859-015-0715-9

### Keywords

- microRNA
- Bayesian classifiers
- Lung cancer
- qRT-PCR gene expression measurement