Data selection
Data were drawn from a set of publicly available databases of gene expression microarrays, stored in the GEO repository bank (http://www.ncbi.nlm.nih.gov/gds/). Selection criteria were: a) inclusion in the GEO data bank from January 2010 to December 2014; b) presence of at least two classes potentially useful for cancer diagnosis, including at least 20 samples each; c) availability of a scientific paper in English language, published on PubMed, and fully describing the experiment and the related study design.
An early selection was made using the following key words in the GEO website: cancer AND human [Organism] AND 40:10000[Number of Samples] AND 2010/01:2014/12[Publication Date] AND GDS[ETYP] AND “gds PubMed”[Filter].
The retrieved databases were carefully investigated to assess their full compliance to the selection criteria. Moreover, studies based on a matched design were excluded, because all the applied methods of analysis rely on the assumption of independent sampling.
Logic learning machine (LLM)
LLM generates classifiers described by a set of intelligible rules of the type:
$$ \mathbf{if}< premise>\mathbf{then}< consequence> $$
where <premise> is a logical product (AND) of conditions and < consequence> provides a class assignment for the output [5, 6, 15].
LLM produces rules through a three-step process, namely: latticization (binarization), monotone Boolean function reconstruction and rule generation (Fig. 2). In the first phase (latticization) each variable is transformed into a string of binary data, using the inverse only-one coding [15]; then, resulting coded strings are concatenated in one unique large sequence of bits. In the second phase (monotone Boolean function reconstruction) a set of binary vectors, called implicants, is selected and allows the identification of clusters associated with a specific class. During the third phase all the generated implicants are transformed into as many rules, each one including a collection of simple threshold conditions in its <premise> part. Algorithms for the efficient generations of implicants, starting from any dataset, have been illustrated in detail elsewhere [15].
A set of quality measures has been defined for any rule r generated by LLM [2, 15], such as the proportion of correct classifications C(r), called the covering, or the false positive fraction E(r). In a binary classification task, depending on the class identified by the rule r, C(r) will correspond to either the sensitivity or the specificity. Let r’ represent the rule obtained from r by removing the condition c from its premise part. A simple measure of the relevance R(c) of that condition is then provided by:
$$ R(c)=\Delta E(c)C(r) $$
where
$$ \Delta E(c)=E\left(r^{\prime}\right)-E(r) $$
Finally, a measure of relevance Rv(xj) for each variable xj can be obtained by applying the following equation:
$$ {R}_v\left({x}_j\right)=1-\prod \limits_k\left(1-R\left({c}_{kl}\right)\right) $$
where k varies on the indices of rules rk that includes a condition ckl on the variable xj.
As a rule of thumb, the inequality Rv(xj) ≤ 10% is used to identify a predictor xj providing a marginal contribution to the accuracy of LLM classifiers, while a rule with C(r)(1 – E(r)) ≤ 10% often covers subjects with anomalous values (possible outliers).
Accuracy assessment
Measures of quality for a single analysis
Performance of LLM was compared with that of four selected competing methods of supervised learning (DT, ANN, SVM, and kNN) in leave-one-out cross-validation (LOOCV). Standard measures of quality were obtained for each analysis and proper comparison techniques were adopted to evaluate the overall performance of each classification method in the whole set of analysis. A parameter tuning procedure was adopted to enhance the performance of each selected method and only the models with the highest accuracy were retained. For instance, for LLM a set of values for the E(r) parameter were selected, ranging from 2.5 to 7.5% (step 0.5%). Parameter tuning for the competing methods will be described in the dedicated paragraphs.
Consider a two-class classification problem, where the output can assume two different values identified as positive and negative. Each analysis with any supervised learning method is characterized by four values:
The number TP of positive samples correctly classified by the resulting model,
The number FN of positive samples wrongly classified by the resulting model,
The number TN of negative samples correctly classified by the resulting model,
The number FP of negative samples wrongly classified by the resulting model.
From these four values other quality measures for the analysis can be derived, among which:
$$ SE=\frac{TP}{TP+ FN}\kern0.5em SP=\frac{TN\ }{TN+ FP} $$
$$ K=\frac{2\left( TP\bullet TN- FP\bullet FN\right)}{\left( TP+ FP\right)\left( TN+ FP\right)+\left( TN+ FN\right)\left( TP+ FN\right)} $$
$$ OR=\frac{SE\bullet SP}{\left(1- SE\right)\left(1- SP\right)} $$
(1)
In the present investigation we have log-transformed the OR to exploit its asymptotic Normal distribution [26]. An asymptotic estimate of the variance σ2 of the logarithm log(OR) of the odds ratio is readily obtained through the equation [27]:
$$ {\sigma}^2=\frac{1}{TP}+\frac{1}{FN}+\frac{1}{TN}+\frac{1}{FP} $$
where the continuity correction is adopted if one of the terms at the denominator is null [27].
In the presence of m multiple outcomes, the definition of the Cohen kappa coefficient can be generalized as follows [28]:
$$ K=\frac{\sum \limits_{i=1}^m{a}_{ii}-\sum \limits_{i=1}^m{e}_{ii}}{\sum \limits_{j=1}^m\sum \limits_{i=1}^m{a}_{ij}-\sum \limits_{i=1}^m{e}_{ii}} $$
where aij represents the counting of elements in the i row and j column of the confusion matrix and
$$ {e}_{ij}=\frac{\sum \limits_{i=1}^m{a}_{ij}\sum \limits_{j=1}^m{a}_{ij}}{\sum \limits_{j=1}^m\sum \limits_{i=1}^m{a}_{ij}} $$
represents the corresponding expected counting in the case of a random distribution of the elements inside the cells of the confusion matrix.
The specificity SPm for m multiple outcomes was simply obtained by selecting a category as the reference and computing the proportion of correctly classified samples inside that category.
The corresponding estimate of log(OR) and of its related variance σ2 were then retrieved by applying the Mantel Haenszel (MH) method [29]. Let i = 1 be the index of the reference category in the confusion matrix; the MH estimate of log(OR) is obtained as:
$$ \log (OR)\cong \log \left(\frac{\sum \limits_{i=2}^m\frac{a_{11}{a}_{ii}}{a_{11}+{a}_{1i}+{a}_{i1}+{a}_{ii}}}{\sum \limits_{i=2}^m\frac{a_{1i}{a}_{i1}}{a_{11}+{a}_{1i}+{a}_{i1}+{a}_{ii}}}\right) $$
whereas the corresponding asymptotic estimate of the variance σ2 is obtained through the following equation [30]:
$$ {\sigma}^2\cong \frac{\sum \limits_{i=2}^m\frac{\left({a}_{11}+{a}_{1i}\right)\left({a}_{i1}+{a}_{ii}\right)\left({a}_{11}+{a}_{i1}\right)\left({a}_{1i}+{a}_{ii}\right)}{{\left({a}_{11}+{a}_{1i}+{a}_{i1}+{a}_{ii}-1\right)\left({a}_{11}+{a}_{1i}+{a}_{i1}+{a}_{ii}\right)}^2}}{\left(\sum \limits_{i=2}^m\frac{a_{1i}{a}_{i1}}{a_{11}+{a}_{1i}+{a}_{i1}+{a}_{ii}}\right)\left(\sum \limits_{i=2}^m\frac{a_{11}{a}_{ii}}{a_{11}+{a}_{1i}+{a}_{i1}+{a}_{ii}}\right)} $$
Finally, the sensitivity SE for multiple outcomes is obtained by exploiting the relationship between OR, SE and SP reported in eq. (1):
$$ SE=\frac{OR\left(1- SP\right)}{SP+ OR\left(1- SP\right)} $$
A “natural” reference category for multiple outcomes was adopted, whenever possible, selecting either the group of subjects without any disease, if any, or the class with the (allegedly) less severe illness. Otherwise, in the case of comparison between groups of severely diseased patients (i.e., classes including only malignant tumors) the reference was arbitrarily defined as the class with the highest number of individuals.
Common measures of quality across studies: the summary ROC curve
For each dataset, comparison between the considered supervised classification methods was based on the K index.
For each classifier a common measure of accuracy across the N studies was obtained by employing the method of the summary ROC (sROC) curves [31]. In particular, the area sAUC under the sROC curve was adopted to evaluate the quality of any classification technique. A proper model was considered, which is described by the following equation:
$$ sROC(x)=\frac{x\bullet sOR}{x\bullet sOR+1-x} $$
where sOR is the summary odds ratio given by:
$$ sOR=\exp \left(\sum \limits_{i=1}^N\frac{\mathit{\log}\left({OR}_i\right)}{\sigma_i^2}\right) $$
being ORi the odds ratio of the ith study and σi2 the variance of log(ORi).
An estimate of the standard error for log(sOR) can also be obtained as
$$ StdErr\left[\log (sOR)\right]=\sqrt{\frac{1}{\sum \limits_{i=1}^N\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{${\sigma}_i^2$}\right.}} $$
whereas, under the log-Normal assumption for the distribution of sOR, the related 95% confidence intervals (95%CI) of this estimate are obtained as follows:
$$ 95\% CI=\exp \left( sOR\pm 1.96\bullet \sqrt{\frac{1}{\sum \limits_{i=1}^N\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{${\sigma}_i^2$}\right.}}\right) $$
The value of sAUC represents a summary measure of pure accuracy [31] and is easily obtained from sOR through the following equation:
$$ sAUC=\frac{sOR}{sOR-1}-\frac{sOR\bullet \log (sOR)}{{\left( sOR-1\right)}^2} $$
In the present study we have performed an sROC analysis for each of the five classification methods thus resulting in five sROC curves.
All supervised analyses were carried out by using Rulex, a software suite developed and commercialized by Rulex Inc. (http://www.rulex.ai). Summary ROC analysis was performed by Stata for Windows statistical software (release 12.1, Stata Corporation, College Station, TX).
Brief description of competing methods of supervised data mining
Decision tree (DT)
A DT is a graph where each node is associated with a condition based on an attribute of the input vector x and each leaf corresponds to an assignment for a specified output class. Moving from a leaf to a root, a simple intelligible rule can be identified [32].
DT is obtained by a “divide-and-conquer” approach that provides disjoint rules. At each iteration, a new node is added to the DT by choosing the condition that subdivides the training set S according to a specific measure of goodness. Parameter tuning was performed comparing the performance of three different pruning approaches (namely: pessimistic, no pruning and cost-complexity). Furthermore, the highest impurity by node was let to vary between 0.0 and 0.1 (step 0.01).
Artificial neural network (ANN)
ANN is a connectionist model formed by the interconnection of simple units (neurons), arranged in layers. Each neuron computes a weighted sum of the inputs applying a proper activation function, which provides the output value that will be propagated to the following layer. The input vector x is sent to the first layer. The remaining layer receives input from the previous one and the last layer produces the output class to be assigned to x. Weights for each neuron are estimated by suitable optimization techniques and form the set of parameters for the ANN. The Levenberg-Marquardt version of the back propagation algorithm was applied to train the ANN [32]. Parameter tuning was performed comparing the performance of ANN with a different number of both hidden layers (from 0 to 1) and neurons (2 to 6). Moreover, the learning rate was let to vary between 0.25 and 0.75 (step 0.05).
K-nearest neighbor classifier (kNN)
Let n be the number of pairs (xj,yj) in the training set S, where xj is the input vector and yj the output class for the jth sample. When a new subject described by the input vector x is to be classified, the nearest k samples in S, according to a suitable distance measure, are determined and the class y associated with the majority of the k nearest samples is assigned to x [32].
In the present investigation the standard Euclidean distance was employed, after having normalized the components of the input vector x to reduce the effect of biases possibly caused by unbalanced domain intervals for different input variables. Tuning procedure was applied to the number of nearest samples letting the k parameter vary between 1 to 10.
Support vector machine (SVM)
SVM is a non-probabilistic binary linear classifier based on the identification of an optimal hyperplane of separation between two classes [32]. Given a training set, the classifier selects a subset l of input vectors xj in the training set S, called support vectors, and their corresponding outputs yj ∈ {− 1,1}. The class y for any input vector x is then given by:
$$ y=\operatorname{sgn}\left(\sum \limits_{j=1}^l{y}_j{\alpha}_jK\left({\boldsymbol{x}}_j,\boldsymbol{x}\right)+b\right) $$
where the coefficients αj and the offset b are evaluated through a proper training algorithm.
K(·,·) is a kernel function used to perform a non-linear classification by constructing an optimal hyperplane in a high dimensional projected space. A linear kernel function was tested on each dataset. The training algorithm was performed by using the LIBSVM library, which is featured by the Rulex Analytics software. The performance of SVM with linear and RBF kernels was tested. Tuning procedure also included the degree of the kernel function that was let range from 1 to 10.