Analyzing gene expression data for pediatric and adult cancer diagnosis using logic learning machine and standard supervised methods

Background Logic Learning Machine (LLM) is an innovative method of supervised analysis capable of constructing models based on simple and intelligible rules. In this investigation the performance of LLM in classifying patients with cancer was evaluated using a set of eight publicly available gene expression databases for cancer diagnosis. LLM accuracy was assessed by summary ROC curve (sROC) analysis and estimated by the area under an sROC curve (sAUC). Its performance was compared in cross validation with that of standard supervised methods, namely: decision tree, artificial neural network, support vector machine (SVM) and k-nearest neighbor classifier. Results LLM showed an excellent accuracy (sAUC = 0.99, 95%CI: 0.98–1.0) and outperformed any other method except SVM. Conclusions LLM is a new powerful tool for the analysis of gene expression data for cancer diagnosis. Simple rules generated by LLM could contribute to a better understanding of cancer biology, potentially addressing therapeutic approaches.


Background
Logic Learning Machine (LLM) is an innovative method of supervised data mining based on an efficient implementation of the Switching Neural Network model. In recent years LLM has been applied to a variety of biomedical settings [1][2][3][4][5][6]. The advantage of LLM with respect to most traditional methods of supervised data analysis is the capability of identifying simple intelligible rules with potential diagnostic and prognostic applications.
In particular, LLM was applied to extract few highly discriminant rules from a signature of 62 genes related to hypoxic condition for the prognosis of neuroblastoma, a highly fatal childhood cancer [2]. In such analysis LLM outperformed standard methods of machine learning, including: Decision Tree (DT), Artificial Neural Network (ANN) and k-Nearest Neighbor classifier (kNN). Furthermore, the capability of LLM to exploit the complex correlation structure of highly dimensional gene expression data for feature selection tasks and to combine information from clinical features and gene expression for classification purposes was reported in the analysis of both simulated and real data sets [1,6]. These results indicate that LLM could be a new powerful and flexible tool for the analysis of gene expression data in Oncology setting.
However, its accuracy as classifier when applied to a set of gene expression databases for cancer diagnosis remains to be assessed.
The present study is aimed at evaluating the performance of LLM through an extensive analysis of microarray gene expression data of cancer patients in diagnostic settings. A comparison with standard methods of supervised analysis is also provided.

Data set identification
Fifty-two datasets were retrieved from the GEO web site. After a careful examination of their content and the related documentation, 44 were excluded because they did not fully comply with the selection criteria, thus leaving eight data sets available for the analyses. The complete list of excluded datasets and the reason for their exclusion are shown in Table 1. Briefly, nine data sets were excluded due to non-independent sampling (five based on time series analysis and four including matched tissues), 16 for insufficient sample size, one due to the lack of class separation, and one because data were collected from a cancer transplantation in mice. Finally, 17 data sets included samples from only one class of malignancy, thus leaving eight data sets with at least two classes suitable for diagnostic comparisons (7 adult cancers and 1 childhood malignancy). Table 2 shows the list of the eight databases included in the analyses, ordered by their appearance in the GEO data bank. Three data sets (GDS4968, GDS4296 and GDS3952) included multiple-class comparisons (three classes for the first, five for the second and four for the last one), while the remaining five databases (GDS4887, GDS4794, GDS4762, GDS4471, and GDS3945) included two classes.
In more details, the GDS4968 data set included 99 samples and 33,297 gene features of pre-malignant (monoclonal gammopathy) and malignant (multiple myeloma) bone marrow plasma cells [7]. Three-class comparison was made between a group of gammopathy of undetermined significance (n = 20), multiple myeloma (n = 41) and smoldering (i.e., asymptomatic) multiple myeloma (n = 33); data related to five healthy controls were excluded from the analyses due to the insufficient sample size.
GDS4887 included 40 samples and 54,675 features of hepatocellular carcinoma diagnosed in patients with chronic hepatitis C [8]. Expression profile of 20 carcinoma tumor was compared to that of 20 non tumor tissues.
GDS4794 included 23 samples of small cell lung cancer and 42 normal tissues [9]. Gene expression features were 54,675. GDS4762 included 37 breast cancer cells and 43 fibroblasts [10], and gene expression was measured in 33,297 features.
GDS4471 included 76 samples of medulloblastoma (a malignant childhood cancer of the brain) from patients aged 3 to 16 years [11]. The analysis was performed comparing the expression of 54,675 features in 51 samples of classic medulloblastoma (MB_CL) with that of a pool of 25 other types (namely: 6 desmoplastic nodular, 17 anaplastic and 2 medullomyoblastoma).
GDS4296 included 54,675 features and 174 samples from 9 different cancer tissues [12]. Among them the following malignancies were included in the analysis: 23 renal cancers, 21 colon cancers, 26 malignant melanomas, 26 non-small cell lung cancers and 21 ovarian cancers, whereas other tumor types were excluded for insufficient sample size (namely: 15 breast, 18 central nervous system and 6 prostate cancers, and 18 leukemia cases).
GDS3952 included 162 samples from breast cancer patients and a variety of other malignancies and normal tissues [13]. Comparison between the expression of 54, 675 features in the following disease groups was performed: benign breast cancer (n = 37), ectopic cancers (n = 22), healthy controls (n = 31) and malignant breast cancer (n = 51); other smaller groups were not considered in the present analysis.
Finally, GDS3945 included 33,297 features in 42 samples of renal clear cell carcinoma [14]. Comparison was made between 21 samples of renal clear carcinoma cells stimulated by T3 thyronine and 21 samples from healthy controls. Data also included information about different genotypes related to thyronine receptor variants, but data were too sparse to allow for further analyses. Table 3 shows the results of the analysis of gene expression profiles in the eight selected data sets by LLM and by the four competing methods. In general, performance was very high for each algorithm in each comparison except for ANN in the GDS3945 data set. In two cases, including GDS4794 and GDS4762, all machine learning methods performed equally well with the maximum attainable accuracy (100%), except for ANN in the later data set. With regard to the other six comparisons, performance of SVM (measured by the Cohen's kappa index) was the highest in each analysis, except for GDS3952, where kNN showed the best accuracy, and GDS3945, where LLM performed better. However, in the GDS4887 data set ANN and kNN had the same performance of SVM, and in the analysis of GDS4296 also kNN performed equally high. Figure 1 shows the corresponding summary ROC curves, obtained under a proper model assumption, while the corresponding areas under the curves (sAUC) and the related diagnostic Odds Ratios are reported in Table 4. SVM and LLM outperformed any other method (sAUC = 0.996 and 0.995, respectively). kNN and DT also showed an excellent accuracy (sAUC = 0.991 and 0.964), respectively, while ANN had a quite lower accuracy (sAUC = 0.904). Classification rules generated by the logic learning machine Table 5 illustrates the classification rules extracted by the LLM from each data set. In four analyses, corresponding to GDS4887, GDS4794, GDS4762 and GDS3945 data sets, the LLM classifier included very few simple rules, based on only one condition and two features (one for each class at comparison). Conversely (and not surprisingly), more complex rules were needed to classify samples from the three datasets that included multiple classes (GDS4968, GDS4296, and GDS3952). Three conditions based on a small set of features were sufficient to reach a satisfactory classification accuracy, except for the analysis of the GDS3952 data set (the corresponding very complex rules had a low covering and were not included in Table 5). Furthermore, the second rule for the classification of monoclonal gammopathies in the GDS4968 data set included a feature generated by artificial DNA used as an internal control (named "Control_3389"), thus indicating that the classification was partly based on an artifact; covering of the corresponding rule was rather high, i.e., 60%. In the absence of available reliable measures of the concentration of the corresponding mRNA (e.g., by quantitative PCR) it is impossible to establish if such an artifact can be attributable either to a defective extraction of the LLM algorithm or to noising data related to the intrinsic variability of the microarray technique. Finally, in the GDS4471 data set four classification rules, including a maximum of two conditions each and 6 different features, were identified to separate 41 samples of the classic form of a childhood brain cancer (medulloblastoma) for a pool of 25 heterogeneous samples of rare variants of the same malignancy.

Discussion
LLM is an innovative method of supervised analysis that can identify simple and intelligible rules for classification tasks. In previous investigations its accuracy was comparable to that of most common supervised methods based on black-box algorithms and also outperformed DT, which shares with LLM the capability to generate intelligible rules [1-3, 15, 16].
In the present investigation LLM showed an excellent performance in the analysis of dataset for cancer diagnosis. Comparisons included different types of malignancies or different classes of cancer and non-cancer cells. Carcinogenesis is characterized by the disruption of several biochemical paths; then, many to several genes are expected to be strongly differently expressed between different tumor types and, even more so, between neoplastic and non-neoplastic tissues [17]. Contrarily to other rule-based methods, including DT, LLM classification is based on a set of partially overlying rules. Such rules are able to identify small subgroups of interest that tend to escape classification methods based either on univariate analysis or on a divide-and-conquer approach.
An advantage of LLM and DT compared to black-box classifiers is that classification rules can provide useful insights about cancer biology. In some instances, the potential biological role of those genes identified by the simple one-feature LLM rules can be confirmed by evidence from recent studies on independent cohorts of cancer patients. For example, in our study LLM revealed a downregulation of the Aqp7 gene in a set of 20 hepatocellular carcinoma samples (Table 5, GDS4887 data set). Aqp7 encodes for a transmembrane channel protein belonging to the aquagliceroporins family, whose function is the regulation of transcellular movements of water and glycerol in many mammalian tissues [18]. A significative reduction of AQP7 was reported in hepatocellular cancer cells by a recent investigation on 68 patients. The study also found an upregulation of AQP9 and a downregulation of AQP3 (two molecules belonging to the same family of AQP7), indicating that the dysregulation of the aquaporine activity could play a fundamental role in the liver cancer development and progression [19].  In the GDS4794 data set LLM found an overexpression of Cbx3 in 23 small cell lung cancer samples. CBX3 is a member of the heterochromatin protein 1 family that plays a fundamental role in multiple cellular functions, including DNA repair, transcriptional regulation, and telomere function [20]. CBX3 has been suspected to be responsible of the silencing of tumor suppressor genes and it was demonstrated to be upregulated in many tumors. Interestingly, in a recent investigation, Chang et al. [21] found a high prevalence of samples positive for the CBX3 in a set of 88 lung cancer patients. Most subjects were affected by lung adenocarcinoma, while samples in the GDS4794 database belonged to another subtype of lung cancer (namely: the small cell one). Our finding, if confirmed by further independent investigations, could provide some insights to the knowledge of the carcinogenesis of the different subtypes of lung cancer.
Composite rules generated by combining different features are more difficult to interpret, but they could shed some light on the complex interaction at the basis of carcinogenesis and tumor progression. In the GDS4471 data set, LLM found an overexpression of the Efhd2 gene that, combined to a downregulation of Loc100132891, was able to separate most samples of classic medulloblastoma (88.3%) from the other subtypes of the same tumor. Medulloblastoma is a childhood cancer, whose different subtypes can be associated with different origins and patient survival [11]. Difference in gene expression between different tumor subtypes could help both to provide some insights on the carcinogenesis process and to identify potential targets for new therapeutic approaches. In this example, the overexpression of Efhd2 (also commonly known as swiprosin-1) is not a surprising finding, in that it encodes for a calcium ion binding protein involved in several cellular functions, including neurodegeneration and cancer [22]. Conversely, Loc100132821 is a long noncoding RNA whose function is, at least to our knowledge, still unknown (http://www.igenebio.com/gene/hs-loc100132891.html). In the two last decades many long non-coding transcripts have been found to play a crucial role in carcinogenesis [23], then the possibility that a reduced transcription of Loc100132821 in cells with a high expression of swiprosin-1 could be involved in the carcinogenesis of classic medulloblastoma might be a new finding that deserves further investigations.
The present study represents the first analysis comparing LLM with different supervised methods for cancer diagnosis using a set of microarray data including different cancer types. LLM showed an excellent accuracy, but our results should be interpreted at the light of some limitations of this study. First of all, this should be considered as a pilot study comparing classification methods under similar simplified conditions and the proposed method should be applied to larger sets of microarray data, including more difficult clinical context (e.g., evaluation of   patients' prognosis, response to drug treatments, vaccination trials, etc.). In particular, the tuning methods have produced a very high classification performance that, in the absence of an independent validation cohort, should be interpreted as an upper limit of the actual expected classification accuracy.
In spite of the extensive search in the GEO repository, only eight data sets with a sufficiently high dimension have been identified, then the excellent performance found for LLM should be confirmed by further independent studies. Baseline conditions that can influence the LLM performance are still to be elucidated, and further analyses based on large data sets including both real and ad hoc simulated data are needed, in particular to assess the effect of different sample size, unbalanced vs. balanced design, noisy data, and poorly defined gold standards.
Finally, in this study, similarly to some other previous investigations [1,2,16], LLM outperformed DT, which shares with our method the capability to extract simple rules for classification purposes. At least at our knowledge, DT is the most largely employed method of machine learning based on simple threshold rules. However, in the last decades some other innovative methods have been proposed and applied for classification purposes in many biomedical fields [24,25], then in further investigations the accuracy of LLM should be also compared to that of such methods.

Conclusions
LLM is an innovative method of supervised analysis that has shown a very high accuracy in classifying patients with different oncological conditions. The LLM classifier is typically based on a small set of simple and intelligible rules that could contribute to a better understanding of carcinogenesis process, hopefully supporting the adoption of new therapeutic approaches.

Data selection
Data were drawn from a set of publicly available databases of gene expression microarrays, stored in the GEO repository bank (http://www.ncbi.nlm.nih.gov/gds/). Selection criteria were: a) inclusion in the GEO data bank from January 2010 to December 2014; b) presence of at least two classes potentially useful for cancer diagnosis, including at least 20 samples each; c) availability of a scientific paper in English language, published on PubMed, and fully describing the experiment and the related study design.
An early selection was made using the following key words in the GEO website: cancer AND human The retrieved databases were carefully investigated to assess their full compliance to the selection criteria. Moreover, studies based on a matched design were excluded, because all the applied methods of analysis rely on the assumption of independent sampling.
Logic learning machine (LLM) LLM generates classifiers described by a set of intelligible rules of the type: where <premise> is a logical product (AND) of conditions and < consequence> provides a class assignment for the output [5,6,15].
LLM produces rules through a three-step process, namely: latticization (binarization), monotone Boolean function reconstruction and rule generation (Fig. 2). In the first phase (latticization) each variable is transformed into a string of binary data, using the inverse only-one coding [15]; then, resulting coded strings are concatenated in one unique large sequence of bits. In the second phase (monotone Boolean function reconstruction) a set of binary vectors, called implicants, is selected and allows the identification of clusters associated with a specific class. During the third phase all the generated implicants are transformed into as many rules, each one including a collection of simple threshold conditions in its <premise> part. Algorithms for the efficient generations of implicants, starting from any dataset, have been illustrated in detail elsewhere [15].
A set of quality measures has been defined for any rule r generated by LLM [2,15], such as the proportion of correct classifications C(r), called the covering, or the false positive fraction E(r). In a binary classification task, depending on the class identified by the rule r, C(r) will correspond to either the sensitivity or the specificity. Let r' represent the rule obtained from r by removing the condition c from its premise part. A simple measure of the relevance R(c) of that condition is then provided by: Finally, a measure of relevance R v (x j ) for each variable x j can be obtained by applying the following equation: where k varies on the indices of rules r k that includes a condition c kl on the variable x j . As a rule of thumb, the inequality R v (x j ) ≤ 10% is used to identify a predictor x j providing a marginal contribution to the accuracy of LLM classifiers, while a rule with C(r)(1 -E(r)) ≤ 10% often covers subjects with anomalous values (possible outliers).

Measures of quality for a single analysis
Performance of LLM was compared with that of four selected competing methods of supervised learning (DT, ANN, SVM, and kNN) in leave-one-out cross-validation (LOOCV). Standard measures of quality were obtained for each analysis and proper comparison techniques were adopted to evaluate the overall performance of each classification method in the whole set of analysis. A parameter tuning procedure was adopted to enhance the performance of each selected method and only the models with the highest accuracy were retained. For instance, for LLM a set of values for the E(r) parameter were selected, ranging from 2.5 to 7.5% (step 0.5%). Parameter tuning for the competing methods will be described in the dedicated paragraphs.  the sensitivity SE and the specificity SP: the Cohen kappa coefficient K: the odds ratio OR: In the present investigation we have log-transformed the OR to exploit its asymptotic Normal distribution [26]. An asymptotic estimate of the variance σ 2 of the Fig. 2 Schematic representation of the Logic Learning Machine algorithm. In the first phase (Latticization) each variable is transformed into a string of binary data, using the inverse only-one code binarization and all strings are eventually concatenated in one unique large string per each subject. In the second phase (Shadow Clustering) a set of binary vectors (the "implicants") is generated, each of which identifies a cluster in the input space associated with a specific output class. Finally, all the implicants are transformed into simple conditions and combined in a set of intelligible rules logarithm log(OR) of the odds ratio is readily obtained through the equation [27]: where the continuity correction is adopted if one of the terms at the denominator is null [27].
In the presence of m multiple outcomes, the definition of the Cohen kappa coefficient can be generalized as follows [28]: where a ij represents the counting of elements in the i row and j column of the confusion matrix and represents the corresponding expected counting in the case of a random distribution of the elements inside the cells of the confusion matrix.
The specificity SP m for m multiple outcomes was simply obtained by selecting a category as the reference and computing the proportion of correctly classified samples inside that category.
The corresponding estimate of log(OR) and of its related variance σ 2 were then retrieved by applying the Mantel Haenszel (MH) method [29]. Let i = 1 be the index of the reference category in the confusion matrix; the MH estimate of log(OR) is obtained as: whereas the corresponding asymptotic estimate of the variance σ 2 is obtained through the following equation [30]: a 11 a ii a 11 þ a 1i þ a i1 þ a ii ! Finally, the sensitivity SE for multiple outcomes is obtained by exploiting the relationship between OR, SE and SP reported in eq. (1): A "natural" reference category for multiple outcomes was adopted, whenever possible, selecting either the group of subjects without any disease, if any, or the class with the (allegedly) less severe illness. Otherwise, in the case of comparison between groups of severely diseased patients (i.e., classes including only malignant tumors) the reference was arbitrarily defined as the class with the highest number of individuals.

Common measures of quality across studies: the summary ROC curve
For each dataset, comparison between the considered supervised classification methods was based on the K index.
For each classifier a common measure of accuracy across the N studies was obtained by employing the method of the summary ROC (sROC) curves [31]. In particular, the area sAUC under the sROC curve was adopted to evaluate the quality of any classification technique. A proper model was considered, which is described by the following equation: where sOR is the summary odds ratio given by: being OR i the odds ratio of the ith study and σ i 2 the variance of log(OR i ).
An estimate of the standard error for log(sOR) can also be obtained as whereas, under the log-Normal assumption for the distribution of sOR, the related 95% confidence intervals (95%CI) of this estimate are obtained as follows: The value of sAUC represents a summary measure of pure accuracy [31] and is easily obtained from sOR through the following equation: In the present study we have performed an sROC analysis for each of the five classification methods thus resulting in five sROC curves.
All supervised analyses were carried out by using Rulex, a software suite developed and commercialized by Rulex Inc. (http://www.rulex.ai). Summary ROC analysis was performed by Stata for Windows statistical software (release 12.1, Stata Corporation, College Station, TX).

Brief description of competing methods of supervised data mining Decision tree (DT)
A DT is a graph where each node is associated with a condition based on an attribute of the input vector x and each leaf corresponds to an assignment for a specified output class. Moving from a leaf to a root, a simple intelligible rule can be identified [32].
DT is obtained by a "divide-and-conquer" approach that provides disjoint rules. At each iteration, a new node is added to the DT by choosing the condition that subdivides the training set S according to a specific measure of goodness. Parameter tuning was performed comparing the performance of three different pruning approaches (namely: pessimistic, no pruning and costcomplexity). Furthermore, the highest impurity by node was let to vary between 0.0 and 0.1 (step 0.01).

Artificial neural network (ANN)
ANN is a connectionist model formed by the interconnection of simple units (neurons), arranged in layers. Each neuron computes a weighted sum of the inputs applying a proper activation function, which provides the output value that will be propagated to the following layer. The input vector x is sent to the first layer. The remaining layer receives input from the previous one and the last layer produces the output class to be assigned to x. Weights for each neuron are estimated by suitable optimization techniques and form the set of parameters for the ANN. The Levenberg-Marquardt version of the back propagation algorithm was applied to train the ANN [32]. Parameter tuning was performed comparing the performance of ANN with a different number of both hidden layers (from 0 to 1) and neurons (2 to 6). Moreover, the learning rate was let to vary between 0.25 and 0.75 (step 0.05).

K-nearest neighbor classifier (kNN)
Let n be the number of pairs (x j ,y j ) in the training set S, where x j is the input vector and y j the output class for the jth sample. When a new subject described by the input vector x is to be classified, the nearest k samples in S, according to a suitable distance measure, are determined and the class y associated with the majority of the k nearest samples is assigned to x [32].
In the present investigation the standard Euclidean distance was employed, after having normalized the components of the input vector x to reduce the effect of biases possibly caused by unbalanced domain intervals for different input variables. Tuning procedure was applied to the number of nearest samples letting the k parameter vary between 1 to 10.

Support vector machine (SVM)
SVM is a non-probabilistic binary linear classifier based on the identification of an optimal hyperplane of separation between two classes [32]. Given a training set, the classifier selects a subset l of input vectors x j in the training set S, called support vectors, and their corresponding outputs y j ∈ {− 1,1}. The class y for any input vector x is then given by: where the coefficients α j and the offset b are evaluated through a proper training algorithm. K(·,·) is a kernel function used to perform a non-linear classification by constructing an optimal hyperplane in a high dimensional projected space. A linear kernel function was tested on each dataset. The training algorithm was performed by using the LIBSVM library, which is featured by the Rulex Analytics software. The performance of SVM with linear and RBF kernels was tested. Tuning procedure also included the degree of the kernel function that was let range from 1 to 10.