Patients
Affymetrix GeneChip HGU133plus2.0 enrolled 182 neuroblastoma patients on the bases of the availability of gene expression profile. Eightyeight patients were collected by the Academic Medical Center (AMC; Amsterdam, Netherlands) [1, 62]; 21 patients were collected by the University Children's Hospital, Essen, Germany and were treated according to the German Neuroblastoma trials, either NB97 or NB2004; 51 patients were collected at Hiroshima University Hospital or affiliated hospitals and were treated according to the Japanese neuroblastoma protocols [63]; 22 patients were collected at Gaslini Institute (Genoa, Italy) and were treated according to Italian AIEOP protocols. The data are stored in the R2 microarray analysis and visualization platform (AMC and Essen patients) or at the BITneuroblastoma Biobank of the Gaslini Institute. The investigators who deposited data in the R2 repository agree to use them for this work. In addition, we utilized data present on the public database at the Gene Expression Omnibus number GSE16237 for Hiroshima patients [63]. Informed consent was obtained in accordance with institutional policies in use in each country. In every dataset, median followup was longer than 5 years, tumor stage was defined as stages 1, 2, 3, 4, or 4s according to the International Neuroblastoma Staging System (INSS), normal and amplified MYCN status were considered and two age groups were considered, those with age at diagnosis smaller than 12 months and greater or equal to 12 months. Good and poor outcome were defined as the patient's status alive or dead 5 years after diagnosis. The characteristics of the patients are shown in Table 1.
Batch effect measure and removal
The PVCA approach [41] was used to estimate the variability of experimental effects including batch. The pvca package implemented in R was utilized to perform the analysis setting up a pre defined threshold of 60%. The analysis included Age at diagnosis, MYCN amplification, INSS stage, and Outcome and Institute variables. The estimation of experimental effects was performed before and after the batch effect removal.
The frozen surrogate variable analysis (FSVA) implemented in the sva package [64] was utilized for removing the batch effect from the training and the test sets. The parametric prior method and the Institute batch variable were set up for the analysis.
Gene expression analysis
Gene expression profiles for the 182 tumors were obtained by microarray experiment using Affymetrix GeneChip HGU133plus2.0 and the data were processed by MAS5.0 software according to Affymetrix guideline.
Preprocessing step
To describe the procedure adopted to discretize values assumed by the probe sets a basic notation must be introduced. In a classification problem ddimensional examples x\in X\subset {\Re}^{d}, are to be assigned to one of q possible classes, labeled by the values of a categorical output y. Starting from a training set S including n pairs (x _{
i
}, y_{
i
}), i = 1, ..., n, deriving from previous observations, techniques for solving classification problems have the aim of generating a model g(x), called classifier, that provides the correct answer y = g(x) for most input patterns x. Concerning the components x_{
j
} two different situations can be devised:

1.
ordered variables: x_{
j
} varies within an interval [a,b] of the real axis and an ordering relationship exists among its values.

2.
nominal (categorical) variables: x_{
j
} can assume only the values contained in a finite set and there is no ordering relationship among them.
A discretization algorithm has the aim of deriving for each ordered variable x_{
j
} a (possibly empty) set of cutoffs γ_{
jk
}, with k = 1, ..., t_{
j
}, such that for every pair x_{
u
}, x_{
v
} of input vectors in the training set belonging to different classes (y_{
u
} ≠ y_{
v
}) their discretized counterparts z_{
u
}, z_{
v
} have at least one different component.
Denote with ρ_{
j
} the vector which contains all the α_{
j
} distinct values for the input variable x_{
j
} in the training set, arranged in ascending order, i.e. ρ_{
jl
}<ρ_{
j
},_{l+1}for each l = 1, ..., α_{
j
}1. Then, we can consider a set of binary values τ_{
jl
}, with j = 1, ..., d and l = 1, ..., α_{
j
}1, asserting if a separation must be set for the jth variable between its lth and (l+1)th values:
{\tau}_{jl}=\left\{\begin{array}{cc}1,\hfill & \text{if}{\gamma}_{\text{j}}\text{contains}{\rho}_{\text{jl}}\hfill \\ 0,\hfill & \text{otherwise}\hfill \end{array}\right.
Of course, the total number of possible cutoffs is given by
{\displaystyle \sum _{j=1}^{d}}{\displaystyle \sum _{l=1}^{{\alpha}_{j}1}}{\tau}_{jl}
which must be minimized under the constraint that examples x_{
u
} and x_{
v
} belonging to different classes have to be separated at least by one cutoff. To this aim, let X_{
juv
} the set of indexes l such that ρ_{
jl
} lies between x_{
uj
} and x_{
vj
}:
{X}_{juv}=\left\{l{x}_{uj}<{\rho}_{jl}<{x}_{vj}\right\}
Then, the discretization problem can be stated as:
mi{n}_{\tau}{\displaystyle \sum _{j=1}^{d}}{\displaystyle \sum _{l=1}^{{\alpha}_{j}1}}{\tau}_{jl}
subj\phantom{\rule{2.77695pt}{0ex}}to\phantom{\rule{2.77695pt}{0ex}}{\displaystyle \sum _{j=1}^{d}}{\displaystyle \sum _{l\in {X}_{juv}}}{\tau}_{jl}\ge 1\phantom{\rule{1em}{0ex}}\text{foreach}u,\phantom{\rule{2.77695pt}{0ex}}v,s.t.{y}_{u}\ne {y}_{v}
(1)
To improve the separation ability of the resulting set of cutoffs the constraint in (1) can be reinforced by imposing that
{\displaystyle \sum _{j=1}^{d}}{\displaystyle \sum _{l\in X{j}_{uv}}}{\tau}_{jl}\ge s
for some s ≥ 1. Intensive trials on real world datasets have shown that a good value for s is given by s= 0.2d; this choice has been adopted in all the analysis performed in the present paper.
Since the solution of the programming problem in (1) can require an excessive computational cost, a nearoptimal greedy approach is adopted by the Attribute Driven Incremental Discretization (ADID) procedure [39]. It follows an iterative algorithm that adds iteratively the cutoff obtaining the highest value of a quality measure based on the capability of separating patterns belonging to different classes. Smart updating procedures enable ADID to efficiently attain a (sub) optimal discretization.
After the set of candidate cutoffs is produced, a subsequent phase is performed, to refine their position. This updating task significantly increases the robustness of final discretization.
Classification by ADID and LLM implemented in Rulex 2.0
A classification model was built on the expression values of the 62 probe sets constituting NBhypo signature [1]. Model generation and performance was established by splitting the dataset into a training set, comprising 60% of the whole patients cohort, and a test set comprising the remaining 40%. To build a classifier, a Rulex 2.0 process was designed. A discretizer component that adopts the Attribute Driven Incremental Discretization (ADID) procedure [39] and a classification component that adopts a rule generation method called Logic Learning Machine (LLM) were utilized into the process. Entropy based (EntMDL [42]), Modified Chi Square [43], ROC based (Highest Youden index ([44]), and Equal frequency (i.e. median expression for each feature) components have been executed as alternative discretization methods. To design the most accurate classifier one important parameter of the LLM component was evaluated. The parameter was the maximum error allowed on the training set. It defines the maximum percentage of examples covered by a rule with a class differing from the class of the rule. The parameter values evaluated ranged in the set 0%, 5%, 10%, 15%, 20%, 25%, and 30%. For each parameter value, a 10 times repeated 10fold cross validation analysis was performed and the classification performances were collected. The parameters' choice that obtained the best mean classification accuracy was selected to train the final gene expression based classifier on the whole training set utilizing the aforementioned Rulex components. The Rulex software suite is commercialized by Impara srl[40].
Decision Tree [45], Support Vector Machines (SVM) [46], and Prediction Analysis of Microarrays (PAM) [47]were run on the same training and test sets for reference.
Performance evaluation in predicting patients' outcome
To evaluate the prediction performance of the classifiers we used the following metrics: accuracy, recall, specificity and negative predictive values (NPV), considering good outcome patients as positive instances and poor outcome patients as negative instances. Accuracy is the proportion of correctly predicted examples in the overall number of instances. Recall is the proportion of correctly predicted positive examples against all the positive instances of the dataset. Precision is the proportion of correctly classified positive examples against all the predicted positive instances. Specificity is the proportion of correctly predicted negative examples against all the negative instances of the dataset. NPV is the proportion of the correctly classified negative examples against all the predicted negative instances.
Rule quality measures
Rule generation methods constitute a subset of classification techniques that generate explicit models g(x) described by a set of m rules r_{
k
}, k = 1, ..., m, in the ifthen form:
\mathbf{\text{if}}\text{}premise\text{}\mathbf{\text{then}}\text{}consequence\text{}
where <premise> is the logical product (and) of m_{
k
} conditions c_{
kl
}, with l = 1, ..., m_{
k
}, on the components x_{
j
}, whereas <consequence> gives a class assignment y=\mathit{\u1ef9} for the output. In general, a condition c_{
kl
} in the premise involving an ordered variable x_{
j
} has one of the following forms x_{
j
} >λ, x_{
j
} ≤ μ, λ<x_{
j
} ≤ μ, being λ and μ two real values, whereas a nominal variable x_{
k
} leads to membership conditions {x}_{k}\in \left\{\alpha ,\phantom{\rule{2.77695pt}{0ex}}\delta ,\phantom{\rule{2.77695pt}{0ex}}\sigma \right\}, being α, δ, σ admissible values for the kth component of x.
For instance, if x_{1} is an ordered variable in the domain {1, ..., 100} and x_{2} is a nominal component assuming values in the set {red, green, blue}, a possible rule r_{1} is
if\phantom{\rule{0.3em}{0ex}}{x}_{\text{1}}>\text{4}0\phantom{\rule{0.3em}{0ex}}and\phantom{\rule{0.3em}{0ex}}{x}_{\text{2}}\in \left\{red,blue\right\}then\phantom{\rule{0.3em}{0ex}}y=0
where 0 denotes one of the q possible assignments (classes).
According to the output value included in their consequence part, the m rules r_{
k
} describing a given model g(x) can be subdivided into q groups G_{1}, G_{2}, ..., G_{
q
}. Considering the training set S, any rule r∈G_{
l
} is characterized by four quantities: the numbers TP(r) and FP(r) of examples (x _{
i
}, y_{
i
}) with y_{
i
} = y_{
l
} and y_{
i
} ≠ y_{
l
}, respectively, that satisfy all the conditions in the premise of r, and the numbers FN(r) and TN(r) of examples (x _{
i
}, y_{
i
}) with y_{
i
} = y_{
l
} and y_{
i
} ≠ y_{
l
}, respectively, that do not satisfy at least one of the conditions in the premise of r.
The quality of a rule was measured utilizing the following quantities. Give a rule r, we define the covering C(r), the error E(r), and the precision P(r) according to the following formulas:
C\left(r\right)=\frac{TP\left(r\right)}{TP\left(r\right)+FN\left(f\right)}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}},\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}E\left(r\right)=\frac{FP\left(r\right)}{FP\left(r\right)+TN\left(r\right)},\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}P\left(r\right)=\frac{TP\left(r\right)}{TP\left(r\right)+FP\left(r\right)}
The covering of a rule is the fraction of examples in the training set that satisfy the rule and belong to the target class. The error of a rule is the fraction of examples in the training set that satisfy the rule and do not belong to the target class. The precision of a rule is the fraction of examples in the training set that do not belong to the target class but satisfy the premises of the rule. The greater was the covering and the precision, the higher was the generality and the correctness of the corresponding rule.
To test the statistical significance of the rules we used a Fisher's exact test (FET) implemented by the software package R. The test of significance considered significant any rule having P < 0.05.
Relevance measure and ranking of the probe sets
To obtain a measure of importance of the features included into the rules and rank these features according to this value, we utilized a measure called Relevance R(c) of a condition c. Consider the rule r' obtained by removing that condition from r. Since the premise part of r' is less stringent, we obtain that E(r') ≥ E(r) so that the quantityR(c) = (E(r')−E(r))C(r) can be used as a measure of relevance for the condition c of interest.
Since each condition c refers to a specific component of x, we define the relevance R_{
v
}(x_{
j
}) for every input variable x_{
j
} as follows:
{R}_{v}\left({x}_{j}\right)=1{\displaystyle \prod _{k}}\left(1R\left({c}_{kl}\right)\right)
where the product is computed on the rules r_{
k
} that includes a condition c_{
kl
} on the variable x_{
j
}.
Denote with V_{
kl
} the attribute involved in the condition c_{
kl
} of the rule r_{
k
} and with S_{
kl
} the subset of values of V_{
kl
} for which the condition c_{
kl
} is verified. If V_{
kl
} is an ordered attribute and the condition c_{
kl
} is V_{
kl
} ≤ a for some value a∈S_{
kl
}, then the contribution to R_{
v
}(x_{
j
}) is negative. Hence, by adding the superscript − (resp. +) to denote the attribute V_{
kl
} with negative (resp. positive) contribution, we can write R_{
v
}(x_{
j
}) for an ordered input variable x_{
j
} in the following way:
{R}_{v}\left({x}_{j}\right)={\displaystyle \prod _{{V}_{kl}}}\left(1R\left({c}_{kl}\right)\right){\displaystyle \prod _{{V}_{kl}}}\left(1R\left({c}_{kl}\right)\right)
where the first (resp. second) product is computed on the rules r_{
k
} that includes a condition c_{
kl
}
leading to a negative (resp. positive) contribution for the variable x_{
j
}.
Output assignment for a new instance
When the model g(x) described by the set of m rules r_{k}, k = 1, ..., m, is employed to classify a new instance x, the <premise> part of each rule is examined to verify if the components of x satisfy the conditions included in it. Denote with Q the subset of rules whose <premise> part is satisfied by x; then, the following three different situations can occur:

1.
The set Q includes only rules having the same output value ỹ in their <consequence> part; in this case the class ỹ is assigned to the instance x.

2.
The set Q contains rules having different output values in their <consequence> part; it follows that Q can be partitioned into q disjoint subsets Q_{i}, (some of which can be empty) including the rules r pertaining to the ith class. In this case, to every attribute x_{j} can be assigned a measure of consistency t_{ij} given by the maximum of the relevance r(c) for the conditions c involving the attribute x_{j} and included in the <premise> part of the rules in Q_{i}. Then, to the instance x is assigned the class \mathit{\u1ef9} associated with the following maximum:
\mathit{\u1ef9}={\displaystyle \underset{i=1,\cdots q}{\text{argmax}}}{\displaystyle \sum _{i=1}^{d}}{t}_{ij}

3.
The set Q is empty, i.e. no rule is satisfied by the instance x; in this case the set Q_{−1} containing the subset of rules whose <premise> part is satisfied by x except for one condition is considered and points 1 and 2 are again tested with Q = Q_{
−
}_{1}. If again Q is empty the set Q_{−2} containing the subset of rules whose <premise> part is satisfied by x except for two conditions is considered and so on.
The conflicting case 2 can be controlled in Rulex 2.0 by assigning a set of weights w_{
i
} to the output classes; in this way equation (1) can be written as
\mathit{\u1ef9}=\underset{i=1,\cdots q}{\text{argmax}}{\displaystyle \sum _{i=1}^{d}}{t}_{ij}{w}_{i}
and we can speak of weighted classification.