Patients
Affymetrix GeneChip HG-U133plus2.0 enrolled 182 neuroblastoma patients on the bases of the availability of gene expression profile. Eighty-eight patients were collected by the Academic Medical Center (AMC; Amsterdam, Netherlands) [1, 62]; 21 patients were collected by the University Children's Hospital, Essen, Germany and were treated according to the German Neuroblastoma trials, either NB97 or NB2004; 51 patients were collected at Hiroshima University Hospital or affiliated hospitals and were treated according to the Japanese neuroblastoma protocols [63]; 22 patients were collected at Gaslini Institute (Genoa, Italy) and were treated according to Italian AIEOP protocols. The data are stored in the R2 microarray analysis and visualization platform (AMC and Essen patients) or at the BIT-neuroblastoma Biobank of the Gaslini Institute. The investigators who deposited data in the R2 repository agree to use them for this work. In addition, we utilized data present on the public database at the Gene Expression Omnibus number GSE16237 for Hiroshima patients [63]. Informed consent was obtained in accordance with institutional policies in use in each country. In every dataset, median follow-up was longer than 5 years, tumor stage was defined as stages 1, 2, 3, 4, or 4s according to the International Neuroblastoma Staging System (INSS), normal and amplified MYCN status were considered and two age groups were considered, those with age at diagnosis smaller than 12 months and greater or equal to 12 months. Good and poor outcome were defined as the patient's status alive or dead 5 years after diagnosis. The characteristics of the patients are shown in Table 1.
Batch effect measure and removal
The PVCA approach [41] was used to estimate the variability of experimental effects including batch. The pvca package implemented in R was utilized to perform the analysis setting up a pre- defined threshold of 60%. The analysis included Age at diagnosis, MYCN amplification, INSS stage, and Outcome and Institute variables. The estimation of experimental effects was performed before and after the batch effect removal.
The frozen surrogate variable analysis (FSVA) implemented in the sva package [64] was utilized for removing the batch effect from the training and the test sets. The parametric prior method and the Institute batch variable were set up for the analysis.
Gene expression analysis
Gene expression profiles for the 182 tumors were obtained by microarray experiment using Affymetrix GeneChip HG-U133plus2.0 and the data were processed by MAS5.0 software according to Affymetrix guideline.
Preprocessing step
To describe the procedure adopted to discretize values assumed by the probe sets a basic notation must be introduced. In a classification problem d-dimensional examples , are to be assigned to one of q possible classes, labeled by the values of a categorical output y. Starting from a training set S including n pairs (x
i
, y
i
), i = 1, ..., n, deriving from previous observations, techniques for solving classification problems have the aim of generating a model g(x), called classifier, that provides the correct answer y = g(x) for most input patterns x. Concerning the components x
j
two different situations can be devised:
-
1.
ordered variables: x
j
varies within an interval [a,b] of the real axis and an ordering relationship exists among its values.
-
2.
nominal (categorical) variables: x
j
can assume only the values contained in a finite set and there is no ordering relationship among them.
A discretization algorithm has the aim of deriving for each ordered variable x
j
a (possibly empty) set of cutoffs γ
jk
, with k = 1, ..., t
j
, such that for every pair x
u
, x
v
of input vectors in the training set belonging to different classes (y
u
≠ y
v
) their discretized counterparts z
u
, z
v
have at least one different component.
Denote with ρ
j
the vector which contains all the α
j
distinct values for the input variable x
j
in the training set, arranged in ascending order, i.e. ρ
jl
<ρ
j
,l+1for each l = 1, ..., α
j
-1. Then, we can consider a set of binary values τ
jl
, with j = 1, ..., d and l = 1, ..., α
j
-1, asserting if a separation must be set for the j-th variable between its l-th and (l+1)-th values:
Of course, the total number of possible cutoffs is given by
which must be minimized under the constraint that examples x
u
and x
v
belonging to different classes have to be separated at least by one cutoff. To this aim, let X
juv
the set of indexes l such that ρ
jl
lies between x
uj
and x
vj
:
Then, the discretization problem can be stated as:
(1)
To improve the separation ability of the resulting set of cutoffs the constraint in (1) can be reinforced by imposing that
for some s ≥ 1. Intensive trials on real world datasets have shown that a good value for s is given by s= 0.2d; this choice has been adopted in all the analysis performed in the present paper.
Since the solution of the programming problem in (1) can require an excessive computational cost, a near-optimal greedy approach is adopted by the Attribute Driven Incremental Discretization (ADID) procedure [39]. It follows an iterative algorithm that adds iteratively the cutoff obtaining the highest value of a quality measure based on the capability of separating patterns belonging to different classes. Smart updating procedures enable ADID to efficiently attain a (sub) optimal discretization.
After the set of candidate cutoffs is produced, a subsequent phase is performed, to refine their position. This updating task significantly increases the robustness of final discretization.
Classification by ADID and LLM implemented in Rulex 2.0
A classification model was built on the expression values of the 62 probe sets constituting NB-hypo signature [1]. Model generation and performance was established by splitting the dataset into a training set, comprising 60% of the whole patients cohort, and a test set comprising the remaining 40%. To build a classifier, a Rulex 2.0 process was designed. A discretizer component that adopts the Attribute Driven Incremental Discretization (ADID) procedure [39] and a classification component that adopts a rule generation method called Logic Learning Machine (LLM) were utilized into the process. Entropy based (EntMDL [42]), Modified Chi Square [43], ROC based (Highest Youden index ([44]), and Equal frequency (i.e. median expression for each feature) components have been executed as alternative discretization methods. To design the most accurate classifier one important parameter of the LLM component was evaluated. The parameter was the maximum error allowed on the training set. It defines the maximum percentage of examples covered by a rule with a class differing from the class of the rule. The parameter values evaluated ranged in the set 0%, 5%, 10%, 15%, 20%, 25%, and 30%. For each parameter value, a 10 times repeated 10-fold cross validation analysis was performed and the classification performances were collected. The parameters' choice that obtained the best mean classification accuracy was selected to train the final gene expression based classifier on the whole training set utilizing the aforementioned Rulex components. The Rulex software suite is commercialized by Impara srl[40].
Decision Tree [45], Support Vector Machines (SVM) [46], and Prediction Analysis of Microarrays (PAM) [47]were run on the same training and test sets for reference.
Performance evaluation in predicting patients' outcome
To evaluate the prediction performance of the classifiers we used the following metrics: accuracy, recall, specificity and negative predictive values (NPV), considering good outcome patients as positive instances and poor outcome patients as negative instances. Accuracy is the proportion of correctly predicted examples in the overall number of instances. Recall is the proportion of correctly predicted positive examples against all the positive instances of the dataset. Precision is the proportion of correctly classified positive examples against all the predicted positive instances. Specificity is the proportion of correctly predicted negative examples against all the negative instances of the dataset. NPV is the proportion of the correctly classified negative examples against all the predicted negative instances.
Rule quality measures
Rule generation methods constitute a subset of classification techniques that generate explicit models g(x) described by a set of m rules r
k
, k = 1, ..., m, in the if-then form:
where <premise> is the logical product (and) of m
k
conditions c
kl
, with l = 1, ..., m
k
, on the components x
j
, whereas <consequence> gives a class assignment for the output. In general, a condition c
kl
in the premise involving an ordered variable x
j
has one of the following forms x
j
>λ, x
j
≤ μ, λ<x
j
≤ μ, being λ and μ two real values, whereas a nominal variable x
k
leads to membership conditions , being α, δ, σ admissible values for the k-th component of x.
For instance, if x1 is an ordered variable in the domain {1, ..., 100} and x2 is a nominal component assuming values in the set {red, green, blue}, a possible rule r1 is
where 0 denotes one of the q possible assignments (classes).
According to the output value included in their consequence part, the m rules r
k
describing a given model g(x) can be subdivided into q groups G1, G2, ..., G
q
. Considering the training set S, any rule r∈G
l
is characterized by four quantities: the numbers TP(r) and FP(r) of examples (x
i
, y
i
) with y
i
= y
l
and y
i
≠ y
l
, respectively, that satisfy all the conditions in the premise of r, and the numbers FN(r) and TN(r) of examples (x
i
, y
i
) with y
i
= y
l
and y
i
≠ y
l
, respectively, that do not satisfy at least one of the conditions in the premise of r.
The quality of a rule was measured utilizing the following quantities. Give a rule r, we define the covering C(r), the error E(r), and the precision P(r) according to the following formulas:
The covering of a rule is the fraction of examples in the training set that satisfy the rule and belong to the target class. The error of a rule is the fraction of examples in the training set that satisfy the rule and do not belong to the target class. The precision of a rule is the fraction of examples in the training set that do not belong to the target class but satisfy the premises of the rule. The greater was the covering and the precision, the higher was the generality and the correctness of the corresponding rule.
To test the statistical significance of the rules we used a Fisher's exact test (FET) implemented by the software package R. The test of significance considered significant any rule having P < 0.05.
Relevance measure and ranking of the probe sets
To obtain a measure of importance of the features included into the rules and rank these features according to this value, we utilized a measure called Relevance R(c) of a condition c. Consider the rule r' obtained by removing that condition from r. Since the premise part of r' is less stringent, we obtain that E(r') ≥ E(r) so that the quantityR(c) = (E(r')−E(r))C(r) can be used as a measure of relevance for the condition c of interest.
Since each condition c refers to a specific component of x, we define the relevance R
v
(x
j
) for every input variable x
j
as follows:
where the product is computed on the rules r
k
that includes a condition c
kl
on the variable x
j
.
Denote with V
kl
the attribute involved in the condition c
kl
of the rule r
k
and with S
kl
the subset of values of V
kl
for which the condition c
kl
is verified. If V
kl
is an ordered attribute and the condition c
kl
is V
kl
≤ a for some value a∈S
kl
, then the contribution to R
v
(x
j
) is negative. Hence, by adding the superscript − (resp. +) to denote the attribute V
kl
with negative (resp. positive) contribution, we can write R
v
(x
j
) for an ordered input variable x
j
in the following way:
where the first (resp. second) product is computed on the rules r
k
that includes a condition c
kl
leading to a negative (resp. positive) contribution for the variable x
j
.
Output assignment for a new instance
When the model g(x) described by the set of m rules rk, k = 1, ..., m, is employed to classify a new instance x, the <premise> part of each rule is examined to verify if the components of x satisfy the conditions included in it. Denote with Q the subset of rules whose <premise> part is satisfied by x; then, the following three different situations can occur:
-
1.
The set Q includes only rules having the same output value ỹ in their <consequence> part; in this case the class ỹ is assigned to the instance x.
-
2.
The set Q contains rules having different output values in their <consequence> part; it follows that Q can be partitioned into q disjoint subsets Qi, (some of which can be empty) including the rules r pertaining to the ith class. In this case, to every attribute xj can be assigned a measure of consistency tij given by the maximum of the relevance r(c) for the conditions c involving the attribute xj and included in the <premise> part of the rules in Qi. Then, to the instance x is assigned the class associated with the following maximum:
-
3.
The set Q is empty, i.e. no rule is satisfied by the instance x; in this case the set Q−1 containing the subset of rules whose <premise> part is satisfied by x except for one condition is considered and points 1 and 2 are again tested with Q = Q
−
1. If again Q is empty the set Q−2 containing the subset of rules whose <premise> part is satisfied by x except for two conditions is considered and so on.
The conflicting case 2 can be controlled in Rulex 2.0 by assigning a set of weights w
i
to the output classes; in this way equation (1) can be written as
and we can speak of weighted classification.