Software
We implemented the analysis using the RPART algorithm in R [17]. Post decision tree construction (single and ensemble) prediction accuracy rates were measured as in [18] by sensitivity and specificity. Sensitivity and specificity are defined as follows:
where the true positive (TP) is the number of correctly diagnosed HBSA or HepC positive cases; false negative (FN) is the number of HBSA or HepC positive that the model is unable to diagnose; true negative (TN) is the number of correctly diagnosed HBSA or HepC negative successfully diagnosed by the model; and false positive (FP) is the number of HBSA or HepC negative that the model is unable to diagnose.
Data set
The data set employed in this study originally comprised 18625 individual cases (1 individual patient case per row) of hepatitis virus testing over a decade from 1997 - 2007. Data was provided by ACT Pathology, The Canberra Hospital (TCH), Australia. Patient identifiers were removed by TCH staff prior to data access, with only a laboratory ID numbers provided for the study. After data cleaning that included the removal of rows with missing values, 10378 rows of complete data were compiled for HBSA, with 8801 complete data rows available for HepC. Only cleaned and complete data sets were used in the experiments described herein. Of the final data set, 212 rows were HBSA positive, and 641 rows positive for HepC. Therefore, the majority of the data were negative for either HBV or HCV, stimulating the analyses described here to derive methods to increase prediction accuracy for an unbalanced data set. HBSA was classified as positive at ≥ 1.6 immunoassay units (IU) and HepC ≥ 0.6 IU for a positive classification. All other HBSA and HepC results below this assay cut-off were classified as negative (M. de Souza, ACT Pathology, pers. comm.).
The study was divided into two phases to assess the impact of pre-processing efficacy. The first phase compared three pre-processing techniques before testing accuracy for single decision trees. Phase two comprised ensembles of 36 or 72 decision trees with pre-analysis scaling of the data before an assessment of prediction accuracy.
Ethics
For access to de-identified patient data, this study had human ethics approval granted by the Human Research Ethics Committee at The University of Canberra (protocol 07/24), The Australian National University Human Ethics Committee (2012/349) and the ACT Health Human Research Ethics Committee (ETHLR.11.016).
Phase 1 - single decision tree analysis
Prior to running the single decision trees and assessing prediction accuracy, four common data pre-processing techniques were employed [19]. The four pre-processing techniques used were: no pre-processing (Raw), scaling 1 - 100 (Scale), a natural logarithm scale (Log) and scale-logging (Scale-Log), a combination of the previous two methods. Scaling sets the range of each explanatory variable to a common range of 0 - 100. Logging uses the natural logarithm (ln) transformation. Scale-logging uses a common range of 0 - 100 then takes the natural logarithm. Note also that assignment of positive or negative to data (based on HBSA or HepC value) occurs before scaling.
After data pre-processing, three data set selection methods were used, as follows;
Basic Single: For both HBSA (n = 10378) and HepC (n = 8801), two-thirds of the data were randomly selected for training with the other third of data reserved for testing [19]. The single tree obtained from the training set was applied to the testing set, and the accuracy rate computed.
Bootstrap Single: Pre-processing was identical to the basic single approach, but in addition used the bootstrap technique in order to increase the number of positive cases in the training data to match the same number of negative cases (in the training phase). The bootstrapped training data is then used to construct a tree classification for the response variables HBSA and HepC, which were compared for accuracy to the one-third testing data.
Matched Single: As an alternative to bootstrapping, the same number of negative cases as the available number of positive cases were used to train the data, with the negative cases selected at random from the whole data set. This training data was then used to construct a tree classification for the response variables HBSA and HepC, as summarized above.
Phase 2 - decision tree ensembles
As well as the single decision tree methods above, it is also possible to divide up the abundant negative cases into multiple sets (see [14]) and thereby produce multiple decision trees. Three methods for carrying out this division were studied: a description of each one follows.
Basic Multiple: Positive HBSA data was randomly divided into two parts comprising of 2/3 training data (141 cases) and 1/3 testing data (71 cases). Cases with negative HBSA (10167 cases) were selected, and divided into 72 random subsets (i.e. 10167/141). The 141 cases with positive HBSA were combined with each of the above 72 HBSA negative subsets. The above 72 subsets were applied one at a time to construct a classification tree for the response variables, and each of the original 72 trees applied to the remaining data, which had not been used in construction of that individual tree, and compute the accuracy rate for each subsequent tree ensemble.
Majority Multiple: This method is very similar to the basic multiple method. However, this time we created 36 subsets for training where each subset had 282 (i.e., 141 × 2) cases with half cases negative and the other half of cases with positive HBSA. Furthermore, we computed the accuracy rate for each tree (using the same test dataset) based on majority voting from all trees. The result of a majority vote is the decision of at least > 50% of the trees in the ensemble.
Clear Negative: For this method we first select the cases that are “clearly negative” as judged by pathology data reference ranges for HBSA or HepC immunoassay and ALT (Table 1). “Clear negativity” is defined as HBSA < 0.01 IU, HepC < 0.03 IU and ALT < 55U/L. We then combined them with 2/3 of the positive cases (71 for HBSA or 214 for HepC) to construct the training set. The remaining data are used for testing. In other words, we have only one set for training and the remaining data for testing, unlike the other two methods that had multiple training sets.
Phase 3 - analysis of variance
The final phase of the study uses an analysis of variance [20] in order to identify the amount of variation in mean accuracy rate attributable to four factors: method, data pre-processing, outcome and virus type (HBV or HCV). There are three methods (basic multiple, majority multiple and clear negative). There are four pre-processing techniques (none, scale, log, and scale-log). There are two outcomes (predicted positive and predicted negative). There are two viruses, Hepatitis B virus (HBV -measured by the immunoassay marker HBSA) and Hepatitis C virus (HCV - measured by the immunoassay marker HepC: Table 1). The interaction between pairs of these factors was also modelled, to see if there were settings of one factor that caused the accuracy rates to behave differently depending on the setting of another factor.