Early classification of multivariate temporal observations by extraction of interpretable shapelets

Background Early classification of time series is beneficial for biomedical informatics problems such including, but not limited to, disease change detection. Early classification can be of tremendous help by identifying the onset of a disease before it has time to fully take hold. In addition, extracting patterns from the original time series helps domain experts to gain insights into the classification results. This problem has been studied recently using time series segments called shapelets. In this paper, we present a method, which we call Multivariate Shapelets Detection (MSD), that allows for early and patient-specific classification of multivariate time series. The method extracts time series patterns, called multivariate shapelets, from all dimensions of the time series that distinctly manifest the target class locally. The time series were classified by searching for the earliest closest patterns. Results The proposed early classification method for multivariate time series has been evaluated on eight gene expression datasets from viral infection and drug response studies in humans. In our experiments, the MSD method outperformed the baseline methods, achieving highly accurate classification by using as little as 40%-64% of the time series. The obtained results provide evidence that using conventional classification methods on short time series is not as accurate as using the proposed methods specialized for early classification. Conclusion For the early classification task, we proposed a method called Multivariate Shapelets Detection (MSD), which extracts patterns from all dimensions of the time series. We showed that the MSD method can classify the time series early by using as little as 40%-64% of the time series’ length.

First, the entropy of the dataset is computed as: where m c is the number of time series of class c and M is the number of all time series. To compute the distance threshold, the method takes two parameters, the shapelet and the distances RowDist between the shapelet and all time series in the dataset D. It sorts the distances (line 2) and finds the mid-point between two consecutive distances as a candidate for the threshold (line 4). The dataset is then divided into two parts left and right to the threshold. The left part (lines 6-10) has all time series such that the distance between the shapelet and any time series is less than or equal to the threshold. The right part (lines 11-15) has the remainder of the time series. Then, the entropy of the left and right parts are computed (lines 16 and 17, respectively). By comparing the entropy before and after the split, we obtain a measure of information gain (line 18) as where M L and M R are the number of time series in D L and D R , respectively. We choose the distance threshold that maximizes the information gain for the shapelet (line 21).

Genes used in our experiments
The list of the genes used in our experiments for both viral infection and drug response datasets is provided in

Evaluation of MSD method on the viral infection and drug response datasets using all genes
The MDS method was evaluated on the viral infection and drug repose datasets. The accuracy and the list of parameters used is provided in Table S.2.

Evaluation of MSD method on the viral infection and drug response datasets using a subset of genes
The MDS method was evaluated on the viral infection and drug repose datasets using a subset of genes.
For the viral infection dataset, the subset has been chosen as the top genes from the ranked list (provided from the literature) that gives the highest accuracy. For the drug response datasets, the subset has been chosen by enumerating all combinations of genes and selecting a subset that gives the highest accuracy. For computational reasons, we could not enumerate all combinations of genes used in the Baranzini12 and Costa17 datasets. The accuracy and the list of parameters used is provided in Table S.3. For some cases, like Baranzini3A, the number of genes that gives the highest accuracy is one gene. In that case, the percentage of variables used to satisfy Equation 6 has no effect so that we report the percentage in the table as 0.1-1. Also when using 2 genes, like in Baranzini3B and Lin9, all percentages less than or equal to 0.5 have the same effect while all percentages greater than 0.5 have the same effect.

Comparison of distance threshold methods
The MDS method was evaluated on the viral infection and drug repose datasets using two different distance threshold methods. Namely, we compared the proposed distance threshold method information gain with the Chebyshev's inequality method. We note that using Chebyshev's inequality gives better relative accuracy, but on the other hand it gives worse coverage, which reduces the overall accuracy. We applied a paired t-test of the null hypothesis that the difference between the 1000 bootstrap accuracies for both methods are a random sample from a normal distribution with mean 0 and unknown variance, against the alternative that the mean is not 0. We applied the t-test at the 99% significance level. The results are shown in Table S.4. Using Chebyshev's inequality as a threshold method outperformed the information gain in only two datasets (Lin9 and Costa17) where in Costa17 the difference is not significant.

Comparison of utility score methods
The MSD method was evaluated on the viral infection and drug repose datasets using two utility score methods. Namely, we compared the proposed weighted information gain method with the weighted F 1 method. We applied a t-test to measure the significance of the difference at the 99% significance level. The results are shown in Table S.5. Using weighted F 1 score as a utility score, the method outperformed the weighted information gain in only one dataset (Baranzini6).