Selection of entropy-measure parameters for knowledge discovery in heart rate variability data

Background Heart rate variability is the variation of the time interval between consecutive heartbeats. Entropy is a commonly used tool to describe the regularity of data sets. Entropy functions are defined using multiple parameters, the selection of which is controversial and depends on the intended purpose. This study describes the results of tests conducted to support parameter selection, towards the goal of enabling further biomarker discovery. Methods This study deals with approximate, sample, fuzzy, and fuzzy measure entropies. All data were obtained from PhysioNet, a free-access, on-line archive of physiological signals, and represent various medical conditions. Five tests were defined and conducted to examine the influence of: varying the threshold value r (as multiples of the sample standard deviation σ, or the entropy-maximizing rChon), the data length N, the weighting factors n for fuzzy and fuzzy measure entropies, and the thresholds rF and rL for fuzzy measure entropy. The results were tested for normality using Lilliefors' composite goodness-of-fit test. Consequently, the p-value was calculated with either a two sample t-test or a Wilcoxon rank sum test. Results The first test shows a cross-over of entropy values with regard to a change of r. Thus, a clear statement that a higher entropy corresponds to a high irregularity is not possible, but is rather an indicator of differences in regularity. N should be at least 200 data points for r = 0.2 σ and should even exceed a length of 1000 for r = rChon. The results for the weighting parameters n for the fuzzy membership function show different behavior when coupled with different r values, therefore the weighting parameters have been chosen independently for the different threshold values. The tests concerning rF and rL showed that there is no optimal choice, but r = rF = rL is reasonable with r = rChon or r = 0.2σ. Conclusions Some of the tests showed a dependency of the test significance on the data at hand. Nevertheless, as the medical conditions are unknown beforehand, compromises had to be made. Optimal parameter combinations are suggested for the methods considered. Yet, due to the high number of potential parameter combinations, further investigations of entropy for heart rate variability data will be necessary.


Background
Heart rate variability (HRV) is the variation of the time interval between consecutive heartbeats. It highly depends on the extrinsic regulation of the heart rate (HR) and reflects the balance between the sympathetic and the parasympathetic nervous system [1]. Batchinsky et al. [2] develop a collection of methods for using HRV to describe regular periodic oscillations in the heart rate, attributed to the vagal and/or sympathetic branches of the autonomic nervous system.
Research on HRV has attracted considerable attention in the fields of psychology and behavioral medicine. It has its origin in the search for non-invasive correlates of injury severity which can be extracted from available signals in order to discover new cardiac biomarkers [2]. These signals are usually ones that are routinely measured, and include sources like a photoplethysmogram or the electrocardiogram (ECG) [3]. Electrocardiography is an interpretation of the electrical activity of the heart over some period of time. It is a non-invasive procedure, using electrodes attached to the surface of the skin, and is commonly used to measure the heart rate, the regularity of the beats, and characterize properties or injuries in the heart chamber. An R-to-R interval (RRI) describes the latency between two consecutive R peaks in the ECG. The RRI time series are used as input for the determination of HRV parameters.
In studies of HRV, both time-and frequency-domain measures are typically used by practitioners and researchers [1,4]. Additionally, further knowledge about the subject's status can be discovered by the evaluation of certain patterns and shifts in an "apparent ensemble amount of randomness" of a stochastic process [5]. This randomness, as well as the predictability of this process, can be measured by entropy [6]. Thus, it is a commonly used tool to describe the regularity of large biomedical data sets. The more regulatory inputs a system has, the higher its irregularity is, due to interference of those regulatory systems. This assumption is also true for many biomedical systems such as HRV [7]. Therefore, it is a reasonable hypothesis that a more regular heart rate variability is connected to a defect in a regulatory system. To measure this irregularity, Pincus et al. proposed approximate entropy (ApEn) [8]. Further types of entropy were developed based on this method to improve it [9]. Because of its ability to measure regularity, entropy is widely used as a diagnostic tool in medicine to derive and discover biomarkers in large biomedical data. Its applications range from sudden infant death syndrome [7], to complexity analysis of intracranial pressure dynamics during periods of severe intracranial hypertension [10], to quantification of amplitude variations in mechanomyographic signals [11], to analysis of short gait data sets [12], to automatic detection of normal, pre-ictal, and ictal conditions from recorded electroencephalography signals [13], to the postural sway in stroke patients [14].
The approach to quantify the structural complexity (or, inversely, regularity) of the HRV is called heart rate complexity (HRC) and utilizes methods derived from nonlinear dynamics. Note that complexity and variability are not necessarily the same [15]. A periodical signal, such as a sine wave, is variable but not complex. This property allows complexity measures to ignore the complicated periodic oscillations to at least some extent [16].
To date, numerous entropy types are widely accepted as measures of the HRC. However, since their function is controlled by three, four or six parameters, there are many possible combinations to choose from. The selection of criteria for these parameters is controversial and heavily depends on the intended purpose and the data at hand [17]. The variation of only one parameter often results in highly non-linear behaviour [12]. Therefore, results of calculations with different parameters cannot be easily extrapolated from existing data, but have to be computed individually. Hence, the variation of several parameters in order to optimize the reliability of the results is a very time consuming process. To be used in daily routine, reasonable parameters have to be selected prior to the evaluation of HRC.
The determination of appropriate parameters for entropy in general [11,12,[18][19][20] and for HRV applications in particular [16,17,21,22] is the subject of ongoing research. A summary of the current knowledge regarding parameter selection is given in the subsection "Parameter selection". In order to verify the results of previous publications and to extend them by the usage of more entropy measures, this work focuses on the parameter selection for approximate, sample, fuzzy and fuzzy measure entropy and their implications for HRV data. The objective of this paper is to provide a reference for choosing parameters for HRV applications based on their influence on the different entropies' ability to distinguish significantly between pathological and non-pathological recordings. The main research questions addressed in this paper are: (1) what does the choice of the threshold value(s) mean for the entropies to be a direct measure of regularity, (2) how does the data length influence significance for different data sets, (3) how should the weighting factor(s) be chosen for fuzzy and fuzzy measure entropy, and finally, (4) what are the constraints for choosing the threshold values?.

Methods
In the following sections approximate (ApEn), sample (SampEn), fuzzy (FuzzyEn) and fuzzy measure entropy (FuzzyMEn) are described in detail. They are built on each other and ordered by increasing complexity. The number of parameters increases from three for ApEn to six for FuzzyMEn, with the basic parameters staying the same and being extended by new ones in each step. Afterwards, the challenge of parameter selection and the tests conducted for the parameters to the different entropy types are described. Finally, the data used for all performed tests are briefly described.

Approximate entropy
The main idea behind approximate entropy is that a sequence is regular if a subsequence and an expansion of the subsequence are similar. It was developed by Pincus [8] and is calculated the following way.
Given a sample sequence {u 1 , . . . , u N }, a template length m and a threshold value r, the sequence is first split into overlapping sequences Next, define C m i as the number of j = 1, . . . , N − m + 1, is the Chebyshev distance, i.e., the maximum distance between two elements of X m i and X m j . Constructing the same sequences, but with template length m + 1, yields C m+1 i . Then, φ m and φ m+1 are defined as: The approximate entropy is then defined as

Sample entropy
Richman and Moorman showed in [23] that approximate entropy is biased towards regularity. Thus, they modified it to sample entropy. The main difference between the two is that sample entropy does not count self-matches, and only compares the first N − m subsequences instead of all N − m + 1, so the same amount of subsequences are used in j m and j m+1 [23]. It is calculated in the following way.
Given a sample sequence {u 1 , . . . , u N }, a template length m, and a threshold value r, first the overlapping sequences, {X m 1 , . . . , X m N−m } are constructed as for ApEn. As opposed to ApEn, C m i is now defined as the number of j = 1, . . . , is again the Chebyshev distance. Applying the same for template length m + 1 results in:

Fuzzy entropy
ApEn and SampEn are very sensitive with respect to the threshold parameter r. They show, on one hand, a very abrupt behavior, and on the other hand less significance for small r, as was discussed in [9] and can be seen in figure 1. To soften these effects, Chen et al. developed fuzzy entropy, which uses a fuzzy membership function instead of the Heaviside function [9]. FuzzyEn is calculated as follows.
Let m and r again be the template length and the threshold value and n a weight for the fuzzy membership function. Sequences {X m 1 , . . . , X m N−m+1 } are defined as for ApEn and SampEn from an input sequence {u 1 Next, using a fuzzy membership function x µ(x, n, r) , n, r) . According to Chen et al. [9], fuzzy membership functions must be continuous and convex. The first property guarantees only slow similarity changes, and by the second condition self-similarity is a maximum of the function. For all tests conducted in this study, x → e −(x/r) n was chosen as the fuzzy membership function as in [9]. For n ∞ this fuzzy membership function converges to the Heaviside function. These steps are repeated for template length m + 1 to get D m+1 . Consequently, φ m and φ m+1 are calculated the following way: Fuzzy entropy is defined as FuzzyEn(m, r, n) := lim and can be estimated by FuzzyEn(m, r, n, N)

Fuzzy measure entropy
Liu et al. proposed adding a function for global similarity to the fuzzy entropy and called the combination fuzzy measure entropy [24]. It can be calculated as follows. Given a data sequence {u 1 , . . . , u N }, a template length m, two threshold values r L and r F , and two weighting parameters n L and n F (r L and n L correspond to the local term and r F and n F to the global one), a sequence {X m 1 , . . . , X m N−m+1 } is constructed like before. In the next step, it is transformed into a local sequence (5) and XF m i := X m i − u mean , and u mean is the mean value of the complete data sequence {u 1 , . . . , u N }.
Using the local and the global parameters for the fuzzy membership functions, the matrices DL m and DF M are defined as DL m Afterwards, all these steps are repeated for template length m + 1 to get DL m+1 and are defined as: Fuzzy measure entropy is then defined as FuzzyMEn (m, r L , r F , n L , n F ) := lim

Parameter selection
As one can see in the description of each entropy, the various entropy types have three, four or six parameters. Thus, there are many possible combinations to choose from. The parameters which are varied in the test cases, their ranges and values mentioned in the literature, and the choice of certain fixed parameters are described here. The most common choice for the template length is m = 2, as it was recommended by Pincus and Goldberger for ApEn [7], by Yentes et. al for SampEn [12], and confirmed by other studies, e.g. [20]. A false nearest neighbor method is also sometimes used, but according to Chon et al. the standard choice leads to the statistically best solutions for ApEn for human heart rate variability data [25]. In [9,24], the template length is set to m = 2 for other entropy types as well. For comparability, this value was used for all tests.
Some publications describe a sensitivity of the entropies to the data length N [1, 6,8]. Therefore, the significance of the entropies calculated of parts of the heart rate variability data has been tested with increasing data set size.
For the threshold parameter r, Pincus suggested in [8] to choose a value between 0.1 s and 0.25 s, where s is the sample standard deviation of the data sequence. This is also the standard range in most publications, with the most common choice of r = 0.2 s [17,21,23,24,26]. To examine threshold parameters in the standard range, they were tested with heart rate variability data.
In [19,27], the so called flip-flop effect is described, where for some values of the threshold value r a signal has a higher entropy compared to another and for other choices of r a lower one. This is also shown in [21] for heart rate variability data. To test for this effect, various entropies of signals from one database were calculated with parameters inside the standard range. Lu et al. also showed this effect in [26]. Therefore, they proposed choosing r ∈ [0.1 · s, 1.0 · s] for ApEn in such a way as to maximize the entropy value. Since finding a maximum is computationally very expensive, Chon et al. created in [25] an empirical formula to calculate an r, hereinafter called r Chon , which approximates a maximizing r. For m = 2, it can be formulated as: where s 1 is the standard deviation of the distances in the data sequence, i.e., the standard deviation of {(u 1 − u 2 ), . . . , (u N−1 − u N )}, and s 2 the standard deviation of the complete data sequence. This formula was derived from nonphysiological data and Liu et al. showed in [17] that it is not always a good approximation of the maximizing r, but actually leads to more significant results than the maximizing r, when applied to heart rate variability data.
Regarding fuzzy function parameters, Chen et al. [9] used n = 2 and r = 0.2 s for test signals. They described in [9] that for a larger n, the closer data points are weighted more strongly. Liu et al. [24] used the weighting factors n L = 3, n F = 2 and r L = r F = 0.2s for heart rate variability analysis. Their choices for n L and n F were given without any motivation.

Test cases
To get further knowledge of the parameters, heart rate variability data were used to compare different choices of r L , r F and n, n L and n F with respect to the resulting significance of the statistical tests comparing pathological and normal cases.
The following test cases using the data described in the Data section have been conducted to answer the research questions stated in the Background section and to support an "optimal" parameter selection for all entropy types for heart rate variability data. For each research question, one test case has been designed (except for the third, which is covered in two test cases: one for each entropy type under consideration). All tests have been conducted consecutively. Fixed parameters have been taken from literature or based on the outcomes of preceding tests.

Test case 1
Variation of the threshold values r = r L = r F within the standard interval [0.1 s, 0.25 s] [8] to show its influence on the entropy values and the aforementioned flip-flop effect [21,27]. Fixed parameters were m = 2, n = n L = 3 and n F = 2 [7,24]. For data length N, the data length of the shortest RR interval sequence of the available data N = 1126 sets was used.

Test case 2
Variation of the data length N = x · 110 with x ∈ [1, 10] to show its influence on the significance. Maximum N was 1100 due to the length of the available test data. Fixed parameters were m = 2, n = n L = 3, n F = 2 and r = r Chon or r = 0.2 s [7,17,21,23-26].

Test case 3
Variation of the weighting factor n for FuzzyEn in the interval [1,6] to show its influence on the significance. Fixed parameters were m = 2 and r = r Chon or r = 0.2 s [7,17,21,[23][24][25][26]. N = 1000 was chosen based on the results of test case 2.

Test case 4
Variation of the weighting factors n L and n F for Fuzzy-MEn in the interval [1,6] to show their influence on the significance. Fixed parameters were m = 2 and r L = r F = r Chon or r L = r F = 0.2 s [7,17,21,[23][24][25][26]. N = 1000 was chosen based on the results of test case 2.

Test case 5
Variation of the threshold values r L and r F for Fuzzy-MEn in an interval of [0.25 · r Chon , 6 · r Chon ] and in an interval of [0.1 s, 0.25 s] [8] to show their influence on the significance. The parameter m = 2 was fixed [7]. N = 1000, n L and n F were chosen based on the results of the previous tests.
In order to test for their statistical significance, the calculated entropies were first tested for normality using Lilliefors' composite goodness-of-fit test [28]. If this test was positive for all results within a subset of a test case (i.e., certain database and/or choice for r), the p-value was calculated with a two sample t-test, and otherwise a Wilcoxon rank sum test was performed. To ensure comparability, the same statistical test was used for each subset of a test case.
Since statistical tests used in this work are based on the same null hypothesis, the same subject groups, the same endpoints and only slight variations of the analysis method, interaction of the observed results is not only possible, but highly probable. On the other hand, p-value adjustments such as the commonly used Bonferroni correction assume uncorrelated endpoints and are therefore considered inappropriate for the tasks in this work [29]. Besides, the aim of this work is not to test whether there is a difference between groups, but to investigate the ability of varied methods to detect those differences.
Due to the high computational complexity of the tests conducted (e.g., the calculation of the variation of the weighting factors n L and n F for FuzzyMEn takes several hours for only one r and one comparison of databases), the authors had to refrain from more robust randomized testing strategies (i.e., the permutation of the original data), since the computation time would multiply by at least a thousand times.

Data
All data used for the described tests have been taken from http://Physionet.org [30], a free-access, on-line archive of physiological signals. They are described in detail in this section.
To create a control group all databases described as non-pathological were combined into one database, afterwards called np. This database contains the Normal Sinus Rhythm RR Interval Database, which consists of the beat annotations of 54 long-term ECG recordings, digitized at 128 samples per second, of subjects with normal sinus rhythm (30 men, aged 28.5 to 76, and 24 women, aged 58 to 73). Furthermore, it includes the MIT-BIH Normal Sinus Rhythm Database, which consists of 18 long-term recordings (5 men, aged 26 to 45, and 13 women, aged 20 to 50) digitized at 128 samples per second. The Fantasia Database of 120-minute recordings of twenty young (10 men and 10 women; 21 -34 years old) and twenty elderly (10 men and 10 women; 68 -85 years old) healthy subjects with ECG digitized at 250 Hz was also used [31]. The record "fantasia/f1o09" had to be excluded due to its high number of supraventricular premature beats. This results in a total database size of 111 recordings. RR intervals greater than 2.5 seconds were excluded to ignore artifacts.
Two databases were used to search for pathological effects. One was the Congestive Heart Failure RR Interval Database, afterwards referred to as chf2, which includes 29 long-term ECG recordings, with a sampling frequency of 128 Hz, of subjects aged 34 to 79 (8 men and 2 women; gender not known for the remaining subjects) with congestive heart failure (NYHA classes I, II, and III) [32].
The second one was the MIT-BIH Arrhythmia Database, afterwards called mit, which contains 48 half-hour recordings, sampled with a frequency of 360 Hz, from 47 subjects (25 men aged 32 to 89 years and 22 women aged 23 to 89 years) [33]. It contains a set of randomly chosen signals and 25 signals especially chosen to include examples of uncommon but clinically important arrhythmias recorded at the BIH Arrhythmia Laboratory [33].
The two databases chf2 and mit were always evaluated separately to keep the results as homogeneous as possible and to avoid the mutual neutralization of ab-normalities. To ensure comparability, the data length N was equal for all recordings. Longer recordings were cropped at the beginning and the end in equal shares.
In a way, this work can be considered as a pilot study, as only previously recorded data are used. Furthermore, the findings of this work will be incorporated in followup studies.

Results
The following sections show the results of our tests, which are described in the Test cases section above. An overview of these results is given in Table 1.
The results were tested for normality using Lilliefors' composite goodness-of-fit test. Throughout the whole test case 2, which combines results of two different ways to determine r (see figure 2), data were either normally or not normally distributed. Hence, entropy values were compared using the Wilcoxon rank sum test. Entropies calculated with r as multiple of s were distributed normally in our test scenarios and could therefore be compared using the two sample t-test (as in figures 3 (B, D), 4 (B, D) and 5 (B, D)). Entropy values derived using r as multiple of r Chon , on the other hand, were not normally distributed and hence compared using Wilcoxon's rank sum test (as in figures 3 (A, C), 4 (A, C) and 5 (A, C)). Only findings of the same statistical test are combined in the following sections. Figure 1 corresponds to test case 1 and shows the effect of different threshold values on the entropy values. One of the defining characteristics of FuzzyEn and FuzzyMEn, their relative insensitivity to changes in r, is clearly visible. The flip-flop effect can be observed for ApEn. For the chf2 database, entropy values are higher for pathological data where r <0.15 s, whereas they are lower for r >0.15 s. This behavior is reversed for the mit database, with lower entropy values for pathological data for r <0.18 s and higher entropy values for pathological data for r >0.18 s.
The results for test case 2, analyzing the sensitivity of the entropy types to the data length are represented in figure 2. The more data points evaluated, the higher the Variation of the weighting factors n L and n F for FuzzyMEn in the interval [1,6] Best results with n L = 2, n F = 1 Best results with n L = 1, n F = 3 separation between pathological and non-pathological data. When using the threshold value r = 0.2 s, significance is already reached with N ≥ 200 data points, whereas with r = r Chon more data (N ≥ 1000) are needed before significance is reached. Comparing the different methods when using r = 0.2 s, one can see that they only differ when N is very small. With higher N , their behavior converges. Figure 3 presents the results of test case 3, showing that the results become insignificant for n >3 with r = r Chon , and for n >1 with r = 0.2 s in the mit database (C, D). The increasing p-values are below the significance level (p <0.05) for the chf2 database within the observed range of n (A, B). , calculated with n L = 1 and n F = 3. In (A), the best results are achieved with r L < ∼ 1 · r Chon or r F < ∼ 1 · r Chon , whereas they have to be greater than or equal to 1 · r Chon in (C) for good performance. In (B), r L <0.12 s or r L > ∼ 0.2σ yield significant results, however rF does not matter that much. In contrast, the best performance in (D) is achieved with r F > ∼ 0.18σ and r L ∈ [0.14 s, 0.22 s].

Discussion
As a summary of the following discussion of the previously presented results, the combinations of parameters which yielded the best results are listed in Table 2. Some of the tests concerning the parameter selection showed no clear results. A couple of evaluations revealed contradicting results depending on the database (chf2 vs. mit ) or on the chosen parameters (e.g., r = r L = r F = r Chon vs. r = r L = r F = 0.2s). The latter is easy to overcome by choosing a different set of parameters based on the choice of the threshold value r. However, a compromise must be found in order to find parameters suitable for both databases.
One of the hardest difficulties lies in the choice of the threshold value r due to the flip-flop effect, i.e., for some parameters one data set has a higher entropy compared to another, but this order is reversed for different parameter choices [17,27]. This can occur for simple signals, but also when analyzing heart rate variability data, as in figure 1. This leads to difficulties with the interpretation of the entropy, i.e., the direct assignment of entropy values to pathological or non-pathological data without a given r. In our tests, the effect occurred for ApEn, but it is also reported to occur for SampEn and FuzzyEn as well by Boskovic et al. [27]. The flip-flop effect does not allow us to make a clear statement, as in [7][8][9]17], that a higher entropy corresponds to a higher irregularity. Since this effect can happen for all different entropy values [27], they should not be seen as a direct measure of regularity, but rather as an indicator of differences in regularity with regard to certain time periods.
Compared to the threshold value, the data length N seems to have a smaller effect on the ability of the entropy measures to differentiate pathological from non-pathological data sets. Generally, a larger N leads to a higher probability of significance. If possible, the data length N should be longer than 200 data points when using r = 0.2 s. This finding is consistent with previous studies, e.g. [12]. Surprisingly, it should even exceed a length of 1000 data points when using r = r Chon , assuming a continuing trend in figure 2. Due to the length of the available test data, the range had to be restricted for this test. In case of HRV data, N is the number of recorded heartbeats and therefore proportional to the duration of the recording. Thus, to increase N, longer measurements are necessary. The Task Force of The European Society of Cardiology and The North American Society of Pacing and Electrophysiology [4] recommends a duration of five minutes for short time recordings, which would result in 300 data points at an average heart rate of 60 beats per second. This would be sufficient when using r = 0.2 s, but not for r = r Chon . These considerations should be kept in mind when dealing with HRV data.
Due to the different behavior when varying n, n L and n F given different threshold parameters r = r L = r F = r Chon and r = r L = r F = 0.2 s, the parameters n, n F and n L have been chosen independently for the different threshold values. This is no constraint to the method, since the choice of r is known beforehand anyway (and is not based on the potentially unknown medical condition of the subject). The values n L = 2 and n F = 1 for r L = r F = r Chon , and n L = 1 and n F = 3 for r L = r F = 0.2 s showed better results than n L = 3, n F = 2 as proposed by Liu et al. in [24]. Unsurprisingly, similar values were found for n, as n for FuzzyEn equals n L for FuzzyMEn. For consistency, n = n L = 2 for r = r Chon and n = n L = 1 for r = 0.2 s are recommended.
The tests concerning r F and r L showed that there is no optimal choice, since the results in figure 5 (A) and (C) as well as (B) and (D) contradict each other. Nevertheless, both r F = r L = r Chon and r F = r L = 0.2 s, as described in the literature [9,24,25], are the most reasonable compromise between figure 5 (A) and (C), and (B) and (D), respectively.
Finally, a number of important limitations of this study need to be considered. First, this study is limited to previously recorded signals of only two different cardiac diseases due to the availability of data. As already mentioned in the Data section, a separate evaluation is warranted, to avoid the mutual neutralization of abnormalities. Furthermore, the template length was fixed to m = 2 for all calculations, as the number of possible variations of parameters would get too high otherwise and m = 2 seems to be a reasonable choice in all investigated literature, e.g., [8]. In [34] it was reported that spikes due to recording errors in heart rate variability data can disturb ApEn and SampEn. No tests were done to examine this behavior for the included entropies. Finally, the study did not evaluate the dependency of the entropies on age and gender as reported in literature [35,36].

Conclusions
The results of the presented study clearly stress the need for further investigations of signal entropy for heart rate variability data. Given the wide range of different medical conditions of subjects, the assortment of available methods to calculate entropy, and their customizability with up to six degrees of freedom, it is almost impossible to cover all combinations in a single study. Future work will therefore be focused on overcoming the limitations of the presented work, i.e., extending the evaluations to other cardiac diseases, the variation of the template length m, investigating the robustness with respect to recording errors, and the relation of the parameter choice to age and gender.