 Methodology article
 Open Access
 Published:
Finding an appropriate equation to measure similarity between binary vectors: case studies on Indonesian and Japanese herbal medicines
BMC Bioinformatics volume 17, Article number: 520 (2016)
Abstract
Background
The binary similarity and dissimilarity measures have critical roles in the processing of data consisting of binary vectors in various fields including bioinformatics and chemometrics. These metrics express the similarity and dissimilarity values between two binary vectors in terms of the positive matches, absence mismatches or negative matches. To our knowledge, there is no published work presenting a systematic way of finding an appropriate equation to measure binary similarity that performs well for certain data type or application. A proper method to select a suitable binary similarity or dissimilarity measure is needed to obtain better classification results.
Results
In this study, we proposed a novel approach to select binary similarity and dissimilarity measures. We collected 79 binary similarity and dissimilarity equations by extensive literature search and implemented those equations as an R package called bmeasures. We applied these metrics to quantify the similarity and dissimilarity between herbal medicine formulas belonging to the Indonesian Jamu and Japanese Kampo separately. We assessed the capability of binary equations to classify herbal medicine pairs into match and mismatch efficacies based on their similarity or dissimilarity coefficients using the Receiver Operating Characteristic (ROC) curve analysis. According to the area under the ROC curve results, we found Indonesian Jamu and Japanese Kampo datasets obtained different ranking of binary similarity and dissimilarity measures. Out of all the equations, the Forbes2 similarity and the Variant of Correlation similarity measures are recommended for studying the relationship between Jamu formulas and Kampo formulas, respectively.
Conclusions
The selection of binary similarity and dissimilarity measures for multivariate analysis is data dependent. The proposed method can be used to find the most suitable binary similarity and dissimilarity equation wisely for a particular data. Our finding suggests that all four types of matching quantities in the Operational Taxonomic Unit (OTU) table are important to calculate the similarity and dissimilarity coefficients between herbal medicine formulas. Also, the binary similarity and dissimilarity measures that include the negative match quantity d achieve better capability to separate herbal medicine pairs compared to equations that exclude d.
Background
Binary features have been commonly used to represent a great variety of data [1–3], expressing the binary status of samples as presence/absence, yes/no, or true/false. It has many applications in the bioinformatics, chemometrics, and medical fields [4–19], as well as in pattern recognition, information retrieval, statistical analysis, and data mining [20, 21]. The choice of an appropriate coefficient of similarity or dissimilarity is necessary to evaluate multivariate data represented by binary feature vectors because different similarity measures may yield conflicting results [22]. Choi et al. [23] collected binary similarity and dissimilarity measures used over the last century and revealed their correlation through the hierarchical clustering technique. They also classified equations into two groups based on inclusion and exclusion of negative matches. Consonni & Todeschini [1] proposed five new similarity coefficients and compared those coefficients with some wellknown similarity coefficients. Three of the five similarity coefficients are less correlated with the other common similarity coefficients and need an investigation to understand their potential. Meanwhile, Todeschini et al. [24] reported an analysis of 44 different similarity coefficients for computing the similarities between binary fingerprints by using simple descriptive statistics, correlation analysis, multidimensional scaling Hasse diagrams, and their proposed method ‘atemporal target diffusion model’.
Nowadays, the utilization of herbal medicines, i.e. Indonesian Jamu, Japanese Kampo, traditional Chinese medicine (TCM), and so on [25], are becoming popular for disease treatment and maintaining good health. In case of Indonesian Jamu, each Jamu medicine is prepared from a single plant or a mixture of several plants as its ingredients. The National Agency of Drug and Food Control (NADFC) of Indonesia supervises the production of Jamu medicines before its release for public use. Up to 2014, there were 1247 Jamu factories in Indonesia [26]. They have concocted a lot of Jamu formulas with various efficacies. Consequently, the studies of Jamu formulas have become an interesting research topic in the last few years. It may be related to the problems of the Jamu philosophy, systematization of Jamu, or phytochemistry. In the Jamu studies, the relationships between plants, Jamu, and efficacies lead to determine important plants for every disease class using global and local approaches [4, 5, 27]. In addition, Kampo formulas are traditional medicines from Japan. These are generally prepared by combination of crude drugs. In total, 294 Kampo formulas are listed in the Japanese Pharmacopoeia of 2012 and it can be used for selfmedication [28]. Currently, many researchers have done Kampo studies to unveil the complex systems of Kampo medication and to reveal the scientific aspect of its relevance to modern healthcare. In Jamu and Kampo studies, herbal medicine formula and plant/crude drug relations are represented as binary feature vectors, denoting whether a particular plant is used or not as an ingredient.
The relationships between Jamu formulas, as well as Kampo formulas and other herbal medicines, are not only reflected by the efficacy similarity but also by the ingredient similarity. One Jamu formula can be suggested as an alternative to the other one if they have relatively similar ingredients. For mathematical analysis, each Jamu formula is represented as a binary vector using 1 to indicate the presence of a plant and 0 otherwise. However, each Jamu formula usually uses a few plants. Thus, most of the Jamu vectors contain a few 1 s and many 0 s. Consequently, the number of plants that are used simultaneously in Jamu pairs is much smaller than the number of plants that are not used simultaneously as Jamu ingredients. Therefore, in order to find relatively similar Jamu formulas, the high number of negative matches might influence the calculation of binary similarity or dissimilarity between Jamu pairs. On the other hand, there is no guarantee that negative cooccurrence between two entities is identical [29]. Hence, it is necessary to examine the binary similarity and dissimilarity coefficients of Jamu formulas to determine the appropriate measurement for finding a suitable mixing alternative of a target crude drug.
Currently, there are several methods to measure the quality of classifiers [30, 31] such as the Receiver Operating Characteristic (ROC) curves [32, 33], PrecisionRecall (PR) curves [33, 34], Cohen’s Kappa scores [35, 36], and so on. An ROC curve is a very powerful tool for measuring classifiers’ performance in many fields, especially in the machine learning and binaryclass problems [37]. The purpose of ROC analysis is similar to that of the Cohen’s Kappa, which is mainly used for ranking classifiers. The ROC curve conveys more information than Cohen’s Kappa in a sense that it can also visualize the performance of a classifier by a curve instead of generating just a scalar value. In this study, we propose a method to select the most suitable similarity measures in the context of classification based on False Positive Rates (FPRs) and True Positive Rates (TPRs) by using ROC curve analysis. We discuss the stepbystep development of this method by applying it to assess the similarity of herbal medicines in the context of their efficacies. Initially, we gathered 79 binary similarity and dissimilarity equations. Some identical equations were eliminated in the preliminary step. Subsequently, the capability of binary measures to separate herbal medicine pairs into match and mismatch efficacy groups was assessed by using the ROC analysis.
Methods
The proposed method leads to the selection of a suitable equation such that when two herbal medicine formulas belong to the same efficacy group, their ingredient similarity measured by the equation becomes higher in the global context of a large set of formulas. Figure 1 illustrates data representation and also the procedure of our experiment.
Datasets
We used 3131 Jamu formulas collected from NADFC of Indonesia [4, 5, 27], which comprise of 465 plants. Thus, Jamu vs. plant relations were then organized as a 3131x465 matrix (Fig. 1a). Jamu formulas were represented by binary vectors, which express the binary status of plants as ingredients, 1 (presence) and 0 (absence). Each Jamu formula consists of 1 to 26 plants, with average 4.904, standard deviation 2.969 and the set union of all formulas consists of 465 plants. Each Jamu formula corresponds to one or more efficacy/disease classes. Total 14 disease classes are used in this Jamu study, of which 12 classes are from the National Center for Biotechnology Information (NCBI) [38]. The list of disease classes are as follows: blood and lymph diseases (E1), cancers (E2), the digestive system (E3), femalespecific diseases (E4), the heart and blood vessels (E5), diseases of the immune system (E6), malespecific diseases (E7), muscle and bone (E8), the nervous system (E9), nutritional and metabolic diseases (E10), respiratory diseases (E11), skin and connective tissue (E12), the urinary system (E13), and mental and behavioral disorders (E14). Corresponding to 3131 Jamu formulas, there can be (3,131x3,130)/2 = 4,900,015 Jamu pairs.
For the purpose of comparison, we created four random matrices as the same size as Jamuplant relations by randomly inserting 1 s and 0 s. In three of the random datasets, the numbers of 1 s are 1, 5 and 10% of 465 plants (called as random 1%, random 5%, and random 10%). In the case of the other dataset, we randomly inserted the equal number of 1 s in every row as it is in the original Jamu formulas (called as random Jamu). We also applied our proposed method into Kampo dataset [28]. This dataset is presented as a twodimensional binary matrix with rows and columns representing Kampo formulas and crude drug ingredients, respectively. Kampo dataset is composed of 274 Kampo formulas and each formula consists of 3 to 19 crude drugs, with average 8.923, standard deviation 3.885, and the set union of all formulas consists of 227 crude drugs. Then, each Kampo formula is classified into deficiency or excess class, according to Kampospecific diagnosis of patient’s constitution.
Flow of the experiment
The binary similarity (S) and dissimilarity (D) measure between a herbal medicine pair is expressed by the Operational Taxonomic Units (OTUs as shown in Fig. 1a) [39, 40]. Concretely, let two Jamu formulas be described by tworow vectors J _{ i } and J _{ i’ }, each comprised of M variables with value 1 (presence) or 0 (absence). The four quantities a, b, c, d in the OTUs table are defined as follows: a is the number of features where the values for both j _{ i } and j _{ i’ } are 1 (positive matches), b and c are the number of features where the value for j _{ i } is 0 and j _{ i’ } is 1 and vice versa, respectively (absence mismatches), and d is the number of features where the values for both j _{ i } and j _{ i’ } are 0 (negative matches). The sum of a and d represents the total number of matches between j _{ i } and j _{ i’ }, the sum of b and c represents the total number of mismatches between j _{ i } and j _{ i’ }. The total sum of the quantities in the OTUs table a + b + c + d is equal to M.
We collected equations to measure similarity or dissimilarity between binary vectors from literature [1, 3, 20, 21, 23, 24, 29, 40–62], listed as Eqs. 179 in Table 1. The binary similarity and dissimilarity equations were represented by four quantities, i.e. a, b, c and d. We also implemented these 79 equations as an R package, called bmeasures. The bmeasures package is available on Github and can be installed by invoking these commands: install.packages(“devtools”), library(“devtools”), install_github(“shwijaya/bmeasures”), library(“bmeasures”). The installation of bmeasures package was tested on R release 3.2.4 and the devtools package ver. 1.11.0. Initially, we measure the similarity and dissimilarity coefficients between herbal medicine pairs by using 79 equations. Then, the resulted similarity/dissimilarity coefficients are used for further analysis. Our experimental procedure can be divided into two major steps, which we discuss in the following segments:
Step 1. Reducing the candidate equations
The binary similarity and dissimilarity equations were evaluated to eliminate duplications. When two or more equations can be transformed into the same form by algebraic manipulations, only one of them is kept for further analysis. We also removed equations from our analysis that produce infinite/NaN values or indeterminate forms while applying to measure similarity and dissimilarity using all datasets.
Hierarchical clustering of the remaining equations was then done with an aim to further narrow down the number of candidate equations and to evaluate the closeness between equations. After we obtained the similarity/dissimilarity coefficients between herbal medicine pairs for each equation, we clustered those equations based on its similarity/dissimilarity coefficients using Agglomerative hierarchical clustering with Centroid linkage (Fig. 1b) [50, 63–65]. The Euclidean distance (Eq. 80) was used to measure the distance between two equations, k and l, that is:
where s _{ mn }(k) and s _{ mn }(l) are the similarity/dissimilarity values between corresponding herbal medicine pair using equations k and l respectively, N is the total number of herbal medicine formulas, and d _{ k,l } is the distance between equation k and l. The cluster centroid is the average values of the variables for the observations (in the present case equations) in that cluster. Let \( {\overline{X}}_G,{\overline{X}}_H \) denote group averages for clusters G and H. Then, the distance between cluster centroids is calculated using Eq. 81.
where \( {\overline{X}}_G \) is the centroid of G by arithmetic mean \( {\overline{X}}_G=\frac{1}{n_G}{\displaystyle {\sum}_{i=1}^{n_G}}{X}_{Gi} \) [2, 65, 66]. We implemented the clustering process using hclust function in R. At each step, the cluster centroid was calculated to represent a group of equations in the clusters. Furthermore, two equations or clusters are merged for which the distance between the centroids is the minimum until all equations are merged into one cluster.
We performed the hierarchical clustering process twice, first to reduce the candidate equations for which the distance between equations measured by Eq. 80 is zero or nearly zero and secondly to evaluate the combined characteristic of a group of equations. Mean centering and unit variance scaling was applied to the similarity/dissimilarity coefficients before the clustering process.
Step 2. ROC Analysis of selected equations
The effectiveness of similarity/dissimilarity measuring capability of the selected equations was evaluated by means of the ROC curve (Fig. 1c) [67, 68]. For ROC analysis, we divided all the herbal medicine pairs into match and mismatch efficacy classes and used the corresponding distributions with respect to similarity scores to calculate FPRs and TPRs. The ROC curve was created by selecting a series of threshold to generate FPR and TPR. FPR is the proportion of false positive predictions out of all the false data and TPR is the proportion of true positive predictions out of all the true data, defined by Eq. 82 [67–69]:
where true positive (TP) is the number of herbal medicine pairs correctly classified as positive, true negative (TN) is the number of pairs correctly classified as negative, false positive (FP) is the number of pairs incorrectly classified as positive, and false negative (FN) is the number of pairs incorrectly classified as negative. We defined and compared the performance of good equations by using the minimum distance of the ROC curve to the theoretical optimum point and by using the Area Under the ROC Curve (AUC) analysis [70]. The minimum distance between the ROC curve and the optimum point was measured as the Euclidean distance. The minimum distance can also be computed by TP, TN, FP, and FN values corresponding to selected similarity thresholds i using the following formulation:
Results and discussion
Preliminary verification of the equations
In the preliminary step, we removed 12 equations denoted by ‘*’ in Table 1 because each of them can be recognized as identical to one or more other equations by only algebraic manipulations such as linear transformation. From the seven groups of redundant equations shown in Table 2, we included S_{Jaccard}, S_{Dice1/Czekanowski}, S_{Sokal}&_{Sneath2}, D_{Hamming}, D_{Lance}&_{Williams}, S_{Cosine} and S_{Sokal}&_{Sneath5} in our analysis and therefore, we were left with 67 equations at this stage. Next, we clustered the 67 equations to reduce the number of equations using Jamu and Kampo datasets. During the clustering process, we eliminated 11 equations indicated by ‘**’ in Table 1 that produced infinite/NaN values or indeterminate forms while applied to all datasets. Such conditions can be reached when denominator of an equation becomes equal to 0, i.e. the values of b and c in the Mountford and Peirce similarities (Eq. 37 and Eq. 73) are 0 if two formulas use exactly the same ingredients.
The clustering of 56 equations in the context of Jamu data is shown in Fig. 2. The distances among equations belonging to individual clusters indicated as 1 to 7 in Fig. 2 are equal or nearly equal to 0. In other words, those equations have similar characteristics when generating binary similarity/dissimilarity coefficients for Jamu data. By using the clustering result, we reduced 11 equations denoted by ‘***’ in Table 1 because they were related to other equations in the same cluster e.g. we eliminated S_{BaroniUrbani}&_{Buser2} (Eq. 72) because it is similar to S_{BaroniUrbani}&_{Buser1} (Eq. 71). A careful observation of equations belonging to the same cluster in the group IDs 1 to 7 in Fig. 2 implies that one equation can be transformed to another just by adding or multiplying by constants (Table 3). For example, we can represent S_{BaroniUrbani}&_{Buser2} as [(2 x S_{BaroniUrbani}&_{Buser1}) – 1]. The excluded equations based on the clustering process are as follows: S_{Dice1/Czekanowski} (Eq. 3), S_{Innerproduct} (Eq. 13), S_{Russell}&_{Rao} (Eq. 14), D_{MeanManhattan} (Eq. 20), D_{Vari}(Eq. 23), D_{Chord} (Eq. 30), S_{Kulczynski2} (Eq. 41), S_{Driver}&_{Kroeber} (Eq. 42), S_{Johnson} (Eq. 43), S_{Hamann} (Eq. 67), and S_{BaroniUrbani}&_{Buser2} (Eq. 72). In case of Kampo dataset, the clustering results also identified the same equations belong to the same cluster with zero or nearly to zero distance. Therefore, both datasets eliminated the same equations, indicated by ‘***” in Table 1, and also obtained the same number of selected equations (45 binary similarity and dissimilarity measures) for further analysis. Hence, among the 79 binary similarity dissimilarity measures used over the last century, there are only 45 unique equations that produce different coefficients by capturing different information. Additionally, these binary measures satisfy the symmetry property [71], i.e. in case of such equations d(x, y) = d(y, x) or S(x, y) = S(y, x).
We applied hierarchical clustering again to these 45 equations to give a better understanding of relationships between selected equations. In general, Jamu and Kampo data generated more or less the same heatmap. The resulted dendrogram together with the heatmap of Jamu data are shown in Fig. 3. We can roughly identify four main clusters (I, II, III, and IV). The hierarchical clustering clearly separated the equations on the basis whether they measure similarity or dissimilarity. Although both similarity/dissimilarity measures may produce the same coefficient range, they work in the opposite way. The higher the similarity between two herbal medicine formulas, the higher the similarity coefficients. On the other hand, the higher the similarity between two herbal medicine formulas the lower the dissimilarity coefficients. Therefore, the agglomerative clustering with centroid linkage performs well in the process to separate similarity and dissimilarity equations. All the equations belonging to clusters I and II are for measuring dissimilarity whereas the equations belonging to clusters III and IV are for measuring similarity. Conversely, the equations that include negative match quantity d spread throughout all the clusters. This result indicates that the equations cannot be grouped based on the existence of negative match quantity d.
ROC analysis of selected equations
The ROC curves were created for each binary similarity/dissimilarity equation to compare their performance. Initially, we normalized the similarity and dissimilarity coefficients, such that their minimum becomes 0 and maximum becomes 1, before using them to create the ROC curves. In the case of equations that measure dissimilarity, we transformed a normalized dissimilarity coefficient D to a similarity coefficient S for the sake of comparison by using the following equation S = 1 − D ^{2} [40, 41].
In the context of Jamu data, we started the ROC analysis of selected equations by classifying the Jamu pairs into match and mismatch classes based on their efficacies. A Jamu pair belongs to the match class if the efficacy of both the Jamu formulas of a pair is the same. On the other hand, a Jamu pair belongs to the mismatch class if the efficacies of the formulas of a pair are different. The number of Jamu pairs in the match and mismatch classes are 646,728 and 4,253,287 respectively. Obviously, the number of Jamu pairs in the mismatch class is much larger than that in the match class. This imbalance is a challenge in assessment of the capability of equations to separate Jamu pairs into match and mismatch classes. In order to handle this condition, we created 20 mismatch classes each equal to the size of the match class by random sampling of the mismatch class Jamu pairs according to bootstrap method [67]. Every equation was then iteratively evaluated by using those datasets as mismatch class data.
Our objective is to assess the capability of the equations to separate the Jamu pairs into match and mismatch efficacy classes based on their similarity coefficients using ROC analysis. In order to create an ROC curve corresponding to an equation, we need the distributions of match class and mismatch class Jamu pairs with respect to their similarity values calculated by the equation. We divided the range of the similarity coefficient into 100 equal intervals, and the lower limit of each interval was considered as a threshold. Corresponding to every threshold, TP and FN were determined from the distribution of match class and FP and TN were determined from the distribution of mismatch class. In our case, TP and FP are the numbers of Jamu pairs with the similarity value larger than or equal to threshold, and FN and TN are the numbers of Jamu pairs with the similarity value smaller than threshold. FPR and TPR were then calculated for every threshold using Eq. 82. We produced the ROC curve by plotting the resulting FPR on the xaxis and TPR on the yaxis. In perfect or ideal classification, the ROC curve follows the vertical line from (0,0) to (0,1) and then horizontal line up to (1,1). In the case of random data, the ROC curve follows the diagonal line from (0,0) to (1,1). In the case of real data, the ROC curve usually follows an above diagonal line. The (0,1) is the optimum classification point where FPR is zero and TPR is one and hence the (0,1) point will be referred to as ‘optimum point’. The performance of a classifier was assessed either by measuring the minimum distance from the optimum point to the curve or by measuring the AUC. In the case of the minimum distance, the lower is the value of the minimum distance the better is the performance of the classifier. In the case of the AUC, the bigger is the AUC value, the better is the performance of the classifier.
In order to assess the effectiveness of an equation using the minimum distance, the ROC curve was generated by using all of the Jamu pairs from match and mismatch efficacies. The Euclidean distance metric was used to measure the distance from the (0, 1) point to the (FPR, TPR) points for all 45 selected equations. In addition, we created 20 ROC curves for each equation considering in each case the match class Jamu pairs and one of the 20 different mismatch class samples. Thus, we obtained 20 AUCs of the ROC curve for each equation and averaged those values to determine the overall AUCs corresponding to an equation. The ROCR package [72] was used to calculate the AUC values. Table 4 shows the results of ROC analysis and also Kappa scores for Jamu data. The scatter plot of minimum distances and mean of AUCs corresponding to 45 equations for both datasets is shown in Fig. 4. Based on the scatter plot generated using Jamu data in Fig. 4a, the 45 equations are empirically divided into 4 groups (C1, C2, C3, and C4). The wellperforming equations corresponding to both approaches were obtained in C1, which consists of Eqs. 48, 49, 54, 68, and 79. The Michael similarity (Eq. 68) produces the lowest minimum distance, and the highest AUC is obtained by the Forbes2 similarity (Eq. 48). The ROC curves generated using Michael and Forbes2 similarities for all datasets are shown in Fig. 5. As expected, the ROC curves corresponding to all random datasets follow the diagonal line and that corresponding to Jamu data follows the above diagonal line. Most equations with the highest AUC values are similaritymeasuring equations and these equations belong to cluster III in Fig. 3. Out of these equations, the Lance & Williams distance (Eq. 27) produces the highest AUC value among dissimilaritymeasuring equations.
We repeated our experiments also for Kampo data following the same procedures. The results of ROC analysis and also Cohen’s Kappa using Kampo data are shown in Table 5. In addition, the plot between minimum distances and mean AUCs of Kampo data is shown in Fig. 4b. The remaining equations are clustered into 3 groups (C1, C2 and C3). The most suitable binary equations for classifying Kampo data were found in the cluster C1, with Tarwid Similarity (Eq. 40) and Variant of Correlation similarity (Eq. 79) producing the lowest minimum distance and the highest mean AUCs, respectively, which are different from the top ranking equations in case of Jamu data. Only 5 of top10 wellperforming equations corresponding to Jamu data matches with those corresponding to Kampo data with different order. These results indicate different dataset produce different ranking of equations and there is no superior equation that can perform well for all datasets [73]. Each binary similarity and dissimilarity equation has its own characteristics and fits for a specific problem. Therefore, our proposed method can be used to choose the appropriate equations wisely, depending on the characteristics of the data to analyze.
In case of Jamu and Kampo pairs, the negative match quantity d is much higher compared to the positive match a and the absence mismatches b and c. One of our objectives is to understand the effect of d in calculating similarity/dissimilarity coefficients between herbal medicines. Among the equations that do not include d, the Simpson similarity (Eq. 45) and the Forbes1 similarity (Eq. 34) produce the lowest minimum distance in Jamu and Kampo data, respectively. Furthermore, the Derived Jaccard similarity (Eq. 78) and the McConnaughey (Eq. 39) produce the highest AUC in Jamu data and Kampo data. Out of 79 equations in Table 1, 46 equations use d in their expressions. Interestingly, the equations that include d perform better in measuring similarity/dissimilarity in both datasets. The best performing equations corresponding to minimum distance and mean AUCs for Jamu data are Eqs. 68 and 48, which include negative match quantity d. Likewise, the best equations in the Kampo data (Eqs. 79 and 40) also include negative match quantity d. Then, the top5 well performing equations corresponding to both datasets include d. If we also consider another metric to rank the classifier performance, i.e. Cohen’s Kappa, we find a consistent result. That is top5 equations with the largest Kappa score also include d (Table 4 and 5). It implies the similarity between Jamu pairs and Kampo pairs are influenced by the negative matches. This result supports the findings of Zhang et al. [20] that all possible matches, S _{ ij } where i, j ϵ{0,1}, should be considered for better classification results. Moreover, the performance measurement of binary similarity/dissimilarity equations using the AUC of ROC curve is more preferable to the minimum distance because this approach considers all (FPR, TPR) points, not only a single point with minimum distance to the optimum point.
For further insight into the matter, we examined the performance of the equations for every disease class in Jamu data separately using the same approach. We created match and mismatch datasets for every disease class using all Jamu pairs. The match class consists of Jamu pairs with the same efficacy class and the mismatch class consists of Jamu pairs with different efficacy class but one of the Jamu formulas in that pair has the same efficacy class as the match class. To measure the AUC of ROC curve, we created 20 mismatch classes each equal to the size of the match class by using the bootstrap method. Thus, we obtained 20 AUCs of the ROC curves for each disease class and each equation, and we averaged those 20 values to determine the overall AUCs corresponding to a disease class and an equation (Additional file 1: Table S1). Figure 6 shows the ROC curves for every disease class using Forbes2 similarity coefficients. The immune system disease class (E6) produces the highest AUC score and the highest average of AUCs (for all 45 equations). Moreover, the best classification is obtained in case of immune system class indicated by an arrow in Fig. 6, with the average of recognition rate of 0.805. The relatively high recognition rate of E6 class corresponds to our knowledge that the disease of immune system class is a very specific disease and utilization of the crude drug is restricted compared to other disease classes. The minimum distance of an ROC curve from the optimum point (expressed by Eq. 83) indicates the difficulty of classification i.e. the higher the minimum distance the more difficult it is to achieve a successful classification. Therefore, when the minimum distance is close to zero, it implies that good classification of the data is possible. In case of classification of Jamu formulas concerning individual diseases, relatively lower minimum distance was obtained for specific type of disease classes such as diseases related to E6 and the urinary systems (E13), which indicates that very specific types of medicinal plants are used to make such Jamu formulas. On the other hand, the disease classes such as those related to digestive systems (E3) and nutritional and metabolic diseases (E10) are caused by diverse factors and therefore the corresponding Jamu formulas are made using diverse types of plants resulting in relatively higher minimum distance for these disease classes (Fig. 6).
Conclusions
Different binary similarity and dissimilarity measures yield different similarity/dissimilarity coefficients, which in turn causes differences in downstream analysis e.g. clustering. Hence, determining appropriate binary similarity and dissimilarity coefficients is an essential aspect of big data analysis in versatile areas of scientific research including chemometrics and bioinformatics. In this study, we presented an organized way to select a suitable equation for studying relationship between herbal medicine formulas in Indonesian Jamu and Japanese Kampo. We started our study by collecting 79 binary similarity and dissimilarity equations from literature. In the early stages, we reduced algebraically redundant equations and equations that produce invalid values or relatively similar coefficients when applied to our datasets. In addition, we eliminated some equations based on agglomerative hierarchical clustering because they were very closely related to other equations in the same cluster. Finally, we selected 45 unique equations that produced different coefficients for our analysis. The ROC curve analysis was then performed to assess the capabilities of these equations to separate herbal medicine pairs having the same and different efficacies. The experimental results show that the binary similarity and dissimilarity measures that include the negative match quantity d in their expressions have a better capability to separate herbal medicine pairs than those equations that exclude d. Moreover, we obtained different ranking of binary equations for different datasets, i.e. Jamu and Kampo data. Thus, this result indicates the selection of binary similarity and dissimilarity measures is data dependent and we should choose the binary similarity and dissimilarity measures wisely depending on the data to be processed. In case of Jamu data, the biggest AUC value is obtained by the Forbes2 similarity. Conversely, the Variant of Correlation similarity is recommended for classifying Kampo pairs into match and mismatch classes. The procedure followed in this work can also be used to find suitable binary similarity and dissimilarity measures under similar situations in other applications.
Abbreviations
 AUC:

The Area Under the ROC Curve
 D:

Dissimilarity
 FN:

False Negative
 FP:

False Positive
 FPR:

False Positive Rate
 NADFC:

The National Agency of Drug and Food Control
 NCBI:

The National Center for Biotechnology Information
 OTU:

The Operational Taxonomic Unit
 PR:

PrecisionRecall
 ROC:

The Receiver Operating Characteristic
 S:

Similarity
 TCM:

Traditional Chinese Medicine
 TN:

True Negative
 TP:

True Positive
 TPR:

True Positive Rate
References
Consonni V, Todeschini R. New similarity coefficients for binary data. MatchCommunications Math Comput Chem. 2012;68:581–92.
Legendre P, Legendre L. Numerical ecology. 2nd. Amsterdam: Elsevier Science; 1998.
Batagelj V, Bren M. Comparing resemblance measures. J Classif. 1995;12:73–90.
Afendi FM, Darusman LK, Hirai A, AltafUlAmin M, Takahashi H, Nakamura K, Kanaya S: System biology approach for elucidating the relationship between Indonesian herbal plants and the efficacy of Jamu. In Proceedings  IEEE International Conference on Data Mining, ICDM. IEEE; 2010:661–668.
Afendi FM, Okada T, Yamazaki M, HiraiMorita A, Nakamura Y, Nakamura K, Ikeda S, Takahashi H, AltafUlAmin M, Darusman LK, Saito K, Kanaya S: KNApSAcK family databases: Integrated metaboliteplant species databases for multifaceted plant research. Plant Cell Physiol 2012, 53:e1(1–12).
Auer J, Bajorath J. Molecular similarity concepts and search calculations. In: Keith JM, editor. Bioinformatics volume II: Structure, function and applications (Methods in molecular biology), vol. 453. Totowa: Humana Press; 2008. p. 327–47.
Kedarisetti P, Mizianty MJ, Kaas Q, Craik DJ, Kurgan L. Prediction and characterization of cyclic proteins from sequences in three domains of life. Biochim Biophys Acta  Proteins Proteomics. 2014;1844(1 PART B):181–90.
Zhou T, Shen N, Yang L, Abe N, Horton J, Mann RS, Bussemaker HJ, Gordân R, Rohs R. Quantitative modeling of transcription factor binding specificities using DNA shape. Proc Natl Acad Sci. 2015;112:4654–9.
Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A, Le QT. Sample classification from protein mass spectrometry, by “peak probability contrasts. Bioinformatics. 2004;20:3034–44.
Pinoli P, Chicco D, Masseroli M. Computational algorithms to predict Gene Ontology annotations. BMC Bioinformatics. 2015;16 Suppl 6:1–15.
Kangas JD, Naik AW, Murphy RF. Efficient discovery of responses of proteins to compounds using active learning. BMC Bioinformatics. 2014;15:1–11.
Ohtana Y, Abdullah AA, AltafUlAmin M, Huang M, Ono N, Sato T, Sugiura T, Horai H, Nakamura Y, Morita Hirai A, Lange KW, Kibinge NK, Katsuragi T, Shirai T, Kanaya S. Clustering of 3Dstructure similarity based network of secondary metabolites reveals their relationships with biological activities. Mol Inform. 2014;33:790–801.
Abe H, Kanaya S, Komukai T, Takahashi Y, Sasaki SI. Systemization of semantic descriptions of odors. Anal Chim Acta. 1990;239:73–85.
Willett P, Barnard JM, Downs GM. Chemical similarity searching. J Chem Inf Model. 1998;38:983–96.
Flower DR. On the properties of bit stringbased measures of chemical similarity. J Chem Inf Model. 1998;38:379–86.
Godden JW, Xue L, Bajorath J. Combinatorial preferences affect molecular similarity/diversity calculations using binary fingerprints and Tanimoto coefficients. J Chem Inf Model. 2000;40:163–6.
Agrafiotis DK, Rassokhin DN, Lobanov VS. Multidimensional scaling and visualization of large molecular similarity tables. J Comput Chem. 2001;22:488–500.
RojasCherto M, Peironcely JE, Kasper PT, van der Hooft JJJ, De Vos RCH, Vreeken RJ, Hankemeier T, Reijmers T. Metabolite identification using automated comparison of highresolution multistage mass spectral trees. Anal Chem. 2012;84:5524–34.
Fligner MA, Verducci JS, Blower PE. A modification of the Jaccard–Tanimoto similarity index for diverse selection of chemical compounds using binary strings. Technometrics. 2002;44:110–9.
Zhang B, Srihari SN. Binary vector dissimilarity measures for handwriting identification. In: Proceedings of SPIEIS&T Electronic Imaging, vol. 5010. 2003. p. 28–38.
Zhang B, Srihari SN. Properties of binary vector dissimilarity measures. In: Proc. JCIS Int’l Conf. Computer Vision, Pattern Recognition, and Image Processing. 2003. p. 1–4.
Kosman E, Leonard KJ. Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species. Mol Ecol. 2005;14(2):415–24.
Choi SS, Cha SH, Tappert CC. A survey of binary similarity and distance measures. J Syst Cybern Informatics. 2010;8:43–8.
Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P. Similarity coefficients for binary chemoinformatics data: Overview and extended comparison using simulated and real data sets. J Chem Inf Model. 2012;52:2884–901.
Wijaya SH, Tanaka Y, Hirai A, Afendi FM, Batubara I, Ono N, Darusman LK, Kanaya S. Utilization of KNApSAcK Family Databases for Developing Herbal Medicine Systems. J Comput Aided Chem. 2016;17:1–7.
Seminar nasional dan pameran industri Jamu [http://seminar.ift.or.id/seminarjamubrandindonesia/]. Accessed 19 Aug 2014.
Wijaya SH, Husnawati H, Afendi FM, Batubara I, Darusman LK, AltafUlAmin M, Sato T, Ono N, Sugiura T, Kanaya S. Supervised clustering based on DPClusO: Prediction of plantdisease relations using Jamu formulas of KNApSAcK database. Biomed Res Int. 2014;2014:1–15.
Okada T, Afendi FM, Yamazaki M, Chida KN, Suzuki M, Kawai R, Kim M, Namiki T, Kanaya S, Saito K. Informatics framework of traditional SinoJapanese medicine (Kampo) unveiled by factor analysis. J Nat Med. 2016;70:107–14.
da Silva MA, Garcia AAF, Pereira de Souza A, Lopes de Souza C. Comparison of similarity coefficients used for cluster analysis with dominant markers in maize (Zea mays L). Genet Mol Biol. 2004;27:83–91.
Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.
Lim T, Loh W, Shih Y. A comparison of prediction accuracy, complexity, and training time of thirty three old and new classification algorithms. Mach Learn. 2000;40:203–29.
Metz CE. Basic principles of ROC analysis. Semin Nucl Med. 1978;8:283–98.
Davis J, Goadrich M. The relationship between PrecisionRecall and ROC curves, Proc 23rd Int Conf Mach Learn  ICML’06. 2006. p. 233–40.
Manning CD, Schütze H. Foundations of statistical natural language processing. Cambridge: MITpress; 1999.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46.
BenDavid A. A lot of randomness is hiding in accuracy. Eng Appl Artif Intell. 2007;20:875–85.
BenDavid A. About the relationship between ROC curves and Cohen’s kappa. Eng Appl Artif Intell. 2008;21:874–82.
Genes and diseases [http://www.ncbi.nlm.nih.gov/books/NBK22185/]. Accessed 20 May 2016.
Clifford HT, Stephenson W. An Introduction to Numerical Classification. New York: Academic; 1975.
Warrens MJ. Similarity coefficients for binary data: properties of coefficients, coefficient matrices, multiway metrics and multivariate coefficients. Psychometrics and Research Methodology Group, Leiden University Institute for Psychological Research, Faculty of Social Sciences, Leiden University; 2008.
Jackson DA, Somers KM, Harvey HH. Similarity coefficients: Measures of cooccurrence and association or simply measures of occurrence? Am Nat. 1989;133:436–53.
Dalirsefat SB, da Silva MA, Mirhoseini SZ. Comparison of similarity coefficients used for cluster analysis with amplified fragment length polymorphism markers in the silkworm, Bombyx mori. J Insect Sci. 2009;9:1–8.
Jaccard P. The distribution of the flora in the alpine zone. New Phytol. 1912;11:37–50.
Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945;26:297–302.
Hubalek Z. Coefficients of association and similarity, based on binary (presenceabsence) data: An evaluation. Biol Rev. 1982;57:669–89.
Cheetham AH, Hazel JE, Journal S, Sep N. Binary (presenceabsence) similarity coefficients. J Paleontol. 1969;43:1130–6.
Cha S, Choi S, Tappert C. Anomaly between Jaccard and Tanimoto coefficients. In: Proceedings of StudentFaculty Research Day, CSIS, Pace University. 2009. p. 1–8.
Cha SH, Tappert CC, Yoon S. Enhancing Binary Feature Vector Similarity Measures. 2005.
Lourenco F, Lobo V, Bacao F. BinaryBased Similarity Measures for Categorical Data and Their Application in SelfOrganizing Maps. 2004.
Ojurongbe TA. Comparison of different proximity measures and classification methods for binary data. Faculty of Agricultural Sciences, Nutritional Sciences and Environmental Management, Justus Liebig University Gießen; 2012.
Johnson SC. Hierarchical clustering schemes. Psychometrika. 1967;32:241–54.
Michael EL. Marine ecology and the coefficient of association: A plea in behalf of quantitative biology. J Ecol. 1920;8:54–9.
Stiles HE. The association factor in information retrieval. J ACM. 1961;8(2):271–9.
Nei M, Li WH. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci U S A. 1979;76:5269–73.
Holliday JD, Hu CY, Willett P. Grouping of coefficients for the calculation of intermolecular similarity and dissimilarity using 2D fragment bitstrings. Comb Chem High Throughput Screen. 2002;5:155–66.
Boyce RL, Ellison PC. Choosing the best similarity index when performing fuzzy set ordination on binary data. J Veg Sci. 2001;12:711–20.
Faith DP. Asymmetric binary similarity measures. Oecologia. 1983;57:287–90.
Gower JC, Legendre P. Metric and Euclidean properties of dissimilarity coefficients. J Classif. 1986;3:5–48.
Chang J, Chen R, Tsai S. Distancepreserving mappings from binary vectors to permutations. IEEE Trans Inf Theory. 2003;49:1054–9.
Lance GN, Williams WT. Computer Programs for Hierarchical Polythetic Classification (“Similarity Analyses”). Comput J. 1966;9:60–4.
Avcibaş I, Kharrazi M, Memon N, Sankur B. Image steganalysis with binary similarity measures. EURASIP J Appl Signal Processing. 2005;17:2749–57.
Baroniurbani C, Buser MW. Similarity of binary data. Syst Biol. 1976;25:251–9.
Frigui H, Krishnapuram R. Clustering by competitive agglomeration. Pattern Recognit. 1997;30:1109–19.
Cimiano P, Hotho A, Staab S. Comparing conceptual, divisive and agglomerative clustering for learning taxonomies from text. In: Ecai 2004: Proceedings of the 16th European Conference on Artificial Intelligence, vol. 110. 2004. p. 435–9.
Bolshakova N, Azuaje F. Cluster validation techniques for genome expression data. Signal Process. 2003;83:825–33.
Bien J, Tibshirani R. Hierarchical clustering with prototypes via minimax linkage. J Am Stat Assoc. 2011;106(495):1075–84.
Sonego P, Kocsor A, Pongor S. ROC analysis: Applications to the classification of biological sequences and 3D structures. Brief Bioinform. 2008;9:198–209.
Fawcett T. An introduction to ROC analysis. Pattern Recognit Lett. 2006;27:861–74.
Li M, Chen J, Wang J, Hu B, Chen G. Modifying the DPClus algorithm for identifying protein complexes based on new topological structures. BMC Bioinformatics. 2008;9:1–16.
Gorunescu F. Data Mining: Concepts, models and techniques. Springer Science & Business Media, Verlag Berlin Heidelberg, Germany; 2011.
Carey VJ, Huber W, Irizarry RA, Dudoit S. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York: Springer; 2005.
Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: Visualizing classifier performance in R. Bioinformatics. 2005;21:3940–1.
Gelbard R, Goldman O, Spiegler I. Investigating diversity of clustering methods: an empirical comparison. Data Knowl Eng. 2007;63:155–66.
Acknowledgements
Not applicable.
Funding
This work was supported by the National Bioscience Database Center in Japan; the Ministry of Education, Culture, Sports, Science, and Technology of Japan; the US National Science Foundation and Japan Science and Technology Agency [Strategic International Collaborative Research Program ‘Metabolomics for a Low Carbon Society’]; the National Bioscience Database Center in Japan and NAIST Big Data Project.
Availability of data and materials
The simulated dataset(s) supporting the conclusions of this article are available in KNApSAcK Family Databases (http://kanaya.naist.jp/KNApSAcK_Family/).
Authors’ contributions
SW conducted the primary investigation, carried out the experiments, developed bmeasures package, and drafted the manuscript; SW, MA and SK designed the proposed method; FA provided JamuSpecies relations; MA and IB aided in the manuscript development; LD and SK supervised the study and participated in the manuscript. All authors read and approved the manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Additional file
Additional file 1: Table S1.
The mean of AUCs between equations and disease classes in Jamu data. (XLSX 50 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Wijaya, S.H., Afendi, F.M., Batubara, I. et al. Finding an appropriate equation to measure similarity between binary vectors: case studies on Indonesian and Japanese herbal medicines. BMC Bioinformatics 17, 520 (2016). https://doi.org/10.1186/s128590161392z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128590161392z
Keywords
 Binary data
 Similarity measures
 Distance metric
 Jamu
 Kampo
 ROC curve
 Hierarchical clustering