Research | Open | Published:
An unsupervised partition method based on association delineated revised mutual information
BMC Bioinformaticsvolume 10, Article number: S63 (2009)
The syndrome is the basic pathological unit and the key concept in traditional Chinese medicine (TCM) and the herbal remedy is prescribed according to the syndrome a patient catches. Nevertheless, few studies are dedicated to investigate the number of syndromes and what these syndromes are. Correlative measure based on mutual information can measure arbitrary statistical dependences between discrete and continuous variables.
We presented a revised version of mutual information to discriminate positive and negative association. The entropy partition method self-organizedly discovers the effective patterns in patient data and rat data. The super-additivity of cluster by mutual information is proved and N-class association concept is introduced in our model to reduce computational complexity. Validation of the algorithm is performed by using the patient data and its diagnostic data. The partition results of patient data indicate that the algorithm achieves a high sensitivity with 96.48% and each classified pattern is of clinical significance. The partition results of rat data show the inherent relationship between vascular endothelial function related parameters and neuro-endocrine-immune (NEI) network related parameters.
Therefore, we conclude that the algorithm provides an excellent solution to patients and rats data problem in the context of traditional Chinese medicine.
Traditional Chinese medicine (TCM) is taken by most people in China as a complementary therapeutic alternative since herbal remedies have the advantage over western medicine in that it has less side effects and are less costly. TCM has been always regarded as a key component in 5000 years of Chinese civilization history. In ancient times before modern medicine was born, people all over the world mainly benefit from three traditional medicines, among which only TCM is still alive today; while Chaldaic and ancient Hindu medicines only have extremely rare documents as evidence that they ever existed in history. TCM, whose core is syndrome, is on the way to modernization. It is aiming to be accepted, like Western medicine, as a science [1–3].
The syndrome is the basic pathological unit and the key concept in TCM theory since herbal remedy is prescribed according to syndrome or syndromes a patient catches . Therefore, identification and determination of syndrome(s) in TCM become significantly important for TCM physicians. Nevertheless, there are few documents dedicated to this issue.
In information theory, entropy is a metric to measure uncertainty of random variables. Mutual information (MI) of two random variables is a measure that scales mutual dependence of the two variables. It has been applied in many fields, in which researchers treat as divergence or distance between two distributions [5–7]. The advantage of mutual information over correlation methods is discussed in . In this paper, we propose a novel unsupervised data mining model, in which we treat mutual information as an association measure of two variables. In our effort, we try to unsupervisedly discover syndromes in chronic renal failure (CRF) data and clinically verify these syndromes to test the performance of our model. Based on revised mutual information, we propose an unsupervised pattern discovery algorithm to self-organizedly allocate significantly associated symptoms to patterns. By using diagnostic patients data, each pattern is verified to have clinical meaning. By using rats data, we also apply this method to find the inherent relationship between vascular endothelial function related parameters and NEI network related parameters.
Correlative measure based on mutual information
Correlative measure for discrete variables
Mutual information between two discrete variables is formally defined as:
MI(X, Y) = H(X) + H(Y) - H(X ∪ Y) (1)
where H(X) denotes the Shannon entropy of variable X, H(X ∪ Y) represents the joint entropy between variables X and Y. Formally, suppose that X and Y are both categorical variables, H(X) and H(X ∪ Y) are denoted as:
where n i denotes the number of occurrence of the i th category of X with m categories, N is the total number of sampled X
where n ij represents the number of simultaneous occurrence of the i th category of X with m categories and the j th counterpart of Y with l categories.
Mutual information is universally used to measure the similarity between two variables' distributions and is taken here as an association measure of two variables. Indeed, MI is a measure defined on the set consisted of two variables, if the set is composed of more than two variables, then the definition of MI measure will be rewritten as follows:
Mutual information has an interesting property – super-additivity. We introduce the concept of super-additivity and give the mathematical proof of it. However, it is noted that the super-additivity of mutual information has minor contribution to validation of the algorithm here. Text for this section.
Super-additivity of correlative measure
Let us consider nonempty finite set X and set-family E(X) consisting of its subsets P is a set-function defined on E(X) with properties:
(i) P(A) ≥ 0, ∀ A ∈ E(X)
(ii) P(∅) = 0
If for arbitrary nonempty finite set S i ∈ E(X), S j ∈ E(X), i ≠ j, S i ∈ S j = ϕ, have
P(S i ∪ S j ) ≥ P(S i ) + P(S j ) (5)
This set-function P is called super-additive.
One of the important properties of correlative measure is just its super-additivity. In other words, correlative measure of one finite set is no less than the summation of the correlative measures of all its subsets.
Theorem. Correlative measure MI(s1, s2, ⋯, s m ) is finitely super-additive, and unique.
Proof. The definition of the MI ensures the uniqueness of the measure. We now turn to prove super-additivity of the measure. Suppose that the set X ψ is partitioned into m ψ subsets s1, s2,⋯, s m satisfying for arbitrary i, j (i ≠ j), s i ≠ 0, s j ≠ 0,
We only need to prove
By the definition of MI, we have
Subtracting (7) from (8), we have
The proof is complete.
A revised version of correlative measure
Despite so many merits of applying MI have been recorded , MI also suffers from some defects when dealing with the data. First, MI-based association between two variables is symmetric, but the relation between two symptoms is usually asymmetric. Indeed, symmetric is a special case of asymmetric. Alternatively, two variables' MI is non-negative but boundless, which may make evaluating two subjects' relation difficult in a situation that the association value is isolated. An ameliorated version of the MI can fill the gap. We used the normalized form of association between two variables μ as:
By this definition, the relation between two variables is asymmetric because two variables' Shannon entropies are usually difficult. Additionally, according to information theory, MI(X, Y) is non negative and its upper bound is the minimum between H(X) and H(Y), therefore, the new version of association μ(X, Y) takes value between 0 and 1, which is similar to correlation in statistical theory to some extent.
Furthermore, by information theory, the form of MI can be recast as:
MI(X, Y) = H(X) - H(X|Y) (11)
where H(X|Y) denotes conditional entropy, it measures the remaining uncertainty of X under the condition of knowing Y, that is to say, MI(X, Y) represents the information content with regard to knowing X under the condition of knowing Y. Therefore, associations of two mostly close symptoms and completely opposite counterpart are both very large, making the association defined by MI compose of positive association and negative one. We present an ameliorated version of MI to distinguish positive association and negative association.
The frequency that X and Y are both of nonzero categories is denoted as Pofr(X, Y), it is this positive frequency of X and Y that separates positive association and negative association. We redefined the form of MI as:
where θ is pre-assigned positive quantity, we called it threshold in this paper. When θ = 0, the ameliorated version of MI is traditional form of MI, so the ameliorated MI is an extend version of traditional MI. b is a real number and is greater than 1, it can be seen as penalty coefficient. Proper setting of the two parameters will make the positively associated symptoms keep their association invariant, while the negatively associated counterparts lessen their association, even turn to zero.
Correlative measure for continuous variables
Let us consider two continuous variables. Based on above definitions, now we want to reduce the correlative measure format for two continuous variables satisfied normal contribution .
Let two continuous variables X, Y satisfied normal contribution, their PDFs are
while -∞ <x < ∞, -∞ <y < ∞, E x , E y are mathematical expectations of X, Y, σ x , σ y are standard deviation of X, Y.
Joint probability density function of X, Y is expressed as
while ρ is correlation coefficient of X, Y.
Entropy of X is
In a similar way, entropy of Y is
And then joint entropy of X, Y is
let x - (μ x /σ x ) = u, y - (μ y /σ y ) = v, then
So now we get correlative measure between X, Y is
Entropy partition method
Once association for each pair (every two variables) is acquired, we propose a self-organized algorithm to automatically discover the patterns. The algorithm can not only cluster, but also make some variables appear in some different patterns. In this section, we use three subsections to introduce the algorithm. In the first subsection, we introduce the concept of "Relative" set. Based on this, the pattern discovery algorithm is proposed in the second subsection. The last subsection is devoted to presenting an n-class association concept to back up the idea of the algorithm.
For a specific variable X, a set, which is collected by means of gathering N variables whose associations with X are larger than others with regard to X, is attached to it and is denoted as R(X). Each variable in the set can be regarded as a "Relative" of X while other variables that do not belong to the set are considered as irrelative to X, so we name R(X) as "Relative" set of X. The "Relative" sets of all k variables can be denoted by a k × N matrix. Based on the matrix, the pattern discovery algorithm is proposed.
A pair (variable X and Y) is defined to be significantly associated if and only if X belongs to the "Relative" set of Y (X ∈ R(Y)) and vice versa (Y ∈ R(X)). It is convenient to extend this definition to a set with multiple variables. If and only if each pair of these variables is significantly associated, then we can call that the set is significant associated. A pattern is defined as a significantly associated set with maximal number of variables. All these kinds of sets constitute the hidden patterns in the data. Therefore, a pattern should follow three main criteria: (1) the number of variables within a set is no less than 2; (2) each pair of the variables belong to a set is significantly associated; and (3) any variable outside a set cannot make the set significantly associated. This means the number of variables within the set reaches maximum.
To discover all patterns hidden in the data, we propose the unsupervised algorithm, which can be implemented by three steps.
Based on the Q × N matrix, all the significantly associated pairs are collected, denoted by a M2 × 2 matrix, where M2 represents the number of significantly associated pairs.
Based on the M2 × 2 matrix, collecting all the significantly associated three variables, denoted by a M3 × 3 matrix, where M3 represents the number of significantly associated three variables. Similarly, if there exist significantly associated multiple variables, the corresponding result is denoted by M m × m, where M m represents the number of significantly associated multiple variables and m stands for the number of variables. Obviously m ≤ N, where N represents the number of relative variables. Since N is bounded, the algorithm can converge.
Finding the maximal m. Matrix M m × m have M m patterns with m variables. A set that contains m-1 variables is certainly not a pattern since it does not fulfill the third criterion of a pattern. All these kinds of sets are removed from the matrix Mm-1× (m - 1), the rest are certainly patterns with m - 1 variables. Similarly, all the patterns can be discovered.
From pattern discovery method introduced above, we know that it needs to compute times between every two symptoms of n symptoms and this number will rise into if for every three symptoms of n symptoms. Generally speaking, there are many cases that several (such as 5 or 6) symptoms combine together to describe syndrome in TCM theory. Therefore, based on the unsupervised character of the data, the number of symptoms to be computed should be 2~3 multiples of actual 5~6, namely 10~18. With regard to the TCM data of this paper, the computation times number reaches , namely about 1016 magnitude, which doesn't match any general computer's capacity currently. Because of the above-mentioned reasons, we introduce the concept of n-class association. Formally speaking, for n variables, if arbitrary n - 1 variables of the n variables are considered as close association, then we can call these n variables are of n-class association. Specially, when n = 2, it is just about the association between two variables.
Base on the concept of n-class association, when turning to judge n variables are associated or not, we need only to judge whether arbitrary n -1 variables of n variables are associated. It means that, theoretically we just need to implement the computing of the association between two variables, which significantly decreases the computation complexity so that the pattern discovery algorithm could be applied into large-scale data. In fact, thanks to mathematical induction, the n-class association concept is easy to understand on the mathematics.
For n = 2, proof of this proposition is obvious. Supposed that n variables are close associated, and our purpose is to prove that when an (n + 1)-th variable is under close association with other n variables, the whole (n + 1) variables are considered close associated.
We know that , then
It tells us that the correlative measure among X1, X2,⋯, X n , Xn+1is composed of the association measure among X1, X2,⋯, X n and measure between two subset X1, X2,⋯, X n and Xn+1. It means that if an n + 1-th variable is under close association with other n variables, these n + 1 variables are considered as close association. The proposal of n-class association concept extensively decreases the computational complexity of pattern discovery algorithm.
To validate the algorithm and illustrate the reason of choosing parameters as described above, we must take the objective data for the unsupervised data into account.
In the supervised learning situation, the validation of the algorithm is performed by estimating three measures: sensitivity, specificity and accuracy, of the classification results . However, under unsupervised background here, validation of the algorithm must be done in a slightly different way. We summarize it in following three steps.
For each pattern S, we return it to the unsupervised data, if all variables of the pattern simultaneously appear (their values are non-zero) on a patient, then serial number of the patient is recorded. We collect all these serial numbers, enumerate the total numbers of them, record the number as L S . All the serial numbers are stored in a vector with L S dimensions denoted as .
Tracking the vector to the syndrome data, we get L S vectors with 9 dimensions. Each dimension encodes a syndrome uniformly. The L S vectors are added one by one to generate a new vector = ws i , i = 1, 2,..., 8, 9), where i represents the i-th syndrome, ws i denotes that there are ws i patients are diagnosed as the syndrome in the whole L S patients. Obviously, we have ws i ≤have enough NEI data of 400 Wistar L s . It is easy to find the maximal number, denoted as , in the vector . We record the and the corresponding syndrome i max .
We define the sensitivity T s of the pattern S as . The sensitivity of the algorithm, denoted as T, can be calculated by summing up sensitivities of all patterns and then averaging. i.e. , where P denotes the number of patterns.
Relation between sensitivity of algorithm and threshold
Given a clinical data, the number of patterns, denoted as P above, generated by the algorithm is only determined by the number of "relative" N and threshold θ, i.e., P = f(N, θ). We now reconsider the form of the sensitivity of the algorithm T:
Where and L S are constant for a given data, so T is determined by N and θ. For clinics, a pattern with 3 or 4 variables is optimal to be diagnosed as what syndrome, so N is chosen to be not less than 4.
Results and discussion
Discrete data example
Syndrome is diagnosed according to symptom combinations. As shown in Table 1, we choose 72 symptoms that are closely related to CRF. The pulse information of every patient was not included for its bad consistency during the process of survey. In the survey, the data set was recruited from six clinical centers located in six provinces from the same demographic area and at the same time from October 2005 to March 2006, where a total of 601 patients who suffer from CRF were surveyed.
The case must strictly meet four conditions to be included within the data: (1) based on the diagnosis criterion of CRF and the state of illness to be classified under stages 3, 4 and 5 ; (2) no dialysis therapy for all patients for a month before the survey; (3) patients of ages between 18 and 65 years; and (4) patients must agree to sign the informed consent. Additionally, there is also three exclusion criteria information contained in the survey.
Besides chronic kidney function failure, a patient also suffers from inter-current diseases such as serious respiratory, cardiovascular, cerebrovascular, and digestive blood system diseases.
Women who are in gestation or lactation will be excluded.
A patient with symptoms produced by drug therapy.
Every case is with 72 symptoms, together with the basic information of each subject. The frequencies of 72 symptoms are listed in Table 1, each variable (symptom) has four categories, i.e. none, light, middle, severe, represented by 0, 1, 2, 3, respectively. The latter three categories of each variable mean that the symptom has appeared and then separated into light, middle, severe by clinical doctors, who are strictly and uniformly trained to reach a high consistency.
CRF patients recruited here were clinically diagnosed by TCM physicians to receive herbal treatment. We collected this diagnostic data (also called syndrome data) to validate the unsupervised pattern discovery algorithm. The data is composed of nine syndromes. Name and frequency of syndromes are shown in Table 2 in a frequency-descending way. The data is represented by a matrix, row represents an observation, and column represents a syndrome. If a patient is diagnosed as one of nine syndromes, the corresponding column of the matrix is denoted as 1, otherwise the column is denoted as 0. Generally speaking, if a CRF patient is diagnosed having two syndromes or above, the corresponding columns are all denoted as 1.
The algorithm has three parameters to be adjusted. First, three or four symptoms usually constitute a syndrome. On the other hand, in clinical application, too many symptoms (like eight symptoms) may confound the TCM physicians and lead to complex result, thus in our model we set the number of "Relatives" of variables, denoted as N, 5. Second, we set the threshold as: θ = 40/601 = 6.67%. This parameter choice involves validation of the algorithm steps section. Third, penalty coefficient b is set as 2, which will separate positive and negative associations in our model.
Partition result and discussion
As depicted in Part A of Table 3, one pattern have four symptoms, the other 15 patterns are comprised of three symptoms. The 35 patterns with two symptoms are not listed here for their minor contribution to clinics, since it is very hard to diagnose two symptoms as a syndrome in clinics. Here, a pattern including more than two symptoms is called a clinically effective pattern.
Here, we investigate the relation between the sensitivity of the algorithm T and the threshold θ. As depicted in Figure 1, sensitivity of the algorithm is varied in different threshold. The optimal threshold is 40/601 and the corresponding largest sensitivity is 0.9648, which is better than any reported literature till now. When the threshold is 0, the form of MI is turned into the traditional MI. From Figure 1 we can easily see that the ameliorated version of MI is better than traditional MI since sensitivities of the algorithm in the situation of threshold θ > 0 are larger than the counterpart in θ = 0. The larger the sensitivity of the algorithm, the corresponding result more accords with the clinics. The number in the bracket around each point means the number of effective patterns discovered in the corresponding threshold setting. As depicted in Figure. 2, each pattern's distribution in patient data is and its corresponding sensitivity are showed. The pattern's sensitivity is defined in Step 3 of validation method's algorithm steps.
Continuous data example
For continuous variables, we use the neuro-endocrine-immune data collected from 400 Wistar rats. In 1977 Besedovsky proposed "immune-neuro-endocrine network" theory , and this theory has broadly studied and rapidly become one important theory hotspot in medicine and biology field. More and more evidences indicated that there are not only one big loop among these three systems but also direct and bidirectional interaction between each other [14–16].
We have enough NEI data of 400 Wistar rats collected by the following ways and means:
This research focuses on the high risk factors and pathogenesis of vascular disease and takes vascular endothelial function as breakthrough point.
This research divides 400 Wistar rats to 6 groups by using the randomized single blind method: normal group; basic model group; composite model group; ginseng intervention group; double-ginseng intervention group, compound powder intervention group.
The rats of basic model group are fed with hyperhomocysteinemic of fixed ration to come into being endothelial dysfunction.
Base on basic model group, the rats of composite model group are forced to swim with fixed load at fixed time.
Base on composite model group, the rats of ginseng intervention group are treated with ginseng of fixed ration.
Base on composite model group, the rats of double-ginseng intervention group are treated with double-ginseng of fixed ration.
Base on composite model group, the rats of compound powder intervention group are treated with compound powder of fixed ration.
This research determines 5 vascular endothelial function related parameters and 21 NEI network related parameters. Each rat can only be measured several (usually 4 at most) parameters because of its limited blood volume. All the parameters are showed in Table 4, 5.
Partition result and discussion
This research focus on finding whether there are some rules between vascular endothelial function related parameters and NEI network related parameters. According to the entropy partition algorithm, we take the NEI network related parameters as initial set X, and take the vascular endothelial function related parameters as object set Y. Both X and Y are prescribed as normal contribution. Corresponding parameters σ, μ can be estimated with Bayesian method . After determining the density function of X, Y, we can calculate their correlative measure according to (12). Then we can take entropy partition to this data. According to N - class association, we can get each parameter's Relative set, thus final output set S.
The partition result is showed in Table 6. We can find most relevant NEI network related parameters of each vascular endothelial function related parameter. The parameter's relativity follows descending from top to down.
We take pattern A in Table 6 for example. Figure 3 describes the rules consisting in the above-mentioned two parameters set. These figures show most relevant NEI related parameters to each vascular endothelial function related parameter. Through this way, we can find some interesting phenomena and rules.
The abnormity and disorder in NEI network will inevitably lead to some corresponding endothelial dysfunction. Take the composite model group for example, there will be some subsystems consisting of some endothelial function related parameters and some NEI network related parameters and they will exist in some laws during the remaining time.
When intervened by ginseng, double-ginseng, compound powder respectively, these rules in these subsystems receive varying degree of change. And after intervened by compound powder, these rules almost disappear and the trend looks like to be consistent with normal group.
This shows that the hyperhomocysteinemic feed in basic model group and load swimming in composite model group are just the particular causation of above-mentioned rules.
In this paper, we presented an unsupervised partition method based on association delineated by revised mutual information. A revised version of mutual information is developed to discriminate positive association and negative counterpart. Based on our model, unsupervised pattern discovery algorithm is proposed to allocate significantly associated symptoms into several patterns. The algorithm not only can cluster, but also can make some symptoms appear in different patterns, which are consistent with TCM diagnoses. By using the syndrome data, the unsupervised algorithm was validated and the sensitivity of algorithm performance measure was defined to evaluate the patterns discovered. The algorithm reaches a maximal sensitivity with 96.48%, which means that the CRF data is of good quality. Furthermore, the results shows that, under proper parameters setting, the algorithm successfully discovered 16 patterns in CRF patients and each of the patterns can be automatically diagnosed as syndrome, which is completely in accordance with the corresponding results diagnosed by TCM physicians. We also apply this method in NEI data collected by400 Wistar rats, and the result shows some corresponding rules of vascular endothelial function related parameters and NEI network related parameters. The study in this paper provides an improved solution for syndrome classification in patient and rat data and its results can contribute significantly to the TCM practice.
Traditional Chinese Medicine
Chronic Renal Failure
Normile D: The new face of traditional Chinese medicine. Science. 2003, 299: 188-190. 10.1126/science.299.5604.188.
Xue T, Roy R: Studying traditional Chinese medicine. Science. 2003, 300: 740-741. 10.1126/science.300.5620.740.
Zhou X, Wu Z: Ontology development for unified traditional Chinese medical language system. Artificial Intelligence in Medicine. 2004, 32: 15-27. 10.1016/j.artmed.2004.01.014.
Li S, Zhang X, Li Y, Wang Y: Understanding Zheng in traditional Chinese medicine in the context of neuro-endocrine immune network. IET Systems Biology. 2007, 1: 51-60. 10.1049/iet-syb:20060032.
Chow T, Huang D: Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information. IEEE Transactions on Neural Networks. 2005, 6 (1): 213-224. 10.1109/TNN.2004.841414.
Awates S, Tasdizen T, Foster N, Whitaker R: Adaptive Markov modeling for mutual-information-based unsupervised MRI brain-tissue classification. Medical Image Analysis. 2006, 10: 726-739. 10.1016/j.media.2006.07.002.
Sun Z, Xi G, Yi J, Zhao D: Select informative symptom combination of diagnosing syndrome. Journal of Biological Systems. 2007, 15: 27-37. 10.1142/S0218339007002088.
Wentian Li: Mutual Information Functions Versus Correlation Functions. Journal of Statistical Physics. 1990, 60 (5–6): 823-837.
Kwak Nojun, Choi Chong-Ho: Input feature selection by mutual information based on parz en window. IEEE Transaction on Pattern Analysis and Machine Intelligence. 2002, 24: 1667-1671. 10.1109/TPAMI.2002.1114861.
Zhanquan Sun, Guangcheng Xi, Haixia Li, Jianqiang Yi, Jie Wang: Correlation Analysis Between TCM syndromes and Physicochemical Parameters. Chinese Journal of Biomedical Engineering. 2006, 3: 93-102.
Delen D, Walker G, Kadam A: Predicting breast cancer survivability: a comparison of three data mining methods. Artificial Intelligence in Medicine. 2005, 34: 113-127. 10.1016/j.artmed.2004.07.002.
Chen H: Practice of Internal Medicine. 2005, Beijing: Peoples' Medical Publishing House, 12
Besedovsky H, Sorkin E: Network of immune-neuroendocrine interactions. Clin Exp Immunol. 1977, 27 (1): 1-12.
Changgeng Zhu: Immune-neuro-endocrine network. Journal of Acta Anatomica Sinica. 1993, 24 (2): 216-221.
Jackson IMD: Significance and function of neuropeptides in cerebrospinal fluid. Neurobiology of Cerebrospinal Fluid. Edited by: Wood JH. 1980, New York: Plenum Press, 1: 623-650.
Nathanson JA, Chun LLY: Immunological function of the blood-cerebrospinal fluid barrier. Proc Natl Acad Sci U S A. 1989, 86: 1684-1688. 10.1073/pnas.86.5.1684.
We make a grateful acknowledgement to China Academy of Chinese Medical Sciences and Hebei Medical University for providing data supports. The work was supported by the National Basic Research Program of China (973 Program) under grant No. 2003CB517106 and NSFC Projects under grant No. 60621001, China.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1
The authors declare that they have no competing interests.
JC developed the method and performed the results validation. GCX conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.