An unsupervised partition method based on association delineated revised mutual information

Chen, Jing; Xi, Guangcheng

doi:10.1186/1471-2105-10-S1-S63

Volume 10 Supplement 1

Selected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009)

Research
Open access
Published: 30 January 2009

An unsupervised partition method based on association delineated revised mutual information

Jing Chen¹ &
Guangcheng Xi¹

BMC Bioinformatics volume 10, Article number: S63 (2009) Cite this article

2522 Accesses
1 Citations
Metrics details

Abstract

Background

The syndrome is the basic pathological unit and the key concept in traditional Chinese medicine (TCM) and the herbal remedy is prescribed according to the syndrome a patient catches. Nevertheless, few studies are dedicated to investigate the number of syndromes and what these syndromes are. Correlative measure based on mutual information can measure arbitrary statistical dependences between discrete and continuous variables.

Results

We presented a revised version of mutual information to discriminate positive and negative association. The entropy partition method self-organizedly discovers the effective patterns in patient data and rat data. The super-additivity of cluster by mutual information is proved and N-class association concept is introduced in our model to reduce computational complexity. Validation of the algorithm is performed by using the patient data and its diagnostic data. The partition results of patient data indicate that the algorithm achieves a high sensitivity with 96.48% and each classified pattern is of clinical significance. The partition results of rat data show the inherent relationship between vascular endothelial function related parameters and neuro-endocrine-immune (NEI) network related parameters.

Conclusion

Therefore, we conclude that the algorithm provides an excellent solution to patients and rats data problem in the context of traditional Chinese medicine.

Background

Traditional Chinese medicine (TCM) is taken by most people in China as a complementary therapeutic alternative since herbal remedies have the advantage over western medicine in that it has less side effects and are less costly. TCM has been always regarded as a key component in 5000 years of Chinese civilization history. In ancient times before modern medicine was born, people all over the world mainly benefit from three traditional medicines, among which only TCM is still alive today; while Chaldaic and ancient Hindu medicines only have extremely rare documents as evidence that they ever existed in history. TCM, whose core is syndrome, is on the way to modernization. It is aiming to be accepted, like Western medicine, as a science [1–3].

The syndrome is the basic pathological unit and the key concept in TCM theory since herbal remedy is prescribed according to syndrome or syndromes a patient catches [4]. Therefore, identification and determination of syndrome(s) in TCM become significantly important for TCM physicians. Nevertheless, there are few documents dedicated to this issue.

In information theory, entropy is a metric to measure uncertainty of random variables. Mutual information (MI) of two random variables is a measure that scales mutual dependence of the two variables. It has been applied in many fields, in which researchers treat as divergence or distance between two distributions [5–7]. The advantage of mutual information over correlation methods is discussed in [8]. In this paper, we propose a novel unsupervised data mining model, in which we treat mutual information as an association measure of two variables. In our effort, we try to unsupervisedly discover syndromes in chronic renal failure (CRF) data and clinically verify these syndromes to test the performance of our model. Based on revised mutual information, we propose an unsupervised pattern discovery algorithm to self-organizedly allocate significantly associated symptoms to patterns. By using diagnostic patients data, each pattern is verified to have clinical meaning. By using rats data, we also apply this method to find the inherent relationship between vascular endothelial function related parameters and NEI network related parameters.

Methods

Correlative measure based on mutual information

Correlative measure for discrete variables

Mutual information between two discrete variables is formally defined as:

MI(X, Y) = H(X) + H(Y) - H(X ∪ Y) (1)

where H(X) denotes the Shannon entropy of variable X, H(X ∪ Y) represents the joint entropy between variables X and Y. Formally, suppose that X and Y are both categorical variables, H(X) and H(X ∪ Y) are denoted as:

H (X) = - \sum_{i = 1}^{m} \frac{n_{i}}{N} \ln \frac{n_{i}}{N}

(2)

where n_idenotes the number of occurrence of the i th category of X with m categories, N is the total number of sampled X

H (X \cup Y) = - \sum_{i = 1}^{m} \sum_{j = 1}^{l} \frac{n_{i j}}{N} \ln \frac{n_{i j}}{N}

(3)

where n_ijrepresents the number of simultaneous occurrence of the i th category of X with m categories and the j th counterpart of Y with l categories.

Mutual information is universally used to measure the similarity between two variables' distributions and is taken here as an association measure of two variables. Indeed, MI is a measure defined on the set consisted of two variables, if the set is composed of more than two variables, then the definition of MI measure will be rewritten as follows:

M I (X_{1}, X_{2}, \dots X_{n}) ≜ \sum_{i = 1}^{n} H (X_{i}) - H (\cup_{i = 1}^{n} X_{i})

(4)

Mutual information has an interesting property – super-additivity. We introduce the concept of super-additivity and give the mathematical proof of it. However, it is noted that the super-additivity of mutual information has minor contribution to validation of the algorithm here. Text for this section.

Super-additivity of correlative measure

Let us consider nonempty finite set X and set-family E(X) consisting of its subsets P is a set-function defined on E(X) with properties:

(i) P(A) ≥ 0, ∀ A ∈ E(X)

(ii) P(∅) = 0

If for arbitrary nonempty finite set S_i∈ E(X), S_j∈ E(X), i ≠ j, S_i∈ S_j= ϕ, have

P(S_i∪ S_j) ≥ P(S_i) + P(S_j) (5)

This set-function P is called super-additive.

One of the important properties of correlative measure is just its super-additivity. In other words, correlative measure of one finite set is no less than the summation of the correlative measures of all its subsets.

Theorem. Correlative measure MI(s₁, s₂, ⋯, s_m) is finitely super-additive, and unique.

Proof. The definition of the MI ensures the uniqueness of the measure. We now turn to prove super-additivity of the measure. Suppose that the set X ψ is partitioned into m ψ subsets s₁, s₂,⋯, s_msatisfying for arbitrary i, j (i ≠ j), s_i≠ 0, s_j≠ 0,

X = \cup_{i = 1}^{m} s_{i} = \cup_{s_{i} \in X} s_{i}

We only need to prove

M I (\cup_{i = 1}^{m} s_{i}) \geq \sum_{i = 1}^{m} M I (s_{i})

(6)

By the definition of MI, we have

\begin{matrix} M I (X) = M I (\cup_{i = 1}^{m} s_{i}) = M I (s_{1}, s_{2}, ... s_{m}) \\ = M I (X_{1}, X_{2}, ... X_{n}) \\ = \sum_{i = 1}^{n} H (X_{i}) - H (\cup_{i = 1}^{n} X_{i}) \end{matrix}

(7)

\begin{matrix} \sum_{s_{i} \in X} M I (s_{i}) = \sum_{s_{i} \in X} (\sum_{X_{j} \in s_{i}} H (X_{j}) - H (\sum_{X_{j} \in s_{i}} X_{j})) \\ = \sum_{i = 1}^{n} H (X_{i}) - \sum_{s_{i} \in X} H (s_{i}) \\ = \sum_{i = 1}^{n} H (X_{i}) - \sum_{i = 1}^{m} H (s_{i}) \end{matrix}

(8)

Subtracting (7) from (8), we have

\begin{matrix} M I (X) - \sum_{s_{i} \in s} M I (s_{i}) = \sum_{i = 1}^{m} H (s_{i}) - H (\cup_{i = 1}^{n} X_{i}) \\ = \sum_{i = 1}^{m} H (s_{i}) - H ((\cup_{i = 1}^{m} s_{i})) \geq 0 \end{matrix}

(9)

The proof is complete.

A revised version of correlative measure

Despite so many merits of applying MI have been recorded [9], MI also suffers from some defects when dealing with the data. First, MI-based association between two variables is symmetric, but the relation between two symptoms is usually asymmetric. Indeed, symmetric is a special case of asymmetric. Alternatively, two variables' MI is non-negative but boundless, which may make evaluating two subjects' relation difficult in a situation that the association value is isolated. An ameliorated version of the MI can fill the gap. We used the normalized form of association between two variables μ as:

μ (X, Y) = \frac{M I (X, Y)}{H (Y)}

(10)

By this definition, the relation between two variables is asymmetric because two variables' Shannon entropies are usually difficult. Additionally, according to information theory, MI(X, Y) is non negative and its upper bound is the minimum between H(X) and H(Y), therefore, the new version of association μ(X, Y) takes value between 0 and 1, which is similar to correlation in statistical theory to some extent.

Furthermore, by information theory, the form of MI can be recast as:

MI(X, Y) = H(X) - H(X|Y) (11)

where H(X|Y) denotes conditional entropy, it measures the remaining uncertainty of X under the condition of knowing Y, that is to say, MI(X, Y) represents the information content with regard to knowing X under the condition of knowing Y. Therefore, associations of two mostly close symptoms and completely opposite counterpart are both very large, making the association defined by MI compose of positive association and negative one. We present an ameliorated version of MI to distinguish positive association and negative association.

The frequency that X and Y are both of nonzero categories is denoted as Pofr(X, Y), it is this positive frequency of X and Y that separates positive association and negative association. We redefined the form of MI as:

M I (X, Y) = {\begin{array}{l} \frac{H (X) + H (Y) - H (X \cup Y)}{H (Y)}, p o f r (X, Y) \geq θ \\ \frac{H (X) + H (Y) - b * H (X \cup Y)}{H (Y)}, p o f r (X, Y) < θ \end{array}

where θ is pre-assigned positive quantity, we called it threshold in this paper. When θ = 0, the ameliorated version of MI is traditional form of MI, so the ameliorated MI is an extend version of traditional MI. b is a real number and is greater than 1, it can be seen as penalty coefficient. Proper setting of the two parameters will make the positively associated symptoms keep their association invariant, while the negatively associated counterparts lessen their association, even turn to zero.

Correlative measure for continuous variables

Let us consider two continuous variables. Based on above definitions, now we want to reduce the correlative measure format for two continuous variables satisfied normal contribution [10].

Let two continuous variables X, Y satisfied normal contribution, their PDFs are

\begin{matrix} f (x) = 1 / (\sqrt{2 π} σ_{x}) \cdot \exp (- {(x - E_{x})}^{2} / (2 σ_{x}^{2})) \\ f (y) = 1 / (\sqrt{2 π} σ_{y}) \cdot \exp (- {(y - E_{y})}^{2} / (2 σ_{y}^{2})) \end{matrix}

while -∞ <x < ∞, -∞ <y < ∞, E_x, E_yare mathematical expectations of X, Y, σ_x, σ_yare standard deviation of X, Y.

Joint probability density function of X, Y is expressed as

f (x, y) = [\frac{1}{2 π σ_{x} σ_{y} \sqrt{1 - ρ^{2}}}] \cdot \exp ([- \frac{1}{2 (1 - ρ^{2}}] \cdot [(\frac{{(x - E_{x})}^{2}}{σ_{x}^{2}} - \frac{2 ρ (x - E_{x}) (x - E_{y})}{σ_{x} σ_{y}} + \frac{{(x - E_{y})}^{2}}{σ_{y}^{2}})])

while ρ is correlation coefficient of X, Y.

Entropy of X is

\begin{matrix} H (X) = - \int_{- \infty}^{\infty} f (x) \log (f (x)) d x \\ = - \int_{- \infty}^{\infty} \frac{1}{\sqrt{2 π} σ_{x}} \cdot \exp (- \frac{{(x - E_{x})}^{2}}{2 σ_{x}^{2}}) \log (\frac{1}{\sqrt{2 π} σ_{x}} \cdot \exp (- \frac{{(x - E_{x})}^{2}}{2 σ_{x}^{2}})) d x \\ = - \log (\frac{1}{\sqrt{2 π} σ_{x}}) + 1 / 2 \end{matrix}

In a similar way, entropy of Y is

H (Y) = - \log (\frac{1}{\sqrt{2 π} σ_{y}}) + \frac{1}{2}

And then joint entropy of X, Y is

H (X, Y) = - \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} f (x, y) \log (f (x, y)) d x dy

let x - (μ_x/σ_x) = u, y - (μ_y/σ_y) = v, then

\begin{array}{l} H (X, Y) \\ = - \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} [\exp (\frac{- 1}{2 (1 - ρ^{2})} \cdot (u^{2} - 2 ρ u v + v^{2}))] \cdot [\log (\frac{1}{2 π σ_{x} σ_{y} \sqrt{1 - ρ^{2}}})] d u d v + \\ [\frac{1}{2 π {(1 - ρ^{2})}^{1 / 2}}] \cdot \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} [\exp (\frac{- 1}{2 (1 - ρ^{2})} \cdot (u^{2} - 2 ρ u v + v^{2}))] \cdot [\frac{u^{2} - 2 ρ u v + v^{2}}{2 (1 - ρ^{2})}] d u d v \\ = - \log (\frac{1}{2 π σ_{x} σ_{y} \sqrt{1 - ρ^{2}}}) + 1 \end{array}

So now we get correlative measure between X, Y is

M I (X, Y) = H (X) + H (Y) - H (X, Y) = - \frac{\log (1 - ρ^{2})}{2}

(12)

Entropy partition method

Once association for each pair (every two variables) is acquired, we propose a self-organized algorithm to automatically discover the patterns. The algorithm can not only cluster, but also make some variables appear in some different patterns. In this section, we use three subsections to introduce the algorithm. In the first subsection, we introduce the concept of "Relative" set. Based on this, the pattern discovery algorithm is proposed in the second subsection. The last subsection is devoted to presenting an n-class association concept to back up the idea of the algorithm.

"Relative" set

For a specific variable X, a set, which is collected by means of gathering N variables whose associations with X are larger than others with regard to X, is attached to it and is denoted as R(X). Each variable in the set can be regarded as a "Relative" of X while other variables that do not belong to the set are considered as irrelative to X, so we name R(X) as "Relative" set of X. The "Relative" sets of all k variables can be denoted by a k × N matrix. Based on the matrix, the pattern discovery algorithm is proposed.

Algorithm steps

A pair (variable X and Y) is defined to be significantly associated if and only if X belongs to the "Relative" set of Y (X ∈ R(Y)) and vice versa (Y ∈ R(X)). It is convenient to extend this definition to a set with multiple variables. If and only if each pair of these variables is significantly associated, then we can call that the set is significant associated. A pattern is defined as a significantly associated set with maximal number of variables. All these kinds of sets constitute the hidden patterns in the data. Therefore, a pattern should follow three main criteria: (1) the number of variables within a set is no less than 2; (2) each pair of the variables belong to a set is significantly associated; and (3) any variable outside a set cannot make the set significantly associated. This means the number of variables within the set reaches maximum.

To discover all patterns hidden in the data, we propose the unsupervised algorithm, which can be implemented by three steps.

Step 1

Based on the Q × N matrix, all the significantly associated pairs are collected, denoted by a M₂ × 2 matrix, where M₂ represents the number of significantly associated pairs.

Step 2

Based on the M₂ × 2 matrix, collecting all the significantly associated three variables, denoted by a M₃ × 3 matrix, where M₃ represents the number of significantly associated three variables. Similarly, if there exist significantly associated multiple variables, the corresponding result is denoted by M_m× m, where M_mrepresents the number of significantly associated multiple variables and m stands for the number of variables. Obviously m ≤ N, where N represents the number of relative variables. Since N is bounded, the algorithm can converge.

Step 3

Finding the maximal m. Matrix M_m× m have M_mpatterns with m variables. A set that contains m-1 variables is certainly not a pattern since it does not fulfill the third criterion of a pattern. All these kinds of sets are removed from the matrix M_m-1× (m - 1), the rest are certainly patterns with m - 1 variables. Similarly, all the patterns can be discovered.

N-class association

From pattern discovery method introduced above, we know that it needs to compute $C_{n}^{2}$ times between every two symptoms of n symptoms and this number will rise into $C_{n}^{3}$ if for every three symptoms of n symptoms. Generally speaking, there are many cases that several (such as 5 or 6) symptoms combine together to describe syndrome in TCM theory. Therefore, based on the unsupervised character of the data, the number of symptoms to be computed should be 2~3 multiples of actual 5~6, namely 10~18. With regard to the TCM data of this paper, the computation times number reaches $C_{72}^{10} ~ C_{72}^{18}$ , namely about 10¹⁶ magnitude, which doesn't match any general computer's capacity currently. Because of the above-mentioned reasons, we introduce the concept of n-class association. Formally speaking, for n variables, if arbitrary n - 1 variables of the n variables are considered as close association, then we can call these n variables are of n-class association. Specially, when n = 2, it is just about the association between two variables.

Base on the concept of n-class association, when turning to judge n variables are associated or not, we need only to judge whether arbitrary n -1 variables of n variables are associated. It means that, theoretically we just need to implement the computing of the association between two variables, which significantly decreases the computation complexity so that the pattern discovery algorithm could be applied into large-scale data. In fact, thanks to mathematical induction, the n-class association concept is easy to understand on the mathematics.

For n = 2, proof of this proposition is obvious. Supposed that n variables are close associated, and our purpose is to prove that when an (n + 1)-th variable is under close association with other n variables, the whole (n + 1) variables are considered close associated.

We know that $M I (X_{1}, X_{2}, \dots, X_{n}) = \sum_{i = 1}^{n} H (X_{i}) - H (\sum_{i = 1}^{n} X_{i})$ , then

\begin{array}{l} M I (X_{1}, X_{2}, \dots, X_{n}, X_{n + 1}) \\ = \sum_{i = 1}^{n} H (X_{i}) + H (X_{n + 1}) - H (\sum_{i = 1}^{n + 1} X_{i}) \\ = \sum_{i = 1}^{n} H (X_{i}) + H (X_{n + 1}) - H (\sum_{i = 1}^{n} X_{i}, X_{n + 1}) \\ = \sum_{i = 1}^{n} H (X_{i}) + H (X_{n + 1}) - [H (\sum_{i = 1}^{n} X_{i}) + H (X_{n + 1}) - M I (\sum_{i = 1}^{n} X_{i}, X_{n + 1})] \\ = [\sum_{i = 1}^{n} H (X_{i}) - H (\sum_{i = 1}^{n} X_{i})] + M I (\sum_{i = 1}^{n} X_{i}, X_{n + 1}) \\ = M I (X_{1}, X_{2}, \dots, X_{n}) + M I (\sum_{i = 1}^{n} X_{i}, X_{n + 1}) \end{array}

It tells us that the correlative measure among X₁, X₂,⋯, X_n, X_n+1is composed of the association measure among X₁, X₂,⋯, X_nand measure between two subset X₁, X₂,⋯, X_nand X_n+1. It means that if an n + 1-th variable is under close association with other n variables, these n + 1 variables are considered as close association. The proposal of n-class association concept extensively decreases the computational complexity of pattern discovery algorithm.

Validation method

Algorithm steps

To validate the algorithm and illustrate the reason of choosing parameters as described above, we must take the objective data for the unsupervised data into account.

In the supervised learning situation, the validation of the algorithm is performed by estimating three measures: sensitivity, specificity and accuracy, of the classification results [11]. However, under unsupervised background here, validation of the algorithm must be done in a slightly different way. We summarize it in following three steps.

Step 1

For each pattern S, we return it to the unsupervised data, if all variables of the pattern simultaneously appear (their values are non-zero) on a patient, then serial number of the patient is recorded. We collect all these serial numbers, enumerate the total numbers of them, record the number as L_S. All the serial numbers are stored in a vector with L_Sdimensions denoted as ${\vec{V}}_{S}$ .

Step 2

Tracking the vector ${\vec{V}}_{S}$ to the syndrome data, we get L_Svectors with 9 dimensions. Each dimension encodes a syndrome uniformly. The L_Svectors are added one by one to generate a new vector $\vec{W_{S}}$ = w^s_i, i = 1, 2,..., 8, 9), where i represents the i-th syndrome, w^s_idenotes that there are w^s_ipatients are diagnosed as the syndrome in the whole L_Spatients. Obviously, we have w^s_i≤have enough NEI data of 400 Wistar L_s. It is easy to find the maximal number, denoted as $w_{i_{\max}}^{S}$ , in the vector $\vec{W_{S}}$ . We record the $w_{i_{\max}}^{S}$ and the corresponding syndrome i_max.

Step 3

We define the sensitivity T_sof the pattern S as $T_{S} = \frac{w_{i_{\max}}^{S}}{L_{S}}$ . The sensitivity of the algorithm, denoted as T, can be calculated by summing up sensitivities of all patterns and then averaging. i.e. $T = \frac{1}{P} \sum_{S = 1}^{P} T_{S}$ , where P denotes the number of patterns.

Relation between sensitivity of algorithm and threshold

Given a clinical data, the number of patterns, denoted as P above, generated by the algorithm is only determined by the number of "relative" N and threshold θ, i.e., P = f(N, θ). We now reconsider the form of the sensitivity of the algorithm T:

\begin{array}{l} T = \frac{1}{P} \sum_{S = 1}^{P} T_{S} \\ = \frac{1}{f (N, θ)} \sum_{S = 1}^{f (N, θ)} T_{S} \\ = \frac{1}{f (N, θ)} \sum_{S = 1}^{f (N, θ)} \frac{w_{i_{\max}, S}}{L_{S}} \end{array}

(13)

Where $w_{i_{\max}}^{S}$ and L_Sare constant for a given data, so T is determined by N and θ. For clinics, a pattern with 3 or 4 variables is optimal to be diagnosed as what syndrome, so N is chosen to be not less than 4.

Results and discussion

Discrete data example

Data collection

Syndrome is diagnosed according to symptom combinations. As shown in Table 1, we choose 72 symptoms that are closely related to CRF. The pulse information of every patient was not included for its bad consistency during the process of survey. In the survey, the data set was recruited from six clinical centers located in six provinces from the same demographic area and at the same time from October 2005 to March 2006, where a total of 601 patients who suffer from CRF were surveyed.

Table 1 The name of each variable and its frequency. The most is Hypodynamia, the least is Anuria. The total patients number is 601.

Full size table

The case must strictly meet four conditions to be included within the data: (1) based on the diagnosis criterion of CRF and the state of illness to be classified under stages 3, 4 and 5 [12]; (2) no dialysis therapy for all patients for a month before the survey; (3) patients of ages between 18 and 65 years; and (4) patients must agree to sign the informed consent. Additionally, there is also three exclusion criteria information contained in the survey.

(1)
Besides chronic kidney function failure, a patient also suffers from inter-current diseases such as serious respiratory, cardiovascular, cerebrovascular, and digestive blood system diseases.
(2)
Women who are in gestation or lactation will be excluded.
(3)
A patient with symptoms produced by drug therapy.

Every case is with 72 symptoms, together with the basic information of each subject. The frequencies of 72 symptoms are listed in Table 1, each variable (symptom) has four categories, i.e. none, light, middle, severe, represented by 0, 1, 2, 3, respectively. The latter three categories of each variable mean that the symptom has appeared and then separated into light, middle, severe by clinical doctors, who are strictly and uniformly trained to reach a high consistency.

Diagnostic data

CRF patients recruited here were clinically diagnosed by TCM physicians to receive herbal treatment. We collected this diagnostic data (also called syndrome data) to validate the unsupervised pattern discovery algorithm. The data is composed of nine syndromes. Name and frequency of syndromes are shown in Table 2 in a frequency-descending way. The data is represented by a matrix, row represents an observation, and column represents a syndrome. If a patient is diagnosed as one of nine syndromes, the corresponding column of the matrix is denoted as 1, otherwise the column is denoted as 0. Generally speaking, if a CRF patient is diagnosed having two syndromes or above, the corresponding columns are all denoted as 1.

Table 2 The basic information of syndrome data. Each syndrome is assigned a Greek symbol. The total patients number is 601.

Full size table

Parameter setting

The algorithm has three parameters to be adjusted. First, three or four symptoms usually constitute a syndrome. On the other hand, in clinical application, too many symptoms (like eight symptoms) may confound the TCM physicians and lead to complex result, thus in our model we set the number of "Relatives" of variables, denoted as N, 5. Second, we set the threshold as: θ = 40/601 = 6.67%. This parameter choice involves validation of the algorithm steps section. Third, penalty coefficient b is set as 2, which will separate positive and negative associations in our model.

Partition result and discussion

As depicted in Part A of Table 3, one pattern have four symptoms, the other 15 patterns are comprised of three symptoms. The 35 patterns with two symptoms are not listed here for their minor contribution to clinics, since it is very hard to diagnose two symptoms as a syndrome in clinics. Here, a pattern including more than two symptoms is called a clinically effective pattern.

Table 3 Patterns discovered automatically by the algorithm

Full size table

Here, we investigate the relation between the sensitivity of the algorithm T and the threshold θ. As depicted in Figure 1, sensitivity of the algorithm is varied in different threshold. The optimal threshold is 40/601 and the corresponding largest sensitivity is 0.9648, which is better than any reported literature till now. When the threshold is 0, the form of MI is turned into the traditional MI. From Figure 1 we can easily see that the ameliorated version of MI is better than traditional MI since sensitivities of the algorithm in the situation of threshold θ > 0 are larger than the counterpart in θ = 0. The larger the sensitivity of the algorithm, the corresponding result more accords with the clinics. The number in the bracket around each point means the number of effective patterns discovered in the corresponding threshold setting. As depicted in Figure. 2, each pattern's distribution in patient data is and its corresponding sensitivity are showed. The pattern's sensitivity is defined in Step 3 of validation method's algorithm steps.

Continuous data example

Data collection

For continuous variables, we use the neuro-endocrine-immune data collected from 400 Wistar rats. In 1977 Besedovsky proposed "immune-neuro-endocrine network" theory [13], and this theory has broadly studied and rapidly become one important theory hotspot in medicine and biology field. More and more evidences indicated that there are not only one big loop among these three systems but also direct and bidirectional interaction between each other [14–16].

We have enough NEI data of 400 Wistar rats collected by the following ways and means:

This research focuses on the high risk factors and pathogenesis of vascular disease and takes vascular endothelial function as breakthrough point.
This research divides 400 Wistar rats to 6 groups by using the randomized single blind method: normal group; basic model group; composite model group; ginseng intervention group; double-ginseng intervention group, compound powder intervention group.
The rats of basic model group are fed with hyperhomocysteinemic of fixed ration to come into being endothelial dysfunction.
Base on basic model group, the rats of composite model group are forced to swim with fixed load at fixed time.
Base on composite model group, the rats of ginseng intervention group are treated with ginseng of fixed ration.
Base on composite model group, the rats of double-ginseng intervention group are treated with double-ginseng of fixed ration.
Base on composite model group, the rats of compound powder intervention group are treated with compound powder of fixed ration.
This research determines 5 vascular endothelial function related parameters and 21 NEI network related parameters. Each rat can only be measured several (usually 4 at most) parameters because of its limited blood volume. All the parameters are showed in Table 4, 5.

Table 4 Vascular endothelial function related parameters

Full size table

Table 5 NEI network related parameters

Full size table

Partition result and discussion

This research focus on finding whether there are some rules between vascular endothelial function related parameters and NEI network related parameters. According to the entropy partition algorithm, we take the NEI network related parameters as initial set X, and take the vascular endothelial function related parameters as object set Y. Both X and Y are prescribed as normal contribution. Corresponding parameters σ, μ can be estimated with Bayesian method [10]. After determining the density function of X, Y, we can calculate their correlative measure according to (12). Then we can take entropy partition to this data. According to N - class association, we can get each parameter's Relative set, thus final output set S.

The partition result is showed in Table 6. We can find most relevant NEI network related parameters of each vascular endothelial function related parameter. The parameter's relativity follows descending from top to down.

Table 6 Entropy partition result of rats data.

Full size table

We take pattern A in Table 6 for example. Figure 3 describes the rules consisting in the above-mentioned two parameters set. These figures show most relevant NEI related parameters to each vascular endothelial function related parameter. Through this way, we can find some interesting phenomena and rules.

The abnormity and disorder in NEI network will inevitably lead to some corresponding endothelial dysfunction. Take the composite model group for example, there will be some subsystems consisting of some endothelial function related parameters and some NEI network related parameters and they will exist in some laws during the remaining time.

When intervened by ginseng, double-ginseng, compound powder respectively, these rules in these subsystems receive varying degree of change. And after intervened by compound powder, these rules almost disappear and the trend looks like to be consistent with normal group.

This shows that the hyperhomocysteinemic feed in basic model group and load swimming in composite model group are just the particular causation of above-mentioned rules.

Conclusion

In this paper, we presented an unsupervised partition method based on association delineated by revised mutual information. A revised version of mutual information is developed to discriminate positive association and negative counterpart. Based on our model, unsupervised pattern discovery algorithm is proposed to allocate significantly associated symptoms into several patterns. The algorithm not only can cluster, but also can make some symptoms appear in different patterns, which are consistent with TCM diagnoses. By using the syndrome data, the unsupervised algorithm was validated and the sensitivity of algorithm performance measure was defined to evaluate the patterns discovered. The algorithm reaches a maximal sensitivity with 96.48%, which means that the CRF data is of good quality. Furthermore, the results shows that, under proper parameters setting, the algorithm successfully discovered 16 patterns in CRF patients and each of the patterns can be automatically diagnosed as syndrome, which is completely in accordance with the corresponding results diagnosed by TCM physicians. We also apply this method in NEI data collected by400 Wistar rats, and the result shows some corresponding rules of vascular endothelial function related parameters and NEI network related parameters. The study in this paper provides an improved solution for syndrome classification in patient and rat data and its results can contribute significantly to the TCM practice.

Abbreviations

TCM:: Traditional Chinese Medicine
NEI:: Neuro-Endocrine-Immune
MI:: Mutual Information
CRF:: Chronic Renal Failure

References

Normile D: The new face of traditional Chinese medicine. Science. 2003, 299: 188-190. 10.1126/science.299.5604.188.
Article CAS PubMed Google Scholar
Xue T, Roy R: Studying traditional Chinese medicine. Science. 2003, 300: 740-741. 10.1126/science.300.5620.740.
Article CAS PubMed Google Scholar
Zhou X, Wu Z: Ontology development for unified traditional Chinese medical language system. Artificial Intelligence in Medicine. 2004, 32: 15-27. 10.1016/j.artmed.2004.01.014.
Article PubMed Google Scholar
Li S, Zhang X, Li Y, Wang Y: Understanding Zheng in traditional Chinese medicine in the context of neuro-endocrine immune network. IET Systems Biology. 2007, 1: 51-60. 10.1049/iet-syb:20060032.
Article PubMed Google Scholar
Chow T, Huang D: Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information. IEEE Transactions on Neural Networks. 2005, 6 (1): 213-224. 10.1109/TNN.2004.841414.
Article Google Scholar
Awates S, Tasdizen T, Foster N, Whitaker R: Adaptive Markov modeling for mutual-information-based unsupervised MRI brain-tissue classification. Medical Image Analysis. 2006, 10: 726-739. 10.1016/j.media.2006.07.002.
Article Google Scholar
Sun Z, Xi G, Yi J, Zhao D: Select informative symptom combination of diagnosing syndrome. Journal of Biological Systems. 2007, 15: 27-37. 10.1142/S0218339007002088.
Article Google Scholar
Wentian Li: Mutual Information Functions Versus Correlation Functions. Journal of Statistical Physics. 1990, 60 (5–6): 823-837.
Google Scholar
Kwak Nojun, Choi Chong-Ho: Input feature selection by mutual information based on parz en window. IEEE Transaction on Pattern Analysis and Machine Intelligence. 2002, 24: 1667-1671. 10.1109/TPAMI.2002.1114861.
Article Google Scholar
Zhanquan Sun, Guangcheng Xi, Haixia Li, Jianqiang Yi, Jie Wang: Correlation Analysis Between TCM syndromes and Physicochemical Parameters. Chinese Journal of Biomedical Engineering. 2006, 3: 93-102.
Google Scholar
Delen D, Walker G, Kadam A: Predicting breast cancer survivability: a comparison of three data mining methods. Artificial Intelligence in Medicine. 2005, 34: 113-127. 10.1016/j.artmed.2004.07.002.
Article PubMed Google Scholar
Chen H: Practice of Internal Medicine. 2005, Beijing: Peoples' Medical Publishing House, 12
Google Scholar
Besedovsky H, Sorkin E: Network of immune-neuroendocrine interactions. Clin Exp Immunol. 1977, 27 (1): 1-12.
PubMed Central CAS PubMed Google Scholar
Changgeng Zhu: Immune-neuro-endocrine network. Journal of Acta Anatomica Sinica. 1993, 24 (2): 216-221.
Google Scholar
Jackson IMD: Significance and function of neuropeptides in cerebrospinal fluid. Neurobiology of Cerebrospinal Fluid. Edited by: Wood JH. 1980, New York: Plenum Press, 1: 623-650.
Google Scholar
Nathanson JA, Chun LLY: Immunological function of the blood-cerebrospinal fluid barrier. Proc Natl Acad Sci U S A. 1989, 86: 1684-1688. 10.1073/pnas.86.5.1684.
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

We make a grateful acknowledgement to China Academy of Chinese Medical Sciences and Hebei Medical University for providing data supports. The work was supported by the National Basic Research Program of China (973 Program) under grant No. 2003CB517106 and NSFC Projects under grant No. 60621001, China.

This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1

Author information

Authors and Affiliations

Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing, 086-100190, PR China
Jing Chen & Guangcheng Xi

Authors

Jing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Guangcheng Xi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guangcheng Xi.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JC developed the method and performed the results validation. GCX conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Chen, J., Xi, G. An unsupervised partition method based on association delineated revised mutual information. BMC Bioinformatics 10 (Suppl 1), S63 (2009). https://doi.org/10.1186/1471-2105-10-S1-S63

Download citation

Published: 30 January 2009
DOI: https://doi.org/10.1186/1471-2105-10-S1-S63

Selected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009)

An unsupervised partition method based on association delineated revised mutual information

Abstract

Background

Results

Conclusion

Background

Methods

Correlative measure based on mutual information

Correlative measure for discrete variables

Super-additivity of correlative measure

A revised version of correlative measure

Correlative measure for continuous variables

Entropy partition method

"Relative" set

Algorithm steps

Step 1

Step 2

Step 3

N-class association

Validation method

Algorithm steps

Step 1

Step 2

Step 3

Relation between sensitivity of algorithm and threshold

Results and discussion

Discrete data example

Data collection

Diagnostic data

Parameter setting

Partition result and discussion

Continuous data example

Data collection

Partition result and discussion

Conclusion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us