Literal Similarity Metrics
Although symptoms are denominated by TCM doctors without the support of authoritative editorial board and a symptom might be described in several different names owing to different experience and background of TCM doctors, symptom names describing the same symptom usually have literal similarity due to the ideographic characteristics of Chinese. For example, both '
' and '
' mean head and they have the same ideographic character '
' (Head). Both '
' and '
' mean that a person sweats in upper limb, and they also have the same ideographic characters '
' (Upper Limb) and '
' (Perspiration). Therefore, literal similarity metrics are considered to be used to measure the similarity between symptom names.
In spite of different experience and background of TCM doctors, symptoms are generally denominated with some loose conventions inherited historically and followed by most of TCM doctors. In general, a symptom name of TCM contains sequentially expressions of the affected body part, the disease property and the disease degree. For example, in the symptom name '
' (Severe Headache) the affected body part is '
' (Head), '
' (Ache) is the disease property and '
' (Severe) represents the disease degree. In '
' (Throat Tickle) '
' (Throat) is the affected body part with '
' (Tickle) being the disease property. Among the components of a symptom name some may be missing such as in '
' (Throat Tickle) the disease degree is absent. However, the component affected body part appears in most of symptom names (66.97% (363/542) of the standard symptom names and 70.10% (3130/4465) of the clinical symptom names contain the affected body part in our experimental data) and, moreover, it is usually the prefix when it appears in a symptom name (66.61% (361/542) of the standard symptom names and 55.83% (2493/4465) of the clinical symptom names start with the affected body part). Therefore, prefix of symptom names is considered to be an enhanced factor to determine the literal similarity.
According to the observations discussed above, four literal similarity metrics are used here for validating the feasibility, and Jaro-Winkler Distance is also used to demonstrate the effect of the symptom name prefix.
Jaro Distance Metric
Jaro Distance (JD) [4] is one of the most popular and basic literal similarity metrics, and here JD score is defined as follows:
Where m is the number of matching characters between a standard symptom name s and a clinical symptom name s', t is the number of transpositions of the characters, i.e. the count of matching characters but in different order in s and s' [5], |s| and |s'| are the number of characters in s and s' respectively.
Jaro-Winkler Distance Metric
Jaro-Winkler Distance (JWD) [4] is extended from JD and adjusts the score of JD upwards for the symptom name pairs having common prefixes. JWD is introduced as follows:
Where JD(s, s') is the JD score of a standard symptom name s and a clinical symptom name s', prefixLength is the length of their common prefix, and PREFIXSCALE is a constant scaling factor for measuring how much the score is adjusted upwards for a symptom name pair having a common prefix (Here three is assigned to PREFIXSCALE).
Smith-Waterman Distance Metric
Smith-Waterman Distance (SWD) [6] is a dynamic programming algorithm, and it is guaranteed to find symptom name pairs which have the optimal local alignment with respect to a gap-scoring scheme and a scoring system including a substitution matrix. The substitution matrix M for comparing a symptom name pair is constructed as follows.
Where sc
i
is the i th character in a standard symptom name s and
is the j th character in a clinical symptom name s', m is the length of s and n is the length of s', M(i, j) is the similarity score between the substring sc1sc2...sc
i
of s and the substring
of s', ω (sc
i
,
), ω (sc
i
, -) and ω (-,
) are the gap-scoring schemes described by [6] in detail.
Smith-Waterman-Gotoh Distance Metric
Smith-Waterman-Gotoh Distance (SWGD) [7] is an improved algorithm of SWD. It allows multiple-sized gaps, and speeds up to O(MN) instead of O(M2N) of SWD (where M and N are the lengths of a standard and a clinical symptom names respectively).
Remedy-Based Similarity Metrics
According to the TCM theory, the same or similar symptoms are always treated by the same or similar groups of remedies (i.e. the corresponding remedies of the symptoms). For example, '
' and '
' are two similar symptom names representing throat pain in TCM, and they are both treated by the common remedies '
' (Honeysuckle), '
' (Chrysanthemum) and '
' (Fructus Arctii). Therefore, the information about the corresponding remedies of a standard and a clinical symptom names is involved to determine whether they express the same symptom. Three remedy-based similarity metrics are proposed below to measure the similarity between a standard and a clinical symptom names using their corresponding remedies.
Set-Based Similarity Metric
The Set-Based similarity metric adopts Jaccard coefficient to measure the similarity between a standard and a clinical symptom names using their corresponding remedy sets. It is represented by the following formula.
Where s and s' are a standard and a clinical symptom names respectively, R and R' are their corresponding remedy sets, |R ∪ R'| is the number of elements in the union of R and R', and |R ∩ R'| is the number of elements in the intersection of R and R' .
Vector-Space-Model-Based Similarity Metric
In TCM the remedy potency for curing different symptoms is not equivalent. Some remedies are often used to treat a symptom and seldom to treat the others. Appearance of such remedies is an important evidence to distinguish this symptom from the others. However, the Set-Based similarity metric does not measure and use the importance of remedies toward a particular symptom, presupposing that remedies are equivalent for all symptoms. To estimate the importance of a remedy toward a particular symptom, TF-IDF weighting scheme is involved as follows.
Let s
i
be a symptom name, R
i
be its corresponding remedy bag containing all the occurrences of remedies in the prescriptions with the symptom name s
i
, and R be the set of all remedies in TCM. For any r
j
∈ R, its weight wi, jfor s
i
is defined as follows:
Where fi, jis the frequency of occurrence of r
j
in R
i
, |R| is the number of remedies in R, df
j
is the number of the symptom names whose corresponding remedy bags contain r
j
.
Thus a vector in multi-dimensional space is constructed naturally by the weighted remedies to describe every symptom name. For a standard symptom name s
m
and a clinical symptom name s
n
, if their corresponding remedy bags are R
m
and R
n
, the following vectors are used to describe R
m
and R
n
.
Then similarity between s
m
and s
n
can be measured by the cosine metric defined bellow.
SimRank-Based Similarity Metric
The Set-Based and Vector-Space-Model-Based similarity metrics presuppose the independence among the corresponding remedies. However, the hypothesis may be violated owing to the fact that some remedies are alternative i.e. they have the same or similar effects. For example, the remedies '
' (Hawthorn) and '
' (Endothelium Corneum Gigeriae Galli) have the same effect and they all can be used to treat the symptom '
' (Anorexia). According to the intuition that "two objects are similar if they are related to similar objects" [8], an observation is derived that two symptom names may be same or similar if they have same or similar corresponding remedies and two remedies are similar (or they have similar curative effects) if they are used to treat same or similar symptoms. Following the observation and based on the SimRank algorithm [8], the mutually recursive computational process of SimS (the similarity of two symptom names) and SimR (the similarity of two remedy names) are described as follows.
-
(1)
Initialize SimS and SimR as follows.
-
(2)
Iteratively update SimS and SimR using the formulas below until the termination condition is met.
Where k represents the k th iteration and k ≥ 1, R and R' are the corresponding remedy sets of symptom names s and s' respectively, |R| and |R'| are the sizes of R and R', r
i
and
are the i th and the j th remedies in R and R' . Similarly, S and S' are the corresponding symptom name sets of r and r' (S and S' both contain standard symptom names as well as clinical symptom names), |S| and |S'| are the sizes of S and S', s
i
and
are the i th and the j th symptom names in S and S', C is called as 'confidence level' or 'decay factor' and it is a constant value between 0 and 1 (the signification and argument of C can refer to [8]). SimRank was introduced by [8] in detail. In this paper, when k equals 4 the iterative procedure is terminated.
Hybrid Similarity Metrics
Both literal similarity metrics and remedy-based similarity metrics have their advantages respectively, but the disadvantages also exist. Literal similarity metrics cannot distinguish the symptom names which have high literal similarity but with different or even opposite meanings. Remedy-based similarity metrics can find similar symptom names which are cured by similar remedies, but they ignore the literal characteristics of symptom names.
Therefore, a hybrid strategy which integrates literal similarity and remedy-based similarity is investigated for making up for the disadvantages of each other. The strategy is drawn from the following observation.
Observation: Two s ymptom names expressing the same symptom have the similar corresponding r emedies, at the same time the s ymptom names should be literally s imilar (named SRSS).
According to the observation, the hybrid strategy (i.e. SRSS) is constructed as follows.
Where s and s' are a standard and a clinical symptom names respectively, α and β are the weights of Sim
L
(s, s') and Sim
RB
(s, s'), Sim
L
(s, s') denotes literal similarity which can be computed through any literal similarity metric discussed above, Sim
RB
(s, s') expresses remedy-based similarity, and its definition can be chosen among all the remedy-based similarity metrics. Instantiation of Sim
L
(s, s'), Sim
RB
(s, s') and their weights will result in a particular hybrid similarity metric.