To handle the problems of the prior structural kernel, we first examined the effectiveness of each main feature for the walk kernel which showed the best performance in our previous work, and then modify the dependency kernel so that it can accept the features of the walk kernel and partial path matches.

In the modified version, we treat each type of substructures with different importance.

For this, we classify the types of substructures into several categories and enhance the learning performance by allowing different weights or counts according to the types of common dependency substructures that two relation instances share. Next, we treat the shortest path strings as strings and introduce some string kernels such as the spectrum kernel, subsequence kernel and gap weighted kernel. Finally, we suggest the walk weighted subsequence kernel, which can model not only the previous problems, but also non-contiguous structures and structural importance not covered by the previous kernels.

### Walk types

We start the kernel modification with the re-consideration of *walks* properties. In the walk kernel, the structural information is encoded with *walks* of graphs. Given *v* ∈ *V* and *e* ∈ *E*, a walk can be defined as an alternating sequence of vertices and edges, *v*
_{
i
}, *e*
_{
i, i+1
}, *v*
_{
i+1
}, *e*
_{
i+1, i+2
}, ..., *v*
_{
i+n-1
}. It begins with a vertex and ends with a vertex, where *V* and *E* are a set of vertices (nodes) and edges (relations), respectively. We took into consideration walks of length 3, v_{
i
}, e_{
i, i+1
}, v_{
i+1
}, among all possible subsets of walks on the shortest path between a pair of NEs. We called it *v*-walk. Likewise, we defined *e*-walk which starts and ends with an edge, e_{
i, i+1
}, v_{
i+1
}, e_{
i+1, i+2
}. It is actually not a walk defined in the graph theory, but we take *e*-walk to capture contextual syntactic structures as well. We utilized both lexical walks and syntactic walks for each of the *v*-walks and the *e*-walks. The lexical walk consists of lexical words and their dependency relations on a lexical dependency path like Figure 2c, and the syntactic walk, of POS and their dependency relations, on a syntactic dependency path, respectively. With this walk information, we can capture structural context information. This path-based walk representation is easy to incorporate structural information to the learning scheme because a path reflects the dependency relation map between words on it.

### Different properties of two walks

In this work, we focus on different structural properties of *v*-walk and *e*-walk. The *v*-walk shows a labeled relationship from a head to its modifier. Thus, it is related to a direct dependency relationship between two words or POS. On the other hand, *e*-walk describes the immediate dependency structure around a node. If a node is a predicate, then it has a close connection with the sub-categorization information which is important in semantic role labeling task for discovering the predicate-argument structure for a given sentence.

In Figure 2c, the *e*-walk of "*sub(UP)-control-comp_by (DN)*" shows the argument structure of the predicate verb, "*control*". In this case, one entity fills the "*subject*" argument of "*control*" and the other entity directly or indirectly fills the "*comp_by*" role. If an instance holds such dependency structure with respect to the predicate of "*control*", it is very likely that two NEs in the structure have a genic relation. The semantic relations among predicates and their modifiers are clearly helpful for relation extraction. According to [28], the F-score was improved by 15% when incorporating semantic role information into the information extraction system.

Thus, we evaluated each walk type's contribution to the interaction extraction. For this, we conducted the experiment by restricting the walk kernel to operate with a single walk type. As shown in Table 3, we could achieve a quite competing result only with *e*-walk information. Clearly, this result demonstrates that *e*-walk contributes more to the overall similarity for relation learning than *v*-walk since it is related to semantic role information. However, the *e*-walk style structural information is excluded in the previous dependency kernel, which is one of the reasons for the low performance. Therefore, such information should be considered as prior knowledge, and be regarded as more significant structures, among the subpaths.

### Modified dependency kernel

The dependency kernel directly computed the structural similarity between two graphs by counting common subgraphs. However, our previous dependency kernel rigorously focused on *v-walk*, so the direct dependencies between pairs of nodes and *e-walk* style structural information was excluded. Two nodes match when the two nodes were the same and their direct child nodes and the dependency types from the nodes to their direct child nodes matched. Thus, we extend the kernel by allowing the possibility of partial matches besides *e*-walk with an extra factor ensuring that the partial matches have lower weights than complete path matches.

In the extended dependency kernel, partial matches such as single word/POS matches and node-edge or edge-node matches are counted, as well as

*v*-walks. Moreover, the matches are all differently weighted. Before we explain the matching function, we will introduce some notations. For each node

*x*,

*word*(

*x*) is the word at a certain node and POS(

*x*) is the POS of the node.

*children*
_{
w
}(

*n*) denotes word dependency list of word

*n* and

*children*
_{
p
}(

*p*) refers to POS dependency list of POS

*p. children*
_{
w
}(

*n*) is the set of (

*relation*,

*word*) pairs which are direct modifiers of

*n*. In a similar way,

*children*
_{
p
}(

*p*) is the set of (

*relation*,

*pos*) pairs which are direct modifiers of POS

*p*. In addition,

*sc*
_{
w
}(

*n*
_{
1
},

*n*
_{
2
}) and

*sc*
_{
p
}(

*p*
_{1},

*p*
_{2}) denote the set of common dependencies between two subgraphs rooted at

*n*
_{
1
} and

*n*
_{
2
}, and POS

*p*
_{1} and

*p*
_{2}, respectively. We can define the sets of common dependencies between two graphs as follows:

That is, (*x*, *y*) can be an element of the set of *sc*
_{
w
}(*n*
_{
1
}, *n*
_{
2
}) only when the direct child nodes of two parent nodes, *x* and *y*, are the same word and have the same dependent relation with their parents *n*
_{
1
} and *n*
_{
2
} as well. For subcategorization information, *subcat*
_{
w
}(*x*) is used to refer to the sub-categorization pair of a word *x* which is composed of the left and right edge of it. That is the same information with *e*-walk. The matching function *C*
_{
w
}(*n*
_{
1
}, *n*
_{
2
}) is the number of common subgraphs rooted at *n*
_{
1
} and *n*
_{
2
}.

The similarity is recursively computed over their common dependency child word pairs in the set

*sc*
_{
w
}(

*n*
_{
1
},

*n*
_{
2
}), starting from root nodes. As a result, we can calculate

*C*
_{
w
} as follows:

In order to count common subgraphs with considering their structural importance, the matching function was devised. In the definition, if the set of common dependency child word pairs is empty but the two nodes have the same sub-categorization value, then the matching function returns 3.0. If there is no child of *n*
_{1} or *n*
_{2} but two nodes are the same words, then *C*
_{
w
}(*n*
_{
1
}, *n*
_{
2
}) returns 1.0. In case that there is no child of *n*
_{1} or *n*
_{2} and two nodes are different words, *C*
_{
w
}(*n*
_{
1
}, *n*
_{
2
}) returns 0. The last two definitions recursively call *C*
_{
w
} with respect to their common dependency word pairs in the set *sc*
_{
w
}(*n*
_{
1
}, *n*
_{
2
}) but *C*
_{
w
} is weighted with a larger value if the two nodes share the same subcategorization information.

Such isomorphism between two graphs is identified in terms of common POS dependencies in addition to common word dependencies. In the similar way,

*C*
_{
p
}(

*p*
_{
1
},

*p*
_{
2
}) is applied for common POS dependency subgraphs rooted at POS

*p*
_{
1
} and

*p*
_{
2
}. However, in case of syntactic dependency path, the subcategorization information is excluded as follows:

Since

*C*
_{
w
} and

*C*
_{
p
} have different properties that

*C*
_{
w
} is related to lexical and

*C*
_{
p
}, to morpho-syntactic subgraphs,

*C*
_{
w
} is more weighted than

*C*
_{
p
}. Finally, the dependency kernel evaluates the similarity of two graphs by the composition of syntactic dependencies and lexical dependencies as follows:

The formula (6) enumerates all matching nodes of two graphs, *d*
_{
1
} and *d*
_{
2
}. It is a summation of common word dependency subgraphs and common POS dependency subgraphs between two graphs.

As a result, the F-score was improved from 60.4 to 69.4 on LLL dataset (Table 3), compared with the previous dependency kernel. The uses of partial path match and subcategorization information were helpful but the result is still worse than that by the walk kernel. In order to maintain direct dependency structures, this kernel excluded the non-contiguous sub-paths on the shortest path which can be important in the relation learning. Thus, we introduce string kernels to handle such non-contiguous subpaths.

### String kernels

In this section, we will look at the string kernels from various structural perspectives. First of all, we will briefly introduce concepts and notations for string kernels. The string kernel was first addressed in the text classification task by [30]. The basic idea is to compare text documents by means of substrings they contain: the more substrings in common, the more similar they are. A string is defined as any finite sequence of symbols drawn from a finite alphabet and string kernels concern occurrences of subsequences or substrings in strings. In general, for a given string s_{
1
} s_{
2
}...s_{
n
}, a *substring* denotes a string, s_{
i
}s_{
i+1
}...s_{
j-1
}s_{
j
}, that occurs contiguously within the string, while a *subsequence* indicates an arbitrary string, s_{
i
}s_{
j
}...s_{
k
} whose characters occur contiguously or non-contiguous.

So far, we re-represented the shortest path strings with meaningful substructures such as walks. In this work, we also project the shortest path string like Figure 2c to a string itself and directly compare the strings. On the basis of our data representation, nodes and edges of a shortest path string correspond to alphabets of a string. That is, a finite alphabet set, consists of word or POS and dependency relation symbols of shortest path strings and string kernels operate on the shortest path strings. The kernels consider both lexical shortest path string and syntactic shortest path string. We gradually enlarge the kernels to perform a more comprehensive comparison between the two shortest path strings, from the spectrum kernel to the weighted subsequence kernel.

### Spectrum kernel

First, we performed string comparisons with a simple string kernel. One of the ways to compare two strings is to count how many

*p*-length contiguous substrings they have in common. It is called the spectrum kernel of order

*p* or

*p*-spectrum kernel. We borrowed the notation from [

27]. Such

*bag- of-characters* representation is the most widely used in natural language processing. However, the major shortcoming is the structural simplicity that all features represent only local information. In Equation (

7),

*K*
_{
s
}(

*s*
_{1},

*s*
_{2}) denotes the number of common

*p*-substrings between two shortest path strings,

*s*
_{1} and

*s*
_{2}.

The string *s*(*i*: *i + p* ) means the *p*-length substring *s*
_{
i
}...*s*
_{
i + p
} of *s*. In this work, we fixed the order of spectrum as 3 and summed *K*
_{
s
} of lexical dependency path string and *K*
_{
s
} of syntactic dependency path string for the common substring counting. With this kernel, we can consider the substructure as shown in Figure 1f. As a result, we achieved the F-score of 70.5 on LLL data (Table 3). In spite of its structural simplicity, the result was quite promising. It was better than the performance of the extended dependency kernel. We could obtain a reasonable performance only with contiguous dependencies on the shortest path string.

### Fixed-length subsequence kernel

In the spectrum kernel, substructures such as "*stimulate_obj(DN)~comp_from(DN)*", which has gaps between them, is excluded in the structural comparison. In order to cover the substructures, we tested the subsequence kernel that the feature mapping is defined by all contiguous or non-contiguous subsequences of a string. Unlike the spectrum kernel, the subsequence kernel allows gaps between characters. That is, some characters can intervene between two matching subsequences. Thus, this kernel can explain the substructures like Figure 1g. The substructure of "*stimulate-obj(DN)~comp_from(DN)*" can match phrases such as "*stimulate-obj(DN)-any other noun-comp_from(DN)*" which use other nouns instead of "*transcription*". The advantage of the kernel is that we can exploit long-range dependencies existing on strings. Likewise the spectrum kernel, we reduce the dimension of the feature space by only considering fixed-length subsequences. This kernel is defined via the feature map from the space of all finite sequences drawn from to the vector space indexed by the set of *p*-length subsequences derived from A. We will define A^{
p
}as the set of all subsequences of length *p*. We denote the length of the string, *s = s*
_{1}
*s*
_{2}...*s*
_{|p|} by |*p*|. Also, *u* indicates a subsequence of s if there exist an index sequence i = (i_{1}...i_{|u|}) with 1 ≤ *i*
_{1} < ... <*i*
_{|u|} ≤ |*p*| such that *u*
_{
j
} = *s*
_{
i
} for *j* = 1, ⋯, |*u*|. We use a boldface letter i to indicate an index sequence i_{1}...i_{|u|} for a string and the subsequence *u* of a string *s* is denoted by u = s[i] for short. That is, *u* is a subsequence of *s* in the position indexed by i and equals to
.

Then, the feature coordination function

*ϕ*
_{
u
}(

*s*) is used to denote the count of how many times substring

*u* occurs as a contiguous and non-substring in the input string

*s. I*
_{
p
} is the index sequences set of length

*p*.

*ϕ*
_{
u
}
^{
p
}(

*s*) indicates the count of how many times substring

*u* of length

*p* occurs. Consider two strings,

*s* and

*t* to be compared that have the same length

*p*, where the feature space is generated by all subsequences of length

*p* derived from shortest path strings to be classified. Then, the overall inner product between them can be expressed as follows:

We choose 3 as the length parameter *p*. Despite the positive aspect of the subsequence that it considers non-contiguous subsequences as well as contiguous substrings, the performance was not satisfactory. It has improved over the spectrum kernel to some extent, but it was the same value with the walk kernel as F-score 77.5 on LLL which showed the best result in our previous study.

### Gap-weighted fixed length subsequence kernel

In the subsequence kernel, all substructures are equally counted. It does not matter whether the subsequence is continuous, non-continuous or how spread out the occurrences of subsequences are. Thus, the gap-weighted subsequence kernel is tested so as to reflect degree of the contiguity of the subsequence to the subsequence kernel. In this kernel, two substructures of "

*stimulate-obj(DN)-transcription*" and "

*stimulate~ comp_from(DN)-promoter*" have different weights. For the purpose of tuning the weight to reflect how many gaps between characters there are, a decay factor,

*λ*(0 <

*λ* ≤ 1) is introduced. It can penalize non-contiguous substring matches. That is, the further apart the beginning and the end in a substring are, the more it is penalized. Contiguous substring matches are assumed to be coherent and affect more the overall meaning of shortest path string. The feature coordination function is changed into a weighted count of subsequence occurrences as follows:

The count is down-weighted by the total length of gaps.

*l*(

**i**) denotes the span length of indices

**i**, i

_{|u|} -i

_{1}+1. The similarity value between two subsequences are decreased by the factor of

*l*(

**i**) and

*l*(

**j**), reflecting how spread out the subsequences are. The inner product between two strings

*s* and

*t* over A

^{
p
}is a sum over all common fixed-length subsequences that are weighted according to their frequency of occurrence and lengths.

That is, this kernel function computes all matched subsequences of *p* symbols between two strings and each occurrence is weighted according to their span. In general, a direct computation of all subsequences becomes inefficient even if we use a small value of *p*. For an efficient computation, the dynamic programming algorithm by [30] was used. In this paper, we will not explain the details about the efficient recursive kernel computation method. We set the lambda as 0.5 and the index set is fixed as *U* = A^{3} (three node or edge phrases on the shortest path string). If we choose *λ* as 1, the weights of all occurrences will be 1 regardless of *l*. In that case, the kernel is equivalent to the fixed length subsequence kernel that identically counts all common subsequences as 1. As a result, the F-score (70.2) was lower than the subsequence kernel even though this kernel can offer a more comprehensive weighting scheme depending on the dependency distance of each subsequence. The inclusion of gap weighting to substrings was not much effective.

### Walk-weighted fixed length subsequence kernel

In order to improve the gap-weighted subsequence kernel, we devise the walk-weighted subsequence kernel which can handle structural properties differently in addition to the consideration of contiguous and non-contiguous substring. Like the gap-weighted subsequence kernel, this kernel assigns different weights for each subsequence. However, it assigns the weights of subsequences not by the lengths of gaps, but by their type. We set more weights to contiguous subsequences than to non-contiguous subsequences since they are coherent and can affect more the overall meaning of shortest path string. Also,

*e*-walks get more weights than

*v*-walks so that highlight semantic role information. Similarly, this kernel also considers subsequences of length 3.

The formula (11) means that the kernel assigns 3.0 for common contiguous *e*-walk substrings, 2.0 for common contiguous *v*-walk substrings. For non-contiguous subsequences, they can be penalized by gap-weights, but the performance was the best when we set the lambda to 1.0. Thus, in our experiments, 1.0 was also allocated to non-contiguous subsequences regardless of their gap. The significance values can take into account the types of substructures and we experimentally set the significance values for the best F-value.

As a result, this kernel showed the best performance (F-score 82.1) for the extraction of genic relation on the LLL data. This result demonstrates that the use of carefully designed weighted string kernels in terms of types of common subsequences is very effective on learning of a structured representation.