Walk-weighted subsequence kernels for protein-protein interaction extraction
© Kim et al. 2010
Received: 22 April 2009
Accepted: 25 February 2010
Published: 25 February 2010
Skip to main content
© Kim et al. 2010
Received: 22 April 2009
Accepted: 25 February 2010
Published: 25 February 2010
The construction of interaction networks between proteins is central to understanding the underlying biological processes. However, since many useful relations are excluded in databases and remain hidden in raw text, a study on automatic interaction extraction from text is important in bioinformatics field.
Here, we suggest two kinds of kernel methods for genic interaction extraction, considering the structural aspects of sentences. First, we improve our prior dependency kernel by modifying the kernel function so that it can involve various substructures in terms of (1) e-walks, (2) partial match, (3) non-contiguous paths, and (4) different significance of substructures. Second, we propose the walk-weighted subsequence kernel to parameterize non-contiguous syntactic structures as well as semantic roles and lexical features, which makes learning structural aspects from a small amount of training data effective. Furthermore, we distinguish the significances of parameters such as syntactic locality, semantic roles, and lexical features by varying their weights.
We addressed the genic interaction problem with various dependency kernels and suggested various structural kernel scenarios based on the directed shortest dependency path connecting two entities. Consequently, we obtained promising results over genic interaction data sets with the walk-weighted subsequence kernel. The results are compared using automatically parsed third party protein-protein interaction (PPI) data as well as perfectly syntactic labeled PPI data.
In recent years, biomedical research has been accelerated by technological advances in biomedical science and a surge of genomic data from the Human Genome Project and a huge amount of new information has been coming from the related research. In order to manage the proliferation of biomedical data, databases such as SWISS-PROT , BIND , MINT , and UniProt  have been developed. However, since a majority of databases still rely on human curators, substantial manual efforts are required to cope with enormous collections on biomedical research and publications. Thus, the development of high-quality information extraction tools that allow scientists and curators to quickly access new discoveries is an important issue in bioinformatics.
One major approach to this issue is to adopt specific types of matching rules or patterns as the core relation discovery operation. The patterns are mainly represented in the form of sequences of words, parts-of-speech (POS), or syntactic constituents [5, 6]. Such pattern-based relation extraction can provide an intuitively easy methodology with high precision, but pattern forms are too rigid to capture semantic/syntactic paraphrases or long-range relations, which lead to low recall rates. Thus, some works suggested more generalized pattern learning methods to align relevant sentences [7–9].
As an alternative, various kernel methods have been employed to this relation extraction problem. Such methods have, in particular, provided appealing solutions for learning rich structural data such as syntactic parse trees and dependency structures, which cannot be easily expressed via the flat features. Kernels are normally designed to capture structural similarities between instances based on the common substructures they share. Mostly, the similarities between structures can be efficiently computed in a recursive manner without explicitly enumerating with feature vectors, which enables us to avoid complex feature construction and selection processes [9–14].
This work expands on our previous kernel approaches . Previously, we had addressed problems of genic and PPI extraction between biomedical entities with four kernel methods: predicate kernel and walk kernel (feature-based), dependency kernel (structure-based kernel), and hybrid kernel (composite kernel of structure-based and feature-based kernels). Each kernel captured structural information in a different way to find the relationships between genes/proteins in a sentence. The kernels are based on the shortest path connecting two entities on the syntactic parse trees (dependency graph). We explored the interaction learning problem in the following aspects: (1) efficient data representation, (2) what semantic/syntactic features or substructures on the shortest path linking interactive two entities are useful for relation learning, and (3) how those structures can be incorporated into kernels. The results revealed that the walk kernel, one of the feature-based kernels, showed a very competitive performance on Learning Language in Logic (LLL) data , with an F-score of 77.5.
In the feature-based kernels, the syntactic substructures of relation instances were mapped to flat features. For example, the walk kernel learned interactions through v-walk and e-walk features on the shortest dependency paths. In that kernel, a v-walk feature consisted of (word 1 , relation, word 2 ) and (POS 1 , relation, POS 2 ), and an e-walk feature was composed of (relation 1 , word, relation 2 ) and (relation 1 , POS, relation 2 ) where a word/POS is a node and a syntactic dependency relation between two nodes is an edge on the dependency graph, as shown in Figure 1b and Figure 1c. The dependency relation between a head and its dependent was roughly represented with seven main functions of relations: appos (apposition), comp_prep (prepositional complement), mod (modifier), mod_att (attributive modifier), neg (negation), obj (object) and subj (subject).
On the contrary, the structure-based kernel was represented by structural similarities between relation instances (See Equation (3), (4)), instead of explicit feature enumerations. However, contrary to our expectation, the kernel based on structural isomorphism between the shortest path graphs showed lower performances than the feature-based kernel.
This study starts from the drawbacks of our previous dependency kernel. It showed difficulties in handling the following aspects: (1) e-walks, (2) partial match, and (3) non-contiguous paths, and (4) different significance of substructures. First, in the kernel, the structural similarity between two relation instances was recursively captured by comparisons of v-walks (node-edge-node) on their sub-graphs. In other words, v-walks (Figure 1b) were compared, but e-walks (Figure 1c), one of the features in the walk kernel, were not. However, according to our experiments, the e-walk feature practically plays a more important role in determining the relation than v-walk. Second, fragments such as "subj (UP) stimulate" or "subj (UP) stimulate obj(DN)" shown in Figure 1e were excluded in the substructure comparisons. Thus, two subgraphs matched only when two root nodes and their direct child nodes were the same and the dependency relationships between them were the same. Such a match is referred to as a complete path match. It consists of a connected sub-graph with at least two words. However, the cases where a series of node-edge-nodes between two graphs are identical can be sparse. Thus, we additionally consider fragment structures in the learning framework, which are incomplete graphs like in Figure 1e. This is referred to as partial path match. Third, the kernel considered internally contiguous dependency substructures on the shortest path. However, non-contiguous substructures can be important in genic interaction. For example, a non-contiguous relation between two entities such as "simulate~comp_from(DN) promoter" in Figure 1g may have an effect on genic interaction. This substructure is called non-contiguous path. Finally, the kernel counted all common subgraphs equally regardless of their importance, even though some subgraphs have more useful properties for learning than others. The kernel made no distinction between the significances of structures. We tackle the 4 issues with the new kernels.
The first three substructures mentioned above can be covered by general graph kernels [14, 16]. However, none of related studies treated the fourth issue from a syntactic structure perspective. The main idea is that each dependency structure has a potentially different property and significance for relation learning according to its type. Thus, we properly classify the substructures that the directed shortest path encloses and distinctively incorporated them into kernels according to the types of dependency substructures.
This point differs from other recent kernel approaches that address the PPI extraction [9, 11, 14, 16, 17]. Moreover, our system can handle directed interactions that the roles of entities are separated as agent and target, while many PPI systems assume interactions to be undirected [9, 11, 14, 17–19].
Here, we first evaluate the effectiveness of different dependency subpaths and revise the dependency kernel so as to overcome the problems of complete subgraphs and equal counting for all subgraphs. In order to treat non-contiguous path substructures, we next introduce string kernels that compare two instances (paths) in terms of substrings they contain. Finally, we propose the walk-weighted subsequence kernel which assigns different weights according to the types of common substrings between two shortest path strings. That is, lexical subgraphs and morpho-syntactic subgraphs, e-walk and v-walk, and contiguous and non-contiguous dependencies are all differently handled by this kernel. In the experiments, we evaluated our kernels on the 5 PPI corpora by . In addition, we compared the performances on human-annotated data and automatically parsed data.
In general, a deeper linguistic representation is known to support information extraction well if its accuracy is guaranteed. Thus, many researches related to relation extraction have used shallow or full syntactic analysis. In particular, words between two entities are considered to carry important information regarding relationships between the entities. Furthermore, structural information of a sentence affects the relation learning. However, the use of a whole sentential structure can generate noise in learning since all constituents in a sentence actually do not concern an interaction between two entities. Thus, we need to restrict the structure for interaction learning to directly or indirectly relevant ones to two entities. One of the ways is to use the shortest path between two entities on a syntactic graph.
As an alternative to the shortest path approach,  suggested the all-dependency-paths kernel to identify protein/gene interactions. In order to extract correct interactions, they represented a parse tree of a sentence with a dependency graph and considered dependencies outside the shortest path connecting two entities as well as dependencies on the shortest path. They assigned a weight of 0.9 to the edges (dependencies) on the shortest path and a weight of 0.3 to other edges. Consequently, the weighting scheme helps emphasize the dependencies on the shortest path without excluding dependencies other than the shortest path. Thus, potentially relevant words outside of the shortest path can be included in the kernel.
However,  reported that subtrees enclosed by the shortest path between two entities still describe their relation better than other subtrees, even though the representation can miss important words outside the shortest path in some cases, as pointed by .
As a feature-based approach,  used various syntactic path features, which are encoded with SVM. They used the predicate argument structures obtained by a head-driven phrase structure grammar (HPSG) parser and a dependency parser, and word context features related to words before, between, and after two interacting NEs.
On the other hand, some works considered only shallow linguistic information concerning word context features without using structural information by parsing.  expressed a relation between two entities by using only words that appear in fore-between, between, and between-after the entities. They utilized neighboring words and their word class sequences to discover the presence of a relation between entities.  extended 's work.
 proposed the composite kernel, which combines previously suggested kernels: the all-paths-dependency kernel of , the bag-of-words kernel of , and the subset tree kernel of . They used multiple parser inputs as well as multiple kernels. The system is the current state-of-the-art PPI extraction system on various PPI corpora. They also boosted system performance by adopting the corpus weighting concept (SVM-CW) .
Recently, the BioNLP 2009 shared task considered more detailed behaviours of bio-molecules  as compared to previous PPI researches. The main task required the recognition of bio-molecular events, event types, and primary argument concerning the given proteins. The 8 event categories such as gene expression, transcription, protein catablolism, phoshorylation, localization, binding, and regulation were considered and the best result in the task was 51.59% (F-score).
In order to represent an interaction instance with the shortest traversal path between two entities on the parse graph, we used Dijkstra's algorithm . We first transformed the dependency graph to an undirected graph which allows the edges to be traversed in any direction because every syntactic relation is toward the syntactic head. However, to preserve the original directions of the relations on the graph, we assign a dependent-head edge with "UP" and conversely, "DN" for a head-to-dependent edge. Furthermore, the shortest path string is defined as a sequence of words or parts-of-speech (POS) connected by directed dependency relations, as shown in Figure 2b. The presence of the "PRED" label for a word on the path indicates that the directions of the left or right edges connected to the word are changed. This often occurs in predicative words. Since a key component of semantic relation determination is to identify the semantic arguments filling the roles of predicates, such predicate markers can be an informative feature on predicate argument structures for a given sentence. Figure 2b visualizes the shortest dependency path linking "ywhE" and "sigF". It is reused as its path string form (Figure 2c) for string kernels and as a dependency list form (Figure 2d) for the dependency kernel. The lexicalized dependency path string consists of words and their dependency relations. We also consider the syntactic dependency path string, which consists of POS and their dependency relations, incorporating direction and predicate information. Likewise, a POS dependency list contains pairs of POS and the syntactic relations between the POS and their direct child nodes' POS. For each node on a graph, the word dependency list contains pairs of nodes and the syntactic relations between the nodes and their direct child nodes. The labels of all NE pair instances are represented in word order with "TA" (target-agent), "AT" (agent-target) and "O" (no interaction). Figure 2e shows some instances derived from the sentence Figure 2a.
For the relation learning, we basically adopt a kernel approach. Kernel means a similarity function that maps a pair of instances to their similarity score. That is, the kernel K over an object (feature) space X can be represented with a function K: X × X → [0, ∞]. Thus, objects of kernel methods are expressed by a similarity matrix. In general, a kernel can be easily computed using inner products between objects without explicit feature handling. Thus, it can be operated well on a rich structured representation which has a high dimensional feature space such as graph or tree.
In this work, a genic relation pair for kernel functions is represented by the shortest dependency path between two NEs on the syntactic graph. Thus, the proposed kernels compute structural similarities in various ways according to the substructures that two dependency paths contain. Each kernel defines meaningful substructures differently.
We performed experiments on extracting gene and protein interactions from two different data sets, automatically parsed and perfectly parsed data sets. First of all, we used the LLL 05 shared task data for individual evaluation of the kernels that we proposed in this work. The dataset provides dependency syntactic information and directed interactions annotated with agent and target roles. The LLL basically used the Link Grammar Parser  for syntactic analysis and the parsed results were manually corrected. Thus, it is a clean data set. The dependency analysis produced by the parser was simplified to 27 grammatical relations. The typed dependencies are shown in Figure 1. The task provides a separate test set and external evaluations through the web server http://genome.jouy.inra.fr/texte/LLLchallenge/scoringService.php. The task information is available at http://genome.jouy.inra.fr/texte/LLLchallenge/ and further details about the data set are described in .
Example size of the 5 converted PPI corpora by 
Both the perfectly parsed and third party auto-parsed data sets are analyzed into typed dependency forms. However, two syntactic analyses for relative clauses are remarkably different from each other. In addition, the converted corpora have much more dependency types than LLL and some typed dependencies on the corpora are redundant. For example, there is no clear-cut distinction between the pronominal relation types, "amod" (adjectival modifier), "nn" (nominal modifier) and "dep"(dependent).
To clarify perfectly parsed LLL dataset and auto-parsed LLL dataset, we will refer to the former as "LLL" and the latter by  as "converted LLL". "Converted corpora" is short for the dependency-parsed 5 corpora by .
Using the converted corpora, it is quite difficult to directly retrieve syntactic dependencies for the following reasons: (1) multiple syntactic relations and (2) self cycle relations between two nodes exist, and (3) NE tag information is not reflected in the parsed results, as shown in Figure 3. Self cycle here means that an edge (x, y) and its inverted edge (y, x) coexist between two nodes, x and y.
Extraction performances are evaluated by the F-score. As mentioned before, we can check the performance over the LLL test data through an external evaluation. On the contrary, separate test sets and external evaluation schemes are not provided to the other 4 corpora. Thus, the proposed kernel is evaluated by the 10-fold document-level cross-validation used in many recent works [9, 11, 14, 17]. We adopt the same data splitting and evaluation strategy as the study. Also, if the same protein name occurs multiple times in a sentence, the interactions are identified over each occurrence. However, self interactions that a single protein interacts with itself are regarded as negative, whereas in many other works, self interactions are not considered as candidates and removed prior to evaluation.
In this paper, we proposed 5 different kernels. We first investigated structural significances depending on substructure types with our previous walk kernel. Then, for accurate assessments, each kernel's performance was evaluated with the clean dataset, LLL. As a result, the walk-weighted subsequence kernel, which yielded the best performance, was selected for further experiments. We performed the walk-weighted subsequence kernel over the converted LLL data to figure out how the use of automatic parsed data affects the performance of the kernel in comparison with the clean dataset. Finally, the walk-weighted subsequence kernel was evaluated over the 5 converted PPI corpora.
Performances of our kernels (LLL)
walk kernel 
dependency kernel 
extended dependency kernel
fixed-length subsequence kernel
gap-weighted subsequence kernel
walk-weighted subsequence kernel
As indicated in Table 3, even the spectrum kernel, which is the simplest string kernel, showed a better result than the dependency kernel. However, the fixed-length subsequence kernel showed a good performance, but the gap-weighted subsequence kernel, whose subsequence was penalized by its spread extent on strings, was unsatisfactory. The discounting by distance gap was not effective. Similarly, gap weighting for non-contiguous subsequences was not good as with the walk-weighted subsequence kernel. The best result was obtained with the walk-weighted subsequence kernel where their substructures were differently weighted according to their significance levels for learning. In the kernel, we classified string types into contiguous e-walk, contiguous v-walk, and non-contiguous subsequences and assigned different weights to each of them. The system performance was improved by 5% with the walk-weighted subsequence kernel over the original LLL data, as compared to the previous walk-based kernel (Table 3). According to the results by the LLL evaluation server, action-type interactions and negative interactions were recognized better in the walk-weighed subsequence kernel than in the previous walk-based kernel. We finally chose the walk-weighted subsequence kernel for further experiments.
Results of our kernel on LLL and the converted LLL by 
Results on the 5 converted PPI corpora by 
Consequently, the proposed kernel showed a bit low recall as compared to its precision on the automatic parsed datasets. In particular, recall rates were much lower over the AIMed and BioInfer. One of the main causes for low recalls might be inaccurate dependency representations on the converted corpora. As mentioned before, there were various difficulties including the converted data in experiments. Although we performed a rough disambiguation pre-processing to remove (1) multiple dependency categories, (2) unnecessary syntactic relations according to grouping named entity words, and (3) cycle relations from the converted corpora, some dependency relations were still ambiguous, which is a critical factor for training. The difficulties are summarized in Table 2. Thus, the performance can be enhanced by using other syntactic parser which is well adapted for this domain.
Although performance comparisons among systems would be more reliable than the LLL dataset since the 4 benchmark PPI corpora are much larger to show an advance and robustness of an approach, further discussions on the evaluation set and experimental settings will be necessary. Without benchmark test dataset and external evaluation, it is difficult to directly compare performances between approaches since there are substantial differences according to the data set used for training and testing, whether to include self interactions, preprocessing, or the data splitting strategy used in cross-validation. According to , the F-score on AIMed could increase up to 18% with a random splitting from a pool of all generated NE pairs.
Performance comparison with other systems (5 PPI corpora)
In conclusion, with respect to both the clean and converted LLL, we found increases in performance as compared to other systems. On the other hand, for the rest of fully automatically parsed corpora, we could not show significant improvements in F-score over other systems. As in many other researches, the extraction performances on AIMed and BioInfer, were worse than those on the other 3 corpora. In particular, recall rate requires further research to complement imperfect syntactic analyses derived from automated systems in the real world and insufficient information of the shortest path representation.
However, precisions of the proposed kernel were quite competitive in both clean dataset, and automated third-party datasets. As shown in the experiments, our approach worked better than other system under the same syntactic information environment or when accurate syntactic features were provided. Thus, the performance of the kernel is expected to be enhanced by the syntactic parser adapted for this task.
First, many interactions regarding nested entities were filtered out. In fact, nested named entities are commonly encountered in biomedical text. For example, they account for 16.7% of all named entities of GENIA corpus. Currently, we allow all entities on nested entities. For instance, "IFN-gamma" and "IFN-gamma SCI" in Figure 5a are both considered. However, only one of them generally participates in an actual interaction. As mentioned earlier, since nested entities have the same structural and contextual information but prediction results need to be different, they prevent a proper learning. This can be one reason for the low performance. Thus, we need to restrict either the longest NE or the shortest NE for performance improvement.
Second, interactions related to single dependency shortest paths were filtered out. The single dependency shortest path implies one that is composed of two entities and a direct dependency between them, such as "NE_conj_and_NE" or "NE_conj_or_NE". In this case, we need further information other than the paths to retrieve correct interactions. To ensure sufficient information for relation extraction, we expanded the single dependency shortest paths. For instance, the shortest path between "BMP-2" and "BMPR-1A" in Figure 5b represented as "NE/NE_prep_for(DN)_NE/NE" requires contextual information outside the path such as "binding" to identify their interaction. In particular, the interactions of coordinated entities were often undetected because they commonly involve parsing errors. For the pairs of coordinated entities such as "BMP-2" and "BMPR-II" (4, 12) and "BMP-2" and "ActR-II" (4, 14) in Figure 5b, the distant useful words such as "binding" were still excluded in the shortest path, although the paths were extended to consider the surrounding contexts. After all, a more elaborate strategy for path extension with respects to single dependency path and coordination handling is required as future work. The path links that have no predicate should be reconsidered.
Third, some interactions are undetected due to parsing errors. In the biomedical domain, complex sentences containing coordination, apposition, acronym, clause dependency, or long dependency structures are very common. For instance, the first pair, "FGF-2" and "FGFR1" (18, 22) in the sentence of Figure 6a was not found due to incorrect dependency analysis in the form of "nn(FGFR4-32, FGFR1-27)". It implies that other supplementary information is required in addition to the shortest path representation to compensate for parsing errors often caused by the complicated coordination, apposition structures in biomedical texts. One way to reduce these can be efficient sentence splitting.
Our system still failed to identify some interactive pairs that occur with negation expressions. In Figure 6b, it correctly handled the negative pairs such as "MMP-1" and "TIMP-1" by negation processing but missed out interaction pairs such as "gelatinase A" and "TIMP-1" (17, 25), "gelatinase B" and "TIMP-1" (19, 25). Further studies should be performed in this negation processing. Finally, some cases need information other than a sentence, as shown in Figure 6c. In the sentence, the system cannot recognize the pair of "cyclinA" and "cdk2" (2, 4).
Figure 7 shows some types of false positive interactions that remain to be filtered out. In Figure 7a, the pair of "phosphatidylinositol (PI) 3-kinase" and "CD5" actually has no relation but our system recognized it as an interaction pair because of the substructure of "interacts/VBZ_PRED prep_with(DN)". It needs a much broader context including "we investigated whether~" for the correct extraction. Figure 7b shows a pair exemplifying a negation expression. In the sentence, the system detected all pairs as interactive pairs, but some pairs should have been filtered out. Our negation processing method could not cover the context of "but not with".
Figure 7c shows the interactions that are not detected with information on the shortest path alone. The shortest path in the sentence did not represent the context of "is not necessary". That is, words representing the interactions did not exist on the shortest path. Likewise, the interaction in Figure 7d should be filtered out. It requires a broad context such as "interaction with 14-3-3 proteins". It is one of the difficult cases to be filtered out.
A more detailed investigation of other dependency parsers applied in PPI studies [11, 16, 19] should be performed in our future research. Although, according to the work of , 1% absolute improvement in parsing leads to 0.25% improvement in PPI extraction accuracy, it is quite important to obtain reliable syntactic information in the systems that fully depend on syntactic information without considering the bag-of-words context information. The critical point in biomedical text parsing is how well a parser handles coordination, apposition, and relative clauses which often cause erroneous results in PPI learning. In addition to a further improvement in parsing accuracy, the strategy for the shortest path extension should be improved to supplement incorrect syntactic analyses. Likewise, the method for the pairs, including nested entities and negation expressions, should be enhanced.
So far, we restricted the research to focus on structural kernels by using dependency information on the shortest path. However, the combination with the bag-of-words kernel can be a backup to compensate for the imperfect syntactic analyses derived from automated systems in the real world and insufficient information of the shortest path representation by including the neighboring words. The bag-of-context words kernel can improve recall rates. Additionally, studies on string kernels are possible because there can be a wide variety of string kernels depending on how the subsequences are defined.
We presented kernel methods defined on the shortest dependency path for genic relation extraction. The dependency path between two NEs, which consists of the connections between words, is highly lexicalized. In this study, we started off with four drawbacks of our previous work in terms of e-walks, partial path matches, different significance levels of structures and non-contiguous paths and presented the revisions for the dependency kernel, variants of string kernels, and the walk-weighted subsequence kernel to effectively handle the drawbacks. The proposed kernels were experimented on the LLL shared task data and 5 PPI corpora. We achieved good performances only with substructures represented on the shortest path between entities.
To handle the problems of the prior structural kernel, we first examined the effectiveness of each main feature for the walk kernel which showed the best performance in our previous work, and then modify the dependency kernel so that it can accept the features of the walk kernel and partial path matches.
In the modified version, we treat each type of substructures with different importance.
For this, we classify the types of substructures into several categories and enhance the learning performance by allowing different weights or counts according to the types of common dependency substructures that two relation instances share. Next, we treat the shortest path strings as strings and introduce some string kernels such as the spectrum kernel, subsequence kernel and gap weighted kernel. Finally, we suggest the walk weighted subsequence kernel, which can model not only the previous problems, but also non-contiguous structures and structural importance not covered by the previous kernels.
We start the kernel modification with the re-consideration of walks properties. In the walk kernel, the structural information is encoded with walks of graphs. Given v ∈ V and e ∈ E, a walk can be defined as an alternating sequence of vertices and edges, v i , e i, i+1 , v i+1 , e i+1, i+2 , ..., v i+n-1 . It begins with a vertex and ends with a vertex, where V and E are a set of vertices (nodes) and edges (relations), respectively. We took into consideration walks of length 3, v i , e i, i+1 , v i+1 , among all possible subsets of walks on the shortest path between a pair of NEs. We called it v-walk. Likewise, we defined e-walk which starts and ends with an edge, e i, i+1 , v i+1 , e i+1, i+2 . It is actually not a walk defined in the graph theory, but we take e-walk to capture contextual syntactic structures as well. We utilized both lexical walks and syntactic walks for each of the v-walks and the e-walks. The lexical walk consists of lexical words and their dependency relations on a lexical dependency path like Figure 2c, and the syntactic walk, of POS and their dependency relations, on a syntactic dependency path, respectively. With this walk information, we can capture structural context information. This path-based walk representation is easy to incorporate structural information to the learning scheme because a path reflects the dependency relation map between words on it.
In this work, we focus on different structural properties of v-walk and e-walk. The v-walk shows a labeled relationship from a head to its modifier. Thus, it is related to a direct dependency relationship between two words or POS. On the other hand, e-walk describes the immediate dependency structure around a node. If a node is a predicate, then it has a close connection with the sub-categorization information which is important in semantic role labeling task for discovering the predicate-argument structure for a given sentence.
In Figure 2c, the e-walk of "sub(UP)-control-comp_by (DN)" shows the argument structure of the predicate verb, "control". In this case, one entity fills the "subject" argument of "control" and the other entity directly or indirectly fills the "comp_by" role. If an instance holds such dependency structure with respect to the predicate of "control", it is very likely that two NEs in the structure have a genic relation. The semantic relations among predicates and their modifiers are clearly helpful for relation extraction. According to , the F-score was improved by 15% when incorporating semantic role information into the information extraction system.
Thus, we evaluated each walk type's contribution to the interaction extraction. For this, we conducted the experiment by restricting the walk kernel to operate with a single walk type. As shown in Table 3, we could achieve a quite competing result only with e-walk information. Clearly, this result demonstrates that e-walk contributes more to the overall similarity for relation learning than v-walk since it is related to semantic role information. However, the e-walk style structural information is excluded in the previous dependency kernel, which is one of the reasons for the low performance. Therefore, such information should be considered as prior knowledge, and be regarded as more significant structures, among the subpaths.
The dependency kernel directly computed the structural similarity between two graphs by counting common subgraphs. However, our previous dependency kernel rigorously focused on v-walk, so the direct dependencies between pairs of nodes and e-walk style structural information was excluded. Two nodes match when the two nodes were the same and their direct child nodes and the dependency types from the nodes to their direct child nodes matched. Thus, we extend the kernel by allowing the possibility of partial matches besides e-walk with an extra factor ensuring that the partial matches have lower weights than complete path matches.
That is, (x, y) can be an element of the set of sc w (n 1 , n 2 ) only when the direct child nodes of two parent nodes, x and y, are the same word and have the same dependent relation with their parents n 1 and n 2 as well. For subcategorization information, subcat w (x) is used to refer to the sub-categorization pair of a word x which is composed of the left and right edge of it. That is the same information with e-walk. The matching function C w (n 1 , n 2 ) is the number of common subgraphs rooted at n 1 and n 2 .
In order to count common subgraphs with considering their structural importance, the matching function was devised. In the definition, if the set of common dependency child word pairs is empty but the two nodes have the same sub-categorization value, then the matching function returns 3.0. If there is no child of n 1 or n 2 but two nodes are the same words, then C w (n 1 , n 2 ) returns 1.0. In case that there is no child of n 1 or n 2 and two nodes are different words, C w (n 1 , n 2 ) returns 0. The last two definitions recursively call C w with respect to their common dependency word pairs in the set sc w (n 1 , n 2 ) but C w is weighted with a larger value if the two nodes share the same subcategorization information.
The formula (6) enumerates all matching nodes of two graphs, d 1 and d 2 . It is a summation of common word dependency subgraphs and common POS dependency subgraphs between two graphs.
As a result, the F-score was improved from 60.4 to 69.4 on LLL dataset (Table 3), compared with the previous dependency kernel. The uses of partial path match and subcategorization information were helpful but the result is still worse than that by the walk kernel. In order to maintain direct dependency structures, this kernel excluded the non-contiguous sub-paths on the shortest path which can be important in the relation learning. Thus, we introduce string kernels to handle such non-contiguous subpaths.
In this section, we will look at the string kernels from various structural perspectives. First of all, we will briefly introduce concepts and notations for string kernels. The string kernel was first addressed in the text classification task by . The basic idea is to compare text documents by means of substrings they contain: the more substrings in common, the more similar they are. A string is defined as any finite sequence of symbols drawn from a finite alphabet and string kernels concern occurrences of subsequences or substrings in strings. In general, for a given string s 1 s 2 ...s n , a substring denotes a string, s i s i+1 ...s j-1 s j , that occurs contiguously within the string, while a subsequence indicates an arbitrary string, s i s j ...s k whose characters occur contiguously or non-contiguous.
So far, we re-represented the shortest path strings with meaningful substructures such as walks. In this work, we also project the shortest path string like Figure 2c to a string itself and directly compare the strings. On the basis of our data representation, nodes and edges of a shortest path string correspond to alphabets of a string. That is, a finite alphabet set, consists of word or POS and dependency relation symbols of shortest path strings and string kernels operate on the shortest path strings. The kernels consider both lexical shortest path string and syntactic shortest path string. We gradually enlarge the kernels to perform a more comprehensive comparison between the two shortest path strings, from the spectrum kernel to the weighted subsequence kernel.
The string s(i: i + p ) means the p-length substring s i ...s i + p of s. In this work, we fixed the order of spectrum as 3 and summed K s of lexical dependency path string and K s of syntactic dependency path string for the common substring counting. With this kernel, we can consider the substructure as shown in Figure 1f. As a result, we achieved the F-score of 70.5 on LLL data (Table 3). In spite of its structural simplicity, the result was quite promising. It was better than the performance of the extended dependency kernel. We could obtain a reasonable performance only with contiguous dependencies on the shortest path string.
In the spectrum kernel, substructures such as "stimulate_obj(DN)~comp_from(DN)", which has gaps between them, is excluded in the structural comparison. In order to cover the substructures, we tested the subsequence kernel that the feature mapping is defined by all contiguous or non-contiguous subsequences of a string. Unlike the spectrum kernel, the subsequence kernel allows gaps between characters. That is, some characters can intervene between two matching subsequences. Thus, this kernel can explain the substructures like Figure 1g. The substructure of "stimulate-obj(DN)~comp_from(DN)" can match phrases such as "stimulate-obj(DN)-any other noun-comp_from(DN)" which use other nouns instead of "transcription". The advantage of the kernel is that we can exploit long-range dependencies existing on strings. Likewise the spectrum kernel, we reduce the dimension of the feature space by only considering fixed-length subsequences. This kernel is defined via the feature map from the space of all finite sequences drawn from to the vector space indexed by the set of p-length subsequences derived from A. We will define A p as the set of all subsequences of length p. We denote the length of the string, s = s 1 s 2...s |p| by |p|. Also, u indicates a subsequence of s if there exist an index sequence i = (i1...i|u|) with 1 ≤ i 1 < ... <i |u| ≤ |p| such that u j = s i for j = 1, ⋯, |u|. We use a boldface letter i to indicate an index sequence i1...i|u| for a string and the subsequence u of a string s is denoted by u = s[i] for short. That is, u is a subsequence of s in the position indexed by i and equals to .
We choose 3 as the length parameter p. Despite the positive aspect of the subsequence that it considers non-contiguous subsequences as well as contiguous substrings, the performance was not satisfactory. It has improved over the spectrum kernel to some extent, but it was the same value with the walk kernel as F-score 77.5 on LLL which showed the best result in our previous study.
That is, this kernel function computes all matched subsequences of p symbols between two strings and each occurrence is weighted according to their span. In general, a direct computation of all subsequences becomes inefficient even if we use a small value of p. For an efficient computation, the dynamic programming algorithm by  was used. In this paper, we will not explain the details about the efficient recursive kernel computation method. We set the lambda as 0.5 and the index set is fixed as U = A3 (three node or edge phrases on the shortest path string). If we choose λ as 1, the weights of all occurrences will be 1 regardless of l. In that case, the kernel is equivalent to the fixed length subsequence kernel that identically counts all common subsequences as 1. As a result, the F-score (70.2) was lower than the subsequence kernel even though this kernel can offer a more comprehensive weighting scheme depending on the dependency distance of each subsequence. The inclusion of gap weighting to substrings was not much effective.
The formula (11) means that the kernel assigns 3.0 for common contiguous e-walk substrings, 2.0 for common contiguous v-walk substrings. For non-contiguous subsequences, they can be penalized by gap-weights, but the performance was the best when we set the lambda to 1.0. Thus, in our experiments, 1.0 was also allocated to non-contiguous subsequences regardless of their gap. The significance values can take into account the types of substructures and we experimentally set the significance values for the best F-value.
As a result, this kernel showed the best performance (F-score 82.1) for the extraction of genic relation on the LLL data. This result demonstrates that the use of carefully designed weighted string kernels in terms of types of common subsequences is very effective on learning of a structured representation.
This work is supported by the BK21 program of the Ministry of Education and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2009-0070211).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.