Supporting the search for cross-context links by outlier detection methods
© Sluban and Lavrač; licensee BioMed Central Ltd. 2010
Published: 06 October 2010
Background and relation to previous work
Outliers in data can either present noise in the data, which has harmful effects on knowledge discovery (and should therefore best be eliminated), or correct data instances that belong to a specific subconcept of the main domain concept (and can potentially carry new interesting insights). Several outlier detection methods have been developed in data and text mining, mainly used for noise filtering and error detection purposes. Except for , outlier detection in text mining has not yet been used for exploratory purposes. Our work focuses on using noise/outlier detection methods for a novel task of cross-context link discovery.
Outlier detection through class noise filtering methods
This work uses a class noise detection approach for finding outlier documents which include bridging terms, linking different contexts/domains. It has been shown in  that detecting interesting outliers that appear in the literature on a given phenomenon can help the expert to find implicit relationships among concepts of different domains. In our approach we searched for a set of outlier documents using a class noise filtering approach  implemented with three different learning algorithms: Naïve Bayes (abbreviated: Bayes), Support Vector Machine (SVM) and Random Forest (RF). These outlier detection methods work in a 10-fold cross-validation manner, where repeatedly nine folds are used for training the classifier and on the complementary fold the misclassified instances are denoted as noise/outliers (of the domain they belong to).
Testing of the methods
This work shows that outlier detection methods can shorten the time needed for searching for bridging terms in cross-context link discovery, since the bridging terms are more frequent in sets of outlier documents.
This work was partially supported by the national project Knowledge Technologies and by the EU project FP7-211898 BISON.
- Petrič I, Urbančič T, Cestnik B, Macedoni-Lukšič M: Literature mining method RaJoLink for uncovering relations between biomedical concepts. J. Biomed. Inform. 2009, 42(2):219–227. 10.1016/j.jbi.2008.08.004View ArticlePubMedGoogle Scholar
- Brodley CE, Friedl MA: Identifying mislabeled training data. Journal of Artificial Intelligence Research 1999, 11: 131–167.Google Scholar
- Swanson DR: Migraine and magnesium: eleven neglected connections. Perspectives in Biology and Medicine 1988, 31(4):526–557.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd.