Skip to main content

Table 3 The knowledge base in a nutshell (I)

From: A method for automatically extracting infectious disease-related primers and probes from the literature

Rules
R1
  IF the sum of the lengths (i.e. number of characters) of all tokens s i belonging to s is smaller than Lmin, THEN discard s.
R2 i | i = affix_in_sequence_tails(s) → discard(s) add(s')
s' = {s1,...,si-1}
  IF an affix from the list of problem affixes is matched at the tail of s, THEN discard s AND add to the facts base a modified copy of s that does not include the tokens corresponding to the matched affix.
R3 i | i = affix_in_sequence_head(s) → discard(s) add(s')
s' = {si+1,...,s n }
  IF an affix from the list of problem affixes is matched at the head of s, THEN discard s AND add to the facts base a modified copy of s that does not include the tokens corresponding to the matched affix.
R4
  IF all tokens s i from s belong to the custom dictionary of English words, THEN discard s.
R5 (i, j) | (i, j) = affix_within_sequence(s) → discard(s) add(s') add(s'')
s' = {s1,...,si-1},
s'' = {si+j,...,s n }
  IF an affix from the list of problem affixes is matched within s, THEN discard s AND add to the facts base the sub-lists of tokens s' and s'' that include all the tokens in s that occur before and after the matched affix respectively.
R6 in_dictionary(s1) length(s1) ≥ 3 → discard(s) add(s')
s' = {s2,...,sn}
  IF the first token of s belongs to the custom dictionary of English words AND its length (i.e. number of characters) is greater or equal to 3 THEN discard s AND add to the facts base the sub-list of tokens s', that includes all but the first token of s.
R7 in_dictionary(s n ) length(s n ) ≥ 3 → discard(s) add(s')
s' = {s1,...,sn-1}
  IF the last token of s belongs to the custom dictionary of English words AND its length (i.e. number of characters) is greater or equal to 3 THEN discard s AND add to the facts base the sub-list of tokens s', that includes all but the last token of s.
R8 size(s) ≥ 2 → merge(s)
  IF s contains 2 or more tokens, THEN convert s into a singleton by concatenating all its tokens.
  1. This table shows the complete knowledge base for refining DNA sequences. R1 is designed to discard short sequences. R2, R3, R5 and R6 are designed to refine noisy sequences, whereas R5 deals with incorrectly merged sequences. R4, by contrast, removes concatenations of dictionary words recognized by the detectors as valid sequences. Finally, R8 converts a list of tokens containing two or more elements into a singleton whose only element represents the refined sequence. The symbol s denotes a list of tokens s = {s 1 , s 2 ,..., s n } of size n. See Table 4 for details on the functions, actions and symbols used by the different rules.