A method for automatically extracting infectious disease-related primers and probes from the literature

García-Remesal, Miguel; Cuevas, Alejandro; López-Alonso, Victoria; López-Campos, Guillermo; de la Calle, Guillermo; de la Iglesia, Diana; Pérez-Rey, David; Crespo, José; Martín-Sánchez, Fernando; Maojo, Víctor

doi:10.1186/1471-2105-11-410

BMC Bioinformatics

Table 4 The knowledge base in a nutshell (II)

From: A method for automatically extracting infectious disease-related primers and probes from the literature

Functions, Actions and Symbols
S = {s₁, s₂,...,s_n} denotes a sequence (list of tokens) as recognized by the detectors during the second phase.
Length(t): function that returns the size—i.e. number of symbols—of a token t ∈ ∑⁺
L_min: minimum required size—i.e. number of symbols from ∑—for a sequence of tokens s not to be discarded. We set this parameter to 7 to enable the BLAST tool to produce results when queried using s.
discard(s): action that removes the sequence s from the facts base.
add(s): action that adds the sequence s to the facts base.
L_A: list of problem affix expressions.
affix_in_sequence_tail(s): function that attempts to match all elements from L_A to the tail of sequence s. If one or more matches occur, then the function returns the position of the first token belonging to the longest matched element. If there are two or more matches of the same length, then the one occurring first in the sequence is selected. For example, when processing the sequence of tokens {"TTACTCATGCCATACATAAATGGATA", "TAMRA", "T"}, the function would match the subsequence {"TAMRA", "T"} as the longest suffix belonging to L_A occurring at the tail of s. Thus, the function would return the value 2, corresponding to the position of the token "TAMRA".
affix_in_sequence_head(s): function that attempts to match all elements from L_A to the head of sequence s. If one or more matches occur, then the function returns the position of the last token of the longest matched element. If there are two or more matches of the same length, then the one occurring first in the sequence is selected. For instance, when processing the sequence of tokens {"and", "CTAGTTT", "ACGTAGGGT"}, the function would return the value 1, corresponding to the position of the token "and".
affix_within_sequence(s): function that attempts to match all elements from L_A to a subsequence of s—excluding the head and the tail. If one or more matches occur, then the function returns a tuple <p, s>, where p denotes the position of the first token of the matched element and s is the size—i.e. number of tokens—of the match. If there are two or more matches of the same length, then the one occurring first in the sequence is selected. For example, when processing the list of tokens {"ACGTTTACGT", "TAMRA", "and", "CGATGGGA"}, the function would return the tuple (2, 1), corresponding to the subsequence {"TAMRA"}.
in_dictionary(t): function that searches the token t ∈ ∑⁺ in a custom dictionary created by the authors and composed of words belonging to ∑⁺. The function returns true if t is found in the dictionary and false otherwise.
size(s): function that returns the number of elements in the list of tokens s.
merge(s): function that returns the concatenation (preserving the original order) of all the elements in the lists of tokens s. For instance, if s = {"AAC", "TCG", "A"}, then merge(s) = {"AACTCGA"}.

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com