In this section, we first provide formal definitions of neighborhoods (i.e. feature data) and target sets pertinence. Then, we introduce an algorithm for the identification and the comparison of pertinent target sets in a DAG when all the set compositions are directly available (like in figure 1b where sets corresponding to DAG nodes are explicitly stored). Next, we propose a generic compact representation of neighborhoods and detail the adaptation of the previous algorithm in this context. The rest of this section focuses on specific representations and algorithms relying on the DAG properties that lead to further time and space optimizations.
Definitions
Uniform representation of data: DAGs defining sets partially ordered by the inclusion relation
We will denote S the set of objects considered in the remaining of this paper. For example, S can be the set of proteins of an organism. We consider that a neighborhood is a set N (of sets of elements of S) partially ordered by the inclusion relation ≺. Partially ordered sets (posets) are generally represented by Hasse diagrams in which there is an edge from y to x if and only if y covers x (denoted x ≺ y). This means that x ≺ y and there is no other element z such that x ≺ z ≺ y (see [8] for more details).
In the following, we consider that a neighborhood is a Hasse diagram (the DAG of figure 1b) that defines a poset (N, ≺).
Target sets pertinence
Several methods and tools use a similarity or dissimilarity index to compare sets, we can cite amongst others: FunSpec [9], BlastSets [7], GOStat [10], EASE [11], PANDORA [12], aBandApart [13], goCluster [14], see [1] for a review. Even though, these methods are using various similarity indices to compare the query and target sets (hypergeometric, binomial, χ2, Fisher's exact test, or percentages), they all have in common that they consider only counts of elements such as the sizes of the query and target sets, or the number of common and differing elements. Thus, when comparing two sets, the bigger the number of common elements and the smaller the number of differing elements, the more similar they are considered.
Formally, this corresponds to any similarity index F between a query set Q and a target set T, such that F(Q, T) increases with |T ∩ Q| and decreases with |T - Q|. Then, given such a similarity index for the comparison of a given query set to a neighborhood, it is not necessary to perform the comparisons with all the target sets in the neighborhood. We introduce the notion of pertinence of a target set for its comparison to a given query set, which allows to consider target sets that are likely to have good similarity values (elements in common with the query) and to ignore target sets that will give redundant results (not different enough from other target sets because of the set composition dependencies in the Hasse diagram representing the neighborhood).
Our main observation is that redundancy is caused by two target sets when one includes the other and when they have either the same common elements or the same differing elements with the query. For example in figure 3, the target sets T1 and T3 are both redundant with T2 for the query set Q. T1 is redundant because it includes T2 and has the same common elements (which implies that it has more differing elements) so it is less similar and it does not bring more information than T2 alone. Similarly, T3 is redundant because it is included in T2 and has the same differing elements (i.e. less common elements). The fact that one target set includes the other is important for the biological meaning of the results. Let us consider T2 and T5 of figure 3. In this case, one should be tempted to decide that only T2 is pertinent (same common elements and fewer differing elements). Actually, T5 is also pertinent because the differing elements do not include those of T2 and thus T5 may be associated to a pertinent nonredundant biological meaning.
Mathematically, the observation above is written simply as follows:
Definition. A target set T in a neighborhood N is pertinent for its comparison to a given query set Q if and only if:
T ∩ Q ≠ ∅ (1)
∄T' ∈ N such that T' ⊂ T and T' ∩ Q = T ∩ Q (2)
∄T' ∈ N such that T' ⊂ T and T' - Q = T - Q (3)
This mathematical definition suggests that one must test all possible T' to decide if a target set T is pertinent. However, it is easy to deduce that only the parent and child nodes of T in the Hasse diagram representing the neighborhood N must be checked. Let us suppose that such a T' exists for (2) (resp. (3)). Then, due to the inclusion relation between T and T', all the sets in N on the path linking T and T' also satisfy (2) (resp. (3)), and then especially a child node (resp. a parent node) of T.
As only the parent and child nodes of T need to be considered, the test of pertinence can be performed on the number of common and differing elements. This is because if these numbers are equal then we are in presence of the same elements (inclusion relation).
As a result, the mathematical definition can be simplified into the following 3 rules (illustrated in figure 3) that are more suitable for the design of an efficient algorithm:
Rule 1: |T ∩ Q| ≠ 0
Rule 2: ∄T' such that T' ≺ T and |T ∩ Q| = |T' ∩ Q|
Rule 3: ∄T' such that T ≺ T' and |T - Q| = |T' - Q|
Structures and algorithms
Algorithm for the identification and the comparison of pertinent target sets in the explicit representation
The pertinence rules allow us to define an algorithm (given in Algorithm 1 of figure 4) for the identification of target sets that are pertinent for their comparison to a given query set. Its principle is to search the DAG of the neighborhood, starting from the leaves corresponding to query elements (Rule 1) and exploring their ancestors to identify nodes satisfying the pertinence definition. This corresponds to a multiple sources breadth-first search, in which the queue is initialized with the nodes corresponding to the query elements. Each time a node is processed it is tested for pertinence. The search can stop at nodes including the query (Rule 2) or when the target set size is too big to give a significant similarity value. In the latter case, a test on the target set size is performed if an upper bound max_target_size can be computed theoretically (which is the case for most of the similarity models). In this algorithm, we suppose that the sets corresponding to the nodes of the DAG are available (which is not always feasible). The worst-case time complexity of a breadth-first search is O(V + E) where V is the number of vertices of the DAG and E is the number of edges. To test the pertinence of a node, we need (i) to compute the number of common and differing elements of the sets corresponding to the nodes, and (ii) to compare these values to the parent and child nodes to test if neither Rule 2 nor Rule 3 are violated. The computation of the number of common and differing elements for a node can be done in O(|S|), the maximum size of a set. The test of pertinence done in pertinent(Q, T) necessitates an access to all the parent and child nodes, which adds up to O(2E) supplementary tests. Thus, the worst-case time complexity of Algorithm 1 is O(|S|V + 3E) = O(|S|V + E). The worst-case time complexity occurs when all the nodes except the root include some but not all of the query elements. Let us consider the average case, in which we expect the query set to be small compared to the total number of elements |S|. The target sets sharing elements with the query represent only a subgraph of the DAG (figure 5b and 5c), and, the pertinent target sets should have sizes that are commensurate with the query size, which implies that they are deep in the DAG (figure 5c). Then, the number of nodes processed is typically very small compared to V. Moreover, the average target set size is small compared to |S|. Thus, the added |S| factor to the complexity may be considered as a constant and be negligible in the average case.
Algorithm 1 assumes that the set compositions are available. In the following, we introduce compact representations of neighborhoods and algorithms efficiently working on such representations.
Generic compact representation of neighborhoods
It is generally inefficient to explicitly store the composition of all the sets of a neighborhood. A compact representation is needed. Such a representation should permit one both to identify and to generate pertinent target sets efficiently, in a way that avoids the generation of all the sets and the traversal of the entire graph. Indeed, the uniform representation of neighborhoods by DAGs is adequate for compactness: we can store only the DAG defining the poset and reconstruct sets corresponding to nodes on the fly. In this compact representation, leaf nodes (nodes without successors) correspond to singleton sets (one for each element of S) and all the other nodes correspond to sets that can be built by searching the labels of reachable leaf nodes. For efficiency reasons, internal nodes are labeled with the size of the sets they represent as we will explain later. Figure 6 illustrates the compact representation corresponding to the DAG of figure 1b.
Algorithm for the identification of pertinent target sets in the generic compact representation
Like in Algorithm 1, the principle is to start from the leaves representing query elements and traverse the DAG to search for pertinent target sets among the sets sharing elements with the query. The difficulty we have to solve is that the set compositions are not available. We thus (i) store the size of the set corresponding to a node, as illustrated in figure 6, (ii) order the nodes in the queue by their size and (iii) propagate the common elements during the search.
In order to test Rule 2, we need the number of common elements of the node processed and its child nodes. With the breadth-first order, the node 'mRNA metabolic process' of size 6 of figure 6 would have been processed before the node of size 4 'RNA splicing' (shorter path from the leaves), and thus the common element b would not have been propagated yet. The solution is to maintain the queue ordered by the set sizes to ensure that sets smaller than the node processed (this includes all its descendants) have already been processed. This way, all the common elements have been propagated and Rule 2 can be tested at the level of the node processed.
In order to test Rule 3, we need the number of differing elements of the node processed and its parent nodes. Unfortunately, this number is not available at this time for the parent nodes because all the common elements may not have been propagated to the parent nodes yet. For example, the node 'RNA processing' (size 5) in figure 6 is processed before the element g has been propagated to the node 'RNA metabolic process' of size 7. The solution is to consider the node processed as the parent node and test if Rule 3 is not violated for its child nodes that is, we test the pertinence of its child nodes.
As a result, the pertinence decision is divided in two steps:
Step 1: Rule 2 is tested at the level of the node processed.
Step 2: Rule 3 is tested at the level of the child nodes of the node processed.
The efficiency of the resulting algorithm (given in Algorithm 2 in figure 7), compared to Algorithm 1, is only affected by the extraction of the next element of the queue, which must be ordered by the set sizes. As we can have at most |S| different sizes of set, the worst-case time complexity is affected by a factor of log |S| by using an adequate data structure. As for the previous algorithm, the average target set size is expected to be small compared to |S|, so the log |S| factor may be considered as a constant and be negligible.
Specific representations of neighborhoods
The previous compact representation is general and can be used for any neighborhood. Nonetheless, we identified cases where further time and space optimizations can be envisaged. It arises when:
-
the DAG is actually a tree (each node has only one ancestor). In this case, the tree can be stored as a parenthesized expression without the need to store the size of the sets for each node. Typical examples of this situation correspond to the gene expression profiles hierarchically clustered, or the IUBMB Enzyme Nomenclature [15] that is, sets of genes/proteins annotated with the same EC number.
-
the DAG obtained after building the neighborhood is implicit and thus, does not need to be stored. It corresponds for example to a correspondence analysis or a principal component analysis of the codon usage (see [16]) or the sets of genes that are adjacent on the chromosome. In the latter case, we only need to know the order of the genes: any pair of genes defines an interval which defines a set of adjacent genes.
In the following, we present efficient algorithms for the identification of pertinent target sets in these specific representations.
Algorithm for the identification and the comparison of pertinent target sets in the tree compact representation
The main advantage to searching pertinent target sets in a tree is that for a given node, the child nodes define a non overlapping partition of the set their parent node represents, and thus, only the number of common and differing elements need to be propagated.
The principle is to recursively compute a triplet of values (number of common elements, number of differing elements, tag indicating potential pertinence) for each node by using a stack of stacks to parse the parenthesized expression. The idea is to push an empty stack when an opening parenthesis is encountered, or a triplet of values when an element is encountered. When a closing parenthesis is read, the computation can occur and consists in the following:
-
(i)
Compute the number of common and differing elements corresponding to this node by summing up the values of the triplets contained in the top stack.
-
(ii)
Test Rule 2 for current node: if the number of common elements is bigger than all of the triplets contained in the top stack then the tag is set to potentially pertinent (not pertinent otherwise).
-
(iii)
Test Rule 3 for child nodes: if child nodes tagged potentially pertinent have less differing elements than the current node, then they are pertinent and the comparison is performed.
-
(iv)
Replace the top stack by the triplet of computed values.
-
(v)
Stop if all the query elements are included or if the target set size exceeds max_target_size.
Compared to the previous algorithm, this one avoids (i) the merging of the common elements for each node and (ii) the extraction of the next element of the queue. The tree is composed of |S| leaves and at most |S| - 1 nodes, thus, the worst-case time complexity of this algorithm is O(|S|).
Algorithm for the identification and the comparison of pertinent target sets in implicit compact representations
An implicit representation requires us to provide a specific algorithm for each different implied DAG. However, this loss in genericity allows considerable time saving in the search for pertinent target sets and considerable space savings for storing the neighborhoods. We have chosen to present the sets of adjacent genes on the chromosome because it can be described very simply and briefly, leads to a straightforward algorithm, and was often encountered in our experiments.
For our illustration, we only need to store the genes in the order they appear on the chromosome. Thus, the space requirement is θ(|S|), instead of θ(|S|2) needed for the DAG representation.
To identify pertinent target sets, we only need to know the position on the chromosome of each of the query elements. Then, each pair of positions defines a lower and an upper bound of an interval that, in turns, defines a set. For such a set to be pertinent, the bounds of the interval must be such that the position just before (resp. after) the lower (resp. upper) bound must not be an element of the query since it would violates Rule 3. Rule 2 holds because the bounds correspond to query elements. The worst-case time complexity of the resulting algorithm is O(|Q|2), Q being the query set. Compared to Algorithm 2 working on the generic compact representation, this algorithm spares (i) the merging of common elements and (ii) the extraction of the next element of the queue.
Comments
View archived comments (1)