Identifying a convenient data structure to solve a given problem is one of the main factors to achieve a high performance final product. As Skiena explains in [13], picking the wrong data structure for the job can be disastrous in terms of performance but identifying the very best data structure is usually not as critical, because there can be several choices that perform similarly.
Based on the divide and conquer rule, the first step is breaking down the analysis problem into different sub-problems simple enough to be solved in polynomial time by identifying a convenient data structure. Here, the analysis algorithm can be split into four parts: (1) checking whether the user’s protein/chemical identifiers are present in Reactome, (2) for the present ones, finding whether these are parts of complexes and/or sets as well as the species projection, (3) aggregating the found identifiers in the pathways (and super-pathways) where these are present and finally (4) performing the statistical testing to calculate the likelihood that the association between the sample identifiers and the found pathway is due to random chance.
Further on in this section each part is discussed in detail to determine its peculiarities; to expose the chosen data structure and the mechanisms adopted for its improvement; and to show how to connect each step to the following one to come up with the final improved analysis algorithm. Another point of emphasis for optimisation will be the memory usage of each step, so that the filled data structures can be kept in memory to improve the performance of the data traversing algorithms implemented on top of them.
User sample identifiers search in Reactome
Annotated physical entities (PE) in Reactome can be either single entities or complexes. Single entities include proteins, small molecules, RNA, DNA, carbohydrates, or lipids, whilst complexes consist of a combination of any of the single entities, or polymers synthesized from the single entities. However, apart from these two main categories, curators in Reactome can group related entities into sets. PEs are the building blocks that later on will be used as inputs, outputs, catalysts or regulators in reactions.
Identifiers or accession numbers are used to unequivocally refer to a single entity, but PEs have different slots to hold the main identifier, secondary identifier, cross-references, synonyms and other identifiers. The main identifier slot is always manually annotated by the experts who curate data in Reactome (curators), and the other slots can either be manually filled during curation or automatically populated during the release process. This strategy allows storing identifiers for a wide range of resources: UniProt, ChEBI, Ensembl, miRBase, GenBank/EMBL/DDBJ, RefPep, RefSeq, EntrezGene, OMIM, InterPro, Affymetrix, Agilent, KEGG Compound, Illumina, etc.
Therefore, in the first part of the analysis, the main requirement is to improve the process of finding out whether each identifier in the user’s sample corresponds to one or many PEs in Reactome. An identifier corresponds to a PE if it matches with any of the identifiers stored in the different slots mentioned afore. In fact, the best way to solve this problem is by following the reverse approach; creating a lookup table with all the corresponding PEs per each identifier cross-referenced in Reactome. As a consequence, another important requirement is to minimise the memory usage so the data can be kept in memory to improve the query time.
The selection of a good data structure is then determined by requirements both to implement a fast lookup table and to keep memory usage low. A Trie is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings [14]. A radix tree is a space-optimized Trie data structure where each node with only one child is merged with its parent [15].
On the one hand, a radix tree has relatively low memory usage for the lookup table because the common prefixes are shared avoiding data duplication (Fig. 1). On the other hand, the cost of comparing a search key for equality with a key from the data structure can be a dominant cost which cannot be neglected. The radix tree string lookup algorithm fits the analysis algorithm’s original purpose because iterating over tree nodes keeps the identifier seeking time restricted to each identifier’s length and existence in the Reactome target set. As a consequence of this, in case the searched identifier is not contained in the data structure, there is no need to read all of it as happens in the hashing methods where the hash value of the string has to be calculated in every case by reading it entirely.
In summary, once a tree node is reached following the radix tree lookup algorithm for a given identifier, the presence or absence of references to PEs indicates whether the associated identifier is present or not in the database. Actually, the mentioned “references to PE” are indeed pointers to nodes in the data structure chosen for the next part of the analysis.
Reactome uses unique primary identifiers for the PEs it references, in particular UniProt for proteins and ChEBI for chemical entities. Thus, if users submit datasets using these reference systems, the mapping to PEs is straightforward. However, following frequent user requests, we also accept input data with non-unique identifiers, in particular gene names. These are then potentially mapped to multiple PEs. Thus, each target node in the tree could contain more than one pointer to the next data structure.
Traversing complexes/sets composition and species projection
Reaching the associated single entity for a given identifier is the beginning of the second step in the analysis. When these single entities are part of a complex, they are also a target in this step of the analysis. Besides the single entities and complexes, there is another type of PE called sets which, along with complexes, are also to be considered. A set is an abstract representation of a group of two or more entities which are not interacting with each other but are functionally equivalent in the situation where the set is used, for example multiple members of a family of enzymes that could each potentially catalyse a reaction. Furthermore, complexes and sets can also contain other complexes and sets in order to represent much more elaborate structures causing the problem’s intricacy to grow.
Another specific requirement is the possibility of performing species projection to collect the results for Homo sapiens independently of the species for which the identifiers are provided, to benefit from the more complete Reactome annotation for Human. To do so, the species orthologs annotated in Reactome have to be taken into account. Orthologs are entities in different species that evolved from a common ancestor by speciation.
The last requirement in this step is to keep track of the identifiers mapping between the submitted identifiers and those used in Reactome to curate the single entities: UniProt accessions for proteins, Ensembl identifier for genes, CHEBI identifiers for small molecules and miRBase for microRNAs. Although an important part of this mapping started by including the known cross-references as identifiers in the radix tree in the previous step, the mapping itself has to be implemented in this step.
Summarising the exposed requirements for this step of the analysis, the chosen data structure has to model the entities composition problem, the species orthologs projection and the entities mapping. A directed graph is a graph, or set of nodes connected by edges, where the edges have a direction associated with them. For a given graph G with several nodes (a, b, c and d), if G has an arrow from a to b and another arrow from b to c, then the composed graph G
2 has an arrow from a to c. If G has an arrow from a to b, another arrow from b to c and yet another from c to d, then the composed graph G
3 has an arrow from a to d.
Building one graph per species (Fig. 2a) and interconnecting all of them linking all the ortholog nodes (Fig. 2b) creates a bigger graph where the projection requirement is then satisfied. Due to the node uniqueness in the final graph, for those cases where a node is part of one or more structured entities, it contains as many edges pointing to other graph nodes as structures in which it is contained, so structured entities are easily modelled. Finally, if each node of the graph contains its associated entity main identifier (Fig. 2c), when it is reached from a radix tree node representing an identifier other than the main one, this association is stored in order to be offered as part of the result as the required mapping once the analysis is finished.
The graph in Fig. 2a shows three proteins (P1, P2 and P3), two complexes (C1 and C2), and two sets (S1 and S2). By following the edge from node to node, S2 could be either P2 or P3, formally represented as [P2,P3]. C1 is a complex which, due to its edge from S2, is then potentially two complexes: {P1,P2} or {P1,P3}, represented as [{P1,P2},{P1,P3}]. Following this deconstruction, S1 is then [P1, {P1,P2}, {P1,P3}] and finally C2 is [{P1,P2}, {{P1,P2},P2}, {{P1,P3},P2}].
For instance, when an identifier matching with P3 is processed and its corresponding node in the graph is reached from the radix tree, it takes miniscule processing time to traverse the graph and reach the nodes S2, C1, S1 and C2. Likewise, if the target protein is P1, the reachable nodes following the graph edges are C1, S1 and C2. In both examples each target protein is part of the complexes and sets represented by the traversed nodes.
Employing a graph improves the analysis algorithm cost and, important in building an in-memory analysis, the memory usage is kept low because there is no data duplication as the node for a given main identifier is only in memory once. In addition, the final number of node iterations of the algorithm is limited by the related entities for a given identifier, avoiding queries against a large amount of data and intermediate results merging, as done in the database based approach.
As for the radix tree described above, the graph also requires a strategy to allow the algorithm to move on to the next analysis step. In this case, each graph node representing an entity directly associated to one or several pathways will contain as many links to the following data structure as different locations where it is present. Although in the current analysis step each entity associated with the target identifier is found, for the final result and the statistics calculation, there is still one more data structure to be used, as explained in the following sub-section.
Results aggregation into the pathways organisation
Every PE that was directly or indirectly hit in the previous step is associated to one or more pathways. To calculate the significance of each pathway, for a given user sample, it is essential to determine the number of entities found per pathway. Due to the parent-child organisation of the Reactome pathways in an ontology-like hierarchy, when an entity is present in a certain pathway it is also present in its super-pathways in a recursive manner until a top-level pathway is reached (i.e. if a protein is present in “Metabolism of carbohydrates”, it is also present in “Metabolism”).
Taking into account the requirements previously discussed, a good data structure to model this step is a double-linked tree, where each node represents a pathway and contains links to its parent and children (Fig. 3). When a node in the tree is hit, the action can be recursively propagated all the way up to the root. To reduce the memory footprint only identifiers, names and placeholders for results calculation are kept in each node.
Apart from being a convenient data structure to speed up collection of results and a good holder for the statistics results, once the analysis is finished, this data structure can also be serialised to a file to persist the result. In addition, associating the file to a token provides an easy way to create finer grained methods that allow filtering of the result on the server side to help speeding up light-weight clients. In this scenario, the clients can keep the token once the initial analysis is finished and depending on the user’s needs, perform several requests to the server referencing the associated token.
Analysis result statistics calculation
The basic hypothesis in an over-representation analysis is that relevant pathways can be detected if the proportion of differentially expressed genes, within a given pathway, exceeds the proportion of genes that could be randomly expected [1]. Consequently, the fourth and last step in the analysis method involves the statistics calculation. This step does not require any extra data structure because the double-linked tree fits perfectly to the purpose.
The p-Value shows the statistical significance of each hit pathway for a given sample and the background for which the analysis has been performed. In Reactome the method used to calculate the statistical significance is the Binomial Test. Together with the p-Value, the False Discovery Rate (FDR) helps estimating the false positives and it is calculated using the Benjamini-Hochberg approach [16]. As mentioned afore, we have focussed on optimising the performance of the Reactome pathway analysis, while maintaining the basic algorithm as previously published [17].