Explorative search of distributed bio-data to answer complex biomedical questions
© Masseroli et al.; licensee BioMed Central Ltd. 2014
Published: 10 January 2014
Skip to main content
© Masseroli et al.; licensee BioMed Central Ltd. 2014
Published: 10 January 2014
The huge amount of biomedical-molecular data increasingly produced is providing scientists with potentially valuable information. Yet, such data quantity makes difficult to find and extract those data that are most reliable and most related to the biomedical questions to be answered, which are increasingly complex and often involve many different biomedical-molecular aspects. Such questions can be addressed only by comprehensively searching and exploring different types of data, which frequently are ordered and provided by different data sources. Search Computing has been proposed for the management and integration of ranked results from heterogeneous search services. Here, we present its novel application to the explorative search of distributed biomedical-molecular data and the integration of the search results to answer complex biomedical questions.
A set of available bioinformatics search services has been modelled and registered in the Search Computing framework, and a Bioinformatics Search Computing application (Bio-SeCo) using such services has been created and made publicly available at http://www.bioinformatics.deib.polimi.it/bio-seco/seco/. It offers an integrated environment which eases search, exploration and ranking-aware combination of heterogeneous data provided by the available registered services, and supplies global results that can support answering complex multi-topic biomedical questions.
By using Bio-SeCo, scientists can explore the very large and very heterogeneous biomedical-molecular data available. They can easily make different explorative search attempts, inspect obtained results, select the most appropriate, expand or refine them and move forward and backward in the construction of a global complex biomedical query on multiple distributed sources that could eventually find the most relevant results. Thus, it provides an extremely useful automated support for exploratory integrated bio search, which is fundamental for Life Science data driven knowledge discovery.
Data deluge of the post-genomic era is providing scientists with potentially valuable information, but makes difficult to find and extract from the available data those that are most reliable and most related to the biomedical questions to be answered. Moreover, such questions are increasingly complex and often simultaneously regard many heterogeneous aspects of an organism and its biomolecular entities. Several of these questions can be addressed only by searching, extracting, integrating and comprehensively querying different types of data, which are distributed in several data sources and often inherently ordered or associated with ranked confidence values. Usually, scientists manually explore these data using the individual search services available and struggle in combining intermediate results in order to find the most adequate answers to their global questions.
Several data integration platforms and workflow systems  have been created to query and combine available data and services from heterogeneous sources in order to explore existing information and extract new knowledge. Proposed data integration approaches can be grouped with respect to the adopted integration techniques or interaction paradigms. The former ones include information linkage, data warehousing, mediator based systems and service integration methods. Information linkage implementations, like SRS  or NCBI Entrez , enable users to interrogate several sources through a single Web site and provide results with links to the data sources; yet, they do not integrate the retrieved data. Fully materialized systems, like EnsMart  or BioWarehouse , integrate data within a warehouse according to a local schema. This approach allows performing easily complex computations on the integrated data, but requires updating often the data warehouse, which generally is a complex task. Mediator based systems, like TAMBIS  or BioMart , are designed to query remotely distributed sources through a virtual mediated schema; the query on the mediated schema is transformed in queries over the schemata of the diverse sources and the retrieved data are processed locally. In mediated approaches data remain in the original sources without being materialized locally; thus, mediator based approaches provide up-to-date data, but complex computations on the data are a challenging task. Service integration approaches require registering the services in order to describe them according to an integration model. Among others, Mork et al.  proposed an entity-based model to integrate data from diverse services; they suggested to register services through a DSL (Domain Specific Language), based on an eXtensible Markup Language (XML) file, and map them onto the entities described in the model.
Among interaction paradigms, the path-based approach is similar to the exploratory one used in our work; it is founded on a semantic graph, built according to links available between sources, which enables users to compose queries by selecting entities from the graph. Biozon , GenoQuery  and the BioGuide (http://www.bioguide-project.net/) tool family (e.g. BioGuideSRS ) are examples of such approach implementations. Several other types of query interfaces have also been proposed. Recently, Latendresse and Karp  presented their Structured Advanced Query Page as an original interface to query a unique integrated database containing multiple data types.
Notable examples of workflow systems supporting service and data integration include Taverna , Wings/Pegasus [14, 15], Galaxy , Triana  and Kepler . Yet, Taverna, the most known and used in bioinformatics, and the other available workflow systems do not rely on a general model of the services to be integrated. Furthermore, available data integration platforms and workflow systems do not take into account, in the integration process, often available partial rankings of the data to be integrated. Thus, they cannot provide support for ranking-aware multi-topic searches. Both these limitations are addressed and overcome by Search Computing (http://www.search-computing.org/). It has been proposed as a new software framework that provides the abstractions, foundations, methods, and tools required to answer complex multi-topic queries over multiple data sources, also ranked . It reaches this goal by interacting with a collection of cooperating search services and using ranking and joining of results as the dominant factors for service composition. The diverse services are described, according to a general and flexible service model, at three different levels of abstraction, i.e. at conceptual, logical and physical level [19, 20]; then, they are wrapped, registered in the system and mapped onto the virtual mediated schema, which is built based on the semantic relationships between services described at service registration. These aspects originally differentiate Search Computing from previous proposals for service registration and integration of data from diverse services, such as the one from Mork et al. .
Here, we illustrate and discuss our novel work to support explorative integrated bio search and ranking-aware combination of distributed biomedical-molecular data, aimed at answering multi-topic complex biomedical questions. This work complements a previous study  of the envisaged relevance of Search Computing to the Life Sciences, in particular to information integration and support for Life Sciences ordered data. The foundation of the extension of Search Computing in support of explorative searches in the complex biomedical-molecular scenarios was shortly introduced in  and ; here such extension is thoroughly illustrated and discussed, focusing on a paradigmatic bioinformatics use case. By supporting interactive explorative multi-topic data searches, the work here presented significantly extends a previous approach  focused only on the efficient execution of predefined single global multi-topic queries over multiple ranked search services. The demonstrator prototype initially developed to implement such previous approach  is significantly extended and enhanced by the original Web application here presented and made publicly available. Besides allowing querying diverse services and integrating their provided data on-the-fly, it additionally supports exploration (inspection and selection) of intermediate partial results, as well as their expansion and refinement through search query modification and extension. Furthermore, it enables users to attribute different weights to results from diverse sources.
By performing the exploratory search steps of the use case example above described, the scientist can explore the biomedical-molecular Semantic Resource Framework defined by the bioinformatics services registered in Bio-SeCo (Figure 1). In so doing, he/she can compose and submit a global query that might find the answer to his/her original complex multi-topic question: "Which genes encode proteins in different organisms with high sequence similarity to a given protein X, are significantly over co-expressed in the same given biological tissue or condition Y and are involved in the biological process Z?" The possibility to easily construct in an explorative way such complex biomedical queries and run them efficiently across multiple distributed sources allows global evaluations of available bio-data that can unveil unexpected results and lead to new biomedical knowledge discoveries. On December 18th 2013, we run the above example global query by using equal service relative weights and setting input parameter values with the human Paired box protein Pax-6 isoform a protein [UniProt:P26367] ID as amino acid sequence X, tumor as pathological biological condition Y, and regulation of apoptotic process as biological process Z.
The created Bio-SeCo application implements a novel exploratory search interaction paradigm and supports the user in performing a progressive step-by-step construction of the search query by exploring the data provided by the available services registered in Bio-SeCo. This aspect of expanding an initial query - according to the liquid query paradigm  - after evaluating its provided results, in order to refine or extend them, innovatively differentiates our exploration approach from the path-based one.
Conversely, both approaches use a graph of sources to express the queries; thus, Figure 1 could be obtained also in path-based systems [9–11]. Usually scientists perform manually such supervised exploration of data by using the individual tools available, save somewhere (e.g. within a spread sheet) single search results and manually combine/compare them in order to identify common patterns and try to find answer to their global questions. Bio-SeCo offers an integrated environment where to perform such data exploration, which automatically saves intermediate results, combines them taking into account their partial order and supplies ordered global results. Furthermore, Bio-SeCo offers multiple alternative and interchangeable types of result visualization, i.e. table, atom and scatter plot views, with also the possibility to easily integrate new advanced visualizations.
The order of the provided results is induced by their global scores, computed on the basis of the Fagin's method  and according to a score function defined as combination of partial scores of intermediate ranked results, as described in the Methods section. This choice seams to be the most appropriate for Bio-SeCo, which aims at quickly giving global ordered answer sets to user complex searches on multiple combined search services that provide individual rankings, possibly incomplete and with ties. It was positively evaluated by the users who provided feedbacks about the relevance of the system and its ranking strategy. An alternative to Fagin's method could be the very promising BioConsert method, recently presented by Cohen-Boulakia et al. . They proposed to rank answer sets, retrieved for a user query, according to a median-based consensus ranking generated on the basis of the results of a set of ranking methods and reflecting their common points. Since finding a median of rankings with ties is a NP-hard problem, they proposed an interesting heuristic to generate such a consensus ranking. It performs well with the datasets considered in ; yet, being a greedy heuristic, unfortunately it is not guaranteed to always perform as well for all data sets.
The future development of Bio-SeCo will focus on further extending its Semantic Resource Framework by registering in Bio-SeCo additional bioinformatics services, thus supporting a wider variety of biomedical questions, even more complex. It will also include the aspect of guiding user exploration of available resources towards the ones that provide more appropriate data according to the user preferences and strategies. To this regard, path-based systems like Biozon  and BioGuideSRS  are important reference for systems aimed at assisting scientists in searching for relevant data within external sources while taking their predilections and policies into account.
By using available services to search biomedical-molecular data and taking advantage of the ranking attributes that they define, the here described Bioinformatics Search Computing application allows efficient exploration of available bio-data and search for globally ranked answers to complex multi-topic biomedical questions. In so doing, it offers a valuable and powerful automated support for exploratory integrated bio searches at the basis of Life Science data driven knowledge discovery.
In order to create our Bio-SeCo application, we first selected a set of typical biomedical-molecular topics (i.e. Protein, Gene, Gene Expression, Biological Function and Genetic Disorder) to be included in Bio-SeCo. According to the service mart modelling approach , we modelled the service marts (i.e. the generalized and normalized conceptual descriptions) of bioinformatics services that provide data about such topics. We did so by identifying their main and common attributes and normalizing their names. We also defined the semantic connection patterns, i.e. the pair-wise coupling, between service marts of services that provide data about different topics. This was done by identifying pairs of normalized attributes of the connected service marts and defining their comparison predicates, as conjunctive Boolean expressions, that allow joining their values semantically.
Then, using available Search Computing tools, we registered in Bio-SeCo some bioinformatics search services that provide data about the selected biomedical-molecular topics and their semantic associations. They include two BLAST sequence alignment and search services available at Washington University (WU)  and National Center for Biotechnology Information (NCBI) , respectively, the search engine over the Array Express repository of gene expression data , and five query services over our Genomic and Proteomic Data Warehouse (GPDW) publicly available at http://www.bioinformatics.deib.polimi.it/GPKB/. The latter ones provide access to Gene, Protein and their Genetic Disorder and Biological Function Feature (i.e. Gene Ontology Biological Process, Molecular Function and Cellular Component) annotation data.
For each service, the service registration consists in first creating a wrapper, i.e. an adapter that matches the service attributes to their normalized version defined in a modelled service mart, and associating the wrapper with such a service mart. Since each type of service is modelled by a single service mart, more registered services can share the same service mart, such as the two registered BLAST services. Then, one or more access patterns and a service interface are defined for each service. The latter one maps an access pattern to the wrapper of the end point of the service data source, which is used to call the service. Whereas the former ones, which can be shared by more services associated with the same service mart, are specific signatures of a service mart, with the characterization of each attribute as input (I) or output (O), depending on the role that the attribute plays in the service call. Furthermore, an output attribute can be characterized as ranked (R), if the service produces its results in an order that depends on the values of that attribute. Based on the semantic type of access pattern input and output attributes of two registered services, specific connection patterns between individual services are then automatically derived from the connection patterns defined at conceptual level between the service marts associated with the registered services. All these tasks can be done quite easily by following the documentation provided by the Search Computing project. As an example, the access patterns that we created to model the NCBI Blast sequence alignment search by Protein ID and GPDW Biological Function Feature by Protein ID services, together with their pair-wise coupling connection pattern, are here reported as follows.
NCBI-BLAST(SearchedDB I , QueryUniprotProteinID I , TopAlignment I , SubstitutionMatrix I , ExpectationUpper I , SearchFilter I , GapOpenCost I , GapExtensionCost I , FoundSequenceID O , FoundSequenceIDName O , FoundSequenceSymbol O , FoundSequenceDescription O , FoundSequenceLength O , BestAlignmentExpectation R )
GPDW_BiologicalFunctionFeature(ProteinID I , ProteinIDName I , BiologicalFunctionFeatureName I , BiologicalFunctionFeatureID O , BiologicalFunctionFeatureIDName O , BiologicalFunctionFeatureName O , BiologicalFunctionFeatureDefinition O )
ExistsProteinBiologicalFunctionFeature(NCBI-BLAST, GPDW_BiologicalFunctionFeature): [(NCBI-BLAST.FoundSequenceID = GPDW_BiologicalFunctionFeature.ProteinID) AND [(NCBI-BLAST.FoundSequenceIDName = GPDW_BiologicalFunctionFeature.ProteinIDName)]
By doing all the described service registration steps, we created the Semantic Resource Framework depicted in Figure 1. It constitutes the reference used by Bio-SeCo to enable the query, exploration and integration of the data provided by the services registered in the framework.
A query on a single search service registered in the framework is expressed based on the user inputs and service access pattern selected. Expansion of a search service query on another search service is performed, according to the liquid query paradigm , by composing single search service sub-queries based on their connection pattern chosen. This last specifies the output values of the first service to be used as input values to the second service, as well as their conjunctive logical conditions to be implemented in the query execution plan. In this way, an exploratory expanded query, expressed on the biomedical-molecular semantic resource network created at service registration time, can be actually formalized in concrete sub-queries posed to the search services associated with the network nodes and related each others as defined by the network arches. For example, according to the above defined NCBI-BLAST and GPDW_BiologicalFunctionFeature access patterns and their coupling connection pattern, the expansion on the network Biological Function node (i.e. GPDW_BiologicalFunctionFeature service) of an initial query for Protein similarity (i.e. using NCBI-BLAST service) is expressed through the two following sub-queries:
NCBI-BLAST(SearchedDB, QueryUniprotProteinID, TopAlignment, SubstitutionMatrix, ExpectationUpper, SearchFilter, GapOpenCost, GapExtensionCost)
GPDW_BiologicalFunctionFeature(NCBI-BLAST.FoundSequenceID, NCBI-BLAST.FoundSequenceIDName, BiologicalFunctionFeatureName)
Their execution plan provides as expanded results only those items from the first and the second sub-query that together satisfy the conjunctive logical conditions defined in the used connection pattern. Notice that join conditions used in an expanded query are clearly shown in the Bio-SeCo user interface (Figure 6). In the considered example, the expanded results include only those user selected proteins that, according to the NCBI-BLAST service, are similar in sequence to a user specified protein and have the user specified biological function(s), according to the GPDW_BiologicalFunctionFeature service. Thus, multi-service expanded results always include only the items in common in the partial results from each of the sub-queries composed, i.e. from each combined search service.
To compose individual search results of a multi-topic query, taking into account their partial rankings and provide a global score, Bio-SeCo uses a highly efficient algorithm for rank aggregation [35–37]. It takes into account the following four major aspects of the Bio-SeCo application scenario. First, individual search results are provided by single search services that are individually called and composed within Bio-SeCo; time and completeness of their answers is not guarantied. Second, ordered search results are usually partially ranked, i.e. they can include ties. Third, depending on the user chosen parameters, individual search services may provide only top k ordered results. Fourth, as specified in the previous Methods subsection, global ranking is defined for subsets of equal number of common partial results from each sub-query (i.e. from each single search service). Thus, consensus ranking methods, which usually exploit the fact that the same data item is found in several rankings to construct the consensus, can be straightforwardly applied to get a global ranking for the global results on the basis of their partial rankings. Based on a consensus method previously proposed by Fagin et al. , the ranking algorithm implemented in Bio-SeCo can efficiently compute the elements of a near-optimal aggregation of multiple partial rankings induced by a global score. This score is computed according to a scoring function defined as the weighted summation of multiple partial scores of intermediate ranked results. The scores of the individual search results, i.e. the inputs of the scoring function, are provided by the ranked attribute of every search service called in the multi-topic (i.e. multi-service) query, where the ranked attribute of each service is identified by the specific access pattern used in the query for that service. The weights of the scoring function are defined, for each registered service, as the product of a service specific and a service relative weight. The former ones are set according to the values of the ranked attribute of the specific service to which each of them refers, in order to normalize the partial rankings of each individual search to be composed in the global search. The latter ones ensure that the composed global score is in the [0.0 - 1.0] range, with 1.0 as the best score. Constrained to satisfy such global score range, through the Bio-SeCo interface the user can interactively change the default equal values of the single service relative weights (Figure 6) to attribute more/less weight, in the global ranking, to results from some of the composed search services.
Domain Specific Language
Bioinformatics Search Computing application
Basic Local Alignment Search Tool
Genomic and Proteomic Data Warehouse
National Center for Biotechnology Information
Sequence Retrieval System
eXtensible Markup Language.
The authors would like to thank all the Search Computing development team, in particular Chiara Pasini, for the valuable support provided regarding the Search Computing framework and Davide Chicco for registering in Bio-SeCo some query services over the GPDW and their semantic connections.
This research is part of the "Search Computing" project (2008-2013), funded by the European Research Council (ERC), under the 2008 call for "IDEAS Advanced Grants".
This work has been partially supported by the "Search Computing" project (2008-2013), funded by the European Research Council (ERC), under the 2008 call for "IDEAS Advanced Grants".
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 1, 2014: Integrated Bio-Search: Selected Works from the 12th International Workshop on Network Tools and Applications in Biology (NETTAB 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S1.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.