We presented a data integration system that is utilizing and indexing XML-based data representation formats. Thus, the basic unit of data stored within the DIPSBC platform is 'XML document'. This unit is very generic and can range from genes and pathways to whole genome microarray experiment results, implicating a very high variability in data granularity. We use XML as central data format in order to capture this granularity and to make heterogeneous data compatible, a prerequisite for the coordinated integration of the various data sets.
As a result, the document management of our system is highly flexible, community compliant and well suited for data collaborations. On the one hand, the adoption of community standards enables cross-referencing proprietary data with publicly available data sets and applets for data visualization such as genome browser etc. as was demonstrated in the use cases. On the other hand, in particular with data types that are not yet standardized or that are so heterogeneous that they can not be standardized, for example the very specific data analysis results, the system offers full format flexibility and has basically no restrictions as was demonstrated by introducing a custom standard for data analysis results (Figure 1).
Currently the procedure of adding new data to the system involves two steps: first, the member of the consortium who generated the data set (e.g. from a microarray experiment) transfers it to the administrator. Second, the administrator checks the data for integrity by XSD schema validation and then adds the normalized XML to the index. Although this procedure ensures improved data integrity by manual curation, it would still be favourable to automate the procedure of XML transformation, validation, normalization and indexing, for example by implementing custom Perl plugins. These plugins could provide data upload interfaces, enabling members of a given collaboration to directly add their experimental data to the system. A corresponding interface is currently under development and will be provided in a future version of DIPSBC.
In the age of 'omics'-data, researchers are faced with ever growing data set sizes. While the proposed XML structure is feasible for most of the functional genomic data types, it can not be applied to high-throughput sequencing experiments. The usage of XML for the representation of such data might be counterproductive here, because XML is a human-readable format which adds lots of redundant text to the actual data. Therefore, in practice we do not transform such data sets to XML, but rather create metadata XML files for the search index that store processed data. The raw files (e.g. BAM files in the case of next generation sequencing or CEL files in the case of microarrays) are stored in the file system and are only referenced by the indexed metadata XML file.
One important issue within collaborative research groups is data security. Experimentalists need to be able to maintain in control of their raw data and study results need to be dealt with confidentially before they are published in a research journal. This can best be accomplished by securing the system with password protection and possibly also IP range restrictions at the web server configuration.
Also a more fine-grained user management can be realized by using the Foswiki user group functionality. Then, certain pages of the web site can be restricted to certain users or groups. Additionally, this concept could easily be extended to the central Solr index search so that particular search results would be restricted to specific users. For this purpose, the Solr-Search-Plugin would need to read the current user ID via the respective Foswiki variable and then filter the index results according to the logged in user. An overview of corresponding current and planned developments can be found at the DIPSBC homepage under the section 'Roadmap'.
Another advantage of the Foswiki collaboration platform worth mentioning is its intuitive data exchange function. At each page, users can upload files by clicking the 'Attach' button. Other users can then download the respective files. This has two important advantages compared to data sharing via e-mail: first, files that are too large for e-mail transmission can be shared; second, the reference file is stored only once at a central location, and if the file is changed, it can be downloaded again from the same location.
An important part of the proposed data integration system is the incorporation of data analysis results that add additional value to the raw experimental data and aid in the interpretation of these data. Currently, data analyses which lie beyond the capabilities of the Java applets need to be generated outside of the platform (see above use case 'Integration of experimental results from proteomic and transcriptomic data'). However, for future development steps it might be worth considering the integration of an R interface that could enable the direct statistical processing of experimental data.
Our data integration system was already applied within several research projects, typically involving between 5 and 15 collaboration partners located at different sites. These small to medium sized projects likely represent the typical size for the majority of research projects. However, the system might as well be suited for larger collaborations, because the web server and Foswiki collaboration platform can still handle a lot more simultaneous accessions than would be generated by tens or even hundreds of participating users. This is proved by the fact that many companies use Foswiki as their intranet system, sometimes including thousands of web pages and high access rates.
As for scalability of the index machine, of course its search and index performance decreases with increasing numbers of stored documents. Nevertheless, the Solr/Lucene software library is optimized for very fast text queries on large amounts of data. E.g., the current index size of our data integration system amounts to almost 35 million indexed documents or 22.1 GB of physical storage, with Pubmed and UniProt records representing the major part. While indices of smaller size typically can be queried within split seconds, query times of this rather large index lie in the range of below one second for general queries and up to a few seconds for very complex queries. Therefore the system can be conveniently used to handle quite large amounts of documents. However, if larger index sizes are needed, as might be the case e.g. with meta-data of next-generation sequencing experiments, Solr/Lucene offers native support of distributed searches. For this purpose, a large index is split into several smaller indices on different machines, and thereby fast response times can be maintained.
All parts of the introduced system can be straightforwardly implemented. The basic system setup with the Foswiki user interface and the Solr backend can be achieved in less than one day by an experienced programmer. Also, an important advantage of the system is the fact that its components are open source. Therefore it can be modified and adjusted for specific functions.
Because of its flexibility, the system can easily incorporate additional or new data types like patient data, high-throughput sequencing data, or any other data types that will occur during future developments of experimental techniques. Adequate helper applications that make use of the underlying XML files can be developed or adapted efficiently in order to support the analysis of such new data. Therefore, the combination of a fast indexing machine with a web-based collaboration platform makes this system highly flexible, evolvable, scalable and easy to use for research collaborations.