NGS-Logistics: data infrastructure for efficient analysis of NGS sequence variants across multiple centers

Ardeshirdavani, Amin; Souche, Erika; Dehaspe, Luc; Van Houdt, Jeroen; Vermeesch, Joris Robert; Moreau, Yves

doi:10.1186/1471-2105-16-S2-A10

Volume 16 Supplement 2

Highlights from the Tenth International Society for Computational Biology (ISCB) Student Council Symposium 2014

Meeting abstract
Open access
Published: 28 January 2015

NGS-Logistics: data infrastructure for efficient analysis of NGS sequence variants across multiple centers

Amin Ardeshirdavani^1,2,4,
Erika Souche^3,4,
Luc Dehaspe^3,4,
Jeroen Van Houdt^3,4,
Joris Robert Vermeesch^3,4 &
…
Yves Moreau^1,2,4

BMC Bioinformatics volume 16, Article number: A10 (2015) Cite this article

1618 Accesses
1 Citations
1 Altmetric
Metrics details

Background

Next-Generation Sequencing (NGS) is a key tool in genomics, in particular in research and diagnostics of human Mendelian, oligogenic, and complex disorders [1]. Multiple projects now aim at mapping the human genetic variation on a large scale, such as the 1,000 Genomes Project, the UK 100k Genome Project. Meanwhile with the dramatic decrease of the price and turnaround time, large amounts of human sequencing data have been generated over the past decade [2]. As of January 2014, about 2,555 sequencers were spread over 920 centers across the world [3]. As a result, about 100,000 human exome have been sequenced so far [4]. Crucially, the speed at which NGS data is produced greatly surpasses Moore's law [5] and challenges our ability to conveniently store, exchange, and analyze this data. Data pre-processing is needed to extract reliable information from sequencing data and it can be divided into two major steps: primary analysis (image analysis and base calling) and secondary analysis. When looking for variation in the human genome, secondary analysis consists of aligning/mapping the reads against the reference genome and scanning the alignment for variation. Both raw data and mapped reads are large files occupying significant disk storage space. The collection of files resulting from the analysis of a single whole genome study can take up to 50Gb of disk space. This raises significant issues in terms of computing and data storage and transfer, with off-site data transfer currently being a key bottleneck. Moreover, the analysis of NGS data also raises the major challenge of how to reconcile federated analysis of personal genomic data and confidentiality of data to protect privacy. In many situations, the analysis of data from a single study alone will be much less powerful than if it can be correlated with other studies. In particular, when investigating a mutation of interest, it is extremely useful to obtain data about other patients or controls sharing similar mutations. However, personal genome data (whole genome, exome, transcriptome data, etc.) is sensitive personal data. Confidentiality of this data must be guaranteed at all times and only duly authorized researchers should access such personal data.

Methods

To address all challenges described above, we developed a data structure NGS-Logistics, which fulfills all requirements of a successful application that can process data inclusively and comprehensively from multiple sources while guaranteeing privacy and security. NGS-Logistics is a web-based application providing a data structure to analyze NGS data in a distributed way. The data can be located in any data center, anywhere in the world. NGS-Logistics provides an environment in which researchers do not need to worry about the physical location of the data (Figure 1). With respect to users rights, queries will be sent to each remote server. The host will process the request and return the results back to the main server where all the privacy limitations are controlled for the data. Once the results are ready, the end user can see the desired information. Depending on the type of query, results will be divided into two parts, the first part is related to the samples to which the user has authorized access, and for which the users can see all details. The second part contains results for the whole population, for which the user has only access to some aggregate statistics without details. An example of such a query would be to review the mutations present at a single genomic position in each individual patient from a set of patients to which the user has authorized access (1st part) and to contrast these results with background frequency of mutation in the reference populations (2nd part) (Figure 2).

Results

The pilot version of NGS-Logistics has been installed and is currently being beta-tested by users at the Center for Human Genetics of the University of Leuven. Currently we have two installations of the system, the first one at the Leuven University Hospitals and the second one at the Flemish Supercomputing Center (VSC). The development of NGS-Logistics has significantly reduced the effort and time needed to evaluate the significance of mutations from full genome sequencing and exome sequencing, in a safe and confidential environment. This platform provides more opportunities for operators who are interested in expanding their queries and further analysis.

References

Voelkerding KV, Dames SA, Durtschi JD: Next-generation sequencing: from basic research to diagnostics. Clin Chem. 2009, 55 (4): 641-658. 10.1373/clinchem.2008.112789.
Article CAS PubMed Google Scholar
Institute NHGR: DNA Sequencing Costs. 2013
Google Scholar
Next Generation Genomics: World Map of High-throughput Sequencers. [http://omicsmaps.com/]
Human genome: Genomes by the thousand. Nature. 2010, 467 (7319): 1026-1027.
DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). [http://www.genome.gov/sequencingcosts/]

Download references

Author information

Authors and Affiliations

KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Leuven, Belgium
Amin Ardeshirdavani & Yves Moreau
iMinds Medical IT Department. Kasteelpark Arenberg 10, Box 2446, 3001, Leuven, Belgium
Amin Ardeshirdavani & Yves Moreau
KU Leuven, Center of Human Genetics Gasthuisberg, O&N I Herestraat 49, box 602, 3000, Leuven, Belgium
Erika Souche, Luc Dehaspe, Jeroen Van Houdt & Joris Robert Vermeesch
KU Leuven Department of Human Genetics Gasthuisberg, O&N I Herestraat 49, box 602, 3000, Leuven, Belgium
Amin Ardeshirdavani, Erika Souche, Luc Dehaspe, Jeroen Van Houdt, Joris Robert Vermeesch & Yves Moreau

Authors

Amin Ardeshirdavani
View author publications
You can also search for this author in PubMed Google Scholar
Erika Souche
View author publications
You can also search for this author in PubMed Google Scholar
Luc Dehaspe
View author publications
You can also search for this author in PubMed Google Scholar
Jeroen Van Houdt
View author publications
You can also search for this author in PubMed Google Scholar
Joris Robert Vermeesch
View author publications
You can also search for this author in PubMed Google Scholar
Yves Moreau
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Ardeshirdavani, A., Souche, E., Dehaspe, L. et al. NGS-Logistics: data infrastructure for efficient analysis of NGS sequence variants across multiple centers. BMC Bioinformatics 16 (Suppl 2), A10 (2015). https://doi.org/10.1186/1471-2105-16-S2-A10

Download citation

Published: 28 January 2015
DOI: https://doi.org/10.1186/1471-2105-16-S2-A10

Highlights from the Tenth International Society for Computational Biology (ISCB) Student Council Symposium 2014

NGS-Logistics: data infrastructure for efficient analysis of NGS sequence variants across multiple centers

Background

Methods

Results

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

Highlights from the Tenth International Society for Computational Biology (ISCB) Student Council Symposium 2014

NGS-Logistics: data infrastructure for efficient analysis of NGS sequence variants across multiple centers

Background

Methods

Results

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us