Skip to main content

Convex-hull voting method on a large data set

Background

Genes work in concert as a system, not as independent entities, to mediate disease states. There has been considerable interest in understanding variations in molecular signatures between normal and disease states. The selective-voting convex-hull ensemble procedure accommodates molecular heterogeneity within and between groups and allows retrieval of sample-specific sets and investigation of variations in individual networks relevant to personalized medicine[1]. The work here describes using the convex-hull voting method on a large data set. Using parallelization techniques, we predict that we can execute the convex-hull voting algorithm on the University of Kentucky cluster (DLX) using a dataset much too large to run in a feasible time on a single machine.

Materials and methods

Normalized RNA-seq data for 208 samples (104 matched normal/tumor pairs) from TCGA breast carcinoma data set were downloaded and analyzed by the edgeR package, which identified 2,882 differentially expressed genes with at least a 2-fold difference between tumor and normal samples and at 1% false discovery rate. The convex-hull voting method1 was applied to data from the differentially expressed genes. A general idea of the algorithm including levels of parallelism is given in Figure 1.

Figure 1
figure 1

Ensemble convex-hull voting algorithm and levels of parallelization

A parallel-for loop is used within the R code allowing multiple processors within a node to concurrently perform the voting calculations of different sample pairs within one iteration. Then multiple jobs are submitted to perform the randomized iterations. This turns a computationally intensive problem into a data intensive problem since each iteration produces just over 6 GBs of data.

Results

The final runtime of one iteration of the large dataset was just under 34 hours and up to 32 iterations can run concurrently. The entire run of 100 iterations using this large data set took less than a week time.

Conclusions

Future work will involve the parallelization of the entire computationally and data intensive steps in a way that reduces the complexity of job submission and scalability of the entire job. Computing paradigms such as Hadoop are being explored for this task.

References

  1. Nagarajan R, Kodell RL: A Selective Voting Convex-Hull Ensemble Procedure for Personalized Medicine. AMIA Summits on Translational Science Proceedings. 2012, 2012: 87-94.

    PubMed Central  Google Scholar 

Download references

Acknowledgements

This research was supported by the Cancer Research Informatics and the Biostatistics and Bioinformatics Shared Resource Facilities of the University of Kentucky Markey Cancer Center (P30CA177558) and the University of Kentucky Center for Computational Sciences.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sally R Ellingson.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ellingson, S.R., Wang, C. & Nagarajan, R. Convex-hull voting method on a large data set. BMC Bioinformatics 16 (Suppl 15), P2 (2015). https://doi.org/10.1186/1471-2105-16-S15-P2

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-16-S15-P2

Keywords

  • Breast Carcinoma
  • Single Machine
  • Computing Paradigm
  • Multiple Processor
  • Vote Method