 Software
 Open Access
 Published:
DynaVenn: webbased computation of the most significant overlap between ordered sets
BMC Bioinformatics volume 20, Article number: 743 (2019)
Abstract
Background
In many research disciplines, ordered lists are compared. One example is to compare a subset of all significant genes or proteins in a primary study to those in a replication study. Often, the top of the lists are compared using Venn diagrams, ore more precisely Euler diagrams (set diagrams showing logical relations between a finite collection of different sets). If different cohort sizes, different techniques or algorithms for evaluation were applied, a direct comparison of significant genes with a fixed threshold can however be misleading and approaches comparing lists would be more appropriate.
Results
We developed DynaVenn, a webbased tool that incrementally creates all possible subsets from two or three ordered lists and computes for each combination a pvalue for the overlap. Respectively, dynamic Venn diagrams are generated as graphical representations. Additionally an animation is generated showing how the most significant overlap is reached by backtracking. We demonstrate the improved performance of DynaVenn over an arbitrary cutoff approach on an Alzheimer’s Disease biomarker set.
Conclusion
DynaVenn combines the calculation of the most significant overlap of different cohorts with an intuitive visualization of the results. It is freely available as a web service at http://www.ccb.unisaarland.de/dynavenn.
Background
A frequent task for life science and biomedical researchers is to compare ordered lists. A prominent example is the comparison of genes, proteins or other biomarkers from a primary to a replication study. Often, the most significant markers are selected and the respective sets from the two studies are compared. Venn diagrams, or more generally Euler diagrams, offer themselves for visualizing the overlap between two or more respective sets. In addition to natural sciences, life sciences and medical applications they are also applied also in other research disciplines such as social sciences over economics studies [1–4]. A variety of online tools, standalone software programs or packages for toolboxes such as R have been implemented. Given two or more input sets, these tools compute either standard Venn diagrams or area proportional Venn diagrams, where the overlap between the sets corresponds to the area of the respective intersection in the Venn diagram. Frequently used tools among others include Venny (http://bioinfogp.cnb.csic.es/tools/venny/), BioVenn [5], the R Package VennDiagram, or jvenn [6]. All these tools excel by their ease of use and convenient representation of the results as planar Venn diagrams. To compute the significance of the overlap of two or more sets, the hypergeometric distribution can be applied. As mentioned, the sets to be compared are often the top part of an ordered list, e.g. the most significant genes for one disease that are compared between two studies. Here, fixed cutoff values are frequently applied, e.g. all genes with pvalue below 0.05 are selected or all genes that are at least twofold upregulated. In some cases, e.g. for bimodal distributions, the selection of the threshold can be done from the distribution of the input data. Often, the selection of the most appropriate cutoff values may be challenging. First, it can be argued which cutoff is applied. Second, the cutoff for the two studies to be compared may have to be selected differently. The pvalue for genes, e.g. computed by ttest or other hypothesis tests varies also with the cohort size, the used experimental technique and other factors. As an alternative, a fixed number of genes independent of the pvalue is often considered, e.g. the top20 dysregulated genes in the two studies to be compared. If two ordered lists are available metrics for the comparison of such lists represent a valid alternative. Examples include Kendall’s Tau or Spearman’s Footrule that can be applied to compare the complete lists. All these strategies have their different strengths and weaknesses. As a flexible tool that does not rely on fixed thresholds and cutoffs we developed DynaVenn, a solution that computes the optimal overlap between two or three sets from ordered lists. Notably, a similar approach is followed by the R package GeneOverlap [7], performing Fisher’s exact test. This tool however uses a fixed threshold by taking the entire input lists and does not produce an interactive visualization. The genomic background is set to all genes measured in Chromatin ImmunoPrecipitation DNASequencing (ChIPSeq) or RNASeq while DynaVenn uses the genes measured in all the input sets. The webbased solution is freely available at http://www.ccb.unisaarland.de/dynavenn.
Methods & Implementation
Basic concept
DynaVenn makes use of the general concept and intuitive graphical representation of area proportional Venn diagrams. For two or three ordered lists of items (A, B, C) all possible Venn diagrams are calculated by considering the top l, m and in case of three lists n items as sets. In the case of two lists, the result is a matrix of Venn diagrams and corresponding pvalues that are calculated by the hypergeometric distribution. In the case of three lists, a cube is computed respectively. Then, a backtracking step is carried out similar to the backtracking in the dynamic programming solution of sequence alignments. For each position in the matrix (i,j) that contains the pvalue for the first i and j elements, the matrix positions (i−1,j) and (i,j−1) are considered. Thereby, in each step it is decided whether an element of the first or the second list has to be added in order to come to the optimal pvalue. In the case of three lists, the same applies for (i,j,k). In each case, the result of the algorithm is the sequence of items of the two or three sets that have to be added in order to find the Venn diagram with the most significant overlap. Here, the user has the choice either to compute the nominal pvalues without adjustment, or to use either the Bonferroni or BenjaminiHochberg adjustment. The standard choice of DynaVenn is to consider the BenjaminiHochberg approach to control the false discovery rate (FDR). The formula used to compute the pvalues (2 sets) is presented below:
Representation of results
The results of the computation are visualized as area proportional Venn diagram in the central part of the results panel on the website (Fig. 1). Besides the diagram, the input lists are presented and the cutoff values selected by our algorithm are shown. Further, the elements that are contained in both lists are highlighted in blue. The user can run an animation: starting with an empty Venn diagram, the elements from both lists are step wise added according to the order selected by the backtracking algorithm until both lists have been completely added. Below the Venn diagram, the course of the pvalue is presented. The best (lowest) pvalue determines the best Venn diagram with the most significant overlap. The matrix containing the computed pvalues can be visualized as a heat map.
Implementation
The main algorithm of DynaVenn has been implemented in Python, including Celery and redis. The hypergeometric tests have been calculated using scipy, the adjustment for multiple testing is done with the statsmodels module. The visualizations have been created using venn.js and plotly.js. The results can be downloaded as JSON files for further offline analysis.
Run time consideration
To understand the implication of the input size on the run time we increased random input sets in a stepwise manner and repeated the computations 10 times. The reported run times are averaged and the standard deviation is computed. Starting from input sets of size 10 over 20, 50, 100, 150 up to finally 1,000 items, the run time depending on the set size is presented in Fig. 2. As expected from the implementation, the run time increases quadratically with increasing set sizes, i.e. given list containing n items the asymptotic run time is \(\mathcal {O}(n^{2})\). For typical input sets in the range of several hundred items, the execution time is still moderate an in the range of seconds, for 200 items in each list, DynaVenn e.g. requires 8 seconds to compute all 40,000 Venn diagrams, pvalues and the backtracking. Since often also sets in the range of thousands will be explored an approximated time to result is interpolated and provided to the user. In case that a third list is used as an input, the run time becomes cubic, i.e. the asymptotic run time is \(\mathcal {O}(n^{3})\). In this case it is not reasonable to upload lists with several thousand items.
Results
Selected extreme scenarios
To check the performance of the tool and to verify properties of the hypergeometric test and their effect on interpretability we created extreme cases. First, we used identical lists as input. Second, we flipped the both lists, i.e. the second list was the reverse as the first list. Third, we flipped the complete list. In a last example, we flip the first 20 percent and the last 80 percent of the lists each. These four scenarios are presented in Fig. 2ad. For the identical sets (Fig. 2), the most significant overlap (in the case of 500 items the pvalue is 1.5·10^{−145}) is found to be in the middle of both lists after selecting 250 items from the first and 250 items from the second list, which is intuitive because of the symmetric nature of the hypergeometric distribution. For completely swapped lists, as well as in reversed lists (Fig. 2bc) there is no overlap with pvalue below 1. In the last case that is shown in Fig. 2d, where the top 20% and the bottom 80% of the second list are swapped compared to the original list, the most significant overlap is reached after selecting 200 markers (again, 500 markers are chosen as input set size), 100 from the first set and 100 from the second set. In this case, the pvalue is 1.2·10^{−102}.
Application Alzheimer’s Disease
As biomedical use case we selected bloodborne microRNA Alzheimer’s Disease (AD) biomarkers. In a first study, we evaluated a cohort of American patients using Illumina Next Generation Sequencing (NGS) [8]. In a second study we followed a similar study protocol in Germany and applied basically the same experimental technique [9]. In the latter manuscript we state “In the initial cohort, we observed 203, in the validation cohort, 146 dysregulated miRNAs at a significance level of 0.05. With 68 miRNAs, the overlap was significant (P =.0003)”. This means that we selected all 203 nominally significant miRNAs from the first study performed in the US, all 146 nominally significant miRNAs from the validation study in Germany and computed 68 overlapping miRNAs that have been shown as a Venn diagram. Using the hypergeometric distribution, we computed a pvalue of 0.0003 for this overlap. The two miRNA list, i.e. all miRNAs sorted by pvalues in ascending order were used as input for DynaVenn. DynaVenn selected the top 116 miRNAs from the primary study, 56 from the second study, yielding an overlap of 32 miRNAs in both sets (Fig. 1). The FDRadjusted pvalue for this overlap was 1.66·10^{−6}, 3 orders of magnitude below the original pvalue (raw pvalue: 7.7·10^{−11}). In this case, DynaVenn supported to find a more significant overlap for the original and the validation set. The decreased pvalue is thereby mostly due to a substantially lower number of miRNAs selected from the original data set.
Application to over three lists
As described above, the run time of DynaVenn increases exponentially with the number of ordered lists to be compared. For m lists with n items the run time is \(\mathcal {O}(n^{m})\). For comparing larger numbers of ordered lists heuristic solutions can be applied. The most straightforward way is to pick the most significant overlap between two ordered lists and use this as an input for further iterations. More sophisticated strategies such as computing and comparing all pairwise combinations of two lists and stepwise joining those with the highest overlap resulted in very long run times of DynaVenn. In that we consider the maximum of three lists as one of the currently most relevant limitations of DynaVenn.
Conclusions
Researchers have many options to compare sets or lists in terms of similarity. For lists, e.g. different metrics are available. One way that is frequently applied by researchers is to use Venn diagrams, a class of set diagrams. Here, it is assumed that the order of the items does not play a role. One case in biomedical applications is to compare items with pvalues below a certain threshold in two experiments to get an understanding whether significant items in both lists overlap. With DynaVenn we developed a tool that does not rely on a fixed cut off value for two or three input lists but computes the optimal threshold and thus the most significant Venn diagram from the input lists. We demonstrate the improved performance on an Alzheimer’s Disease biomarker set but also consider extreme cases. When the lists are equal, a metric for the similarity of lists may be more appropriate since DynaVenn considers the middle of the lists as most significant overlap, which is correct given that it considers the lists as sets. In the third extreme case, where the top 20% of the list and the bottom 80% of the list are flipped, DynaVenn computes the perfect Venn diagram within few seconds.
Our tool reports well interpretable graphics and more over presents the results also as interactive simulation. A challenge is the run time. In case of two lists, the result is computed even for hundreds to thousands of items in seconds to minutes. For three sets, the run time increases in a cubic manner, restricting the input to several hundred but not thousands of items. There exist several ways to improve this, on an implementation side, parallelization may lead to a speed up of three orders of magnitude. Additionally, a pruning strategy may help to find areas in the matrix or cube, where significance values will exceed a certain threshold and have thus not to be considered. Further, also the extension to larger numbers of lists can be addressed in further versions of DynaVenn.
In sum, we provide an easy to use online tool that makes static Venn diagrams dynamic and serves many different applications in biomedical research and beyond.
Availability and Requirements
Project name : DynaVenn Project home page : http://www.ccb.unisaarland.de/dynavennOperating system(s) : Platform independent Programming language : Python, Javascript Other requirements : Tested with Safari, Firefox and Chrome browsers License : Proprietary Any restrictions to use by nonacademics : The source code of the project can be made available for noncommercial use upon request.
Abbreviations
 AD:

Alzheimer’s disease
 ChIPSeq:

Chromatin ImmunoPrecipitation DNASequencing (ChIPSeq)
 FDR:

False discovery rate
 miRNA:

microRNA
 NGS:

Next Generation Sequencing
References
 1
Dou C, Cao Z, Yang B, Ding N, Hou T, Luo F, Kang F, Li J, Yang X, Jiang H, Xiang J, Quan H, Xu J, Dong S. Changing expression profiles of lncRNAs, mRNAs, circRNAs and miRNAs during osteoclastogenesis. Sci Rep. 2016; 6(1):21499. https://doi.org/10.1038/srep21499.
 2
Lu Y, Li C, Chen H, Zhong W. Identification of hub genes and analysis of prognostic values in pancreatic ductal adenocarcinoma by integrated bioinformatics methods. Mole Biol Rep. 2018; 45(6):1799–807. https://doi.org/10.1007/s1103301843252.
 3
Chen G, Wang Q, Yang Q, Li Z, Du Z, Ren M, Zhao H, Song Y, Zhang G. Circular RNAs hsa_circ_0032462, hsa_circ_0028173, hsa_circ_0005909 are predicted to promote CADM1 expression by functioning as miRNAs sponge in human osteosarcoma. PLOS ONE. 2018; 13(8):0202896. https://doi.org/10.1371/journal.pone.0202896.
 4
Yuan LL, Zhang M, Yan X, Bian YW, Zhen SM, Yan YM. Dynamic Phosphoproteome Analysis of Seedling Leaves in Brachypodium distachyon L. Reveals Central Phosphorylated Proteins Involved in the Drought Stress Response. Sci Rep. 2016; 6(1):35280. https://doi.org/10.1038/srep35280.
 5
Hulsen T, de Vlieg J, Alkema W. BioVenn  a web application for the comparison and visualization of biological lists using areaproportional Venn diagrams. BMC Genomics. 2008; 9(1):488. https://doi.org/10.1186/147121649488.
 6
Bardou P, Mariette J, Escudié F, Djemiel C, Klopp C. jvenn: an interactive Venn diagram viewer. BMC Bioinformatics. 2014; 15(1):293. https://doi.org/10.1186/1471210515293.
 7
Shen L, Sinai M. Geneoverlap: Test and visualize gene overlaps. R package version. 2013; 1(0).
 8
Leidinger P, Backes C, Deutscher S, Schmitt K, Mueller SC, Frese K, Haas J, Ruprecht K, Paul F, Stähler C, Lang CJ, Meder B, Bartfai T, Meese E, Keller A. A blood based 12miRNA signature of Alzheimer disease patients. Genome Biol. 2013; 14(7):78. https://doi.org/10.1186/gb2013147r78.
 9
Keller A, Backes C, Haas J, Leidinger P, Maetzler W, Deuschle C, Berg D, Ruschil C, Galata V, Ruprecht K, Stähler C, Würstle M, Sickert D, Gogol M, Meder B, Meese E. Validating Alzheimer’s disease micro RNAs using nextgeneration sequencing. Alzheimer’s Dementia. 2016; 12(5):565–76. https://doi.org/10.1016/J.JALZ.2015.12.012.
 10
Ludwig N, Fehlmann T, Kern F, Gogol M, Maetzler W, Deutscher S, Gurlit S, Schulte C, von Thaler AK, Deuschle C, Metzger F, Berg D, Suenkel U, Keller V, Backes C, Lenhof HP, Meese E, Keller A. Machine Learning to Detect Alzheimer’s Disease from Circulating Noncoding RNAs. Genomics Proteom Bioinforma. 2019. Dec 4. pii: S16720229(19)301573. https://doi.org/10.1016/j.gpb.2019.09.004. [Epub ahead of print].
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Affiliations
Contributions
JA implemented the web application and wrote the paper; TF and CB supported the implementation; AK wrote the manuscript and conceived the project. All authors have read and approved the final manuscript.
Corresponding author
Correspondence to Andreas Keller.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Amand, J., Fehlmann, T., Backes, C. et al. DynaVenn: webbased computation of the most significant overlap between ordered sets. BMC Bioinformatics 20, 743 (2019). https://doi.org/10.1186/s1285901933205
Received:
Accepted:
Published:
Keywords
 Venn diagrams
 Web server
 Hypergeometric test
 List overlap