 Research
 Open Access
 Published:
Jaccard/Tanimoto similarity test and estimation methods for biological presenceabsence data
BMC Bioinformatics volume 20, Article number: 644 (2019)
Abstract
Background
A survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presenceabsence data, we evaluate species cooccurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest nonrandom cooccurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied.
Results
We introduce a hypothesis test for similarity for biological presenceabsence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. The exact and asymptotic solutions are derived. To overcome a computational burden due to highdimensionality, we propose the bootstrap and measurement concentration algorithms to efficiently estimate statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate pvalues and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution, particularly with an increasing dimensionality. We showcase their applications in evaluating cooccurrences of bird species in 28 islands of Vanuatu and fish species in 3347 freshwater habitats in France. The proposed methods are implemented in an open source R package called jaccard (https://cran.rproject.org/package=jaccard).
Conclusion
We introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient for binary data, that enable straightforward incorporation of probabilistic measures in analysis for species cooccurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science.
Background
Analysis of species cooccurrences helps us understand ecological and biological relationships among species. Essentially, the presence (1) and absence (0) of species are surveyed in multiple biogeographic units (or bioregions) using fieldwork, imaging, sequencing, and other techniques. Then, the Jaccard/Tanimoto coefficient is one of the most fundamental and popular similarity measures to compare such biological presenceabsence data. Given two presenceabsence vectors y_{i} and y_{j} of length m that represent two different species, the Jaccard/Tanimoto similarity coefficient is the ratio of their intersection to their union, T(y_{i},y_{j})=y_{i}∩y_{j}/y_{i}∪y_{j} [1, 2]. This quantification of overlaps allows us to quantify coexistence of species [3–6]. However, the Jaccard/Tanimoto coefficient lacks probabilistic interpretations or statistical error controls. Surprisingly, its statistical properties, hypothesis testing, and estimation methods for pvalues have been inadequately studied. Here, we present a rigorous statistical test evaluating the similarity in presenceabsence data, derive exact and asymptotic solutions, and introduce efficient estimation methods for significance of the Jaccard/Tanimoto similarity coefficient.
Generally, analysis of cooccurrences enables us to distinguish generalist species that survive in a broad range of environments from specialists that only thrive in a few localities [7, 8]. Alternatively, similarity between two localities – how two biogeographic units share an overlapping set of species – sheds light on the beta diversity that may arise from ecological processes over time [9–11]. There has been a long standing discussion on how to conduct association analysis for occurrences of species, including appropriate null models and evaluation techniques [12–17]. There are also specialized probabilistic approaches, including metrics related to the Jaccard/Tanimoto coefficient [18–21]. Yet, these studies rarely utilized statistical significance. Therefore, we investigated a hypothesis test using the Jaccard/Tanimoto coefficient that underlies or accompanies most of such association analyses.
The Jaccard/Tanimoto coefficient measuring similarity between two species has long been used to evaluate cooccurrences between species or between biogeographic units [3–5, 22–24]. Pioneering early works on probabilistic treatment of the Jaccard/Tanimoto coefficient assume that the probability of species occurrences is 0.5 [5, 22, 23]. These can be seen as special cases of our methods where both probabilities of y_{i} and y_{j} are set to 0.5. Recently, [24] and [25] proposed estimating pvalues with combinatorics and hypergeometric distributions, respectively. We found that they may lead to inaccurate estimates. To provide a comprehensive statistical treatment, we have developed a suite of methods and estimation techniques for rigorously testing similarity between presenceabsence data.
We derive a hypothesis test from the first principles using the Jaccard/Tanimoto coefficient. In the process, we propose an unbiased estimation of expectation and a centered Jaccard/Tanimoto coefficient that accounts for different probabilities of species occurrences. The negative and positive values of the centered Jaccard/Tanimoto coefficient naturally correspond to negative and positive association. We introduce an exact distribution of Jaccard/Tanimoto similarity coefficients under independence that is shown to provide accurate pvalues. Because the exact solution for a large m is computationally expensive, we have developed two efficient and accurate estimation algorithms. We demonstrate their remarkable accuracy and computational efficiency in comprehensive simulation studies, where pvalues and false discovery rates (FDRs) are evaluated. As applications, we evaluated cooccurrences of bird species from m=28 islands of Vanuatu and of fish species from m=3347 freshwater habitats in France.
All proposed methods are implemented in a statistical programming language R [26], available on the Comprehensive R Archive Network (https://cran.rproject.org/package=jaccard). We additionally provide an interactive web app (https://nnnn.shinyapps.io/jaccard). The implementations are efficient and general, such that the jaccard package can rigorously test similarity between binary data arising from genomics, biochemistry, and others.
Methods
Statistical model and test
Quantitative comparison of presenceabsence data in ecology and biology plays a crucial role in evaluating species coexistences, biodiversities, and ecosystems. In particular, one may be interested in comparing how species are cooccurring in biogeographic units or how biogeographic units are occupied by certain species. Note that species are used generally to indicate groups of organisms under investigations, such as operational taxonomic units (OTUs); similarly, biogeographic units or bioregions could be distinct survey areas, islands, or habitats. We are interested in statistically testing similarity between a pair of presenceabsence data.
Given two presenceabsence vectors y_{i} and y_{j} of length m, we are interested in inferring whether they are significantly related. Consider presence (1) and absence (0) of two species are recorded at m biogeographic units. We measure their similarity by the ratio of their intersection to their union, T(y_{i},y_{j})=y_{i}∩y_{j}/y_{i}∪y_{j}. This is well known as the Jaccard/Tanimoto index or similarity coefficient [1, 2]. In order to utilize the Jaccard/Tanimoto similarity coefficient in a statistically rigorous manner, we propose a family of methods and algorithms (Fig. 1).
Under the null model of independence, y_{i} and y_{j} are assumed to be independent and identically distributed (i.i.d.). They are modeled by a Bernoulli distribution, with corresponding occurrence (i.e., success) probabilities p_{i} and p_{j}∈[0,1]. Specifically, for k=1,…,m,y_{i,k}∼_{i.i.d.}Bernoulli(p_{i}) and y_{j,k}∼_{i.i.d.}Bernoulli(p_{j}). Because this conventional definition is undefined if both binary vectors contain only zeros such that y_{i}∪y_{j}=0, we refine the definition of Jaccard/Tanimoto coefficient
Following the definition of Jaccard/Tanimoto similarity coefficient in Eq. (1), we derive its expected value \(\mathbb E[T(\boldsymbol {y}_{i},\boldsymbol {y}_{j})] =\frac {p_{i} p_{j}}{p_{i} + p_{j}  p_{i} p_{j}}\). Substantial deviation from the expected value signifies similarity. Note that the Jaccard/Tanimoto coefficient can also be defined in terms of a multinomial distribution with four categories and m trials (for example, representing m biogeographic units). Four categories arising from presenceabsence data are N_{1}=y_{i}∩y_{j},N_{2}=y_{i}∩(1−y_{j}),N_{3}=(1−y_{i})∩y_{j} and N_{4}=m−N_{1}−N_{2}−N_{3}. From p_{i} and p_{j}, probabilities of those four categories are p_{i}p_{j},p_{i}(1−p_{j}),(1−p_{i})p_{j} and (1−p_{i})(1−p_{j}), respectively. Putting them together, N=(N_{1},N_{2},N_{3},N_{4}) is distributed according to a multinomial distribution, Multi(m,p_{i}p_{j},p_{i}(1−p_{j}),(1−p_{i})p_{j},(1−p_{i})(1−p_{j})).
Proposition 1
If y_{i} and y_{j} are independent, then
Proof 1
First, we compute conditional expectation given N_{1}+N_{2}+N_{3}. We observe that N_{1}N_{1}+N_{2}+N_{3} follows \(\text {Bernoulli}(N_{1}+N_{2}+N_{3},\frac {p_{i}p_{j}}{p_{i}+p_{j}p_{i}p_{j}})\). Hence, on set N_{1}+N_{2}+N_{3}>0, we have
and on set N_{1}+N_{2}+N_{3}=0, we have
Therefore,
This allows us to define the centered Jaccard/Tanimoto coefficient as
This accounts for expected values, naturally distinguishing negative and positive associations. Generally, we would like to measure the deviation of an observed coefficient from an expected value, instead of simply looking at a magnitude of an observed statistics. Furthermore, this centered coefficient may be scaled by variance in order to span a predefined range.
To evaluate whether y_{i} and y_{j} are independent, a following statistical hypothesis testing is performed:
The null hypothesis H_{0} is that the centered Jaccard/Tanimoto coefficient equals zero. Note that this is equivalent to that the conventional (uncentered) Jaccard/Tanimoto coefficient equals an expected value under independence. Therefore, although we propose and use the centered coefficient, this hypothesis testing is attributed to both uncentered and centered versions. Then, a pvalue indicates a probability of observing a coefficient equal to or more extreme than an observed coefficient under the null hypothesis.
Distribution of the jaccard/Tanimoto coefficient
To obtain its pvalue, we derive the distribution of Jaccard/Tanimoto coefficient under the null hypothesis. In terms of N=(N_{1},N_{2},N_{3},N_{4}), the Jaccard/Tanimoto coefficient can be expressed as
When p_{i} and p_{j} are known, the pvalue is given by \(\mathbb {P}(K_{T^{c}})\) where
However, in practice, probabilities p_{i} and p_{j} are usually unknown. Therefore, we define the centered Jaccard/Tanimoto coefficient by \(\hat T^{c} = T \frac {\hat p_{i}\hat p_{j}}{\hat p_{i} +\hat p_{j} \hat p_{i}\hat p_{j}}\), where \(\hat p_{i} =\frac {\sum {\boldsymbol {y}_{i}}}{m}, \hat p_{j} =\frac {\sum {\boldsymbol {y}_{j}}}{m}\) are standard estimators of p_{i} and p_{j} respectively. Plugin estimates of \(\mathbb E[T(y_{i},y_{j})]\) into Eq. (4) will result in conservative behaviors, since we estimate the probabilities on the same sample that we want to perform the test. Then, the estimates of expectation are biased toward the observed value of Jaccard/Tanimoto coefficient. To overcome this bias, we estimate probabilities p_{i} and p_{j} for each configuration (N_{1},N_{2},N_{3},N_{4}) separately.
So in this case, the critical region is defined as follows
where \(\tilde p_{i} = \frac {N_{1}+N_{2}}{m}\) and \(\tilde p_{j} = \frac {N_{1}+N_{3}}{m}\).
Because the exact distribution is computationally expensive (see Results for comparison), we introduce an asymptotic approximation when m→∞. It may be useful when dealing with very large binary data, where computational power is a bottleneck. Denote by q_{1}=p_{i}p_{j} the probability that both y_{i} and y_{j} have ones, and by q_{2}=p_{i}+p_{j}−2p_{i}p_{j} the probability that only one of two vectors has one. Similarly, \(\hat q_{1}\) and \(\hat q_{2}\) are defined with the plugin estimators. As m→∞, we can estimate the variance:
Proposition 2
If y_{i} and y_{j} are independent then
as m→∞, where
Proof 2
Theorem 14.6 of [27] states that
where
Then, we define function \(g(x_{1},x_{2})=\frac {x_{1}}{x_{1}+x_{2}}\) and apply the delta method. So, we get
The gradient of g is
Finally, after simplification, we obtain
In practice, probabilities p_{i} and p_{j} are unknown and need to be estimated. Recall that \(\hat p_{i}=\frac {\# \{y_{ik}=1\}}{m}\) and \(\hat p_{j}=\frac {\# \{y_{jk}=1\}}{m}\). We define \(\hat q_{1}\) and \(\hat q_{2}\) by replacing in definition of q_{1} and q_{2} true probabilities p_{i} and p_{j} by its estimators. So based on Proposition 2 we are able to approximate pvalues as follow:
where \(\phi = \frac {1}{\sqrt {2\pi }} \int _{ \infty }^{x} e^{x^{2}/2}dx\) is a standard Gaussian cumulative distribution function (CDF).
Measure concentration algorithm
The distribution of the centered Jaccard/Tanimoto coefficient can be expressed in terms of the multinomial distribution. However, evaluating a significance test based on this representation requires exhaustive computations. It needs summation over all possible states of the multinomial distribution. For the centered Jaccard/Tanimoto coefficient between y_{i} and y_{j}, we need to compute probability of event \( K_{\hat T^{c}}\) defined by Eq. (5).
This can be quickly and accurately estimated by the measure concentration algorithm (MCA) with a known error bound [28]. For every ε>0, we will construct I_{ε}, a set of (N_{1},N_{2},N_{3},N_{4}) with N_{1}+N_{2}+N_{3}+N_{4}=m, such that \(\mathbb P (N_{1},N_{2},N_{3},N_{4})\in I_{\varepsilon } \geq 1\varepsilon \). Given the set I_{ε}, we have following bounds
In addition, \(p^{U}_{\varepsilon }(\hat T^{c})p^{L}_{\varepsilon }(\hat T^{c})=\varepsilon \).
The idea behind the algorithm is that a multinomial distribution concentrates around its mode. Two possible states N=(N_{1},N_{2},N_{3},N_{4}) and \(\boldsymbol N^{\prime }=(N_{1}^{\prime },N_{2}^{\prime },N_{3}^{\prime },N_{4}^{\prime })\) are neighbors, N∼N^{′}, if \(\sum _{i=1}^{4}N_{i}N^{\prime }_{i}=2\). This means that N^{′} can be obtained from N by moving one element to a different class. We construct the set I_{ε} as follows.
At the onset, I_{ε} contains only the mode of multinomial distribution. We find the mode by a simple hill climbing algorithm, which starts with a state close to the mean of the multinomial distribution and follows the direction of increasing probability until the maximum is reached. Because of unimodality, it is indeed a global maximum. In the next steps, we add the neighbors of states which were previously visited. The procedure is repeated until the total probability of set I_{ε} reaches the desired value 1−ε. The details of the above method can be found in [28]. We construct the set I_{ε} and we estimate the pvalue by
Bootstrap procedure
The bootstrap procedure has gained mainstream popularity for its wide applicability and statistical treatments [29]. Creating an empirical distribution of null statistics allows for a flexible and robust estimation of pvalues and related statistics. We show how to use the resampling with replacement to obtain statistical significance of T^{c}(y_{i},y_{j}). Particularly, resampling with replacement y_{i} and y_{j}, separately, breaks any potential dependency. This allows us to calculate an empirical distribution of Jaccard/Tanimoto coefficients under the null hypothesis:
The expectation of Jaccard/Tanimoto coefficients is estimated directly from resampled vectors \(\boldsymbol {y}^{*}_{i}\) and \(\boldsymbol {y}^{*}_{j}\), that are effectively independent. Therefore, each iteration provides randomness, which helps avoid a bias related to using an estimated expectation based only on observation. Previously, there are early works in Monte Carlo procedures [14, 30] and published statistical tables for assessing randomness in species cooccurances [22, 23]. However, earlier works have assumed that a probability of occurrences is 0.5 regardless of species or biogeographic units. Permutation methods based on conventional uncentered coefficients are available in R packages, whose operating characteristics are not described in details [31, 32].
The resolution of the empirical null distribution depends on B, where the larger B will result in more precise estimation of pvalues. Although the choice of B would likely be dictated by n and m, as well as available computational power, we recommend setting B to at least 510 times of m. In our simulation studies, the total bootstrap iterations is set to B=5×m, which are shown to be both accurate and fast. When comparing a very large set of species or OTUs, it may be helpful to pool null statistics to increase the pvalue resolution and speed up the computation.
Results and discussion
Simulation studies
We have developed statistical methods and algorithms to obtain statistical significance of Jaccard/Tanimoto similarity coefficients for biological presenceabsence data. Beyond deriving the exact solution, we introduce the measurement concentration algorithm (MCA) and bootstrap method. We characterize their operating characteristics by comprehensive simulation studies where a wide range of parameters for presenceabsence datasets are considered. Our goal is to maintain theoretically correct behaviors of pvalues. Null pvalues corresponding to H_{0} are evaluated against a Uniform(0,1) distribution. False discovery rates (FDRs) are directly estimated from pvalues produced by our methods to demonstrate an overall error control.
First, we conducted 5 simulation scenarios using different underlying occurrence probabilities p=0.1,0.3,0.5,0.7,0.9 to generate independent presenceabsence datasets. In essence, they are two species of length m=100 that exhibit unrelated cooccurrence patterns, where a proportion of presence (1’s) ranges from 10% to 90%. For each of simulation scenarios, a total of 2000 comparisons were made using a length m=100. Without any information about simulation parameters, our proposed methods are applied on an identically simulated dataset (Fig. 2). Theoretically correct pvalues under the null hypothesis (null pvalues) should form a Uniform distribution between 0 and 1, which are denoted by dashed diagonal lines in QQ plots. An upward deviation from diagonals shows an anticonservative bias, as shown among some asymptotic pvalues. In all scenarios, pvalues from the exact solution, bootstrap (B=500), and measure concentration (accuracy =1×10^{−5}) algorithms follow a theoretically correct Uniform(0,1) distribution (Fig. 2). Asymptotic approximation is inconsistent; its behavior is anticonservative with p=0.3,0.5 and slightly conservative with p=0.7,0.9. Asymptotic approximation should only be used when computational time is a critical bottleneck.
Second, we generated a mixture of independent and dependent datasets out of n=2000 presenceabsence vectors (of n=2000 species observed in m=200 biogeographic units) to evaluate false discovery rates. In three separate scenarios, we simulated 25%, 50%, and 75% of n=2000 species to be independent, resulting in null proportions of π_{0}=.25,.50,.75 respectively. For example, a scenario with π_{0}=.75 produces 500 out of n=2000 presenceabsence variables that are truly associated with the query variable. Then, our proposed asymptotic approximation, bootstrap method, and MCA are used to automatically compute pvalues. To account for variation in simulation, we repeated each scenario 20 times. FDRs and π_{0} are estimated by the qvalue methodology [33]. Qvalues are evaluated against FDR thresholds, so that we can evaluate accuracy of observed FDRs (Fig. 3). Twenty simulation replications are shown in semitransparent shades, whereas their group averages for 3 methods are shown as solid lines. An upward deviation as shown by asymptotic approximation indicates an overall anticonservative behavior, likely due to m↛∞. The bootstrap and MCA maintain the overall error rates, where the bootstrap exhibits slightly conservative characteristics (Fig. 3).
Third, we compared the computational efficiency of our proposed methods using our jaccard package on RStudio Cloud (Intel Xeon 2.90GHz and 1GB RAM), with R 3.5.0. We measured the runtime for a range of lengths m=50,…,500. For each m, we applied the proposed methods 10 times, with the bootstrap iteration B=5×m and MCA accuracy of 1×10^{−5}. The average runtimes are shown Fig. 4. Our proposed computational methods show drastic improvement over the exact solution as m increases. The asymptotic approximation is mostly instantaneous. When the similarity between two presenceabsence vectors of length m=500 were tested using the jaccard package, the exact solution was prohibitively slow, taking 41.5s on average. The bootstrap method was 449.8 times (0.09s) faster, whereas MCA was 92.5 times (0.45s) faster than the exact solution. Furthermore, we compared the runtimes of estimation methods for m=1000,…,10000 (Additional file 1: Figure S1). The gain in computational efficiency is more pronounced as the dimension (i.e., a length of presenceabsence vectors) grows in size.
Last, a simulation study with p=0.5 and m=200 was used to evaluate two recent methods of species cooccurrences analysis. We generated independent presenceabsence data where two species are truly unrelated. Then, methods of combinatorics [24] and hypergeometric distributions [25] are applied to obtain pvalues. We followed the recommendations given in each paper, displaying four possible pvalues from [24] (Additional file 2: Figure S2) and two onesided pvalues from [25] (Additional file 3: Figure S3). We observe these pvalues under the null hypothesis to substantially deviate from theoretically correct Uniform(0,1) distributions.
Applications in species cooccurrences
To show applications in statistically testing biological presenceabsence data, the proposed methods are applied to species cooccurrence data. We investigated bird species on 28 islands in the Republic of Vanuatu, that are available in [6] and analyzed in several pioneering studies in nonrandom cooccurrences of species [12, 14–16]. The data is consisted of presence and absence of bird species in 28 islands of Vanuatu, which used to be known as the New Hebrides. Three generalist species that existed in all 28 islands were removed from our analysis. We are interested in identifying what pairs of species exhibit statistically significantly cooccurrences.
For n=53 bird species in m=28 islands, we obtained 1378 pairwise Jaccard/Tanimoto similarity coefficients. The conventional Jaccard/Tanimoto coefficients depends strongly on their expected values under independence (Fig. 5). Similarly, the conventional Jaccard/Tanimoto coefficients are substantially correlated with the proportion of occurrences, with a Pearson correlation of 0.43 (pvalue <2.2×10^{−16}). Relying only on similarity coefficients would miss nonrandom cooccurrences among bird species that live in a few islands (Additional file 4: Figure S4). Our proposed methods account for cooccurrences that would be expected under independence. Histograms of the uncentered and centered Jaccard/Tanimoto coefficients are compared in Additional file 5: Figure S5.
We computed statistical significance by applying the bootstrap method with B=5000 and MCA with accuracy of 1×10^{−5}. Our two computational approaches estimated pvalues that are almost identical with their mean squared deviation of 1.15×10^{−4} (Additional file 6: Figure S6). Significant results that are substantially deviating from random samples indicate nonrandom cooccurrences of species (Fig. 6). Out of 1378 pairs of species that were tested, the proportion of independent specie pairs was estimated to be 24% using qvalue methodology [33]. Then, we calculated FDRs from 1378 pairwise pvalues. We discovered that 374 (27%) pairs are deemed significant at a qvalue threshold of 0.10.
Additionally, we applied the Jaccard/Tanimoto similarity tests among fish species in French freshwater streams, surveyed over a long period of time [34]. Briefly, the presence and absence data of the n=32 most common fish species in m=3347 sites across French rivers are obtained during 1980  1991 [34]. Our analysis estimates that about 84.3% of 496 pairs are estimated to be nonrandomly cooccurring. As surveyed for over a decade across Fresh rivers and surrounding habitats, it is reasonable that many fish species are interacting or influenced by related climate conditions. There are 21 pairs of species with qvalues >0.1 (corresponding pvalues ranging from 0.637 and 0.969). For example, the centered statistics between Pungitius pungitius and Cyprinus carpio is 3.31×10^{−4}, whereas that between Pungitius pungitius and Lota lota is −4.40×10^{−4}. P. pungitius is a small fish species typically riding in thick submerged vegetation with the breeding season falling in April  July. C. carpio and L. lota are much bigger species and generally prefers a large body of water.
Conclusion
From biogeography to microbiology, evaluating similarity among species and biogeographic units is fundamental to assessing coexistence and biodiversity. Having observed occurrences of species in multiple biogeographic units, one of the primary goals in analyzing presenceabsence data is to identify nonrandom cooccurrences. Even if two species would be present independently of each other, they may occur together by chance. For the last 30 years, the Jaccard/Tanimoto coefficient has been shown to be highly useful for quantitative analysis of cooccurrences that help inform systematic relationship among species [3–5]. We have developed a rigorous statistical framework and methods to efficiently calculate statistical significance of such similarity and to identify nonrandom cooccurrences.
For testing cooccurrences using the Jaccard/Tanimoto coefficient, we introduce exact and asymptotic solutions, as well as bootstrap and measure concentration algorithm. The proposed suite of statistical methods can provide a rigorous guideline to identify related species. Through comprehensive simulation studies, we characterized their operating characteristics using pvalues and FDRs. The proposed bootstrap and measure concentration algorithms are highly accurate and efficient, providing orders of magnitude improvement in a computational speed. We have implemented the proposed methods in an open source R package and a Shiny web app. A user can upload a dataset to be analyzed, and create histograms and heat maps automatically. This will facilitate adaptation of pvalues, FDRs, and related quantities in analyzing species cooccurrences.
Beyond species cooccurrences, the Jaccard/Tanimoto coefficient is used in diverse areas of biological science where binary data are observed and compared. When molecules and reactions are represented as hashed fingerprints, it is used for quantitative comparisons and classifications [35–37]. Similarity between biochemical reactions can be tested by applying our methods on their corresponding fingerprints. In genomics, the standard tools such as BEDTools [38] evaluate genomic intervals using the Jaccard/Tanimoto coefficients. Given genomic intervals from two samples or groups, one could test whether their overlap is statistically significant, providing evidences for shared genomic variations. Due to the popularity of Jaccard/Tanimoto coefficients, the proposed suite of methods would be useful in a broad range of scientific applications.
Availability of data and materials
The jaccard package is available on the Comprehensive R Archive Network (CRAN) https://CRAN.Rproject.org/package=jaccard, whereas the development version on Github https://github.com/ncchung/jaccard. The Shiny app is available at https://nnnn.shinyapps.io/jaccard/.
Abbreviations
 FDR:

False discovery rate
 i.i.d:

Independent and identically distributed
 MCA:

Measure concentration algorithm
 OTU:

Operational taxonomic unit
References
Jaccard P. The distribution of the flora in the alpine zone. New Phytologist. 1912; 11(2):37–50. https://doi.org/10.1111/j.14698137.1912.tb05611.x.
Tanimoto T. An elementary mathematical theory of classification and prediction. Technical report. 1958.
Birks HJB. Recent methodological developments in quantitative descriptive biogeography. Ann Zool Fenn. 1987; 24:165–78.
Jackson DA, Somers KM, Harvey HH. Null models and fish communities: Evidence of nonrandom patterns. Am Natural. 1992; 139(5):930–51.
Real R, Vargas JM. The probabilistic basis of jaccard’s index of similarity. Syst Biol. 1996; 45(3):380–5. https://doi.org/10.1093/sysbio/45.3.380.
Manly BFJ. Randomization, Bootstrap and Monte Carlo Methods in Biology. Boca Raton, FL: Chapman & Hall / CRC Press; 2006.
Davies NB, Krebs JR. An Introduction to Behavioural Ecology. USA: WileyBlackwell; 1993.
Townsend CR, Begon M, Harper JL. Essentials of Ecology. USA: WileyBlackwell; 2002.
Whittaker RH. Vegetation of the siskiyou mountains, oregon and california. Ecol Monogr. 1960; 30(3):279–338. https://doi.org/10.2307/1943563.
Harrison S, Ross SJ, Lawton JH. Beta diversity on geographic gradients in britain. J Animal Ecol. 1992; 61(1):151. https://doi.org/10.2307/5518.
Koleff P, Gaston KJ, Lennon JJ. Measuring beta diversity for presenceabsence data. J Animal Ecol. 2003; 72(3):367–82. https://doi.org/10.1046/j.13652656.2003.00710.x.
Connor EF, Simberloff D. The assembly of species communities: Chance or competition?Ecology. 1979; 60(6):1132. https://doi.org/10.2307/1936961.
Diamond JM, Gilpin ME. Examination of the “null” model of connor and simberloff for species cooccurrence on islands. Oecologia. 1982; 52:64–74. https://doi.org/10.1007/BF00349013.
Gilpin ME, Diamond JM. Factors contributing to nonrandomness in species cooccurrences on islands. Oecologia. 1982; 52:75–84. https://doi.org/10.1007/BF00349014.
Wilson JB. Methods for detecting nonrandomness in species cooccurrences: a contribution. Oecologia. 1987; 73(4):579–82. https://doi.org/10.1007/BF00379419.
Manly BFJ. A note on the analysis of species cooccurrences. Ecology. 1995; 76(4):1109–15. https://doi.org/10.2307/1940919.
Sanderson J, Moulton M, Selfridge R. Null matrices and the analysis of species cooccurrencessanderson. Oecologia. 1998; 116(1–2):275–83. https://doi.org/10.1007/s004420050.
Ellwood MDF, Manica A, Foster WA. Stochastic and deterministic processes jointly structure tropical arthropod communities. Ecol Lett. 2009; 12(4):277–84. https://doi.org/10.1111/j.14610248.2009.01284.x.
Chase JM, Myers JA. Disentangling the importance of ecological niches from stochastic processes across scales. Philosoph Trans Royal Soc B: Biol Sci. 2011; 366(1576):2351–63. https://doi.org/10.1098/rstb.2011.0063.
Fridley JD, Vandermast DB, Kuppinger DM, Manthey M, Peet RK. Cooccurrence based assessment of habitat generalists and specialists: A new approach for the measurement of niche width. J Ecol. 2007; 95(4):707–22. https://doi.org/10.1111/j.13652745.2007.01236.x.
Araújo MB, Rozenfeld A. The geographic scaling of biotic interactions. Ecography. 2013. https://doi.org/10.1111/j.16000587.2013.00643.x.
BaroniUrbani C, Buser MW. Similarity of binary data. Syst Zool. 1976; 25(3):251. https://doi.org/10.2307/2412493.
BaroniUrbani C. A statistical table for the degree of coexistence between two species. Oecologia. 1979; 44(3):287–9. https://doi.org/10.1007/bf00545229.
Veech JA. A probabilistic model for analysing species cooccurrence. Global Ecol Biogeogr. 2013; 22:252–60. https://doi.org/10.1111/j.14668238.2012.00789.x.
Griffith DM, Veech JA, Marsh CJ. cooccur: Probabilistic species cooccurrence analysis inr. J Stat Softw. 2016; 69. https://doi.org/10.18637/jss.v069.c02.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2017. https://www.Rproject.org.
Wasserman L. All of Statistics: A Concise Course in Statistical Inference. New York: Springer; 2010.
Łącki MK, Startek M, Valkenborg D, Gambin A. IsoSpec: Hyperfast fine structure calculator. Analyt Chem. 2017; 89(6):3272–7. https://doi.org/10.1021/acs.analchem.6b01459.
Efron B, Tibshirani R. An Introduction to the Bootstrap. Boca Raton, Florida: Chapman & Hall / CRC Press; 1994.
Connor EF, Simberloff D. Species number and compositional similarity of the galapagos flora and avifauna. Ecol Monogr. 1978; 48:219–48. https://doi.org/10.2307/2937300.
Gotelli NJ, Hart EM, Ellison AM. EcoSimR: Null Model Analysis for Ecological Data. R package version 0.1.0. 2015. http://github.com/gotellilab/EcoSimR.
Oksanen J, Blanchet FG, Friendly M, Kindt R, Legendre P, McGlinn D, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Szoecs E, Wagner H. Vegan: Community Ecology Package. R package version 2.45. 2017. https://CRAN.Rproject.org/package=vegan. Accessed 14 Jun 2018.
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Nat Acad Sci. 2003; 100(16):9440–5. https://doi.org/10.1073/pnas.1530509100.
Comte L, Hugueny B, Grenouillet G. Climate interacts with anthropogenic drivers to determine extirpation dynamics. Ecography. 2016; 39(10):1008–16. https://doi.org/10.1111/ecog.01871.
Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P. Similarity coefficients for binary chemoinformatics data: Overview and extended comparison using simulated and real data sets. J Chem Inf Model. 2012; 52(11):2884–901. https://doi.org/10.1021/ci300261r.
Rahman SA, Cuesta SM, Furnham N, Holliday GL, Thornton JM. ECBLAST: a tool to automatically search and compare enzyme reactions. Nature Methods. 2014; 11(2):171–4. https://doi.org/10.1038/nmeth.2803.
Bajusz D, Rácz A, Héberger K. Why is tanimoto index an appropriate choice for fingerprintbased similarity calculations?J Chem Inform. 2015; 7(1). https://doi.org/10.1186/s1332101500693.
Quinlan AR. Bedtools: the swissarmy tool for genome feature analysis. Current Protocols in Bioinformatics. 2014:11–12. https://doi.org/10.1002/0471250953.bi1112s47.
Acknowledgements
Not applicable.
About this supplement
This article has been published as part of BMC Bioinformatics, Volume 20 Supplement 15, 2019: Selected articles from the 14th International Symposium on Bioinformatics Research and Applications (ISBRA18): bioinformatics. The full contents of the supplement are available at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume20supplement15
Funding
This work was supported by the Polish National Science Centre (NCN) grants 2014/12/W/ST5/00592 and 2016/23/D/ST6/03613. Publication costs are funded by 2016/23/D/ST6/03613. The funding body played no role in the study.
Author information
Authors and Affiliations
Contributions
NCC and AG conceived and designed the study. NCC and BM developed and implemented the methods and algorithms, with contributions from AG and MS. NCC analyzed the data and led the writing with substantial helps from BM and AG. All authors approved publication. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Computational runtimes when testing similarity between presenceabsence data upto m=10000. We ran the proposed 4 methods to compute pvalues for a wide range of dimension m. For each m, 100 independent simulations are conducted. Note that for m≥1000, the exact solution did not compute in a reasonable time. The bootstrap and measure concentration algorithm (MCA) are orders of magnitude faster than the exact solution. The asymptotic solution is instantaneous regardless of m.
Additional file 2
Combinatoric pvalues of similarity among independent presenceabsence vectors of m=200 with p=.5. In each scenario, 2000 independent variables are simulated and tested using a combinatorics [24]. [24] recommends p_{lt}+p_{et} and p_{gt}+p_{et} as pvalues. The dashed red lines indicate theoretically correct Uniform distributions.
Additional file 3
Hypergeometric pvalues of similarity among independent presenceabsence vectors of m=200 with p=.5. We used a hypergeometric distribution [25] to obtain pvalues of similarity between independent species. The original authors suggested that p_{gt} and p_{lt} can be “interpreted and reported as pvalues”. The dashed red lines indicate theoretically correct Uniform distributions.
Additional file 4
Scatterplot of marginal occurrences of 53 bird species and Jaccard/Tanimoto coefficients. As expected, we observe high correlation (Pearson correlation =0.43) between marginal occurrences and Jaccard/Tanimoto coefficients.
Additional file 5
Histograms of conventional and centered Jaccard/Tanimoto similarity coefficients. The conventional (uncentered) Jaccard/Tanimoto coefficients are centered by their expected values under the independence assumption.
Additional file 6
Comparison of pvalues from the bootstrap and measure concentration algorithm (MCA). Both algorithms were applied on 1378 cooccurrences of bird species. The difference between estimated pvalues from two methods is minimal with a mean squared deviation of 1.15×10^{−4}. The diagonal red line indicates the identity.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Chung, N., Miasojedow, B., Startek, M. et al. Jaccard/Tanimoto similarity test and estimation methods for biological presenceabsence data. BMC Bioinformatics 20 (Suppl 15), 644 (2019). https://doi.org/10.1186/s1285901931185
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901931185
Keywords
 Jaccard
 Tanimoto
 Binary similarity
 Presenceabsence
 Cooccurrences
 Pvalue
AMS Subject Classification
 Primary 62F03; secondary 6207