A novel scan statistics approach for clustering identification and comparison in binary genomic data
 Danilo Pellin^{1, 2}Email author and
 Clelia Di Serio^{1}
https://doi.org/10.1186/s1285901611738
© The Author(s) 2016
Published: 22 September 2016
Abstract
Background
In biomedical research a relevant issue is to identify time intervals or portions of a ndimensional support where a particular event of interest is more likely to occur than expected. Algorithms that require to specify apriori number/dimension/length of clusters assumed for the data suffer from a high degree of arbitrariness whenever no precise information are available, and this may strongly affect final estimation on parameters. Within this framework, spatial scanstatistics have been proposed in the literature, representing a valid nonparametric alternative.
Results
We adapt the so called Bernoullimodel scan statistic to the genomic field and we propose a multivariate extension, named Relative Scan Statistics, for the comparison of two series of Bernoulli r.v. defined over a common support, with the final goal of highlighting unshared event rate variations. Using a probabilistic approach based on success probability estimates and comparison (likelihood based), we can exploit an hypothesis testing procedure to identify clusters and relative clusters. Both the univariate and the novel multivariate extension of the scan statistic confirm previously published findings.
Conclusion
The method described in the paper represents a challenging application of scan statistics framework to problem related to genomic data. From a biological perspective, these tools offer the possibility to clinicians and researcher to improve their knowledge on viral vectors integrations process, allowing to focus their attention to restricted overtargeted portion of the genome.
Keywords
Background
In many different research areas it is of interest to identify time intervals or portions of a ndimensional support where a particular event is more likely to occur than expected. These regions, which in biology are commonly called clusters or hotspots, are presumable characterized by an increased probability of success and their identification may throw light on a better understanding of the underlying eventsgenerating process. Different perspectives can be adopted according to both classical and Bayesian frameworks, and within parametric and nonparametric approaches. Applications include also the fields of epidemiology, public health, astronomy and neuroscience, ranging from one to ndimensional spaces [1–6].
Many algorithms require to specify apriori the number of clusters assumed for the data and/or their expected dimension and/or length. These settings may strongly affect the final estimation results and requires a high degree of arbitrariness on the parameters whenever no precise informations are available. Spatial scan has been proposed with wide success in the literature [5] becoming one of the main epidemiological statistics tools in disease surveillance to test the null hypothesis that geographical data are randomly distributed against a localized cluster alternative. This method and its natural extensions are of particular interest since no prior information on parameters or clusters characteristics are required. Indeed, the scan statistic is able to address any of the following interrelated purposes: a) to test if event aggregation occurs (overall clustering), b) cluster localization (detection of cluster) c) to test event distribution on a specific region (focused test).
In a multivariate setting, a challenging goal for researchers may be the identification of regions where two spatial processes  defined over a common support  show different behaviours. More in detail, the processes are allowed to share fluctuations in probability (or rate) of success. To address this type of problem, a few alternatives have been currently proposed. Most of them rely on nonparametric estimation of relative risk function by means of kernel method, as proposed in [7, 8] for environmental epidemiology data analysis.
Scan statistics methodologies have been proposed for the analysis of Poisson and Gaussian distributed random variables, categorical and many other type data. In this paper we are interested in modeling spatial distribution of a particular type of genomic data, such as viral IS retrieved by using Next Generation Sequencing (NGS) platforms [9, 10]. From a statistical point of view, the genome is interpreted as a set of 2×3×10^{9} independent Bernoulli random variables B _{ chr,postion,strand }, where 1 means that a viral integration has been observed mapping to that particular genomic coordinates and 0 otherwise.
In genomics a few alternatives have been proposed to identify clusters of ISs, termed Common Integration Sites (CIS) or hotspots. The most popular in the biological literature is a gene integration frequency based method, involving Grubbs test [11] for outlier identification [12]. This approach suffers from an important limitation since ISs located outside genes and their neighborhoods are excluded from the analysis, thus leading to miss possible important intergenic CISs potentially very informative. To overcome this problem, an alternative method based on DBSCAN [13] algorithm has been proposed in [10]. The main drawback of this algorithm is the strong dependence of results on tuning parameters settings, difficult to calibrate for different sized data sets involving viral vectors with different clustering behaviours. To solve this issue, in [10] authors proposed a framework based on resampling insilico generated ISs to select an optimal distance parameter, by controlling the probability of smaller clusters (3 events) identification. However, the impact of this procedure on bigger clusters investigation is unclear.
Insertional mutagenesis [14] provide a good setting in clinical genomics to understand the importance of comparing two integration patterns. This phenomenon is caused by virus integration trajectory within particular dangerous genomic regions, such as oncogenic regions. Since many studies revealed different patterns in site selection process among available viral vectors, a statistical procedure that allows to identify differently targeted regions represents a fundamental tool in limiting insertional mutagenesis risk. Another framework where tools for detecting genomic clustering might be extremely helpful for biological research is the investigation of active regulatory element involved in differentiation process. This can be performed by exploiting the capability of particular viral vectors, such as the Murine Leukemia Virus (MLV) derived vectors, in marking transcription start site of active genes [15, 16].
Some approaches have been proposed in the literature [17] based on kernel methods where two separate nonparametric kernel densities are estimated by means of Gaussian kernels. Comparative clusters of integrations (hotspots) can be selected in those genomic areas where no overlapping among confidence intervals for densities were detected. However, the arbitrary choice of smoothing parameters (bandwidth) strongly affects the detecting procedure.
In this paper we propose to overcome several problematic issues in the existing procedures, by extending the Bernoulli model proposed in [5] to the genomic field. We first study more in depth the preliminary results presented in [18] for clusters identification in univariate setting. We also propose a novel multivariate alternative, that we call Relative Scan Statistics for comparing two integration patterns by the identification of comparative or relative clusters. Multivariate extensions of scan statistics have already been proposed in the literature [19],to detect disease outbreaks by means of simultaneous analysis of different data sets. To our knowledge, there are no paper focusing on detecting differences among data sets using scan statistics. Finally, the proposed methods are compared to the existing ones, like the DBSCAN algorithm and the comparative hotspot [17] procedure.
The paper is organized as it follows. In Section Methods we introduce the Kulldorff scan statistics for Bernoulli data, we illustrate how the method can be used to compare two genomic data sets and the algorithm implementation is presented. In Section Results and discussion real data sets are descibed and results obtained for the univariate and multivariate analysis are discussed. Final consideration and conclusion are provided in Section Conclusions.
Methods
Kulldorff spatial scan statistics for Bernoulli model
The method proposed by [5] can be adopted to face clusters identification as a general problem. In this work, we focus on Bernoulli model, since we consider a particular type of genomic data – derived by viral vector integration in gene therapy – that reveal presence or absence of a genomic event (namely the integration). A brief description of the underlying idea and the specification of the method for the univariate data analysis previously proposed in [18], is next introduced. Let define the whole study area under investigation as G, \(\mathcal {Z}\) the collection of zones Z⊂G obtained by scanning the support by means of a window of variable size.
Multivariate extension to novel relative scan statistics for Bernoulli model
Let now introduce a novel multivariate extension of the described method for identifying the most highly significant relative cluster. The method is described as referred to a bivariate case, in order to ensure clarity of the underlying idea, but can be easily extended for the comparison of more than two processes.
We define a relative cluster as an area \(Z \in \mathcal {Z}\) where two Bernoulli processes show different behaviour, in terms of success probability variation with respect to Z ^{ C }=G∖Z. Conditioning on a particular area \(Z \in \mathcal {Z}\) let define p _{ Z1} and p _{ Z2} as the probability of being an event within Z respectively for Process _{1} and Process _{2} and q _{ Z1} and q _{ Z2} be referred to Z ^{ C }. Bernoulli trials location, assumed as known over G, can differ between the two processes. All the analyses are conditioned on the total count of observed events X _{1} and X _{2}. The aim is here to highlight regions where the difference between probability of success in the two series is maximum and statistically significant, accounting for possible different data sets size and nonconstant but shared underlying probability variations.
To measure and compare within each process the behaviour observed within/outside Z, we propose the success probability ratio \(\frac {p_{Zi}}{q_{Zi}}\). The ratio takes values in R ^{+} and more specifically \( 0 \leq \frac {p_{Zi}}{q_{Zi}} < 1\) if the probability of success is lower within Z than outside and \(1 < \frac {p_{Zi}}{q_{Zi}} < \infty \) otherwise.

N _{1} and N _{2} be the total count of Bernoulli trials for each process.

X _{1} and X _{2} be the total count of success

n _{1Z } and n _{2Z } be the size, in terms of trials, of the Z with respect of each series

x _{1Z } and x _{2Z } be the success amount within Z with respect of each series
Since a closed analytical formula for \(\hat {{q_{Z1}}}\), \(\hat {{q_{Z2}}},\hat {k_{Z}}\) is computationally difficult to derive, we search for a numerical solution to calculate likelihood value \(L_{0Z_{j}}\) and parameters estimates. We remark that differently from the univariate case, the likelihood under the null depend on Z and is not constant over the whole study area G.
Thus, the upper bound for the total amount of element in \(\mathcal {Z}\) is [ (X _{1}+X _{2})∗(X _{1}+X _{2}−1)/2]. Whenever is possible to define a minimum/maximum length threshold for the relative cluster, a further reduction of complexity and computational efforts holds.
The interpretation of pvalue associated to relative scan statistic S must take into account the dimension of set \(\mathcal {Z}\), corresponding to the total amount of performed tests. Since dependence between tests varies in strength and can be both positive or negative (it depends on the respective location of the zones associated to tests considered), we adopted the HolmBonferroni [22] method for family wise error rate (FWER) control. If S results significant, it is possible to scan the study area to identify eventual secondary significant relative cluster \(\hat {Z^{*}}\) disjoint with \(\hat {Z}\). For this purpose, we implement a sequential approach, thus ensuring I type error rate control and higher power [23]. The method consists in removing from G zone(s) previously detected as significant, redefining a new the set \(\mathcal {Z^{*}}\) and values for \(N_{1}^{*}\), \(N_{2}^{*}\), \(X_{1}^{*}\), \(X_{2}^{*}\), \(n_{1Z}^{*}\), \(n_{2Z}^{*}\), \(x_{1Z}^{*}\) and \(x_{2Z}^{*}\) and sequentially performing maximizationFWER control steps.
Algorithm
We next describe the procedure for identifying relative clusters. We designed the script for genomic binary data (e.g. viral integration data). When referring in particular to gene therapy settings, the input information needed are data sets (one data sets in univariate analysis and two data sets for multivariate comparison) relative to IS coordinates (chromosome, position and strand), blind regions locations if available, maximum length for candidate interesting regions, L _{ max }, and a minimum event counts, EC _{ min }. These two input parameters play a crucial role in the definition of the final output and have a strong impact on the computational effort. Their setting must be chosen carefully, according to the data sets size and computational resources available. We suggest, to avoid to exceed half of the support G for L _{ max } (clusters greater than this threshold are not very informative) and to set EC _{ min } to a small value (EC _{ min }≥3) in order to preserve the capability to detect possible smaller interesting regions.
 1.
Using IS data sets and blind regions annotation file, calculate effective genome size X _{1}, X _{2} and N
 2.
Chromosome based definition of the full set of zones, \(\mathcal {Z}\).
 3.
Filter zones with length(Z)≥L _{ max } and EventCount(Z)≤EC _{ min }.
 4.
Using IS data sets and blind regions annotation file, calculate effective zones size x _{1Z }, x _{2Z } and n _{ Z }.
 5.
For each zone Z, calculate L _{0Z } (Eq. 3) and L _{ Z } (Eq. 4) and corresponding λ _{ Z } (Eq. 5).
 6.
Using \({\chi ^{2}_{1}}\) distribution, assign to each λ _{ Z } a pvalue (Eq. 6).
 7.
Apply multiple testing procedure.
 8.
If adjusted pvalue associated to \(\lambda _{\hat {Z}}\) is significant, define \(G^{*}=G \setminus \hat {Z}\).
 9.
Calculate new \(X_{1}^{*}\), \(X_{2}^{*}\) and N ^{∗} and restart from step 2.
The algorithm is implemented with a R script available upon request to the corresponding author.
Results and discussion
Datasets
Our application considers data sets that are comparable, for size and type of data, to those used in the literature [10] where alternative methods have been implemented to analyze and compare the profile of MLV and HIV integrations in human hematopoietic stem cells CD34+ in order to study their behaviour within the same cell type. To reduce possible technical bias the same laboratory protocol and sequencing platform was adopted.
For a detailed description of the biotechnological protocols adopted in the laboratories and subsequent bioinformatics processing steps performed, we refer to [10] and its supplementary materials. The final ISs data sets size were respectively 32631 for MLV (X _{1}) and 28,382 for HIV (X _{2}).
Due to various reasons related to sequencing technique (e.g. restriction enzymes) and mappability issue of the human genome (e.g. repeated sequences), the whole genome is not technically investigable. Blind regions are defined in the literature [17] as unobserved genomic portions which are strictly dependent on different laboratory settings and their distribution, position and total amount may change a lot across studies. However, using sophisticated and computationally intensive algorithm, it is possible to calculate and predict them quite precisely.
Regarding the univariate setting, taking into account for mappability condition allow to reduce possible systematic/technical bias and to compare clustering behaviour among experiments performed under different setting. Incorporating blind regions information in the multivariate scan statistics makes our approach more straightforward as compared to density estimations procedure, and their asymmetry with respect to strand does not necessary require to split analysis into two strand specific tasks. In this paper we adopt results in the literature [17] for selecting predicted blind regions thus reducing the genome representation to a set of N=4398094578 (about 2.20 ×10^{9} each strand) independent Bernoulli random variable.
A filtering procedure was applied to \(\mathcal {Z}\) generated, consisting in eliminating zones longer than 2.5×10^{7} bps (considering simple difference between ISs position) and containing less then 3 ISs. This is performed in order to reduce maximization space and to focus on more biologically meaningful regions without loss of arbitrariness. The size of each zone n _{ Z } is determined subtracting to the theoretical size (2 x ISs distance) the total amount (considering both strand separately) of blind regions contained.
Univariate analysis results
We run single IS series analysis with scan statistics approach and we compare the results with hotspots reported in the literature [10], obtained using DBSCAN algorithm [13] (see Supplementary Material and Method in [10] for DBSCAN setting used). Some preliminary results for this analysis has been previously published in [18], without taking into account blind regions bias and focusing only on most significant findings. In HIV data set, DBSCAN identify 2446 clusters, containing 50.6 % (14,369 IS) of the total amount of IS. Clusters’ length is on average 19220 bps, but varies from a minimum of 100 to a maximum of 200500 bps. The majority (90 %) of HIV clusters are composed by 3–10 ISs.
List of first 10 clusters identified in HIV data by scan statistics
S  Chr  Start  End  IS count  \(\frac {\hat {p_{HIV_{Z}}}}{\hat {q_{HIV_{Z}}}}\)  Raw pvalue  Adj pvalue 

2463.2  chr11  63175583  68111375  651  17.2  <2e16  <2e16 
1795.1  chr16  95090  3640598  444  19.6  <2e16  <2e16 
1390.0  chr17  70634094  73732441  386  15.5  <2e16  <2e16 
1189.8  chr17  75720251  78604915  323  16.2  <2e16  <2e16 
1063.8  chr3  46999507  52978572  424  8.5  <2e16  <2e16 
1046.8  chr6  30563526  33532447  325  12.6  <2e16  <2e16 
1041.8  chr9  138245676  139772487  224  26.9  <2e16  <2e16 
732.0  chr8  144469820  146194757  188  18.1  <2e16  <2e16 
721.1  chr19  572963  3118599  209  14.3  <2e16  <2e16 
629.1  chr17  1483915  4578114  238  9.2  <2e16  <2e16 
List of first 10 clusters identified in MLV data by scan statistics
S  Chr  Start  End  IS count  \(\frac {\hat {p_{MLV_{Z}}}}{\hat {q_{MLV_{Z}}}}\)  Raw pvalue  Adj pvalue 

386.5  chr20  51646845  51991770  89  22.8  <2E16  <2E16 
326.4  chr20  10362242  10450134  55  51.8  <2E16  <2E16 
318.4  chr17  26646082  26672265  41  131.1  <2E16  <2E16 
302.6  chr17  76325116  76460372  56  39.5  <2E16  <2E16 
285.6  chr19  59566413  59591310  37  127.9  <2E16  <2E16 
284.6  chr21  38671040  39311896  90  12.2  <2E16  <2E16 
279.2  chr17  51718847  53782415  142  6.2  <2E16  <2E16 
278.7  chr1  25046795  28847012  183  4.7  <2E16  <2E16 
267.7  chr18  72291047  72971441  87  11.6  <2E16  <2E16 
264.4  chr12  6084417  10441567  197  4.2  <2E16  <2E16 
We next investigate how methods agree in identifying locations of most significant regions. DBSCAN clusters are sorted in terms of size, i.e. the amount of IS falling within cluster limits, to allow for possible the comparison with scan statistics results.
The list of the first 10 Most Significant Clusters (MSCs) coordinates discovered by Scan Statistics in HIV data set are showed in Table 1, together with some related measures. The complete list is available in Additional file 1. The most significant cluster is located at chromosome 11, interval 63,175,583;68,111,375 and within the same region DBSCAN identifies 40 out of 2446 distinct clusters, including the top 2 for ISs content (interval 65,586,752;65,736,062, 110 ISs and interval 6665150366776194, 96 ISs). The second most significant cluster, named MSC _{2} is located on chromosome 16, interval 71,294,851;77,821,445 and is composed by 610 IS. Within this genomic region, DBSCAN reported 38 clusters, including the third in terms of ISs.
Univariate analysis results for MLV data set are tabulated in Table 2 and Additional file 2. Region on chromosome 20, interval 51,646,845;51,991,770 contain 89 ISs and is suggested to be the most evident hotspot region for MLV vector. Within the same interval, DBSCAN identify 8 distinct clusters, but not among the top in ranking. The second, MSC _{2},is on chromosome 20, interval 10,362,242;10,450,134 and is composed by 55 ISs. It overlaps with the 50th hotspost retrieved using DBSCAN. A perfect correspondence is observed between MSC _{3} and the 4th cluster derived from DBSCAN, both located on chromosome 17, interval 26,659,383;26,672,265. Conversely, the first cluster calculated using DBSCAN is on chromosome 22 27,525,356;27,545,150, its size is 42 ISs and corresponds to 85th MLV scan statistics derived cluster.
In simple terms we reveal that the most important part of the difference in identifying the total amount of clusters can be attributed to a fragmentation of scan statistics cluster in more DBSCAN clusters. Despite that, an overall clear correspondence in terms of localization was observed, while agreement in ranking is more dependent on clustering behaviour.
Multivariate analysis results
We remark that the a big advantage of the proposed methods is the ability to detect both long and short regions. Long relative cluster can be usually easily visualized by using density estimate superposition. Short relative clusters or closed opposite relative cluster are much more difficult to detect, due to the smoothness of kernel estimator. This is in our opinion a crucial feature of our proposal, and it may be of particular utility for data analysis and for vector safety assessment. We now compare our list with the suggested 100 regions (51 for MLV and 49 for HIV) proposed in the literature [17].
Although the total amount of interesting regions might vary considerably, it is not clear which one performs better since true differently targeted regions are not known. In our opinion, since the underlying biological mechanism and target site selection process are deeply different (MLV belongs to the gammaretroviral genus and HIV to the lentiviral), a longer list of candidate regions can be considered more realistic.
List of relative clusters identified by relative scan statistics
S  Chr  Start  End  HIV IS  MLV IS  \(\log \left (\frac {\frac {\hat {p_{HIV_{Z}}}}{\hat {q_{HIV_{Z}}}}}{\frac {\hat {p_{MLV_{Z}}}}{\hat {q_{MLV_{Z}}}}}\right)\)  Type  Adj pvalue 

474.1  chr11  63153734  68347426  659  129  1.91  hiv  <2E16 
450.9  chr6  30095760  33488528  332  7  4.49  hiv  <2E16 
434.2  chr16  95090  3561021  430  41  2.74  hiv  <2E16 
260.9  chr17  70835415  73732441  372  75  1.86  hiv  <2E16 
227.0  chr3  47041751  52978572  422  119  1.47  hiv  <2E16 
219.4  chr9  134493480  139818935  307  60  1.89  hiv  <2E16 
213.5  chr17  77047796  77746204  172  7  3.70  hiv  <2E16 
191.9  chr8  144548769  146194757  182  15  2.89  hiv  <2E16 
122.0  chr19  1027304  6006371  292  104  1.20  hiv  <2E16 
115.4  chr22  48983597  49573459  115  11  2.71  hiv  <2E16 
105.6  chr21  37559632  39311896  9  126  3.02  mlv  <2E16 
102.1  chr19  54074745  55048471  122  18  2.21  hiv  <2E16 
99.3  chr17  1069411  4213267  229  79  1.23  hiv  <2E16 
96.4  chr1  153550587  154168170  90  7  2.94  hiv  <2E16 
91.8  chr18  70832211  73059134  6  103  3.26  mlv  <2E16 
91.5  chr17  4573721  7723628  194  62  1.32  hiv  <2E16 
86.5  chr20  49745347  52129713  7  102  3.07  mlv  <2E16 
86.0  chr12  11729500  14430150  8  105  2.95  mlv  <2E16 
83.3  chr20  60901158  62379063  109  19  2.02  hiv  <2E16 
81.3  chr6  6536008  13289623  22  141  2.13  mlv  <2E16 
Graphs for remaining chromosomes are available in Additional file 4.
Conclusions
In this paper we present two methods for clustering identification of genomic events based on scan statistics approach. Results retrieved from both methods are consistent with the biological literature and findings thus revealing deep biological differences between integration process and target sites selection characterizing different viral vectors. Speculating on cluster dimensions and length, our analysis confirms the well known preferences of MLV in integrating more likely in regulatory elements or in general over small genomic interval, whereas HIV integrates over wider regions corresponding to active coding elements. Independently from the total amount of identified interesting regions, a substantial spatial overlap between results was observed in HIV data set, as regarding both localization and significance. For MLV data set, a good agreement is showed in terms of localization but for significance ranking. The intrinsic behaviour of HIV probably helps this results correspondence, since aggregation is less strong than MLV but affects wider regions, leading to cluster formed by many IS rewarded by DBSCAN ranking scheme based on dimension. For MLV instead, generally the aggregation tendency is characterized by higher event density but limited to narrow genomic intervals and less ISs.
Relative Scan Statistics seems to be able to identify regions characterized by unshared variation of events rate, potentially allowing for focusing downstream analysis only on differently targeted regions. This may help clinicians/researcher in improve viral vectors safety. The results obtained agree with previous published literature and avoid the necessity to split analysis according to strands.
In conclusion, starting from a probabilistic approach based on estimation and comparison of probability of success, we recommended scan statistics as a fundamental inferential tool able to exploit an hypothesis testing procedure to sort candidate regions in terms of significance instead of size or additional testing procedure.
Declarations
Acknowledgements
The authors thank all members of the CUSSB for helpful suggestions.
Declaration
Publication charges for this article were funded by the grant FFC 27/2014 of the Fondazione per la Ricerca sulla Fibrosi Cistica.
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 11, 2016. Selected articles from the 11th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2014). The full contents of the supplement are available online https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume17supplement11.
Availability of data and materials
The raw sequence reads for HIV and MLV IS are available the GenBank Short Read Archive under the accession number SRA024251.1.
Authors’ contributions
Contributed statistical methodology: DP CDS. Wrote the paper: DP CDS. Implemented the methods: DP. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Naus J. The distribution of the size of maximum cluster of points on the line. J Am Stat Assoc. 1965; 60:532–8.View ArticleGoogle Scholar
 Naus J. Clustering of random points in two dimensions. Biometrika. 1965; 52:263–7.View ArticleGoogle Scholar
 Loader CR. Largedeviation approximation to the distribution of scan statistics. Ann Appl Probab. 1991; 23:751–71.View ArticleGoogle Scholar
 Chen J, Glaz J. Two dimensional discrete scan statistics. Stat Probab Lett. 1996; 31:59–68.View ArticleGoogle Scholar
 Kulldorff M. A spatial scan statistic. Commun Stat:Theory Methods. 1997; 26:1481–1496.View ArticleGoogle Scholar
 Kulldorff M, Athas W, Feuer E, Miller B, Key C. Evaluating cluster alarms: A spacetime scan statistics and brain cancer in Los Alamos. Am J Public Health. 1998; 88:1377–1380.View ArticlePubMedPubMed CentralGoogle Scholar
 Kelsall JE, Diggle PJ. Kernel estimation of relative risk. Bernoulli. 1995; 1:3–16.View ArticleGoogle Scholar
 Kelsall JE, Diggle PJ. Non parametric estimation of spatial variation in relative risk. Stat Med. 1995; 14:2335–343.View ArticlePubMedGoogle Scholar
 Bushman F, Lewinski M, Ciuffi A, Barr S, Leipzig J, Hannenhalli S, Hoffmann C. Genomewide analysis of retroviral DNA integration. Nat Microbiol. 2005; 3:848–58.View ArticleGoogle Scholar
 Cattoglio C, Pellin D, Rizzi E, Maruggi G, Corti G, Miselli F, Sartori D, Guffanti A, Di Serio C, Ambrosi A, De Bellis G, Mavilio F. Highdefinition mapping of retroviral integration sites identifies active regulatory elements in human multipotent hematopoietic progenitors. Blood. 2010; 116:5507–517.View ArticlePubMedGoogle Scholar
 Grubbs FE. Sample criteria for testing outlying observations. Ann Math Stat. 1950; 21:27–58.View ArticleGoogle Scholar
 Biffi A, Bartolomae CC, Cesana D, Cartier N, Aubourg P, Ranzani M, Cesani M, Benedicenti F, Plati T, Rubagotti E, et al. Lentiviral vector common integration sites in preclinical models and a clinical trial reflect a benign integration bias and not oncogenic selection. Blood. 2011; 117(20):5332–5339.View ArticlePubMedGoogle Scholar
 Ester M, Kriegel HP, Sander J, Xu X, et al. A densitybased algorithm for discovering clusters in large spatial databases with noise. Kdd. 1996; 96(34):226–231.Google Scholar
 HaceinBeyAbina S, Garrigue A, Wang GP, Soulier A, Lim J, et al.Insertional oncogenesis in 4 patients after retrovirusmediated gene therapy of SCIDX1. J Clin Investig. 2008; 118:3132–142.View ArticlePubMedPubMed CentralGoogle Scholar
 Cattoglio C, Facchini G, Sartori D, A A, Antonelli A, Miccio A, Cassani B, Schmidt M, von Kalle C, Howe S, Thrasher AJ, Aiuti A, Ferrari G, Recchia A, Mavilio F. Hot spots of retroviral integration in human CD34+ hematopoietic cells. Blood. 2007; 110:1770–1778.View ArticlePubMedGoogle Scholar
 Ambrosi A, Di Serio C. Vectors and integration in gene therapy: Statistical considerations. J Comput Sci Syst Biol. 2009; 2:117–23.View ArticleGoogle Scholar
 Ambrosi A, Glad I, Pellin D, Cattoglio C, Mavilio F, Di Serio C, Frigessi A. Estimated comparative integration hotspots identify different behaviors of retroviral gene transfer vectors. PLoS Comput Biol. 2011; 7:12.View ArticleGoogle Scholar
 Pellin D, Di Serio C. Clusters identification in binary genomic data: The alternative offered by scan statistics approach. Comput Intell Methods for Bioinforma Biostat. 2014; 1:149–58.Google Scholar
 Kulldorff M, Mostashari F, Duczmal L, Yih K, Kleinman K, Platt R. Multivariate spatial scan statistics for disease surveillance. Stat Med. 2007; 26:1824–1833.View ArticlePubMedGoogle Scholar
 Aiuti A, Biasco L, Scaramuzza S, Ferrua F, Cicalese MP, Baricordi C, Dionisio F, Calabria A, Giannelli S, Castiello MC, et al. Lentiviral hematopoietic stem cell gene therapy in patients with wiskottaldrich syndrome. Science. 2013; 341(6148):1233151.View ArticlePubMedPubMed CentralGoogle Scholar
 Wilks SS. The largesample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics. 1938; 9(1):60–62.View ArticleGoogle Scholar
 Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat. 1979; 6:65–70.Google Scholar
 Zhang Z, Assunção R, Kulldorff M. Spatial scan statistics adjusted for multiple clusters. Journal of Probability and Statistics. 2010; 2010.Google Scholar