Evaluation of statistical approaches for association testing in noisy drug screening data

Background Identifying associations among biological variables is a major challenge in modern quantitative biological research, particularly given the systemic and statistical noise endemic to biological systems. Drug sensitivity data has proven to be a particularly challenging field for identifying associations to inform patient treatment. Results To address this, we introduce two semi-parametric variations on the commonly used concordance index: the robust concordance index and the kernelized concordance index (rCI, kCI), which incorporate measurements about the noise distribution from the data. We demonstrate that common statistical tests applied to the concordance index and its variations fail to control for false positives, and introduce efficient implementations to compute p-values using adaptive permutation testing. We then evaluate the statistical power of these coefficients under simulation and compare with Pearson and Spearman correlation coefficients. Finally, we evaluate the various statistics in matching drugs across pharmacogenomic datasets. Conclusions We observe that the rCI and kCI are better powered than the concordance index in simulation and show some improvement on real data. Surprisingly, we observe that the Pearson correlation was the most robust to measurement noise among the different metrics.

statistical inference and machine learning have become standard tools of a scientist trying to make sense of massive biological data [1][2][3].
Major challenges in the statistical analyses of biological data include the non-standard distributions of observed values and the noisiness of the measurements, especially for high throughput phenotypic measurements. The direct measurements often need to be transformed and/or normalized before the values can be interpreted and compared with results from other experiments, even within the same study. These transformations result in values that are often relative, bounded, highly skewed and/or distributed with heavy tails. Furthermore, the complexity of the experimental design and assays used, as well as the necessary compromises to enable high throughput collection of data (for example, working with small cell numbers and liquid volumes in plates with thousands of wells), make measurement noise unavoidable within these data [4][5][6]. Fortunately, by taking replicate measurements of the same conditions, it is possible to directly measure the degree of noise introduced by a given experimental protocol.
A common approach in identifying associations from biological data is computing similarities between vectors of biological measurements. Correlation coefficients are very widely used measures of similarity, of which the most popular is Pearson's Product Moment Correlation. Pearson's correlation assesses the degree of linear association between variables. The standard statistical test of significance for Pearson's correlation makes parametric assumptions, including normality and homoscedasticity of the inputs, and testing may be sensitive to outliers. Non-parametric correlation measures, such as the Spearman rank correlation and its associated tests, can detect nonlinear, monotonic associations but are commonly accepted as less powered to detect significant linear associations [7][8][9][10]. The concordance index (CI), a linear transformation of Kendall's Tau, is a non-parametric association measure with the advantage of being highly adaptable to missing and censored data, where the exact value of a data point is bounded but not precisely known. These non-parametric statistics also come with commonly associated tests for significance, including exact tests (often requiring no ties in the data), and asymptotically correct tests. The asymptotic tests make weak assumptions on the distribution of the data; however, they converge only in the limit of large sample sizes, and rates of convergence are difficult to characterize in practice. Beyond correlation coefficients, other metrics are also used, for instance cosine similarity from the field of information retrieval (which reduces to Pearson for centered data) [11][12][13]. In addition, any measure of distance can be transformed into a measure of similarity through usage of an appropriate kernel. A brief introductory level overview of computing similarities can be found in [14].
In our work, we examine the behaviour of the common correlation coefficients when applied to regimes motivated by high-throughput screens of cancer cell lines measuring viability after treatment with compounds. We show that for realistic sample sizes, common statistical tests applied to the concordance index fail to control for false positives, reinforcing the need for testing against a non-parametric, permutation-derived null distribution. We carefully examine the power of permutation tests of standard correlation measures in simulated settings, including the addition of simulated noise of realistic magnitudes. We also introduce two novel modifications of the concordance index, the robust CI (rCI) and kernelized CI (kCI), which take into account the exact quantification of noise possible through comparing replicates into the assessment of similarity between two vectors of measurements. We characterize these novel statistics through comparison with existing correlation coefficients, and provide efficient algorithms and implementations to compute them. We found that the standard CI is more powerful under a permutation test in detecting non-zero correlations for bounded and skewed distributions. We also surprisingly observed that the Pearson correlation was more robust to measurement noise than the other correlation coefficients investigated, including the proposed rCI and kCI statistics.

Related work
Assessing how correlation coefficients behave when applied to non-normal data and in the presence of noise has been previously investigated. Published studies have examined the power of correlation based permutation tests, as well as compared permutation tests to those based on asymptotic limiting behaviours of the coefficients [15][16][17]. Work by Bishara and Hitter [16] (with followup work by Puth et al. [17]) not only shows a comprehensive comparison of Type 1 error control and power between several parametric and resampling-based tests for the Pearson and Spearman correlations, but also contains a review of early simulation work on permutation testing of the Pearson correlation. These previous studies tested a wide range of unbounded distributions, but did not investigate simultaneously skewed and bounded distributions, a situation which arises frequently in analysis of high throughput phenotypic screens. Previous studies (excepting Chok's work [15]) have also left out the Concordance Index/Kendall's Tau correlation or any other pairwise ranking based methods from their comparisons. Previous studies have also investigated the effect of assay/measurement noise [18] or non-normal distributed data [19] on the bias of the sample Pearson correlation coefficient, but not on the statistical power to detect a significant effect. The robustness of other correlation coefficients to the presence of such measurement noise likewise remains unstudied. Finally, modifications of CI have been proposed previously in literature, both in the context of biological data [5], as well as in the computer science literature for assessing information retrieval system performance [20]. However, these modified CI statistics were used primarily as a performance metric for concordance between predictors and observations, and not as a measure of correlation between two sets of measurements.
Our study differs from existing studies in several aspects. First, we investigate larger sample sizes and lower p-value thresholds for significance ( α levels), more reflective of the high-throughput assays and the necessary corrections for the multitude of correlations assessed when analyzing modern biological data. Unlike previous studies, our intent is not an exhaustive characterization of different tests for significance of correlation coefficients across a variety of distributions. Rather, we advocate permutation testing as a general method applicable across correlation coefficients, and ask which coefficient is most powerful at detecting a significant (monotonic) effect. We investigate the statistical power only in the normally distributed case and a single bounded and skewed distribution, focusing instead on comparing different correlations and assessing their performance under noise. Our robust and kernelized Concordance Index measures are also unique in using replicate measurements to directly address the effect of noise in the analyzed data, and we evaluate their performance in detecting significant correlations.

Motivating data: preclinical drug sensitivity screening
While the questions asked in this paper are general and not necessarily tied to the analysis of any particular type of biological assay or experiment, our analyses were motivated by applications to preclinical drug sensitivity screening. These screens usually test the ability of compounds to inhibit cell growth and/or induce cell death across a panel of cancer cell line models grown in vitro. Each cell line is treated at several concentrations of compound, and the cell line growth at a particular time-point is compared to a matched untreated control to derive a % viability value for each concentration level.
In the analysis of these data, it is common to fit a Hill Curve to the multi-dose viability measurements of a cancer cell line in response to treatment by a particular drug, as shown in Additional file 1: Figure S1 [21,22]. From these dose-response curves, summary metrics are derived that try to capture the sensitivity of a particular cell line to treatment with the compound in a single number. The most commonly used summary metrics are the IC50, the concentration at which the curve crosses 50%, and the Area Above or Under the Curve (AAC or AUC). The IC50 is a measure of compound potency, while the AAC/AUC averages the signal between potency and maximum efficacy of a compound. The advantage of the IC50 is that it is in micromolar units and can be compared directly between experiments, however, it is not guaranteed to exist in all cases. The AUC/AAC metrics can always be calculated, but they are dependent on the exact concentration ranges chosen for a particular experiment. In practice, many researchers prefer the IC50 measure because of its natural interpretation as a concentration. For examples within our study, we will be using the AAC, as it has been shown to be more consistent between published datasets and does not suffer from missing or truncated values [6,23]. As is standard practice [24], we calculate the area above the curve in log concentration space, and normalize the AAC by the total possible area for each experiment determined by the range of concentration measured, leading to values lying in the bounded range between 0 and 1. Note that this is similar to a calculation of mean viability over the concentrations range as a fraction.

Coefficients considered
The general problem is to identify monotonic associations between pairs of variables drawn from a range of potential distributions, and to that end, we considered several parametric and non-parametric association statistics. The Concordance Index (CI) is defined as the fraction of pairs of observations that are ordered the same way by two variables, and it is a linear transformation of Kendall's Tau. Formally, for two variables x and y, CI is defined as: where N the number of observations of x and y. CI has a range of [0, 1], is 0 when the two variables are perfectly anti-correlated, 1 if they are perfectly correlated, and 0.5 in expectation over the space of all possible orderings, i.e., if there is no association between the variables.
Even though the Concordance Index is a useful non-parametric statistic, we hypothesize that incorporating parametric information about two observations are can increase the robustness of the Concordance Index to noise. For instance, in physical and biological measurements, measurement error often has a characteristic scale smaller than the range of possible measurements. If measurement error is localized, increasing magnitude of the difference between the value of two observations increases the confidence that the two observations are ordered correctly by the measurements.
The Robust Concordance Index (rCI) is a modification of CI that only considers pairs of points that are sufficiently dissimilar in both variables. We define two thresholds δ x ≥ 0 and δ y ≥ and define rCI as: The Kernelized Concordance Index (kCI) is a generalization of rCI, where instead of only considering valid pairs, every pair is assigned a weight according to a kernel. The rCI is a special case of kCI with a heavyside kernel, where valid pairs are weighted with 1 and invalid pairs are weighted with 0. We choose the kernel so as the difference in measurements for either x or y approach 0, the weight for the pair of observations should tend to 0, and as the difference grows large, the weight should tend to 1. A sigmoid of the form 1 1+e kx−c is one such kernel with these properties, and we outline a procedure for fitting the kernel to problem-specific empirical data in the supplemental methods. The definition of the kCI formally is: Where x = |x i − x j | and y = |y i − y j | . It is not necessary to define the kernel purely on the difference in the observations, though we chose to do so for simplicity. For purposes of comparison, in subsequent analyses, we compared CI, rCI, and kCI with Pearson and Spearman correlation, which are also commonly used as test statistics for association.
Both rCI and kCI have parameters that must be tuned for the intended data. The thresholds in rCI and the kernel shape in kCI can be interpreted as a characteristic range within which differences cannot be discriminated from statistical noise or measurement error. In the Methods, we define a method for choosing these parameters using problem-specific empirical data and replicate measurements. For our simulations below, we fit the parameters to real data from pharmacogenomic studies.

Software
To efficiently compute CI, rCI, and kCI, we developed an open-source R software package, wCI (weighted Concordance Index) that implements all three statistics and is freely available from https:// github. com/ bhklab/ wCI under the GPLv3.
The CI can be computed in O(N log N ) time because counting the number of inversions in a permutation is computationally equivalent to sorting [25]. Together with our work defining the rCI statistic, we present an algorithm for computing rCI also in O(N log N ) time, described in a reference implementation included in the aforementioned repository. While the algorithm looks quite complex in implementation, intuitively it exploits three insights: The first is that the concordance index in can be computed in O(N log N ) time by sorting both vectors by the ordering permutation for one, and then counting the number of "hops" each element in the other makes during a merge sort. The second is using "virtual" elements with values at x − δ x and x + δ x , as well as y − δ y and y + δ y , and computing the difference between the number of "real" elements hopped over during the merge sort for the +δ and −δ elements, with the difference between these two capturing the number of inverted pairs falling within the 2 * δ margin (deemed invalid pairs). The third is to repeat the algorithm twice, once starting with X sorted and once with Y sorted, while comparing the computed invalid pair counts and applying the inclusion/exclusion principle to compute the pairs deemed invalid during a single sort, or during both sorts. To keep the time complexity to O(N log N ) , we use a data structure to record the location of corresponding real and virtual items in X and Y and track their location when they are moved during the sort.
In its most general case, kCI can depend on a unique weight for each pair of elements.
As such, it needs to read from memory n 2 values, and therefore cannot be computed faster than O(N 2 ) . For certain restricted classes of kernels, faster computation may be possible, as evidenced by interpreting rCI as kCI with a Heavyside kernel, but an exploration of these special cases is outside the scope of our work.

Inflation of P-values for association testing
Assessing statistical significance requires comparing an observed value of the test statistic with a proper null distribution. The most commonly tested null hypothesis is one of no association, usually formalized as coefficient of association equaling zero for the population -this is the null hypothesis considered throughout this work. There exist analytical formulations for the distribution under the null hypothesis for all the coefficients above, except for kCI. For the Pearson correlation, the statistic t = r n−2 1−r 2 is known to asymptotically follow a t-distribution with n-2 degrees of freedom, and is most commonly used in significance testing. For the CI, a widely applied analytical distribution of the concordance index, due to its general applicability to data including censoring and ties, was introduced by Noether [26], and advocated for application to the CI by Pencina and D' Agostino [27]. The derived distribution is also only asymptotically correct for non-Gaussian data. For Spearman's rho, statistical testing is commonly done using an Edgeworth series expansion introduced by David et al. [28].
To evaluate the accuracy of these analytical formulations, we took independent samples from relevant distributions, computed the similarity according to each coefficient, and assessed the statistical significance using the analytical formulas. Under the null hypothesis, the p-values calculated from the analytical formula should be drawn from a uniform [0, 1] distribution. We summarize these findings with a Q-Q plot for both normally distributed data and beta-distributed data ( Fig. 1a and c, Additional file 1: figure S2). In the context of testing association with a larger number K of features -for instance, 10 4 genes -multiple hypothesis testing requires the significance threshold be at least as small as the reciprocal of the number of features, α < 1 K . To illustrate this, we show the fraction of samples with analytical p-value less than several values of α ( Fig. 1b  and d).
These results show that for the rCI and CI statistics, the analytical p-values are unreliable in the regime under 1000 observations, and that the problem is exacerbated for samples from non-normal distributions. This is also true to a smaller extent for the Pearson correlation test, in the Beta(1.2,4.5) case. The consequence of using the analytical p-values when the number of observations is small is an inflated type 1 error rate and false positive associations.

Characterization of the null distribution of CI
The null hypothesis for the CI is that the two variables have no association, and thus each permutation of the variables is equally likely. A permutation π on n elements has   Fig. 1 The asymptotic approximation of the CI null distribution produces an excess of small p-values. We took independent samples from a normal and beta distribution, computed their similarity using the coefficients above, and calculated asymptotic p-values using the approximations from the text. Because the samples are independent, their p-value distribution should be uniform. The Q-Q plots for normal (a) and beta (c) distributions for samples of length N = 100 sampled 200,000 times shows an excess of small p-values for CI and rCI. In the case of the normal distribution, p-values of 10 −4 occur over twenty times more often than would be expected, and for the beta distribution nearly one hundred times more often for rCI. (b, Normal) and (d, beta) summarize the frequency of p < 10 −3 for different sample sizes. As the number of samples grows large, the asymptotic approximation becomes more correct, but even in the regime of hundreds of samples, extreme p-values occur several times more often than they should under the null an inversion if there exist two elements i, j that are out of order, i.e. π(i) > π(j) but i < j . The null distribution on CI is then the distribution on the number of inversions over the space of permutations of n elements. We have devised an exact null that can correctly compute the number of inversions on permutations of sets, in the absence of ties, of up to N = 170 elements. In cases of more than 170 elements, calculating the exact distribution breaks numerical precision of our environment, as the number of permutations goes as N! and the range on the number of inversions is C (N, 2). The null distribution is also exact for pairs of variables which have a tie structure that can be represented as a multiset, i.e., observations can be partitioned into equivalence classes based on ties. To demonstrate that the exact formula correctly calculates the null for CI with no ties, we computed the number of inversions from K = 1e6 permutations of N elements and compared this with our exact null. Comparing the exact distribution of the concordance index (1 -number of inversions divided by number of pairs) to the empirically observed distribution for the simulated permutations ( Fig. 2a) we see very good consistency between simulation and our implementation of the exact calculation. We also computed the p-value according to our exact distribution for each simulation sample, and saw a very good concordance of the distribution of p-values with the uniform distribution, as expected under the null (Fig. 2b).
In general, if both variables have ties, the analytical null is unknown. This occurs with CI when both inputs have ties, with rCI in almost all circumstances, and for CI with right-censored data as in survival analysis contexts, all of which have a tie structure that cannot be represented by a multiset. Similarly, the exact analytical null for kCI is unknown. The only unbiased solution for computing statistical significance in cases where the limiting null approximation breaks down is a permutation null. Unfortunately, this is computationally costly.

Power analysis
We set out to investigate the power of the newly defined rCI coefficient in comparison to the other commonly used correlation coefficients discussed above. As the theoretical log10 Empirical Quantile log10 Theoretical Quantile Exact Analytical Null Asymptotic Null Fig. 2 The analytical null accurately computes exact p values for CI. a The analytical distribution matches a permutation null of K = 1e6 samples of length 100 from a standard normal distribution. As CI is entirely non-parametric, the choice of distribution is irrelevant. b The Q-Q plot shows the -log 10 empirical rank of the CI on the x-axis and the -log 10 theoretical quantile from the analytical null (red) and asymptotic null (blue). The analytical p-values are both monotonic and correctly approximate the uniform distribution (grey) distribution of the rCI statistic even under assumptions of normally distributed data is unknown, we analyze the power of these statistics in simulation. We sampled from a simulated bivariate standard normal distribution with a known correlation between the two variables r > 0 . For evaluating the power, we treat the population correlation as the effect size. We evaluated the statistical power of the Pearson, Spearman, CI and rCI coefficients for detecting an association at a fixed significance level α , using a permutationbased test. In these simulations, we investigated the effects of effect size, sample size, and the rCI δ parameter on the power of these respective statistics. Lacking a principled way to choose a kernel for the kCI applied to normally distributed data in simulation, we did not include the kCI statistic in our analysis.
We began by evaluating the effect of the new δ parameter on the power of the rCI statistic, seeking to answer the question of what choice of δ is optimal for this new statistic. We took 10, 000 samples of length N = 100 from bivariate normal distributions with expected correlations of 0.2, 0.3 and 0.4, representing low ( < 10% ), medium ( ∼ 40% ) and high ( > 80% ) power situations for the Pearson correlation with this sample size. Examining the medium power case ( r = 0.3 ), we see that the Spearman and CI statistics are similarly powered under this permutation test, whereas Pearson, as expected in this bivariate normal case, is significantly better powered than the most widely used non-parametric correlation coefficients (Fig. 3a). The rCI shows a trend to increasing power with increasing δ until approximately δ = 1.2 , after which power decreases and at around δ = 1.5 the power of the rCI becomes worse than the CI or Spearman statistics. Note that in this case, the marginal standard deviation of the data is 1, so the δ parameter can be directly interpreted as a multiple of the standard deviation of the population. To compare the behaviour of the rCI statistic over the three different effect sizes investigated, we normalized the power at each δ to a percentage of the max power observed in simulation across all choices of δ (Fig. 3b). While the exact location of the maximum varies slightly with the effect size, the power is fairly stable in a range around δ = 0.75 and δ = 1.25 for all three investigated regimes, and close to the maximum achieved. Motivated by this observation, we chose δ = 1 for our subsequent simulations.
We then explored a range of expected population Pearson correlations between 0 and 0.5, looking at 1000 samples of N = 100 long vectors for each effect size and computing the empirical power observed in simulation (Fig. 3c). Unsurprisingly, the Pearson statistic is most powerful across the range, followed by rCI (at δ = 1 ), and then Spearman and CI practically indistinguishable. These simulations reinforce the benefit of the rCI modification over existing fully non-parametric statistics across the full range from weakly powered to well powered situations.
Finally, we wished to investigate how the sample size on which the correlation coefficients are calculate affects their power in a permutation test. For this, we decided to calculate a level-set of constant power as decreasing effect size is traded off for increasing sample size, fixing the expected Pearson power to be at 50%, and investigating sample sizes from N = 50 to N = 300 , in steps of 50. For this simulation, we did 10, 000 samples from the distribution to calculate the power. We then normalized the power observed at each sample size as percentage of observed Pearson power to adjust for slight errors in calculation of the level set (Fig. 3d). We see a clear trend among rCI and the non-parametric statistics of increased power compared to the Pearson correlation as sample size increases, however, while rCI achieves > 90% of the Pearson power already at N = 50 and > 95% by N = 300 , the non-parametric statistics remain at less than 90% power even at N = 300.

Power simulation in non-normal data
Our ultimate motivation for investigating the performance of existing correlation coefficients and proposing modifications of the concordance index was their application to noisy biological data. Prior to examining real data, we sought to evaluate the power of these permutation tests in cases where the marginal distributions of the observed variables is non-normal. Motivated by our intended application of drug screening data, we chose to investigate data drawn from bivariate beta distributions. In particular, we simulated data drawn from distributions with both marginals fixed as Beta(1.2, 4.5) distributions (Additional file 1: figure S3a), and varying Pearson correlations between the two sampled variables. Further details on the simulation are described in the methods section.
Unlike the normally distributed data, we did not optimize the selection of delta parameters for rCI through power simulations. Rather, we derived these parameters from real pharmacogenomics data, by choosing a threshold on differences between AAC  measurements that maximizes our classification performance of replicate vs non-replicate measurements (see Methods). The optimal threshold was very close to 0.1, so for the simulations, a delta of 0.1 was used. We also included the kCI statistic in these simulations, deriving kernel parameters from the same data (see Methods).
We first examined the empirical power of these statistics at a modest sample size of 100 samples, varying the correlation between the two variables. We again explored a range of expected Pearson correlations between 0 and 0.5, looking at 1000 samples of N = 100 long vectors at each effect size. While the Pearson correlation was more powerful at weaker effect sizes, at larger effect sizes (where all statistics achieve > 50% power in simulation), the CI and kCI statistic become slightly more powerful in detecting significant effects at this sample size (Additional file 1: Figure S3b). The rCI statistic performs similarly to Spearman over the range of effect sizes investigated, both falling short of the Pearson correlation until all statistics converge to 100% power.
We then investigated how sample size affects the power of these statistics. Similar to our previous simulations with normally distributed data, we used a theoretical calculation of a level set of 50% power for the Pearson correlation test. With the data distribution deviating significantly from normal, this was even more approximate than before, so we once again normalized everything relative to the power we observed for the Pearson correlation in simulation. We again took 10,000 samples each for vectors of length 100-300, in step sizes of 50. We observe that as sample size increases larger than 100, the power of the CI statistics in a permutation test continues to increase over the Pearson correlation (Additional file 1: Figure S3c). The kCI statistic behaves similarly to the Pearson correlation in these simulations, while the rCI and Spearman, both gaining power relative to the Pearson correlation, but do not equalize in power even in our largest sample size tested, 300. We repeated the same simulations, but adjusting effect size to hold a theoretical power of 0.25 (Additional file 1: Figure S3d). Consistent with our previous results, at a sample size of 100 the Pearson correlation was the most powerful statistic, but as the sample size grew the CI overtook the Pearson correlation in power, and it was the most powerful statistic at sample sizes of 300. The rest of the statistics increased in power similarly as before, but starting off relatively less powerful compared to Pearson.

Power simulation in non-normal data with additive noise
Finally, we repeated the simulations at fixed sample size and varying expected correlations with Beta distributed data above, with the addition of noise sampled from a Laplace distribution, mimicking noise seen between replicate experiments in AAC measurements, as detailed in the methods section. The additive noise was significantly smaller in magnitude than the range of the data, and as the rCI δ and kCI kernels were picked using the replicate measurements, we expected these statistics to be advantaged in terms of power to detect effects "corrupted" by noise.
All statistics were less powerful after the addition of noise (Additional file 1: Figure S4b). This is unsurprising, as from the point of view of a sample statistic, the data generation process is opaque: a particular sample generated with added noise is indistinguishable from data generated with a weaker between-variable correlation. In these simulations, the rCI and kCI exhibit a similar drop in power to non-parametric statistics. Interestingly, the Pearson correlation was actually less affected by the addition of noise than all the other statistics, leading to the Pearson correlation outperforming all other statistics in observed power throughout the range of effect sizes simulated. It is possible that these results are observed because the addition of additive noise increases slightly the variance of the data, and may decrease the skewness of the distribution, reducing the deviation from normally distributed data.

Querying pharmacogenomic datasets
With some theoretical properties explored, we sought to evaluate the practical performance of these coefficients on real pharmacogenomic data. Large-scale pharmacogenomic studies have treated panels of cancer cell lines with drugs at a range of doses. The sensitivity of a cell line to a drug can be summarized with the area above the dose response curve (AAC), and assaying a panel of cell lines gives a vector of drug sensitivities for a particular drug. The studies considered are described in the Methods. For each drug measured in two studies across a common set of cell lines, we computed the similarity coefficients between that drug's response vector in one study against all other drugs' response vectors in the other study. We then found the rank of the matched drug (Fig. 4a). This has been shown to be a challenging problem in that drug sensitivities do not necessarily replicate well across pharmacogenomic studies [6,23]. The statistic with best resilience to the noise and artifacts in the pharmacogenomic studies will tend to rank matched drugs better. Because some of the drugs have been tested on very few common cell lines across datasets, we also considered only those comparisons made with more than 50 cell lines (Fig. 4b). In addition to examining the distribution of the matched-drug ranks, we summarized the results by calculating the area under the empirical Cumulative Distribution Function (CDF) ( Table 1).
The five metrics had similar performance, though Pearson had the consistently best recall. The kCI showed improved recall across datasets compared to CI and Spearman correlation, suggesting that incorporating information characterizing the noise in the datasets can improve identification of association. While this benchmark is informative, it is ultimately inconclusive about the practical capabilities of the different coefficients to identify biomarkers of drug sensitivity.

Discussion
Evaluation and significance testing of coefficients of association is a ubiquitous step in modern biological data analyses. However, the behaviour of common correlation coefficients applied to noisy data originating from distributions seen in modern biological data is not well characterized. Our study addressed this question for high-throughput cell line viability screening. We proposed new measures of association for noisy data, investigated the correctness of significance testing for common correlation coefficients, introduced efficient and accurate implementations for calculating significance for the concordance index, and investigated the power of permutation testing coefficients of association in distributions mimicking our area of application. Finally, we demonstrated that incorporating information about the measurement noise into calculating associations between vectors of observations, through our proposed kCI approach, demonstrates improved performance over the standard concordance index in a cross-dataset replicability task. Researchers in biology will often compute and test thousands of correlations, requiring multiple testing correction to keep the expected number of false positives under control, often done through a False Discovery Rate or Family Wise Error Rate correction procedure. Hypothesis testing in these situations require methods that can compute p-values accurately in the range of 1e-4 -1e-10, depending on the number of hypotheses investigated, meaning that correct tail end behaviour dominates the accuracy of inference. Pencina and D' Agostino [27] investigated the use of Noether's method to calculate 95% confidence intervals in the presence of complicated tie structures due to right censoring of one vector, and showed the method to have moderately precise coverage. Unfortunately, our data suggests that the extension of this approach to calculating p values for hypothesis testing does not control false positives at lower alphas than 0.05, at moderate to large sample sizes. This is the case not only for the rCI statistic which introduces tied pairs in the data, but also for the concordance index applied to data with no ties. While this may be unsurprising given the asymptotic arguments for the derivation of Noether's method, what caught us off guard in our results was the slow rate at which the control of false positives converged, even for the case without ties and sampling from a bivariate normal distribution. As much as 500 samples were not enough to restore proper false positive control at an alpha of 0.001, and given the behaviour observed in the Q-Q plots for smaller sample sizes, we expect the sample size required for more stringent alpha's will be substantially larger. Permutation tests are a versatile tool for assessing significance without making strong parametric assumptions about the generating distribution for the data. For permutation testing to be theoretically justified in making inferences about a population parameter, the data has to be independently and identically sampled, and each observation has to satisfy exchangeability between the two variables measured [29]. These assumptions are all shared by both exact tests for non-parametric correlations, and asymptotic analytical tests for the Pearson and CI correlations. Therefore, from a theoretical basis, the permutation test can only increase the range of valid situations for its application. Two concerns often raised by practitioners regarding applying permutation testing is a perceived loss of power, and the computational cost of simulating permutation null distributions. It is important to note that an inflation of false positives will lead in practice to lower p values and the illusion of more power originating from analytical tests. Previous studies suggest however, that the power of the permutation test for the Pearson correlation is very nearly identical to the Pearson t test in all cases where the t test retains good type 1 error control [16,29], with the t test providing significantly more power only in cases where data is nearly normally distributed, and at low ( ∼ 5 ) sample sizes [16]. Concerns about the computational cost of permutation testing are valid, especially in the context of multiple testing correction. However, approaches such as QUICKSTOP [30], for cases where the required alpha after correction is known, as well as other approaches focused on ensuring FDR control [31], can reduce by orders of magnitude the number of permutations needed to perform inference. Combined with modern computer hardware, permutation testing, even at "high-throughput data scale", becomes feasible. Therefore, based on our results assessing Type I error rates for asymptotic approximations of the CI distribution, and prior work mentioned above, we advocate for permutation testing to be adopted as the default testing strategy for tests of association, and require justification for reverting to analytical formulas for calculating p values for correlation coefficients.
Once it has been established that permutation testing is necessary to obtain correct p value estimates for sample sizes of interest, we compared the power of the different statistics under the permutation test. Pearson's correlation coefficient remained the most powerful statistic to detect linear associations between normally distributed variables when using a permutation test, but with only a marginal advantage. The power difference between the Pearson, CI, rCI and Spearman are small, and probably of little significance in practice at our investigated sample sizes and α . The newly defined rCI statistic exhibited most power with the delta parameter in the range of approximately 1 standard deviation of the data, and the value of the optimal delta was fairly stable over different effect sizes. While less powerful than the Pearson correlation in this simulation, the semi-parametric rCI was more powerful than non-parametric statistics, suggesting that in cases where the Pearson is not a suitable statistic (for example, looking for non-linear relationships), the rCI could be the tool of choice.
Beyond the Gaussian case, none of Pearson, Spearman or CI are consistently most powerful over the range of effect sizes investigated in bivariate beta distributed data at a moderate sample size of 100. The performance of our newly defined statistics, rCI and kCI, was similar to the others. Interestingly, we noticed that the Concordance Index seemed to outperform the Pearson correlation in the well powered ( > 50% ) regime. Strikingly, holding the power of the Pearson test at 50% while increasing sample size, the relative power gain of the Concordance Index increases. This suggests that for standard regimes of biological data analysis, where data is moderately large (100s of samples), and non-normally distributed, a permutation test based on the Concordance Index may be better powered to detect significant effects than the widely used Pearson correlation. We must caution however that precise rankings of these correlations in distributions very different from the skewed and bounded data we investigated cannot be extrapolated from our results. As a demonstration of this fact, when noise was added to the simulated data, Pearson was found to once again be more powerful than the Concordance Index.
There are, of course, limitations to this work. First, we only investigated the task of detecting linear associations from data. Undeniably, non-linear effects play a major role in biology, however linear analysis is often the first step in understanding any biological system. A major focus of our work has been on the concordance index (and by extension the Kendall Tau) measure of correlation. We have however omitted any investigation of the CI (and our derivatives) to censored data, where it is most commonly applied. Finally, due to computational constraints, we could not exhaustively search the space of distributions, sample sizes, hyper-parameters, and effect strengths.

Conclusion
The experiments carried out in our work can be summarized into three categories: investigating the Type I error control of common tests applied to correlation statistics in simulation; investigating the power of permutation tests based on common and novel (rCI, kCI) correlation statistics in simulation; and benchmarking the performance of novel and existing correlation statistics on real high throughput drug screening data (Additional file 1: Figure S5).
From the first set of simulations, we conclude that applying asymptotically derived tests to the regime of moderate (N≤500) sample sizes can lead to a significant inflation in false positives when looking at α small enough to correct for multiple testing. While we present an approach to calculate exact p-values for the Concordance Index in a subset of cases (dependent on the presence and structure of ties in the data), we advocate for permutation testing as a default approach for assessing significance for correlation tests.
We then introduce two modifications to the Concordance Index, the rCI and kCI statistics, motivated by increasing robustness of the Concordance Index to noisy data. In simulations, we demonstrate that CI and kCI are best powered of the tested correlations when the data deviates significantly from normality and sample sizes are large enough. However, the permutation test based on Pearson correlation showed better robustness to noise than all other statistics. Finally, we benchmarked the new correlation coefficients against Pearson, Spearman and CI in a cross-study replicate similarity task. In this benchmark, Pearson correlation and the kCI demonstrated the best performance.
Ultimately, we conclude that researchers working with high-throughput drug screening data will be best served by using the Pearson correlation as a measure of correlation, with significance estimates derived from permutation tests. This result is surprising, giving that the distribution of both the data and the noise in the data deviate significantly from normality. Nevertheless, both existing non-parametric correlations and novel similarity metrics designed specifically with the task in mind failed to consistently outperform the humble Pearson correlation.

Cell line screening datasets considered
As mentioned previously, our investigations were motivated by the task of finding correlations between high throughput compound screens measuring viability in cancer cell lines. For this study, we considered data originating from six different screening datasets: the Genomics of Drug Sensitivity in Cancer (versions 1 and 2) [32][33][34], the Cancer Therapeutics Response Portal v2 [35][36][37], the Genentech Cell Screening Initiative [6], the Finnish Institute of Medicine cell line screening dataset [38] and the Oregon Health and Science University Breast Cancer Screen [39]. All the data used in this study was previously published, except for the data from the Genentech Cell Screening Initiative, where an updated version of the original dataset was used for our analyses.
All datasets were downloaded using the PharmacoGx R package [40], from orcestra.ca [41]. For the analyses in this manuscript, all dose-response curves were summarized as the Area Above the Curve, using a 3 parameter Hill Curve model fit using the Pharma-coGx package, as previously described [42].

Characterization of the null distributions
There are four situations to consider: 1. Concordance Index with no ties -the two vectors are composed of unique elements, i.e. ∃i, j such that x i = x j or y i = y j 2. Concordance Index with ties -∃i, j such that x i = x j or y i = y j . This problem breaks down further into two cases: (A) the tie structure is transitive and can be represented by one multiset -if i ∼ = j and j ∼ = k , then i ∼ = k , or (B) the tie structure cannot be represented by a multiset, so ∃i, j, k|i ∼ = j, j ∼ = k, i ≇ k , where ∼ = denotes a tie in at least one of the vectors. 3. Robust Concordance Index -we define two thresholds t x and t y such that if |x i − xj| < t x or |y i − y j | < t y , the pair is considered invalid and ignored in the CI calculation. The denominator in the count of inversions is the number of valid pairs, where both x and y exceed their respective thresholds. As with non-transitive tied CI, rCI's tie structure does not necessarily form an equivalence relation and is not transitive. 4. Kernelized Concordance Index assigns a weight w to all pairs that is a function of one or both of the differences between the values. Larger differences in x and y can be considered a pair with greater confidence. The choice of kernel is typically monotone increasing with delta, for instance a sigmoid. rCI can be thought of as a special case of kCI where the kernel is a Heavyside step function.
The exact analytical nulls for 1 and 2A are known and presented here. For cases 2B, 3, and 4, the exact null is unknown, so statistical significance must be assessed with permutation testing. Smirnov et al. BMC Bioinformatics (2022) 23:188 Case 1 In the cases that there are no ties, calculating the concordance index can be reduced to counting inversions in a permutation. Given two lists of numbers without ties, x and y, to calculate CI(x, y), it is sufficient to find the ordering permutations σ x and σ y for x and y respectively, and then count the number of inversions in σ x · σ −1 y . Computationally, this is equivalent to ordering both lists such that x is sorted, and then counting inversions in y.
The simplest way to generate the probability distribution on the number of inversions from a set is to use a generating polynomial or an ordinary generating function. The distribution on the permutations which result in a particular inversion number is represented as coefficients in a polynomial on x; the number (or probability, if normalized) of permutations yielding k inversions is the coefficient of x k .
In the absence of ties, from [43], the following result holds: let I n (k) denote the number of permutations of S with k inversions. Then: The right hand side of (4) admits exact recursive and iterative methods for adding elements by multiplying polynomials. This is equivalent to convolutions on discrete probability distributions. We note that this formula is widely used in practice, for example, the R Statistical language [44] implements this exact formula for calculating the significance of Kendall's Tau correlation for small sample sizes (by default 50 at the time of writing) in the absence of ties, testing against the null hypothesis of no association.

Case 2a
The best way to think about permuting elements with ties between them is as follows: Each equivalence class formed by the ties in the original vectors x and y is assigned to a specific element e j . Let the set of distinct elements to be permuted be denoted E = {e 1 , ..., e n } and let a j ∈ Z >0 denote the multiplicity of element e j , i.e. how many elements are in this equivalence class. Thus there are α := n j=1 α j elements in total. We denote by M = {e α 1 1 , ..., e α n n } the multiset containing all elements (with ties). It turns out that an analogous result to (4) can be obtained for multisets. The original reference is from 1915-see [45]-but it's easier to read in modern notation, such as presented in [46]. Let inv (σ ) denote the number of inversions of a permutation of the multiset (set with ties) M. The distribution of inv can be shown to be with the q-factorial being defined by (The expression on the right-hand side of (5) is also called the q-multinomial coefficient.) Observe that, by splitting the sum over S M according to the number of inversions, where I M (k) denotes the number of permutations of the multiset M with k inversions. Thus, (5) is the exact multiset analogue of (4).
The right hand side of (4), gives a recursive formula for expressing I n (k) in terms of the I n−1 (j) : in terms of the generating function this reads The proof if this formula (presented in [43]) proceeds by looking at permutations of the first n − 1 elements and then inserting the last element at all possible position. By keeping track of how many extra inversions this insertion introduces, we arrive at (8).
This argument extends rather well to the case with ties: let M be the multiset as described in the introduction and denote by M − the set obtained from M by removing one occurrence of e n . That is, if M = e α 1 1 , .., e α n n , then and in particular if |M| = n then |M − | = n − 1 . We can give the following analogue of (8): with m! x defined in (6). We note that this case, where the Concordance Index can be reduced to counting inversions on multisets is of practical interest. Specifically, it will arise whenever only one of x or y contains exact ties. However, it is limited to exact ties, as representing the vectors as a multiset requires transitivity of ties, and therefore cannot be applied to censored data.

Notes on efficient implementation:
Calculating the distribution of inversions over random permutations requires polynomial multiplication, or equivalently, convolution of discrete distributions. This can be accomplished in O(n log(n)) time using the Fast Fourier Transform; once transformed, polynomial multiplication is elementwise instead of convolutional. Numerical precision limitations introduce very slightly complex coefficients after the inverse transform introduces, which can be dealt with by rounding to the nearest integer. The naive application of FFT for polynomial multiplication would be to take the two factors, transform them, multiply, and inverse transform them. We implemented a method by which the sequence of polynomials to multiply is generated, transformed and multiplied simultaneously, and inverse transformed once to further optimize the process.
Computing the distribution on inversions for Case 2a requires dividing polynomials. It turns out that in all cases encountered when applying the recursive formula (10) -with the D n ratios of n+m k=n z k to m+1 k=1 z k , for each polynomial in the denominator, there always exists a polynomial in the numerator which is a polynomial multiple of the denominator such that the coefficients are integral and the remainder is 0. In our experience, doing any polynomial division -including in FFT space -was prohibitively inaccurate, so we implemented a method to simplify the D n s to just the product of the quotients, eliminating the division entirely.
Efficient implementations of these distributions in R are shared with the wCI R package.

Cases 2b -4
We note that the exact distribution under the assumption of no association between the two vectors x and y for Cases 2b -4 are not known. For Cases 2b, as well as 3, the multiset representation breaks down. It is likely that a completely different approach will be needed to efficiently (in polynomial time) compute these general cases. Note that the CI applied to data with censoring falls into this regime.
The continuous nature of the kCI kernel likewise suggests that the null distribution can not be computed combinatorically. If solved however, the distribution for rCI will likely be a limiting case as the kernel approaches the discontinuous Heavyside function.

Power analysis
All power analysis was conducted through simulations in the R programming language. A link to the code is provided at the end of the manuscript.
Each power analysis simulation followed the same steps: 1. Two vectors of data of chosen length N were sampled from a bivariate distribution with a known expected Pearson correlation R as the effect size measure. 2. A permutation test was conducted using each of the Pearson, Spearman, CI, rCI and kCI (for Beta distributed data) as a test statistic, using the adaptive permutation testing algorithm described below. Tests resulting in P values under the chosen α of 0.001 were considered to have rejected the Null hypothesis, and therefore successfully detected the dependence between the two vectors. 3. Steps 1-2 were repeated 1000 or 10,000 times (as described in the results section), recording for each iteration and each statistic the whether the Null was rejected. The power at each effect size was calculated as the percentage of cases in which the Null was rejected by each statistic. 4. Steps 1-3 were repeated for effect sizes ranging from R = 0 , to R = 0.5 , in 0.1 step increments.
Bivariate Normal Data was generated using the function mvrnorm from the MASS R package.

Simulating data distributed as a bivariate beta
We construct a bivariate Beta distribution with a fixed expected Pearson correlation from the family of distributions described in [47]. Briefly, the distributions are constructed by sampling (U 1 , U 2 , U 3 ) from a trivariate Dirichlet distribution, and transforming the variables as follows: This construction is a 4 parameter family, with 3 parameters fixed by the shapes of the marginal distributions of X and Y. The fourth parameter is then optimized to yield the desired correlation between X and Y, using the Brent method as implemented by the R optim function.

Adaptive permutation testing
Adaptive stopping during permutation testing was used for all power analyses. Briefly, adaptive stopping allows the permutation test to complete once there are enough random resamples to determine whether a particular sample falls above or below a predefined significance threshold α . For these analyses, the QUICK-STOP algorithm presented by [30] was implemented in the R programming language. The parameters used for our simulation were as follows: the indifference parameter d was set to 0.001 * α , the Type I and II error probability was set to e −10 , and a stopping criteria was implemented to halt after 100 * 1/α permutations.

Choice of rCI thresholds and kCI Kernel
The Robust and Kernelized Concordance indices have parameters that can be optimized for the particular distribution of interest: the rCI threshold and the kCI kernel. Because both coefficients are premised on downweighting pairs of points with differences that cannot be distinguished from statistical noise, replicate measurements are needed to characterize the noise distribution from the data. Conceptually, the goal is to identify pairs of points with true measurements that are different given the random variable of the delta between the measurements. The approach is similar to Storey's FDR [48]. Let x be a quantity for a set of N points, and x be the measurement of x. For a pair of points i and j, consider ≡ |x i − x j | . The null hypothesis h 0 is that the quantity is the same for the pair, i.e. x i = x j . The alternate hypothesis is that the two values are different: x i = x j . Let S be a set of empirical deltas where the true difference is unknown. For repeated measurements of the same point, the null hypothesis is necessarily true, which gives the distribution of δ under they null hypothesis. Let S 0 be a set of empirical deltas from replicates. The problem is then to deconvolve S into a mixture of null and alternate hypotheses.
The main result uses the empirical CDF (eCDF) on S and S 0 . From Bayes' Theorem, we have the probability of a pair of points having equal values given a measured delta exceeding some value t as: Here, the numerator is the eCDF of for replicates multiplied by the estimate of the fraction of observed points for which the null hypothesis is true, and the denominator is the eCDF of for the population of measurements. The inequality emerges because (11) P(h 0 |� > t) = P(� > t|h 0 )P(h 0 ) P(� > t) ≤ P(� > t|h 0 ) P(� > t) P(h 0 ) ≤ 1 . Similarly, the probability of the alternate hypothesis can be computed from above by the law of total probability, that P(h 0 ) + P(h 1 ) = 1.
We consider deltas exceeding the threshold t as positive cases, and those with deltas less than t as negative cases. From this, a confusion matrix can be generated as a function of the threshold:

True Value
Positive Negative rCI decision boundary Positive P(h 1 |� > t) P(h 0 |� > t) Negative The threshold for rCI is given by choosing t = τ that maximizes the Matthew's Correlation Coefficient (MCC) from the confusion matrix above. While a contingency table cannot be perfectly summarized by any one value, MCC has been shown to be a balanced measure of classification even when the class sizes are different. For the kCI kernel, it is desirable to have a monotonic kernel that is 0 at = 0 and without loss of generality tends to 1 as gets large. The P(h 1 |� > t) fits these criteria well, with the additional stipulation at 0. For portability, we fit a sigmoid function to P(h 1 |� > t) , and this is the kernel: the weight assigned to a pair of points is the estimate of the probability that they have different true values.
The thresholds or kernels for both rCI and kCI depend only on and not on the specific values of the measurements. This implicitly assumes homoscedasticity of over the range of the measurements. While it might be useful to explore this further, this choice was motivated by practicality, as there often is insufficient data to characterize the heteroscedasticity of .

Simulation of artificial noise
To simulate noise representative of real pharmacological screening data, we first examined the distribution of pairwise differences in AAC values for experimental replicates done by the same study across all of the datasets described above. Upon observing this data, it was clear that it was not normally distributed. Using Maximum Likelihood Estimation, we fit a Gaussian [ a exp ( −(x−b) 2 2c 2 ) ] and Laplace distribution [ 1 2b exp (− |x−µ| b ) ] to the data, and found the Laplace fit to be qualitatively good, and adequate for use in a simulation context (Additional file 1: Figure S4a).
For power simulations sampling from the Beta distribution with artificially added noise, two vectors of correlated data with a chosen expected correlation were sampled as above. A vector of the same length was then sampled from the MLE Laplace noise distribution, and added to one of the two vectors of simulated data. Finally, any values under 0 or over 1 after the addition of noise were truncated to keep the data bounded in the correct range.

Querying pharmacogenomic datasets
To gauge the performance of the coefficients on practical pharmacogenomic data, we considered the four pharmacogenomic datasets mentioned above obtained from PharmacoGx. We compared two datasets at a time, and compared the matrix of drug