Systematic error detection in experimental highthroughput screening
 Plamen Dragiev^{1},
 Robert Nadon^{2, 3} and
 Vladimir Makarenkov^{1}Email author
https://doi.org/10.1186/147121051225
© Dragiev et al; licensee BioMed Central Ltd. 2011
Received: 30 June 2010
Accepted: 19 January 2011
Published: 19 January 2011
Abstract
Background
Highthroughput screening (HTS) is a key part of the drug discovery process during which thousands of chemical compounds are screened and their activity levels measured in order to identify potential drug candidates (i.e., hits). Many technical, procedural or environmental factors can cause systematic measurement error or inequalities in the conditions in which the measurements are taken. Such systematic error has the potential to critically affect the hit selection process. Several error correction methods and software have been developed to address this issue in the context of experimental HTS [1–7]. Despite their power to reduce the impact of systematic error when applied to error perturbed datasets, those methods also have one disadvantage  they introduce a bias when applied to data not containing any systematic error [6]. Hence, we need first to assess the presence of systematic error in a given HTS assay and then carry out systematic error correction method if and only if the presence of systematic error has been confirmed by statistical tests.
Results
We tested three statistical procedures to assess the presence of systematic error in experimental HTS data, including the χ^{2} goodnessoffit test, Student's ttest and KolmogorovSmirnov test [8] preceded by the Discrete Fourier Transform (DFT) method [9]. We applied these procedures to raw HTS measurements, first, and to estimated hit distribution surfaces, second. The three competing tests were applied to analyse simulated datasets containing different types of systematic error, and to a real HTS dataset. Their accuracy was compared under various error conditions.
Conclusions
A successful assessment of the presence of systematic error in experimental HTS assays is possible when the appropriate statistical methodology is used. Namely, the ttest should be carried out by researchers to determine whether systematic error is present in their HTS data prior to applying any error correction method. This important step can significantly improve the quality of selected hits.
Background
Highthroughput screening (HTS) is a modern technology used by drug researchers to identify pharmacologically active compounds [10]. HTS is a highly automated earlystage mass screening process. Contemporary HTS equipment allows for testing more than 100,000 compounds a day. HTS serves as a starting point for rapid identification of primary hits that are then further screened and evaluated to determine their activity, specificity, and physiological and toxicological properties [2]. As a highly sensitive test system, HTS requires both precise measurement tools and dependable quality control. The absence of standardized data validation and quality assurance procedures is recognised as one of the major hurdles in modern experimental HTS [11–13]. Acknowledging the importance of automatic quality assessment and data correction systems, many researchers have offered methods for eliminating experimental systematic artefacts which, if left uncorrected, can obscure important biological or chemical properties of screened compounds (false negatives) and can seemingly indicate biological activity when there is none (false positives) [1–7, 10–16].
Systematic error may be caused by various factors, including robotic failures and reader effects, pipette malfunction or other liquid handling anomalies, unintended differences in compound concentrations due to agent evaporation or variation in the incubation time and temperature differences, and lighting or air flow present over the course of the entire screen [2, 6]. Unlike random error that produces measurement noise and usually has minimal impact on the whole process, systematic error produces measurements that are systematically over or underestimated. Systematic error may be time dependent, introducing biases in individual plates or subsets of consecutive plates, but it may also affect an entire HTS assay (i.e., all screened plates). In practice, systematic error is almost always location related. The under or overestimation affects compounds located in the same row or column or in the same well location across the screened plates. The row and column effects may be persistent across the assay affecting repeatedly the same rows and columns on different plates or may vary from plate to plate, perturbing some rows and columns within a particular plate only [6]. Plate controls are used in HTS to ensure the accuracy of the activity measurements being taken. Controls are substances with stable wellknown activity levels. They might be positive (i.e., a strong activity effect is observed) or negative (i.e., no any activity effect is observed). Controls help to detect platetoplate variability and determine the level of background noise.
The following normalisation and preprocessing methods have been widely used in experimental HTS to remove platetoplate variation and make plate measurements comparable across plates [6, 13]:

Percent of control  the following formula is used:

Control normalization (known also as normalized percent inhibition transformation) is based on the following formula:

Zscore normalization is carried out as follows:

Bscore (i.e., Best score normalization [3]) is carried out as follows:
First, a twoway median polish procedure [17] is performed to account for row and column effects of the plate. The resulting residuals within each plate are then divided by their median absolute deviation, MAD. It is worth noting that there is an additional smoothing step that could be applied across plates (see the original article [3] for a description of the smoothing). This optional smoothing step was not applied however in [[5, 6] and [17]].
The residual is defined as the difference between the observed result (x_{ ijp } ) and the fitted value ${\widehat{x}}_{ijp}$, defined as the estimated average of the plate (${\widehat{\mu}}_{p}$) + estimated systematic measurement offset $\left({\widehat{R}}_{ip}\right)$ for row i of plate p + estimated systematic measurement column offset $\left({\widehat{C}}_{jp}\right)$ for column j of plate p. For each plate p, the adjusted median absolute deviation (MAD_{ p } ) is then obtained from the r_{ ijp } 's.
Median absolute deviation (MAD) a robust estimate of spread of the r_{ ijp } 's values is computed as follows: $median\left\{\phantom{\rule{0.5em}{0ex}}\left\phantom{\rule{0.5em}{0ex}}{r}_{ijp}median\left({r}_{ijp}\right)\phantom{\rule{0.5em}{0ex}}\right\phantom{\rule{0.5em}{0ex}}\right\}$.
The Bscore normalization was introduced by a team of Merck Frosst researchers [3] as a systematic error correction method.
 1.
Leastsquares approximation of the data carried out separately for each well location of the assay;
 2.
Zscore normalization of the data within each well location of the assay (i.e., the Zscore normalization is performed across all the plates of the assay).
In the HTS workflow, the normalization/data correction phase is usually followed by the hit selection process. During this process the most active compounds are identified as hits and selected for additional screens. A predefined threshold is usually established to select hits [13]. Depending on the specifics of the research study, one may be looking for compounds whose activity level is greater than the defined threshold (i.e., activation assay) or interest may lie in the compounds whose measurements are below the defined threshold (i.e., inhibition assay). In this study, we always assume the latter case where the hits are the compounds with the smallest measurement values. The threshold for defining hits is usually expressed using the mean value and standard deviation of the considered measurements. The most widely used threshold is μ 3σ, where μ is the mean value and σ is the standard deviation of the considered measurements. Hits can be selected globally, over the whole assay, when the mean and standard deviation of all assay compounds are calculated, or on a platebyplate basis, when the mean and standard deviation of the compounds of each single plate are considered [6, 13].
Methods
Data description
In this study we consider an experimental assay provided by the HTS laboratory of McMaster University. This assay was called Test assay and used as a benchmark in McMaster Data Mining and Docking Competition [18]. McMaster Test assay consists of 50,000 different chemical compounds whose potential to inhibit the E. coli DHFR was tested. Each of the 50,000 considered compounds was screened in duplicate; two copies of each of the 625 plates were run through the HTS equipment; 1250 plates in total, with wells arranged in 8 rows and 12 columns, were screened; columns 1 and 12 of each plate were used for positive and negative controls and were, therefore, not considered in our study. Thus, every plate comprised 80 different compounds. The exact experimental conditions of Test assay are reported in [18]. The competition organizers defined as primary hits the compounds that reduced the DHFR of E. coli to 75% of the average residual activity of the high controls. Two lists of hits were published (for more details, the reader is referred to: http://www.info2.uqam.ca/~makarenv/experimental_actives.pdf). The first list, called a consensus hits list, contained all compounds that were classified as hits in both of their replicate measurements (i.e., both measurement values were lower than or equal to 75% of the reference controls). Only 42 of all the 50,000 tested compounds were declared consensus hits. The second list, called an average hits list, contained 96 compounds classified as hits when the average value of the two HTS measurements was lower than or equal to 75% of the reference controls. Obviously, all consensus hits were also average hits. A secondary screening of the 96 average hits was also performed in order to determine their activity in different concentrations. As result of the secondary screening, 12 of the average hits were identified as DR hits (i.e., hits having wellbehaved doseresponse curves).
Generating systematic error
We simulated data in order to evaluate the performances of the systematic error detection tests. First, we generated errorfree datasets consisting of random normally distributed data. The basic data format adopted here was that of the McMaster dataset  1250 plates, each containing 96 wells arranged in 8 rows and 12 columns. In addition, we also generated two other basic datasets which were 4 and 16 times bigger. They also included 1250 plates, each of them comprising 384 (16 × 24) and 1536 (32 × 48) wells, respectively. It is worth noting that 96, 384 and 1536well plates are the most typical plate formats used in the modern HTS.
An assay was defined as an ordered set of plates PL_{ p } , where p (1 ≤ p ≤ 1250) is the plate number. Each plate, PL_{ p } , can be viewed as a matrix of experimental HTS measurements x_{ ijp } , where i (1 ≤ i ≤ N_{ R } ) is the row number, j (1 ≤ j ≤ N_{ C } ) is the column number, and N_{ R } and N_{ C } are, respectively, the number of rows and columns in PL_{ P } . The generated values X_{ ijp } 's followed the standard normal distribution ~N(0, 1).
Then, the hits were added to the datasets. Several hit percentages, h, were tested in our simulations: h = 0.5, 1, 2, 3, 4 and 5%. The locations and values of hits were chosen randomly. The probability of each well in each plate to contain a hit was h%. The values of hits followed a normal distribution with the parameters ~N(μ  5SD, SD), where μ and SD are the mean value and standard deviation of the errorfree dataset.
Five types of HTS datasets containing different kinds of systematic and/or random error generated and tested in this study
Error type  Generation of erroraffected measurements 

A. Datasets with both column and row systematic errors which are constant across all assay plates.  ${{x}^{\prime}}_{ijp}={x}_{ijp}+{r}_{i}+{c}_{j}+Ran{d}_{ijp}$, 1 ≤ i ≤ 8, 1 ≤ j ≤ 12, 1 ≤ p ≤ 1250. 
B. Datasets with the column systematic error only which is constant across all plates.  ${{x}^{\prime}}_{ijp}={x}_{ijp}+{c}_{j}+Ran{d}_{ijp}$, 1 ≤ i ≤ 8, 1 ≤ j ≤ 12, 1 ≤ p ≤ 1250. 
C. Datasets with the well systematic error which is constant across all plates.  ${{x}^{\prime}}_{ijp}={x}_{ijp}+{w}_{ij}+Ran{d}_{ijp}$, 1 ≤ i ≤ 8, 1 ≤ j ≤ 12, 1 ≤ p ≤ 1250. 
D. Datasets with the variable column and row systematic error which are different for each plate.  ${{x}^{\prime}}_{ijp}={x}_{ijp}+{r}_{ip}+{c}_{jp}+Ran{d}_{ijp}$, 1 ≤ i ≤ 8, 1 ≤ j ≤ 12, 1 ≤ p ≤ 1250. 
E. Datasets with the random error only (i.e., systematic error was absent).  ${{x}^{\prime}}_{ijp}={x}_{ijp}+Ran{d}_{ijp}$, 1 ≤ i ≤ 8, 1 ≤ j ≤ 12, 1 ≤ p ≤ 1250. 
In order to render our simulation study more realistic, we limited the number of rows, columns and wells affected by systematic error. Typically, in real HTS assays only some of the error parameters (i.e., r_{ i } , c_{ j } , w_{ ij } ,r_{ ip } and c_{ jp } , see Table 1) are non null and only a few columns and rows are biased by systematic error. In datasets of types A and B, the number of rows and columns affected by systematic error as well as their locations were chosen randomly. These parameters were identical for all the plates of the assay. In datasets of type D, the number of rows and columns affected by systematic error as well as their locations were also randomly selected, but these parameters were different for different plates of the assay. In datasets of type C, the number of biased wells and their locations were randomly selected and were the same for all assay plates. The datasets used in our simulations were subject to the following constraints. For the 96well plates, at most 2 rows and 2 columns (cases A, B and D), and not more than 10% of the wells (case C) were affected by systematic error. For the 384well plates, the limits were 4 rows, 4 columns and 10% of the wells, whereas for the 1536well plates, systematic error affected at most 8 rows, 8 columns and 10% of wells.
Systematic error detection tests
Three systematic error detection methods, including the ttest, the χ^{ 2 } goodnessoffit test and Discrete Fournier Transform procedure followed by the KolmogorovSmirnov test, were examined in this study in the context of experimental HTS.
ttest
where μ_{ 1 } is the mean of the sample S_{ 1 } and μ_{ 2 } is the mean of the sample S_{ 2 } . The calculated tstatistic was then compared to the corresponding critical value for the chosen statistical significance level α (the α values equal to 0.01 and 0.1 were used in our simulations) in order to decide whether or not H_{ 0 } should be rejected. While assuming homogeneity of variance in the construction of the ttest, the computation can be optimized using the equivalent contrasts in the context of an analysis of variance.
χ ^{ 2 } goodnessoffit test
The second tested method was the χ^{ 2 } goodnessoffit test. This test was performed in Simulation 2 only in order to assess the presence of systematic error in the hit distribution surfaces. It was first recommended in [6] in order to identify systematic error in HTS data. The null hypothesis H_{ 0 } , here, is that no systematic error is present in the data. If H_{ 0 } is true, then the hits are evenly distributed across the well locations and the observed counts of hits x_{ ij } in each row i and each column j of the hit distribution surface is not significantly different from the expected value calculated as the total counts across the entire surface divided by the number of wells. The rejection region of H_{ 0 } is P(χ^{ 2 } > C_{ α })>α, where C_{ α } is the χ^{ 2 } distribution critical value corresponding to the selected α parameter (the α values equal to 0.01 and 0.1 were tested here) and to the number of degrees of freedom of the model.
where E is the total hits count of the whole hit distribution surface divided by the number of wells (N_{ R } × N_{ C } ) with the number of degrees of freedom equal to N_{ R }  1.
where E is the total hits count of the whole hit distribution surface divided by the number of wells (N_{ R } × N_{ C } ) with the number of degrees of freedom equal to N_{ C }  1.
where E is the total hits count of the whole hit distribution surface divided by the number of wells (N_{ R } × N_{ C } ) with the number of degrees of freedom equal to N_{ R } × N_{ C }  1.
Discrete Fourier Transform and KolmogorovSmirnov test
The third tested method consisted of the Discrete Fourier Transform (DFT) procedure [9] followed by the KolmogorovSmirnov goodnessoffit test [8]. DFT has been widely used in the frequency analysis of signals and, in particular, for building the signal's density spectrum. The power density spectrum shows the energy contained in each frequency component existing in the signal. In order to apply DFT to HTS data we need first to unroll a plate measurement matrix into a linear sequence of measurements. There are two natural ways to do so: (a) to build the sequence starting by the first row of the plate, followed by the second row, then third one, and so on, and (b) to start by the first column of the plate, followed by the second column, third one, and so on. The analysis of sequences (a) and (b) would allow us to detect column and row effects, respectively. DFT detects frequencies of signals that repeat every two, three, four, and so on, positions in the sequence. DFT calculates the amplitudes of every possible frequency component. Let ${y}_{k}^{p}$ (1 ≤ k ≤ N) be the power density spectrum generated by the DFT analysis for the plate p with N wells.
where $F\left({y}_{k}^{p}\right)$ is defined as the number of values in the density spectrum that are lower than or equal to ${y}_{k}^{p}$, i.e., $F\left({y}_{k}^{p}\right)\equiv \Vert \phantom{\rule{0.5em}{0ex}}\left\{{y}_{l}^{p},\phantom{\rule{0.5em}{0ex}}1\le l\le N,\phantom{\rule{0.5em}{0ex}}{y}_{l}^{p}<{y}_{k}^{p}\right\}\phantom{\rule{0.5em}{0ex}}\Vert $. Big values of D lead to the rejection of the null hypothesis (i.e., x_{ ijp } 's have been drawn from random normally distributed data). The method consisting of the DFT analysis followed by the KolmogorovSmirnov test was included in some commercial software focusing on the detecting systematic error in experimental data (e.g., in Array Validator described in [20]).
Results and discussion
Simulation 1: Detecting systematic error in individual plates
Simulation 1 consisted of the detection of systematic error on a platebyplate basis. Artificial HTS data for three different plate sizes: 96 wells  8 rows and 12 columns, 384 wells  16 rows and 24 columns, and 1536 wells  32 rows and 48 columns were first generated. We started by creating basic errorfree datasets for which the well measurements followed a standard normal distribution ~N(0,1). For all datasets the number of plates was set to 1250  the same as in McMaster Test assay[18]. Then, we added 1% of hits to each of the generated basic datasets. The hits were added in such a way that the probability that a given well contained a hit was 1%. All the hit values followed a normal distribution with the parameters ~N(μ  5SD, SD), where μ and SD are the mean value and standard deviation of the basic dataset (without hits).
Using these errorfree datasets, we generated datasets comprising different types of systematic error, labelled A to E, as reported in Table 1. Systematic error was added only to some of the assay rows (columns, wells). The number of rows (columns, wells) affected by systematic error as well as the indexes of the affected rows, columns and wells were determined randomly for each considered dataset. Six types of erroraffected sets were produced for each errorfree dataset by varying the standard deviation of systematic error. The following values of the systematic error standard deviation were used: 0, 0.6SD, 1.2SD, 1.8SD, 2.4SD and 3.0SD, where SD is the standard deviation of the basic dataset. The ttest and KS test were then applied to erroraffected data. Both tests produced a binary result for each row and column of each plate: Systematic error was detected or not detected in this row or column. The output was then compared to the information from the data generation phase to determine whether the result of the test was correct.
where Pr(a) is the relative observed agreement among raters (i.e., statistical tests in our study) and Pr(e) is the hypothetical probability of chance agreement. If the raters are in complete agreement, then κ = 1. If there is no agreement among the raters, other than what would be expected by chance, then κ ≤ 0.
In our HTS context, Pr(a) and Pr(e) were calculated as follows: $\mathrm{Pr}(a)=\frac{TP+TN}{P\times ({N}_{R}+{N}_{C})}$ and $\mathrm{Pr}(e)=\frac{(TP+FN)\times (TP+FP)+(TN+FN)\times (TN+FP)}{{(P\times ({N}_{R}+{N}_{C}))}^{2}}$, where P is the number of plates in the assay, N_{ R } and N_{ C } , are, respectively, the number of rows and columns per plate, TP (true positives) is the sum of the numbers of rows and columns where systematic error was added during the data generation and then detected by the test, FP (false positives) is the sum of the numbers of rows and columns where systematic error was not added but detected by the test, TN (true negatives) is the sum of the numbers of rows and columns where systematic error was not added and not detected by the test, and FN (false negatives) is the sum of the numbers of rows and columns where systematic error was added but not detected by the test.
Since datasets of types C and E did not contain row or column systematic error, the sensitivity and Cohen's kappa coefficient of both competing statistical tests for these data were undefined (i.e., TP = FN = 0 for these data types).
The kappa coefficient curves in Figures 2, 3 and 4 show that the ttest clearly outperforms DFT followed by the KS test for all selected sizes of systematic error, confidence levels and plate sizes. The accuracy of the ttest grows as the size of systematic error increases. It also grows slightly as the plate size increases. The accuracy of the KS test remains very low and usually varies between 0.0 and 0.1, thus suggesting a very poor systematic error recovery by this test. Figures 13SM, 14SM and 15SM indicate that the success rate of the ttest is largely independent of the systematic error variance and remains very steady for all tested types of systematic error and plate sizes. In contrast, the success rate of the KS test decreases as the standard deviation of systematic error increases. The performance of the KS test is also affected by the size of the plate (Figures 2, 3 and 4). The KS test success rate decreases significantly, and often falls below 50%, for larger plates (Figure 15SM). The chosen confidence level α affects the accuracy of both statistical tests. For instance, the use of α = 0.1 generally causes a decrease in the kappa coefficient (the decrease of 0.2 on average, see Figures 2, 3 and 4) and in the success rate (the decrease of 10% on average, see Figures 13SM, 14SM and 15SM) of the ttest, when compared to α = 0.01. The sensitivity charts (Figures 1SM, 2SM and 3SM) show that the increase in the variance of systematic error leads to the increase in sensitivity of both tests. In terms of sensitivity, the ttest outperforms the KS test for all data types and all sizes of systematic error, the only exception being large plates tested with the confidence level α = 0.1 (Figure 3SM).
Similarly to real HTS assays, our artificially generated datasets had systematic error in only a few rows and/or columns. They contained many negative and only a few positive samples. Such an imbalance between positive and negative samples implies that the overall accuracy of the tests will depend much more on the test specificity than on its sensitivity. Figures 4SM, 5SM and 6SM confirm this observation  most of the specificity charts resemble the corresponding success rate charts (see Figures 13SM, 14SM and 15SM).
Simulation 2: Detecting systematic error on hit distribution surfaces
The second simulation, Simulation 2, consisted of the detection of systematic error on the hit distribution surfaces. The recommendation to use statistical tests to examine hit distribution surfaces of experimental HTS assays was first formulated in [6], in the case of the χ^{ 2 } test. In Simulation 2, we also considered artificially generated assays with plates of three different sizes (i.e., 96, 384 and 1536well plates as well as 1250plate assays) with the measurements following the standard normal distribution. From every basic dataset we generated 6 errorfree datasets comprising 0.5%, 1%, 2%, 3%, 4% and 5% of hits. All the hit values followed a normal distribution with the parameters ~N(μ  5SD, SD). Using the errorfree datasets, we generated assays containing different types of systematic error (i.e., from A to E). Systematic error, added to some of the assay rows (columns, wells) only, followed the normal distribution with the mean value of 0 and the standard deviation of 1.2SD. For each such an assay, we calculated its hit distribution surface for the hit selection threshold of μ3σ. Then we applied, in turn, the ttest, and the KS and χ^{ 2 } goodnessoffit tests to detect the presence of systematic error.
The kappa coefficient curves presented in Figures 5, 6 and 7 illustrate that the ttest clearly outperforms the χ^{ 2 } goodnessoffit test as well as the combination of DFT and the KS test for all selected sizes of systematic error, confidence levels and plate sizes. The accuracy of the ttest generally grows as the size of systematic error increases, but this trend is not as steady as in Simulation 1: The curve's minimum is not always associated with the lowest systematic noise (e.g., see cases c and d in Figure 5). The kappa values for the χ^{ 2 } and KS tests usually varies between 0.0 and 0.25, thus suggesting a poor systematic error recovery provided by both of them. As in Simulation 1, the success rate of the ttest is largely independent of the systematic error variance (Figures 16SM, 17SM and 18SM). Moreover, the success rate of the ttest varies between 90 and 100% in the most of simulated experiments. At the same time, the accuracy of the KS test is extremely low in almost all of the considered situations. The success rate analysis of the χ^{ 2 } goodnessoffit test suggests that this test follows different patterns for different types of data. For datasets of types D and E, whose hit distribution surfaces did not contain systematic error, the accuracy of the χ^{ 2 } test is very close to that of the ttest (Figures 16SM, 17SM and 18SM, cases d, e, i and j). However, for the datasets that contained row and/or column systematic error and well systematic error, the success rate of the χ^{ 2 } goodnessoffit test is significantly lower than that of the ttest (Figures 16SM, 17SM and 18SM, cases a to c and f to h) and shows a tendency to deteriorate when the percentage of hits in the data increases. The sensitivity patterns shown in Figures 7SM, 8SM and 9SM demonstrate that the sensitivity of the three statistical tests grows as the percentage of hits contained in the data increases. Similarly to Simulation 1, choosing a bigger value of α led to a decrease in the accuracy of all tests.
Application to the McMaster data
As a final step in our study we applied the three discussed systematic error detection tests on real HTS data. We examined the impact that the presented methodology would have on the hit selection process in McMaster Data Mining and Docking Competition Test assay[18]. Similarly to Simulations 1 and 2 carried out with artificial data, we performed two types of analysis. First, we studied the raw HTS measurements, and then calculated and analyzed the hit distribution surfaces of Test assay.
Number of rows, columns and plates (where at least one row or column contains systematic error) of McMaster Test assay in which the ttest reported the presence of systematic error, depending on the α parameter
α  Plates  Rows  Rows %  Columns  Columns % 

0.01  159  76  0.76%  94  0.75% 
0.05  814  575  5.76%  606  4.86% 
0.1  1121  1148  11.50%  1296  10.38% 
0.2  1241  2242  22.46%  2583  20.70% 
The obtained results suggest that the number of positives for the row and column effects is almost exactly what we would expect by chance (e.g., approximately 1% when we used α = 0.01, 5 % when we used α = 0.05, etc.). This means that there is no statistical evidence of bias for columns and rows in McMaster Test assay.
Number of hits selected in McMaster Test assay for the μ 3SD threshold after the application of the Bscore correction, depending on the α parameter
α  Original hits  Obtained hits  Preserved hits  Added hits  Removed hits 

0.01  96  123  57  66  39 
0.05  96  125  55  70  41 
0.1  96  126  52  74  44 
0.2  96  130  55  75  41 
Number of hits selected in McMaster Test assay for the μ 2.29SD threshold (i.e., threshold used by the McMaster competition organizers to select the 96 original average hits) after the application of the Bscore correction, depending on the α parameter
α  Original hits  Obtained hits  Preserved hits  Added hits  Removed hits 

0.01  96  357  79  278  17 
0.05  96  419  79  340  17 
0.1  96  411  79  332  17 
0.2  96  417  76  341  20 
In our second experiment, we computed and analyzed the hit distribution surfaces of McMaster Test assay for the hit selection thresholds: μ 3SD and μ 2SD. We assessed the presence of systematic error in the assay by applying the three discussed systematic error detection tests: ttest, KS test and χ^{ 2 } goodnessoffit test. All three tests detected the presence of systematic error in both surfaces for both considered confidence levels α = 0.01 and 0.1. While the hit distribution surface is useful for detecting the presence of overall bias, it does not capture the variability of the bias on a platebyplate basis.
Number of hits selected in McMaster Test assay for the μ 3SD and μ 2.29SD thresholds after the application of the Well Correction method
Threshold  Original hits  Obtained hits  Preserved hits  Added hits  Removed hits 

μ 3SD  96  26  26  0  70 
μ 2.29SD  96  102  72  30  24 
Conclusions
In this article we discussed and tested three methods for detecting the presence of systematic error in experimental HTS assays. We conducted a comprehensive simulation study with artificially generated HTS data, constructed to model a variety of reallife situations. The variants of each dataset, comprising different hit percentages and various types and levels of systematic error, were examined. The experimental results show that the method performances depend on the assay parameters  plate size, hit percentage, and type and variance of systematic error. We found that the simplest, and computationally fastest method, the ttest, outperformed the KolmogorovSmirnov (KS) and χ^{ 2 } goodnessoffit tests in most of the practical situations. The ttest demonstrated a high robustness when applied on a variety of artificial datasets. The success rate of the ttest was, in most situations, well above 90%, regardless the plate size, noise level and type of systematic error, while the values of Cohen's kappa coefficient computed for this test suggested its superior performance especially in the case of large plates and high level of systematic noise. We can thus recommend the ttest as a method of choice in experimental HTS. On the contrary, advocated in some works [20, 21] Discrete Fourier Transform followed by the KS test yielded very disappointing results. Moreover, the latter technique required a lot of computational power but provided the worst overall performance among the three competing statistical procedures. The KS test can still be used to examine HTS data located in small plates (i.e., 96well plates), but we strongly recommend not using it for the analysis or large plates (i.e., 384 and 1536well plates) and hit distribution surfaces. The main reason for such a disappointing performance of the KS test is it that was applied, as recommended in [20], on the data already transformed by the Discrete Fourier method. Figure 19SM presents an example of data from one of the simulated 96well plates before and after the application of Discrete Fourier Transform. The raw data followed a normal distribution and contained random error only (i.e., systematic error was not added). The raw data did not deviate from the normal distribution, as shown both graphically (Figure 19SMa) and by the KS test (KS = 0.03, p = 0.5). However, after the application of Discrete Fourier Transform, the data deviate from normality as shown in the graph (Figure 19SMa) and by the KS test (KS = 0.06, p = 0.0018). The third method, the χ^{ 2 } goodnessoffit test suggested in [6], can be employed to assess hit distribution surfaces for the presence of systematic error. In general, its performances were lower than those of the ttest and were very sensitive to the type of systematic error as well as to its variance. The χ^{ 2 } goodnessoffit test could be recommended, especially to analyze HTS assays with small plate sizes, but we suggest carrying out the ttest as well to confirm its results.
In addition to the experiments with simulated data, we applied the three discussed systematic error detection tests to real HTS data. Our goal was to evaluate the impact of systematic error on the hit selection process in experimental HTS. The obtained results (see Tables 25 and Figure 8) confirm the following fact: If raw HTS data are not treated properly for eliminating the effect of systematic error, then many (e.g., about 30% of hits in the case of McMaster Test assay, as reported in Table 5) of the selected hits may be due to the presence of systematic error and, at the same time, many promising compounds may be missed during hit selection. A special attention should be paid to control the results of aggressive data normalization procedures, such as Bscore, that could easily do more damage by introducing biases in raw HTS data and, therefore, lead to the selection of many false positive hits even in the situations when the data don't contain any kind of systematic error.
Our general conclusion is that a successful assessment of the presence of systematic error in experimental HTS assays is achievable when the appropriate statistical methodology is used. Namely, the ttest should be carried out by HTS researchers to preprocess raw HTS data. This test should help improve the "quality" of selected hits by discarding many potential false positives and suggesting new, and eventually real, active compounds. The ttest should be used in conjunction with data correction techniques such as: Well correction [5, 6], when row or column systematic error (detected by the test) repeats across all plates of the assay, and Bscore [3] or trimmedmean polish score [7], when systematic error varies across plates. Thus, we recommend adding an extra preliminary systematic error detection and correction step in all HTS processing software and using consensus hits in order to improve the overall accuracy of HTS analysis.
Declarations
Acknowledgements
The authors thank the Natural Sciences and Engineering Research Council of Canada (NSERC) and Nature and Technologies Research Funds of Quebec (FQRNT) for supporting this research. We also thank Professor JeanJacques Daudin and two anonymous referees for their helpful comments. All authors read and approved the final manuscript.
Authors’ Affiliations
References
 Heyse S: Comprehensive analysis of highthroughput screening data. Proc of SPIE 2002, 4626: 535–547.View ArticleGoogle Scholar
 Heuer C, Haenel T, Prause B: A novel approach for quality control and correction of HTS data based on artificial intelligence. In Pharmaceutical discovery & development report. PharmaVentures; 2005.Google Scholar
 Brideau C, Gunter B, Pikounis W, Pajni N, Liaw A: Improved statistical methods for hit selection in HTS. J Biomol Screen 2003, 8: 634–647. 10.1177/1087057103258285View ArticlePubMedGoogle Scholar
 Kevorkov D, Makarenkov V: Statistical analysis of systematic errors in HTS. J Biomol Screen 2005, 10: 557–567. 10.1177/1087057105276989View ArticlePubMedGoogle Scholar
 Makarenkov V, Kevorkov D, Zentilli P, Gagarin A, Malo N, Nadon R: HTSCorrector: new application for statistical analysis and correction of experimental data. Bioinformatics 2006, 22: 1408–1409. 10.1093/bioinformatics/btl126View ArticlePubMedGoogle Scholar
 Makarenkov V, Zentilli P, Kevorkov D, Gagarin A, Malo N, Nadon R: An efficient method for the detection and elimination of systematic error in highthroughput screening. Bioinformatics 2007, 23: 1648–1657. 10.1093/bioinformatics/btm145View ArticlePubMedGoogle Scholar
 Malo N, Hanley J, Carlile G, Jing L, Pelletier J, Thomas D, Nadon R: Experimental design and statistical methods for improved hit detection in highthroughput screening. J Biomol Screen 2010, 15: 990–1000. 10.1177/1087057110377497View ArticlePubMedGoogle Scholar
 D'Agostino R, Stephens M: GoodnessofFit Techniques. Marcel Dekker, Inc; 1986.Google Scholar
 Cooley JW, Tukey JW: An algorithm for the machine calculation of complex Fourier series. Math Comput 1965, 19: 297–301. 10.1090/S00255718196501785861View ArticleGoogle Scholar
 Dove A: Screening for content  the evolution of high throughput. Nat Biotechnol 2003, 21: 859–864. 10.1038/nbt0803859View ArticlePubMedGoogle Scholar
 Kaul A: The impact of sophisticated data analysis on the drug discovery process. Business briefing: future drug discovery 2005.Google Scholar
 Gunter B, Brideau C, Pikounis B, Pajni N, Liaw A: Statistical and graphical methods for quality control determination of HTS data. J Biomol Screen 2003, 8: 624–633. 10.1177/1087057103258284View ArticlePubMedGoogle Scholar
 Malo N, Hanley JA, Cerquozzi S, Pelletier J, Nadon R: Statistical practice in highthroughput screening data. Nat Biotechnol 2006, 24: 167–175. 10.1038/nbt1186View ArticlePubMedGoogle Scholar
 Gagarin A, Makarenkov V, Zentilli P: Clustering techniques to improve the hit selection in HTS. J Biomol Screen 2006, 11: 903–914. 10.1177/1087057106293590View ArticlePubMedGoogle Scholar
 Zhang JH, Chung TDY, Oldenburg KR: A simple statistic parameter for use in evaluation and validation of HTS assays. J Biomol Screen 1999, 4: 67–73. 10.1177/108705719900400206View ArticlePubMedGoogle Scholar
 Zhang JH, Chung TDY, Oldenburg KR: Confirmation of primary active substances from HTS of chemical and biological populations: a statistical approach and practical considerations. J Comb Chem 2000, 2: 258–265. 10.1021/cc9900706View ArticlePubMedGoogle Scholar
 Tukey JW: Reading Massachusetts: AddisonWesley. Exploratory Data Analysis 1977.Google Scholar
 Elowe NH, Blanchard JE, Cechetto JD, Brown ED: Experimental Screening of Dihydrofolate Reductase Yields a "Test Set" of 50,000 Small Molecules for a Computational DataMining and Docking Competition. J Biomol Screen 2005, 10: 653–657. 10.1177/1087057105281173View ArticlePubMedGoogle Scholar
 Helm JS, Hu Y, Chen L, Gross B, Walker S: Identification of activesite inhibitors of MurG using a generalizable, highthroughput glycosyltransferase screen. J Am Chem Soc 2003, 125: 11168–11169. 10.1021/ja036494sView ArticlePubMedGoogle Scholar
 Kelley BP: Automated Detection of Systematic Errors in Array Experiments. Software Report. Datect Inc., Santa Fe; 2005.Google Scholar
 Root DE, Kelley BP, Stockwell BR: Detecting spatial patterns in biological array experiments. J Biomol Screen 2003, 8: 393–398. 10.1177/1087057103254282PubMed CentralView ArticlePubMedGoogle Scholar
 Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas 1960, 20: 37–46. 10.1177/001316446002000104View ArticleGoogle Scholar
 Fleiss JL: Statistical methods for rates and proportions. 2nd edition. John Wiley & Sons, Inc. New York; 1981.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.