Systematic error detection in experimental high-throughput screening
© Dragiev et al; licensee BioMed Central Ltd. 2011
Received: 30 June 2010
Accepted: 19 January 2011
Published: 19 January 2011
Skip to main content
© Dragiev et al; licensee BioMed Central Ltd. 2011
Received: 30 June 2010
Accepted: 19 January 2011
Published: 19 January 2011
High-throughput screening (HTS) is a key part of the drug discovery process during which thousands of chemical compounds are screened and their activity levels measured in order to identify potential drug candidates (i.e., hits). Many technical, procedural or environmental factors can cause systematic measurement error or inequalities in the conditions in which the measurements are taken. Such systematic error has the potential to critically affect the hit selection process. Several error correction methods and software have been developed to address this issue in the context of experimental HTS [1–7]. Despite their power to reduce the impact of systematic error when applied to error perturbed datasets, those methods also have one disadvantage - they introduce a bias when applied to data not containing any systematic error . Hence, we need first to assess the presence of systematic error in a given HTS assay and then carry out systematic error correction method if and only if the presence of systematic error has been confirmed by statistical tests.
We tested three statistical procedures to assess the presence of systematic error in experimental HTS data, including the χ2 goodness-of-fit test, Student's t-test and Kolmogorov-Smirnov test  preceded by the Discrete Fourier Transform (DFT) method . We applied these procedures to raw HTS measurements, first, and to estimated hit distribution surfaces, second. The three competing tests were applied to analyse simulated datasets containing different types of systematic error, and to a real HTS dataset. Their accuracy was compared under various error conditions.
A successful assessment of the presence of systematic error in experimental HTS assays is possible when the appropriate statistical methodology is used. Namely, the t-test should be carried out by researchers to determine whether systematic error is present in their HTS data prior to applying any error correction method. This important step can significantly improve the quality of selected hits.
High-throughput screening (HTS) is a modern technology used by drug researchers to identify pharmacologically active compounds . HTS is a highly automated early-stage mass screening process. Contemporary HTS equipment allows for testing more than 100,000 compounds a day. HTS serves as a starting point for rapid identification of primary hits that are then further screened and evaluated to determine their activity, specificity, and physiological and toxicological properties . As a highly sensitive test system, HTS requires both precise measurement tools and dependable quality control. The absence of standardized data validation and quality assurance procedures is recognised as one of the major hurdles in modern experimental HTS [11–13]. Acknowledging the importance of automatic quality assessment and data correction systems, many researchers have offered methods for eliminating experimental systematic artefacts which, if left uncorrected, can obscure important biological or chemical properties of screened compounds (false negatives) and can seemingly indicate biological activity when there is none (false positives) [1–7, 10–16].
Systematic error may be caused by various factors, including robotic failures and reader effects, pipette malfunction or other liquid handling anomalies, unintended differences in compound concentrations due to agent evaporation or variation in the incubation time and temperature differences, and lighting or air flow present over the course of the entire screen [2, 6]. Unlike random error that produces measurement noise and usually has minimal impact on the whole process, systematic error produces measurements that are systematically over- or underestimated. Systematic error may be time dependent, introducing biases in individual plates or subsets of consecutive plates, but it may also affect an entire HTS assay (i.e., all screened plates). In practice, systematic error is almost always location related. The under- or overestimation affects compounds located in the same row or column or in the same well location across the screened plates. The row and column effects may be persistent across the assay affecting repeatedly the same rows and columns on different plates or may vary from plate to plate, perturbing some rows and columns within a particular plate only . Plate controls are used in HTS to ensure the accuracy of the activity measurements being taken. Controls are substances with stable well-known activity levels. They might be positive (i.e., a strong activity effect is observed) or negative (i.e., no any activity effect is observed). Controls help to detect plate-to-plate variability and determine the level of background noise.
Percent of control - the following formula is used:
, where x ij is the raw measurement of the compound in well (i, j), is the normalized value of x ij , and μ pos is the mean of positive controls.
Control normalization (known also as normalized percent inhibition transformation) is based on the following formula:
, where x ij is the raw measurement of the compound in well (i, j), is the normalized value of x ij , μ pos is the mean of positive controls, and μ neg is the mean of negative controls.
Z-score normalization is carried out as follows:
, where x ij is the raw measurement of the compound in well (i, j), is the normalized value of x ij , μ is the mean of all the measurements of the given plate, and σ is the standard deviation of all the measurements of the given plate.
B-score (i.e., Best score normalization ) is carried out as follows:
First, a two-way median polish procedure  is performed to account for row and column effects of the plate. The resulting residuals within each plate are then divided by their median absolute deviation, MAD. It is worth noting that there is an additional smoothing step that could be applied across plates (see the original article  for a description of the smoothing). This optional smoothing step was not applied however in [[5, 6] and ].
The residual is defined as the difference between the observed result (x ijp ) and the fitted value , defined as the estimated average of the plate ( ) + estimated systematic measurement offset for row i of plate p + estimated systematic measurement column offset for column j of plate p. For each plate p, the adjusted median absolute deviation (MAD p ) is then obtained from the r ijp 's.
Median absolute deviation (MAD) - a robust estimate of spread of the r ijp 's values is computed as follows: .
The B-score normalization was introduced by a team of Merck Frosst researchers  as a systematic error correction method.
Least-squares approximation of the data carried out separately for each well location of the assay;
Z-score normalization of the data within each well location of the assay (i.e., the Z-score normalization is performed across all the plates of the assay).
In the HTS workflow, the normalization/data correction phase is usually followed by the hit selection process. During this process the most active compounds are identified as hits and selected for additional screens. A predefined threshold is usually established to select hits . Depending on the specifics of the research study, one may be looking for compounds whose activity level is greater than the defined threshold (i.e., activation assay) or interest may lie in the compounds whose measurements are below the defined threshold (i.e., inhibition assay). In this study, we always assume the latter case where the hits are the compounds with the smallest measurement values. The threshold for defining hits is usually expressed using the mean value and standard deviation of the considered measurements. The most widely used threshold is μ- 3 σ, where μ is the mean value and σ is the standard deviation of the considered measurements. Hits can be selected globally, over the whole assay, when the mean and standard deviation of all assay compounds are calculated, or on a plate-by-plate basis, when the mean and standard deviation of the compounds of each single plate are considered [6, 13].
In this study we consider an experimental assay provided by the HTS laboratory of McMaster University. This assay was called Test assay and used as a benchmark in McMaster Data Mining and Docking Competition . McMaster Test assay consists of 50,000 different chemical compounds whose potential to inhibit the E. coli DHFR was tested. Each of the 50,000 considered compounds was screened in duplicate; two copies of each of the 625 plates were run through the HTS equipment; 1250 plates in total, with wells arranged in 8 rows and 12 columns, were screened; columns 1 and 12 of each plate were used for positive and negative controls and were, therefore, not considered in our study. Thus, every plate comprised 80 different compounds. The exact experimental conditions of Test assay are reported in . The competition organizers defined as primary hits the compounds that reduced the DHFR of E. coli to 75% of the average residual activity of the high controls. Two lists of hits were published (for more details, the reader is referred to: http://www.info2.uqam.ca/~makarenv/experimental_actives.pdf). The first list, called a consensus hits list, contained all compounds that were classified as hits in both of their replicate measurements (i.e., both measurement values were lower than or equal to 75% of the reference controls). Only 42 of all the 50,000 tested compounds were declared consensus hits. The second list, called an average hits list, contained 96 compounds classified as hits when the average value of the two HTS measurements was lower than or equal to 75% of the reference controls. Obviously, all consensus hits were also average hits. A secondary screening of the 96 average hits was also performed in order to determine their activity in different concentrations. As result of the secondary screening, 12 of the average hits were identified as D-R hits (i.e., hits having well-behaved dose-response curves).
We simulated data in order to evaluate the performances of the systematic error detection tests. First, we generated error-free datasets consisting of random normally distributed data. The basic data format adopted here was that of the McMaster dataset - 1250 plates, each containing 96 wells arranged in 8 rows and 12 columns. In addition, we also generated two other basic datasets which were 4 and 16 times bigger. They also included 1250 plates, each of them comprising 384 (16 × 24) and 1536 (32 × 48) wells, respectively. It is worth noting that 96, 384 and 1536-well plates are the most typical plate formats used in the modern HTS.
An assay was defined as an ordered set of plates PL p , where p (1 ≤ p ≤ 1250) is the plate number. Each plate, PL p , can be viewed as a matrix of experimental HTS measurements x ijp , where i (1 ≤ i ≤ N R ) is the row number, j (1 ≤ j ≤ N C ) is the column number, and N R and N C are, respectively, the number of rows and columns in PL P . The generated values X ijp 's followed the standard normal distribution ~N(0, 1).
Then, the hits were added to the datasets. Several hit percentages, h, were tested in our simulations: h = 0.5, 1, 2, 3, 4 and 5%. The locations and values of hits were chosen randomly. The probability of each well in each plate to contain a hit was h %. The values of hits followed a normal distribution with the parameters ~N(μ - 5 SD, SD), where μ and SD are the mean value and standard deviation of the error-free dataset.
Five types of HTS datasets containing different kinds of systematic and/or random error generated and tested in this study
Generation of error-affected measurements
A. Datasets with both column and row systematic errors which are constant across all assay plates.
, 1 ≤ i ≤ 8, 1 ≤ j ≤ 12, 1 ≤ p ≤ 1250.
B. Datasets with the column systematic error only which is constant across all plates.
, 1 ≤ i ≤ 8, 1 ≤ j ≤ 12, 1 ≤ p ≤ 1250.
C. Datasets with the well systematic error which is constant across all plates.
, 1 ≤ i ≤ 8, 1 ≤ j ≤ 12, 1 ≤ p ≤ 1250.
D. Datasets with the variable column and row systematic error which are different for each plate.
, 1 ≤ i ≤ 8, 1 ≤ j ≤ 12, 1 ≤ p ≤ 1250.
E. Datasets with the random error only (i.e., systematic error was absent).
, 1 ≤ i ≤ 8, 1 ≤ j ≤ 12, 1 ≤ p ≤ 1250.
In order to render our simulation study more realistic, we limited the number of rows, columns and wells affected by systematic error. Typically, in real HTS assays only some of the error parameters (i.e., r i , c j , w ij , r ip and c jp , see Table 1) are non null and only a few columns and rows are biased by systematic error. In datasets of types A and B, the number of rows and columns affected by systematic error as well as their locations were chosen randomly. These parameters were identical for all the plates of the assay. In datasets of type D, the number of rows and columns affected by systematic error as well as their locations were also randomly selected, but these parameters were different for different plates of the assay. In datasets of type C, the number of biased wells and their locations were randomly selected and were the same for all assay plates. The datasets used in our simulations were subject to the following constraints. For the 96-well plates, at most 2 rows and 2 columns (cases A, B and D), and not more than 10% of the wells (case C) were affected by systematic error. For the 384-well plates, the limits were 4 rows, 4 columns and 10% of the wells, whereas for the 1536-well plates, systematic error affected at most 8 rows, 8 columns and 10% of wells.
Three systematic error detection methods, including the t-test, the χ 2 goodness-of-fit test and Discrete Fournier Transform procedure followed by the Kolmogorov-Smirnov test, were examined in this study in the context of experimental HTS.
where μ 1 is the mean of the sample S 1 and μ 2 is the mean of the sample S 2 . The calculated t-statistic was then compared to the corresponding critical value for the chosen statistical significance level α (the α values equal to 0.01 and 0.1 were used in our simulations) in order to decide whether or not H 0 should be rejected. While assuming homogeneity of variance in the construction of the t-test, the computation can be optimized using the equivalent contrasts in the context of an analysis of variance.
The second tested method was the χ 2 goodness-of-fit test. This test was performed in Simulation 2 only in order to assess the presence of systematic error in the hit distribution surfaces. It was first recommended in  in order to identify systematic error in HTS data. The null hypothesis H 0 , here, is that no systematic error is present in the data. If H 0 is true, then the hits are evenly distributed across the well locations and the observed counts of hits x ij in each row i and each column j of the hit distribution surface is not significantly different from the expected value calculated as the total counts across the entire surface divided by the number of wells. The rejection region of H 0 is P(χ 2 > C α )>α, where C α is the χ 2 distribution critical value corresponding to the selected α parameter (the α values equal to 0.01 and 0.1 were tested here) and to the number of degrees of freedom of the model.
where E is the total hits count of the whole hit distribution surface divided by the number of wells (N R × N C ) with the number of degrees of freedom equal to N R - 1.
where E is the total hits count of the whole hit distribution surface divided by the number of wells (N R × N C ) with the number of degrees of freedom equal to N C - 1.
where E is the total hits count of the whole hit distribution surface divided by the number of wells (N R × N C ) with the number of degrees of freedom equal to N R × N C - 1.
The third tested method consisted of the Discrete Fourier Transform (DFT) procedure  followed by the Kolmogorov-Smirnov goodness-of-fit test . DFT has been widely used in the frequency analysis of signals and, in particular, for building the signal's density spectrum. The power density spectrum shows the energy contained in each frequency component existing in the signal. In order to apply DFT to HTS data we need first to unroll a plate measurement matrix into a linear sequence of measurements. There are two natural ways to do so: (a) to build the sequence starting by the first row of the plate, followed by the second row, then third one, and so on, and (b) to start by the first column of the plate, followed by the second column, third one, and so on. The analysis of sequences (a) and (b) would allow us to detect column and row effects, respectively. DFT detects frequencies of signals that repeat every two, three, four, and so on, positions in the sequence. DFT calculates the amplitudes of every possible frequency component. Let (1 ≤ k ≤ N) be the power density spectrum generated by the DFT analysis for the plate p with N wells.
where is defined as the number of values in the density spectrum that are lower than or equal to , i.e., . Big values of D lead to the rejection of the null hypothesis (i.e., x ijp 's have been drawn from random normally distributed data). The method consisting of the DFT analysis followed by the Kolmogorov-Smirnov test was included in some commercial software focusing on the detecting systematic error in experimental data (e.g., in Array Validator described in ).
Simulation 1 consisted of the detection of systematic error on a plate-by-plate basis. Artificial HTS data for three different plate sizes: 96 wells - 8 rows and 12 columns, 384 wells - 16 rows and 24 columns, and 1536 wells - 32 rows and 48 columns were first generated. We started by creating basic error-free datasets for which the well measurements followed a standard normal distribution ~N(0,1). For all datasets the number of plates was set to 1250 - the same as in McMaster Test assay . Then, we added 1% of hits to each of the generated basic datasets. The hits were added in such a way that the probability that a given well contained a hit was 1%. All the hit values followed a normal distribution with the parameters ~N(μ - 5 SD, SD), where μ and SD are the mean value and standard deviation of the basic dataset (without hits).
Using these error-free datasets, we generated datasets comprising different types of systematic error, labelled A to E, as reported in Table 1. Systematic error was added only to some of the assay rows (columns, wells). The number of rows (columns, wells) affected by systematic error as well as the indexes of the affected rows, columns and wells were determined randomly for each considered dataset. Six types of error-affected sets were produced for each error-free dataset by varying the standard deviation of systematic error. The following values of the systematic error standard deviation were used: 0, 0.6 SD, 1.2 SD, 1.8 SD, 2.4 SD and 3.0 SD, where SD is the standard deviation of the basic dataset. The t-test and K-S test were then applied to error-affected data. Both tests produced a binary result for each row and column of each plate: Systematic error was detected or not detected in this row or column. The output was then compared to the information from the data generation phase to determine whether the result of the test was correct.
where Pr(a) is the relative observed agreement among raters (i.e., statistical tests in our study) and Pr(e) is the hypothetical probability of chance agreement. If the raters are in complete agreement, then κ = 1. If there is no agreement among the raters, other than what would be expected by chance, then κ ≤ 0.
In our HTS context, Pr(a) and Pr(e) were calculated as follows: and , where P is the number of plates in the assay, N R and N C , are, respectively, the number of rows and columns per plate, TP (true positives) is the sum of the numbers of rows and columns where systematic error was added during the data generation and then detected by the test, FP (false positives) is the sum of the numbers of rows and columns where systematic error was not added but detected by the test, TN (true negatives) is the sum of the numbers of rows and columns where systematic error was not added and not detected by the test, and FN (false negatives) is the sum of the numbers of rows and columns where systematic error was added but not detected by the test.
Since datasets of types C and E did not contain row or column systematic error, the sensitivity and Cohen's kappa coefficient of both competing statistical tests for these data were undefined (i.e., TP = FN = 0 for these data types).
The kappa coefficient curves in Figures 2, 3 and 4 show that the t-test clearly outperforms DFT followed by the K-S test for all selected sizes of systematic error, confidence levels and plate sizes. The accuracy of the t-test grows as the size of systematic error increases. It also grows slightly as the plate size increases. The accuracy of the K-S test remains very low and usually varies between 0.0 and 0.1, thus suggesting a very poor systematic error recovery by this test. Figures 13SM, 14SM and 15SM indicate that the success rate of the t-test is largely independent of the systematic error variance and remains very steady for all tested types of systematic error and plate sizes. In contrast, the success rate of the K-S test decreases as the standard deviation of systematic error increases. The performance of the K-S test is also affected by the size of the plate (Figures 2, 3 and 4). The K-S test success rate decreases significantly, and often falls below 50%, for larger plates (Figure 15SM). The chosen confidence level α affects the accuracy of both statistical tests. For instance, the use of α = 0.1 generally causes a decrease in the kappa coefficient (the decrease of 0.2 on average, see Figures 2, 3 and 4) and in the success rate (the decrease of 10% on average, see Figures 13SM, 14SM and 15SM) of the t-test, when compared to α = 0.01. The sensitivity charts (Figures 1SM, 2SM and 3SM) show that the increase in the variance of systematic error leads to the increase in sensitivity of both tests. In terms of sensitivity, the t-test outperforms the K-S test for all data types and all sizes of systematic error, the only exception being large plates tested with the confidence level α = 0.1 (Figure 3SM).
Similarly to real HTS assays, our artificially generated datasets had systematic error in only a few rows and/or columns. They contained many negative and only a few positive samples. Such an imbalance between positive and negative samples implies that the overall accuracy of the tests will depend much more on the test specificity than on its sensitivity. Figures 4SM, 5SM and 6SM confirm this observation - most of the specificity charts resemble the corresponding success rate charts (see Figures 13SM, 14SM and 15SM).
The second simulation, Simulation 2, consisted of the detection of systematic error on the hit distribution surfaces. The recommendation to use statistical tests to examine hit distribution surfaces of experimental HTS assays was first formulated in , in the case of the χ 2 test. In Simulation 2, we also considered artificially generated assays with plates of three different sizes (i.e., 96-, 384- and 1536-well plates as well as 1250-plate assays) with the measurements following the standard normal distribution. From every basic dataset we generated 6 error-free datasets comprising 0.5%, 1%, 2%, 3%, 4% and 5% of hits. All the hit values followed a normal distribution with the parameters ~N(μ - 5 SD, SD). Using the error-free datasets, we generated assays containing different types of systematic error (i.e., from A to E). Systematic error, added to some of the assay rows (columns, wells) only, followed the normal distribution with the mean value of 0 and the standard deviation of 1.2 SD. For each such an assay, we calculated its hit distribution surface for the hit selection threshold of μ -3 σ. Then we applied, in turn, the t-test, and the K-S and χ 2 goodness-of-fit tests to detect the presence of systematic error.
The kappa coefficient curves presented in Figures 5, 6 and 7 illustrate that the t-test clearly outperforms the χ 2 goodness-of-fit test as well as the combination of DFT and the K-S test for all selected sizes of systematic error, confidence levels and plate sizes. The accuracy of the t-test generally grows as the size of systematic error increases, but this trend is not as steady as in Simulation 1: The curve's minimum is not always associated with the lowest systematic noise (e.g., see cases c and d in Figure 5). The kappa values for the χ 2 and K-S tests usually varies between 0.0 and 0.25, thus suggesting a poor systematic error recovery provided by both of them. As in Simulation 1, the success rate of the t-test is largely independent of the systematic error variance (Figures 16SM, 17SM and 18SM). Moreover, the success rate of the t-test varies between 90 and 100% in the most of simulated experiments. At the same time, the accuracy of the K-S test is extremely low in almost all of the considered situations. The success rate analysis of the χ 2 goodness-of-fit test suggests that this test follows different patterns for different types of data. For datasets of types D and E, whose hit distribution surfaces did not contain systematic error, the accuracy of the χ 2 test is very close to that of the t-test (Figures 16SM, 17SM and 18SM, cases d, e, i and j). However, for the datasets that contained row and/or column systematic error and well systematic error, the success rate of the χ 2 goodness-of-fit test is significantly lower than that of the t-test (Figures 16SM, 17SM and 18SM, cases a to c and f to h) and shows a tendency to deteriorate when the percentage of hits in the data increases. The sensitivity patterns shown in Figures 7SM, 8SM and 9SM demonstrate that the sensitivity of the three statistical tests grows as the percentage of hits contained in the data increases. Similarly to Simulation 1, choosing a bigger value of α led to a decrease in the accuracy of all tests.
As a final step in our study we applied the three discussed systematic error detection tests on real HTS data. We examined the impact that the presented methodology would have on the hit selection process in McMaster Data Mining and Docking Competition Test assay . Similarly to Simulations 1 and 2 carried out with artificial data, we performed two types of analysis. First, we studied the raw HTS measurements, and then calculated and analyzed the hit distribution surfaces of Test assay.
Number of rows, columns and plates (where at least one row or column contains systematic error) of McMaster Test assay in which the t-test reported the presence of systematic error, depending on the α parameter
The obtained results suggest that the number of positives for the row and column effects is almost exactly what we would expect by chance (e.g., approximately 1% when we used α = 0.01, 5 % when we used α = 0.05, etc.). This means that there is no statistical evidence of bias for columns and rows in McMaster Test assay.
Number of hits selected in McMaster Test assay for the μ- 3 SD threshold after the application of the B-score correction, depending on the α parameter
Number of hits selected in McMaster Test assay for the μ- 2.29 SD threshold (i.e., threshold used by the McMaster competition organizers to select the 96 original average hits) after the application of the B-score correction, depending on the α parameter
In our second experiment, we computed and analyzed the hit distribution surfaces of McMaster Test assay for the hit selection thresholds: μ- 3 SD and μ- 2 SD. We assessed the presence of systematic error in the assay by applying the three discussed systematic error detection tests: t-test, K-S test and χ 2 goodness-of-fit test. All three tests detected the presence of systematic error in both surfaces for both considered confidence levels α = 0.01 and 0.1. While the hit distribution surface is useful for detecting the presence of overall bias, it does not capture the variability of the bias on a plate-by-plate basis.
Number of hits selected in McMaster Test assay for the μ- 3 SD and μ- 2.29 SD thresholds after the application of the Well Correction method
μ- 3 SD
μ- 2.29 SD
In this article we discussed and tested three methods for detecting the presence of systematic error in experimental HTS assays. We conducted a comprehensive simulation study with artificially generated HTS data, constructed to model a variety of real-life situations. The variants of each dataset, comprising different hit percentages and various types and levels of systematic error, were examined. The experimental results show that the method performances depend on the assay parameters - plate size, hit percentage, and type and variance of systematic error. We found that the simplest, and computationally fastest method, the t-test, outperformed the Kolmogorov-Smirnov (K-S) and χ 2 goodness-of-fit tests in most of the practical situations. The t-test demonstrated a high robustness when applied on a variety of artificial datasets. The success rate of the t-test was, in most situations, well above 90%, regardless the plate size, noise level and type of systematic error, while the values of Cohen's kappa coefficient computed for this test suggested its superior performance especially in the case of large plates and high level of systematic noise. We can thus recommend the t-test as a method of choice in experimental HTS. On the contrary, advocated in some works [20, 21] Discrete Fourier Transform followed by the K-S test yielded very disappointing results. Moreover, the latter technique required a lot of computational power but provided the worst overall performance among the three competing statistical procedures. The K-S test can still be used to examine HTS data located in small plates (i.e., 96-well plates), but we strongly recommend not using it for the analysis or large plates (i.e., 384 and 1536-well plates) and hit distribution surfaces. The main reason for such a disappointing performance of the K-S test is it that was applied, as recommended in , on the data already transformed by the Discrete Fourier method. Figure 19SM presents an example of data from one of the simulated 96-well plates before and after the application of Discrete Fourier Transform. The raw data followed a normal distribution and contained random error only (i.e., systematic error was not added). The raw data did not deviate from the normal distribution, as shown both graphically (Figure 19SMa) and by the K-S test (KS = 0.03, p = 0.5). However, after the application of Discrete Fourier Transform, the data deviate from normality as shown in the graph (Figure 19SMa) and by the K-S test (KS = 0.06, p = 0.0018). The third method, the χ 2 goodness-of-fit test suggested in , can be employed to assess hit distribution surfaces for the presence of systematic error. In general, its performances were lower than those of the t-test and were very sensitive to the type of systematic error as well as to its variance. The χ 2 goodness-of-fit test could be recommended, especially to analyze HTS assays with small plate sizes, but we suggest carrying out the t-test as well to confirm its results.
In addition to the experiments with simulated data, we applied the three discussed systematic error detection tests to real HTS data. Our goal was to evaluate the impact of systematic error on the hit selection process in experimental HTS. The obtained results (see Tables 2-5 and Figure 8) confirm the following fact: If raw HTS data are not treated properly for eliminating the effect of systematic error, then many (e.g., about 30% of hits in the case of McMaster Test assay, as reported in Table 5) of the selected hits may be due to the presence of systematic error and, at the same time, many promising compounds may be missed during hit selection. A special attention should be paid to control the results of aggressive data normalization procedures, such as B-score, that could easily do more damage by introducing biases in raw HTS data and, therefore, lead to the selection of many false positive hits even in the situations when the data don't contain any kind of systematic error.
Our general conclusion is that a successful assessment of the presence of systematic error in experimental HTS assays is achievable when the appropriate statistical methodology is used. Namely, the t-test should be carried out by HTS researchers to pre-process raw HTS data. This test should help improve the "quality" of selected hits by discarding many potential false positives and suggesting new, and eventually real, active compounds. The t-test should be used in conjunction with data correction techniques such as: Well correction [5, 6], when row or column systematic error (detected by the test) repeats across all plates of the assay, and B-score  or trimmed-mean polish score , when systematic error varies across plates. Thus, we recommend adding an extra preliminary systematic error detection and correction step in all HTS processing software and using consensus hits in order to improve the overall accuracy of HTS analysis.
The authors thank the Natural Sciences and Engineering Research Council of Canada (NSERC) and Nature and Technologies Research Funds of Quebec (FQRNT) for supporting this research. We also thank Professor Jean-Jacques Daudin and two anonymous referees for their helpful comments. All authors read and approved the final manuscript.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.