### Data set

To develop experimental and computational methods, we produced a benchmark data set from cultured *Drosophila melanogaster* cells (S2 cells). The phenotypic readout, after five days of co-RNAi incubation of cells in 384-well plates, was total ATP content, which served as a measure of overall cell viability [25].

All pairwise interactions between 16 different genes were assayed: 8 genes with a previously characterised role in cell-cycle regulation and 8 genes selected randomly from the transcriptome by use of a computer random number generator. The 8 random genes happened to contain a few well-known multifunctional genes, including the cell cycle regulator Rbf. Gene ontology (GO) annotation terms for the 16 genes are provided in Additional File 1: Table S**1**.

The cells were incubated with all 16 _{*} 15/2 = 120 pairwise combinations of dsRNAs targeting these genes. The experiment was performed with two biological replicates, using different passages of the cells; each of these contained 10 technical replicates. Hence, the data set consists of 20 measurements for each dsRNA combination and 2,400 measurements in total.

We adapted a *criss-cross* design [26]. To this end, two different 384-well stock plates were prepared: one *row plate*, where each of the single dsRNAs occupied a full row of wells, and one *column plate*, where each of the dsRNAs was placed into a full column. By combining reagents from the row plate with those from the column plate, each pairwise combination of dsRNAs was obtained twice. For each of the two biological replicates, five plates were incubated and analysed.

### Adjustment for plate effects and quality assessment

We applied plate normalisation [27]

{y}_{pi}={y}_{pi}^{pre}-{\widehat{\mu}}_{p},

(3)

where {y}_{pi}^{pre} was the logarithm-transformed (base 2) luminescence intensity of the *i*-th well in plate *p* and {\widehat{\mu}}_{p} was a plate-specific correction coefficient. Methods for computing {\widehat{\mu}}_{p} from the data need to be adapted to the experiment [28]. Here, we used the midpoint of the shorth of all values from plate *p* for wells with co-RNAi, but not from control (Additional File 2: Figure S1). The shorth of a distribution is the shortest interval that contains half of the data; its midpoint can serve as an estimator of the mode of the distribution, and the estimate is generally less affected than, e.g., mean or median by skewness or outliers in the data. Diagnostic plots for quality assessment showed no significant spatial artifacts (Additional File 3: Figure S2). Reproducibility was assessed by scatter plots between replicates (Additional File 4: Figure S3). In Section **Choice of scale and neutrality function**, we provide reasons for the choice of the logarithm transformation for the analysis of this data and discuss alternative approaches.

### Estimating main effects and interactions

Next, we obtained estimates of interaction effects *w*_{
ij
} from the co-RNAi data. First, suppose that the baseline value *y*_{0}, the double knock down phenotype *y*_{
ij
} and the single-gene effects *m*_{
i
} are known. Then, for *i* ≠ *j*, the interaction term *w*_{
ij
} can simply be obtained from Equation (2) by solving *w*_{
ij
} = *y*_{
ij
} - *y*_{0} - *m*_{
i
} - *m*_{
j
}. Note that third and higher order terms are not present in a pairwise co-RNAi experiment. Now let *y*_{
ijk
} be the *k*-th replicate measurement for the combinatorial knock down of genes *i* and *j*, and suppose that we have estimated {\u0177}_{0k} as the baseline phenotype in replicate *k* and {\widehat{m}}_{ik} as the single-gene effect of gene *i* in replicate *k*. We used the data version of the above relationship to motivate the estimator

{\u0175}_{ij}=\frac{1}{K}\sum _{k=1}^{K}{\epsilon}_{ijk},\phantom{\rule{1em}{0ex}}where

(4)

{\epsilon}_{ijk}={y}_{ijk}-{\u0177}_{0k}-{\widehat{m}}_{ik}-{\widehat{m}}_{jk}.

(5)

To obtain the main effect estimates {\widehat{m}}_{ik} and the baseline estimates {\u0177}_{0k}, we minimised the sum of squares,

\left({\u0177}_{01},\phantom{\rule{2.77695pt}{0ex}}\dots ,{\u0177}_{0K},{\widehat{m}}_{11},\phantom{\rule{2.77695pt}{0ex}}\dots ,{\widehat{m}}_{NK}\right)=arg\phantom{\rule{0.3em}{0ex}}min\sum _{i,j=1}^{N}{\u0175}_{ij}^{2}.

(6)

In the rest of this section, we provide the motivations for this approach, contrast it with plausible alternative choices, discuss implementation and describe variations that may be useful for other applications.

#### Identifiability

In order to make the solution of criterion (6) unique, a further condition is necessary, for instance {\sum}_{i}{\widehat{m}}_{ik}=0 for each *k*. The choice of this condition does not affect the estimated interactions {\u0175}_{ij} and serves mainly for computational convenience.

#### Parameterisation

In general, equations (5) and (6) allow different values for the baseline {\u0177}_{0k} and the single gene effects {\widehat{m}}_{ik} for different replicates *k*. Here, we allowed for different {\u0177}_{0k} and {\widehat{m}}_{ik} between the two biological replicates, but set them equal within the 10 technical replicates each. Hence, 32 parameters were fitted, for 16 dsRNAs in 2 biological replicates. We have found it useful to allow different values for the two biological replicates in order to adjust for slight, but detectable variations in experimental parameters such as number of cells seeded, incubation time, dsRNA reagent concentration and transfection efficiency. Allowing the parameters {\u0177}_{0} and *m*_{
i
} to pick up some of this - unintended, but hardly avoidable - variability in the data, we expect to have arrived at better estimates of the interaction effects *w*_{
ij
}. Such an approach is analogous to using different, matched normalisation controls in different parts of an experiment. If we had been confident that across replicates these values were actually the same, then we could have set them to be equal in (5) and (6), thus reducing the number of parameters to 16 and slightly increasing the precision of the estimates. On the other hand, if model fit diagnostics had indicated that allowing these parameters to even be different for each technical replicate would fit the data substantially better, a more highly parameterized model with 320 parameters could have been fit, with a loss in estimation precision.

#### Use of single-gene and non-treated controls

Criterion (6) only uses the data from the co-RNAi wells for the estimation of the baseline and the single gene effects. It does not require, or use, measurements from single gene RNAi treatments or untreated control wells. Our rationale for doing so was as follows. Ideally, the values obtained through minimisation of (6) and from direct measurements in controls should be the same; however, more data points were available for evaluating criterion (6) than there are control measurements, hence the former provided better precision and was more robust against individual outlier data points. Furthermore, if there were a difference between baseline and single gene effect estimates obtained in the two different ways, then we would prefer the estimates from criterion (6), since -by construction- they lead to more conservative estimates of interaction effects. In fact, criterion (6) can be motivated by the assumption that interactions should be rare, and by the aim of explaining as much possible of the observed variation in *y*_{
ijk
} through baseline and single gene effects, and only considering what remains from that as interactions. If available, we propose using the data from single gene and untreated control wells for quality control: compare them to the estimates {\u0177}_{0k} and {\widehat{m}}_{ik} obtained from (6), and deviations would indicate an experimental problem that needs to be attended to before further interpretation of the data.

#### Robustness

Instead of minimising the sum of squares (6), a robust variant could be chosen such as minimising the {\mathcal{L}}_{1}-norm or a trimmed sum of squares (LTS regression; [29]), or using M-estimation [30]. For our data, exploration of these methods did not lead to substantially different results. In other situations, however, such variants may be appropriate, for example, when the proportion of interactions is not small, or when some of them are large, and we advise researchers to check their data for such effects.

### Assessing statistical significance

For each pair of genes (*i*, *j*) we applied the ordinary *t*-statistic to the residuals *ε*_{
ijk
} across the 20 measurements (*k* = 1,..., 20) to weigh the evidence for the interaction to be non-zero. However, the large number of replicates is a peculiarity of our benchmark dataset, and cannot be expected in general. With few replicates, the ordinary *t*-test has two problems: first, due to the small number of degrees of freedom, the test will have little power, leading to a loss of discoveries. Second, it becomes likely that, by chance, small estimates of the variance (the denominator of the *t*-statistic) are obtained even if the true variance is larger, leading to a fraction of discoveries with small effect sizes that would not be replicated if the experiment were repeated. To address both of these problems, in the context of microarray analysis the moderated *t*-test has been proposed [31, 32]; we investigated the same approach here.

We wanted to assess the relative performance of three different ranking methods: the *p*-value from the ordinary *t*-test, the *p*-value from the moderated *t*-test, and the average effect {\u0175}_{ij} (that is, simply the numerator of the *t*-statistic). For this, we set up a benchmark set of true interactions (positives) and non-interactions (negatives). We considered those pairs true positives that had, on the full data set with 20 replicates, a nominal *p*-value of less than 0.001 in the ordinary *t*-test; all other pairs were considered true negatives.

To simulate an application setting with few replicates, we applied the ranking methods to the data from a single plate, hosting two technical replicate measurements per gene pair. We applied a set of thresholds, with decreasing stringency, to the three ranking methods and obtained the corresponding hit list. We then, for each hit list, computed the true positive rate (TPR) as the ratio between the number of true (as defined by the reference) hits and the total number of positives, and the false positive rate (FPR) as the ratio between the number of false (as defined by the reference) hits and the total number of negatives. This resulted in one ROC curve per plate.

Figure 1 indicates clear benefits from using the moderated *t*-test. The ordinary *t*-test generally performs poorly in such situations with limited amount of replication. When only two technical replicates from the same plate were used (panel a), the variance of replicates was constant or close to constant across the different interaction pairs. Hence, the variance estimates used in the moderated *t*-test were the same or almost the same for all gene pairs, making the resulting moderated *t*-statistic proportional -and therefore equivalent- to the numerator of the *t*-statistic (the average effect size {\u0175}_{ij}), and no significant difference in performance was seen between the two methods in this setting. However, when variation is higher, as is the case when also biological replicates were considered (Panel b), the moderated *t*-test was able to pick up some of the gene-pair dependence of the variance and outperformed the average effect size {\u0175}_{ij}.

The reference that we used for this benchmark, as described above, is not a *ground truth* in a strict sense, and the TPR and FPR values may be biased because of errors in the reference. However, for the purpose of method comparison, the relative positions of the ROC curves of the three ranking approaches are still informative as long as the reference set is enriched for ground truth compared to a random set; in effect, the biases cancel out each other. This pragmatic use of the ROC has also been called *pseudo-ROC* analysis [33].

### Resulting interactions

The detected interactions are summarised in Figure 2. The figure uses three different visualisation tools: the heatmap representation of the interaction matrix {\u0175}_{ij}, the threshold graph representation of the same matrix, and the threshold graph representation of the correlation matrix

{c}_{ij}=cor\left({\u0175}_{i}.,{\u0175}_{j}.\right),

(7)

where {\u0175}_{i} denotes the *i*-th row of the matrix {\u0175}_{ij}, i.e. the interaction profile of gene *i* across all other tested genes. Most interactions are seen within the cell-cycle set of genes; there are few interactions between the cell-cycle related genes and the randomly selected set of genes. Among the randomly selected set, Rbf is interacting strongly with Rho1. Rbf is in fact a known cell cycle regulator [34]. The DroID database [35] reported seven interactions between the cell cycle related genes. Of those, three were found significant in our data: CSN4-CSN5 [36], CSN3-CSN4 (predicted) and CSN3-CSN5 (predicted). Among the novel interactions, Rho1 showed negative interactions with CSN3, CSN4 and CSN5 (CSN3-5). Consistent interaction patterns with CSN3-5 are expected as CSN3-5 are three subunits of the COP9 complex.

The directions of the observed interactions are informative: A knock-down of any of the three COP9 subunits resulted in reduced viability. The interactions within CSN3-5 are positive (the negative viability effect of the double knock-down is less severe than the expected effect from the two single knock-downs). One can speculate that knocking down one of the subunits is enough to disrupt the complex and cause a viability effect, and that once the complex is dysfunctional, knocking down another subunit has little additional effect.

The correlation network provides additional insights. It makes evident that the three COP9 subunits have similar interaction profiles and therefore cluster together. An interpretation is that it is the disruption of the COP9 complex, through any of the three subunits, that interacts with the rest of the gene set.

### Scalability - Larger experiments

Future experiments are likely to test a considerably larger number of genes than the experiment discussed here. We analysed a scalability testing (ST) data set, in which a similar experimental design as described above was used to assay cell viability in response to all pairwise interactions of a set of 84 dsRNA reagents. The data preprocessing is shown in Additional File 5: Figure S4 and Additional File 6: Figure S5.

#### Fit diagnostics

Fit diagnostics check how well a data analytical model fits the data by plotting the residual distributions against various explanatory variables. Depending on the viewpoint, they can be used to check the adequacy of a model and to assess the quality of the data. Figure 3 shows the distribution of the residuals *ε*_{
ijk
} against the predicted value {\u0177}_{0k}+{\widehat{m}}_{ik}+{\widehat{m}}_{jk} for the ST data. Trends in this plot would indicate a model misspecification problem. No significant trends were apparent.

#### False discovery rate

Schweder and Spjøtvoll [37] suggested a diagnostic plot of the observed *p*-values that permits estimation of the fraction of true null hypotheses. For a series of *m* hypothesis tests with *p*-values *p*_{
i
}, they suggested plotting

\left(1-{p}_{i},N\left({p}_{i}\right)\right)\phantom{\rule{1em}{0ex}}for\phantom{\rule{2.77695pt}{0ex}}i\in 1,\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}m,

(8)

where *N*(*p*) is the number of *p*-values greater than *p*. An application of this diagnostic plot to the *p*-values from the moderated *t*-test for the ST data set is shown in Figure 4. If all null hypotheses were true, i.e., no interactions were present, the *p*-values would each be uniformly distributed in [0,1], and the cumulative distribution function of (*p*_{1},..., *p*_{
m
}) would fall close to the line *f*(*x*) = *x*/*m*. In fact, the curve is an approximately straight line with smaller slope within between *x* = 0 and about *x* = 0.5, and then bends upward, indicating that some null hypotheses are not true. The intersection of the dashed line with the vertical axis at *x* = 1 indicates evidence for a number of false null hypotheses around 400 to 500.

### Choice of scale and neutrality function

An analytic expansion like in Equation (2) is always possible, its usefulness, however, and that of the above definition of interactions, depends on the choice of scale of the phenotype variable *y* [23]. Consider, for example, cell number in a cell culture in exponential growth during an incubation time *t*, and suppose that two different genes may be perturbed by RNAi. In the absence of interactions between the two perturbations, we might expect

n\left(t\right)={n}_{0}{e}^{\left(1+{m}_{1}{x}_{1}+{m}_{2}{x}_{2}\right)kt},

(9)

where *n* is the cell number at time *t*, *n*_{0} is the initial cell number, *k* is the growth rate of the unperturbed cells, the indicator variables *x*_{1}, *x*_{2} ∈ {0, 1} indicate whether or not the gene was perturbed by RNAi, and *m*_{1}, *m*_{2} are the two perturbations' individual effects on growth [38]. If we expand *n* - *n*_{0} directly as in Equation (2), we will get a second-order term between *x*_{1} and *x*_{2} simply due to the presence of the exponential function in Equation (9). However, if we first transform the cell number measurements to a logarithmic scale, for instance, *y* = log(*n*/*n*_{wt}), where *n*_{wt} = *n*_{0} exp(*kt*) is the cell number at time *t* without perturbation, then *y* is exactly described by the linear expression (*m*_{1}*x*_{1} + *m*_{2}*x*_{2}) *kt*, which better reflects the fact that the perturbations do not interact. This is the approach we have taken above. In the following, we discuss some points regarding the choice of scale that may need to be considered in different experimental settings.

The exponential growth model (9) may not be appropriate for the whole duration of the experiment. Initially, for small *t*, cells might need some time to recover from an experimental treatment that they were subjected to (such as transfection, seeding) before they go into their optimal growth rate. For larger *t*, cell density might become so large that saturation effects play a role, again decreasing the growth rate.

Furthermore, depending on the measurement setup, in particular for fluorescence or luminescence based measurements, background signal might contribute to the observed data. In the worst case, *k* = *k*(*t*) is a complex time-dependent function, possibly different for different perturbations, and if *n*(*t*) is only observed at one end point *t*, such effects can make it impossible to infer biologically relevant interactions. In some instances, it might be possible to fit a more general growth model (that includes, for instance, an initial lag phase and a background signal).

In some experiments reported in the literature, *biomass production n*(*t*) was monitored over time. This allows measuring the growth rate *n*^{-1}(*t*) *dn*(*t*)/*dt* as a function of time. Typically, this function reaches a maximal value at some time between the start and end time of the observation, and this value is used to quantify the growth phenotype. Relative growth rate can be defined as the dimensionless ratio [39]

\rho =\frac{ma{x}_{t}\frac{dn\left(t\right)}{n\left(t\right)dt}}{ma{x}_{{t}^{\prime}}\frac{d{n}_{wt}\left({t}^{\prime}\right)}{{n}_{wt}\left({t}^{\prime}\right)d{t}^{\prime}}}.

(10)

If the perturbation does not affect growth, then *ρ* = 1. If the perturbation promotes growth, *ρ* will be larger than 1, if it is slowing down growth, *ρ* will be less than 1.

Several authors have considered two perturbations to be interacting if the product of their individual relative growth rates, *ρ*_{1} and *ρ*_{2}, is different from that of the combinatorial perturbation, i. e. if not *ρ*_{12} = *ρ*_{1} *ρ*_{2} [5, 39, 40]. This definition of interaction is somewhat different from that implied by (9). In particular, if exponential growth (9) holds, then (10) simplifies to *ρ*_{1} = 1 + *m*_{1} and *ρ*_{2} = 1 + *m*_{2}, and according to (9), the two perturbation are considered non-interacting if *ρ*_{12} = 1 + *m*_{1} + *m*_{2} = *ρ*_{1} + *ρ*_{2} - 1. For small perturbations, the two definitions are approximately equivalent, since (1 + *m*_{1})(1 + *m*_{2}) = 1 + *m*_{1} + *m*_{2} + *m*_{1}*m*_{2} and *m*_{1}*m*_{2} is negligible if *m*_{1} and *m*_{2} are small compared to 1; for instance, for *m*_{1} = *m*_{2} = 0.1, *m*_{1}*m*_{2} = 0.01.

For strong perturbations, however, they lead to different results. For the present work, we chose (9) because it allows a more coherent treatment of data in which *dn*(*t*)/*dt* is positive for some of the cases (typically, for the wild type) and negative for others (say, for RNAi that knocks down an apoptosis inhibitor). In those cases, *ρ* can become negative, and interpretation of the multiplicative neutrality function *ρ*_{12} = *ρ*_{1} *ρ*_{2} would be difficult.