An equivalence approach to the integrative analysis of feature lists

Background Although a few comparison methods based on the biological meaning of gene lists have been developed, the goProfiles approach is one of the few that are being used for that purpose. It consists of projecting lists of genes into predefined levels of the Gene Ontology, in such a way that a multinomial model can be used for estimation and testing. Of particular interest is the fact that it may be used for proving equivalence (in the sense of “enough similarity”) between two lists, instead of proving differences between them, which seems conceptually better suited to the end goal of establishing similarity among gene lists. An equivalence method has been derived that uses a distance–based approach and the confidence interval inclusion principle. Equivalence is declared if the upper limit of a one-sided confidence interval for the distance between two profiles is below a pre-established equivalence limit. Results In this work, this method is extended to establish the equivalence of any number of gene lists. Additionally, an algorithm to obtain the smallest equivalence limit that would allow equivalence between two or more lists to be declared is presented. This algorithm is at the base of an iterative method of graphic visualization to represent the most to least equivalent gene lists. These methods deal adequately with the problem of adjusting for multiple testing. The applicability of these techniques is illustrated in two typical situations: (i) a collection of cancer-related gene lists, suggesting which of them are more reasonable to combine –as claimed by the authors– and (ii) a collection of pathogenesis–based transcript sets, showing which of these are more closely related. The methods developed are available in the goProfiles Bioconductor package. Conclusions The method provides a simple yet powerful and statistically well-grounded way to classify a set of genes or other feature lists by establishing their equivalence at a given equivalence threshold. The classification results can be viewed using standard visualization methods. This may be applied to a variety of problems, from deciding whether a series of datasets generating the lists can be combined to the simplification of groups of lists. Electronic supplementary material The online version of this article (10.1186/s12859-019-3008-x) contains supplementary material, which is available to authorized users.

between two or more classes, or to predict the class to which a new individual belongs.
In a simplified way, one can usually consider "feature lists" to represent a kind of summary of what is being analyzed. It is important, however not to forget that this is not exempt from criticism.
• First of all, these lists are often formed by multiple elements associated with each main "feature", i.e., several transcripts of a gene, several peptides of a protein or multiple methylation sites of a gene. The way in which multiple features collapse into a single one in order to be summarized (the average, the most variable, etc.) is not exempt from arbitrariness.
• Secondly, and even more importantly, the lists are usually obtained by applying a cut-off value with a certain statistical basis (for example, an adjusted p-value ≤ 0.05), which means that this list may include (or exclude) genes that a reasonable change in the selection criteria might exclude (or include).
Although much has been discussed about these issues, and alternative approaches have been sought, the use of a list as a summary of an experiment is still a very common approach. This is not without foundation from a statistical point of view, where it is generally assumed that a summary may contain less information than all data.

Analysis of individual feature lists
The analysis of gene lists has a long history, probably as long as the analysis of genomic data. Draghici [1] introduced enrichment analysis or over-representation analysis, which selects annotations that appear with a surprisingly (unexpected) high frequency if we take into account how they are distributed among all genes. Mootha [2,3] introduced the gene set enrichment analysis method as an alternative to the analysis of cutoff-based lists. This method analyzes all data (instead of the list) looking for annotations that tend to appear in extreme cases (between genes up-or downregulated) without requiring the list to be cut by a point. Shojaie [4] went one step further and performed the analysis of the lists based on the regulatory network implicitly associated with them. These three methods are nothing more than the first of literally dozens of variants of the same ones that have been developed in the last decade. Khatri [5] is an excellent review of this process. Essentially all these methods share one characteristic i.e. they are focused on the analysis of single feature lists. Despite their relevance, which is why we are mentioning them, their purpose is different from what is discussed in this work: They do not seek to compare lists but rather extract the biological information and therefore will not be discussed here.

Comparison between two or more feature lists
Comparison between lists has a shorter history because, curiously, it is a topic that has attracted less attention than the analysis of individual lists. This is probably for the same reasons that there has been a tendency to perform individual studies rather than to compare or group them: the cost of studies, especially in their initial stage and often their low reproducibility. In addition, many methods or tools for comparing gene lists are based on a well-defined statistical model, which also suggests that this has been a relatively marginal issue. In general, we can differentiate between: (i) methods that compare the composition of the lists, either simply by the identifiers that form them or by the ranges of those in the list; and (ii) methods that project the elements in some other space such as the Gene Ontology or provide other representations of the list such as co-expression networks.
Among the first approaches, we find programs such as GeneVenn [6], BioVenn [7] and VennPainters [8] that perform a visual comparison based on more or less flexible forms of Venn diagrams and are therefore limited in terms of the number of lists that may be compared. There are also applications such as CORal [9], Rank-Rank Hypergeometric Overlap [10] and OrderedLists [11] that are based on determining the degree of overlap of two or more ordered lists. One characteristic of all these methods is that if they perform the comparison visually or by using a statistical model, they do not refer to the biological meaning of the list elements.
In this work, we are interested in methods that, in some way, rely on the biological information in the lists. This is usually done by basing the comparison on the annotations of the genes in biological knowledge databases, such as the Gene Ontology [12,13]. There are several distinct approaches to doing this. Some programs start by conducting an overrepresentation (or gene enrichment) analysis of each list and then the set of enriched GO terms obtained from each list is compared. This is the case, for instance, with the PANTHER web tool [14]. The clusterProfiler Bioconductor package [15] can also be used in a very flexible manner, enabling comparison of two or more gene lists based on the corresponding GO or KEGG enriched terms. A different approach is to rely on semantic similarity measures [16], which are used to compare GO terms or entities annotated with GO terms (in this case gene lists), by leveraging on the ontology structure and properties. The GOSemSim Bioconductor package [17] enables computation of a variety of such measures, which may be used to compare gene lists. Last, the method extended in this paper was introduced in [18,19] with the aim of providing an inferential basis for comparing two gene lists. It is based on the annotations of the lists at a fixed level of the Gene Ontology. The method is implemented in the goProfiles package [20] also available from Bioconductor.

Comparison between multiple lists
As omics studies have become more complex, we are faced with the need and the opportunity to work with several, or even many, gene lists. We may have distinct scenarios for this. For example: • Sometimes researchers want to obtain "as much as possible" from their expensive omics experiments and compare everything vs everything even if some comparisons are not relevant. This may result in dozens of gene lists, which may contain redundant information.
• Sometimes researchers collect gene lists from different studies because they consider these to be about the same biological problem. This is the case with the examples that will be discussed later in the paper: -Cancer-related gene lists (http://www. bushmanlab.org/links/genelists, [21]) -Pathogenesis-based transcript sets (https://www.ualberta.ca/medicine/institutescentresgroups/atagc/research/gene-lists, [22]) • The Molecular Signature Database (MsigDB, [23]) contains thousands of gene lists that might benefit from some type of dimension reduction.
One may want to work on these lists for different purposes which we may classify simplistically as dimension reduction or dimension augmentation.
(i) Dimension reduction here would mean trying to reduce the number of lists: for instance, referring to the previous example, having a smaller number of relevant signatures, or simplifying the exploration of lists that result from some omics experiments. (ii) Dimension augmentation, on the other hand, may mean the possibility of combining datasets associated with gene lists derived from them. In other words, if one can establish that a few lists are equivalent, one may assume that the data that have generated them can also be considered equivalent, or, which is the same thing, can be combined into a single bigger dataset.
In recent years there have appeared several programs intended to allow the comparison of more than two lists. listcompare is a web tool that checks overlap of multiple gene lists (http://www.molbiotools.com/ listcompare.html) without making any use of biological information contained in the lists. Other tools, such as clusterProfiler [15] or ToppCluster [24], perform a comparison based on doing an enrichment analysis of each gene list and then relying on the enriched categories to compare the lists. This comparison is made either descriptively (clusterProfiler) or interactively building a network with the enriched terms (ToppCluster). The former tools, especially clusterProfiler, have the merit that their comparison provides hints on the biological difference between the lists because they are based on enriched GO categories. A drawback, however, is the fact that these comparisons are visual only, with no inferential basis behind them. The method presented in this paper does not directly highlight the categories explaining the differences between the lists but does provide an inferential basis for the comparison, somehow complementing the others.

Difference vs equivalence hypotheses tests
A common misconception among practitioners of statistics is to take up the fact of not rejecting a null hypothesis as a proof of its veracity. The phrase "to accept the null hypothesis", though very common, is a statistical nonsense. If anything (to some extent) is proven in an hypotheses test, it is the alternative hypothesis when the null hypothesis is rejected, but no inferences can be drawn from not rejecting it. Posterior power arguments, say that to try to infer the veracity of a nonrejected null hypothesis from arguments of its (high) power computed from the parameter estimates may lead to paradoxes [25]. Thus, if the objective of an study is to prove similarity, e.g. to decide if the biological information provided by two gene lists is similar (but not necessarily exactly equal), from an hypotheses testing perspective the right approach would be to contrast a null hypothesis of relevant dissimilarity against an alternative of irrelevant dissimilarity (not necessarily null dissimilarity, i.e., exact equality, which may lead to an undemonstrable statement). In practice, this may be implemented by choosing a measure of dissimilarity and establishing a threshold of "acceptable" dissimilarity. Then, the null hypothesis should specify that the true measure of dissimilarity is not less than while, conversely, the alternative should specify that the dissimilarity is less than .

Objectives
In this paper we present a statistical approach that allows for: (i) given a predetermined equivalence threshold , to check the biological equivalence of the set of lists; and (ii) for a set of given lists, to determine the minimum equivalence threshold that allows the equivalence of the set to be declared. With this approach, the efficiency of the statistical tests is evaluated in a simulation study and a graphic representation is provided to facilitate the interpretation of the results. The application is illustrated using two publicly available sets of gene lists (Kidney Gene Lists and Cancer Gene Lists). The goProfiles Bioconductor package has been extended to include the new capabilities of the method.

Previous work: statistical inference for functional profiles
The methods proposed in this work rely on biological knowledge to compare two or more gene lists. In practice, this means that biological annotations are used to characterize each list in such a way that a method for comparing these characterizations can be used. The method, known as goProfiles, has been introduced elsewhere ( [18,19]) and is reviewed briefly below.
Given a list of n features annotated in the Gene Ontology (GO), a reasonable way to characterize this list is to count how many features are annotated in each category (see Figure 2 in [18]). This yields a frequency table that we call functional profile describing how these n features are distributed between C 1 , C 2 , . . . , C s GO categories. However, given that a feature can be annotated in more than one category, the frequencies obtained may add up to more than n when counts are considered, or to more than 1 if we rely on proportions. This complicates the analysis because, for example, a chi-squared approach cannot be used to model or compare these frequency tabulations. This was solved in [18] by introducing the ideas of "expanded" and "contracted" profiles. It is very common for a given feature to be annotated in several categories. An expanded profile allows multiple annotations to be transformed into simple ones so that each feature is annotated in one and only one category. This is possible by defining this profile on the Cartesian product partition C, C × C, . . . , C × · · · × C s excluding symmetric products. With this formulation, the expanded profile is the vector of probabilities where each p i 1 ,i 2 ,...,i k , k ≤ s, describes the probability of simultaneous annotation in a possible combination of categories so that a feature will always be annotated in one and only one category of such expanded profiles. Alternatively, the contracted profile, or simply profile for short, is the vector of probabilities defined on the original categories.
where p i. describes the probability of annotation in category C i , i = 1, ..., s. Expanded and contracted profiles are related by a simple linear transformation (a "contraction") that turns an expanded into a contracted profile. Given two lists of features of size n and m respectively, the dissimilarity between their associated functional profiles P, Q can be evaluated by the squared Euclidean distance (which in fact is not a true metric distance, just a dissimilarity, as it does not verify the triangular inequality) between them: All quantities can be naturally estimated by their relative frequencies. Salicrú et al. [19] obtained the asymptotic distribution of the estimated distance between two profiles and relied on this result to derive hypothesis tests for comparing two feature lists.
where the "hat" notation stands for the sample profiles (the unknown probabilities substituted by the corresponding relative frequencies) and the general expression of σ PQ is given in [19].
Besides classical comparison tests (to establish possible "difference" against a null hypothesis of complete equality), equivalence tests (e.g. [26]) appear to be the natural approach to testing when the goal is establishing (near) equality. In accordance with the global distance-based approach of the present paper, the problem can be stated as follows: Given two population profiles P and Q, instead of testing one aims to test: where "near equality" is stated in terms of a "practical equivalence value", > 0. Choosingthethreshold isadifficultquestion in equivalence testing -and a subject commonly but wrongly ignored in "differencetesting" (4), asa "statisticallysignificant" difference (from zero) does not mean "biologically (clinically, etc.) interesting" difference. In general, the chosen value should be an expert choice. But as can be seen below, the method finally proposed in the present paper does not depend on a given choice.
The so-called "Interval Inclusion Rule" (e.g. as stated in [27] under a distance-based approach) provides a general way of solving equivalence testing problems. It can be stated as follows (again adapted to the distance-based approach): The above criterion defines a test with significance level α. A consequence of (3) is that is an asymptotically valid upper confidence level, wherê sed = σPQ 1 n + 1 m stands for the estimated standard error of d P ,Q and z 1−α for the 1 − α quantile of a standard Gaussian distribution, N(0, 1).
For convenience and in line with further developments in this paper, the criterion to declare equivalence may be restated in terms of p-values p( ) as: Declare equivalence if: where stands for the standard normal cumulative distribution function and The above notation tries to highlight that both the p-value and test statistic TPQ( ) depend on the chosen equivalence limit, a crucial matter in the following sections.

Equivalence test based on functional profiles for h ≥ 2 comparisons
Let L 1 , ..., L s be s distinct lists, e.g. coming from s studies on a similar subject. We wish to do a certain number, h ≤ s × (s − 1)/2, of previously specified comparisons (equivalence tests) between these lists. For a given equivalence threshold , this can be done by: • first performing every selected comparison, • and then doing a multiple testing adjustment in order to deal with testing multiplicity.
There are many approaches to accounting for multiplicity. If the number of comparisons h is not big (e.g. at most some tens, note that we are dealing with feature lists comparisons, not with comparing the individual features coming from a given study), a reasonable approach is to control the "family wise error rate" (FWER), for instance, using the Holm-Bonferroni criterion: • Given an equivalence limit , compute the p-values • Sort the p-values in ascending order: • The null hypothesis of non-equivalence (i.e. existence of a "relevant" functional profile dissimilarity between lists) is rejected for all those comparisons l = 1, ..., k − 1 such that p l ( ) < p k ( ) where k is the smallest value satisfying that p (k) ( ) > α/(h + 1 − k).
In the case of a great number h of comparisons, possibly other criteria like the false discovery rate (FDR) for multiple testing corrections would be the option to choose, but the general idea is still the same.

Algorithm
As has been stated before, choosing the equivalence limit may be a problematic task. But here we take what would be a complementary approach: instead of previously fixing , we will let it vary in order to give a numerical value aiming to measure what would be a statistically significant equivalence between lists, prone to graphical representation and possible interpretation.
The first step is to build a dissimilarity matrix between lists based on the threshold that makes them equivalent: The resulting dissimilarity matrix may be the input for an adequate representation method.

Visualization
Any data representation method accepting dissimilarity matrices as a starting point can be applied using the ij matrix. For example, it may be used to construct a dendrogram showing the equivalence levels at which sets of lists may be considered significantly equivalent. Although many clustering methods may be used to build the dendrogram, as a first approach, the "maximum distance" or "complete linkage" method seems to be a reasonable and useful choice. The maximum distance method defines the distance between two groups as the distance between their two farthest members. For a given set of lists the construction of dendrograms using this method is in accordance with the declaration of equivalence of the most extreme lists, and consequently with the declaration of equivalence of all pairs of lists.

Simulation study
An extensive simulation study was performed to investigate the properties of the equivalence test. Several simulation scenarios were created to cover a variety of situations found in practice. These scenarios were considered: (i) the number of common (n 0 ) and distinct (n 1 = n − n 0 , m 1 = m − n 0 ) features in each list (ii) the number of GO nodes on which the profiles are based and (iii) the distribution of annotations along the nodes in both profiles. The simulated scenarios were generated crossing all levels of the factors described above: Each simulation consisted in repeatedly iterating the following steps: 1. Independently generate three expanded profiles P 1 , Q 1 and R, from a multinomial distribution of sizes n 1 , m 1 and n 0 and respective probability parameters: Here, n 1 , m 1 and n 0 stand for the total number of annotated genes in each generated expanded profile and k stands for the allowed maximum number of simultaneous annotation, k ≤ s, of the s annotated GO items 1 . Remember that n 1 stands for the genes exclusively in the first list being compared, m 1 for the genes only in the second list and n 0 for the genes common to both lists, if any. 2. Build the finally compared profiles: P = P 1 + R and Q = Q 1 + R, with n = n 1 + n 0 and m = m 1 + n 0 . 3. "Contract" P and Q to obtain the basic profilesP andQ. 4. Perform the equivalence test between these profiles and collect all desired statistics (e.g., if equivalence was declared or not in this single simulation iteration).
For simplicity, in this section we will not continue using the above notation for probability vectors like P 1 (which on the other hand is useful to highlight simultaneous annotation of GO items) and we will simply designate them as p 1 , . . . , p g , where g corresponds to the vector length. In the simulations presented here, each one of its components p i was obtained from a geometric model dependent only on a single parameter 0 < θ < 1: Obviously, making the simulated profiles dependent on a single parameter in this way greatly restricts the possible scenarios to be simulated. On the other hand, it allows for an important simplification in order to trace the probability of declaring equivalence as a single function of the simulated true squared Euclidean distance: given a value of θ, P 1 was obtained from θ, Q 1 was obtained from 1 − θ 1 For large values of s, e.g., s = 50 or more, it is advisable to take k < s to avoid managing extremely large profile vectors (say, q i = (1 − θ)θ i−1 /(1 − θ g )) and R from a fixed θ 0 value, independently of θ.
For fixed n, m, n 0 , θ 0 , s and k values, the squared Euclidean distance is a function of θ, d = D(θ). Given a set of desired d values (some less than , i.e. in a scenario of false null hypothesis in the equivalence test; some greater than , i.e. true null hypothesis; and one with d = , just on the limit of equivalence), numerically solving the equation D(θ) = d we can obtain the required θ values and thus a set of profiles to simulate these "population" distances. Then simulations may be performed in order to obtain the probability of rejecting the null hypothesis of non-equivalence, i.e. to obtain the power curve of the test, as a function of the d parameter.
A well-behaved (unbiased) equivalence test should reject the null hypothesis of nonequivalence with a probability greater than α for parameter values d < , with a probability smaller than α for d > and, ideally, with a probability of α when d = . Figures 1, 2 and 3 display the probability of rejecting the null hypothesis of nonequivalence (i.e., the probability of declaring equivalence) as a function of the true simulated squared Euclidean distance. They correspond to three sample sizes and the threshold of equivalence scenarios. They show that the profiles equivalence test is generally valid. As a consequence of the Bonferroni-Holm method validity (as a way to protect FWER), the equivalence test for more than two lists is also generally valid. For very small equivalence limits (near zero), there is some type I error inflation, with probabilities slightly over the nominal significance level, e.g. values around 0.06 for significance levels of 0.05. As supplementary material (see Additional file 5) we provide three bar plots representing the probabilities of false negatives and false positives corresponding to Figs. 1, 2 and 3, respectively. When the true simulated distance is d < , not rejecting the null hypothesis (not declaring equivalence) corresponds to a false negative. When d ≥ , declaring equivalence is a false positive.

Software
The analysis of functional profiles, that is, the computation of profiles and paired tests of difference or equivalence has been implemented in the R package goProfiles available in Bioconductor [20].
The current version of the package (1.44 or higher) implements the capabilities described in this paper. That is, given a set of gene lists -provided as Entrez identifiers-one can: (i) compute a dissimilarity matrix between the corresponding profiles at a given level of any ontology; (ii) apply the algorithm described in the previous section to determine the equivalence level at which any pair of lists can be considered equivalent; and (iii) visualize the associated dendrogram with the chosen method. Fig. 1 Power curve of the equivalence test, as a function of the true squared Euclidean distance. Balanced case of two gene lists of size 200 with 20 genes in common. Equivalence limit at = 0.25. The null hypothesis of the equivalence test states that the true squared Euclidean distance, d, is greater than or equal to , that is to say, that both lists are sufficiently dissimilar according to the limit criterion. Thus, rejecting this hypothesis corresponds to declaring equivalence. When the true simulated distance is d < , not rejecting the null hypothesis not declaring equivalence ) corresponds to a false negative. When d ≥ , declaring equivalence is a false positive The package is available in github (https://github.com/ alexsanchezpla/goProfiles) and in Bioconductor (http:// bioconductor.org/packages/goProfiles/).

Examples
We have selected two prototypical situations where we believe that using the approach described in the paper can be useful for simplifying the data the researchers are working with, or even for shedding new light on their meaning.

Equivalence analysis of kidney gene lists
Organ rejection diagnosis is mainly based on the study of tissue biopsies (e.g. renal, lung, heart or liver) but, unfortunately, the lesions observed using conventional histology are often not specific for the underlying mechanism since histological lesions (e.g., interstitial inflammation in renal biopsies) maybe driven by different processes. The molecular mechanisms operating in human organ transplant rejection are best inferred from the mRNAs expressed in biopsies because the corresponding proteins often have low expression and short half-lives, while small noncoding RNAs lack specificity. The study of associations should be characterized in a population that rigorously identifies the different mechanism participating in organ rejection, i.e. T cell-mediated and antibodymediated rejection (TCMR and ABMR). Associations can be universal (both types of rejection), TCMR-selective, or ABMR-selective. It has been proposed that top universal transcripts are gamma-interferon-inducible and transcripts shared by effector T cells and NK cells. TCMRselective transcripts are expressed in activated effector T cells or gamma-interferon-induced macrophages while ABMR-selective transcripts are expressed in NK cells and endothelial cells. Transcript associations are highly reproducible between biopsy sets when the same rejection definitions, algorithm, and technology are applied, but exact ranks will vary. Although rejection-associated transcripts are never completely rejection-specific because they are shared with the stereotyped response-toinjury and innate immunity, transcriptomic analysis using pathogenesis-based transcripts contributes to a better characterization of mechanisms leading to organ dysfunction.
This example uses a set of gene lists generically described as "PBTs" (pathogenesis-based transcript sets) studied by [22] and available at https://www. ualberta.ca/medicine/institutes-centres-groups/atagc/ Fig. 2 Power curve of the equivalence test, as a function of the true squared Euclidean distance. Balanced case of two gene lists of size 1,000 with 100 genes in common. Equivalence limit at = 0.25. The null hypothesis of the equivalence test states that the true squared Euclidean distance, d, is greater than or equal to , that is to say, that both lists are sufficiently dissimilar according to the limit criterion. Thus, rejecting this hypothesis corresponds to declaring equivalence. When the true simulated distance is d < , not rejecting the null hypothesis (not declaring equivalence) corresponds to a false negative. When d ≥ , declaring equivalence is a false positive research/gene-lists and as supplementary material 2 (See Additional file 1).
Each list consists of a series of probeset identifiers from hgu133plus2 Affymetrix expression microarrays that have been selected in distinct studies referenced in [22]. For this example the probesets have been preprocessed as follows: • Affymetrix identifiers have been converted into Entrez identifiers using Biomart. • When several probesets had the same identifier, this appeared only once in the list.
This preprocessing yields five gene lists described in Table 1 where, for each list, we provide its PBT abbreviation, the number of unique Entrez identifiers and a short description.
Equivalence analysis of these gene lists can be easily performed using functions in the goProfiles package (see the detailed analysis example in Additional file 3). A "standard" analysis has been performed that consists of computing the dissimilarity matrix of equivalence thresholds and building a dendrogram from this for the three ontologies at levels from 2 to 8. These dendrograms can be viewed in Fig. 4 (made at level 3 of the BP ontology) and in Additional file 3 (levels 2 to 8 of all three ontologies).
As can be seen in the plots ( Fig. 4 and supplementary figures in Additional file 3) the grouping produced has the same structure for all lists: kidney transcripts on one side and endothelial and injury transcripts on the other, the latter being more similar to each other than to endothelial transcripts. These groupings are not surprising because each type of gene is involved in different biological processes but they suggest that groupings observed in other settings, where the relation between the lists is not obvious, can also be considered as reasonable (see next example).

Equivalence analysis of cancer gene lists
As a second example, we consider a series of lists that have been obtained from Bushman lab (http:// www.bushmanlab.org/links/genelists). The lists contain Entrez identifiers for each gene so the only preprocessing consisted of removing one list that contained less than 100 genes. Table 2 contains, for each list, the name, the number of genes, the species and a short description. Fig. 3 Power curve of the equivalence test, as a function of the true squared Euclidean distance. Balanced case of two gene lists of size 200 with 20 genes in common. Equivalence limit at = 0.025. The null hypothesis of the equivalence test states that the true squared Euclidean distance, d, is greater than or equal to , that is to say, that both lists are sufficiently dissimilar according to the limit criterion. Thus, rejecting this hypothesis corresponds to declaring equivalence. When the true simulated distance is d < , not rejecting the null hypothesis (not declaring equivalence) corresponds to a false negative. When d ≥ , declaring equivalence is a false positive Citing the researcher's description of the lists they are collections of cancer-related genes that were used to generate a comprehensive list (allOnco) that is comprised of the union of all lists. In this case, we do not have any a priori expectations of which lists should be equivalent to which but, instead, we can rely on equivalence analysis to help answer the question "up to what point can these lists be considered equivalent so that they can be merged into a single list?" Figure 5 and supplementary figures in Additional file 4 show the results of equivalence analysis. Interestingly the lists tend to group consistently within the ontologies -groupings at distinct levels of the ontologies are almost identical-but these groupings can change from one ontology to another, which is not strange because they refer to different concepts. Depending on what the goal of merging the gene lists is the different groupings of each ontology can be used as a guide to decide whether a given dataset should be included or not in a common list. For instance depending on whether what one wishes to obtain is a heterogeneous or a homogeneous list one could decide to include groups that are separated by a higher threshold or, instead, that are near each other in the dendrogram.

Discussion and limitations
In this paper, a method for dealing with the problem of simultaneously comparing multiple feature lists has been introduced. The method is based on comparing feature lists by means of their projections at fixed Gene Ontology levels. These projections are called "functional profiles". One innovative characteristic is that the comparison is done by means of equivalence tests, which are aimed at rejecting a null hypothesis of nonequivalence, which means that it can be stated (when this null hypothesis is  rejected) that two lists are "equivalent at a certain threshold". This is a better statement than simply saying that "there is no evidence of difference or dependency" (which does not mean that they are equal or independent), as would be the case if a "standard" difference or dependency test was performed. The fact that it relies upon equivalence makes it particularly interesting for data integration problems, such as for the cancer gene lists presented in the examples. Following a reviewer suggestion we compared the equivalence test with a standard test of positive dependency. Although appealing, this test is not adequate to solve the proposed problem. Its limitations are shown in the supplementary material (see Additional file 6).
The examples have shown that the method behaves consistently with expected similarities. That is, it tends to consider equivalent at lower threshold feature lists thateven if they have few elements in common -are expected to be easily declared equivalent. One can interpret that this happens because they are associated with the same or similar biological processes, such as with injury-related PBTs in the kidney gene lists examples. This, of course, suggests that when two lists, whose relationship is not known, show up as equivalent they can be considered similar enough.
The method is not free from limitations. For instance an obvious concern may be the fact that comparisons are made separately at each level of each ontology, which means, for instance, that if one considers three levels (e.g. 2, 3 and 4) of the three ontologies (CC, BP and MF) one ends up with nine comparisons that one may want to combine a posteriori. In spite of its apparent inconvenience, this may be seen as useful -firstly because in general, a good consistency and reproducibility are observed between different levels of the same ontology; that is, they yield generally the same classification, and small differences observed may be mostly attributable to list size. In some cases, there may be differences between ontologies, but this is not a serious drawback either, because the distinct ontologies reflect distinct biological concepts, so differences can be considered reasonable. If all the comparisons were compacted into a single one this variability might be lost, which could hamper the interpretation of the results.
Another issue to be accounted for is computational efficiency. This depends on many factors, such as the computer where the programs are run, the number of lists to compare, the size (number of features) of these lists and the number of GO nodes in the profiles being compared. A small simulation study has been performed to provide information about execution times in a realistic scenario: using a basic bioinformatic station (i7-4790 processor with 8GB of RAM running R 3.4 on 64-bit Windows 7 Enterprise) the process of determining the equivalence of a certain number of random gene lists has been executed repeatedly. This has been done using the equivClust function of the current version (1.44.0) of the goProfiles package. Times were measured by means of the R package microbenchmark, which provides summary statistics for the running time. The number of lists compared in the simulations was 5, 10, 25 and 50. For simplicity pairs of lists being compared were set to have the same size. The sizes considered were 100, 200, 1000, 2000 and 5000. The comparison was made at levels 2 and 3 of the "Biological Process" (BP) ontology. Table 1 and Fig. 1 in Additional file 7 show the summaries of the execution times in five replicates (if the number of genes was 5000 only one replicate was used). It can be seen that, at level 3, the required time to build an equivalence dendrogram for 50 gene lists and 5000 genes is more than 32 hours (115858.16/3600) which is clearly not assumable for ordinary calculations.

Conclusions
The method introduced in this work provides a way for classifying sets of genes or other features based on equivalence testing on their corresponding functional profiles. It can be viewed as an extension of the goProfiles methodology [19,20] introduced previously and it is statistically well grounded. Standard visualizations, such as dendrograms, can be used to depict the classification results. The method has a wide applicability and has been used in a variety of problems such as deciding whether a series of datasets generating the lists can be combined, or in regard to classifying the lists in a collection of signatures from most to least similar, in the sense of equivalence.