Applicability of metamorphic testing in bioinformatics
MT is a general technique to alleviate the oracle problem
The programs we used in our case studies belong to two types of bioinformatics programs that are traditionally very hard to test due to the lack of a tangible oracle. For instance, the current approach to test a network simulator involves visual inspection of the simulated values as well as comparison among multiple implementations of a simulator [7]. Therefore it tries to tackle the oracle problem by using multiple implementations. Such multiple implementations are often hard to acquire in practice, and the results of such testing may be hard to interpret when different implementations give different results. Our approach for alleviating the oracle problem is through verifying relationships among multiple test cases. As demonstrated by both of our case studies, as well as many previous studies [5, 6, 11, 22], MT is effective for testing programs that are traditionally difficult to test due to the oracle problem. This allows MT to be applicable for testing various bioinformatics programs in which the oracle problem exists.
MT can test a program against its intended behaviour
For any program, MRs can be derived from the intended program behaviour or the program specification. As demonstrated in our case studies, all MRs are based on the intended program behaviour (that is, from the domain knowledge of network dynamics and approximate string matching), and they do not make use of the details of the implementations (for example, the underlying algorithm and data structure). Based on the ten MRs derived from the intended behaviour of a GRN simulator, we found a fault in GNLab. As explained in the Results section, this fault is due to the mis-specification of algorithm instead of a bug in the implementation. This means, if test cases were derived from the specification alone, this fault may not have been detected. We believe this ability to test a program against its intended behaviours is very important in bioinformatics, as it allows us to focus our testing effort on assessing whether the underlying biological questions are being tackled correctly. Of course, we can easily derive MRs from program specification as well.
MT can be combined with special test cases
It should be emphasized that MT is a general testing technique for the situations where there is no tangible oracle. It can be used to generate a large amount of test cases based on an existing set of test cases. For example, we can easily construct some artificial short sequence reads with predefined mismatch patterns for the testing of SeqMap. Such special test cases are useful and should be used as far as possible. However, special test cases only cover a small portion of all possible inputs, and we still need more test cases whose testing results are not easy to verify. A straightforward method is to combine MT technique with special test cases. This can be done by using each special test case as a source test case to generate follow-up test cases based on some MRs. Such an approach has been shown to be very effective in detecting non-trivial faults [5].
MT is simple and automatable
As demonstrated through the two case studies, the process of MT is straightforward. Different subsets of behaviours of the target program can be tested by employing different MRs. Once the MRs are identified, test cases can easily be automatically generated and their outputs can be verified using simple scripts. A single program can have a great number of MRs, and various follow-up test cases can be defined based on one single source test case. Moreover, the simplicity of MT allows us to perform systematic automated testing using a simple test script. The use of simple test script is important to minimize the chance of introducing bugs into the test script itself, which can subsequently confound the interpretation of the testing result.
MT allows the use of real inputs as test cases
One implication of the ability to automatically generate more test cases is that we can now use real-life program inputs as test inputs. In the MT framework, it is easy to treat a real-life input as a source test case, and generate many follow-up test cases using a set of MRs. For programs that lack a tangible oracle, test cases are usually restricted to those that can easily be constructed and verified. Such test cases may not have the same size and characteristics as the real-life program inputs. For instance, testing of real-life input is often not possible for most network simulators since we have no objective means to verify the large amount of simulation results. Using MT, such difficulty is alleviated by testing the outputs against a set of MRs instead of the oracle. In our case study of GNLab, we can construct test cases based on two real-life GRNs which are much larger and complicated than the randomly generated ones. Many bioinformatics programs deal with high-throughput data, therefore the ability to test whether they can correctly handle such real-life inputs is important.
MT is suitable for bioinformatics programmers
Compared to many other testing techniques, MT is much easier to implement in practice because it relies mainly on user domain knowledge rather than software testing knowledge. Many bioinformatics programs are developed by the end-user – the researcher or research group who uses this program. Chen et al. [13] have demonstrated that MT is particularly suited to test end-user programmers' own programs because (1) "end-user programmers have the domain knowledge to identify MRs" and (2) "end-user programmers can distinguish good MRs based on program structures" [13]. In the GNLab example, we only identified and used some MRs related to the structure of the input networks because our domain knowledge points out that changing the network structure should induce the most changes in the execution of the simulator. Other properties related to the execution of the simulation, such as length of simulation and output interval, are not covered by the MRs. This feature of MT allows the tester to focus most of the testing effort on the subset of functionalities that are more important, or more frequently used by its intended users.
MT is useful for testing diverse types of programs
Although both programs used in our case studies implement deterministic procedures, some initial results show that MT can also be used for testing other types of procedure that are traditionally difficult to test, such as heuristic methods, machine learning methods, stochastic methods and so on [23, 24]. Since many bioinformatics programs implement such procedures, we expect MT to be applicable to them.
Limitations
It should be noted that satisfying all test cases based on a set of MRs does not guarantee the correctness of the program under test. MRs are necessary properties, hence satisfying all of them is not sufficient to guarantee program correctness. This problem is, in fact, a limitation of all software testing methods. Nonetheless, the ability to systematically produce a large number of test cases should increase our chance of detecting a fault in the target program, and hence improve its quality.
As this paper focuses on introducing the application of MT in bioinformatics, there are other issues related to MT that are not explicitly discussed here. First, the success of MT greatly depends on defining a "good" set of MRs. From the testing results of GNLab and SeqMap, we observe that some MRs are less effective in detecting faults than others. In particular, we note that all MRs based on adding and removing nodes from a network is not effective in detecting faults in our fault-seeded mutants of GNLab. So far in this paper, we have not explicitly addressed the issues of selecting effective MRs. Some initial results suggest that those MRs which trigger different execution paths for the source and follow-up cases are more likely to reveal faults [25]. This means, although deriving MRs is usually straightforward, selecting the most effective MRs requires good understanding of the problem domains. More specific guidelines in choosing MRs is being actively investigated. Second, the MT technique itself does not specify how source test cases should be selected given a set of MRs. We have used randomly generated inputs and real-life inputs for generating source test cases in our study. However, as shown in our case studies, the performance of MT also depends on the number and variety of source test cases. We expect that MT can be combined with other established test case selection techniques to improve the fault-revealing ability.
Further examples in bioinformatics
Beside programs for network simulation and short sequence mapping, we notice that many other bioinformatics programs can benefit from MT. Here we briefly discuss how the testing of programs from several important bioinformatics domains suffer from the oracle problem, and how MT technique can be used in each case. The list of applications presented here is by no means exhaustive. Only very simple MRs are pointed out here as we are not discussing any particular detailed problem description. In general, more complex, and potentially more fault revealing, MRs can be formulated based on a more thorough understanding of the problem domain or program specification [25].
Phylogenetics
One major endeavor in phylogenetics is to infer the phylogeny (phylogenetic tree) of some species based on their aligned nucleotide or amino acid sequences [26]. There are three main approaches to phylogenetic inference: (1) parsimony methods, (2) distance based methods, (3) model based methods. Broadly speaking, all methods aim to group these species into a binary tree according to different measures of sequence relatedness. We commonly analyze large number of long bio-sequences. Also, many of these methods involve calculating distance matrix, or computing maximum-likelihood estimates, which are difficult to verify except for trivial inputs. Therefore the testing of phylogenetic inference programs suffers from the oracle problem.
Let us denote a program that performs phylogenetic inference as P. The input of P is a set of n aligned sequences S = {S1, S2, ..., S
n
}, and the output is a binary tree T(S). One possible MR is that adding a sequence, Sn+1, would not change the relative structure of the rest of the tree. That is, the trees generated by the source case T(S) and the follow-up case T(S ∪ Sn+1) only differ by one additional leaf node representing Sn+1. For a P that treats each alphabet independently and equally, we can define another MR: replacing, or permuting, the alphabet of the sequence with one another (for example, A ↦ C, T ↦ A, G ↦ T, C ↦ G) does not change the final structure of the tree. That is, T(S) = T(Permute(S)) where Permute( ) is an alphabet permutation function.
Microarray analysis
Microarray analysis has become an indispensable tool in modern biological and medical research. Many types of analyses are available for analyzing microarray data. They include differential expression (DE) [27], differential variability (DV) [28], hierarchical clustering [29], gene set enrichment analysis [30] and Bayesian network analysis [31]. Due to the difficulty in analyzing the high dimensional input (microarray expression profiles), and often also the high dimensional output (ranked gene list, binary tree, and Bayesian network), the correctness of the implementation is often difficult to verify. In this case, MT technique can be useful. Let us take the identification of DE genes between two sample classes as an example. One simple approach is to use the t-statistics to obtain a P value for each gene based on a two-sided hypothesis, and call the genes with P less than a pre-specified threshold significant DE genes. Since t-statistics is shift independent, we can define a MR that adding a constant value to all values in the input microarray profile does not alter the resulting list of P values. The second MR is that switching the class label of the samples also does not alter the resulting P values as the t-distribution is symmetrical.
Biological database retrieval
Many biomolecular databases are available, and most of them are built to support fast data retrieval and database mining [18, 32–34]. One major challenge is to ensure that we can accurately and efficiently retrieve the desired data item from the database. This is particularly important as we begin to construct large scale gene regulatory networks and metabolic networks using these databases. Invalid retrieval results may lead to a false positive or false negative edges in a reconstructed network. Due to large size of the database, it is generally difficult to test if a search engine can correctly retrieve all data that exactly match a query. A potentially suitable MR is that a query A ∩ B ∩ C should not contain more results than query A ∩ B. Another MR is that executing the query ¬(A ∪ B) should have the same effect as executing the query (¬A) ∩ (¬B). Many more MRs along this line are possible.