In 2009 work began to generate a real world tool to help tackle the issues of inconsistency and incompleteness in relation to in situ gene expression data for the developmental mouse - this work is undertaken as the Argudas project. Argudas is an evolution of the work described previously, and in [14–16]. In the former work two prototypes were developed, see Figure 1. During Argudas, the system has been refined further through the development of two additional implementations; each with a similar, yet distinct, user interface.
This section will explore the knowledge gained from the previous work, and the ways in which Argudas is informed by those examples. The discussion starts with an examination of the issue of subjectivity, then progresses to investigate the breadth of the resources included in Argudas, and concludes with a review of the issues highlighted by the first Argudas prototype.
Tackling subjectivity
One of the problems with the previous implementations is the notion of subjectivity. During the evaluation it was clear that each subject had their own approach to interpreting the information contained within EMAGE and GXD. Accordingly, users disagreed with the system in three key areas: the argumentation schemes, the confidence values assigned to those schemes, and the final conclusion the system presented. Each area of conflict will now be addressed in turn.
Conclusions
At the top of a results page, e.g. Figure 1 part B, the second prototype presents a conclusion as to whether or not the gene is expressed, e.g. "The arguments appear to suggest the gene is expressed". Each time a conclusion is presented some users agree with it, yet others disagree. A plausible reason for this is discussed by Jeffreys et al. [10]:
Different researchers interpret data in different ways, and even the same researcher may make inconsistent interpretations, adding an unreliable and non-uniform element to data processing.
The phenomenon of subjectivity is explored, in relation to argumentation, in the philosophical writings of Perelman and Olbrechts-Tyteca [20]. Perelman and Olbrechts-Tyteca introduce the notion of an audience to capture the idea that each person has their own reasoning process, and thus each member of an audience will judge the same argument differently. This means that there is little point in the system trying to decide whether or not the gene is expressed. Instead the system must generate arguments for and against the gene being expressed, and allow the user to evaluate these arguments in order to reach their own decision. In effect, the system should aggregate and evaluate data, presenting the relevant data to inform the user's decision making process.
For this reason, unlike earlier work, Argudas does not decide whether or not a gene is expressed. Instead it aggregates and interprets information before summarising the material for the user.
Argumentation schemes
The argumentation schemes are derived from expert knowledge of how to interpret and evaluate the information in the EMAGE and GXD resources. The expert used for this task was responsible for curating the information in EMAGE, and thus interpreted and evaluated the information in these resources on a daily basis.
Unfortunately, there is little published research on the creation of argumentation schemes by a domain expert with which to compare the current work. The available literature on schemes often focuses on the dialogue and natural language aspects. The work of Silva et. al. [21] is interesting because it starts with already documented "reasoning templates" (effectively diagrammatic argumentation schemes) and asks a domain expert to use those to explain his reasoning, customising them if necessary, during case-based reasoning. The process of customising the schemes provides a mechanism to help the expert describe his knowledge. However, no such pre-existing schemes exist for the current domain.
Shipman and Marshall [22] discuss the problems of working with a number of knowledge formalisms, including argumentation. Although their work does not focus directly on the application of schemes, a number of their ideas do transfer over, for example:
Tacit knowledge is knowledge users employ without being conscious of its use [23]. Tacit knowledge poses a particularly challenging problem for adding formal structure and content to any system since, by its very nature, people do not explicitly acknowledge tacit knowledge.
The notion that experts are not aware of all their own knowledge presents a massive impediment for all knowledge-based approaches. Similar ideas can be found in the work of Bliss [24], which suggests that experts develop mental models of concepts and processes that can be very hard for them to access. Such knowledge, referred to as deep knowledge, is unarticulated. Knowledge that has never been articulated can be extremely difficult for the expert to recall [24–26]. The biological expert, used in this work, was being asked to provide tacit knowledge, and struggled to do so.
Degrees of confidence for argumentation schemes
In order for the argumentation engine to argue, it requires not only a series of logic rules (derived from the schemes) but additionally that each rule has an associated confidence score. These scores, so-called degrees of belief, should record the expert's confidence, in each rule, in such a way that rules with a high value produce better arguments than rules with a low value. The argumentation engine uses the scores to settle conflict, decreeing that when two arguments oppose one another, the argument with the higher value wins.
During the evaluation of the second prototype the biological expert disagreed with his own assignments, which raises questions over the reliability of their capture. However, this is not the only plausible explanation, Bliss [24] suggests that mental models naturally evolve as the person gains experience and knowledge, or as a result of the person consciously thinking about their processes and knowledge. Consequently, the act of asking an expert to document their knowledge can change that knowledge, and thus render the assigned confidence value redundant.
A further issue to be aware of when dealing with experts, is the reliability of the individuals themselves. As Walton [27] notes, the users of expert opinion often are not capable of judging the quality of that opinion and thus simply apply it:
It is quite common for presumptions to be based on expert opinion where the person who acts on the presumption - not being an expert - is not in a position to verify the proposition by basing it on hard evidence within the field of expertise in question.
Walton [27] goes on to caution that such experts, or authorities, may not prove to be reliable:
But even when they are nonfallacious, as used in a dialogue, appeals to authority are generally weak, tentative, presumptive, subjective, and testimony-based arguments. They are inherently subject to critical questioning, or even rebuttal, on various grounds - especially on grounds relating to the reliability of the source cited.
Walton's opinion is backed by a number of scientific studies, as Hansson [28] reports:
Experimental studies indicate that there are only a few types of predictions that experts perform in a well-calibrated manner. Thus, professional weather forecasters and horse-race bookmakers make well-calibrated probability estimates in their respective fields of expertise [29, 30]. In contrast, most other types of prediction that have been studied are subject to substantial overconfidence. Physicians assign too high probability values to the correctness of their own diagnoses [31]. Geotechnical engineers were overconfident in their estimates of the strength of a clay foundation [32].
These quotes combine to illustrate the difficulty in using expert opinion - often it is not correct, and the user has no way of knowing what (s)he can trust. As such, it is conceivable that the expert did not assign the correct confidence values. Furthermore, the authors were incapable of verifying the quality of the assignments.
Argudas' solution
The previous work demonstrates a need for an improved set of schemes and confidence values. Two significant barriers appear to exist: the difficulty in the expert articulating his own knowledge, and the reliability of an expert's opinion. Instinctively, the solution to both of these issues is the inclusion of further experts in the process. Having a second expert allows ideas to be communicated and evaluated by people in different ways. Thus helping to reduce both the subjectivity and the workload on an individual expert making it easier for them to generate the required output.
Regrettably, Argudas did not have the resources to restart the full scheme generation process. Consequently, as described later, Argudas employed two experts to review the schemes and assign new confidence values. These amended schemes and values are used in Argudas' first prototype (3rdoverall).
Clearly having two experts means the possibility of two different points of view. Hence, when they both agree, the probability of the degree of confidence being accurate increases. Disagreement is beneficial too, because through it new insights are discovered. It seems obvious that if the schemes had been produced by multiple experts the range and diversity of the schemes would have been broader. Furthermore, if two experts have to agree the natural language used to document the schemes, the number of ambiguous phrases should be reduced.
Yet working with multiple experts can cause a number of difficulties. Expert biologists are often geographically disparate. This in conjunction with their workload means it may be difficult to bring the experts together. Furthermore, there is an obvious requirement for a formal resolution process to help dissect and settle differences of opinion. Finally, it must be acknowledged that not all disagreements can be rectified, and that a mechanism for incorporating differences of opinion must exist. These issues point to the requirement for a framework that enables biologists to work together in order to generate the schemes. Lindgren [33] is developing such a framework for the use case of dementia care; however, it is still at an early stage, and cannot be employed here.
Extending Argudas for richer argumentation
Argudas aims to improve on previous work with the integration of further resources - more resources means extra information and potentially richer arguments. Initially the microarray data contained in the ArrayExpress [34] resource was targeted. This highlights a number of integration issues that have yet to be resolved.
Firstly, the ArrayExpress resource does not use the EMAP anatomy ontology. Secondly, accessing the data held by ArrayExpress was difficult as they did not provide a direct programmatic access to their database. Instead access was via a RESTFUL web service, which provided limited functionality and did not allow access to the data required by this work. Initially it was impossible to ask for all the genes expressed in a healthy mouse's pancreas at stage 24 because ArrayExpress did not compute multi-factor statistics. That is, they computed which genes were expressed in the pancreas and which genes were expressed in stage 24 separately and there was no way of presenting the intersection at that time. The team behind the resource were working on improving this interface and claimed that such functionality would become available in the future; however, the delay was problematic because of the time constraints associated with Argudas. Finally, ArrayExpress had less data for the developmental mouse than expected: only three stages were covered. Comparing the costs and benefits it was decided not to pursue this integration further.
As work on ArrayExpress stopped an investigation of the Allen Brain Atlas (ABA) [35] and Gene Expression Nervous System Atlas (GENSAT) [36] began. Both of these resources are databases of in situ experiments focusing predominately on the adult mouse's nervous system, i.e. brain. The latter project makes available a full database dump. The former supplies an extensive range of RESTFUL interfaces that provide access to the desired information.
However, bringing the data from these two new resources into Argudas is problematic. Neither resource uses the EMAP anatomy - as these resources focus on the brain they have a far finer granularity for the brain structures than EMAP. Hence it is necessary to attempt some form of mapping from their respective anatomies to EMAP. Secondly, these resources use their own measures to describe the level of expression, GENSAT natural language terms and ABA floating point numbers, which also must be mapped across to the corresponding EMAGE/GXD terminology. EMAGE and GXD use very similar labels, for example when a gene is not expressed in a particular structure GXD describe the gene as absent whereas EMAGE use the term not detected.
Mapping between the different anatomy ontologies employed by the resources is based on a series of alignments produced by Jiménez-Lozano et al. [37]. As both GENSAT and ABA have a finer granularity than EMAP, mapping from those resources to EMAGE/GXD commonly results in a loss of precision.
The second task is straightforward for GENSAT as their choice of labels is similar to EMAGE's, and in turn GXD's. Whereas EMAGE has not detected, detected, weak, moderate, and strong GENSAT has not done, undetectable, weak signal, and moderate to strong signal.
Mapping EMAGE/GXD expression levels to ABA is a more complex task. There are three different measures of expression level published by ABA. Firstly there is the raw experimental information, secondly there is the average information (across all the experiments for a particular gene and structure), and finally there is a mathematical aggregation of the expression level and expression density. For current purposes, the first class of information is most suitable. Subsequently, the ABA generated expression level mappings must be applied. These mappings are a series of cut-offs that determine whether the expression level is not expressed, weak, moderate or strong. There are different limits for different parts of the brain. In order for the limits to be applied to the structures lower down in the anatomy hierarchy, the limits need to be propagated through the brain in a similar manner to the gene expression information.
When this work has been done it is necessary to determine what level of integration is appropriate for these resources. At the most basic level it would be possible to merely report the results contained in ABA and GENSAT. If either of these resources agreed with an annotation from EMAGE/GXD, it would increase the confidence in that annotation. Fully integrating ABA and GENSAT would require the generation of schemes for these resources, which is substantially more work and would necessitate involvement from a resource expert. In the case of ABA such an approach may not be fruitful; ABA does not publish all the data it collects, accordingly some of the attributes provided by an expert may be hidden from the public, and thereby Argudas. The restricted resources of Argudas meant that only the former option was realistic.
Although an interested biologist may raise a number of concerns regarding the anatomy and expression level mappings described above, currently there is no better way of aggregating data between the four resources of interest.
Further issues identified during the development of Argudas
The first Argudas prototype tests using the revised schemes and confidence values. For the reasons discussed before, it removes the final conclusion and instead presents a list of textual arguments. When this initial system was demonstrated to a group of end users they raised concerns over the number of arguments shown and the style of their presentation. These topics shall be explored in greater depth before a possible conclusion is presented.
Too many arguments
During Argudas' development it became evident that the number of arguments generated varies enormously. For some queries there are no annotations and therefore no arguments. With other queries over ten annotations are retrieved from EMAGE and GXD, accordingly a large number of arguments are produced: arguing for bmp4 - future brain in stage 15 generates two hundred and fifteen arguments. Clearly, no biologist will read all the arguments, hence there can be no guarantee that (s)he will read all the important information. This realisation led to the conclusion that the potential number of arguments is too high, and steps were taken to reduce it.
Although all of the arguments are unique in terms of their content (wording, and order of words) semantically several arguments seem to duplicate one another. Identifying semantically equivalent arguments is not a minor task. The definition of equivalent seems to depend on the individual using the system and the biological task they wish to perform.
There are a number of common interpretations and actions that are not appropriate for certain biological tasks, and which individual biologists may, in general, reject. For example, the EMAP anatomy ontology is defined using part-of relationships. Consequently, positive levels of expression are routinely propagated up the ontology to higher level structures; for instance, if bmp4 is weakly expressed in the telencephalon, it is normally correct to say that bmp4 is weakly expressed in the future brain. Nevertheless, many biologists prefer direct annotations over propagated ones, thus if a second annotation suggests bmp4 is not detected in the future brain, the second annotation would take precedence.
Likewise, there is a similar problem with the granularity of information desired. Finding two distinct annotations with the same conclusion is a powerful argument for trusting the conclusion. However, the granularity of information desired affects the decision as to whether or not the annotations are in agreement. Imagine there are two annotations: one annotation suggests bmp4 is strongly expressed in the future brain, and a second annotation demonstrates bmp4 is weakly expressed in the future brain. If the biologist is attempting to determine if the gene is expressed or not expressed, then these annotations may be taken to agree. Yet, if the aim is to determine the level of expression, these annotations are conflicting.
The goal of reducing the number of possible arguments is further hindered by a request for more positive aspects to be highlighted. For instance, although an argument is created when the probe (used to detect the presence of an expressed gene) information is absent for an experiment, no argument is created when it is present.
In summary, Argudas' users appear to wish for a broader range of prospective arguments, and yet a smaller number of realised arguments. Reconciling these competing aims seemed improbable, until someone remarked that the problem was not the volume of the arguments but the amount of text to be read. It transpired that a significant number of potential users wanted to scan information rather than read it.
The notion of argument reconsidered
Previous work, and the initial version of Argudas, use the ASPIC argumentation engine to generate and evaluate arguments inside a virtual debate. These arguments are presented to the user as a natural language paragraph - this display mechanism is chosen as it is the preference of the original expert. However, feedback suggests this choice is subjective [15]. Furthermore, the potential for a large number of arguments to be generated appears to imply that the original approach is sub-optimal. There is a clear need to find an alternative method for displaying arguments.
During internal discussions it was proposed that the argumentation mechanism should be reconsidered. This approach was based on the belief that users wanted quick access to certain key attributes of the annotation. Theoretically there is no need to employ the argumentation engine to create and evaluate arguments. Instead, the most important schemes (as identified by the scheme revision process), should be the basis for a range of key attributes that describe the annotation. The schemes indicate whether or not the information stored in EMAGE/GXD for a particular annotation should increase or decrease a user's confidence in that annotation. As such, Argudas should extract information from the resources and present it, with associated key highlights, to the user. It then becomes the duty of the user to evaluate the information.
In order to test this hypothesis two mock interfaces were created and evaluated. There are three steps to each interface. The first two steps are the same: select a gene and/or structure of interest; report on the available annotations and allow the user to ask for more information if desired. Figure 2 shows both of these: initially the query is bmp4 - future brain in all stages; the query causes all combinations of the gene and structure to be displayed in a table. The table presents all relevant annotations, summarises what each annotation shows, and provides a link to the resource's web page for that annotation.
In some situations the table in Figure 2 will be enough to resolve a biologist's question; i.e. it is clear that bmp4 is expressed in the future brain in stage 14. On the occasions when the table is not helpful, or does not provide enough information, clicking the argue button provides a range of arguments.
In the first mock interface a number of textual arguments are displayed - in a similar manner to Figure 1 part B. The second interface can be seen in Figure 3 - the 'arguments' are now a list of key attributes such as multiple annotations agree. Whether or not an attribute should strengthen a user's confidence in the annotation is indicated with a tick or cross. The attributes are divided into two layers - firstly by expression level, and then by annotation. For each level of expression there are three attributes that indicate how likely that level of expression is. Asking for more information causes the second layer of attributes to appear. This allows the user to evaluate the annotations individually, and collectively as a group that promotes a specific expression level.
Argument presentation reconsidered
The mock interfaces, shown in Figures 2 and 3, were explored with the help of two expert users. During this small informal exercise (described later in the Methods section), the test users were kept apart; however, they reached identical conclusions:
-
1.
the revised interface, using key attributes rather than textual arguments, is an improvement;
-
2.
a further improvement could be made by placing the content of Figure 3 into a table.
Separately, the experts both had the idea contained in Figure 4. Effectively, the important schemes become columns, with individual annotations represented as rows. Ticks indicate positive aspects that suggest the annotation is more likely to be correct, with crosses having the opposite semantics. Blue dashes inform the user that a piece of information is unavailable.
Enacting these recommendations has a major side effect: there is no need to perform computational argumentation. As such, there is no need for the argumentation engine. Instead the user now performs the argumentation themselves using the aggregated and curated information presented by Argudas.
This change helps address the issue of subjectivity, because the user is now able to apply his/her own confidence values and criteria to the decision making process. Furthermore, because the user is able to build his/her own arguments, Argudas becomes a more flexible tool. For example, it is now possible to use Argudas to decide where, and at what level, a gene is expressed.
Implementing these changes results in the second, and current Argudas system (the user interface can be seen in Figure 4). One addition to the advice offered by the expert users is the inclusion of mouse-overs to present extra information. Discrete dots underneath a tick/cross, or column title, indicate that more information is available if the user hovers his/her mouse over the dots. For example, hovering the mouse over the dots in the cell for 'strong - multiple annotations agree' (Figure 4, top table) reveals the experiment(s) that suggest the gene is strongly expressed.
The evaluation of this system, described later, demonstrates that the changes are both appropriate and effective because they allow users to quickly scan the important information.
Future work
This work set out to model expert knowledge and use it to reason with information sources available through the Internet. In the current use case, this appears to be beyond the scope of what a typical end user wishes. However, the current use case is relatively constrained, and thus contained: it focuses on one kind of gene expression information for one model organism. Extending the use case to include different types of biological information, for example gene regulatory networks, makes the use case considerably more complex. As the intricacy of the biological investigation increases, the need for user support likewise increases. Argumentation is one possible support mechanism.
Another avenue for future work relates to CUBIST, Combing and Uniting Business Intelligence with Semantic Technologies, an EU FP7 project that aims to combine the essential features of Semantic Technologies, Business Intelligence, and Visual Analytics. Data from both unstructured and structured sources will be federated within a Business Intelligence enabled triplet store, before visual analysis techniques such as Formal Concept Analysis [38] are applied. One of the project's three use cases involves the gene expression data described in this paper. Although it is early in the life of the CUBIST project, a semantic Extract Transform Load [39] process that includes computational argumentation may be envisioned. The inclusion of argumentation may provide an intelligent transformation of data, and a user-friendly explanation of the transformation.