Statistical analyses of quantitative data can be constructed and performed in a generic fashion using the Taverna workflow system. This was demonstrated by the t-test identification of differentially-expressed genes from microarray data using the workflow shown in Figure 1. Such generic analyses were made possible by a number of new features present in Taverna and in particular the development of the RShell processor which acts as a client to R when deployed as a service using the RServe library. Standalone web services can offer specific types of statistical analysis such as clustering for data analysis. However, the development of the R processor means that any type of statistical analysis can be performed on data when implemented as a script in the R programming language within a Taverna workflow . This is of particular benefit to the transcriptomics domain since this processor provides Taverna with access to further tools based on R for analysing microarray data such as Bioconductor  which was used by Fisher and colleagues for the identification of candidate genes associated with Trypanosomiasis in workflows . A number of tools such as oneChannelGUI  and GenePublisher  already provide users with access to R and Bioconductor from a graphical user interface. However, the advantage of coupling the R tool with an application like Taverna is that the workflow system provides an interoperability platform for data to be fetched from distributed services for subsequent analysis using R in a workflow. Conversely, the results of an R analysis can be sent for further processing by services downstream in the workflow. The use of distributed data and analysis services in conjunction with R were demonstrated in the example workflow (Fig. 1). Data analysed by R were provided by the maxdLoad2 microarray database using its web service interface provided by maxdBrowse and the results of this were further analysed by the GOTermFinder tool.
The current work on analysing quantitative data extends that by Stevens and coworkers who investigated the implementation of gene annotation pipelines as workflows in a previous version of Taverna . Since then, ad hoc functionality can be incorporated into Taverna workflows through the use of a processor which executes scripts written in the Beanshell language Java . Such scripts provide a method for implementing shim services  which, in the example workflow, were used to reconcile data mismatches between services (Fig. 1). Beanshell scripts were also used to implement a basic form of user interaction using Java Swing classes to steer workflows during their enactment. In the example workflow, this technique was used to select different pair-wise combinations of classes for t-test analysis. This type of interaction is used by the workflow user to direct the workflow to analyse specific subsets of data which are not known at runtime.
Scripts written in the Beanshell and R languages provide generic functionality which can be used as tools to analyse specific types of quantitative data from different scientific domains. In addition, domain-specific functionality can be provided for use in scientific workflows through the plugin mechanism now present in Taverna's software architecture. The ability to incorporate plugins makes it easier for the scientific community to contribute and share functionality in Taverna without the need for their source code to be tightly coupled with the core. In the example workflow (Fig. 1), the plugin mechanism was demonstrated by the implementation of renderers for browsing PDF documents (Fig. 5) and the display of text files containing quantitative data in CSV format as tables. In the context of the current work, an opportunity for developing other plugins for analysing quantitative data might include the ability to invoke analyses in other applications such as MATLAB  and Mathematica  for those users who prefer these tools to perform their calculations.
An initial investment in time and effort is required to construct data analysis workflows. This was true in the case of the example microarray data analysis workflow since a working knowledge of the R programming language is required to devise the t-test analyses, as well as experience of Java programming for writing Beanshell scripts to implement user interaction and shim services. A lack of experience in these two languages can therefore prohibit the construction of complex data analysis workflows by entry-level users of Taverna. However, the fact that workflows can be saved as XML files in the Scufl language allows these analyses to be shared with colleagues and with the wider life sciences community. This is especially pertinent to multi-disciplinary research groups whose expert data analysts use R and other similar tools to develop data analysis protocols, which are then employed by colleagues who are laboratory scientists. This laboratory group of scientists understand the conceptual basis of the analysis performed by the R script but may not have the inclination (or the need) to learn the R programming language in order to understand how such scripts have implemented their analyses. Since it is often the case that there are many more producers of data than there are experts in data analysis within a research group, it is desirable for the laboratory scientists themselves to perform analyses of their own data by re-using R scripts, that may have been incorporated into workflows, written by the data analysis experts. These workflows can therefore be regarded as standard operating procedures providing a best-practice solution for analysing specific types of data. The automation of analyses afforded by Taverna and other workflow software such as Kepler  enables laboratory scientists to quickly test hypotheses on their data by guiding them through complex statistical analyses, especially when users can steer and interact with the enactment of a workflow. This enables a user to run multiple analyses with different parameterizations which was required for the analysis of the microarray data in this study in order to extract information about what was biologically significant. This benefit provided by a workflow approach to users has been implemented by providers of commercial microarray data analysis software such as GeneSpring .
There are still a number of issues highlighted by the current microarray data analysis study associated with the use of web services and workflows that still need to be addressed. Firstly, there is the problem of securing analysis services such as that provided by an R server used in the example workflow. Such services may have been deployed on powerful compute clusters and so their use by unauthorised users within workflows is undesirable. Using the RServe library, the R server can be configured to allow access for those users providing the appropriate username and password combination. However, this information is embedded within the Scufl file and so it is not suitable to share this file with anyone other than colleagues. Future work on the development of Taverna 2, involving the use of distributed agents to handle the security of services during workflow enactment will address this problem. The sharing of workflows within the life sciences community is also an issue that is currently being addressed by the myExperiment project http://www.myexperiment.org/. It introduces the concept of social networking web sites for workflows and aims to provide a collaborative environment for scientists to safely publish their workflows with supporting documentation, and share them with a wider group of people as well as addressing concerns relating to their attribution, credit and intellectual property on the web.