geneCommittee is implemented as an AJAX-enabled web application programmed in J2SE 1.6 Java language and it is designed to run on a standard Tomcat 6 Web application server. Its source code is publicly available in Github repository [26] being distributed under the GPL free software license. The ZK development framework [27] was used to construct a rich web user interface with many features of a desktop application.
Architecture
geneCommittee is based on a layered design in which the system was structured in four main tiers: (i) web interface, (ii) controllers, (iii) execution engine, and (iv) data management. Figure 1 shows a block diagram representing geneCommittee architecture and its relationship with other external entities.
As previously mentioned, the web interface uses the ZK framework, which eases the development of desktop-like web applications by providing a set of rich widgets together with an environment that uses intensive AJAX communications between the client and the server. In geneCommittee we have taken advantage of these features to provide a very intuitive user interface to guide the user through the main system workflow.
In the web interface, the application logic is managed by a set of specific controllers. The design of this layer is strongly based on the Model-View-ViewModel (MVVM) architectural pattern supported by the ZK framework, which reduces the coupling between the user interface and the controller classes. Depending on their primary role, geneCommittee defines three different types of controllers: (i) committee controllers, which manage the committee creation workflow, (ii) the diagnostic controller, which is responsible for the classification of new patients, and (iii) the data management controller, which is in charge of data set handling. Additionally, although not represented in Figure 1, other minor controllers handle secondary features like personal data modification, feedback report and help support. As showed in Figure 1, committee and diagnostic controllers use Weka [28] classifiers and feature selection algorithms to build the committees. Additionally, when training a new committee, a special controller will use the web service interface of the GeneBrowser system to enrich a list of previously selected genes.
The execution engine layer is probably the most important piece of geneCommittee, as it contains the core of the application. The execution engine runs almost every task asynchronously. This design principle provides us with two major advantages. On one hand, by having all the tasks running in one single class we can schedule their execution to avoid system overloads. Specifically, geneCommittee application allows the system administrator to select the maximum number of tasks that can be simultaneously executed as a way to control both memory and processor consumption. Moreover, the execution engine automatically interleaves the execution of different user tasks in order to avoid long waits. On the other hand, the execution engine is completely isolated from the user interface, being alive to complete the execution of pending tasks even when the user closes the session. In such a situation, users can request to receive an email when their tasks are finished.
At the bottom part of Figure 1, the data management layer is responsible for guaranteeing data persistence. geneCommittee stores training and test data sets in separate files while the remaining information (e.g. data sets metadata, user data, committees, etc.) is stored in a relational database. All the data is kept in the server at the time it is uploaded or generated, preventing data loss and stopping the user from making unnecessary save actions.
Sources of available knowledge
In order to identify the biological discrimination power from the available classifiers, the gene sets need to be annotated with biomedical knowledge obtained from public repositories and databases. To address this challenge, geneCommittee applies a previously well-established and successful workflow supported by GeneBrowser application. All implemented services are based on a biomedical warehouse that has a generic database schema and supports an unlimited number of biological databases. Currently it includes data from public data sources, such as UniProt [29], Entrez gene [30], Gene Ontology, KEGG [31] and PubMed [32]. Overall it integrates 1000 species, representing over 7 million gene products with 70 million alternative gene/protein identifiers and 140 million associations with biological entities. Detailed information regarding the schema and the integrated databases is available in [33].
The programmatic access supported by GeneBrowser is used by geneCommittee to automatically annotate the list of genes with terms such as Gene Ontology terms, KEGG pathways and OMIM disease associations. This information will be used by the specialist to select meaningful features for training biological interpretable classifiers. In addition, these enriched groups can be further explored as datasets in GeneBrowser.
System workflow
With the goal of providing specific support to diagnostic analyses and related clinical management decisions, the workflow implemented in geneCommittee includes 4 distinct phases (see Figure 2): Access control, Data management, Committee training, and Diagnostic. This structure was planned with two different types of users in mind, one with a more statistical background, and another mainly interested in patients’ diagnostic. Nevertheless, any profile between both can easily take advantage of all the implemented functionalities.
In order to gain access to our tool, the first step is to create an account, which will allow each user to keep his own independent workspace. The demo account can also be used, providing a quick overview of the system functionalities. After login, two different paths can be independently followed. The first deals with the management of train datasets and the construction of classification models; the second is related with the exploitation of the developed models to identify specific patients in new datasets.
Data management
In this area the user can upload and manage raw datasets that will be later used for training a specific committee. geneCommittee accepts comma separated value (CSV) files where samples are represented in columns and genes are placed in rows. Therefore, each sample (column) contains a Sample ID and a Class. Each gene (row) contains an ID and a name. Each cell in this sample/gene matrix specifies the expression value of a given gene corresponding to a specific sample. More detailed information about the format can be obtained in the user manual (see Additional file 1 for a more elaborated explanation). In addition to data import and export options, this area also allows the user to carefully search and inspect each uploaded dataset.
Datasets uploaded to geneCommittee should be previously normalized, as no pre-processing utilities are included. We recommend using the Robust Multichip Average (RMA) normalization technique [34].
Committee training
The Committee training area is a key piece in the overall architecture of our geneCommittee server. It implements an easy-to-use and straightforward 6-step wizard for giving specific support to both (i) the initial enrichment of raw datasets and (ii) the later construction, training and validation of interpretable classifiers to finally build an accurate committee of experts. Figure 3 shows all the steps comprising the whole wizard together with the input and output entities stored by gene Committee.
In order to better understand the specific functionality of each block depicted in Figure 3, we briefly introduce every step comprising the Committee training wizard.
S1. Data set
The first action consists of selecting the desired dataset from those previously uploaded using the data management area. For each dataset, general information is shown regarding conditions and samples. Since the selected dataset will be subsequently used along the remaining pipeline, any change in this step will imply that the training process already performed with the actual dataset will be lost unless the whole workflow is completed and saved.
S2. Gene set
In this step the user can select those genes showing higher discriminative potential by using several well-established ranking methods: chi-squared distribution, information gain split method, gain ratio and the relief-f feature filtering algorithm. Additionally, numeric attributes can be converted to binary values and/or missing values merged in order to better adapt raw data to the preferred filtering algorithm.
S3. Enrichment
In this stage of the workflow previously selected genes are automatically enriched using GeneBrowser web services, which provide valuable biological knowledge regarding related enzymes, homologies, ontologies, proteins, pathways, diseases and drugs [25]. All this wealthy data gathered from multiple sources can be used to add new relevant features to be taken into consideration during the rest of the training process. As soon as the on-line gene enrichment process is finished, a new table is generated containing the name of the retrieved items, their sources, the associated p-value, the number of genes involved and a complementary link to GeneBrowser application. This link allows the user to obtain exhaustive information for entries of interest without leaving the geneCommittee workspace.
S4. Classifiers
After the gene set enrichment process, it is necessary to (i) specify and configure the classifiers that will be used and (ii) the desired global evaluation strategy. The global evaluation system implemented in geneCommittee is very flexible, allowing both the use of a single classification method or a specific combination of multiple classifiers. Since each classification algorithm has its own advantages and drawbacks, and taking also in consideration that its performance directly depends on the data, our proposal of having a committee of experts allows choosing the best combination of methods for a specific dataset. Regarding this feature, at the moment our geneCommittee server supports five different types of standard classifiers: (i) k-nearest neighbours, (ii) decision trees, (iii) support vector machines, (iv) naïve Bayes, and (v) random forest. According to the needs, the user can select any number of classifiers (belonging to the same type or not) for which each configuration can be individually established.
S5. Evaluation
Once the list of candidate classifiers is finally defined and the evaluation strategy is set, the evaluation step provides an interactive live visualization of the experiment execution. As soon as the job is finished, the user can choose the desired experts (combination of individual classifiers and gene features) to build a new committee (see Figure 4).
In order to evaluate the performance of each expert when dealing with the initially selected dataset, the statistical analysis carried out by geneCommittee can be conveniently adjusted. Implemented measures include Cohen’s Kappa, accuracy, precision, recall, specificity and F-measure, serving for a better perception of the results’ significance. Another interesting feature is the possibility of visualizing the results of a specific class, or in the case shown in Figure 4, a condition. Following the selection of those classifiers comprising the new generated committee, the user has to save it for further use to evaluate new samples (in Diagnostic mode).
S6. Summary
Finally, the last step included in the Committee training wizard is in charge of showing general information about the input dataset used for training, summarizing the criteria used for performing the gene selection process and introducing some details about the new available committee.
Diagnostic mode
In this area it is possible to directly apply previously trained committees to evaluate new samples. As soon as the user uploads a new dataset corresponding to unseen patient data, the selected committee will start working on the diagnostic of all the samples to identify their corresponding classes (conditions). The whole process carried out by geneCommittee working in Diagnostic mode is showed in Figure 5.
Once the selected committee concludes the processing of the new dataset, geneCommittee presents the result of the classification process as showed in Figure 6.
The information displayed in Figure 6 is structured using a table where each column (except for the first) represents one different patient. In this table, rows are grouped in four main sections:
-
Committee: each row contains the diagnostics of one member (expert) of the committee, that is to say, a classifier trained using only the biological information of its associated gene set. Committee members will select one single condition for each patient.
-
By Gene Set: this section summarizes the committee member’s diagnostics by grouping the outputs of those experts that share the same gene set. Only the conditions with the highest number of votes are shown.
-
By Classifier: in the same way as the previous area, this section groups the committee member’s diagnostics taking into consideration the type of the classifier.
-
Voting: this section summarizes the whole diagnostic process carried out by showing the votes that each condition has received, along with a final row evidencing the condition or conditions with the higher number of votes.
The diagnostic view showed in Figure 6 also provides a helpful toolbar with several options for conveniently manage the information displayed.
RNA-Seq support
The workflow implemented in our geneCommittee server is based on previous works [20, 22] in which we defined successful knowledge-based classification protocols for analysing gene expression data. Although these approaches were initially intended to specifically handle DNA microarrays, the underlying workflows are generic and able to process other types of gene expression data sources. In this context, mainly motivated by the increasing importance and popularity of RNA-Seq techniques for measuring gene expression levels, in this work we also evaluate the suitability of our geneCommittee server for dealing with compatible RNA-Seq datasets.