Highlights from the 11th ISCB Student Council Symposium 2015

Table of contents A1 Highlights from the eleventh ISCB Student Council Symposium 2015 Katie Wilkins, Mehedi Hassan, Margherita Francescatto, Jakob Jespersen, R. Gonzalo Parra, Bart Cuypers, Dan DeBlasio, Alexander Junge, Anupama Jigisha, Farzana Rahman O1 Prioritizing a drug’s targets using both gene expression and structural similarity Griet Laenen, Sander Willems, Lieven Thorrez, Yves Moreau O2 Organism specific protein-RNA recognition: A computational analysis of protein-RNA complex structures from different organisms Nagarajan Raju, Sonia Pankaj Chothani, C. Ramakrishnan, Masakazu Sekijima; M. Michael Gromiha O3 Detection of Heterogeneity in Single Particle Tracking Trajectories Paddy J Slator, Nigel J Burroughs O4 3D-NOME: 3D NucleOme Multiscale Engine for data-driven modeling of three-dimensional genome architecture Przemysław Szałaj, Zhonghui Tang, Paul Michalski, Oskar Luo, Xingwang Li, Yijun Ruan, Dariusz Plewczynski O5 A novel feature selection method to extract multiple adjacent solutions for viral genomic sequences classification Giulia Fiscon, Emanuel Weitschek, Massimo Ciccozzi, Paola Bertolazzi, Giovanni Felici O6 A Systems Biology Compendium for Leishmania donovani Bart Cuypers, Pieter Meysman, Manu Vanaerschot, Maya Berg, Hideo Imamura, Jean-Claude Dujardin, Kris Laukens O7 Unravelling signal coordination from large scale phosphorylation kinetic data Westa Domanova, James R. Krycer, Rima Chaudhuri, Pengyi Yang, Fatemeh Vafaee, Daniel J. Fazakerley, Sean J. Humphrey, David E. James, Zdenka Kuncic

partners, and this goes on until the allotted time runs out. The traditional scientific component of the meeting consisted of two keynote presentations by senior researchers, twelve student presentations, and a poster session. At SCS 2015, Dr. Ruth Nussinov (National Cancer Institute, USA and Tel Aviv University, Israel) and Dr. Des Higgins (Conway Institute, University College Dublin, Ireland) generously agreed to deliver the keynote addresses. The symposium also included two short presentations on open science. The first, about publishing in the digital era, was given by Dr. Michael Markie, a faculty member at our institutional partner F1000 (UK). The second, about data sharing, was given by Dr. Robert Davey, a group leader with our institutional partner The Genome Analysis Center (UK). Students submitted 100 abstracts to SCS 2015, which were peerreviewed by 25 independent reviewers. Approximately 75 abstracts were accepted for poster presentations and 12 of these were also accepted for oral presentation. Extended abstracts of oral presentations are included in this report. All abstracts are available online in the SCS 2015 booklet (http://scs2015.iscbsc.org/scs2015-booklet).

Keynotes
The day opened with Dr. Ruth Nussinov's keynote, in which she explained that Ras GTPase proteins involved in signal transduction include oncogenes such as KRas4B. While little is currently know about the mechanism by which the different mutations of KRas4B lead to cancer, Dr. Nussinov revealed that different mutations are differentially associated with different types of cancer. She believes that "structural biology, computations and experiment, are uniquely able to tackle" this question. In the afternoon, Dr. Des Higgins began his keynote presentation with a history of multiple alignment algorithms. He then presented his newest alignment program, Clustal Omega, designed to align large numbers of sequences quickly and accurately.

Student presentations
The student presentations were begun by Griet Laenen, who shared a new method for identifying drug targets [8]. Each putative target is ranked based on the transcriptional response of functionally related genes and based on the structural similarity of the drug to proteins know to interact with the putative target. On a ChEMBL-derived test set, AUC values of up to 90 % were achieved. Nagarajan Raju reported the identification of different modes of protein-RNA interaction in different organisms, based on structural analysis and molecular dynamics simulation [9]. The long reads available from third generation sequencing methods are a valuable resource, but can complicate mapping because of their higher error rate. Hybrid methods use more accurate short reads to correct these errors but current methods for doing so are either too slow or less accurate on larger genomes. Giles Miclotte presented the Jabba hybrid error correction method which uses corrected de Bruijin graphs to achieve comparable performance on small genomes while still performing well for larger genomes [10]. Drawing robust biological conclusions from stochastic single particle tracking trajectories is particularly difficult when particle movement is heterogeneous, such as in the plasma membrane. Paddy Slator described a method for dealing with heterogeneity by analyzing multiple models of heterogeneity and calculating model selection statistics to identify the most likely of these models, accounting for noisy data. This method was tested on several real data sets [11]. Although we often think of DNA as a two-dimensional sequence, the development of the high-throughput ChIA-PET method for identifying all physical contacts between distant loci is enabling modeling of DNA as a three-dimensional structure. Przemysław Szałaj presented 3D-NOME, a new computational method for modelling the threedimensional structure of the genome [12]. Based on both ChIA-PET data and known interactions of CTCF and RNAPII, 3D-NOME builds an initial model of the nucleosome in a bottom-up fashion and then uses Monte Carlo simulations to reconstruct each level of structure in a top-down fashion. Epigenome-wide association studies allow the high-throughput identification of epigenetic markers that contribute to human disease. Charles Edmund Breeze presented eFORGE, a tool for identifying the cell type specificity of differentially methylated positions significantly associated with disease. eFORGE identifies cell types of interest based on enrichment of overlap between significant differentially methylated positions and DNase I hypersensitive sites in each cell type, compared to matched, randomly generated background sets. Jonas Ibn-Salem used Hi-C chromatin conformation to show that paralogous genes share more enhancer elements with one another and is more likely to occur in the same topological association domain then matched, randomly selected gene pairs. This suggests that paralogs share common regulation because and cluster within the threedimensional chromatin architecture. As the number of genome sequences available increases, it becomes more challenging for biologists to characterize the data and use it to classify sequenced organisms. Giulia Fiscon presented a new feature extraction method for solving this problem, which she integrated into a classifier used to assign virus specimens to viral species [13]. RIG-I-like receptors (RLRs) are an important component of the immune response to viruses, detecting viral RNA and triggering the production of type I interferons. Robin van der Lee identified 10 features common to RLR pathway components and used these features to predict a role in the RLR pathway for 187 novel genes. RNAi knock-down screens showed that almost half of these genes had an impact on production of type I interferon, IFNβ [14]. Only four drugs for treating the lethal disease visceral leishmaniasis are available and the resistance is emerging rapidly in the causal parasite, Leishmania donovani. Bart Cuypers shared a new database, integrating the genomic, epigenomic, transcriptomic, proteomic, metabolomics, and phenotype data available for L. donovani. This database is coupled with many novel data-analysis tools, such as one used to find patterns in different omics data across different conditions [15]. In the final oral presentation of the day, Westa Domanova showed that temporal data could be used to distinguish the activity of different kinases, even those sharing the same consensus motif [16].

Award winners
Thanks to the generous contribution of the Swiss Institute of Bioinformatics, two travel fellowships were awarded to Alexander Monzon and Marek Cmero. Thanks to the generous contributions of BioMed Central, an additional two travel fellowships were awarded to Westa Domanova and Güngör Budak. Finally, the generous contributions of F1000 enabled the award of a fifth travel fellowship to Giulia Fiscon. The ISCB Student Council also awarded an Inspiring Youth travel fellowship to high school student Prathik Naidu. Based on the votes of the SCS delegates, a judging committee awarded three speakers with one best oral and two best poster presentation awards. The Best Presentation Award went to Griet Laenen for her work "Prioritizing a drug's targets using both gene expression and structural similarity". The first place poster award went to Robin van der Lee for his poster, "Systematic integration of molecular signatures identifies novel components of the antiviral RIG-I-like receptor pathway". The second place poster award went to Jonas Ibn-Salem for his poster "Co-regulation of human paralog genes in the three-dimensional chromatin architecture". These awards were enabled by the generous support of Oxford University Press. F1000 awards were also made possible by the generosity of that journal and these were awarded to Bart Cuypers, Giles Miclotte, and Charles Breeze for having posters and presentations that were rated highly by SCS delegates. Student Council activities at ISMB In addition to the one day Student Council symposium, the Student Council organizes several activities available to all students and young researchers participating in ISMB/ECCB. Interactive job postings and student council lounge To facilitate the interaction between job seekers and job advertisers, the Student Council organizes an interactive job postings board during ISMB/ECCB. Job offers can be attached to the posting board available at the SC booth in the exhibitors' hall. The SC booth also served as a lounge area where students and young researchers from ISMB/ECCB could network with one another and meet members and leaders of the Student Council.

Career Central workshop
The goal of Student Council Career Central is to expose students to the experiences and success stories of senior researchers. To this end, the Student Council collaborated with the Junior PI group and COBE COSI to organize an Applied Knowledge Exchange Sessions workshop dedicated to career development. The Student Council invited the first speaker of the session, Dr. Mick Watson, Director of ARK-Genomics, The Roslin Institute, University of Edinburgh, UK. He focused his advice on how to give an effective elevator pitch and gave attendees to practice his advice by giving their elevator pitches to one another.

Conclusions
This year's number of submissions and participants increased by about a third and a half respectively over last year's symposium. It seems likely that this is due in part to greater availability of funding and easier acquisition of visas in Europe for many participants. The increased number of participants, the participants' enthusiastic response to the networking opportunities and talks on open access science, and the high quality keynotes, student presentations, and posters, made this year's Student Council Symposium a great success. The pharmaceutical industry is facing unprecedented pressure to increase its productivity. Attrition rates in the later stages of development have risen sharply, with toxicity and lack of efficacy being the main bottlenecks [1]. To address both these safety-and efficacyrelated issues, a better understanding of the complex biological response to drug treatment is vital. Although many drugs exert their therapeutic activities through the modulation of multiple targets [2], these targets are often unknown and identification among the thousands of gene products remains difficult. We propose a computational method to support the identification of putative targets of a drug by means of a dual approach combining network diffusion of gene expression with chemical structure similarity.

Methods
The first component of our method prioritizes proteins as potential targets by integrating experimental gene expression data with prior knowledge on protein interactions [3,4]. More specifically, genes are ranked based on the transcriptional response of functionally related genes by diffusing differential expression signals following treatment over a protein interaction network. In addition, drug-protein interactions can also be predicted from structural information. Building on the similar property principle, the second component of our method prioritizes proteins as drug targets based on the interaction with compounds structurally similar to the drug of interest. To this end compound-compound similarity scores are combined with compoundprotein interaction scores. Both this structure-based and expressionbased approach produce a genome-wide ranking of potential targets that can eventually be fused to obtain a single ranking.

Results/Conclusion
Our method has been evaluated on a test set of small molecule drugs for which the known targets were derived from ChEMBL [5]. AUC values of up to 90 % were obtained. These results indicate the predictive power of combining gene expression data and structural information for a drug of interest with known protein-protein and protein-compound interaction information respectively, to identify the targets of that drug. As such this dual method can aid in gaining a better knowledge of a candidate drug's mode of action and its offtarget effects and thus be of value in the drug development process.

Motivation
Protein-RNA interactions play essential roles in many cellular processes. It is unclear whether same RNA binding proteins from different organisms show unique patterns or variations to recognize RNA. To address this issue, we have constructed 18 sets of same protein-RNA complexes belonging to different organisms and analyzed the interactions and interacting patterns using various sequence and structure based features [1].

Results
We have investigated the recognizing elements by grouping the protein chains into five organisms such as E. coli, H. sapiens, S. cerevisiae, archaea and thermophiles [2]. We observed that positively charged residues are highly preferred in E. coli whereas aromatic residues and polar residues show preference in S. cerevisiae and thermophiles, respectively. In case of RNA, adenine and uracil are highly preferred in H. sapiens and S. cerevisiae, respectively. The neighboring residues around the binding sites are unique in different organisms (Eg. Cys-His in H. sapiens, Ala-leu in E. coli, Ser-Arg in S. cerevisiae, Gly-Arg in archaea and Asn-Lys in thermophiles). Further, molecular dynamics simulations of aspartyl tRNA synthetase complexes from E. coli, T. thermophilus and S. cerevisiae revealed the similarities and differences in structurally equivalent binding site residues to understand the recognition mechanism. Conclusion Sequence and structural analysis along with MD simulation showed the variations in the interactions between protein and RNA to understand the recognition mechanism of protein-RNA complexes. Background Single particle tracking trajectories are fundamentally stochastic, which makes the extraction of robust biological conclusions difficult. This is especially the case when trying to detect heterogeneous movement of molecules in the plasma membrane. This heterogeneity could be due to a number of biophysical processes such as: receptor clustering [1], traversing lipid microdomains [2,3] or cytoskeletal barriers [4].

Results
Working in a Bayesian framework, we developed multiple models for heterogeneity, such as confinement in a harmonic potential well, switching between diffusion coefficients and diffusion in a fenced environment (or 'hop' diffusion). We implement these models using a Markov chain Monte Carlo (MCMC) methodology, developing algorithms that infer model parameters and hidden states from single trajectories. We also calculate model selection statistics, to determine the most likely model given the trajectory. Our methodology also accounts for measurement noise. For LFA-1 receptors diffusing on T cells we previously showed that 12-26 % of trajectories display clear switching between diffusive states, depending on treatment (example trajectory in Fig. 1) [5].
Analysis of the motion of GM1 lipids bound to the cholera toxin B subunit in model membranes confirmed transient trapping in harmonic potential wells. We developed an algorithm which detects hopping diffusion, and validated on simulated data (Fig. 2). We have also demonstrated that allowing for measurement noise is essential, as otherwise false detection of heterogeneity may be observed.

Conclusions
We have used Bayesian methodology to analyze single particle tracking trajectories. Rather than methods which rely on generic properties of Brownian motions, our approach allows us to test which biophysical model best fits a trajectory. With the continuing improvement in spatial and temporal resolution of trajectories, these methods will be important for biological interpretation of single particle tracking experiments. Fig. 1 (abstract O3). MCMC fit of two-state diffusion model with measurement noise to an LFA-1 trajectory, from [5]. Background Human genome is folded into three-dimensional structures. The 3D organization of the genome is thought to facilitate compartmentalization, chromatin organization and spatial interaction of genes and their regulatory elements. Recently developed high-throughput ChIA-PET method allows us to capture the genome-wide map of physical contacts between distal genomic loci.

Methods
We present the 3D-NOME, a multi-scale computational engine we developed to model the 3D organization of the genome. First, we use a bottom-up approach to create a hierarchical model representing the nucleus based on the underlying data features. Then we use Monte Carlo simulations to sequentially reconstruct all the levels in a topdown manner, i.e. we reconstruct more general levels first and we use them to guide the simulation on the following levels. This approach allows us to efficiently model the chromatin folding on a level of whole chromosomes as well as single topological domains. In our modeling we consider CTCF (which is long known to be responsible for chromatin weaving) and RNAPII (which activates genes transcription) interactions. Taken together these two protein factors provide a comprehensive map of human genome interactions. We also consider CTCF-motif orientations in order to obtain more reliable structures. The specificity of the ChIA-PET data allows us to model the shape of individual chromatin loops and their mutual interactions within topological domains boundaries. We demonstrate the effectiveness of 3D-NOME in building 3D genome models at the kilobase resolution using ChIA-PET data of human B-lymphocytes (GM12878 cell line) [1].

Conclusion
In this work we describe both the model construction and simulation steps of our algorithm. We do also highlight main advantages of our approach compared to existing methods for genome architecture modeling. We hope that further refinement of 3D-NOME and application to additional ChIA-PET and other types of 3D genome mapping data will help to advance our understanding of genome structural organization and functioning. Background Leveraging improvements of next generation technologies, genome sequencing of several samples in different conditions led to an exponential growth of biological sequences. However, these collections are not easily treatable by biologists to obtain a thorough data characterization and require a high cost-time investment. Therefore, computing strategies and specifically automatic knowledge extraction methods that optimize the analysis focusing on what data are meaningful and should be sequenced are essential [1].

Methods
Here, we present a new feature-selection algorithm based on mixed integer programming methods [2] able to extract multiple and adjacent solutions for supervised learning problems applied to biological data. We focus on those problems where the relative position of a feature (i.e., nucleotide locus) is relevant. In particular, we aim to find sets of distinctive features, which are as close as possible to each other and which appear with the same required characteristics. Our algorithm adopts a fast and effective method to evaluate the quality of the extracted sets of features and it has been successfully integrated in a rule-based classification framework [3].

Results
Our algorithm has been applied to three viral datasets (i.e., Rhino-, Influenza-, Polyomaviruses [4][5][6]) and enables to extract all the alternative solutions of virus specimen to species assignments, by identifying portions of sequence that are discriminant, compact, and as shorter as possible.
To conclude, we succeeded in extracting a wide set of equivalent classification rules, focusing on short regions of sequences with high reliability and low computational time, in order to provide the biologists with short and highly informative genome parts to be sequenced, as well as a powerful instrument both scientifically and diagnostically, e.g., for automatic virus detection.  Jean-Claude Dujardin and Kris Laukens contributed equally to this work.

Background
The protozoan parasite Leishmania donovani is the cause of visceral leishmaniasis in the Indian subcontinent and poses a threat to public health due to increasing drug resistance. Only little is known about its peculiar molecular biology and the 'omics integration efforts conducted so far are very limited. Here we present an integratory database that contains all genomics, transcriptomics, proteomics and metabolomics experiments that are currently publicly available for Leishmania donovani. In addition, the user interface contains new analysis tools that uses powerful pattern mining strategies like frequent itemset mining to allow the linking of results from different 'omics layers in new datasets.

Methods
We developed a user friendly tool to crosslink all existing L. donovaniomics experiments. Genomics, transcriptomics, proteomics, metabolomics and phenotypic data were collected and added to a MySQL database compendium, further complemented with publicly available data. Relations between different 'omics layers were explicitly defined and provided with a level of confidence. Python scripts were developed to preprocess, analyse and import the data. To allow comparability between different experiments the principles of the COLOMBOS bacterial expression compendium were adapted [1]. Next to this vast data resource, a set of integrative data-analysis tools was developed based on data mining strategies. For example: One tool uses frequent itemset mining algorithms to detect which proteins and metabolites frequently exhibit the same behaviour under different conditions. Another tool converts several -omics layers to a network format that can be opened in Cytoscape [2] thus be the basis for network analysis. Django and Twitter Bootstrap frameworks were used to create a web portal to make the tools accessible to any Leishmania researcher.

Results
Excellent public gene, protein and metabolite annotation databases are already available for Leishmania (e.g. TriTrypDB [3] and GeneDB [4]). However, the added value of our tool is that it links these annotation data to 'omics experiments that are either provided by the user, or publicly available. New experiments can quickly be preprocessed, analysed and integrated in the database via its Python back end. Using the compendium and its tools, we characterized the development and drug-resistance of Leishmania donovani in a system biology context. The genomes of more than 200 strains were examined for associations with phenotypical features and a subset was linked to transcriptomics, proteomics and metabolomics results. The compendium and its scripts were designed to be generic and can therefore be used for other organisms with only minor adaptions to the original setup.
Westa Domanova and James R. Krycer contributed equally to this work. Background A growing body of evidence is emerging that the temporal behavior of cellular signaling molecules controls biological responses. Phosphorylation, one of the most prevalent signaling modifications, occurs rapidly in response to environmental changes. Our understanding of the signaling network is incomplete because the kinase for the majority of phosphorylation substrates remains unknown. To elucidate the underlying topology of signaling cascades from high-throughput data, we need to be able to predict kinase substrate relationships. Currently, prediction algorithms ignore the crucial biological context of kinase substrate relationships.

Methods
To address this we consider temporal behaviours: given that phosphorylation events occur in a coordinated way, with some kinases being active before others, we predict site-specific kinase substrate relationships from large scale in vivo experiments. First, we used available phosphorylation databases to construct a publicly-accessible repository of experimentally-predicted kinase substrate relationships. Next, based on the substrates reported for each kinase in this database, we identify how the kinase activities change over time in temporal datasets.

Results
Applying this to an insulin-stimulated phosphorylation screen we were able to distinguish between the substrates of AKT and RPS6KB1, two kinases with the same consensus motif, and identified IRS-1-S270 as a novel putative AKT site. We subsequently used our ssKSR-LIVE algorithm to predict novel substrates for the kinases driving insulin signaling, shedding light on their role in driving insulin-stimulated biological processes. This algorithm can be applied to other high-throughput screens of signal transduction, and thus can be used to improve our understanding of complex diseases caused by dysregulated signalling, including cancer and type 2 diabetes.
• We accept pre-submission inquiries • Our selector tool helps you to find the most relevant journal Submit your next manuscript to BioMed Central and we will help you at every step: