Since its launch on april 2016, ELIXIR-IT HPC@CINECA has provided access to HPC resources to 63 research projects, allocating a total of approximately 3,250,000 CPU core hours. The 63 projects are distributed among 28 research centres (Universities and Research Centers). Figure 3a reports the distribution of proposed projects over several biological macro-areas, showing as the initiative engaged researchers from different backgrounds and different needs in terms of computational requirements and software. As shown in Fig. 3b, the number of projects submitted is growing constantly since the first opening of the call. The growth can be ascribed both to dissemination activities performed in several national conferences [15, 16] as well as to the good feedback obtained from the first participants, as demonstrated by the publication of the results of HPC@CINECA research projects in peer reviewed scientific journals.
Use cases
Here we provide a brief summary of some research projects that were successfully completed thanks to the HPC resources provided by the ELIXIR-IT HPC@CINECA call.
Genome-wide mapping of 8-oxo-7,8-dihydro-2′-deoxyguanosine across human genome
8-Oxo-7,8-dihydro-2′-deoxyguanosine (8-oxodG) is one of the major DNA modifications that occurs when the DNA is exposed to pro-oxidant species (ROS) generated by endogenous metabolism. 8-oxodG is a potent premutagenic lesion for its ability to pair with both cytosine and adenine residues, thus causing G:C to T:A transversions during DNA replication [17, 18].
Several thousand residues of 8-oxodG are constitutively produced in the genome of mammalian cells and a new method has been developed to identify their genomic distribution.
Recently, by using OxiDIP-Seq, Amente et al. [amenteetaloxidip19] reported the genome-wide distribution of 8-oxodG in proliferating DDR-proficient mammary cells (MCF10A and MEFs). Analysis of OxiDIP-Seq revealed that endogenous 8-oxodG is regioselective distributed across the mammalian genome. Moreover, an integrated data analysis starting from OxiDIP-Seq, ChIP-Seq anti-gH2AX, ChIP-Seq anti-POLII, GRO-Seq and RNA-Seq led to the identification of an accumulation of endogenous DNA damage within the gene body of long genes with poor-to-moderate transcription levels. In terms of computational resources we used 2 TB of permanent storage and 200 k core/hours to perform all the analysis of about 500 GB of starting input data. They were analyzed using HPC@CINECA computer resources,) through both the command line environment and the bioinformatics automated pipelines [refs. To RAP and CAST] developed and provided by the CINECA-ELIXIR IT team. This computational effort led to further insights about the molecular mechanisms underlying the heterogeneity of the local mutation rate and the understanding of why certain regions seem to be more, while others less, prone to oxidation. A full description of this work can be found in [19].
New HPC-optimized algorithm for prediction of RNA-editing events from RNA-Seq data
RNA editing is a relevant epitranscriptome modification occurring in a wide range of organisms. In humans, it affects nuclear and cytoplasmic transcripts mainly by the deamination of adenosine (A) to inosine (I) through ADAR enzymes acting on double RNA strands [LiChurch13]. RNA editing has a plethora of biological effects and its deregulation has been linked to a variety of human diseases including psychiatric, neurological and neurodegenerative disorders, and cancer [20]. Several bioinformatics tools to investigate RNA editing events in NGS data have been released [21]. However, its computational identification is a highly time-consuming process, requiring the traversing of very large alignments files in BAM format, position-by-position. Employing ELIXIR-IT HPC@CINECA resources the original REDItools package [22], one of the most accurate tools to call RNA editing events in RNA-Seq experiments [21], the A-to-I calling process has been speeded up, optimizing its implementation for HPC infrastructures:
-
a first optimization in the new version of the code, REDItools2.0, consisted in loading the sequences from disk by reading each sequence only once, keeping it in memory until no longer than needed. This implementation was on average 8–10 times faster than the original version running on a single core;
-
another improvement of the algorithm consisted in optimizing the splitting of the genome into genomic intervals. The initial release of REDItools treated equally different chromosomal regions, by dividing the whole genome in chunks of equal size and assigning each chunk to a thread. Since usually expression data do not exhibit a constant coverage, the number of reads per genomic unit (density of mapped reads) is quite variable and the original version of REDItools spent a lot of computational time in high-density genomic regions. We therefore implemented an optimal interval division in order to guarantee an approximately uniform per-thread workload;
-
a parallel version of REDItools2.0 has also been implemented by writing an ad-hoc MPI Python script based on the use of the mpi4py library [23]. This library provides binding of the Message Passing Interface (MPI) standard for the Python programming language. In this way it is possible to exploit multiple computing nodes by means of collective communication MPI primitives. A simple master/slave template has finally been implemented for coordinating the overall computation.
Executions of the optimized algorithm on real RNA-Seq experiments have shown that the novel REDItools2.0, is on average ten times faster than the previous implementation and the speed up scales adequately with the number of cores involved in the analysis (Fig. 4) thus representing the first HPC resource specifically devoted to RNA-editing detection.
Thanks to the algorithmic optimization described above, the novel REDItools2.0 package has been then used to investigate RNA editing in very large cohorts of RNA-seq experiments like those produced in GTEx or TCGA projects after the award of additional resources through a competitive PRACE (Partnership for Advanced Computing in Europe) project (ProjectID: 2016163924 GREaT - Genome wide identification of RNA editing sites in very large cohorts of human whole transcriptome data). Full description of this work is available in PRACE White Paper.Footnote 8
Creation of a comprehensive database for genomics data in peach (P. persica L. Batsch)
Peach is an economically important fruit tree species of temperate region. Integrating novel genomics tools is a fundamental goal for increasing the efficiency of breeding activities and the leveraging of basic knowledge in this species. After the release of the first peach genome draft, the remarkable advances in high-throughput molecular tools has led to the generation of a multitude of genomics data from several whole-genome re-sequencing projects.
In this project, Whole-genome sequencing data of 125 peach (P. persica L. Batsch) accessions and 21 wild relatives of the Amygdalus subgenus have been downloaded from the NCBI SRA [24] for a whole of 146 accessions publicly available (input data size about 10 TB). Variant discovery was achieved by applying an imputation-free joint variant-calling procedure on the 146 accessions, improving variant discovery by leveraging population-wide information from a cohort of multiple samples [25]. 200 k core/hours have been used to analyse all the samples on the Pico cluster to create the compendium dataset of peach variants. The identified peach variants, both SNP and InDels, are available at the PeachVar-DB portalFootnote 9 that provides an easy access to the information mined from peach Whole Genome Re-Sequencing (WGRS) data. Full description of this work can be found in [26].
High-quality genome assembly for the European barn swallow (Hirundo rustica rustica)
The barn swallow is a passerine bird with at least eight recognized subspecies in Europe, Asia, and North America. Due to its synanthropic habits and its cultural value, the barn swallow is also a flagship species in conservation biology [27]. The availability of high-quality genomic resources, including a reference genome, is thus pivotal to further boost the study and conservation of this species. To facilitate further population genetics and genomic studies, as a part of the Genome10K effort on generating high-quality vertebrate genomes (Vertebrate Genomes Project) [28].
Formenti et al. [29] have assembled a highly contiguous genome assembly using single molecule real-time (SMRT) DNA sequencing and Bionano optical map technologies for the European subspecies (Hirundo rustica rustica). The assembly of the genome, which was performed entirely on the Marconi CINECA HPC supercomputer occupied 3840 central processing unit (CPU) hours and a total amount of 2.2 Tb of random access memory (RAM) for reads correction, 768 CPU hours and 1.1 Tb of RAM for the trimming steps, and 3280 CPU hours and 2.2 Tb of RAM for the assembly phase. The entire process was completed in less than 5 days on the CINECA HPC platform, while re-analysis of the same data on a local server required more than 80 days (Matteo Chiara, personal communication) at full computational capacity.
After removal of haplotigs, the final assembly resulted in approximately 1.21 Gbp in size, with a scaffold N50 value of more than 25.95 Mbp, representing a considerable improvement over the previously reported assembly [30]. Systematic comparisons of this high quality draft genome assembly of H. rustica with a collection of closely and distantly related bird genomes provide phylogenomics profiles of structural rearrangements and gene losses/gene duplications. The approach used for the assembly of the barn swallow genome, while attesting to the effectiveness of SMRT sequencing combined with DLS optical mapping for the assembly of vertebrate genomes, provides an invaluable asset for population genetics and genomics in the barn swallow and for comparative genomics in birds. Full description of this work can be read in [29].
Massive NGS data analysis reveals hundreds of potential novel gene fusions in human cell lines
One of the genetic alterations that are linked to cancer development in addition to single nucleotide mutations are gene fusions deriving from chromosome rearrangements. The availability of sequence data from NGS techniques has made possible the discovery of a huge amount of such alterations. However, current algorithms for fusion detection either have high false positive result rates or miss some real events. Hence, it is very important to be able to run and compare the results of several algorithms, with different discovery properties.
Gioiosa et al. [31] have extensively carried out the analysis of 935 paired-end RNA-sequencing experiments downloaded from the Cancer Cell Line Encyclopedia repository (CCLE),Footnote 10 for a total of 32 TB of input raw data. The aim was addressing novel putative cell-line specific gene fusion events in human malignancies. Four gene fusion detection algorithms were launched on the CCLE samples to detect gene fusion events, for a total of 500 k core/hours. Furthermore, a prioritization analysis was performed by running a Bayesian classifier that adds an in silico validation on detected events. The collection of fusion events supported by all of the predictive algorithms provides a robust dataset of ∼1700 in silico novel candidates among gene fusion events. These data have been stored, collected and integrated with other external resources within the LiGeA portal (cancer cell LInes Gene fusion portAl),Footnote 11 where they are browsable and freely downloadable. Full description of this work can be found in [31].