- Research
- Open access
- Published:
HPC-T-Annotator: an HPC tool for de novo transcriptome assembly annotation
BMC Bioinformatics volume 25, Article number: 272 (2024)
Abstract
Background
The availability of transcriptomic data for species without a reference genome enables the construction of de novo transcriptome assemblies as alternative reference resources from RNA-Seq data. A transcriptome provides direct information about a species’ protein-coding genes under specific experimental conditions. The de novo assembly process produces a unigenes file in FASTA format, subsequently targeted for the annotation. Homology-based annotation, a method to infer the function of sequences by estimating similarity with other sequences in a reference database, is a computationally demanding procedure.
Results
To mitigate the computational burden, we introduce HPC-T-Annotator, a tool for de novo transcriptome homology annotation on high performance computing (HPC) infrastructures, designed for straightforward configuration via a Web interface. Once the configuration data are given, the entire parallel computing software for annotation is automatically generated and can be launched on a supercomputer using a simple command line. The output data can then be easily viewed using post-processing utilities in the form of Python notebooks integrated in the proposed software.
Conclusions
HPC-T-Annotator expedites homology-based annotation in de novo transcriptome assemblies. Its efficient parallelization strategy on HPC infrastructures significantly reduces computational load and execution times, enabling large-scale transcriptome analysis and comparison projects, while its intuitive graphical interface extends accessibility to users without IT skills.
Background
Ribonucleic acids (RNA), a type of biomolecules found in cells and organisms, are the real-world expression of the genome being transcribed (or expressed), producing a so-called transcriptome. RNAs play a crucial part in gene expression, the process connecting the genotype to the phenotype of an organism: RNAs act like couriers carrying the genetic instructions stored in DNA (genotype) to guide the creation of proteins and other functional molecules defining an organism’s physical traits (phenotype) [1, 2].
Next-Generation Sequencing (NGS) technologies, among which RNA sequencing (RNA-Seq), have given researchers access to the intricate architecture and dynamic activity of the transcriptome. From simple organisms like yeast to more complex ones like humans, this technique offers unparalleled levels of sensitivity and precision in mapping the transcriptome’s dynamism and complexity [3].
de novo transcriptome assembly is the process of reconstructing a transcriptome from short, fragmented RNA sequences obtained through massive sequencing technologies [4,5,6,7,8,9,10,11,12]. This is an important issue in transcriptomics as sequencing technologies only produce RNA sequences of limited length, requiring the reconstruction of the whole transcriptome from these tiny fragments [13,14,15,16,17,18,19]. Several assembly algorithms (e.g., Trinity [20], rnaSPAdes [21], Oasis [22]) can be used to reconstruct the complete transcriptome. Once the assembly is completed, the process of transcriptome annotation is an important computational phase to understand the functions of the genes and other elements in a transcriptome, and how they contribute to the overall biology of an organism [5, 23]. Transcriptome annotation allows us to understand which genes are being expressed in a particular cell or tissue at a given time, and how the expression of these genes contributes to the overall biology of the cell or tissue.
Two main types of annotation are used:
-
Homology-based annotation identifies genes or elements similar to known ones in other organisms.
-
Functional annotation predicts a gene’s or element’s function based on its sequence and other features, without knowledge of genes or elements in other organisms.
Both types are important for understanding the biology of a transcriptome and identifying genes or elements of biological potential, such as conservation of genomics biodiversity, or biomedical interest. Homology-based annotation concerns the mapping of nucleotide or protein sequences in large protein databases. This typically involves the NR [24], Swiss-Prot [25], or TrEMBL [26] databases. These databases contain protein sequences from different organisms, both model and non-model. They can be used to identify homologous genes or elements in the transcriptome that is being annotated.
The homology-based annotation process is computationally demanding and resource-demanding, especially in the presence of very large transcriptomes. Sequence annotations are usually computed using alignment algorithms like BLAST [27] or DIAMOND [28], aligning protein or nucleotide sequences against protein databases. These tools are engineered for handling large-scale sequencing data efficiently and effectively. Once the sequence alignments are obtained from the analysis, homologous proteins are identified. Matches with a high level of similarity allow researchers to attribute a function to query sequences, based on knowledge about homologous proteins in other organisms. Although DIAMOND can be up to 10,000 times faster than the gold standard, BLAST, annotating a medium-to-large transcriptome can still require several days of computation on a single multi-core computing node.
In recent decades, many bioinformatics domains dealing with big data analysis have turned to High Performance Computing (HPC) tools: from HPC services for bioinformatics [29, 30], to genomics [31,32,33], transcriptomics [34, 35], metagenomics [36, 37], as well as in structural bioinformatics [38,39,40,41,42]. In particular, concerning sequence alignment, several optimisation experiments for BLAST and DIAMOND were conducted. The optimised BLAST implementations were designed for either GPU [43, 44] or CPU clusters [45, 46]. An optimised version of DIAMOND is available for HPC clusters [47], while another version improved the original algorithm by using efficient compression of the indices [48]. All these versions, optimised for HPC clusters, can be challenging to use (installation, configuration, and launch) for users without HPC expertise.
To overcome the primary computational bottleneck in de novo transcriptome annotation-the alignment of large sequences against a protein reference database-and to provide a user-friendly alignment software in the HPC environment, we introduce HPC-T-Annotator, a software suite which optimizes HPC machine resources efficiently and effectively, adopting a novel load-balancing technique, in order to derive homology-based on annotation of de novo transcriptomes from BLAST and DIAMOND. The tool parallelizes the alignment process, running multiple tasks simultaneously across different processors or computing nodes, thus completing the alignment process in less time than would be possible using a single processor or computing node.
While its primary purpose is to map de novo transcriptomes against specific protein databases, HPC-T-Annotator can also be used to map predicted Open Reading Frames (ORFs), in terms of coding sequences or proteins (peptide sequences), as well as differentially expressed genes, against the aforementioned databases.
The rest of the paper proceeds as follows. In Sect. Implementation, we are concerned with implementation aspects, describing the algorithm and its operational mode, and focusing attention on the parallel execution process and the advantages it brings. In Sect. Results, an in-depth analysis of HPC-T-Annotator’s performance is conducted, comparing its operational mode with currently employed software with similar characteristics. Finally, in Sect. Discussion and conclusions, we provide general comments and discuss future developments.
The supplementary material gives details on the GUI and the integrated notebooks.
Implementation
The main usage schema of HPC-T-Annotator is shown in Fig. 1. The user starts by filling out a Web form with the project information. In response, the Web service generates a batch of customised scripts and compresses them in a .tar file. This archive must then be transferred to the HPC machine, where the user can launch the parallel annotation software from the command line. Experienced users may also choose to run HPC-T-Annotator without using the Web interfaceFootnote 1.
The Web interface (outlined in the supplementary material) is a critical component of the software, designed to enable straightforward usage of the application on an HPC cluster, even for users without advanced technical skills. This design aims to broaden the software’s appeal and increase the number of potential users.
Once the generated scripts are transferred on the HPC cluster and launched through the command line start.sh, a complex set of operations, described in detail in Sect. HPC-T-Annotator parallel workflow, is performed without any need for further user intervention. Specifically, they execute an optimised split of the sequence file to be annotated. Each partial file, referred to as a chunk file, is then aligned against the database through a master/workerFootnote 2 model for parallel computing. Each worker process aligns the small fraction of the total sequences included in the chunk file. The results are then combined by the master process into a unified output in TSV formatFootnote 3. After the final alignment output file in TSV format is generated on the HPC cluster, the user must download the TSV file locally to post-process the alignment results. The user can then utilize the Python notebooks provided by the HPC-T-annotator suiteFootnote 4. In particular, they include:
-
AnnoDegsReport: a multi-database annotation summary that consolidates the annotation results of a specific transcriptome obtained by running alignment software (DIAMOND and BLAST) against various sequence databases.
-
AnnoRate: to compare database performances in terms of hit percentages for each sequence file and for each reference database.
-
AnnoViz: which takes as input the output file generated by annotation software (BLAST or DIAMOND) and creates graphs providing an interpretation of the annotation and alignment results.
-
MultiVenn: useful for comparing annotation results across different databases.
HPC-T-Annotator parallel workflow
The HPC-T-Annotator software workflow (see Fig. 2) can be described as follows:
-
1.
preparation of input files and directives for workers processes;
-
2.
activation of the workers to operate concurrently, and monitoring of their status;
-
3.
aggregation of the partial outputs from individual workers into a single output file.
The master process remains active for the whole duration of the parallel annotation process, ensuring that each task is run successfully to completion. At that point, it merges the partial alignment files, yielding a single comprehensive alignment file.
The input file targeted for annotation is a de novo transcriptome, formatted in multi-FASTA, and ordered from the longest to the shortest sequence. A naïve splitting of the input file, which retains the same order of sequences, could result into partial files of greatly varying sizes (i.e., the initial files would be larger, with subsequent ones diminishing in size). To achieve a balanced input workload, we use a cyclic algorithm for the assignment of sequence IDs. As an initial step, the master process reads the user-supplied input file and records the IDs of n sequences into separate files (header_1.txt, \(\dots\), header_k.txt), so that each header_i.txt file contains the headers corresponding to a portion of the sequences present in the original file.
In this way, each process will be assigned a specific set of sequences to align with. As scattering algorithm we use cyclic distribution, which consists of the cyclic assignment of sequences within the input file. Thus, with k the number of processes, process 0 will have assigned sequences such as sequence 0, sequence k, sequence 2k, etc. Similarly, process 1 will deal with sequence 1, sequence \(k+1\), sequence \(2k+1\), etc.
Algorithm 1 illustrates, in pseudocode, how the algorithm assigns to each process its pool of sequences from the original file. Each partial chunk (file) is aligned using BLAST or DIAMOND, producing a partial result in TSV format.
Control process and merging phase
The control process (a subprocess of the master) acts as an orchestrator and manages the entire second phase. Its primary role is to oversee the status of the processes executing the alignment phase on the partial sequence files. The control process regularly checks the status of each individual process to verify whether it has successfully completed its assigned task or if it is still ongoing. Once the control process confirms that all processes have completed their tasks, the master proceeds to initiate the merging operation; consolidating the alignment results from the partial files into a single comprehensive TSV file. If a worker process encountered an error during its execution, the inaccuracy is reported in the respective process error-file.
Results
We report here an assessment of HPC-T-Annotator by presenting performance data on some benchmarks and a comparison against other parallel alignment software.
Benchmarks
We conducted a benchmark study to assess how much a parallel application of homology annotation software, specifically DIAMOND and BLAST, would optimise performance. The study focused on the annotation of transcripts for the aforementioned four species, assessing efficiency and speed improvements achieved through parallel execution compared to traditional serial execution on transcriptomes with different sizes, in terms of number and length of sequences (contigs). Table 1 provides a reference to the input datasets. for the de novo transcriptomes of the selected species.
In particular, we used data pertaining to one bacterium of the genus Altererythrobacter (Altererythrobacter sp.) and to three amphibians (Bombina pachypus, Salamandra salamandra, Hyla sarda). The bacterium’s transcriptome is an extremely small one (number of contigs: 220), whereas the amphibians’ transcriptomes are of medium (Bombina pachypus: 190,619 contigs) and large (Salamandra salamandra: 1,146,571 contigs, Hyla sarda: 1,295,741 contigs) size.
The temporal metric used in the following benchmarks is the elapsed time, which represents the execution time of HPC-T-Annotator from when the user starts the software to when they receive the result. The elapsed time includes the times for merging the partial files into the final output file. As for execution on the HPC cluster, HPC-T-Annotator is designed to assign each worker process and the master process to a compute node of the cluster. Table 2 presents the execution times for the species’ transcripts against the Swiss-Prot database using the DIAMOND alignment software, using 8 threads per worker process (even for serial execution) and with Diamond’s ultra-sensitive mode. Each species underwent two executions: one in serial mode using traditional alignment software and another in parallel mode using the alignment software with HPC-T-Annotator as parallelisation tool.
The data in Table 2 are rendered graphically in Fig. 3, showing the variations in serial execution time, and highlighting the markedly reduced execution time achieved by parallel application, showing a comprehensive view of the performance improvements. The results of the benchmark were remarkable, with the parallel execution approach showing substantial reductions in execution time compared to the traditional serial execution. This improvement in efficiency proved the potential of parallel application in accelerating the homology annotation process.
Also, we used the transcriptome of Hyla sarda (Tyrrhenian tree frog) as a reference and performed the alignment using the DIAMOND tool [28] against the Swiss-Prot database. We selected 1, 10, 100, 200, and 300 processes, respectively, with time and speed-up results shown in Table 3 and in the associated graph (Fig. 4).
One can notice that the overall execution time continues to decrease as the number of processes increases; this result is not surprising, as we expect such a trend up to the point where the number of processes is equal to the number of sequences (each process is assigned only one sequence). However, this is not always possible since the number of sequences may be very high, and it would not be feasible to have such a high number of nodes on physical machines.
Speed-up is here taken to represent the performance improvement achieved by executing code on a parallel computing system compared to a sequential execution on a single processor, thus quantifying the reduction in execution time when multiple processors or cores are utilised concurrently.
Parallel algorithm analysis studies the efficiency and effectiveness of algorithms designed for parallel computing architectures. Speed-up is a crucial aspect of parallel algorithm analysis because it provides a quantitative measure of the benefits obtained from parallelization. By comparing the execution times of parallel and sequential implementations, speed-up allows us to assess the efficiency of parallel algorithms and evaluate the scalability of the parallel computing system.
With \(p\) the number of processes used and \(T(p)\) the execution time for \(p\) processes, the formula for speed-up is given by: \(S(p) = \frac{T(1)}{T(p)}\), where the ideal speed-up should follow a linear trend of \(S(n) = n\). Figure 4 presents the data from Table 3 in graphical form, on which several notable observations can be made.
In particular, a boost in performance resulted from leveraging the parallel execution capabilities of HPC-T-Annotator alongside DIAMOND, thus enabling faster and more efficient annotation of homologous sequences. The parallel application effectively utilised computational resources, resulting in optimal resource allocation and load balancing. These findings have profound implications for homology annotation, offering a promising solution for improving speed of large-scale sequence analysis.
-
Single Process Execution (1 process): The elapsed time with serial version of the alignment software, with ultra-sensitive mode and 8 threads, is 101.71 minutes. The speed-up is 1 because there is no performance improvement.
-
Parallel Execution (10 processes): With 10 processes, elapsed time drops sharply from 101.71 to 10.42 min. The speed-up is 9.76, indicating the potential improvement if the task could be perfectly parallelized.
-
Further Parallelisation (100, 200, and 300 processes): As the number of processes increases, the elapsed time decrease, showcasing the benefit of parallel processing. The speed-up value also increase, which is promising. However, it is interesting to note that the increase in speed-up is not linear with the increase in the number of processes. This could be due to factors such as diminishing returns, general overhead, or resource limitations.
The determining factor for the slow increase in speed-up remains the execution of HPC-T-Annotator on clusters with high job competition and overload.
Regarding the accuracy analysis of HPC-T-annotator, we performed some tests, by launching on the same input query and database, both BLAST and DIAMOND in parallel and not parallel version. The results showed no significant differences among the various result files, thus confirming the same accuracy for the parallel version.
Comparison with other parallel alignment software
HPC-T-Annotator’s software is composed of three main parts, as described in Fig. 1, and in Fig. 2 for the details of parallel analysis: (1) the Web tool or configuration and generation of parallel software; (2) the (generated) parallel analysis software; and (3) the notebooks for post-processing analysis. To date, the scientific literature does not describe any software for sequence alignment on HPC as complete as ours (Web interface for configuring and generating the custom parallel software, easy launching parallel analysis on HPC platforms, and easy interpretation, with notebooks, of the alignment results). We compare available software with ours where there are similarities in at least one of the three main components.
In Table 4, we have reported the software architecture data and the usage of RAM, memory and I/O transfer, as well as the execution times of BLAST and DIAMOND when used in multithreading mode or launched on a single node. Additionally, we compared parallel versions of BLAST and DIAMOND with runs of HPC-T-annotator using the BLAST and DIAMOND functions, respectively.
In addition to these software applications, others exist which are equipped with an interface, but are not designed for HPC computation. Among the most interesting and recent ones is BlastGUI [50]. BlastGUI is a cross-platform, Python-based application that features a main interface allowing users to create databases, perform sequence filtering, and conduct sequence alignment through a graphical user interface. BlastGUI ‘s capabilities for operation visualization, automatic sequence filtering, and cross-platform use can significantly facilitate the analysis of biological data; however, BlastGUI does not have tools for analyzing data results and therefore does not help interpret the biological alignment data obtained.
Discussion and conclusions
In the landscape of transcriptomic data analysis, where handling large data sets is crucial for understanding the molecular mechanisms underlying complex biological phenomena, computation on High-Performance Computing (HPC) platforms can be essential. Optimized CPU usage and efficient workload distribution across cluster nodes are fundamental to enable thorough annotation within reasonable time frames. Swiftly annotating a de novo transcriptome, used as a reference resource, allows researchers to identify patterns of differential gene expression, regulatory elements, and evolutionary patterns, thus providing valuable insights for a wide range of biological studies, from bio-medicine to ecology and evolutionary biology.
In this context, HPC-T-Annotator emerges as an essential resource for big data analysis in transcriptomics, reducing execution times in large-scale de novo transcriptome annotation. Indeed, its ability to handle large data volumes and accelerate annotation tasks can significantly facilitate the advancement of biological research.
The capacity of HPC-T-Annotator to manage big data is implemented in all three main phases of the software(as shwon in Fig. 1 in Sect. 2), namely: generation of parallel computation software; deployment on the HPC cluster; and parsing of fundamental results using Python notebooks. Highlighting the tool’s ease of use even for researchers not specialized in IT, we report the following user-friendly usage features: (1) Researchers, even not not skilled in parallel computation can automatically generate and configure the HPC software they need by simply filling out a form. (2) The launch of the generated software modules only requires the execution of a simple command lineFootnote 5. (3) The generated TSV file, usually extremely large in size, can be quickly analyzed using the Python notebooks available online.
Although HPC-T-Annotator was primarily designed for transcriptome annotation, its parallelization framework and adaptable architecture have the potential for broader use in bioinformatics. The tool’s scalability and efficiency make it an ideal candidate for parallelising various computational tasks, such as metagenomic data annotation.
Further developments of HPC-T-Annotator are underway to improve the efficiency and speed of the annotation tasks. Currently, from the Web interface, it is only possible to configure the launch of the annotation of a single transcriptome against a single database. We plan to increase the potential for big data analysis by providing the ability to configure the annotation launch on different databases and several input transcriptomes simultaneously, simply by entering data in a single form.
Finally, the implementation of additional computational adaptations, such as extending software compatibility to clusters with schedulers other than SLURM and expand the suite of Python/Jupyter Notebooks for post-processing operations, will further amplify the accessibility and efficiency of the software, providing biologists and researchers with advanced tools for big data analysis.
Availability and requirements
-
Project name: HPC-T-Annotator.
-
Project home page:http://raganella.deb.unitus.it/HPC-T-Annotator/index.html.
-
Operating system(s): Linux.
-
Programming language: Bash, Python.
-
Other requirements: Linux (Ubuntu 20.04 LTS, CentOS 7 or higher), SLURM 22.* or higher (optional).
-
License: MIT.
-
Any restrictions to use by non-academics: License needed.
Availability of data and materials
All used data are mentioned in Table 1.
Notes
A process that performs a portion of a calculation is called a worker process, while a process that controls and manages the worker processes is called a master process. In this context, one computational node is assigned to the master process and one computational node is assigned to each worker process.
The output produced by HPC-T-Annotator is compatible with option -f 6 provided by both BLAST and DIAMOND, i.e., the tabular format.
For details, check http://raganella.deb.unitus.it/HPC-T-Annotator/notebooks.html.
For details, consult the online help at http://raganella.deb.unitus.it/HPC-T-Annotator/help.html.
References
Buccitelli C, Selbach M. mRNAs, proteins and the emerging principles of gene expression control. Nat Rev Genet. 2020;21(10):630–44. https://doi.org/10.1038/s41576-020-0258-4.
Nachtigall PG, Kashiwabara AY, Durham AM. CodAn: predictive models for precise identification of coding regions in eukaryotic transcripts. Brief Bioinform. 2021;22(3):bbaa045. https://doi.org/10.1093/bib/bbaa045.
Muers M. Transcriptome to proteome and back to genome. Nat Rev Genet. 2011;12(8):518–518. https://doi.org/10.1038/nrg3037.
Joudaki F, Ismaili A, Sohrabi SS, Hosseini SZ, Kahrizi D, Ahmadi H. Transcriptome analysis of gall oak (Quercus infectoria): De novo assembly, functional annotation and metabolic pathways analysis. Genomics. 2023;115(2):110588. https://doi.org/10.1016/j.ygeno.2023.110588.
Raghavan V, Kraft L, Mesny F, Rigerte L. A simple guide to de novo transcriptome assembly and annotation. Brief Bioinform. 2022;23(2):bbab563. https://doi.org/10.1093/bib/bbab563.
Fallon TR, Čalounová T, Mokrejš M, Weng J-K, Pluskal T. transXpress: a Snakemake pipeline for streamlined de novo transcriptome assembly and annotation. BMC Bioinform. 2023;24(1):133. https://doi.org/10.1186/s12859-023-05254-8.
Jackson DJ, Cerveau N, Posnien N. De novo assembly of transcriptomes and differential gene expression analysis using short-read data from emerging model organisms—a brief guide. Front Zool. 2024;21(1):17. https://doi.org/10.1186/s12983-024-00538-y.
Zhu B, Luo X, Gao Z, Hu X, Weng Q. De novo transcriptome assembly and development of EST-SSR markers of the endangered Dendrebium nobile (Orchidaceae). Pak J Bot. 2022;54(2):483–9. https://doi.org/10.30848/PJB2022-2(40).
Sato M, Seki M, Suzuki Y, Ueki S. The dataset of de novo assembly and inferred functional annotation of the transcriptome of Heterosigma akashiwo, a bloom-forming, cosmopolitan raphidophyte. Data Brief. 2023. https://doi.org/10.1016/j.dib.2023.109071.
Ivanov M, Sandelin A, Marquardt S. Trancriptome ReconstructoR: data-driven annotation of complex transcriptomes. BMC Bioinform. 2021;22(1):1–15. https://doi.org/10.1186/s12859-021-04208-2.
Alvarez RV, Mariño-Ramírez L, Landsman D. cTranscriptome annotation in the cloud: complexity, best practices, and cost. GigaScience. 2021;10(2):giaa163. https://doi.org/10.1093/gigascience/giaa163.
Harshan P, Sandhya S, Gopalakrishnan A. De novo transcriptome for Chiloscyllium griseum, a long-tail carpet shark of the Indian waters. Sci Data. 2024;11:285. https://doi.org/10.1038/s41597-024-03093-7.
Palomba M, Libro P, Martino JD, Rughetti A, Santoro M, Mattiucci S, Castrignanò T. De novo transcriptome assembly and annotation of the third stage larvae of the zoonotic parasite anisakis pegreffii. BMC Res Notes. 2022;15(1):223. https://doi.org/10.1186/s13104-022-06099-9.
Palomba M, Libro P, Martino JD, Roca-Geronès X, Macali A, Castrignanò T, Canestrelli D, Mattiucci S. De novo transcriptome assembly of an antarctic nematode for the study of thermal adaptation in marine parasites. Sci Data. 2023;10(1):720. https://doi.org/10.1038/s41597-023-02591-4.
Levy-Booth DJ, Hashimi A, Roccor R, Liu LY, Renneckar S, Eltis LD, Mohn WW. Genomics and metatranscriptomics of biogeochemical cycling and degradation of lignin-derived aromatic compounds in thermal swamp sediment. ISME J. 2021;15(3):879–93. https://doi.org/10.1038/s41396-020-00820-x.
Chiocchio A, Libro P, Martino G, Bisconti R, Castrignanò T, Canestrelli D. Brain de novo transcriptome assembly of a toad species showing polymorphic anti-predatory behavior. Nat Sci Data. 2022;9(1):619. https://doi.org/10.1038/s41597-022-01724-5.
Libro P, Chiocchio A, Rysky ED, Martino JD, Bisconti R, Castrignanò T, Canestrelli D. De novo transcriptome assembly and annotation for gene discovery in Salamandra salamandra at the larval stage. Sci Data. 2023;10(1):330. https://doi.org/10.1038/s41597-023-02217-9.
Libro P, Bisconti R, Chiocchio A, Spadavecchia G, Castrignanò T, Canestrelli D. First brain de novo transcriptome of the Tyrrhenian tree frog, Hyla sarda, for the study of dispersal behavior. Front Ecol Evol. 2022. https://doi.org/10.3389/fevo.2022.947186.
Mastrantonio V, Libro P, Martino JD, Matera M, Bellini R, Castrignanò T, Urbanelli S, Porretta D. Integrated de novo transcriptome of Culex pipiens mosquito larvae as a resource for genetic control strategies. Sci Data. 2024;11:471. https://doi.org/10.1038/s41597-024-03285-1.
Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, MacManes MD, Ott M, Orvis J, Pochet N, Strozzi F, Weeks N, Westerman R, William T, Dewey CN, Henschel R, LeDuc RD, Friedman N, Regev A. De novo transcript sequence reconstruction from RNA-seq using the trinity platform for reference generation and analysis. Nat Protoc. 2013;8(8):1494–512. https://doi.org/10.1038/nprot.2013.084.
Bushmanova E, Antipov D, Lapidus A, Prjibelski AD. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-seq data. GigaScience. 2019;8(9):giz100. https://doi.org/10.1093/gigascience/giz100.
Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo rna-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012;28(8):1086–92. https://doi.org/10.1093/bioinformatics/bts094.
Hart AJ, Ginzburg S, Xu M, Fisher CR, Rahmatpour N, Mitton JB, Paul R, Wegrzyn JL. Entap: bringing faster and smarter functional annotation to non-model eukaryotic transcriptomes. Mol Ecol Resour. 2020;20(2):591–604. https://doi.org/10.1111/1755-0998.13106.
Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:61–5. https://doi.org/10.1093/nar/gkl842.
Boeckmann B, Blatter M-C, Famiglietti L, Hinz U, Lane L, Roechert B, Bairoch A. Protein variety and functional diversity: Swiss-prot annotation in its biological context. Comptes Rendus Biol. 2005;328(10–11):882–99. https://doi.org/10.1016/j.crvi.2005.06.001.
Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28(1):45–8. https://doi.org/10.1093/nar/28.1.45.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10. https://doi.org/10.1016/s0022-2836(05)80360-2.
Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2014;12(1):59–60. https://doi.org/10.1038/nmeth.3176.
Castrignanò T, Gioiosa S, Flati T, Cestari M, Picardi E, Chiara M, Fratelli M, Amente S, Cirilli M, Tangaro MA, Chillemi G, Pesole G, Zambelli F. ELIXIR-IT HPC@CINECA: high performance computing resources for the bioinformatics community. BMC Bioinform. 2020. https://doi.org/10.1186/s12859-020-03565-8.
Yeh C-W, Huang C-W, Yang C-L, Wang Y-T. A high performance computing platform for big biological data analysis. 2023:68–70. https://doi.org/10.1109/ICASI57738.2023.10179527.
Chiara M, Gioiosa S, Chillemi G, D’Antonio M, Flati T, Picardi E, Zambelli F, Horner DS, Pesole G, Castrignanò T. CoVaCS: a consensus variant calling system. BMC Genom. 2018. https://doi.org/10.1186/s12864-018-4508-1.
Bolis M, Garattini E, Paroni G, Zanetti A, Kurosaki M, Castrignanò T, Garattini SK, Biancardi F, Barzago MM, Gianni’ M, Terao M, Pattini L, Fratelli M. Network-guided modeling allows tumor-type independent prediction of sensitivity to all-trans-retinoic acid. Ann Oncol. 2017;28(3):611–21. https://doi.org/10.1093/annonc/mdw660.
Chetruengchai W, Jirapatrasilp P, Srichomthong C, Assawapitaksakul A, Pholyotha A, Tongkerd P, Shotelersuk V, Panha S. De novo genome assembly and transcriptome sequencing in foot and mantle tissues of Megaustenia siamensis reveals components of adhesive substances. Sci Rep. 2024;14(1):13756. https://doi.org/10.1038/s41598-024-64425-6.
Pinna V, Di Martino J, Liberati F, Bottoni P, Castrignanò T. IGUANER-differential gene expression and functional analyzer. In: BDA 2023. LNCS, vol. 14516, pp. 78–93. Springer, Berlin. 2024. https://doi.org/10.1007/978-3-031-58502-9_5.
Picardi E, D’Antonio M, Carrabino D, Castrignanò T, Pesole G. ExpEdit: a webserver to explore human RNA editing in RNA-Seq experiments. Bioinformatics. 2011;27(9):1311–2. https://doi.org/10.1093/bioinformatics/btr117.
Tremblay J, Schreiber L, Greer CW. High-resolution shotgun metagenomics: the more data, the better? Brief Bioinform. 2022;23(6):443. https://doi.org/10.1093/bib/bbac443.
Cervi GH, Flores CD, Thompson CE. Metagenomic analysis: a pathway toward efficiency using high-performance computing. In: ICICT 2021. Lecture notes in networks and systems, vol. 236, pp. 555–565. Springer, Singapore; 2022. https://doi.org/10.1007/978-981-16-2380-6_49.
Martino JD, Castrignano T, Arcieri M, Madeddu F, Pieroni M, Carotenuto G, Bottoni P, Botta L, Gabellone S, Saladino R. Molecular dynamics investigations of human DNA-topoisomerase I interacting with novel dewar valence photo-adducts: insights into inhibitory activity. Int J Mol Sci. 2023. https://doi.org/10.3390/ijms25010234.
Castrignanò T, Meo PDD, Carrabino D, Orsini M, Floris M, Tramontano A. The MEPS server for identifying protein conformational epitopes. BMC Bioinform. 2007;8(S1):1–5. https://doi.org/10.1186/1471-2105-8-s1-s6.
Castrignanò T, Chillemi G, Varani G, Desideri A. Molecular dynamics simulation of the RNA complex of a double-stranded RNA-binding domain reveals dynamic features of the intermolecular interface and its hydration. Biophys J. 2002;83(6):3542–52. https://doi.org/10.1016/S0006-3495(02)75354-X.
Castrignanò T, Chillemi G, Desideri A. Structure and hydration of BamHI DNA recognition site: a molecular dynamics investigation. Biophys J. 2000;79(3):1263–72. https://doi.org/10.1016/S0006-3495(00)76380-6.
Pieroni M, Madeddu F, Di Martino J, Arcieri M, Parisi V, Bottoni P, Castrignanò T. MD-ligand-receptor: a high-performance computing tool for characterizing ligand-receptor binding interactions in molecular dynamics trajectories. Int J Mol Sci. 2023;24(14):11671. https://doi.org/10.3390/ijms241411671.
Vouzis PD, Sahinidis NV. GPU-BLAST: using graphics processors to accelerate protein sequence alignment. Bioinformatics. 2011;27(2):182–8. https://doi.org/10.1093/bioinformatics/btq644.
Zhang J, Wang H, Feng W-C. CuBLASTP: fine-grained parallelization of protein sequence search on CPU+GPU. IEEE/ACM Trans Comput Biol Bioinf. 2017;14(4):830–43. https://doi.org/10.1109/TCBB.2015.2489662.
Mikailov M, Luo F-J, Barkley S, Valleru L, Whitney S, Liu Z, Thakkar S, Tong W, Petrick N. Scaling bioinformatics applications on HPC. BMC Bioinform. 2017. https://doi.org/10.1186/s12859-017-1902-7.
Yim WC, Cushman JC. Divide and conquer (DC) BLAST: fast and easy BLAST execution within HPC environments. PeerJ. 2017. https://doi.org/10.7717/peerj.3486.
Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18(4):366–8. https://doi.org/10.1038/s41592-021-01101-x.
Mai H, Zhang Y, Li D, Leung HC-M, Luo R, Wong C-K, Ting H-F, Lam T-W. AC-DIAMOND v1: accelerating large-scale DNA-protein alignment. Bioinformatics. 2018;34(21):3744–6. https://doi.org/10.1093/bioinformatics/bty391.
Yu J, Blom J, Sczyrba A, Goesmann A. Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism. J Biotechnol. 2017;257:58–60. https://doi.org/10.1016/j.jbiotec.2017.02.020.
Du Z, Wu Q, Wang T, Chen D, Huang X, Yang W, Luo W. BlastGUI: a python-based cross-platform local BLAST visualization software. Mol Inf. 2019. https://doi.org/10.1002/minf.201900120.
Acknowledgements
We acknowledge the CINECA and the ELIXIR-ITA HPC@CINECA initiative for providing HPC resources to our project ELIX5_ castrign P.I. Tiziana Castrignanò.
Funding
P. Bottoni was partly supported by the Italian Ministry of University and Research (MUR) under PRIN grant B87G22000450001 (PINPOINT). T. Castrignanò and F. Liberati were partly supported by the Italian Ministry of University and Research (MUR) under PRIN grant J53D23006500006 (MYSPEC).
Author information
Authors and Affiliations
Contributions
Not applicable.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Arcioni, L., Arcieri, M., Martino, J.D. et al. HPC-T-Annotator: an HPC tool for de novo transcriptome assembly annotation. BMC Bioinformatics 25, 272 (2024). https://doi.org/10.1186/s12859-024-05887-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859-024-05887-3