Workflow
CoRMAP is implemented using several analysis tools for transcriptional data, and customized shell scripts and R scripts. This protocol provides a framework for conducting reference-independent comparative transcriptomic analysis across multiple species. CoRMAP obtains the assembled transcriptome for each species from raw transcriptome data, finds the orthologous relationships of coding genes across species or across a higher taxonomy group level, and plots expression patterns of orthologous gene groups across species. As such, there are three main data processing steps in CoRMAP: (1) de novo assembly, (2) ortholog searching, and (3) analysis of OGG expression patterns across species (or a higher taxonomy level). The complete workflow is illustrated in Fig. 1. Note that these analyses can be run together or separately, depending on the user’s needs. Because individual modules are combined within our pipeline, checkpoints for preliminary data analysis exist between each module. Below we provide a detailed summary of the individual steps within CoRMAP and instructions on how to implement the pipeline.
Input
Installation of CoRMAP
For the installation of CoRMAP, execute the following command in a Linux terminal by downloading the repository from GitHub (git clone—https://github.com/rubysheng/CoRMAP.git). CoRMAP requires dependencies including ascp (https://www.ibm.com/docs/en/aci/3.9.2?topic=macos-ascp-transferring-from-command-line-ascp), FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc [26, 27]), MultiQC [26], Trim Galore! (v 0.6.4), Trinity (v2.8.6), TransDecoder (v 5.5.0) [27], Trinotate (v 3.2.1) (http://trinotate.github.io [28]), and OrthoMCL [28].
Preparation of datasets
CoRMAP includes a utility to download datasets of RNA-Seq raw data containing multiple runs from the Sequence Read Archive (SRA) [29, 30] database in batch by the software ascp. The RNA-Seq dataset of each project in the SRA database has a unique accession number that is recommended to be used as the directory name for each dataset when setting up the folder structure. Using the SRA accession numbers, FTP links are retrieved from the European Nucleotide Archive [29] and used as input to download the FASTQ files. As such, each FTP link refers to a compressed FASTQ file for a run. Every sample of the single-read sequencing method contains one run, while the sample of the paired-end sequencing method has two runs. As is standard practice, users are required to conduct quality checks on files used in this pipeline. This includes but is not limited to experimental design, replication and biological processing of samples.
Computational requirements and alternative options
The computing resources requirement for this pipeline is a large-memory server. Especially for the de novo assembly, this step requires about 1 Gb of RAM per 1 M reads to be assembled. Another option is to run some steps of this pipeline on Galaxy (http://usegalaxy.org), an open source and web-based bioinformatics platform.
For the orthologous searching, the normal hardware requirements include at least 4 Gb memory and about 100 Gb free space. There is an option to separate the calculation into multiple steps, depending on the number of datasets and the workstation hardware requirement. All parameters used in this pipeline can be found in the Additional file 1.
Trimming or quality control
RNA-Seq data quality control, including low-quality base calls trimming, adaptor auto-detecting and trimming, and short reads filtering, is performed by Trim Galore! using the default parameter settings. After this filtering, the minimum length of reads is 20 bp.
Data processing
De novo assembly
The general processing of RNA-Seq raw data without reference genome consists of three steps: (i) normalization, (ii) gene assembly, and (iii) quantification and generation of the associated expression matrix. To reduce the number of input files, computational complexity and processing time, data normalization and de novo assembly by Trinity are separated into two steps. The targeted maximum coverage for reads is 50, while the k-mer size and the maximum percent of the mean for the standard deviation of k-mer coverage across reads are set at the default values of 25 and 200, respectively. Assemblies are assessed by basic contig N50 statistics, which specifies the length of the shortest contig that can cover 50% of the total genome length. The pipeline also has the options for users to compare assemblies with reference genomes and to assess the de novo assembly with QUAST [31].
Quantification
Transcript mapping and quantification back to the assembly are performed in Trinity using the plugin package RNA-seq by Expectation–Maximization (RSEM) [32, 33]. This plugin is used to estimate the abundance of contigs in an alignment-based algorithm with corrections for transcript lengths. In brief, RSEM assumes that each sequence read and read length are observed (one end for single reads and both ends for paired reads) and are generated according to a directed acyclic graph that includes parent sequence, length, start position and orientation. A Bayesian version of the expectation–maximization algorithm is used to obtain maximum likelihood estimates of the model parameters, transcript fractions and posterior mean estimates of abundances. The resulting gene expression matrices are then normalized to Transcripts Per Kilobase Million (TPM) to make them comparable across samples. The original TPM matrix generated from a Trinity plugin includes all gene names in each dataset. For each dataset, the expression matrix would be transformed into a long format with gene names, sample names and TPM values.
We also include the option to calculate TMM (Trimmed Mean of the M-values) values if desired. It is recommended to conduct standard descriptive statistics on TMM and/or TPM values, including but not limited to, outlier detection via PCA plots with replicates, expression levels analysis for extreme expression levels within replicates and the analysis of internal controls, if available. Subgroup structure of the data can also be revealed by the multidimensional scaling (MDS) plot using R package DESeq2 (v1.28.1) [34].
Orthologous gene searching
Genes that have an orthologous relationship are grouped using an automated pipeline for OrthoMCL [28, 35,36,37], called Orthomcl-pipeline (https://github.com/apetkau/orthomcl-pipeline), which requires the input of the amino acid sequences. Before running OrthoMCL for OGG identification, TransDecoder (Home TransDecoder/TransDecoder Wiki (github.com) is used to predict coding regions from assembled transcripts, thereby providing the amino acid sequences. For generating an efficient indexing system for the transcripts, we modified headers of transcripts by incorporating their original project accession numbers and their species codes. If users intend to compare expression data not only at the species level but also at a higher taxonomy level (e.g., families, genera, phyla, etc.) it is necessary to add the taxon code to the header of transcripts together with the species code.
An all-versus-all BLAST search with open reading frames is used to derive scores for pairwise sequence similarities. The E-value cut-off needs to be modified by the user and depends on the taxonomic distance between species or groups. To identify the ortholog or in-paralog pairs, each pair of matched sequences is filtered with 50% “percent match length” score. The percent match length score is obtained by counting the number of amino acids in the shorter sequence that participate in any High-scoring Segment Pair (HSP), dividing that by the length of the shorter sequence, and multiplying by 100. The filtered pairs are linked across or within the objects (species or class). To group orthologous genes, we implemented the Markov Clustering (MCL) algorithm which is included in the Orthomcl-pipeline. The output of the Orthomcl-pipeline includes a list of groups, including the orthologous protein sequence headers used to extract corresponding protein sequences, transcript sequences, quantification numbers and gene mapping information, for further analysis.
When searching for orthologs among several distantly related species, datasets can be split into groups based on their taxonomic relationships to reduce computational requirements. Then, orthologous gene searching (clustering) should be performed at two levels, intraclass- and interclass-. The general process is: (1) in-parallel orthologs searching within each class; (2) filtering clustered genes from those OGGs that contain orthologous genes from all species in one class; (3) integrating filtered genes from all classes for interclass orthologs searching; and (4) filtering clustered genes from those OGGs that contain orthologous genes from all classes.
If the input datasets require interclass orthologs searching followed by filtering, the output provides a cross-class OGG (cOGG). Due to the high annotation quality of the human and some other mammalian genomes, only clusters from these genomes are extracted for the annotation. Annotation of these Mammalia proteins (using BlastP) and transcripts (using BlastX) are obtained using Trinotate and the UniProt database as default. BlastP only returns the best match, while BlastX returns the top 5 hits. E-value cut-offs can be modified by the user depending on the evolutionary distance between organisms compared in the study. Other parameters in Trinotate and UniProt were set to defaults except for the parallel threads, which were set to use the maximum number of cores. Protein sequences are also searched against a protein profile HMM database and Trinotate is used to generate the final report.
Analyze expression patterns of orthologous gene groups
Based on the header name, CoRMAP extracts the expression values of orthologous genes from total transcripts in each dataset. Transcript variants refer to different versions of a transcript from the same gene in our pipeline. These, along with isoforms, are averaged within the same OGG group to represent the OGG expression value. For the final report, CoRMAP provides a meta-table with OGG number, gene name, annotation, and source information including species name, and a set of additional meta data such as tissue type (example e.g., brain region), study focus (example e.g., memory type) and study design. Once we obtain the meta-table, users can analyze and compare expression levels from different perspectives such as taxonomic groups, project design and tissue type, along with a variety of other experimental conditions. As mentioned above, the pipeline offers several checkpoints before calculating expression levels and before conducting comparisons between species.