Prediction of novel long non-coding RNAs based on RNA-Seq data of mouse Klf1 knockout study

Background Study on long non-coding RNAs (lncRNAs) has been promoted by high-throughput RNA sequencing (RNA-Seq). However, it is still not trivial to identify lncRNAs from the RNA-Seq data and it remains a challenge to uncover their functions. Results We present a computational pipeline for detecting novel lncRNAs from the RNA-Seq data. First, the genome-guided transcriptome reconstruction is used to generate initially assembled transcripts. The possible partial transcripts and artefacts are filtered according to the quantified expression level. After that, novel lncRNAs are detected by further filtering known transcripts and those with high protein coding potential, using a newly developed program called lncRScan. We applied our pipeline to a mouse Klf1 knockout dataset, and discussed the plausible functions of the novel lncRNAs we detected by differential expression analysis. We identified 308 novel lncRNA candidates, which have shorter transcript length, fewer exons, shorter putative open reading frame, compared with known protein-coding transcripts. Of the lncRNAs, 52 large intergenic ncRNAs (lincRNAs) show lower expression level than the protein-coding ones and 13 lncRNAs represent significant differential expression between the wild-type and Klf1 knockout conditions. Conclusions Our method can predict a set of novel lncRNAs from the RNA-Seq data. Some of the lncRNAs are showed differentially expressed between the wild-type and Klf1 knockout strains, suggested that those novel lncRNAs can be given high priority in further functional studies.


Background
The category of long non-coding RNAs (lncRNAs) is composed of non-coding RNAs (ncRNAs) with long transcript length (> 200 nucleotides) [1]. The lncRNAs may carry out a variety of functions, e.g. scaffolding multiple proteins to form a complex, and regulating gene expression [2][3][4][5][6][7][8][9][10][11], however, most lncRNAs' functions remain to be specified. During the past decade, a growing number of newly detected lncRNAs have been reported thanks to the development of relevant biotechnology and computational methods [4,[12][13][14][15][16]. Early tiling microarrays were used to detect the lncRNAs in the mammalian transcriptome [4,5], however, they could not detect precise gene http://www.biomedcentral.com/1471-2105/13/331 transcripts [20], which may be mistakenly identified as lncRNAs. The similar mistakes could also be introduced by low sequencing depth for lowly expressed transcripts. Moreover, the incomplete and erroneous assemblies can affect downstream analysis [21,22]. Second, transcriptome reconstruction [23] based on RNA-Seq reads may produce a variety of transcripts, e.g. completely assembled transcripts, intronic RNAs [24] and antisense transcripts [16], which are classified by comparing to the known gene annotations. Thus it is not trivial to identify lncRNAs from such complex assemblies. Third, it is still difficult to distinguish the lncRNAs from the proteincoding mRNAs [1] or short peptides. A protein-coding mRNA can be defined by open reading frame (ORF) greater than 100 amino acids (aa) or 300 nucleotides (nt) [25], but this is arbitrary and incorrect [26]. Here we present a computational pipeline to address these problems.
Although thousands of lncRNAs have been identified [13,14,16], only a handful of them were functionally characterized. Given the difficulty to experimentally characterize the biological functions of the lncRNAs [7], and given the growing body of genomics and epigenomics data becoming available relevant to lncRNAs' biological functions, it is interesting to predict lncRNAs' functions computationally. We applied our computational method to an RNA-Seq dataset derived from a Klf1 gene knockout study on mouse fetal liver tissue [27]. Previous studies based on the Klf1 knockout study manifested that Klf1 is the founding member of a family of 17 transcription factors in mammals [28]. Klf1 knockout mice die from anemia by embryonic day 15 (E15), with severe defects in differentiation, hemoglobinization, enucleation, and membranecytoskeleton organization of red blood cells [29]. However, very little is known of the lncRNAs regulated by Klf1 or that participate in the development of erythroid cells. Here, we recruit the differential expression analysis to explore the lncRNAs that may function in the erythropoiesis.

Pipeline for predicting novel lncRNAs
There are two parts in our pipeline for predicting novel lncRNAs from the RNA-Seq data ( Figure 1).

Initial assembly
Initial assembly (Figure 1-a) represents a genome-guided strategy for transcriptome reconstruction [23]. The raw RNA-Seq reads were first mapped onto the mm9 genome by Tophat 2.0.3 [19]. After that, the un-mapped reads were trimmed to 50 nt before re-mapping. The final mapped reads of each replicate include two parts, namely 'Mapped reads 1' and 'Mapped reads 2' . Moreover, the '-G' option of Tophat together with the Gene Transfer Format (GTF) file of the Ensembl gene annotation was used for read mapping. With the read alignments, we calculated the overlap ratio (OR) between the replicates of each condition (Additional file 1). To increase the read coverage, we merged the read alignments of all six replicates into one Binary version of Sequence Alingment/Map (BAM) using Samtools 0.1.18 [37]. Then the mapped reads were assembled by Cufflinks 2.0.2 [21]. In the transcriptome assembly, we performed Reference Annotation Based Transcript (RABT) assembly [38] with the RefSeq gene annotation to compensate incompletely assembled transcripts caused by read coverage gaps in the regions of RefSeq genes.

Novel lncRNAs detection
Novel lncRNAs detection (Figure 1-b) is aimed at detecting novel lncRNAs from the initial assemblies. Specifically, the initial assemblies were first compared to a set of combined gene annotations (See below) using cuffcompare [22]. As a result, not only the assemblies that completely match the annotations will be detected, but also the novel transcripts can be categorized into different categories according to their locations compared with the reference genes. Notably, only multi-exon transcripts were retained for the comparison and downstream processing. Then low-quality assemblies were filtered according to the optimum Fragments Per Kilobases of exon per Million fragments mapped (FPKM) [21] threshold (2.12, see below). After that, we used a newly-developed program called lncRScan (See below) to detect novel lncRNAs.

Combined gene annotations of RefSeq, Ensembl and UCSC mouse known genes
The cuffcompare program [22] was used to merge the RefSeq, Ensembl and UCSC mouse known genes into one set of gene annotation for comparing with the assembled transcripts. http://www.biomedcentral.com/1471-2105/13/331 Figure 1 Pipeline for predicting novel lncRNAs. (a) Initial assembly. Raw reads are first mapped onto the reference mouse genome. The un-mapped reads are trimmed before re-mapping. Merging the read alignments of all 6 replicates is to increase the read coverage. At the assembly stage, RABT generates synthetic reads from the RefSeq gene annotation to compensate the read coverage gaps over transcripts; (b) Novel lncRNAs detection. The initial assemblies are categorized by cuffcompare, compared with the combined gene annotations. The low-quality transcripts are then filtered according to the optimum FPKM (2.12). The lncRScan program is performed to detect the novel lncRNAs from the remaining high-quality assemblies according to multiple criteria.

FPKM threshold for classifying complete and partial transcripts
Based on the merged read alignments, we conducted an experiment to evaluate the performance of FPKM in classifying complete and partial transcripts. Specifically, we first ran cufflinks on the merged read alignments with default options. Then the output assemblies with FPKM values estimated were categorized using cuffcompare, compared with the combined gene annotations. With the results, we evaluated the performance of different FPKM thresholds in classifying the complete and partial transcripts by Receiver Operating Characteristic (ROC) [39].

Calculating optimum FPKM threshold
The optimum FPKM threshold for classifying the complete and partial transcripts were calculated by training the FPKM values estimated from the experiment above. The index of the optimum FPKM threshold can be obtained by optimizing the sensitivity and specificity in classifying the complete and partial transcripts with formula 1.
where i * represents the index of the optimum FPKM threshold. On the right of formula 1, sensitivities[ i] and specificities[ i] respectively denote the ith sensitivities and specificities, given an index i. The i is enumerated in I, ranging from 1 to the size of a FPKM threshold set. Then we can get the optimum FPKM threshold using formula 2.
where t * denotes the optimum FPKM threshold. The FPKM threshold set T were generated by pROC [39], given the FPKM values of the complete and partial transcripts.

lncRScan
To detect novel lncRNAs from a set of high-quality assemblies, a five-step program named long non-coding RNA Scan (lncRScan) was designed ( Figure 2). Step 1 'extract category' is used to extract five candidate categories of transcripts, including 'i' , 'j' , 'o' , 'u' and 'x' , which may contain novel lncRNAs. Specifically, the 'i' category may contain the lncRNAs falling entirely within an intron of known genes. And the 'j' category may include alternative long non-coding isoforms of known genes as they share at least one spliced site with reference transcripts. The 'u' category may involve the intergenic lncRNAs (lincRNAs). The 'o' category may contain the lncRNAs having generic exonic overlap with a known transcript while the 'x' transcripts also have exonic overlap http://www.biomedcentral.com/1471-2105/13/331 with reference but on the opposite strand. Therefore, the five categories defined here may include novel lncRNAs potentially. On the other hand, all categories of transcripts extracted have not been annotated by either of RefSeq, Ensembl and UCSC known genes, so the predicted lncRNAs can be 'novel' .
Step 2 'extract length' is used to extract the transcripts having long exonic length (> 200 nt) according to the lncRNA's definition.
Step 3 'extract ORF' is set to exclude the assemblies that have long (≥ 300 nt) putative ORF. Then steps 4 and 5 are used to exclude the transcripts of protein-coding potential. In Step 4 'extract PhyloCSF' , Phylogenetic Codon Substitution Frequency (PhyloCSF) [40] is recruited to filter the transcripts of protein-coding potential from an evolutionary view. Briefly, PhyloCSF conducts a comparative genomics method for classifying protein-coding and non-coding sequences [40]. Since the sequence alignments are required for running PhyloCSF, we used Galaxy [41][42][43] to 'stitch' 29 mammalian alignments according to the input transcripts. In Step 5 'extract Pfam' , the amino acid sequences of the remaining transcripts are searched in Pfam [44] (both Pfam-A and Pfam-B) for comparing to known proteins or protein domains, and the transcripts with significant domain hits are excluded.
To evaluate the performance of lncRScan in identifying lncRNAs or filtering mRNAs, we ran the steps 3-5 of lncRScan on four datasets respectively. The first dataset (D-1) contains 1615 multi-exon RefSeq ncRNAs with length > 200nt and the second one (D-2) records 1615 mRNAs randomly sampled from 26368 RefSeq mRNAs. The other two datasets (D-3 and D-4) include 3230 and 4845 mRNAs sampled from the RefSeq mRNAs respectively. The numbers of the retained and filtered transcripts through the steps 3-5 of lncRScan are summarized in Table 1. We can see that 771 (47.74%) lncRNAs of D-1 were retained after the steps 3-5. In contrast, most (99.6%-99.7%) of the mRNAs (D-2, D-3 and D-4) were filtered by the steps 3-5. The result indicates that the filters of lncRScan can dramatically reduce the number of mRNAs. Notably, the step 3 adopting the ORF threshold can filter a large proportion of mRNAs thereby alleviating the overload of PhyloCSF and Pfam calculation. However, some true lncRNAs were filtered through the pipeline, which made the final lncRNAs prediction much stringent.
In addition, lncRScan is available to the scientific community and it can be obtained by svn checkout http:// lncrscan.googlecode.com/svn/trunk/lncrscan-read-only. Other details about lncRScan can be found on http:// code.google.com/p/lncrscan/.

Differential expression analysis
The cuffdiff [22] program was performed to conduct differential expression (DE) tests between the wild-type (WT) and Klf1 knockout (Klf1 KO) samples ( Figure 3). The fold changes were calculated via log 2 FPKM WT FPKM Klf 1KO . A transcript will be reported DE significant if the test gives that the FDR-adjusted p-value after Benjamini-Hochberg correction [45] for multiple-testing represent statistical significant (q-value < 0.05) [46].

Comparisons of transcript length, exon number, ORF length and expression level
The novel lncRNAs we detected were compared to 26368 RefSeq protein-coding transcripts ('NM' prefix) and 2843 RefSeq non-coding transcripts ('NR' prefix) in terms of transcript length, exon number and ORF length. Since a real ncRNA does not have an ORF, a putative ORF of the ncRNA candidate is defined by the longest consecutive codon chain of the ncRNA candidate for comparing with the protein-coding genes. Moreover, for both of the WT and Klf1 KO conditions, we compared the quantified expression levels (FPKM) of the novel lncRNAs to that of the known protein-coding transcripts, which were extracted from the RefSeq and Ensembl gene annotations. The novel lncRNAs and protein-coding transcripts used for FPKM comparison all have enough expression levels (FPKM ≥ 2.12).
With the merged alignments of six replicates, 34053 multi-exon transcripts (26212 annotated, 701 contained by annotations and 7140 novel potentially) were assembled in total, compared with 88434 transcripts of the combined gene annotations. Then we obtained the categories of the initial assemblies by comparing to the combined gene annotations (Table 3). It is notable that the initial assemblies include several categories of transcripts, e.g. transcripts that have complete match intron chain compared with known genes ('=' classcode) and those contained by known genes ('c' classcode). Of the initial assemblies, 26212 (76.97%) transcripts have been annotated by either of RefSeq, Ensembl and UCSC known genes.
Filtering low-quality assemblies with optimum FPKM threshold FPKM can unbiasedly represent quantified expression level of an assembled transcript, and it can be estimated by maximum likelihood estimation (MLE) under a statistical model of cufflinks [21], which also corrects sequencing biases [18] in the estimation. Figure 4 shows the FPKM distributions [47] of the complete ('=' classcode) and partial ('c' classcode) transcripts assembled from the experiment of FPKM threshold (See Methods) while Figure 5 shows the corresponding ROC curve. Notably, the complete transcripts represent much larger FPKM than the partial ones on average (∼29.67 vs ∼4.86, P < 2.2×10 −16 , Figure 3 Differential expression tests. The cuffdiff program performs differential expression tests between the WT and Klf1 KO samples based on the read alignments (BAM) of the six replicates and high-quality assemblies (GTF).
Welch Two Sample t-test). According to the significant difference of FPKM distributions of complete and partial assemblies, we calculated the optimum FPKM threshold (2.12) based on our data (See Methods). We assumed that the artificial transcripts represent either similar FPKM distribution to the partial transcripts or lower FPKM than the partial ones, thus the optimum threshold can be used to filter both of the partial assemblies and artefacts from the 7140 novel assemblies.

Identification of high-quality assemblies
We pooled a set of high-quality assemblies (Additional file 2) for downstream analysis. The high-quality assemblies consist of two categories. One category contains the 26212 initial assemblies that completely match the combined gene annotations ('=' classcode). The other category refers to the 3288 transcripts extracted from the 7140 novel assemblies (!{'=' , 'c'}), which satisfy the expression criterion (FPKM ≥ 2.12).

Novel mouse embryonic lncRNAs
We applied our newly developed lncRNAs detector lncRScan to the high-quality assemblies and detected 308 novel mouse embryonic lncRNAs (Additional file 3). The novel lncRNAs were further classified into 5 categories by comparing with the known gene annotations (Table 4). Specifically, 52 lncRNAs were assigned the 'u' classcode since they were located in the intergenic regions. And 26 lncRNAs with the 'i' classcode fall entirely within the intron of known genes. The other lncRNAs all have exon overlap with known genes. Specifically, 44 lncRNAs with the 'o' classcode have generic exonic overlap with  known genes and 6 'x' lncRNAs also have exonic overlap with known genes but on the opposite strand. The 180 lncRNAs with 'j' can be long non-coding isoforms of known genes. In addition, the 308 novel lncRNAs we predicted were compared with 36991 ones annotated by NONCODE 3.0 [48]. Of the 308 novel lncRNAs, 5 (1.62%) ones have the same structure as NONCODE lncRNAs (Additional file 1) and another 75 (24.35%) ones partially overlap the NONCODE lncRNAs ( Figure 6). By excluding the 80 lncRNAs that overlap the NONCODE annotation, we can get a more stringent set of novel lncRNAs.

Novel lncRNAs have shorter transcript length, fewer exons and shorter putative ORF than protein-coding transcripts
Previous studies in mammals have shown that lncR-NAs are shorter in length and fewer in exon number than are protein-coding transcripts [13,14,16]. To determine whether the embryonic lncRNAs we detected have the same features, we compared the 308 novel lncRNAs to not only 26368 protein-coding transcripts, but also 2843 known non-coding ones, annotated by RefSeq (See Methods). As shown in Figure 7, the novel lncRNAs represent much shorter transcript length on average than either RefSeq protein-coding (∼1.2kb vs ∼3.1kb, P < 2.2 × 10 −16 , Welch Two Sample t-test) or non-coding transcripts (∼1.2kb vs ∼1.9kb, P = 6.027 × 10 −14 ) while the lncRNAs also show fewer exons than either of the Ref-Seq protein-coding (∼2.8 vs ∼10.0, P < 2.2 × 10 −16 ) and non-coding transcripts (∼2.8 vs ∼3.3, P = 5.096 × 10 −8 ), agreed with a previous report [13]. In addition, we also compared the putative ORF lengths of the lncRNAs to that of the RefSeq genes (both protein-coding and noncoding). As a result, the novel lncRNAs represent shorter putative ORF length than either RefSeq protein-coding RNAs (∼0.17 kb vs ∼1.6 kb, P < 2. An assembled transcript will be classified into the category of complete assemblies ('=' classcode) if its FPKM is larger than a given threshold, otherwise it will be put into the partial category ('c' classcode). The blue ROC curve [39] represents the performance of FPKM in classifying the complete and partial transcripts. The corresponding Area Under Curve (AOC) is 0.7825.
consistent with a previous report on zebrafish embryonic lncRNAs [16]. Although the novel lncRNAs candidates are to be ncRNAs, they can differ from the RefSeq ncRNAs used for comparison in some features due to several reasons as follows. First, the RefSeq ncRNAs do not only include lncRNAs, but also other categories of ncRNAs, e.g. microRNAs and small nucleolar RNAs. Second, the lncRNAs can be further classified according to their biological functions, thus the features of different categories of lncRNAs may differ from each other. The lncRNAs we detected may not come from the same category as that annotated by RefSeq. Third, the unbalanced population sizes can affect the comparison between the two categories of ncRNAs. Last, the putative ORF length of the lncRNAs we predicted were limited (< 300 nt), which can affect the ORF comparison. Therefore it is reasonable to see that the two categories of ncRNAs repre-sent slight statistical difference, which is far less than that between the mRNAs and ncRNAs.

Novel lincRNAs have lower expression level than protein-coding transcripts
Previous studies also showed that lncRNAs are expressed at significantly lower levels than are protein-coding transcripts [13,14,16]. To determine whether the embryonic lncRNAs we detected have the same expression feature, we compared the quantified expression levels (FPKM) of the 308 novel lncRNAs to that of the known proteincoding transcripts (Figure 8). In the WT condition (Figure 8-a), the protein-coding transcripts represents slightly higher expression than the novel lncRNAs on average (∼50.92 vs ∼44.54, P = 0.554, Welch Two Sample t-test). Similarly, in the Klf1 KO condition (Figure 8-b), the protein-coding transcripts also show slightly higher expression than the lncRNAs on average (∼37.63 vs ∼34.06, P = 0.6986). The comparison result indicates that the total novel lncRNAs do not show significant lower expression than the protein-coding ones. Moreover, we extracted the 52 lincRNAs ('u' classcode) from the 308 lncRNAs for the expression comparison. The result manifests that the lincRNAs we predicted represents significant lower expression than the protein-coding ones in either WT or Klf1 KO condition (∼11.29 vs ∼50.93, P < 2.2 × 10 −16 , and ∼9.38 vs ∼37.63, P < 2.2 × 10 −16 , respectively).

Differentially expressed lncRNAs
Using cuffdiff, we conducted the differential expression (DE) tests between the WT and Klf1 KO samples for analysing the function of the novel lncRNAs. At the gene level ( Figure 9-a), Klf1 represents like an activator since more assembled genes are significantly repressed (334) after Klf1 is knocked out than the activated ones (250). At the transcript level ( Figure 9-b), Klf1 also behaves like an activator since more transcripts are significantly repressed (262) after Klf1 is knocked out than the activated ones (147). Moreover, we detected 13 (Additional file 4) novel lncRNAs with DE significant. Notably, Klf1 still functions like an activator for the 13 lncRNAs (10 repressed vs 3 activated after Klf1 is knocked out, Figure 9-c). Thus it  is obvious that Klf1 can function as an activator globally, regulating the expression of a number of genes or transcripts including the lncRNAs we detected. The detailed categories of the 13 lncRNAs of DE significant can be seen from Table 5.
However, cuffdiff does a length correction that has a tendency to inflate the FPKM counts for small transcripts, which can interfere the differential expression analysis. To alleviate this problem, we re-ran the DE tests with the "no-effective-length-correction"parameter. As a result, we protein−coding non−coding novel lncRNAs obtained the same results as that without the parameter, which represent the robustness of our predictions.   On the other hand, our pipeline followed a similar strategy for predicting human lincRNAs [13], but we differ in three aspects. First, we used FPKM as a feature for filtering low quality assemblies instead of the read coverage [13] due to the fact that FPKM can unbiasedly represent the expression level of a transcript and the read coverage does not show better performance than FPKM in classifying the complete and partial transcripts assembled from our data (AUCs are equal). Second, we excluded the transcripts having long putative ORF length (≥ 300 nt), which was previously used by the FANTOM consortium [50]. This arbitrary cutoff makes our predictions more stringent, but it must omit the lncRNAs having long putative ORF (≥ 300 nt). Last, we detected several DE significant lncRNAs, which composed a subset of the total lncRNAs we detected and they are more worth being investigated by loss and gain of function studies than the other novel lncRNAs in our scenario. Consequently, our computational methods can effectively alleviate further experimental work for studying the lncRNAs that may participate in the development of erythroid cells.

RNA-
Although our method presented its ability in detecting novel lncRNA candidates, its prediction accuracy can be improved from several aspects, such as using more reliable reads generated by high-quality deep sequencing, paired-end sequencing and strand-specific sequencing. And recent single-molecule sequencing technologies can provide more unbiased ways to capture the transcriptome [51]. The sensitivity of transcriptome reconstruction can also be improved by using various strategies, such as integrating assembly results from Scripture [14]. In addition, the novel lncRNAs predicted from our computational pipeline should be validated by biological experiments, such as cloning and PCR-based techniques [22] as several ones have been tested in the original study by Tallack et al. [27]. Furthermore, additional genetic and/or epigenetic data sources, e.g. Chromatin Immunoprecipitation-Sequencing (ChIP-Seq) on chromatin signatures, can be valuable sources providing useful information for characterizing functions of the novel lncRNAs. And the loss and gain of function studies can be conducted for exploring regulatory mechanisms of the lncRNAs.