Prediction of novel long non-coding RNAs based on RNA-Seq data of mouse Klf1 knockout study

Sun, Lei; Zhang, Zhihua; Bailey, Timothy L; Perkins, Andrew C; Tallack, Michael R; Xu, Zhao; Liu, Hui

doi:10.1186/1471-2105-13-331

Research article
Open access
Published: 13 December 2012

Prediction of novel long non-coding RNAs based on RNA-Seq data of mouse Klf1 knockout study

Lei Sun^1,2,3,
Zhihua Zhang²,
Timothy L Bailey³,
Andrew C Perkins⁴,
Michael R Tallack⁴,
Zhao Xu¹ &
…
Hui Liu¹

BMC Bioinformatics volume 13, Article number: 331 (2012) Cite this article

24k Accesses
106 Citations
10 Altmetric
Metrics details

Abstract

Background

Study on long non-coding RNAs (lncRNAs) has been promoted by high-throughput RNA sequencing (RNA-Seq). However, it is still not trivial to identify lncRNAs from the RNA-Seq data and it remains a challenge to uncover their functions.

Results

We present a computational pipeline for detecting novel lncRNAs from the RNA-Seq data. First, the genome-guided transcriptome reconstruction is used to generate initially assembled transcripts. The possible partial transcripts and artefacts are filtered according to the quantified expression level. After that, novel lncRNAs are detected by further filtering known transcripts and those with high protein coding potential, using a newly developed program called lncRScan. We applied our pipeline to a mouse Klf1 knockout dataset, and discussed the plausible functions of the novel lncRNAs we detected by differential expression analysis. We identified 308 novel lncRNA candidates, which have shorter transcript length, fewer exons, shorter putative open reading frame, compared with known protein-coding transcripts. Of the lncRNAs, 52 large intergenic ncRNAs (lincRNAs) show lower expression level than the protein-coding ones and 13 lncRNAs represent significant differential expression between the wild-type and Klf1 knockout conditions.

Conclusions

Our method can predict a set of novel lncRNAs from the RNA-Seq data. Some of the lncRNAs are showed differentially expressed between the wild-type and Klf1 knockout strains, suggested that those novel lncRNAs can be given high priority in further functional studies.

Background

The category of long non-coding RNAs (lncRNAs) is composed of non-coding RNAs (ncRNAs) with long transcript length (> 200 nucleotides)[1]. The lncRNAs may carry out a variety of functions, e.g. scaffolding multiple proteins to form a complex, and regulating gene expression[2–11], however, most lncRNAs’ functions remain to be specified. During the past decade, a growing number of newly detected lncRNAs have been reported thanks to the development of relevant biotechnology and computational methods[4, 12–16]. Early tiling microarrays were used to detect the lncRNAs in the mammalian transcriptome[4, 5], however, they could not detect precise gene structures and exon linkages of the lncRNAs[14]. Subsequently, this problem was tackled by high-throughput RNA sequencing (RNA-Seq), which presented its advantage of revealing the whole transcriptome[17], including detailed gene structures and expression levels. So far, the RNA-Seq has been the major biotechnology for lncRNA study[13]. For example, by using RNA-Seq, Guttman et al.[14] obtained detailed information of over a thousand large intergenic ncRNAs (lincRNAs) in three mouse cell types[14].

However, studying lncRNAs based on RNA-Seq encounters several technical problems. First, the assembled transcriptome may include partial transcripts and artefacts caused by RNA-Seq problems, such as low sequencing depth, sequencing biases[18] and short read alignment errors[19]. For lowly expressed transcripts, the sequencing biases may introduce undesired gaps in the assembly, resulting in partially assembled transcripts[20], which may be mistakenly identified as lncRNAs. The similar mistakes could also be introduced by low sequencing depth for lowly expressed transcripts. Moreover, the incomplete and erroneous assemblies can affect downstream analysis[21, 22]. Second, transcriptome reconstruction[23] based on RNA-Seq reads may produce a variety of transcripts, e.g. completely assembled transcripts, intronic RNAs[24] and antisense transcripts[16], which are classified by comparing to the known gene annotations. Thus it is not trivial to identify lncRNAs from such complex assemblies. Third, it is still difficult to distinguish the lncRNAs from the protein-coding mRNAs[1] or short peptides. A protein-coding mRNA can be defined by open reading frame (ORF) greater than 100 amino acids (aa) or 300 nucleotides (nt)[25], but this is arbitrary and incorrect[26]. Here we present a computational pipeline to address these problems.

Although thousands of lncRNAs have been identified[13, 14, 16], only a handful of them were functionally characterized. Given the difficulty to experimentally characterize the biological functions of the lncRNAs[7], and given the growing body of genomics and epigenomics data becoming available relevant to lncRNAs’ biological functions, it is interesting to predict lncRNAs’ functions computationally. We applied our computational method to an RNA-Seq dataset derived from a Klf1 gene knockout study on mouse fetal liver tissue[27]. Previous studies based on the Klf1 knockout study manifested that Klf1 is the founding member of a family of 17 transcription factors in mammals[28]. Klf1 knockout mice die from anemia by embryonic day 15 (E15), with severe defects in differentiation, hemoglobinization, enucleation, and membranecytoskeleton organization of red blood cells[29]. However, very little is known of the lncRNAs regulated by Klf1 or that participate in the development of erythroid cells. Here, we recruit the differential expression analysis to explore the lncRNAs that may function in the erythropoiesis.

Methods

Datasets

The RNA-Seq dataset for the Klf1 knockout experiment on mouse embryonic day 14.5 (E14.5) fetal liver tissue can be obtained from NCBI Gene Expression Omnibus (GEO)[30] database with accession number GSE33979[27], and it includes 6 replicates (3 for wild-type and 3 for Klf1 knockout) totalling 160 million 76-base single-end reads generated by Illumina GAIIx sequencing on polyadenylated selected (Poly-A⁺) RNAs. Bowtie[31] index of Mus musculus genome (mm9), Ensembl[32] and NCBI reference sequences (RefSeq) mouse gene annotations[33] are all available on Cufflinks’ website[34]. University of California Santa Cruz (UCSC) mouse known gene annotations[35] can be downloaded from the UCSC genome browser[36].

Pipeline for predicting novel lncRNAs

There are two parts in our pipeline for predicting novel lncRNAs from the RNA-Seq data (Figure1).

Initial assembly

Initial assembly (Figure1-a) represents a genome-guided strategy for transcriptome reconstruction[23]. The raw RNA-Seq reads were first mapped onto the mm9 genome by Tophat 2.0.3[19]. After that, the un-mapped reads were trimmed to 50 nt before re-mapping. The final mapped reads of each replicate include two parts, namely ‘Mapped reads 1’ and ‘Mapped reads 2’. Moreover, the ‘-G’ option of Tophat together with the G ene T ransfer F ormat (GTF) file of the Ensembl gene annotation was used for read mapping. With the read alignments, we calculated the overlap ratio (OR) between the replicates of each condition (Additional file1). To increase the read coverage, we merged the read alignments of all six replicates into one B inary version of Sequence A lingment/M ap (BAM) using Samtools 0.1.18[37]. Then the mapped reads were assembled by Cufflinks 2.0.2[21]. In the transcriptome assembly, we performed R eference A nnotation B ased T ranscript (RABT) assembly[38] with the RefSeq gene annotation to compensate incompletely assembled transcripts caused by read coverage gaps in the regions of RefSeq genes.

Novel lncRNAs detection

Novel lncRNAs detection (Figure1-b) is aimed at detecting novel lncRNAs from the initial assemblies. Specifically, the initial assemblies were first compared to a set of combined gene annotations (See below) using cuffcompare[22]. As a result, not only the assemblies that completely match the annotations will be detected, but also the novel transcripts can be categorized into different categories according to their locations compared with the reference genes. Notably, only multi-exon transcripts were retained for the comparison and downstream processing. Then low-quality assemblies were filtered according to the optimum F ragments P er K ilobases of exon per M illion fragments mapped (FPKM)[21] threshold (2.12, see below). After that, we used a newly-developed program called lncRScan (See below) to detect novel lncRNAs.

Combined gene annotations of RefSeq, Ensembl and UCSC mouse known genes

The cuffcompare program[22] was used to merge the RefSeq, Ensembl and UCSC mouse known genes into one set of gene annotation for comparing with the assembled transcripts.

FPKM threshold for classifying complete and partial transcripts

Based on the merged read alignments, we conducted an experiment to evaluate the performance of FPKM in classifying complete and partial transcripts. Specifically, we first ran cufflinks on the merged read alignments with default options. Then the output assemblies with FPKM values estimated were categorized using cuffcompare, compared with the combined gene annotations. With the results, we evaluated the performance of different FPKM thresholds in classifying the complete and partial transcripts by R eceiver O perating C haracteristic (ROC)[39].

Calculating optimum FPKM threshold

The optimum FPKM threshold for classifying the complete and partial transcripts were calculated by training the FPKM values estimated from the experiment above. The index of the optimum FPKM threshold can be obtained by optimizing the sensitivity and specificity in classifying the complete and partial transcripts with formula 1.

i^{*} = \underset{i \in I}{arg min} \{\sqrt{{(1 - sensitivities [i])}^{2} + {(1 - specificities [i])}^{2}}\}

(1)

where i^∗represents the index of the optimum FPKM threshold. On the right of formula 1, sensitivities i and specificities i respectively denote the i th sensitivities and specificities, given an index i. The i is enumerated in I, ranging from 1 to the size of a FPKM threshold set. Then we can get the optimum FPKM threshold using formula 2.

t^{*} = T [i^{*}]

(2)

where t^∗ denotes the optimum FPKM threshold. The FPKM threshold set T were generated by pROC[39], given the FPKM values of the complete and partial transcripts.

lncRScan

To detect novel lncRNAs from a set of high-quality assemblies, a five-step program named l ong n on-c oding R NA S can (lncRScan) was designed (Figure2). Step 1 ‘extract_category’ is used to extract five candidate categories of transcripts, including ‘i’, ‘j’, ‘o’, ‘u’ and ‘x’, which may contain novel lncRNAs. Specifically, the ‘i’ category may contain the lncRNAs falling entirely within an intron of known genes. And the ‘j’ category may include alternative long non-coding isoforms of known genes as they share at least one spliced site with reference transcripts. The ‘u’ category may involve the intergenic lncRNAs (lincRNAs). The ‘o’ category may contain the lncRNAs having generic exonic overlap with a known transcript while the ‘x’ transcripts also have exonic overlap with reference but on the opposite strand. Therefore, the five categories defined here may include novel lncRNAs potentially. On the other hand, all categories of transcripts extracted have not been annotated by either of RefSeq, Ensembl and UCSC known genes, so the predicted lncRNAs can be ‘novel’. Step 2 ‘extract_length’ is used to extract the transcripts having long exonic length (> 200 nt) according to the lncRNA’s definition. Step 3 ‘extract_ORF’ is set to exclude the assemblies that have long (≥ 300 nt) putative ORF. Then steps 4 and 5 are used to exclude the transcripts of protein-coding potential. In Step 4 ‘extract_PhyloCSF’, Phylo genetic C odon S ubstitution F requency (PhyloCSF)[40] is recruited to filter the transcripts of protein-coding potential from an evolutionary view. Briefly, PhyloCSF conducts a comparative genomics method for classifying protein-coding and non-coding sequences[40]. Since the sequence alignments are required for running PhyloCSF, we used Galaxy[41–43] to ‘stitch’ 29 mammalian alignments according to the input transcripts. In Step 5 ‘extract_Pfam’, the amino acid sequences of the remaining transcripts are searched in Pfam[44] (both Pfam-A and Pfam-B) for comparing to known proteins or protein domains, and the transcripts with significant domain hits are excluded.

To evaluate the performance of lncRScan in identifying lncRNAs or filtering mRNAs, we ran the steps 3-5 of lncRScan on four datasets respectively. The first dataset (D-1) contains 1615 multi-exon RefSeq ncRNAs with length > 200nt and the second one (D-2) records 1615 mRNAs randomly sampled from 26368 RefSeq mRNAs. The other two datasets (D-3 and D-4) include 3230 and 4845 mRNAs sampled from the RefSeq mRNAs respectively. The numbers of the retained and filtered transcripts through the steps 3-5 of lncRScan are summarized in Table1. We can see that 771 (47.74%) lncRNAs of D-1 were retained after the steps 3-5. In contrast, most (99.6%-99.7%) of the mRNAs (D-2, D-3 and D-4) were filtered by the steps 3-5. The result indicates that the filters of lncRScan can dramatically reduce the number of mRNAs. Notably, the step 3 adopting the ORF threshold can filter a large proportion of mRNAs thereby alleviating the overload of PhyloCSF and Pfam calculation. However, some true lncRNAs were filtered through the pipeline, which made the final lncRNAs prediction much stringent.

Table 1 Numbers of retained and filtered transcripts through steps 3-5 of lncRScan

Full size table

In addition, lncRScan is available to the scientific community and it can be obtained by svn checkouthttp://lncrscan.googlecode.com/svn/trunk/lncrscan-read-only. Other details about lncRScan can be found onhttp://code.google.com/p/lncrscan/.

Differential expression analysis

The cuffdiff[22] program was performed to conduct differential expression (DE) tests between the wild-type (WT) and Klf1 knockout (Klf1 KO) samples (Figure3). The fold changes were calculated via $lo g_{2} \frac{FPKM WT}{FPKM Klf 1 KO}$ . A transcript will be reported DE significant if the test gives that the FDR-adjusted p-value after Benjamini-Hochberg correction[45] for multiple-testing represent statistical significant (q-value < 0.05)[46].

Comparisons of transcript length, exon number, ORF length and expression level

The novel lncRNAs we detected were compared to 26368 RefSeq protein-coding transcripts (‘NM’ prefix) and 2843 RefSeq non-coding transcripts (‘NR’ prefix) in terms of transcript length, exon number and ORF length. Since a real ncRNA does not have an ORF, a putative ORF of the ncRNA candidate is defined by the longest consecutive codon chain of the ncRNA candidate for comparing with the protein-coding genes. Moreover, for both of the WT and Klf1 KO conditions, we compared the quantified expression levels (FPKM) of the novel lncRNAs to that of the known protein-coding transcripts, which were extracted from the RefSeq and Ensembl gene annotations. The novel lncRNAs and protein-coding transcripts used for FPKM comparison all have enough expression levels (FPKM ≥ 2. 12).

Results

Initially assembled transcripts

We started our analysis with short read mapping (Figure1-a), and approximately 138 million reads were successfully mapped onto the mm9 genome (Table2). With the merged alignments of six replicates, 34053 multi-exon transcripts (26212 annotated, 701 contained by annotations and 7140 novel potentially) were assembled in total, compared with 88434 transcripts of the combined gene annotations. Then we obtained the categories of the initial assemblies by comparing to the combined gene annotations (Table3). It is notable that the initial assemblies include several categories of transcripts, e.g. transcripts that have complete match intron chain compared with known genes (‘=’ classcode) and those contained by known genes (‘c’ classcode). Of the initial assemblies, 26212 (76.97%) transcripts have been annotated by either of RefSeq, Ensembl and UCSC known genes.

Table 2 Read mapping summary

Full size table

Table 3 Categories of initial assemblies

Full size table

Filtering low-quality assemblies with optimum FPKM threshold

FPKM can unbiasedly represent quantified expression level of an assembled transcript, and it can be estimated by maximum likelihood estimation (MLE) under a statistical model of cufflinks[21], which also corrects sequencing biases[18] in the estimation. Figure4 shows the FPKM distributions[47] of the complete (‘=’ classcode) and partial (‘c’ classcode) transcripts assembled from the experiment of FPKM threshold (See Methods) while Figure5 shows the corresponding ROC curve. Notably, the complete transcripts represent much larger FPKM than the partial ones on average (∼29.67 vs ∼4.86, P < 2. 2 × 10⁻¹⁶, Welch Two Sample t-test). According to the significant difference of FPKM distributions of complete and partial assemblies, we calculated the optimum FPKM threshold (2.12) based on our data (See Methods). We assumed that the artificial transcripts represent either similar FPKM distribution to the partial transcripts or lower FPKM than the partial ones, thus the optimum threshold can be used to filter both of the partial assemblies and artefacts from the 7140 novel assemblies.

Identification of high-quality assemblies

We pooled a set of high-quality assemblies (Additional file2) for downstream analysis. The high-quality assemblies consist of two categories. One category contains the 26212 initial assemblies that completely match the combined gene annotations (‘=’ classcode). The other category refers to the 3288 transcripts extracted from the 7140 novel assemblies (!{‘=’,‘c’}), which satisfy the expression criterion (FPKM ≥ 2.12).

Novel mouse embryonic lncRNAs

We applied our newly developed lncRNAs detector lncRScan to the high-quality assemblies and detected 308 novel mouse embryonic lncRNAs (Additional file3). The novel lncRNAs were further classified into 5 categories by comparing with the known gene annotations (Table4). Specifically, 52 lncRNAs were assigned the ‘u’ classcode since they were located in the intergenic regions. And 26 lncRNAs with the ‘i’ classcode fall entirely within the intron of known genes. The other lncRNAs all have exon overlap with known genes. Specifically, 44 lncRNAs with the ‘o’ classcode have generic exonic overlap with known genes and 6 ‘x’ lncRNAs also have exonic overlap with known genes but on the opposite strand. The 180 lncRNAs with ‘j’ can be long non-coding isoforms of known genes. In addition, the 308 novel lncRNAs we predicted were compared with 36991 ones annotated by NONCODE 3.0[48]. Of the 308 novel lncRNAs, 5 (1.62%) ones have the same structure as NONCODE lncRNAs (Additional file1) and another 75 (24.35%) ones partially overlap the NONCODE lncRNAs (Figure6). By excluding the 80 lncRNAs that overlap the NONCODE annotation, we can get a more stringent set of novel lncRNAs.

Table 4 Categories of novel lncRNAs

Full size table

Novel lncRNAs have shorter transcript length, fewer exons and shorter putative ORF than protein-coding transcripts

Previous studies in mammals have shown that lncRNAs are shorter in length and fewer in exon number than are protein-coding transcripts[13, 14, 16]. To determine whether the embryonic lncRNAs we detected have the same features, we compared the 308 novel lncRNAs to not only 26368 protein-coding transcripts, but also 2843 known non-coding ones, annotated by RefSeq (See Methods). As shown in Figure7, the novel lncRNAs represent much shorter transcript length on average than either RefSeq protein-coding (∼1.2kb vs ∼3.1kb, P < 2. 2×10⁻¹⁶, Welch Two Sample t-test) or non-coding transcripts (∼1.2kb vs ∼1.9kb, P = 6. 027 × 10⁻¹⁴) while the lncRNAs also show fewer exons than either of the RefSeq protein-coding (∼2.8 vs ∼10.0, P < 2. 2 × 10⁻¹⁶) and non-coding transcripts (∼2.8 vs ∼3.3, P = 5. 096 × 10⁻⁸), agreed with a previous report[13]. In addition, we also compared the putative ORF lengths of the lncRNAs to that of the RefSeq genes (both protein-coding and non-coding). As a result, the novel lncRNAs represent shorter putative ORF length than either RefSeq protein-coding RNAs (∼0.17 kb vs ∼1.6 kb, P < 2. 2 × 10⁻¹⁶) or ncRNAs (∼0.17 kb vs ∼0.30 kb, P < 2. 2 × 10⁻¹⁶), consistent with a previous report on zebrafish embryonic lncRNAs[16]. Although the novel lncRNAs candidates are to be ncRNAs, they can differ from the RefSeq ncRNAs used for comparison in some features due to several reasons as follows. First, the RefSeq ncRNAs do not only include lncRNAs, but also other categories of ncRNAs, e.g. microRNAs and small nucleolar RNAs. Second, the lncRNAs can be further classified according to their biological functions, thus the features of different categories of lncRNAs may differ from each other. The lncRNAs we detected may not come from the same category as that annotated by RefSeq. Third, the unbalanced population sizes can affect the comparison between the two categories of ncRNAs. Last, the putative ORF length of the lncRNAs we predicted were limited (< 300 nt), which can affect the ORF comparison. Therefore it is reasonable to see that the two categories of ncRNAs represent slight statistical difference, which is far less than that between the mRNAs and ncRNAs.

Novel lincRNAs have lower expression level than protein-coding transcripts

Previous studies also showed that lncRNAs are expressed at significantly lower levels than are protein-coding transcripts[13, 14, 16]. To determine whether the embryonic lncRNAs we detected have the same expression feature, we compared the quantified expression levels (FPKM) of the 308 novel lncRNAs to that of the known protein-coding transcripts (Figure8). In the WT condition (Figure8-a), the protein-coding transcripts represents slightly higher expression than the novel lncRNAs on average (∼50.92 vs ∼44.54, P = 0. 554, Welch Two Sample t-test). Similarly, in the Klf1 KO condition (Figure8-b), the protein-coding transcripts also show slightly higher expression than the lncRNAs on average (∼37.63 vs ∼34.06, P = 0. 6986). The comparison result indicates that the total novel lncRNAs do not show significant lower expression than the protein-coding ones. Moreover, we extracted the 52 lincRNAs (‘u’ classcode) from the 308 lncRNAs for the expression comparison. The result manifests that the lincRNAs we predicted represents significant lower expression than the protein-coding ones in either WT or Klf1 KO condition (∼11.29 vs ∼50.93, P < 2. 2 × 10⁻¹⁶, and ∼9.38 vs ∼37.63, P < 2. 2 × 10⁻¹⁶, respectively).

Differentially expressed lncRNAs

Using cuffdiff, we conducted the differential expression (DE) tests between the WT and Klf1 KO samples for analysing the function of the novel lncRNAs. At the gene level (Figure9-a), Klf1 represents like an activator since more assembled genes are significantly repressed (334) after Klf1 is knocked out than the activated ones (250). At the transcript level (Figure9-b), Klf1 also behaves like an activator since more transcripts are significantly repressed (262) after Klf1 is knocked out than the activated ones (147). Moreover, we detected 13 (Additional file4) novel lncRNAs with DE significant. Notably, Klf1 still functions like an activator for the 13 lncRNAs (10 repressed vs 3 activated after Klf1 is knocked out, Figure9-c). Thus it is obvious that Klf1 can function as an activator globally, regulating the expression of a number of genes or transcripts including the lncRNAs we detected. The detailed categories of the 13 lncRNAs of DE significant can be seen from Table5.

Table 5 Categories of novel lncRNAs of differential expression significant

Full size table

However, cuffdiff does a length correction that has a tendency to inflate the FPKM counts for small transcripts, which can interfere the differential expression analysis. To alleviate this problem, we re-ran the DE tests with the “–no-effective-length-correction”parameter. As a result, we obtained the same results as that without the parameter, which represent the robustness of our predictions.

Discussion

RNA-Seq has been revolutionizing the transcriptome study as it can effectively capture the whole transcriptome of various cell types under different conditions. Here we predicted 308 novel mouse embryonic lncRNAs from the RNA-Seq data of WT and Klf1 KO samples using a computational pipeline. The novel lncRNAs we detected represent shorter transcript length, fewer exons and shorter putative ORF length, and the 52 lincRNAs of the lncRNAs show lower expression level, compared with known protein-coding transcripts. Moreover, we identified 13 differentially expressed novel lncRNAs, which may be regulated by Klf1 and play functional roles in the development of erythroid cells potentially. Notably, two lncRNAs (IDs: 2_00016377 and 2_00016378) we predicted represent almost the same structures as another two lncRNAs predicted by Tallack et al.[27] based on the same dataset. Specifically, most exons of 2_00016377 and 2_00016378 match that of their ‘lincred1-giant’ and ‘lincred1-dwarf’ lncRNAs respectively. The slight difference may be caused by both of the strategies of transcriptome reconstruction and program versions used. Despite of that, the differential expression of the two lncRNAs we detected can be explained by Tallack et al.’s validation using R eal-t ime Q uantitative PCR (qRT-PCR)[49] on their ‘lincred1’ lncRNAs.

On the other hand, our pipeline followed a similar strategy for predicting human lincRNAs[13], but we differ in three aspects. First, we used FPKM as a feature for filtering low quality assemblies instead of the read coverage[13] due to the fact that FPKM can unbiasedly represent the expression level of a transcript and the read coverage does not show better performance than FPKM in classifying the complete and partial transcripts assembled from our data (AUCs are equal). Second, we excluded the transcripts having long putative ORF length (≥300 nt), which was previously used by the FANTOM consortium[50]. This arbitrary cutoff makes our predictions more stringent, but it must omit the lncRNAs having long putative ORF (≥300 nt). Last, we detected several DE significant lncRNAs, which composed a subset of the total lncRNAs we detected and they are more worth being investigated by loss and gain of function studies than the other novel lncRNAs in our scenario. Consequently, our computational methods can effectively alleviate further experimental work for studying the lncRNAs that may participate in the development of erythroid cells.

Although our method presented its ability in detecting novel lncRNA candidates, its prediction accuracy can be improved from several aspects, such as using more reliable reads generated by high-quality deep sequencing, paired-end sequencing and strand-specific sequencing. And recent single-molecule sequencing technologies can provide more unbiased ways to capture the transcriptome[51]. The sensitivity of transcriptome reconstruction can also be improved by using various strategies, such as integrating assembly results from Scripture[14]. In addition, the novel lncRNAs predicted from our computational pipeline should be validated by biological experiments, such as cloning and PCR-based techniques[22] as several ones have been tested in the original study by Tallack et al.[27]. Furthermore, additional genetic and/or epigenetic data sources, e.g. C hromatin I mmunoprecipitation-Seq uencing (ChIP-Seq) on chromatin signatures, can be valuable sources providing useful information for characterizing functions of the novel lncRNAs. And the loss and gain of function studies can be conducted for exploring regulatory mechanisms of the lncRNAs.

Conclusions

We predicted a set of novel lncRNAs using our computational pipeline from the RNA-Seq data of Klf1 knockout study, and the DE significant lncRNAs are worth being further studied with regard to their biological functions.

References

Mercer TR, Dinger ME, Mattick JS: Long non-coding RNAs: insights into functions. Nat Rev Genet 2009, 10(3):155–159. [10.1038/nrg2521] [10.1038/nrg2521] 10.1038/nrg2521
Article CAS PubMed Google Scholar
Amaral PP, Dinger ME, Mercer TR, Mattick JS: The Eukaryotic Genome as an RNA Machine. Science 2008, 319(5871):1787–1789. 10.1126/science.1155472
Article CAS PubMed Google Scholar
Baker M: Long noncoding RNAs: the search for function. Nat Meth 2011, 8(5):379–383. [10.1038/nmeth0511–379] [10.1038/nmeth0511-379] 10.1038/nmeth0511-379
Article CAS Google Scholar
Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, Bell I, Cheung E, Drenkow J, Dumais E, Patel S, Helt G, Ganesh M, Ghosh S, Piccolboni A, Sementchenko V, Tammana H, Gingeras TR: RNA Maps Reveal New RNA Classes and a Possible Function for Pervasive Transcription. Science 2007, 316(5830):1484–1488. 10.1126/science.1138341
Article CAS PubMed Google Scholar
Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, Gerstein M, Snyder M: Global Identification of Human Transcribed Sequences with Genome Tiling Arrays. Science 2004, 306(5705):2242–2246. 10.1126/science.1103388
Article CAS PubMed Google Scholar
Rinn JL, Kertesz M, Wang JK, Squazzo SL, Xu X, Brugmann SA, Goodnough LH, Helms JA, Farnham PJ, Segal E, Chang HY: Functional Demarcation of Active and Silent Chromatin Domains in Human HOX Loci by Noncoding RNAs. Cell 2007, 129(7):1311–1323. 10.1016/j.cell.2007.05.022
Article PubMed Central CAS PubMed Google Scholar
Guttman M, Donaghey J, Carey BW, Garber M, Grenier JK, Munson G, Young G, Lucas AB, Ach R, Bruhn L, Yang X, Amit I, Meissner A, Regev A, Rinn JL, Root DE, Lander ES: lincRNAs act in the circuitry controlling pluripotency and differentiation. Nature 2011, 477(7364):295–300. [10.1038/nature10398] [10.1038/nature10398] 10.1038/nature10398
Article PubMed Central CAS PubMed Google Scholar
Ng SY, Johnson R, Stanton LW: Human long non-coding RNAs promote pluripotency and neuronal differentiation by association with chromatin modifiers and transcription factors. EMBO J 2012, 31(3):522–533. [10.1038/emboj.2011.459] [10.1038/emboj.2011.459]
Article PubMed Central CAS PubMed Google Scholar
Rinn JL, Chang HY: Genome Regulation by Long Noncoding RNAs. Annu Rev Biochem 2012, 81: 145–166. 10.1146/annurev-biochem-051410-092902
Article CAS PubMed Google Scholar
Mitra SA, Mitra AP, Triche TJ: A Central Role for Long Non-coding RNA in Cancer. Frontiers in Genet 2012., 3(17):
Guttman M, Rinn JL: Modular regulatory principles of large non-coding RNAs. Nature 2012, 482(7385):339–346. [10.1038/nature10887] [10.1038/nature10887] 10.1038/nature10887
Article PubMed Central CAS PubMed Google Scholar
Bernstein E, Allis CD: RNA meets chromatin. Genes & Dev 2005, 19(14):1635–1655. 10.1101/gad.1324305
Article CAS Google Scholar
Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL: Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes & Dev 2011, 25(18):1915–1927. 10.1101/gad.17446611
Article CAS Google Scholar
Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C, Rinn JL, Lander ES, Regev A: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotech 2010, 28(5):503–510. [10.1038/nbt.1633] [10.1038/nbt.1633] 10.1038/nbt.1633
Article CAS Google Scholar
Nagano T, Fraser P: No-Nonsense Functions for Long Noncoding RNAs. Cell 2011, 145(2):178–181. 10.1016/j.cell.2011.03.014
Article CAS PubMed Google Scholar
Pauli A, Valen E, Lin MF, Garber M, Vastenhouw NL, Levin JZ, Fan L, Sandelin A, Rinn JL, Regev A, Schier AF: Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis. Genome Res 2012, 22(3):577–591. 10.1101/gr.133009.111
Article PubMed Central CAS PubMed Google Scholar
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods 2008, 5: 621–628. [10.1038/nmeth.1226] [10.1038/nmeth.1226] 10.1038/nmeth.1226
Article CAS PubMed Google Scholar
Roberts A, Trapnell C, Donaghey J, Rinn J, Pachter L: Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol 2011, 12(3):R22. 10.1186/gb-2011-12-3-r22
Article PubMed Central CAS PubMed Google Scholar
Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009, 25(9):1105–1111. 10.1093/bioinformatics/btp120
Article PubMed Central CAS PubMed Google Scholar
Kozarewa I, Ning Z, Quail M, Sanders M, Berriman M, Turner D: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat Methods 2009, 6(4):291–295. 10.1038/nmeth.1311
Article PubMed Central CAS PubMed Google Scholar
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotech 2010, 28(5):511–515. [10.1038/nbt.1621] [10.1038/nbt.1621] 10.1038/nbt.1621
Article CAS Google Scholar
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 2012, 7(3):562–578. [10.1038/nprot.2012.016] [10.1038/nprot.2012.016] 10.1038/nprot.2012.016
Article PubMed Central CAS PubMed Google Scholar
Garber M, Grabherr MG, Guttman M, Trapnell C: Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Meth 2011, 8(6):469–477. [10.1038/nmeth.1613] [10.1038/nmeth.1613] 10.1038/nmeth.1613
Article CAS Google Scholar
Nakaya H, Amaral P, Louro R, Lopes A, Fachel A, Moreira Y, El-Jundi T, da Silva A, Reis E, Verjovski-Almeida S: Genome mapping and expression analyses of human intronic noncoding RNAs reveal tissue-specific patterns and enrichment in genes related to regulation of transcription. Genome Biol 2007, 8(3):R43. 10.1186/gb-2007-8-3-r43
Article PubMed Central PubMed Google Scholar
Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H: Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 2002, 420(6915):563–573. [10.1038/nature01266] [10.1038/nature01266] 10.1038/nature01266
Article PubMed Google Scholar
Dinger ME, Pang KC, Mercer TR, Mattick JS: Differentiating Protein-Coding and Noncoding RNA: Challenges and Ambiguities. PLoS Comput Biol 2008, 4(11):e1000176. 10.1371/journal.pcbi.1000176
Article PubMed Central PubMed Google Scholar
Tallack MR, Magor GW, Dartigues B, Sun L, Huang S, Fittock JM, Fry SV, Glazov EA, Bailey TL, Perkins AC: Novel roles for KLF1 in erythropoiesis revealed by mRNA-seq. Genome Research 2012.
Google Scholar
Miller IJ, Bieker JJ: A novel, erythroid cell-specific murine transcription factor that binds to the CACCC element and is related to the Krüppel family of nuclear proteins. Mol Cell Biol 1993, 13(5):2776–2786.
Article PubMed Central CAS PubMed Google Scholar
Perkins AC, Sharpe AH, Orkin SH: Lethal [beta]-thalassaemia in mice lacking the erythroid CACCC-transcription factor EKLF. Nature 1995, 375(6529):318–322. [10.1038/375318a0] [10.1038/375318a0] 10.1038/375318a0
Article CAS PubMed Google Scholar
Gene Expression Omnibus (GEO) [http://www.ncbi.nlm.nih.gov/geo/] []
Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009, 10(3):R25. 10.1186/gb-2009-10-3-r25
Article PubMed Central PubMed Google Scholar
Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, et al.: The Ensembl genome database project. Nucleic Acids Research 2002, 30: 38–41. 10.1093/nar/30.1.38
Article PubMed Central CAS PubMed Google Scholar
Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, 35(suppl 1):D61-D65.
Article PubMed Central CAS PubMed Google Scholar
Illumina iGenomes [http://cufflinks.cbcb.umd.edu/manual.html] []
Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D: The UCSC Known Genes. Bioinformatics 2006, 22(9):1036–1046. 10.1093/bioinformatics/btl048
Article CAS PubMed Google Scholar
UCSC table browser [http://genome.ucsc.edu/cgi-bin/hgTables?command=start] []
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25(16):2078–2079. 10.1093/bioinformatics/btp352
Article PubMed Central PubMed Google Scholar
Roberts A, Pimentel H, Trapnell C, Pachter L: Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics 2011, 27(17):2325–2329. 10.1093/bioinformatics/btr355
Article CAS PubMed Google Scholar
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Müller M: pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011, 12: 77. 10.1186/1471-2105-12-77
Article PubMed Central PubMed Google Scholar
Lin MF, Jungreis I, Kellis M: PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 2011, 27(13):i275-i282. 10.1093/bioinformatics/btr209
Article PubMed Central CAS PubMed Google Scholar
Goecks J, Nekrutenko A, Taylor J, Team TG: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 2010, 11(8):R86. 10.1186/gb-2010-11-8-r86
Article PubMed Central PubMed Google Scholar
Blankenberg D, Taylor J, Nekrutenko A, Team TG: Making whole genome multiple alignments usable for biologists. Bioinformatics 2011, 27(17):2426–2428. 10.1093/bioinformatics/btr398
Article PubMed Central CAS PubMed Google Scholar
Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A: Galaxy: A platform for interactive large-scale genome analysis. Genome Res 2005, 15(10):1451–1455. 10.1101/gr.4086505
Article PubMed Central CAS PubMed Google Scholar
Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer ELL, Bateman A: The Pfam protein families database. Nucleic Acids Res 2008, 36(suppl 1):D281-D288.
PubMed Central CAS PubMed Google Scholar
Benjamini Y: Controlling the false discovery rate : a practical and powerful approach to multiple testing. J Royal Stat Soc 1995, 57: 289–300.
Google Scholar
Cufflinks manual [http://cufflinks.cbcb.umd.edu/igenomes.html] []
Wickham H: ggplot2: elegant graphics for data analysis. New York: Springer; 2009. [http://had.co.nz/ggplot2/book] []
Book Google Scholar
Bu D, Yu K, Sun S, Xie C, Skogerbø G, Miao R, Xiao H, Liao Q, Luo H, Zhao G, Zhao H, Liu Z, Liu C, Chen R, Zhao Y: NONCODE v3.0: integrative annotation of long noncoding RNAs. Nucleic Acids Res 2012, 40(D1):D210-D215. 10.1093/nar/gkr1175
Article PubMed Central CAS PubMed Google Scholar
Livak KJ, Schmittgen TD: Analysis of Relative Gene Expression Data Using Real-Time Quantitative PCR and the 2−ΔΔCT Method. 2001, 25(4):402–408.
Google Scholar
Consortium TF, Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, Kodzius R, Shimokawa K, Bajic VB, Brenner SE, Batalov S, Forrest ARR, Zavolan M, Davis MJ, Wilming LG, Aidinis V, Allen JE, Ambesi-Impiombato A, Apweiler R, Aturaliya RN, Bailey TL, Bansal M, Baxter L, Beisel KW, Bersano T, et al.: The Transcriptional Landscape of the Mammalian Genome. Science 2005, 309(5740):1559–1563. [(Genome Network Project Core Group)] [(Genome Network Project Core Group)]
Article Google Scholar
Sam LT, Lipson D, Raz T, Cao X, Thompson J, Milos PM, Robinson D, Chinnaiyan AM, Kumar-Sinha C, Maher CA: A Comparison of Single Molecule and Amplification Based Sequencing of Cancer Transcriptomes. PLoS ONE 2011, 6(3):e17305. 10.1371/journal.pone.0017305
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

The authors would like to acknowledge Fabian A. Buske and other members of the Bailey’s group at IMB of the University of Queensland for helpful discussions. The authors would like to thank the reviewers giving helpful suggestions on the paper. This work was supported by China Postdoctoral Science Foundation 2012M511335 and 2012M511336, The Central Special Fund for Operating Expenses of College Basic Research 2010QNA47 and 2010QNA50 and Fok Ying-Tung Education Foundation for Young Teachers (121066). LS was partially supported by China Scholarship Council (CSC) scholarship.

Author information

Authors and Affiliations

School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou, 221008, JiangSu, PR China
Lei Sun, Zhao Xu & Hui Liu
Center for Computational Biology, and Laboratory of Disease Genomics and Personalized Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, No.7 Beitucheng West Road, Chaoyang District, Beijing, 100029, PR China
Lei Sun & Zhihua Zhang
Institute for Molecular Bioscience, The University of Queensland, Brisbane, 4072, Queensland, Australia
Lei Sun & Timothy L Bailey
Mater Medical Research Institute, Mater Hospital, Brisbane, 4101, Queensland, Australia
Andrew C Perkins & Michael R Tallack

Authors

Lei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Zhihua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Timothy L Bailey
View author publications
You can also search for this author in PubMed Google Scholar
Andrew C Perkins
View author publications
You can also search for this author in PubMed Google Scholar
Michael R Tallack
View author publications
You can also search for this author in PubMed Google Scholar
Zhao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hui Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hui Liu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

LS designed and implemented the method. ACP and MRT provided the raw datasets. LS carried out the experiments and analysis. LS, ZZ, TLB, ZX and HL wrote the paper. All authors read and approved the final manuscript.

Electronic supplementary material

Additional file 1:Supplementary materials. It contains supplementary materials supporting the main text. (ZIP 332 KB)

12859_2012_5681_MOESM2_ESM.ZIP

Additional file 2:GTF of high-quality assembled transcripts. It is a GTF file recording the 28963 high-quality assemblies, which were used in the differential expression tests. And their structures can be visualized by UCSC genome browser. (ZIP 4 MB)

12859_2012_5681_MOESM3_ESM.ZIP

Additional file 3:GTF of novel lncRNAs. It is a GTF file recording the 308 novel lncRNAts detected by lncRScan. The transcript structures can be visualized by UCSC genome browser. (ZIP 19 KB)

12859_2012_5681_MOESM4_ESM.ZIP

Additional file 4:GTF of differentially expressed lncRNAs. It is a GTF file recording the 13 differentially expressed lncRNAs between the WT and Klf1 KO samples. The transcript structures can be visualized by UCSC genome browser. (ZIP 1 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Sun, L., Zhang, Z., Bailey, T.L. et al. Prediction of novel long non-coding RNAs based on RNA-Seq data of mouse Klf1 knockout study. BMC Bioinformatics 13, 331 (2012). https://doi.org/10.1186/1471-2105-13-331

Download citation

Received: 10 July 2012
Accepted: 04 December 2012
Published: 13 December 2012
DOI: https://doi.org/10.1186/1471-2105-13-331

Prediction of novel long non-coding RNAs based on RNA-Seq data of mouse Klf1 knockout study

Abstract

Background

Results

Conclusions

Background

Methods

Datasets

Pipeline for predicting novel lncRNAs

Initial assembly

Novel lncRNAs detection

Combined gene annotations of RefSeq, Ensembl and UCSC mouse known genes

FPKM threshold for classifying complete and partial transcripts

Calculating optimum FPKM threshold

lncRScan

Differential expression analysis

Comparisons of transcript length, exon number, ORF length and expression level

Results

Initially assembled transcripts

Filtering low-quality assemblies with optimum FPKM threshold

Identification of high-quality assemblies

Novel mouse embryonic lncRNAs

Novel lncRNAs have shorter transcript length, fewer exons and shorter putative ORF than protein-coding transcripts

Novel lincRNAs have lower expression level than protein-coding transcripts

Differentially expressed lncRNAs

Discussion

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us