In the following, we first compare the methods in terms of their capability in isoform detection, then in terms of their accuracy in isoform estimation. We stress that the goal of the comparison is not to make a rank list of the considered methods, but to underline global positive aspects, common weaknesses and open problems that might lead to over-optimistic conclusions about the performance of current methodology.
Isoform detection
Here, we illustrate the results in terms of recall, precision and F-measure considering the following effects: type of alignment, modes of action, type of annotation, type of library, abundance level, read length and sequencing depth. In order to investigate such effects, the same figures have to be inspected several times evaluating different aspects each time. To facilitate such comparison, we first describe the general structure of the figures, then we focus the attention on some specific comparison.
Figures 2, 3, 4 and 5 illustrate results for precision and recall obtained in Set-up 1 for libraries 100 bp-PE, 75 bp-PE, 50 bp-PE and 100 bp-SE, respectively. For each of these cases, recall is further expanded with respect to the level of abundance of the true isoforms and results are reported in Figures 6, 7, 8 and 9 in the same order. Finally, F-measure is illustrated in Additional file 1: Figure S1, Additional file 2: Figure S2, Additional file 3: Figure S3 and Additional file 4: Figure S4 in the same order for each of the four cases.
In Figures 2, 3, 4 and 5 results are visually depicted into four panels, A (upper left), B (upper right), C (bottom left) and D (bottom right). Panels A and B refer to the annotation driven alignment; panels C and D to the data driven alignment. Precision is depicted in panels A and C, while recall in panels B and D. Within each panel, the plot is divided in six sub-blocks according to the modes of action and the type of annotation. In particular, on the left blocks (Mode 1) there are RSEM, Cufflinks with the -G option turned on, CEM with the -forceref option turned on, SLIDE with -mode estimation. In the central blocks (Mode 2) there are iReckon, Cufflinks with the option -g turned on, CEM with the option -forceref turned off, SLIDE with -mode discovery. Finally, in the right blocks (Mode 3) there are only Cufflinks and CEM with all default options. Results obtained using CA/IA are depicted in the first/second row of each panel, respectively. In Set-up 1 different bars for the same method and mode of action correspond to the different sequencing depth (0.25M, 0.5M, 1M, 5M, 10M and 20M, respectively). Dashed horizontal lines are added to facilitate comparisons among different cases. Since RSEM does not depend on the alignment strategy, for comparative purposes, panels A and C report the same precision and panels B and D the same recall, in correspondence of the same type of annotation for RSEM. Analogously, within each panel, methods in Mode 3 show the same precision and recall (within the upper and bottom rows of each panel), not depending on the provided annotation when the alignment is data driven.
Figures 6, 7, 8 and 9 have a similar organisation. In particular, CA is used in panels A and C, IA in panels B and D. Within each panel, the plot is divided in nine sub-blocks. Along the horizontal direction the organization is analogous to that of Figures 2, 3, 4 and 5, in the vertical direction recall is reported for the low, medium and high abundance classes, respectively.
Additional file 1: Figure S1, Additional file 2: Figure S2, Additional file 3: Figure S3 and Additional file 4: Figure S4 are also organized in four panels and illustrate the F-measure with respect the sequencing depth. Panels A and B refer to the annotation driven alignment; panels C and D to the data driven alignment. CA is used in panels A and C, IA in panels B and D. To better distinguish results among different modes of action, results in Mode 1 are shown with continuous lines, those in Mode 2 are reported in dashed lines, those in Mode 3 are reported in dotted lines.
For the sake of completeness, we also provide Additional file 5: Figure S5, Additional file 6: Figure S6, Additional file 7: Figure S7 and Additional file 8: Figure S8 showing the number of TP and FP obtained under Set-up 1, for the most extreme conditions PE vs SE, 20M vs 0.25M. The figures are again organized in four panels where panels A and B refer to the annotation driven alignment; panels C and D to the data driven alignment. CA is used in panels A and C, IA in panels B and D. Within each panel, the right bars (i.e., the one depicted in coral) show the number of FP and the left ones (depicted in aquamarine) the number of FN. An horizontal dashed line is added representing the number of truly expressed isoforms. Therefore, the difference between the number of FP and this line represents the number of FN.
To provide a better insight into the capability of methods in Mode 2 (with IA) to identify those isoforms that are not provided in the annotation, Figures 10 and 11 show the number of TP (with the annotation driven alignment and with data driven alignment, respectively). In these figures, the number of TP is divided in those already present in IA (denoted IA) and those not contained in IA (denoted No IA). The latter are further divided according to the true expression level. Finally, Figure 12 illustrates the performance of Cufflinks and CEM (Mode 1) when a suitable threshold is applied to set to zero isoforms with very low estimated FPKM. The figure compares precision, recall and F-measure with the corresponding indexes observed for the same methods in Mode 2.
Additional file 9: Figure S9, Additional file 10: Figure S10, Additional file 11: Figure S11, Additional file 12: Figure S12, Additional file 13: Figure S13 and Additional file 14: Figure S14 show results for Set-up 2. In particular, Additional file 9: Figure S9, Additional file 10: Figure S10, Additional file 11: Figure S11, Additional file 12: Figure S12 are devoted to precision and recall. Additional file 13: Figure S13 is the analogous of Figure 12 and Additional file 14: Figure S14 is the analogous of Additional file 5: Figure S5, Additional file 6: Figure S6, Additional file 7: Figure S7 and Additional file 8: Figure S8.
All methods have been evaluated under the two set-ups, except SLIDE, that for the high computational cost has been evaluated only under Set-up 1, and iReckon, that was evaluated only for PE reads, since it does not support SE reads. In Figures 1, 2, 3, 4, 5, 6, 7, 8 and 9, Additional file 1: Figure S1, Additional file 2: Figure S2, Additional file 3: Figure S3 and Additional file 4: Figure S4 and Additional file 9: Figure S9, Additional file 10: Figure S10, Additional file 11: Figure S11 and Additional file 12: Figure S12, Cufflinks is coloured in red, CEM is coloured in blue, SLIDE is coloured in orange (when present), RSEM is coloured in green and iReckon is coloured in brown (when present).
RSEM shows relatively good performance when CA is provided, but its inference is limited to annotated transcripts (or to a list of potential transcripts that the user has to provide). On the other side, SLIDE shows worse performance than the others (in particular, in Mode 2) suggesting that it could be better suited for genome with lower complexity in terms of isoform structures.
Overall Set-ups 1 and 2 show similar qualitative results. The best F-measure in Set-up 1 was about 0.78, in Set-up 2 was 0.75. In both cases, the best performance was achieved by RSEM in the CA case.
Looking at Figures 2, 3, 4 and 5, precision rarely exceeds 0.6–0.7 and it is often below such values, meaning that all methods produce a quite large number of FP isoforms (remarkably, even when the true CA is provided and the methods are forced to work in Mode 1). Such drawback is also confirmed by the inspection of Additional file 5: Figure S5, Additional file 6: Figure S6, Additional file 7: Figure S7 and Additional file 8: Figure S8, where we often observe a high number of FP (compared to TP) for the same set-up. These poor results can be due to a non sufficiently strong penalty term (or to a not sufficient strong post-filtering step) that should keep to zero not significant isoforms.
The best precision is achieved by Cufflinks in Mode 2 with CA (at a price of lower power). Surprisingly, Cufflinks (Mode 1) shows a worse precision. However, we noticed that most of the isoforms detected as presents by Cufflinks (Mode 1) were estimated with extremely low expression. Therefore, if we filter out these isoforms, precision in Mode 1 and Mode 2 becomes comparable (see Figure 12, left panel), as expected. For example, by setting to zero all isoforms with estimated the Cufflinks (Mode 1) precision increases to 0.59, without significantly reducing the recall (analogous behaviour with 10−1as threshold). Indeed, the F-measure of Cufflinks (Mode 1) with a threshold on the expressed isoforms becomes better than the Cufflinks (Mode 2) one, as expected. The tendency to produce a high number of FP with very low expression level was observed also for CEM (Mode 1), but with a much less pronounced effect (see Figure 12, right panel). The same effect occurs for the other conditions of depth, read length and library type (data not showed).
Overall recall is above 0.80 when methods are used in Mode 1 with CA (and above 0.6 with IA), while recall does not exceed 0.3 in Mode 3. Recalls in Mode 1 are mostly satisfactory (except for SLIDE). The actual observed recall values for Mode 2 and Mode 3 depend on the stringency of the match between newly identified isoforms and existing ones. However, the low recall values in Mode 3 show that the performance of methods in recovering the precise isoform structure is still not satisfactory.
Figures 6, 7 and 8 for PE and Figure 9 for SE datasets illustrate how well methods are able to detect highly expressed isoforms (remarkably, recall is mostly higher than 0.50 also for methods in Mode 3). At the same time they underline that the major problem arises in identifying lowly expressed isoforms (recall rarely exceeds 0.2 either in Mode 2 and Mode 3). In particular, we observe that lowly expressed isoforms are identified mainly in Mode 1 with CA. Methods in Mode 2 with IA are able to identify lowly expressed isoforms mainly if they were already present in IA. On the contrary, for moderately and highly expressed isoforms they are capable of detecting isoforms not provided in IA, see Figures 10 and 11. The figures show good results for Cufflinks and iReckon for medium and high expression classes, at high depth and regardless the alignment type.
The performance clearly drops down when decreasing the depth. In that case, it becomes almost impossible to recover isoforms that have not been provided in the annotation at the beginnig.
As mentioned above, Set-up 2 bring us to analogous overall considerations, see Additional file 9: Figure S9, Additional file 10: Figure S10, Additional file 11: Figure S11, Additional file 12: Figure S12, Additional file 13: Figure S13 and Additional file 14: Figure S14.
Effect of alignment
In order to evaluate the effect of the alignment, we compared results obtained by the same method and mode of action across different alignment strategies, i.e., we investigate its effect on precision by comparing (in Figures 2, 3, 4 and 5) panel A versus panel C, and on recall by comparing panel B versus panel D, for Set-up 1. To evaluate the global effect on the F-measure, it is sufficient to carry out the analogous comparison for Additional file 1: Figure S1, Additional file 2: Figure S2, Additional file 3: Figure S3, and Additional file 4: Figure S4. Clearly, RSEM is not affected by the alignment, since it maps the reads directly to the transcriptome. Therefore, its results differ only with respect to CA and IA within panels A and B.
Overall comparisons show no appreciable difference in precision for Mode 1, a slight difference in Mode 2, where the precision increases along with the quality of mapping, and more remarkable improvements in Mode 3. Analogous results can be observed for recall. Moreover, both differences in precision and recall decrease when the sequencing depth increases. As global measure, we observe that F-measure for 20M 100 bp-PE is about 0.35 and 0.27, respectively, for Cuffinks and CEM in Mode 3 with reads aligned without any annotation; it becomes about 0.40 and 0.34, respectively, when the reads are aligned providing either IA or CA, see Additional file 1: Figure S1. To comment this effect, we inspected the alignment files and we observed that the main differences are in the number of mapped junctions. As an example, we observed that the dataset 20M 100 bp-PE identified 14500 junctions when CA was provided and 14479 junctions without annotation, with a very negligible loss due to the alignment. But, with 0.25M 100 bp-PE the number of mapped junctions was 9173 with CA and dropped down to 8047 without annotation. Analogously, for 20M 50 bp-PE, 14290 junctions were detected when CA was provided and only 12643 junctions without annotation; with 0.25M 50 bp-PE, the number of mapped junctions was 8241 with CA and dropped down to 5990 without annotation. The improved performance in Mode 3 can be explained by the fact that data driven methods can deeply benefit from the presence of informative junctions, even though current performances cannot be considered overall satisfactory.
The same conclusions apply for Set-up 2, comparing Additional file 9: Figure S9 and Additional file 10: Figure S10. In particular, in Mode 3, the F-measure increases from 0.18 of both Cufflinks and CEM in the case of data driven alignment up to 0.24 and 0.31 respectively in the case of CA based alignment (data not showed).
Effect of modes of action
In order to evaluate the effect of the modes of action, results have to be compared across the horizontal blocks of each panel of each figure, see Figures 2, 3, 4, 5, 6, 7, 8 and 9 for Set-up 1 and Additional file 9: Figure S9, Additional file 10: Figure S10, Additional file 11: Figure S11 and Additional file 12: Figure S12 for Set-up 2. Moreover, we have to compare the continuous lines with respect to the dashed lines and the dotted lines in Additional file 1: Figure S1, Additional file 2: Figure S2, Additional file 3: Figure S3 and Additional file 4: Figure S4.
As expected, all methods in Mode 1 have better global performance than those in Mode 2 and perform significantly better than the methods in Mode 3. The latter ones are still a big challenge. Moreover, performance of methods in Mode 1 are much less affected by the depth and the read length, while still benefit of PE reads, if the coverage is not sufficiently high.
As previously mentioned, Figures 2, 3, 4 and 5 show that the precision cannot be considered fully satisfactory and the number of FP is high across all modes of action. In particular, Cufflinks (Mode 1) shows still poor precision. However, in this case, we already noted that most of the FP isoforms were estimated with a very low expression value and filtering out the low s restores the performance, see Figure 12. From Figures 2, 3, 4 and 5, we observe that all methods in Mode 1 using CA have a very good recall, independently from alignment option. Methods in Mode 2 reach good/sufficient recall, except for SLIDE that shows the worst behaviour. In particular, it seems that SLIDE (Mode 2) does not take advantage of the annotation and behaves similarly to methods in Mode 3.
To fully understand the action mode effect, it is important to inspect Figures 6, 7, 8 and 9. Here, one can immediately observe that highly expressed isoforms are detectable in all action modes (in particular, for high sequencing depth), the scenario changes dramatically for moderately and lowly expressed ones. The latter have been recognized only in Mode 1 with CA and with sufficient sequencing depth.
The same conclusions apply for Set-up 2, comparing Additional file 9: Figure S9, Additional file 10: Figure S10, Additional file 11: Figure S11 and Additional file 12: Figure S12.
It follows that, while it is intriguing to promise to reconstruct the whole transcriptome using only RNA-seq data without any information about annotation, researchers have to be very careful in faithfully believe to the results obtained in Mode 3 (using current experimental protocols and computational methodologies analogues to those considered in the present work). As a consequence, for complex genomes, they should consider to complement the results with other sources of either experimental or computational evidence.
Effect of the annotation
In order to evaluate the effect of the annotation, in Figures 2, 3, 4 and 5 precision and recall have to be compared between the two horizontal blocks in each panel A,B,C e D of the same figure (where the upper row refers to CA and the lower row refers to IA) limiting the attention to Mode 1 and Mode 2. While, in Figures 6, 7, 8 and 9 and in Additional file 1: Figure S1, Additional file 2: Figure S2, Additional file 3: Figure S3 and Additional file 4: Figure S4, panels A have to be compared with the corresponding panels B; and panels C with the corresponding panels D.
This effect is very important considering that CA is usually unknown in a real experiment. Therefore, results obtained with CA represent a sort of optimal performance. Results obtained under IA can be seen as a measure of what one can reach with the current knowledge of the Biology.
As expected, moving from CA to IA, regardless the type of alignment, the performances degrade, with a more evident loss in Mode 1 with respect to Mode 2. The observed difference in the two modes of action allows us to illustrate the discovery power of methods in Mode 2. To this purpose, Figures 10 and 11 show the number of transcripts that were not present in IA, but are still recovered by methods in Mode 2 driven by IA. As mentioned before, the “recovering" effect is mainly concentrated on isoforms with high or moderate expression.
In general, the best F-measure observed in Mode 1 with CA is about 0.8 for the higher depth at 100 bp PE and it drops to about 0.6 when IA is used; while in Mode 2 the best F-measure is above 0.6 and drops down to 0.6 when IA is used, compare panel A and B of Additional file 1: Figure S1. Obviously, the present difference depends on the closeness between CA and IA.
Inspecting Figures 2, 3, 4 and 5, it is possible to observe that the loss of performance occurs both in terms of precision and recall. While the loss in detection capability of true isoforms is expected with IA, the loss in precision is more devious. It can be justified by the fact that the methods tend to explain all the reads by assigning them to an isoform. Therefore, other isoforms are used to accommodate the differences in fitting and isoforms with few reads (and low expression) are often formed in absence of a sufficiently strong penalty term or of a post-filtering procedure.
Issues apply for Set-up 2 comparing Additional file 9: Figure S9, Additional file 10: Figure S10, Additional file 11: Figure S11 and Additional file 12: Figure S12, where IA was randomly generated from CA. Therefore, the main conclusion does not depend strongly on the precise content of IA. As a conclusion, we want to stress that the main limit of methods in Mode 2 is their inadequacy in recovering lowly expressed isoforms, unless already annotated in IA. On the contrary, such methods are able in detecting both moderately and highly expressed isoforms in absence of their annotation as well.
Effect of library type
In order to evaluate the effect of the library type, we have to compare PE results with SE results. Such comparison was carried out in Set-up 1, for the read length 100 bp and for all depths. Therefore, for precision and recall we have to compare the corresponding panels of Figures 2 and 6 with Figures 5 and 10, respectively; for F-measure we have to compare Additional file 1: Figure S1 with Additional file 4: Figure S4; for number of TP and TN we have to compare Additional file 5: Figure S5 with Additional file 6: Figure S6 for the highest depth (20M) and Additional file 7: Figure S7 with Additional file 8: Figure S8 for the lowest depth (0.25M).
We observe that PE reads show better indexes with respect to SE reads at the same depth (at the price of higher experimental cost, not evaluated here). However, the differences are almost negligible in Mode 1 (in particular, at high coverage), they are relatively small for methods in Mode 2 when CA is provided. The gap increases in Mode 2 when IA is provided. In this case, Figures 10 and 11 show that PE reads allow to correctly detect more isoforms that were not present in IA. The gain is small for high depth, it becomes more evident for low depth. Advantages of using PE is also evident in Mode 3. To better evaluate the differences, we observe that 20M 100 bp-SE allows to map 13918 junctions when CA is provided and 13564 junctions in absence of annotation; 0.25M 100 bp-SE allows to map 8086 junctions when CA is provided and 6851 in absence of annotation.
As a conclusion, we can underline that the main advantage of using PE with respect to SE is in the better capability to recover novel isoforms when not provided in the annotation. On the other hand, we also observe that 100 bp-SE are quite long reads, with short SE reads the advantages of PE are more pronounced.
Effect of isoform abundance
In order to evaluate the effect of isoform abundance, we have to inspect Figures 6, 7, 8 and 9 that provide a deeper insight about the recall illustrated in Figures 2, 3, 4 and 5. The index is now expanded into three rows depending on the level of expression of the corresponding true isoforms. From the figures, we can see that the capability in isoform detection strongly depends on their expression levels. In general, highly expressed isoforms are easily detected by methods in Mode 1, while methods in Mode 2 and Mode 3 show a lower (but still acceptable) detection capabilities. On the contrary, moderately and lowly expressed isoforms are detected well, or at least with an acceptable rate, in Mode 1. However, they are not well identified in Mode 2, and often completely lost in Mode 3. Integrating such observation with Figures 10 and 11, we observe that lowly expressed isoforms are mostly detected if they are provided in the annotation.
Additional file 11: Figure S11 and Additional file 12: Figure S12 illustrate the recall with respect to the isoform abundance class in Set-up 2, providing analogous conclusions.
Effect of sequencing depth
In order to evaluate the effect of sequencing depth, we have to compare different bars of the same colour in each block and methods of panels of Figures 2, 3, 4, 5, 6, 7, 8 and 9, and the behaviour of each coloured line in the F-measure reported in Additional file 1: Figure S1, Additional file 2: Figure S2, Additional file 3: Figure S3 and Additional file 4: Figure S4.
In most cases, the performance gets worse with the decrease in sequencing depth, but the loss is less evident than what one can expect. In particular, it is almost negligible for methods in Mode 1 with CA and it appears more evident for methods in Mode 2 or 3. The gap increases for data driven alignment and in absence of CA. Indeed, when the depth increases we observed less precision and simultaneously a higher recall. The loss in precision can be explained by the large number of FP isoforms, often with low expression values. More in general, we noticed that as far as a minimum level of depth is reached (in the case of Set-up 1 such level is estimated in about 1M for PE) then further increases of the depth only play a trade-off role between the observed precision and recall without impacting the overall global performance. Conversely, below the saturation level the global performance drops down.
In a similar way, comparing Additional file 5: Figure S5, Additional file 6: Figure S6 and Additional file 7: Figure S7 and Additional file 6: Figure S6, Additional file 7: Figure S7 and Additional file 8: Figure S8, it is possible to see the benefit in the total number of correctly identified isoforms when increasing the depth from the extreme conditions of 0.25M to 20M.
Effect of read length
In order to evaluate the effect of read length, we have to compare precision and recall in Figure 2 (100 bp-PE) with Figures 3 and 4 (75 bp-PE and 50 bp-PE, respectively) and Additional file 1: Figure S1 with Additional file 2: Figure S2 and Additional file 3: Figure S3 in terms of F-measure. We found, as expected, that long reads are preferable to short ones. In particular, we observed an overall loss of performance both in terms of recall and precision. We quantified in about 5% the loss in performance in term of F-measure for methods in Mode 1 (the best performance achieved by RSEM with CA is about 0.78 for 100 bp-PE and becomes 0.75 for 75 bp-PE and 0.73 for 50 bp-PE). A more significant loss was observed when executing methods in Modes 2/3, especially at low depth.
We also observe that in our experimental design short reads are obtained by trimming the long ones. Therefore, short reads generated in this way have a slightly better quality with respect to those (of the same length) generated following the error profile. As a consequence, we expect that the real difference between short and long reads could be slightly larger than the one we have reported.
Additional file 9: Figure S9 and Additional file 10: Figure S10 provide analogous insights for Set-up 2.
Isoform estimation
Here, we illustrate the results for estimation error indexes E1, E2 and E3. Additional file 15: Table S1, Additional file 16: Table S2, Additional file 17: Table S3, Additional file 18: Table S4, Additional file 19: Table S5, Additional file 20: Table S6, Additional file 21: Table S7, Additional file 22: Table S8, Additional file 23: Table S9, Additional file 24: Table S10, Additional file 25: Table S11, Additional file 26: Table S12, Additional file 27: Table S13, Additional file 28: Table S14 Additional file 29: Table S15 and Additional file 30: Table S16 show errors statistics (median, 3rd quartile and maximum value) for the indexes in Set-up 1. In particular, the group of tables in Additional file 15: Table S1, Additional file 16: Table S2, Additional file 17: Table S3, Additional file 18: Table S4, Additional file 19: Table S5, Additional file 20: Table S6, Additional file 21: Table S7 and Additional file 22: Table S8 illustrate the results for PE, and those in Additional file 23: Table S9, Additional file 24: Table S10, Additional file 25: Table S11, Additional file 26: Table S12, Additional file 27: Table S13, Additional file 28: Table S14 Additional file 29: Table S15 and Additional file 30: Table S16 for SE. Within each group, the two most extreme depths are shown (i.e., 20M and 0.25M). Furthermore, the tables provide results either for annotation driven alignment (CA and IA) and for data driven alignment. Additional file 31: Table S17, Additional file 32: Table S18, Additional file 33: Table S19 and Additional file 34: Table S20 refer to Set-up 2. In particular, Additional file 31: Table S17 and Additional file 32: Table S18 describe the results for 60M 75 bp-PE (annotation driven alignment). The analogous cases for data driven alignment are shown in Additional file 33: Table S19 and Additional file 34: Table S20. In order to investigate the performance of the methods in correctly estimating the isoform abundances, the attention have to be mainly focused on the qualitative aspects related to error distributions rather that on the actual values shown in the tables.
Overall, both set-ups show a similar behaviour for all error types. In fact, all methods produce errors that have a strongly asymmetric skewed distribution with a long right tail. This means that they fail to estimate a (significant) fraction of isoforms (see results for the 3rd quartile and the maximum observed value). Moreover, within the same table, the same index is also different (sometimes of order of magnitude) with respect to methods and mode of actions. Differences are observed, not only with respect to the most extreme value, but also with respect to the median and the 3rd quartile. This means that the methods may provide very different results on a large fraction of isoforms. In brief, all tables indicate that the problem of obtaining reliable estimates is still open. The loss of performance is not completely surprising, since estimation is carried out after isoform detection, without explicitly consider the uncertainty due to the identification step. The latter statement is clear when we compare the same method under different modes of action. In such cases, each method uses an analogous statistical procedure for estimating the abundances. However, the inference is carried out on different sets of isoforms and can produce different estimates. To mitigate such drawbacks confidence intervals for the estimates could be more reliable than point estimates.
More in detail, looking at the 3rd quartile in Additional file 15: Table S1 (i.e., 20M 100 bp-PE CA driven alignment), the best result for E1 oscillates between 0.27–0.29 (CEM and Cufflinks Mode 3, RSEM), for E2 is about 0.005 (Cufflinks, Mode 1), and for E3 is between 0.22–0.25 (RSEM, CEM Mode 1).
The low 3rd quartiles observed for E3 and E2 confirm that problems arise mainly with lowly expressed isoforms, that are either not detected or not set to zero by filtering or penalization procedures. On the contrary, the large extreme values observed for E2 and E3 indicate that the corresponding methods can completely fail (of several order of magnitude) some estimates. However, such failures are limited to few units when the corresponding 3rd quartile is low, being larger when the right tail of the distribution becomes heavier.
More attention requires the (relative) error E1, whose 3rd quartile is at least about 0.27–0.29 in Additional file 15: Table S1, (with the median value of about 0.07–0.15). This means that, even when we are able to correctly detect the presence of the isoforms, in more than 50% of the cases the estimation error is about 10% of the true value. Larger values for the median or the 3rd quartile indicate a worse performance.
We stress that Additional file 15: Table S1 illustrates (in principle) the most favourable condition. Comparing it with Additional file 17: Table S3 (i.e., 0.25M 100 bp-PE CA driven alignment), it is possible to evaluate the effect of the depth when the annotation is CA. Whereas precision and recall for methods in Mode 1 were not significantly influenced, the error indexes did. In particular, the largest differences are observed for E2 and E3. In general, at low depth the quality of the estimates is quite poor. Analogous considerations hold considering SE instead of PE, see Additional file 23: Table S9 and Additional file 15: Table S11, for the cases 20M SE and 0.25M SE, respectively.
The influence of annotation can be deduced comparing Additional file 15: Table S1 and Additional file 17: Table S2 (obtained using IA). Analogous comparison can be carried out for all pairs of corresponding tables (CA vs IA). In such cases the differences are larger, due to the fact that methods are not able to identify all isoforms (in particular those not provided in IA).
Additional file 31: Table S17, Additional file 32: Table S18, Additional file 33: Table S19 and Additional file 34: Table S20 drive to similar considerations for Set-up 2 with more skewed error distributions.
Considerations about the computational cost
Since the methods were implemented in different languages and ran in different environments, we found that a technical comparison of execution time was not fair. Therefore, we briefly reports only qualitative considerations. RSEM and Cufflinks (in all modes of action) are sufficiently fast, CEM (in all modes of action) is the fastest. On the contrary, iReckon is quite slow with respect to the others and also produces very large temporary files. Current implementation of SLIDE is too slow to be completed in Set-up 2 in a reasonable amount of time, and therefore was considered only in Set-up 1.