Methods for estimating human endogenous retrovirus activities from EST databases

Background Human endogenous retroviruses (HERVs) are surviving traces of ancient retrovirus infections and now reside within the human DNA. Recently HERV expression has been detected in both normal tissues and diseased patients. However, the activities (expression levels) of individual HERV sequences are mostly unknown. Results We introduce a generative mixture model, based on Hidden Markov Models, for estimating the activities of the individual HERV sequences from EST (expressed sequence tag) databases. We use the model to estimate the relative activities of 181 HERVs. We also empirically justify a faster heuristic method for HERV activity estimation and use it to estimate the activities of 2450 HERVs. The majority of the HERV activities were previously unknown. Conclusion (i) Our methods estimate activity accurately based on experiments on simulated data. (ii) Our estimate on real data shows that 7% of the HERVs are active. The active ones are spread unevenly into HERV groups and relatively uniformly in terms of estimated age. HERVs with the retroviral env gene are more often active than HERVs without env. Few of the active HERVs have open reading frames for retroviral proteins.

Including several alternative match areas could allow more freedom for the sub-HMM in the generation of the EST, which could yield a larger likelihood for the observed EST being generated from the sub-HMM, and hence a larger activity estimate for the corresponding HERV.
2. In the simple BLAST approach, each EST is counted in favor of its best-matching HERV; it does not matter how many matches to that HERV there are. Thus, the BLAST approach for activity estimation is not affected by how many matches per EST-HERV pair are kept.
3. Having several EST-HERV matches would have affected the estimate of the active areas of the HERV sequence (the activity would likely be spread more evenly).
Note that having more than one match for an EST-HERV pair would not increase the 'weight' of the EST in the activity estimation of the HERVs, since each EST is biologically generated only once. Rather, several matches indicate more uncertainty about where in the HERV the EST was generated from.
We briefly discuss three prototypical cases of multiple matches for the same EST-HERV pair: 1. If almost consecutive areas of the EST match almost consecutive areas of the HERV, the set of matches is almost the same as a single long match. In this case it is nearly equivalent to take the best one of the matches and use it in HMM training, because the training restrictions derived from the different matches are nearly equivalent.
2. If the same area of the EST matches several far-off areas of the HERV, only one match is true. If one of the matches is clearly best, the others are likely false; then keeping the best match only may reduce noise. If several matches are nearly equally good, keeping only the best may cause a small amount of error for activity estimation.
3. If consecutive areas of the EST match far-off areas of the HERV, this could be because of for instance alternative splicing. The HMM method and the BLAST approach do not currently take alternative splicing into account.

Details about leaving out HERVs with suspected non-retroviral sequence portions
Below we first discuss one reason for non-retroviral sequence portions; then we describe how our HERV removal procedure (that we used to try to remove effects of non-retroviral content) affects the HERV activity results.

Non-retroviral integrations in HERV sequences
Some HERV sequences in our collection could contain non-retroviral transposon integrations. To avoid such integrations, a rather stringent removal of specific transposons (ALUs and LINEs) was done before running RetroTector, but it is possible other non-retroviral integrations remain.
An even more stringent removal could be done by removing all transposons not indicated as retroviral in RepBase [1] and RepeatMasker [2]. However, this assumes these sources contain complete knowledge of which transposons are retroviral; in reality, HERVs not named as such in these sources could inadvertently be removed.
For the above reasons, a small occurrence of non-retroviral integrations in HERV sequences are in practice unavoidable.

Activity comparison with and without HERV removal
In this paper, as described in the section Removing HERVs with suspected non-retroviral content, we have tried to remove the effect of suspected non-retroviral content on HERV activity by leaving out from our HERV set HERVs with EST hits mostly in un-annotated portions of the sequence.
We compared the group summarized activity reported for the set of 2450 HERVs used in this work (See Supplementary Fig. 6) to that computed from all 3164 HERVs, i.e. including also HERVs where the activity is not within a viral gene or LTR. The activity profile changes quite a lot because then also HML-5, ERV-3, and MER-41 have highly active elements. HERV-H has several highly active elements and is the most active group in the complete set of 3164 HERVs, the second most active group is the unclassified sequences.
We think the activity distribution for the data set where HERVs with EST hits mostly in un-annotated portions of the sequence are removed is more relevant to analyzing real retroviral activity than the distribution for the data set of all HERVs. However, there can be some retroviral expression included in the set of HERVs that was left out: their expression might be in an area that is originally retroviral but has been left un-annotated because of mutations and frame-shifts.

Supplementary Figures
Supplementary figure 1

-Amount of cross-talk in HERV data
The number of simulated ESTs generated from each HERV (row) that match the other HERVs (column) shown in logarithmic scale. The blocks in the diagonal correspond to the three groups in the data set. The generating HERVs are sorted block-wise by underlying true activity.
We can see that the HML2 has the most cross-talk. This group is more difficult from the point of view of the EST matching problem.

Supplementary figure 3 -Activity of HERVs with or without the env-gene
Proportion of active HERVs with different activity thresholds, for HERVs with and without the env -gene. The colored blocks below the curve represent the HERV structure (genes, LTRs) and the curve presents EST hit intensity along the HERV structure. See Table 1 for more information on these HERVs.
(Continued on the next page.)