Bioinformatics analysis of SARS coronavirus genome polymorphism
© Pavlović-Lažetić et al; licensee BioMed Central Ltd. 2004
Received: 24 December 2003
Accepted: 25 May 2004
Published: 25 May 2004
We have compared 38 isolates of the SARS-CoV complete genome. The main goal was twofold: first, to analyze and compare nucleotide sequences and to identify positions of single nucleotide polymorphism (SNP), insertions and deletions, and second, to group them according to sequence similarity, eventually pointing to phylogeny of SARS-CoV isolates. The comparison is based on genome polymorphism such as insertions or deletions and the number and positions of SNPs.
The nucleotide structure of all 38 isolates is presented. Based on insertions and deletions and dissimilarity due to SNPs, the dataset of all the isolates has been qualitatively classified into three groups each having their own subgroups. These are the A-group with "regular" isolates (no insertions / deletions except for 5' and 3' ends), the B-group of isolates with "long insertions", and the C-group of isolates with "many individual" insertions and deletions. The isolate with the smallest average number of SNPs, compared to other isolates, has been identified (TWH). The density distribution of SNPs, insertions and deletions for each group or subgroup, as well as cumulatively for all the isolates is also presented, along with the gene map for TWH.
Since individual SNPs may have occurred at random, positions corresponding to multiple SNPs (occurring in two or more isolates) are identified and presented. This result revises some previous results of a similar type. Amino acid changes caused by multiple SNPs are also identified (for the annotated sequences, as well as presupposed amino acid changes for non-annotated ones). Exact SNP positions for the isolates in each group or subgroup are presented. Finally, a phylogenetic tree for the SARS-CoV isolates has been produced using the CLUSTALW program, showing high compatibility with former qualitative classification.
The comparative study of SARS-CoV isolates provides essential information for genome polymorphism, indication of strain differences and variants evolution. It may help with the development of effective treatment.
Severe Acute Respiratory Syndrome (SARS) is a new infectious disease reported first in the autumn of 2002 and diagnosed for the first time in March 2003 . It is still a serious threat to human health and SARS coronavirus (CoV) has been associated with the pathogenesis of SARS according to Koch's postulate .
Significant research efforts have been made into investigation of the SARS-CoV genome sequence, aimed at establishing its origin and evolution to help eventually in preventing or curing the disease it causes. Although the task is a hard one, it opens up the opportunity, amongst others, for comparative investigation of different SARS-CoV isolates aimed at identification of genome regions properties expressing different levels of sequence polymorphism [3–8].
The genome of SARS-CoV consists of a single positive RNA strand approximately 30 Kb in length, consisting of about 10 open reading frames (ORF), and about 10 intergenic regions (IGRs). The first two overlapping ORFs at the 5' end encompass two-thirds of the genome, while the rest of the ORFs at the 3' end account for the remaining third.
List of the SARS-CoV complete genome isolates investigated. Included are isolates' labels, IDs, accession numbers, length in nucleotides, dates of revisions considered and countries and sources of isolates.
Taiwan: patient #01
Taiwan: Hoping Hospital
Taiwan: Hoping Hospital
Taiwan: patient #06
Taiwan: patient #04
Taiwan: patient #02
Taiwan: patient #043
Taiwan, first fatal case
Germany: patient from Frankfurt
China: Hong Kong
Canada: Toronto, patient #2
Canada: Toronto, patient #2
China: Hong Kong
China: Hong Kong
China: Hong Kong
China: Hong Kong
China: Hong Kong
According to the length of isolates (insertions and deletions) and the presence of SNPs, we classified them into three main groups with subgroups: "regular" isolates with no insertions or deletions (with different numbers of SNPs), isolates with "long insertions" and isolates with "many individual" insertions and deletions (with different positions of SNPs), which is close to phylogenetic analysis results.
Results and discussion
Some of the isolates are nucleotide-identical or almost identical. There are two pairs of nucleotide-identical isolate sequences: (TWH, TWC2) and Tor2 (with accession numbers Ay274119, Nc_004718). Therefore, instead of 38, we consider the dataset to contain 36 isolates. Further, the isolate TWC3 differs in just one position with TWH (see table in additional file 1), which is about randomly expected . Isolates Frankfurt 1 and FRA are identical up to the poly-"a" of length 13 present at the 3' end of FRA (Figure 1).
II) Similarity analysis showed that a significant number of isolates have the same length (29727 bases), the same beginning and ending subsequences (that seem to be exact starts and ends of the complete SARS-CoV genome up to the poly-"a" at the 3' end), thus forming a kind of referent group; these are the isolates TWH, TWC3, TWK, TWS, TWY, Urbani, Frankfurt 1 (Figure 1). The fully sequenced isolate TWH then has been chosen as the referent isolate for sequence comparisons since its average number of SNPs compared to other isolates is the smallest. For example, TWH and Urbani have an average number of SNPs 15.7 and 17.6 respectively for all the isolates, and 5.7 and 10.5 respectively for the referent group. For SNPs see the tables in the additional files 1 and 2.
III) Most isolates, compared to TWH, are shorter at the 5'end (e.g., Sin2500, Sin2679, Sin2774, Sin2677, Sin2748, AS), have various length poly-"a" strings at the 3' end (e.g., Tor2, HSR1, FRA, BJ02, TW1, HKU-39489, WHU), or both (BJ01, BJ03, BJ04, CUHK-W1, CUHK-Su10). Three of the isolates, Taiwan TC1, Taiwan TC2, Taiwan TC3, have both starting and ending deletions (at the 5' end 69, at the 3' end 85 nucleotides). Several isolates (e.g. TWJ, TWC, Sin2677, Sin2748) have some short deletions inside the sequence (Figure 1).
IV) There is a group of isolates that have significant length insertions (29 nucleotides) inside the sequence. These are the isolates GD01, SZ3, SZ16. A significant number of individual insertions have been identified in ZJ01 and ZMY 1 isolates (Figure 1, additional files 3,4,5).
1. with less than 15 (TWC3, TWK, TWS, TWY, Urbani, TWJ, TWC, TW1, Tor2, HSR1, CUHK-Su10, AS, Sin2500, Sin2679, Sin2774, Sin2677, Sin2748, Taiwan TC1, Taiwan TC2, Taiwan TC3, Frankfurt1, FRA, HKU-39849, CUHK-W1),
2. between 15 and 30 (WHU, GZ50, BJ01-BJ04, ZJ01),
3. with equal to or greater than 30 SNPs (GD01, SZ3, SZ16, ZMY 1).
Additional file 1,2,3,4,5 represent SNPs for all the isolates in all five groups, whether they occur in ORFs or IGR (for annotated isolates), as well as the number of SNPs in ORFs and SNPs in IGR, per isolate. The total number of SNPs is 312 (only 2 in IGRs: TWH positions 27812 for the isolate Taiwan TC3 and 27827 for the isolates BJ01 and CUHK-W1). The average number of SNPs per isolate is 15.7 and significant difference from the average shows TWC3 (just 1 SNP) and ZMY 1 (even 80).
Grouping of isolates
The isolates from the dataset considered may be classified according to their sequence polymorphism and SNP contents properties just described. At first, properties (III, IV) may result in three different groups (Figure 2):
B. isolates with "long insertions": GD01, SZ3 and SZ16 (Figure 6b) and
Further, SNPs properties (1–3) may divide A group into A1 and A2, and C group into C1 and C2 subgroups:
A1. TWH, TWC3, TWK, TWS, TWY, Urbani, TWJ, TWC, TW1, Tor2, HSR1, CUHK-Su10, AS, Sin2500, Sin2679, Sin2774, Sin2677, Sin2748, Taiwan TC1, Taiwan TC2, Taiwan TC3, Frankfurt1, FRA, HKU and CUHK-W1 (Figure 5)
A2. WHU, BJ01-BJ04 and GZ50 (Figure 6a)
C1: ZJ01 (Figure 7a)
C2: ZMY 1 (Figure 7b)
Finally, the positions of SNPs will move CUHK-W1 from A1 into A2 group (more than 50% of common SNP positions) while WHU will move from A2 into A1 (less than 30% of common SNP positions), giving the final grouping of isolates presented as a structural tree (Figure 2):
A1. TWH, TWC3, TWK, TWS, TWY, Urbani, TWJ, TWC, TW1, Tor2, HSR1, CUHK-Su10, AS, Sin2500, Sin2679, Sin2774, Sin2677, Sin2748, Taiwan TC1, TC2, TC3, Frankfurt1, FRA, HKU and WHU (Figure 5 and the additional file 1)
Although qualitative in nature, the structural tree turns out to be close to the quantitative grouping which is a basis for (computational) phylogenetic classification.
Changes in amino acids
We analyzed amino acid changes in proteins for the annotated isolates (19 out of 36), and presumed proteins in non-annotated ones for multiple SNPs in all the isolates. Results of the analysis are represented in Figures 3 and 4. Figure 3 shows that silent mutations occurred in envelope protein E, while nucleotide changes resulted in amino acid changes in spike (S), membrane (M) and nucleocapside (N) proteins. All three SNPs in the spike protein are situated in the outer membrane region and not within the potential epitope region (amino acid position 469–882) as proposed by Ren Y. et al. . Amino acid changes occurred in two multiple SNPs in M protein, one multiple SNPs in N protein and 7 (out of 13) multiple SNPs of the polyprotein 1ab, as well as in one multiple SNP of a hypothetical protein, while the silent mutations occurred in three hypothetical proteins. Figure 3 also represents properties of the corresponding amino acids resulted by SNPs. The only significant change in amino acid properties is in S protein Gly→Asp (A2, B groups, i.e., in CUHK-W1, GZ50, BJ01-BJ03, GD01, SZ3 and SZ16 isolates) and hypothetical protein Cys→Arg (the same isolates, BJ04 in addition). The only addition in non-annotated sequences is in hypothetical protein following S protein in TWH, exhibiting silent change, and in non-annotated BJ02 and BJ03, corresponding to the hypothetical protein, Gly→Glu. Similar analysis can be done for amino acid changes corresponding to SNPs at positions specific for B group isolates (Figure 4). Taking into account the only annotated isolate GD01, there are five amino acid changes in polyprotein 1ab, two amino acid properties changes in S protein (Ser→Leu and Tyr→Asp, the second being within the epitope region), one amino acid change in M protein and one amino acid property change (Cis→Arg) in BGI-PUP.
The SARS-CoV isolates have been multialigned using the CLUSTALW program  as the very first step in obtaining a phylogenetic tree. The aligned sequences have been submitted then to CLUSTALW for bootstrapping and phylogenetic tree production. Enlargement of the sequence set resulted in the refinement of the phylogenetic tree produced, as compared to previous results such as Ruan  and Zhang&Zheng , obtained for 14 and 16 isolates, respectively. The phylogenetic tree obtained, drawn using the PhyloDraw program , is represented in Figure 9. It is similar to our structural tree based on qualitative analysis of the isolates (Figure 2).
The results of the analysis of dissimilarities, described in previous paragraphs, are in accordance with the alignment obtained by CLUSTALW, but regrouped and formatted in a way that facilitates further interpretation and application.
All of the SARS-CoV isolates are highly homologous (more than 99% pairwise). Most of them have similar nucleotide structure, with the same 5' and 3' ends and poly-"a" at the 3' end of different length (0–24), some of them with a single short deletion close to the 3' end of the sequence; out of 312 SNPs in total, only two are in IGRs.
Three of the 38 isolates have long insertions within the sequence;
Two of the isolates have a large number of individual insertions / deletions, exhibiting different SNP positions;
All the isolates may be grouped according to sequence polymorphism into three groups (with up to two subgroups), reflecting their similarities / dissimilarities. Since the isolate sequences have a high degree of homology, different properties of groups are represented in a more transparent way in the classification tree obtained by such a qualitative analysis, than in a bootstrapped phylogenetic tree obtained from multialigned sequences using the CLUSTALW program .
The total number of amino acid changes caused by multiple SNPs is 15 (in isolates of A, C groups) and 34 in isolates of B group. The total number of silent mutations is 10 (for A, C groups) and 7 (for B group).
Since S protein is of special interest regarding its receptor affinity and antigenecity, it is interesting to notice that all amino acid properties' changes are located in its outer membrane region, one for A, C groups and two for B group.
The results obtained may be useful in further investigation aiming at identification of SARS-CoV genome regions responsible for its infectious nature.
The coverage included all the isolates published by October 31st 2003 (with updated revisions). The identifiers, accession numbers, genomic size (in nucleotides), revision dates and country or source of the isolates considered are included in the table, together with labels as referred in this paper. The fully sequenced isolate TWH has been chosen as the referent isolate, since its average number of SNPs was the lowest as compared to all other isolates.
Methods for similarity analysis
identification of structurally identical parts of isolates, i.e., insertion and deletion sites
identification of SNPs in structurally identical parts.
Step 1 has been carried out by a function performing similarity analysis of subsequences of a given length (e.g., 100 bps), and identifying significantly non-matching strings as being inserted in the corresponding sequence (i.e. deleted from the other). Since significant number of isolates have the same length (29727 bases) and starting and ending subsequences (that seem to be the exact starts and ends of the complete SARS-CoV genome up to the poly-"a" at the 3' end), they may be considered as forming a representative group. The nucleotide structure of all other isolates was analyzed with respect to this representative group. For each pair of isolates (x,y) (x from the representative group), a file InsDel x-y has been produced containing positions and lengths of each of the insertions or deletions in the isolate y.
Step 2 has been carried out by comparing structurally identical parts (of the same length) of pairs of isolates. The starting and ending positions of those parts have been taken from the file InsDel x-y (for comparison of x and y), produced in step 1. The procedure returns results in a file with SNPs in the two sequences (files Mism x-y).
We also used the CLUSTALW program  for multialignment as a control process, as well as for phylogenetic investigations.
Methods for phylogenetic investigation
In order to use similarity analysis results for drawing any phylogenetic conclusions about the SARS-CoV genome dataset, a CLUSTALW  multialigned output has been generated and a bootstrapped phylogenetic tree has been produced and drawn using the PhyloDraw program .
The work presented has been financially supported by the Ministry of Science and Technology, Republic of Serbia, Project No. 1858.
- Maskalyk J, Hoey J: SARS update. CMAJ 2003, 168(10):1294–1295.PubMedGoogle Scholar
- Fouchier RA, Kuiken T, Schutten M, vanAmerongen G, vanDoornum GJ, vandenHoogen BG, Peiris M, Lim W, Stohr K, Osterhaus ADM: Aetiology: Koch's postulates fulfilled for SARS virus. Nature 2003, 423(6937):240. 10.1038/423240aView ArticlePubMedGoogle Scholar
- Rota PA, Oberste MS, Monroe SS, Nix WA, Campagnoli R, Icenogle JP, Peñaranda S, Bankamp B, Maher K, Chen MH, Tong S, Tamin A, Lowe L, Frace M, DeRisi JL, Chen Q, Wang D, Erdman DD, Peret TCT, Burns C, Ksiazek TG, Rollin PE, Sanchez A, Liffick S, Holloway B, Limor J, McCaustland K, Olsen-Rasmussen M, Fouchier R, Günther S, Osterhaus ADME, Drosten C, Pallansch MA, Anderson LJ, Bellini WJ: Characterization of a Novel Coronavirus Associated with Severe Acute Respiratory Syndrome,. Science 2003, 300(5624):1394–1399. 10.1126/science.1085952View ArticlePubMedGoogle Scholar
- Marra MA, Jones SJM, Astell CR, Holt RA, Brooks-Wilson A, Butterfield YSN, Khattra J, Asano JK, Barber SA, Chan SY, Cloutier A, Coughlin SM, Freeman D, Girn N, Griffith OL, Leach SR, Mayo M, McDonald H, Montgomery SB, Pandoh PK, Petrescu AS, Robertson AG, Schein JE, Siddiqui A, Smailus DE, Stott JM, Yang GS, Plummer F, Andonov A, Artsob H, Bastien N, Bernard K, Booth TF, Bowness D, Czub M, Drebot M, Fernando L, Flick R, Garbutt M, Gray M, Grolla A, Jones S, Feldmann H, Meyers A, Kabani A, Li Y, Normand S, Stroher U, Tipples GA, Tyler S, Vogrig R, Ward D, Watson B, Brunham RC, Krajden M, Petric M, Skowronski DM, Upton C, Roper RL: The Genome Sequence of the SARS-Associated Coronavirus,. Science 2003, 300(5624):1399–1404. 10.1126/science.1085953View ArticlePubMedGoogle Scholar
- Thiel V, Ivanov KA, Putics A, Hertzig T, Schelle B, Bayer S, Weißbrich B, Snijder EJ, Rabenau H, Doerr HW, Gorbalenya AE, Ziebuhr J: Mechanisms and enzymes involved in SARS coronavirus genome expression,. J Gen Virol 2003, 84(9):2305–2315. 10.1099/vir.0.19424-0View ArticlePubMedGoogle Scholar
- Qin E, Zhu Q, Yu M, Fan B, Chang G, Si B, Yang B, Peng W, Jiang T, Liu B, Deng Y, Liu H, Zhang Y, Wang C, Li Y, Gan Y, Li X, Lu F, Tan G, Cao W, Yang R, Wang J, Li W, Xu Z, Li Y, Wu Q, Lin W, Cheng W, Tang L, Deng Y, Han Y, Li C, Lei M, Li G, Li W, Lu H, Shi J, Tong Z, Zhang F, Li S, Liu B, Liu S, Dong W, Wang J, Gane KSW, Yu J, Yang H: A complete sequence and comparative analysis of a SARS-associated virus (Isolate BJ01). Chin Sci Bull 2003, 48(10):941–948. 10.1360/03wc0186View ArticleGoogle Scholar
- Zeng FY, Chan CW, Chan MN, Chen JD, Chow KY, Hon CC, Hui Li J, Li VY, Wang CY, Wang PY, Guan Y, Zheng B, Poon LL, Cha KH, Yuen KY, Peiris JS, Leung FC: The complete genome sequence of severe acute respiratory syndrome coronavirus strain HKU-39849 (HK-39). Exp Biol Med (Maywood) 2003, 228(7):866–873.Google Scholar
- Ruan YJ, Wei CL, Ee LA, Vega VB, Thoreau H, Yun STS, Chia JM, Ng P, Chiu KP, Lim L, Tao Z, Peng CK, Ean LOL, Lee NM, Sin LY, Ng LFP, Chee RE, Stanton LW, Long PM, Liu ET: Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection,. The Lancet 2003, 361: 1779–1785. 10.1016/S0140-6736(03)13414-9View ArticleGoogle Scholar
- PubMed NCBI Entrez[http://www.ncbi.nlm.nih.gov/entrez]
- Wood L: Questions about comparative genomics of SARS coronavirus isolates,. Lancet 2003, 362: 578. 10.1016/S0140-6736(03)14130-XView ArticlePubMedGoogle Scholar
- Hsueh PR, Hsiao CH, Yeh SH, Wang WK, Chen SH, Wang JT, Chang SC, Kao CL, Yang PC: Microbiologic characteristics, serologic responses, and clinical manifestations in Severe Acute Respiratory Syndrome, Taiwan,. Emerging Infectious Diseases 2003, 9(9):1163–1167.PubMed CentralView ArticlePubMedGoogle Scholar
- R Ren Y, Zhou Z, Liu J, Lin L, Li S, Wang H, Xia J, Zhao Z, Wn J, Zhou C, Wang J, Yin J, Xu N, Liu S: A strategy for searching antigenic regions in the SARS-CoV spike protein,. Geno, Prot & Bioinfo 2003, 1(3):207–215.Google Scholar
- Zhang Y, Zheng N: Genomic phylogeny of SARS coronavirus suggested that Guangdong province is the origin area (personal communication).Google Scholar
- PhyloDraw V0.82[http://pearl.cs.pusan.ac.kr/phylodraw]
- Russel RB, Betts MJ, Barnes MR: Amino acid properties.[http://www.russell.embl.de/aas/]
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.