Coronavirus 3CLpro proteinase cleavage sites: possible relevance to SARS virus pathology.

BACKGROUND
Despite the passing of more than a year since the first outbreak of Severe Acute Respiratory Syndrome (SARS), efficient counter-measures are still few and many believe that reappearance of SARS, or a similar disease caused by a coronavirus, is not unlikely. For other virus families like the picornaviruses it is known that pathology is related to proteolytic cleavage of host proteins by viral proteinases. Furthermore, several studies indicate that virus proliferation can be arrested using specific proteinase inhibitors supporting the belief that proteinases are indeed important during infection. Prompted by this, we set out to analyse and predict cleavage by the coronavirus main proteinase using computational methods.


RESULTS
We retrieved sequence data on seven fully sequenced coronaviruses and identified the main 3CL proteinase cleavage sites in polyproteins using alignments. A neural network was trained to recognise the cleavage sites in the genomes obtaining a sensitivity of 87.0% and a specificity of 99.0%. Several proteins known to be cleaved by other viruses were submitted to prediction as well as proteins suspected relevant in coronavirus pathology. Cleavage sites were predicted in proteins such as the cystic fibrosis transmembrane conductance regulator (CFTR), transcription factors CREB-RP and OCT-1, and components of the ubiquitin pathway.


CONCLUSIONS
Our prediction method NetCorona predicts coronavirus cleavage sites with high specificity and several potential cleavage candidates were identified which might be important to elucidate coronavirus pathology. Furthermore, the method might assist in design of proteinase inhibitors for treatment of SARS and possible future diseases caused by coronaviruses. It is made available for public use at our website: http://www.cbs.dtu.dk/services/NetCorona/.


Background
In the spring of 2003, the Severe Acute Respiratory Syndrome (SARS) caused numerous fatalities particularly in Southeast Asia and gravely affected the global economy. The causative agent was shown to be a human coronavirus [1], a virus type which normally causes mild cold symptoms in humans. The abrupt appearance raises concern of another break-out of an epidemic of SARS virus or similar strains in the future.
Coronaviruses are found in different species ranging from chicken to cattle and humans. Currently, seven coronavirus genomes, including SARS coronavirus (CoV), have been fully sequenced and cluster into four main groups, of which SARS-CoV occupies its own [2,3]. Polyproteins encoded by the coronavirus RNA are processed by viral proteinases yielding mature proteins. The main proteinase 3CL pro performs at least eleven proteolytic cleavages within a single viral polyprotein [4,5]. Viral polyprotein processing is a common theme in viral molecular biology, e.g. as seen in picornaviruses and retroviruses like HIV. Therefore, essential viral proteinases have been suggested as potential targets for specific therapeutic approaches, e.g. by development of specific proteinase inhibitors [6][7][8].
In the case of picornaviruses, virus-encoded proteinases are able to cleave specific cellular targets and thereby severely inhibit the cellular translational machinery (the "host cell shut-off" response) while still allowing for high translational activity of viral mRNA [9]. Earlier, we developed a computational approach for predicting potential cleavage sites of picornavirus proteinases 2A and 3C [10]. Badorff et al. successfully used this cleavage predictor to identify the cellular target dystrophin, which they experimentally showed to be cleaved both in vitro and in vivo [11]. However, preliminary studies revealed that this model is not compatible with coronavirus cleavage sites. The general approach is still valid though, and we decided to apply this method to the problem of predicting the 3CL pro proteinase cleavage sites and identifying potential host cell target proteins. We propose that a deeper understanding of coronavirus proteinase function and substrate specificity may benefit further research by: i) increasing the understanding of substrate specificity determinants which may direct studies focusing on the development of specific proteinase inhibitors and ii) providing a method for screening cellular target proteins for potential coronavirus proteinase cleavage sites.
In this paper, we describe the development of a computational prediction method using artificial neural networks for predicting coronavirus 3CL pro proteinase cleavage sites. The method is based on known cleavage sites in seven members of the coronavirus family as the cleavage sites are believed to be sufficiently conserved among family members. This notion is supported by the fact that the SARS 3CL pro proteinase has recently been shown capable of catalysing the cleavage of peptide fragments from other coronaviruses at the expected cleavage sites [12].
We discuss potential targets of 3CL pro proteinase, e.g. the cystic fibrosis transmembrane conductance regulator (CFTR) and translational and transcriptional factors, which may be involved in the molecular pathology of coronaviruses in general and SARS virus in particular.

Analysis of the proteinase cleavage site
The 77 annotated coronavirus polyprotein main proteinase cleavage sites were aligned without gaps by constraining the P1 position. Every site had a glutamine (Q) in position P1 (the position just before the cleavage site; the positions are named as suggested by Berger and Schechter [13] with P1, P2, ... etc., N-terminal to the cleavage site and P1', P2', ... etc., C-terminal to the cleavage site). From the sequence logo ( Figure 1) a very strong consensus is evident around the cleavage site. As discussed by others [14,15], the coronavirus 3C-like proteinase shares many traits with its picornavirus 3C proteinase counterpart, hence the name. This is reflected in the cleavage site logo although differences between the two are also apparent. Positions P1', P1, and P4 have similar amino acid distribution in the 3C and 3CL proteinase cleavage sites. On the other hand, the coronavirus proteinase has a strong preference for leucine at position P2 while this position is relatively non-conserved among picornavirus proteinase cleavage sites [10]. A recently published study of the crystal structure of 3CL pro from the 229E strain of human coronaviruses indicates that residues at positions P5 to P3 form an anti-parallel β sheet with part of the proteinase, signifying their importance in cleavage site recognition [7].
It is clear from the above that a simple, position specific consensus sequence is difficult to define. With the present data set from seven different coronaviruses it is possible to classify correctly 60 (78%) of the 77 cleavage sites by matching an 'LQ' consensus pattern. However, an additional 196 sites in the viral polyproteins are incorrectly classified as cleavage sites, being random occurrences of this pair of amino acids. Classification is improved by using the consensus pattern 'LQ [S/A]', meaning Leu-Gln-(Ser OR Ala), but it is still far from being a useful classifier. The false positive rate is now down to 36 wrong sites, but at the same time only 48 (62%) of the correct cleavage sites are detected. As the pattern becomes more sophisticated, specificity increases (reducing the number of false positives) but at the same time sensitivity drops dramatically (i.e. fewer of the true sites are detected).

Neural network training and performance
To overcome the limitations of simple consensus patterns, we trained an artificial neural network to identify the cleavage sites. The best model was obtained using a threelayered neural network with two hidden neurons and a sequence window encompassing nine amino acids centered on the P1 position, thus encompassing P5-P4'. The network evaluates and assigns a score between 0 and 1 to every glutamine to which it is presented, where a score above 0.5 is considered a positive answer (i.e. a cleavage site is predicted). This model was able to classify correctly 67 of 77 known cleavage sites (87.0%) and 1,358 of 1,372 (99.0%) sites assumed not to be cleaved by the proteinase when testing on independent sites not included when training. The neural network method could thus identify many more of the positive sites with fewer false positives than simple consensus-type methods thereby increasing the classification performance. To evaluate the predictive power of the neural network, we performed a basic bayesian analysis of the data set test results. The scoring range from 0 to 1 was divided into ten bins and the posterior probability of a positive prediction (a prediction indicating a cleavage) being true was calculated and plotted ( Figure 3). The posterior probability in the range 0.5 to 0.8 cannot be determined accurately since relatively few examples score in this interval -only 3% of the test set (both positive and negative examples) scores between 0.4 and 0.8. However, results indicate that prediction scores can be classified into three categories, those that fall below 0.5 are most likely not cleaved, those that fall between 0.5 and 0.8 are possibly cleaved and those above 0.8 are most likely cleaved if available to the proteinase.

Analysis of selected human proteins
As mentioned above, there are several experimentally verified examples of host cell protein cleavage by virus proteinases. Thus, both these and other non-coronavirus proteins from Swiss-Prot [16] 41.0 were examined for potential cleavage sites. In total three groups of proteins were examined: i) proteins known to be cleaved by other viruses, ii) proteins which could be targets when considering the pathology of coronaviruses iii) proteins related to the expected immune response to a viral infection. Eukaryotic translation initiation factor 4 gamma (IF4G_HUMAN) has a potential cleavage site after Gln838 (0.822), but also at two other positions although with lower cleavage scores. Cleavage of this protein may lead to host cell shut-off in a similar way to what has been described for picornavirus 2A proteinase [17].
Two subunits of the RNA polymerase III are predicted targets of the coronavirus proteinase 3CL pro . RNA polymerase (RPC1_HUMAN) has a predicted cleavage site after Glnl95 with a score (0.765) well above the 0.5 cut-off. The protein is the second largest subunit of the RNA polymerase III complex and if this protein is indeed a cellular proteinase target it might cause disruption of the RNA polymerase III complex upon infection with a coronavirus. A similar disruption would be expected in case of a cleavage of the largest subunit of the complex (RPA1_HUMAN) which also has a predicted cleavage site (at position 329, score 0.704). It agrees with findings that poliovirus disrupts RNA polymerase III function, although this occurs through cleavage of transcription factor IIIC and not the polymerase subunits themselves [18][19][20]. Several well-known transcription factors contain potential cleavage sites. The highest scoring is CREB-RP (AT6B_HUMAN) with a predicted cleavage site at Gln358 (0.916) close to the DNA binding leucine zipper motif. This is in agreement with findings from picornavirus 3C pro proteinase although at a different position in the sequence [21]. OCT-1 (PO21_HUMAN) is also predicted to be cleaved by the 3CL pro proteinase with high confidence (0.874) following Gln62 again corresponding to experimental evidence from picornavirus [22]. Several subunits of the transcription initiation factor TFIID, which is a verified target in poliovirus infections [23], have predicted cleavage sites; the 250 kDa subunit (T2D1_HUMAN), the 135 kDa subunit (T2D3_HUMAN), and the 105 kDa subunit (T2DT_HUMAN).
The tumor-suppressor protein P53 is known to be cleaved by picornavirus 3C pro proteinase [24] but this protein is not predicted to contain any coronavirus 3C pro proteinase cleavage sites. However, P53-binding protein 1 (P531_HUMAN) and P53-binding protein 2 (P532_HUMAN), which stimulate p53-mediated transcriptional activation [25], have several potential cleavage sites.
Another known target for viral infections is the microtubule-associated protein 4 (MAP-4) which is cleavable in HeLa cells by the poliovirus 3C pro proteinase [26,27]. MAP-4 (MAP4_HUMAN) might also be cleavable by 3CL pro albeit with a low score (after Gln 1005 with a score of 0.519) and furthermore microtubule-associated protein RP/EB member 1 and 3 (MAE1_HUMAN and  MAE3_HUMAN) have sites which obtain scores above 0.5. The position of the possible cleavage site in MAP-4 is different from that observed with poliovirus 3C pro reflecting the different specificity of this proteinase.
Lung related proteins were examined as early symptoms of SARS could indicate a relation. The cystic fibrosis transmembrane conductance regulator (CFTR_HUMAN) is an ATP-dependent chloride channel. It has a predicted cleavage site with a high score (0.842) following Gln762 in the human sequence. This part of the membrane protein is cytoplasmic and contains several phosphorylation sites (residues 660 -813) indicating an accessible region.
The epithelial sodium channels play an important role in lung liquid homeostasis [28] and the amiloride-sensitive sodium channel δ-subunit (SCAD_HUMAN) has a predicted cleavage site in the cytoplasmic C-terminus (after

LQ[S/A] [T/S/A]X[L/F]Q[S/A/G] NN
Pattern % more of these proteins may lead to reduced presentation of viral peptides to cytotoxic T lymphocytes thereby inhibiting the cellular immune response. IRAK-1 (IRA1_HUMAN) which is involved in IL-1 induced activation of cells has a predicted cleavage site after Gln457 scoring 0.859.
Interferon-induced protein 6-16 precursor (INI2_HUMAN) is a membrane protein and was predicted to possess a cleavage site following Gln97 (0.890) which is located in the cytoplasmic part of the mature protein. Protein 6-16 has been shown to enhance interferonα antiviral efficacy [29]. Interferon-α, -β, and -γ are known to be involved in antiviral defence and have been employed for treatment of SARS [30], but the interferons themselves do not seem to possess cleavage sites.
We have listed the human proteins analysed in this work in a table (Table 1).

Discussion
We have developed a neural network capable of identifying the cleavage site of the coronavirus proteinase 3CL pro and use this model to predict potential cleavage sites in host cell proteins. The predictor is highly specific which means that few false positives are expected, in fact on independent test sets we observed a false positive rate around 1%. The optimal network window size of nine residues agrees well with available structural information about the proteinase from human coronavirus 229E which indicates that the active site makes contact with at least four residues N-terminal to the glutamine [7]. The ten sites known to be cleaved but failed to be recognised by the neural network are not dramatically different from the remainder of the sites ( Table 2). We therefore do not suspect these to be sites of a different hitherto unknown proteinase, but it would be interesting to see if the lower prediction score reflects a lower cleavage efficiency in vivo. Of the fourteen negative examples wrongly predicted as cleavable (Table 3)  Scoring range Posterior probability / Fraction some resemblance to real cleavage sites but also some resemblance to negative examples which are not predicted as cleavable. They may represent sites in-between which are cleavable to a certain extent but are shielded from cleavage due to conformational issues.
Predicted sites even with high scores which are inaccessible to the proteinase (like extracellular domains, transmembrane domains, or buried domains in globular proteins) should be disregarded, as accessibility information is not available to the neural network. Cleavage sites probably exist that are not cleaved because they are not exposed to the solvent sufficiently for the proteinase to work.
Others have attempted recognising the cleavage sites of the 3CL proteinase as a component of a coronavirus gene prediction server using different methods [31]. As the goal Table 1: Selected potential cleavage sites in human proteins from the Swiss-Prot database examined in this work. Columns represent Swiss-Prot identifier, predicted cleavage site position of P1 in the target protein, cleavage site score, and cellular localisation of target protein (Cyt -cytoplasmic, Nuc -nuclear, Mem -membrane associated). The last column lists the cleavage site in the sequencecleavage is predicted between the central glutamine residue (Q) and the following amino acid residue. Sorted by prediction score.    NC_001451  3928  AIBV  KSSVQSVAG  NC_001846  3923  MHV  VSQIQSRLT  NC_001846  5984  MHV  NPRLQCTTN  NC_002306  5527  TGV  KIGLQAKPE  NC_003045  5900  BCoV  ETRVQCSTN  NC_003436  3299  PEDV  GVNLQGGYV  NC_003436  6141  PEDV  SNNLQGLEN  NC_004718  3546  SARS  GVTFQGKFK  NC_004718  4369  SARS  EPLMQSADA  NC_004718  5902 SARS VATLQAENV was different, that predictor is not publicly available and no performance values have been published.

Conclusions
Our method can be employed by researchers suspecting a possible viral proteinase cleavage but may also prove useful for researchers working with coronavirus function. Finally, the method might facilitate proteinase blocking based drug discovery by providing hints about proteinase affinity to various non-cleavable peptide ligands, which is a possible strategy for drug development [7,32].

Data Set Preparation
Seven full-length coronavirus genomes were retrieved from the GenBank database [33] with the following acces-

Sequence logos
Amino acid conservation in multiple sequence alignments may be visualised using sequence logos. The height of the amino acid one-letter abbreviations reflect the Shannon information content [34] in units of bits at that specific position in the multiple sequence alignment [35]. The basic idea behind the visualisation technique is that the height of each letter in a given position reflects its probability p k (i). The total height of the column reflects the total information content (D(i)) at that specific position in the alignment given by (for proteins): Very conserved positions will then get tall columns with the height of individual residue symbols reflecting the amino acid distribution.

Training the neural networks
The artificial neural networks used in this work were of the standard feed-forward type. Sparse encoding was used for translating the amino acids to data input for the networks as has been described previously [36,10,37].
Training was done with three-fold cross-validation and Matthews correlation coefficients [38] were calculated by summing up true positives, false positives, true negatives, and false negatives in all combinations of training and test sets. Using an architecture with two hidden neurons and a symmetric window of nine amino acids centered on the glutamine in the P1 position it was possible to obtain a correlation coefficient of 0.84 on cross-validated test sets. Care was taken to ensure that all cleavage sites were equally distributed in every cross-validated set.

Bayesian statistics
The validity of the statistics depends on the expected fraction of cleavage sites in a given data set, which we only know in the data set at hand. Statistics was thus done on the data set test results in order to create a histogram of prediction probabilities. Statistics was done using Bayes' Theorem: The prediction outcome (0-1) was divided into 10 bins (X l ) with increments of 0.1. The posterior probability P(C pos |X l ) gives the probability of a positive prediction (that is, a cleavage) being true given the bin. This can be calculated from the prior probability P(C pos ), which is the fraction of positive examples in the data set, and the classconditional probability P(X l |C pos ) for positive examples, which is the fraction of positive examples in the bin X l . P(X l ) is the fraction of prediction outcomes in bin X l .

Searching for potential cleavage sites
An averaged sum of the score of all three networks arising from the three-fold cross-validation was used for prediction. Each network outputs a score in the range [0.000-1.000], where scores below 0.5 indicate non-cleavage and scores above 0.5 indicate potential cleavage. This method is also employed by the prediction web server mentioned below. The Swiss-Prot database [16] release 41.0 (February 2003) was downloaded and proteins from this database were used as targets for the neural network predictions.

Availability
Our neural network based prediction method, NetCorona, for prediction of potential cleavage sites of the SARS-3CL pro proteinase is publicly available by following the link 'CBS prediction servers' from http://www.cbs.dtu.dk or at this specific URL: http://www.cbs.dtu.dk/services/ NetCorona/