Volume 7 Supplement 5

## APBioNet – Fifth International Conference on Bioinformatics (InCoB2006)

# Selecting effective siRNA sequences by using radial basis function network and decision tree learning

- Shigeru Takasaki
^{1}Email author, - Yoshihiro Kawamura
^{1}and - Akihiko Konagaya
^{1}

**7(Suppl 5)**:S22

**DOI: **10.1186/1471-2105-7-S5-S22

© Takasaki et al; licensee BioMed Central Ltd 2006

**Published: **18 December 2006

## Abstract

### Background

Although short interfering RNA (siRNA) has been widely used for studying gene functions in mammalian cells, its gene silencing efficacy varies markedly and there are only a few consistencies among the recently reported design rules/guidelines for selecting siRNA sequences effective for mammalian genes. Another shortcoming of the previously reported methods is that they cannot estimate the probability that a candidate sequence will silence the target gene.

### Results

We propose two prediction methods for selecting effective siRNA target sequences from many possible candidate sequences, one based on the supervised learning of a radial basis function (RBF) network and other based on decision tree learning. They are quite different from the previous score-based siRNA design techniques and can predict the probability that a candidate siRNA sequence will be effective. The proposed methods were evaluated by applying them to recently reported effective and ineffective siRNA sequences for various genes (15 genes, 196 siRNA sequences). We also propose the combined prediction method of the RBF network and decision tree learning. As the average prediction probabilities of gene silencing for the effective and ineffective siRNA sequences of the reported genes by the proposed three methods were respectively 65% and 32%, 56.6% and 38.1%, and 68.5% and 28.1%, the methods imply high estimation accuracy for selecting candidate siRNA sequences.

### Conclusion

New prediction methods were presented for selecting effective siRNA sequences. As the proposed methods indicated high estimation accuracy for selecting candidate siRNA sequences, they would be useful for many other genes.

## Background

Although RNA interference (RNAi) has been successfully used for studying gene functions in both plants and invertebrates, many practical obstacles need to be overcome before it becomes an established tool for use in mammalian systems [1–6]. One of the important problems is designing effective siRNA sequences for target genes. The short interfering RNA (siRNA) responsible for RNA interference varies markedly in its gene silencing efficacy in mammalian genes, where the gene silencing effectiveness depends very much on the target sequence positions (sites) selected from the target gene [7, 8]. Since different siRNAs synthesized for various positions induce different levels of gene silencing, the selection of the target sequence is critical to the effectiveness of the siRNA. We therefore need useful criteria for gene silencing efficacy when we are designing siRNA sequences [9, 10].

Zamore et al. and Jayasena et al. showed that 5' end of the antisense strand that was incorporated into RNA-induced silencing complex (RISC) more efficiently was less tightly paired to its complement and began with an A-T pair, whereas the strand incorporated less efficiently had a G-C terminus [11, 12]. Other factors reported to be related to gene silencing efficacy are GC content, sequence features, specific motif sequences and secondary structures of mRNA. Several siRNA design rules/guidelines using efficacy-related factors have been reported [13–17].

Although sequence characteristics for siRNA designs seem to be the most important factor determining effective siRNA sequences, there are few consistencies among the reported rules/guidelines [18–22]. This implies that these rules/guidelines might result in the generation of many candidates and thus make it difficult to extract a few for synthesizing siRNAs. Furthermore, there is in RNAi a risk of off-target regulation: a possibility that the siRNA will silence other genes whose sequences are similar to that of the target gene. When we use gene silencing for studying gene functions, we have to first somehow select high-potential siRNA candidate sequences and then eliminate possible off-target ones [23].

Here we therefore focus on identifying high-potential siRNA sequences from many possible candidates and propose the prediction methods for selecting effective siRNA target sequences from many possible candidate sequences by using the radial basis function (RBF) technique and decision tree learning of a large number known effective and ineffective siRNAs [24–26]. We also propose the combined prediction method of the RBF network and decision tree learning. The effectiveness of the proposed methods were confirmed by using them to evaluate siRNA sequences recently reported to effectively or ineffectively suppress the expression of various genes (see Methods). As the average prediction probabilities of gene silencing for the effective and ineffective siRNA sequences of the reported genes by the proposed three methods were respectively 65% and 32%, 56.6% and 38.1%, and 68.5% and 28.1%, the methods imply high estimation accuracy for selecting candidate siRNA sequences. Although the proposed methods are different from the previous scoring methods and are therefore difficult to compare with them, the evaluation results indicate that the proposed methods would be useful for many other genes. They will therefore be useful for selecting siRNA sequences for mammalian genes.

## Results and Discussion

We propose two prediction methods for selecting effective siRNA sequences from many possible candidate sequences, one based on the supervised learning of RBF and other based on the learning of decision tree.

### Learning based on the RBF network and the decision tree

A radial basis function (RBF) network is a type of artificial network for application to problems of supervised learning, such as regression, classification and time series prediction. As RBF networks are nonparametric models, there is no *a priori* knowledge about the function that is to be used to fit the training set [24, 25]. RBF networks are supervised learning models with a single middle layer of units. They are similar back propagation neural networks but usually faster to train because the RBFs used in the units mean that fewer weight adjustments are needed. Also, RBF networks tend to be more resistant to noisy data than back propagation networks. Decision tree learning is one of the most widely used and practical methods for inductive inference. A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision [26].

The proposed algorithms of the RBF network and the decision tree learning for selecting siRNA sequences effective are described in Methods.

### Verification of the proposed methods

Clusters generated by the RBF network.

Cluster ID | No. of sequences | Percentage of effective siRNAs (%) |
---|---|---|

C1 | 134 | 94 |

C2 | 150 | 70.7 |

C3 | 125 | 70.4 |

C4 | 147 | 61.9 |

C5 | 141 | 43.3 |

C6 | 158 | 32.3 |

C7 | 143 | 27.1 |

C8 | 148 | 8.1 |

### Prediction analysis by the RBF network

The average prediction probability of gene silencing for the MG1 effective siRNA sequences was 66.3% with the standard deviation 23.2%, whereas the average probability for the ineffective siRNA sequences was 33.6% with the standard deviation 17.2%. As there is a clear difference between the prediction probabilities of the effective and ineffective siRNA sequences, the predicted probabilities correspond to the effectiveness indication of the proposed method. The average prediction probabilities of effective siRNA sequences for MG2, MG3, MG4 and MG5 were respectively 66% (standard deviation: 17.4%), 57.4% (21.9%), 78.3% (16.7%) and 57.9% (16.7%), whereas the average prediction probabilities of the corresponding ineffective siRNA sequences were 25.5% (19.7%), 40.7% (21.4%), 20.7% (6.2%) and 30.1% (15.4%). As there are also clear differences between the averages of the effective and ineffective siRNA sequences for these genes, the individual predicted probabilities indicate the effectiveness of the proposed method.

### Prediction analysis by the decision tree learning

### Combined method of the RBF network and decision tree learning

### Comparison with other reported methods

The proposed methods use the supervised learning techniques by the RBF network and decision tree for selecting effective siRNA candidates, whereas most of the previous methods use scoring techniques [27]. Although the proposed methods can estimate the probability of gene silencing in the range from 0 to 1, the scoring methods cannot indicate this probability. The scoring method basically sets score values for candidate siRNA sequences according to the designated design rules. Consequently if an siRNA candidate for a specific gene completely satisfies the required design rules, it is expected to get a high score. Even though a high-potential siRNA would be obtained, however, it is difficult to estimate the probability that this siRNA would actually accomplish the expected gene degradation. In addition, as the previous scoring methods are dependent on their designated rules, the obtained scores vary depending on the individual rules. It is therefore quite difficult to compare these different scoring methods with the proposed methods.

The proposed method, on the other hand, by estimating the gene silencing probability of the siRNA candidates can, as shown in Figure 6, clearly indicate differences between effective and ineffective siRNAs. This therefore implies that the proposed method can easily be used for selecting high-potential siRNA sequences.

## Conclusion

We proposed two prediction methods for selecting effective siRNA target sequences from many possible candidate sequences by using a radial basis function (RBF) network and decision tree learning. They are quite different from the previous score-based siRNA design techniques and can predict the probability that a candidate siRNA sequence will be effective. The proposed methods were evaluated by applying them to recently reported effective and ineffective siRNA sequences for various genes. In addition, we also proposed the combined method of the RBF network and decision tree learning. As the average prediction probabilities of gene silencing for the effective and ineffective siRNA sequences of the recently reported genes by the proposed three methods were respectively 65% and 32%, 56.6% and 38.1%, and 68.5% and 28.1%, the methods imply high estimation accuracy for selecting candidate siRNA sequences. The evaluation results indicated that the proposed methods would be useful for many other genes. It should therefore be useful for selecting siRNA sequences for mammalian genes.

## Methods

### Supervised learning for effective siRNA classifications by using the RBF network

#### Preparation

#### Training phase

The training of the RBF network proceeds in two steps. First the hidden layer parameters are determined as a function of the input data (vectors) and then the weights between the hidden and output layers are determined by comparing the target data and the output of the hidden layer. The hidden layer parameters to be determined are the parameters of hyperellipsoids that partition the input data (vectors) into discrete clusters or regions. The parameters locate the center (i.e., the mean) of each ellipsoid's (region or cluster) basis function and describe the extent or spread of the region (i.e., the variance or standard deviation).

The centers of individual clusters are determined as follows:

(1) Randomly choose *m* vectors from the input data set to be the centers of *m* basis functions.

(2) For each vector *i* in the input dataset compute the Euclidean distance *D*_{i,m}to each of the *m* basis function centers.

where *i* is input vector number, e.g., *i* = 1,2, ..., *TN* (the maximum number of vectors in the set of training data, *X*_{
i
}is *i*-th input vector, *X*_{
i
}= (*x*_{
i,1
}, *x*_{
i,2
}, ..., *x*_{
i,19
}) and *M*_{
m
}is the location vector or center of the basis function for hidden node *m*, *M*_{
m
}= (μ_{
m,1
}, μ_{
m,2
}, ...., μ_{
m,19
}).

(3) Determine for each input data vector the closest basis function center:

C_{
bf
,
i
}= Min{D_{
i
,1
}**,** D_{
i
,2
}**,...,** D_{
i, m
}} for *i* = 1,2,...,*TN*, (2)

where C_{bf,i}is the closest basis function for the input vector *i*.

(4) For all the input vectors grouped around the basis functions, compute the mean C_{
m
}

*i*of the closest basis function

*m*and

*N*

_{ m }is the number of input vectors grouped around the basis function

*m*.

(5) Use these grounded means as the new mean values for the *m* basis functions.

(6) Repeat this process until there is no further significant change to the basis function centers.

The number *m* of basis functions starts as a small value – e.g., *m* = 4 – and increases as the validation data is being evaluated. The variances of the individual basis functions (σ_{1}, σ_{2}, ...,σ_{
l
}) are computed after the individual basis functions are determined.

The radial basis function *GR*(*i, m*) for the hidden unit *m* output of the input vector *i* is defined as a Gaussian function in the following way:

where σ_{
m
}^{2} is a measure of the size of the cluster *m* (i.e., the variance or the square of the standard deviation).

Then all that remains is to find the linear combination of weights that produces the desired output (target) values for each input vector. Since this is a linear problem, convergence is guaranteed and computation proceeds rapidly. This task can be accomplished with an iterative technique based on the perceptron training rule or with various other numerical techniques. Technically, the problem is a matrix inversion problem:

*T* = *BW*, (5)

where *T* is the target vector, *W* is the to-be-determined weighting vector and *B* is the matrix of output values from each hidden unit in response to the input data (calculated from the basis functions, e.g., equation (4)). The matrix is usually not square, so a pseudo inverse may be used to give a minimum least-squares solution.

In the case of the supervised learning, we have already obtained gene silencing results for all input vectors, e.g., *i* = *TN*.

Therefore, *w*_{1}, *w*_{2}, ..., *w*_{
m
}are determined by solving the above linear equations.

After determining the weighting variables, we can compute the percentages of effective and ineffective siRNAs in the individual clusters.

#### Validation phase

To evaluate whether the RBF network carried out appropriate (not overtraining) classifications, we verified individual clusters in the classifications by using the validation data. The differences between the percentages of effective and ineffective siRNAs for the training and validation data are computed for individual clusters. If there are few differences between the percentages of effective and ineffective siRNAs for the training and validation data in some classification, we can infer that the classification was carried out appropriately. If, on the other hand, there are large differences between them, we must conclude that the classification was not appropriate. The differences therefore indicate the effectiveness of individual classifications by the RBF network. The summation of the differences – the entire error of this partition (cluster) number *m* – is used to compare the error of this partition with other errors of other partitions (clusters).

#### Determination of the number *m* of clusters

The number *m* of basis functions corresponds to the number of partitions (clusters) and is determined on the basis of the minimum error of the individual clusters by using the validation data. That is, after carrying out several classifications while changing the number *m* of clusters, the errors of individual clusters are checked and the number of clusters yielding the minimum error is the desired number, i.e., the optimal classification.

### Decision tree learning

#### Preparation

#### Attributes or features

Size: 19 nucleotide sequence

Nucleotides: A, G, C and T

#### Training instances

Effective siRNAs: 860 sequences

Ineffective siRNAs: 860 sequences

To carry out the supervised learning for effective siRNA classifications by using the decision tree learning, we partitioned the training instances into two sets, one for the growth of the decision tree (training data) and other for the decision tree pruning (validation data). The processes of the classifications are carried out in two phases: the growth and pruning of the decision tree.

#### The growth of the decision tree

- (1)
if all the instances belong to a single class, there is nothing to do (except create a leaf node labeled with the name of that class).

- (2)
otherwise, for each attribute that has not already been used, calculate the information gain that would be obtained by using that attribute on the particular set of instances classified to this branch node.

The information gain can be computed in the following way (28).

where

*p* is a number of effective siRNA sequences for this attribute and *n* is a number of ineffective siRNA sequences.

The entropy *E(L)* associated with the position *L* is :

where

*v* is a kind of nucleotides, i.e., *i* = 1 = A, 2 = G, 3 = C and 4 = T, and *L* is the sequence position, i.e., *L* = 1, 2, ...., 19.

The information gain is therefore obtained as follows:

*gain*(

*L*) = 1(

*p*,

*n*) -

*E*(

*L*) (9)

- (3)
use the attribute (position) with the greatest information gain as a branch node.

- (4)
if the information gain becomes less than the specified criterion, stop the growth of the decision tree and create leaf nodes.

#### Decision tree pruning

Working backwards from the bottom of the tree, the subtree starting at each nonterminal node is examined. If the error (misclassification) rate on the validation data improves by pruning it, the subtree is removed. The process continues until no improvement can be made by pruning a subtree.

### Training, validation and evaluation data of the proposed methods

#### Training and validation data

As effective data, we collected 860 effective siRNA sequences (more than 80% gene silencing at the protein level) from 503 different cDNAs reported in references in the PubMed database. We also randomly generated 860 ineffective siRNA sequences as ineffective data. This is because we know that the randomly generated siRNA sequences were less effective in gene silencing as empirical knowledge. These effective and ineffective siRNAs were used as the training and validation data while partitioning the entire data set into various ratios of training data to validation data: 2 to 1, 3 to 1 and 10 to 1. We used 2 to 1.

#### Data used to evaluate the proposed methods

The proposed method was evaluated by using recently reported effective and ineffective siRNAs. These siRNAs were not used for 860 effective siRNAs.

Reynolds et al. recently analyzed 90 siRNAs systematically, targeting every other position of 197-base regions of human *cyclophilin B* mRNA (GeneBank accession no. M60875) [21]. For simplicity, human *cyclophilin B* is symbolized throughout the present paper as MG1. From the 90 analyzed siRNA sequences we selected as effective ones a set of 25 sequences for which MG1 target gene expression is less than 10% and selected as ineffective ones a set of 25 sequences for which MG1 target gene expression is greater than 48%.

Ui-Tei et al. reported 38 effective and 24 ineffective sequences for six genes: *firefly luciferase (PRL-TK), vimentin, Oct 4, EGFP, ECFP and DsRed* [22]. For simplicity, in the rest of this paper all six of these genes are symbolized as MG2.

Amarzguioui et al. reported 21 effective and 25 ineffective siRNA sequences for four genes: *hTF* (accession no. M16553),*mTF* (accession no. M26071), *PSK* (accession no. J272212) and *CSK* (accession no. NM_004383) [18]. For simplicity, in the rest of this paper these four genes are symbolized as MG3.

Takasaki et al. reported 7 effective and 7 ineffective siRNA sequences for the homo sapiens *cyclin B1* gene (accession no. NM_031966) [28]. For simplicity, in the rest of this paper this gene is symbolized as MG4.

Huesken et al. reported 37 siRNAs for TC10 (accession no. BD135193), UBE2I (accession no. NM_003345) and CDC34 (accession no. NM_004359). We selected the top-ranked 12 effective and the worst-ranked 12 ineffective siRNA sequences for these genes. For simplicity, in the rest of this paper these genes are symbolized as MG5 [29].

These test data sets (MG1 to MG5) are available in Additional File 1.

## Declarations

### Acknowledgements

This article has been published as part of *BMC Bioinformatics* Volume 7, Supplement 5, 2006: APBioNet – Fifth International Conference on Bioinformatics (InCoB2006). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/7?issue=S5.

## Authors’ Affiliations

## References

- Dykxhoorn DM, Novina CD, Sharp PA:
**Killing the messenger: Short RNAs that silence gene expression.***Nat Rev Mol Cell Biol*2003,**4:**457–467. 10.1038/nrm1129View ArticlePubMedGoogle Scholar - Elbashir SM, Harborth J, Lendeckel W, Yalcin A, Weber K,
*et al*.:**Duplexes of 21-nucleotide RNAs mediate RNA interference in mammalian cell culture.***Nature*2001,**411:**494–498. 10.1038/35078107View ArticlePubMedGoogle Scholar - Elbashir SM, Lendeckel W, Tuschl T:
**RNA interference is mediated by 21- and 22-nucleotide RNAs.***Genes Dev*2001,**15:**188–200. 10.1101/gad.862301PubMed CentralView ArticlePubMedGoogle Scholar - Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE,
*et al*.:**Potent and specific genetic interference by double-stranded RNA in**Caenorhabditis elegans**.***Nature*1998,**391:**806–811. 10.1038/35888View ArticlePubMedGoogle Scholar - Hannon GJ:
**RNA interference.***Nature*2002,**418:**244–251. 10.1038/418244aView ArticlePubMedGoogle Scholar - Sharp PA:
**RNA interference-2001.***Genes Dev*2001,**15:**485–490. 10.1101/gad.880001View ArticlePubMedGoogle Scholar - Elbashir SM, Martinez J, Patkaniowska A, Lendeckel W, Tuschl T:
**Functional anatomy of siRNAs for mediating efficient RNAi in**Drosophila melanogaster**embryo lysates.***EMBO J*2001,**20:**6877–6888. 10.1093/emboj/20.23.6877PubMed CentralView ArticlePubMedGoogle Scholar - Holen T, Amarzguioui M, Wiiger MT, Babaie E, Prydz H:
**Positional effects of short interfering RNAs targeting the human coagulation trigger Tissue Factor.***Nucleic Acids Res*2002,**30:**1757–1766. 10.1093/nar/30.8.1757PubMed CentralView ArticlePubMedGoogle Scholar - Kumar R, Conklin DS, Mittal V:
**High-throughput selection of effective RNAi probes for gene silencing.***Genome Res*2003,**13:**2333–2340. 10.1101/gr.1575003PubMed CentralView ArticlePubMedGoogle Scholar - Mittal V:
**Improving the efficiency of RNA interference in mammals.***Nat Rev Genet*2004,**5:**355–365. 10.1038/nrg1323View ArticlePubMedGoogle Scholar - Khvorova A, Reynolds A, Jayasena SD:
**Functional siRNAs and miRNAs exhibit strand bias.***Cell*2003,**115:**209–216. 10.1016/S0092-8674(03)00801-8View ArticlePubMedGoogle Scholar - Schwarz DS, Hutvagner G, Du T, Xu Z, Aronin N,
*et al*.:**Asymmetry in the assembly of the RNAi enzyme complex.***Cell*2003,**115:**199–208. 10.1016/S0092-8674(03)00759-1View ArticlePubMedGoogle Scholar - Chalk AM, Wahlestedt C, Sonnhammer ELL:
**Improved and automated prediction of effective siRNA.***Biochem Biophys Res Commun*2004,**319:**264–274. 10.1016/j.bbrc.2004.04.181View ArticlePubMedGoogle Scholar - Naito Y, Yamada T, Ui-Tei K, Morishita S, Saigo K:
**siDirect: highly effective, target-specific siRNA design software for mammalian RNA interference.***Nucleic Acids Res*2004,**32:**W124-W129. 10.1093/nar/gnh123PubMed CentralView ArticlePubMedGoogle Scholar - Santoyo J, Vaquerizas JM, Dopazo J:
**Highly specific and accurate selection of siRNAs for high-throughput functional assays.***Bioinfomatics*2004,**21:**1376–1382. 10.1093/bioinformatics/bti196View ArticleGoogle Scholar - Teramoto R, Aoki M, Kimura T, Kanaoka M:
**Prediction of siRNA functionality using generalized string kernel and support vector machine.***FEBS Letters*2005,**579:**2878–2882. 10.1016/j.febslet.2005.04.045View ArticlePubMedGoogle Scholar - Truss M, Swat M, Kielbasa SM, Schafer R, Herzel H,
*et al*.:**HuSiDa – the human siRNA database: an open-access database for published functional siRNA sequences and technical details of efficient transfer into recipient cells.***Nucleic Acids Res*2005,**33:**D108-D111. 10.1093/nar/gki131PubMed CentralView ArticlePubMedGoogle Scholar - Amarzguioui M, Prydz H:
**An algorithm for selection of functional siRNA sequences.***Biochem Biophys Res Commun*2004,**316:**1050–1058. 10.1016/j.bbrc.2004.02.157View ArticlePubMedGoogle Scholar - Hsieh AC, Bo R, Monola J, Vazquez F, Bare O,
*et al*.:**A library of siRNA duplexes targeting the phosphoinositide 3-kinase pathway: determinants of gene silencing for use in cell-based screens.***Nucleic Acids Res*2004,**32:**893–901. 10.1093/nar/gkh238PubMed CentralView ArticlePubMedGoogle Scholar - Jagla B, Aulner N, Kelly PD,
*et al*.:**Sequence characteristics of functional siRNAs.***RNA*2005,**11:**864–872. 10.1261/rna.7275905PubMed CentralView ArticlePubMedGoogle Scholar - Reynolds A, Leake D, Boese Q, Scaringe S, Marshall WS, Khvorova A:
**Rational siRNA design for RNA interference.***Nat Biotechnol*2004,**22:**326–330. 10.1038/nbt936View ArticlePubMedGoogle Scholar - Ui-Tei K, Naito Y, Takahashi F, Haraguchi T, Ohki-Hamazaki H,
*et al*.:**Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference.***Nucleic Acids Res*2004,**32:**936–948. 10.1093/nar/gkh247PubMed CentralView ArticlePubMedGoogle Scholar - Snove O Jr, Nedland M, Fjeldstad SH, Humberset H, Birkeland OR, Grunfeld T, Saetrom P:
**Designing effective siRNAs with off-target control.***Biochem Biophys Res Commun*2004,**325:**769–773. 10.1016/j.bbrc.2004.10.097View ArticlePubMedGoogle Scholar - Poggio T, Girosi F:
**Networks for approximation and learning.***Proc of IEEE*1990,**78:**1481–1497. 10.1109/5.58326View ArticleGoogle Scholar - Wu CH, McLarty JW:
**Neural Networks and Genome Informatics.***Elsevier Science Ltd*2000.Google Scholar - Quinlan JR:
**Induction of decision trees.***Machine Learning*1986,**1:**81–106.Google Scholar - Saetrom P, Snove O Jr:
**A comparison of siRNA efficacy predictors.***Biochem Biophys Res Commun*2004,**321:**247–253. 10.1016/j.bbrc.2004.06.116View ArticlePubMedGoogle Scholar - Takasaki S, Kotani S, Konagaya A:
**An effective method for selecting siRNA target sequences in mammalian cells.***Cell Cycle*2004,**3:**790–795.View ArticlePubMedGoogle Scholar - Huesken D, Lange J, Mikanin C, Weiler J, Asselbergs F, Warner J, Meloon B, Engel S, Rosenberg A, Cohen D, Labow M, Reinhardt M, Natt F, Hall J:
**Design of a genome-wide siRNA library using an artificial neural network.***Nat Biotechnol*2005,**23:**995–1001. 10.1038/nbt1118View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.