sIR: siRNA Information Resource, a web-based tool for siRNA sequence design and analysis and an open access siRNA database
© Shah et al; licensee BioMed Central Ltd. 2007
Received: 27 July 2006
Accepted: 31 May 2007
Published: 31 May 2007
RNA interference has revolutionized our ability to study the effects of altering the expression of single genes in mammalian (and other) cells through targeted knockdown of gene expression. In this report we describe a web-based computational tool, siRNA Information Resource (sIR), which consists of a new open source database that contains validation information about published siRNA sequences and also provides a user-friendly interface to design and analyze siRNA sequences against a chosen target sequence.
The siRNA design tool described in this paper employs empirically determined rules derived from a meta-analysis of the published data; it uses a weighted scoring system that determines the optimal sequence within a target mRNA and thus aids in the rational selection of siRNA sequences. This scoring system shows a non-linear correlation with the knockdown efficiency of siRNAs. sIR provides a fast, customized BLAST output for all selected siRNA sequences against a variety of databases so that the user can verify the uniqueness of the design. We have pre-designed siRNAs for all the known human genes (24,502) in the Refseq database. These siRNAs were pre-BLASTed against the human Unigene database to estimate the target specificity and all results are available online.
Although most of the rules for this scoring system were influenced by previously published rules, the weighted scoring system provides better flexibility in designing an appropriate siRNA when compared to the un-weighted scoring system. sIR is not only a comprehensive tool used to design siRNA sequences and lookup pre-designed siRNAs, but it is also a platform where researchers can share information on siRNA design and use.
Nobel laureates, Andrew Fire and Craig Mello discovered that the injection of double-stranded RNA (dsRNA) into the nematode C. elegans initiated a potent sequence-specific response which caused a robust interference with the gene expression of the gene containing the same sequence as the dsRNA. . RNAi is mediated through dsRNA, in a process similar to post-transcriptional gene silencing (PTGS) in plants and quelling in fungi. PTGS is a gene regulatory process, where reduction in the steady-state levels of a specific mRNA occurs through sequence-specific degradation of the transcribed mRNA . It is thought that this process evolved as a defense mechanism against RNA viruses. In organisms capable of RNAi, upon entry into the cytoplasm, long dsRNA is cleaved by an RNase III-like enzyme Dicer into small interfering RNA (siRNA) about 21–23 nucleotides in length. These siRNAs assemble into multiprotein RNA-inducing silencing complexes, which then bind to target mRNA using the antisense siRNA as a guide, cleaving the mRNA-siRNA complex. Higher metazoans have evolved different defense mechanisms against RNA viruses, and initiate the interferon response when dsRNA longer than 30 bps is detected in the cytoplasm. However, synthetic oligonucleotides 21–23 bps in length do mediate RNAi in these cells, without an interferon response .
RNAi technology has proven its usefulness in many fields including cancer, gene therapeutics, functional genomics, etc. [3–5]. It is currently the most popularly used gene-silencing technique in functional genomics . Because of its wide range of applications and popularity; we sought to create a tool that can help design efficient siRNAs quickly. siRNA Information Resource ('sIR') is a web-based computational tool that aids in selecting the target sequence for siRNA within a specified target RNA. sIR provides an analysis platform which includes a weighted scoring system to predict siRNA efficacy as well as an open-source database that contains effectiveness information about siRNAs that have been published or tested in our laboratories. Considering the importance of the siRNA technology, we have pre-computed siRNAs for all the known human genes. This pre-computed database provides a list of siRNAs with highest possible score (greatest knockdown) and minimum number of BLAST hits with the Human Unigene database (greatest specificity).
sIR was built on a Linux based server. The databases involved were implemented locally to speed up the design process using a PostgreSQL database. PHP, Perl (CGI) and HTML scripts were written to communicate with the databases using a web based user interface. Other scripting languages such as Perl and Bioperl were used to implement the design algorithms as well as to parse BLAST output. MATLAB statistical software was used to analyze the siRNA data and to refine the weighted scoring algorithm. sIR uses a variety of information from genetic databases such as RefSeq, SOURCE, NCBI and similarity searching software such as BLAST to confirm the uniqueness of siRNA designs. Pre-computation analysis was performed on a Linux based cluster with 29 nodes. The siRNAs were computed in a parallel fashion using Sun Grid Engine 5.3 software for all the known human genes in the Refseq database.
Results and discussion
Target design algorithm
The target design algorithm uses various default and user-specified parameters to screen the mRNA sequence to identify and rank potential siRNA target sequences. If the Open Reading Frame (ORF) parameter is selected, the program pre-processes the input sequence by selecting an ORF 50 nucleotides before and after the 5' and 3' ends of the mRNA respectively. Then it searches for the target sequences with the user-specified pattern. If this pattern is not found then it looks for the following patterns in hierarchical order: AAN(19)TT, AAN(21), NAN(19)TT, NAN(21), where the character 'N' means any of the four bases and the number in the parenthesis after a nucleotide refers to the number of times that nucleotide repeats in the sequence . For all the patterns found, the program assigns a score to the target sequence and displays the list of sequences in the descending order of the score. The user can choose one or more of these sequences to perform a customized similarity search using BLAST against a variety of databases such as Unigene (Human or Mouse), Refseq (Human or Mouse), Human Genomic or Ensemble cDNA transcripts. The user can choose a BLAST method, NCBI BLAST or WU BLAST and display results with a given similarity, e.g., similarity > 75% (16 bp/21 bp). NCBI BLAST uses a word size of 7 and is much faster than the more sensitive WU BLAST which uses a word size of 1.
Recent studies have shown that the overall identity search may not be able to detect all the off-target genes. According to Birmingham et al., (2006), off-targeting is associated with the presence of one or more 3' untranslated region (UTR) complementary matches with the hexamer or heptamer (position 2–7 or 2–8) to the antisense strand of the siRNA . In this tool, users can further refine their alignment search by looking for exact matches of the seed region (position 12–18 of the 19 nt sense strand, complementary to position 2–8 of the antisense strand) in the 3' UTR mRNA regions from the human RefSeq database. The accession numbers with at least four seed matches in the 3' UTR mRNA are the listed in the descending order of the number of seed matches found.
A database of ~1000 experimentally measured siRNA target sequences for ~400 different transcripts was populated as a training set for the system. These siRNA sequences were collected from various laboratories at the University of Texas Southwestern Medical Center, open source published databases and from published papers.
Scoring system algorithm
A composite score is computed using rules we derived from various sources to rationalize the design process and to improve siRNA design success. The rules were compiled from various research papers [8, 10–13]. For example, for the optimum design, it was found that the penultimate nucleotide of the antisense siRNA, which is complementary to position 2 of the 23nt target sequence, should always be complementary to the targeted sequence . Primarily for simplification of chemical synthesis, TT is used. Hence the chance of having an efficient siRNA is increased if the position 2 of the target sequence is 'A'. In addition, moderately low GC content (30–55%) contributes to efficiency. Some studies have shown that the presence of elevated GC content at the 5' end of the siRNA target sequence improves the siRNA efficiency, whereas some studies have shown that there is no correlation [11, 12]. Analyses based on the individual positions of the siRNA target sequence have shown that the presence of bases 'G/C' at position 1, 'A/U' at positions 15–19, 'A' at positions 3, 6 and 'U' at positions 10, 13 of the sense strands positively affect the siRNA efficiency, whereas the presence of bases 'G/C' at position 19, 'G' at position 13 and 'A/U' at position 1 negatively affect the siRNA efficiency [10–13].
Studies in the past as well as many recent studies have tried to find a correlation between secondary structure of sequences and potency of siRNA activity. Holen et al. (2002) have previously reported that there was no correlation between the M-Fold predicted secondary structures and siRNA efficiency . However, some recent studies have shown that such a correlation may exist. Overhoff et al, (2005) have shown that the siRNA efficacy is related to the local RNA target structure and Poliseno et al, (2004) have proven that the energy profiling of siRNAs can be used to predict siRNA activity [15, 16]. An attempt was made to test if the siRNA potency data correlated with the minimum free energy of the secondary structure of the siRNA. The m-fold server (version 3.2) was used to predict the minimum free energy values, dG, for all the 19 nucleotide siRNA sequences . No correlation was found between the two. The relationship between secondary structures of sequences and siRNA potency is arguable and all the recent studies cannot be ignored. However, for simplification purposes, we have not considered this factor in our design of siRNA sequences.
In order to calculate the final score composed of the above mentioned parameters, it was important to introduce weights to rank the parameters in the order of their significance. The following fifteen criteria were tested and weights were calculated using the training dataset for each of them: I) A at position 2 of target siRNA sequence (21 nucleotides) II) A/U at position 1 of sense strand (19 nucleotides) III) G/C at position 1 of sense strand IV) A at position 6 of sense strand V) G at position 13 of sense strand VI)-X) A/U at positions 15–19 of sense strand XI) G/C at position 19 of sense strand XII) Moderate GC content XIII) A at position 3 of sense strand XIV) U at position 10 of sense strand XV) U at position 13 of sense strand [8, 10–13].
Weights of the scoring system
Criteria for the scoring system
A/U at position 1 of sense strand
G/C at position 1 of sense strand
A at position 6 of sense strand
U at position 10 of sense strand
G at position 13 of sense strand
U/T at position 13 of sense strand
A/u at position 16 of sense strand
A/U at position 17 of sense strand
A/u at position 18 of sense strand
A/U at position 19 of sense strand
G/C at position 19 of sense strand
Where Scoremin (= -4.08, sum of all the negative values in Table 1) and Scoremax (= 7.04, sum of all the positive values in Table 1) are the minimum and maximum possible score values respectively and FinalScoresiRNA is the score obtained by the siRNA sequence after normalization of the raw score.
Summary of test set scores
Least potent avg score
Most potent avg score
Correlation between efficiency and Inhibitory activity
Low efficiency avg score
Medium efficiency avg score
High efficiency avg score
Very high efficiency avg score
One-way ANOVA test p-value
Although all the algorithms performed fairly well with the two large data sets, sIR algorithm was consistently slightly better as it attained the highest correlation and the lowest p-values. Most of the rules for this scoring system were influenced by 'Tuschl rules'  and 'Rational design'  but the weighted scoring system provides better flexibility in designing an appropriate siRNA when compared to the un-weighted scoring systems.
The database or resource mode allows the researcher to search for existing and previously tested designs. The database includes the functional as well as non-functional siRNA sequences. There are several other repositories of siRNA sequences including some of the popular databases such as siRecords and HuSida [19, 20]. HuSida currently consists of > 1100 siRNA sequence records and siRecords consists of > 4000 siRNA sequence records. HuSida was mainly designed to store functional human siRNA sequences with efficiency > 50%. However, it is of great research value to store the information of functional as well as non-functional siRNAs as it will help other investigators learn more about the nature of siRNA sequences. Both sIR resource as well as siRecords have the capability to store siRNA sequences from different species along with their variable efficacies. sIR resource also provides additional annotations such as miRNA seed region search to predict off-target activities along with sIR score. Some of the siRNA sequences are hyperlinked to images depicting their efficiency by Western Blot and other methods. This database can be updated using a password protected input form that accepts the data and images for new siRNA designs and uploads them to the database. Currently sIR database consists of only ~1200 siRNA sequences. It should also be noted that the data submitted in the sIR resource database is user-driven and hence there may be a user-related bias regarding the exact effectiveness of siRNA sequences. Hence users are encouraged to submit their research related data such as images of Western Blot analysis, plots etc. to be viewed by others. The sIR resource is envisioned to be a free, central repository where investigators can share their siRNA design and results.
sIR pre-computed siRNA database
Parameters used for the precomputation of siRNAs.
Percent GC Content range
Moderate range, 30% to 50%
Multiple nucleotide run (Runs of 4 or more nucleotides)
Open reading frame
An open reading frame region between 50 nucleotides from the 5' end and 50 nucleotides from the 3' end, downstream and upstream of the mRNA was considered for the design.
Score > 50. siRNAs with scores > 50 were retained
Blast Hit cutoff
Number of blast hits < = 2. siRNAs with blast hits < = 2 and percent homology > 80% were retained.
In order to retrieve these pre-designed siRNAs the user can query the pre-computed siRNA mode with an accession number. The program returns the top 10 possible designs of the siRNAs for that particular accession number after sorting them with respect to minimum number of BLAST hits and then score. In an effort to avoid off-target effects, the user can choose to filter out siRNA sequences which have greater than three seed region matches within the 3' UTR mRNA sequences from the RefSeq database.
sIR is not only a comprehensive tool used to design siRNA sequences and lookup pre-designed siRNAs, but it is also a platform where researchers can share information on siRNA design and use. It is difficult to find information about siRNA sequences which failed or had poor knockdown. It is however important to know that information, as it helps the researchers to avoid reinventing the wheel and enables computations like those herein. As of March 2007, the resource database consists of approximately 1200 entries comprising information on functional as well as non functional siRNAs which can be very important for future discoveries. This web based online tool along with the pre-computed siRNA database saves the investigator a lot of time.
Studies have shown that there is a relationship between siRNA sequence and the RNAi effect and that the presence of certain bases in a particular position contributes more to the efficiency of knockdown. The weighted scoring system of sIR was able to assign weights to different parameters which affect the siRNA potency. This validated system includes a suite of tools and databases that will allow researchers to rapidly and efficiently select siRNA designs with a priori specificity and efficacy estimates.
The authors wish to thank Noriaki Sunaga, George Demartino, Cezary Wozchik, Jerry Shay and Adi Gazdar for their help with building the training set for this software by providing siRNA target sequence information from their laboratories. We also wish to thank Trey Fondon, David Trusty, Alexander Pertsemlidis, Richard Scheuermann and Zhongxue Chen for their valuable advice and guidance with the computational and statistical aspects of the analysis and software. We also appreciate the valuable advice provided by the reviewers of this paper. This work was supported by NIH/NCI SPORE grant 50CA70907, Western Regional Center of Excellence in Biodefense and Emerging Infectious Disease Research NIH/NIAID U54 AI057156 and NIH/NCI grant IR01CA096901.
- Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE, Mello CC: Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans . Nature 1998, 391: 806–811. 10.1038/35888View ArticlePubMedGoogle Scholar
- Tuschl T: RNA interference and small interfering RNAs. ChemBioChem 2001, 2: 239–245. 10.1002/1439-7633(20010401)2:4<239::AID-CBIC239>3.0.CO;2-RView ArticlePubMedGoogle Scholar
- Dykxhoorn DM, Novina CD, Sharp P: Killing the Messenger: Short RNAS that silence gene expression. Nature Reviews Molecular Cell Biology 2003, 4: 457–467. 10.1038/nrm1129View ArticlePubMedGoogle Scholar
- Schütze N: siRNA technology. Molecular and Cellular Endocrinology 2004, 213: 115–119. 10.1016/j.mce.2003.10.078View ArticlePubMedGoogle Scholar
- Kalota A, Shetzline S, Gewirtz A: Progress in the development of nucleic acids therapeutics for cancer. Cancer Biol Ther 2004, 3(1):4–12.View ArticlePubMedGoogle Scholar
- Dorsett Y, Tuschl T: siRNAs: Applications in functional genomics and potential as therapeutics. Nature Reviews Drug Discovery 2004, 3: 318–329. 10.1038/nrd1345View ArticlePubMedGoogle Scholar
- Diehn M, Sherlock G, Binkley G, Jin H, Matese JC, Hernandez-Boussard T, Rees CA, Cherry JM, Botstein D, Brown PO, Alizadeh AA: SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res 2003, 31(1):219–223. 10.1093/nar/gkg014PubMed CentralView ArticlePubMedGoogle Scholar
- The siRNA user guide[http://www.rockefeller.edu/labheads/tuschl/sirna.html] (revised May, 2004)
- Birmingham A, Anderson EM, Reynolds A, Ilsley-Tyree D, Leake D, Fedorov Y, Baskerville S, Maksimova E, Robinson K, Karpilow J, Marshall WS, Khvorova A: 3' UTR seed matches, but not overall identity, are associated with RNAi off-targets. Nature Methods 2006, 3: 199–204. 10.1038/nmeth854View ArticlePubMedGoogle Scholar
- Reynolds A, Leake D, Boese Q, Scaringe S, Marshall WS, Khvorova A: Rational siRNA design for RNA interference. Nature biotechnology 2004, 22: 326–330. 10.1038/nbt936View ArticlePubMedGoogle Scholar
- Elbashir SM, Martinez J, Patkaniowska A, Lendeckel W, Tuschl T: Functional anatomy of siRNAs for mediating efficient RNAi in Drosophila melanogaster embryo lysate. EMBO J 2001, 20(23):6877–6888. 10.1093/emboj/20.23.6877PubMed CentralView ArticlePubMedGoogle Scholar
- Ui-Tei K, Naito Y, Takahashi F, Haraguchi T, Ohki-Hamazaki H, Juni A, Ueda R, Saigo K: Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference. Nucleic Acids Res 2004, 32(3):936–948. 10.1093/nar/gkh247PubMed CentralView ArticlePubMedGoogle Scholar
- Amarzguioui M, Prydz H: An algorithm for selection of functional siRNA sequences. Biochemical and Biophysical Research Communications 2004, 316(4):1050–1058. 10.1016/j.bbrc.2004.02.157View ArticlePubMedGoogle Scholar
- Holen T, Amarzguioui M, Wiiger MT, Babaie E, Prydz H: Positional effects of short interfering RNAs targeting the human coagulation trigger tissue factor. Nucleic Acids Res 2002, 30: 1757–1766. 10.1093/nar/30.8.1757PubMed CentralView ArticlePubMedGoogle Scholar
- Overhoff M, Alken M, Kretschmer-Kazemi Far R, Lemaitre M, Lebleu B, Sczakiel G, Robbins I: Local RNA target structure influences siRNA efficacy: A systematic global analysis. J Mol Biol 2005, 348: 871–881. 10.1016/j.jmb.2005.03.012View ArticlePubMedGoogle Scholar
- Poliseno L, Evangelista M, Mercatanti A, Mariani L, Citti L, Rainaldi G: The energy profiling of short interfering RNAs is highly predictive of their activity. OLIGONUCLEOTIDES 2004, 14: 227–232. 10.1089/oli.2004.14.227View ArticlePubMedGoogle Scholar
- Zuker M: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 2003, 31: 3406–3415. 10.1093/nar/gkg595PubMed CentralView ArticlePubMedGoogle Scholar
- Huesken D, Lange J, Mickanin C, Weiler J, Asselbergs F, Warner J, Meloon B, Engel S, Rosenberg A, Cohen D, Labow M, Reinhardt M, Natt F, Hall J: Design of a genome-wide siRNA library using an artificial neural network. Nature Biotechnology 2005, 23: 995–1001. 10.1038/nbt1118View ArticlePubMedGoogle Scholar
- Ren Y, Gong W, Xu Q, Zheng X, Lin D, Wang Y, Li T: siRecords: an extensive database of mammalian siRNAs with efficacy ratings. Bioinformatics 2006, 22(8):1027–1028. 10.1093/bioinformatics/btl026View ArticlePubMedGoogle Scholar
- Truss M, Swat M, Kielbasa SM, Schäfer R, Herzel H, Hagemeier C: HuSiDa – the human siRNA database: an open-access database for published functional siRNA sequences and technical details of efficient transfer into recipient cells. Nucleic Acids Res 2005, 33: D108-D111. 10.1093/nar/gki131PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.