 Research Article
 Open Access
 Published:
A systematic evaluation of nucleotide properties for CRISPR sgRNA design
BMC Bioinformatics volume 18, Article number: 297 (2017)
Abstract
Background
CRISPR is a versatile gene editing tool which has revolutionized genetic research in the past few years. Optimizing sgRNA design to improve the efficiency of target/DNA cleavage is critical to ensure the success of CRISPR screens.
Results
By borrowing knowledge from oligonucleotide design and nucleosome occupancy models, we systematically evaluated candidate features computed from a number of nucleic acid, thermodynamic and secondary structure models on real CRISPR datasets. Our results showed that taking into account positiondependent dinucleotide features improved the design of effective sgRNAs with area under the receiver operating characteristic curve (AUC) >0.8, and the inclusion of additional features offered marginal improvement (∼2% increase in AUC).
Conclusion
Using a machinelearning approach, we proposed an accurate prediction model for sgRNA design efficiency. An R package predictSGRNA implementing the predictive model is available at http://www.ams.sunysb.edu/~pfkuan/softwares.html#predictsgrna.
Background
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)/Cas system is a heritable and adaptive prokaryotic immune system that protects cells by destroying foreign genetic elements [1]. Over the past few years, CRISPR has emerged as a powerful gene editing technology [2, 3]. CRISPR consists of a single guide RNA (sgRNA) and an enzyme called Cas9. The sgRNA is composed of a short synthetic RNA (approximately 20 base pairs (bp), known as spacer target) located within a Nbp scaffold. The spacer target is designed to bind to a specific sequence in the genome, whereas the Cas9 protein acts as a biomolecular scissor. This system has proven to be a powerful tool for studying individual gene function and for genome engineering.
The design of sgRNA is an important aspect to ensure the success of CRISPRCas9 screens. It is desirable to design sgRNA libraries which have maximum ontarget and minimum offtarget effects. The binding specificity of the sgRNA is determined by the 20 bp spacer target and a protospacer adjacent motif (PAM) sequence (generally NGG or NAG) on the genome. Once the sgRNA binds to the target sequence, the Cas9 nuclease cuts 3bp upstream of the PAM sequence. Different groups have studied the sequence features of spacer target sites that predict sgRNA ontarget efficiency [4–7]. In particular, [5] investigated the positiondependent sequence on sgRNA efficiency and whether these features could reproducibly predict sgRNA efficiency in several publicly available CRISPR datasets. They proposed a predictive model using the positiondependent mononucleotide composition across a 40 bp sequence encompassing 5’ flanking, spacer target and 3’ flanking region; and further demonstrated that their model performed better than the model of [4]. On the other hand, [6, 7] proposed a predictive model based on gradientboosted regression trees using positiondependent and independent sequence properties, location of the sgRNA within the protein and melting temperatures.
Aspects of sgRNA design share similarities to oligonucleotide designs used for microarrays. In both cases, optimal oligonucleotide design aims to increase binding sensitivity and specificity while minimizing off target hybridization. A position dependent sequence bias has been observed in the design of oligonucleotides in Affymetrix microarrays [8], whereas in our earlier work [9] we showed that the thermodynamic and secondary features of the oligonucleotides affect the hybridization intensities in Nimblegen arrays. In addition, [6, 7] investigated position dependent and independent features, position of the guide within the genes, interaction with the PAM sequence and melting temperatures, and showed that these features improved the prediction model in CRISPR/Cas9 screens; whereas microhomology features did not improve the prediction. In this paper, we computed a comprehensive list of features of the target sequence from a number of nucleic acid, thermodynamic, and secondary structure models by adopting some ideas of microarray designs. In a similar manner as [6, 7], we systematically characterized the effect of these features on the efficiency of sgRNA design, and seek to understand if the inclusion of these features improves the design of effective sgRNAs in CRISPR/Cas9 knockout screens.
Methods
We used the sets of efficient and inefficient sgRNAs from the CRISPR/Cas9 screens of [10] and [11] compiled by [5]. The first dataset consists of 731 efficient and 438 inefficient sgRNAs targeting ribosomal genes [10], the second dataset consists of 671 efficient and 237 inefficient sgRNAs targeting nonribosomal genes [10] and the third dataset consists of 830 efficient and 234 inefficient sgRNAs targeting essential genes in mouse embryonic stem cell (mESC) line, JM8 [11]. The procedures for identifying efficient and inefficient sgRNAs were used exactly as described in [5]. Spacer lengths in the reported studies were 20 bp [10] and 19 bp [11]. Using these sets of sgRNAs, we computed primary sequence, thermodynamic, and secondary structures as candidate features. Further details are provided below.
DNA sequence candidate features
Positiondependent nucleotide composition
Similar to [5], we created vectors of positiondependent mononucleotide composition (PD Mono) for the 40 bp long sequences comprised of the spacer targets, and 5’ and 3’ flanking regions. In addition, we extracted positiondependent dinucleotide composition (PD Dinuc) for these 40 bp sequences and computed the single and dinucleotide frequencies (Freq) for the spacer target. Since positions 32 and 33 were part of the PAM sequence (GG), they were excluded from the analysis.
Thermodynamics and secondary structure properties of [9] (Thermo)
Motivated by our earlier work which studied the relationship between oligonucleotide properties and hybridization signal intensities in microarray design [9], we computed the thermodynamic properties: melting temperature (T _{ m }), GC content, entropy change (ΔS), enthalpy change (ΔH), free energy change (ΔG); and secondary structures: longest polyN, repetitive sequence (repeat), length of a potential stemloop (LSL) and minimum energy folding (MEF). T _{ m } was computed according the formula
where [Na ^{+}] was assumed to be 0.2M [12]. ΔG, ΔH and ΔG were calculated by summing the respective entropy, enthalpy and free energy parameters of each dinucleotide, including the initiation parameters and penalty for self complementary duplexes according to the positiondependent nearest neighbor approach as described [13]. These parameters were provided in Tables 1 and 2 of [13]. MEF was computed using the hybridssmin program in OligoArrayAux package, whereas LSL was computed using the palindrome function in the EMBOSS package. Longest polyN and repeat were calculated as previously described [9]. These properties were computed for the spacer target sequence.
DNA secondary structures based on dinucleotide and tetra nucleotide properties of [14] and [15] (Packer)
Following a previously described approach [16], we computed the minimum, maximum and average values of both the tetranucleotide energy and flexibility scores as described [15]. These scores were given in Tables 3 and 4 of [15]. In addition, we computed the minimum, maximum and average values of the dinucleotide roll, twist, slide and shift scores as described [14]. The dinucleotide values of these properties were given in Tables 1, 2 and 3 of [14]. These scores were representations of the threedimensional DNA structure and anisotropic flexibility [14]. Similar to above, we computed these properties for the spacer target sequence.
Physiochemical properties of [17] (PhyChem)
We adapted the approach described by [17] which was developed for predicting nucleosome occupancy and computed the 12 physiochemical properties (Apillicity, basestacking, BDNA twist, bendability, DNA bending stiffness, DNA denaturation, duplex disrupt energy, duplex free energy, propeller twist, protein deformation, proteinDNA twist and ZDNA). For each property, we computed the minimum, maximum and average dinucleotide scores for the spacer target sequence. The dinucleotide values of the 12 physicochemical properties were given in Table 1 of [17].
Pseudo ktuple nucleotide composition of [18] (PseKNC)
The PseKNC model was also originally developed for predicting nucleosome occupancy by taking into account global sequenceorder effects. PseKNC represents the DNA sequence as vectors \(\left [\frac {f_{1}}{d},\hdots,\frac {f_{4^{k}}}{d},\frac {w\theta _{1}}{d},\hdots,\frac {w\theta _{\lambda }}{d}\right ]^{T}\) where \(d={\sum \nolimits }_{j=1}^{4^{k}}f_{j}+w{\sum \nolimits }_{j=1}^{\lambda } \theta _{j}, f_{j}\)’s are the ktuple nucleotide frequencies and
m is the number of local DNA properties considered, P _{ t }(r _{ s } r _{ s+1}) and P _{ t }(r _{ s+j } r _{ s+j+1}) are the score of the tth DNA local structural property for dinucleotide r _{ s } r _{ s+1} and r _{ s+j } r _{ s+j+1} at position s and s+j, respectively. λ is the order of correlations along the DNA sequence and w is the weight factor. Our candidate k, λ and w took values of k=2,3,…,6, λ=1,2,…,15, and w=0,0.1,0.2,…,1. We considered the following strategy to choose the optimal parameters for the PseKNC model. A three way cross validation was performed on each dataset using elastic net [19]. The parameters corresponding to the PseKNC model with the largest average area under the receiver operating characteristic curve (AUC) were selected for subsequent analysis. Based on this criterion, we set k=2, λ=1 and w=0.5. Similar to [18], we considered m=6 DNA local structural properties which were divided into local translational (rise, slide and shift) and angular (twist, roll and tilt).
Optimal pairwise alignment (Align)
We computed the optimal global pairwise alignment scores between the seed region and scaffold using the NeedlemanWunsch algorithm [20] which served as a measure of the potential of the k PAMproximal seed region of the spacer target to interact with the scaffold sequence. The seed region was defined as the immediate k nucleotides next to the PAM sequence. We considered k=5,…,L, where L is the length of spacer target.
Results and discussion
For each dataset, we computed a score for every feature as a measure of strength of association with sgRNA efficiency. If the feature was a binary variable, a log odds ratio between efficient and inefficient sgRNAs was computed. If the feature was a continuous variable, twosample tstatistic was computed. We divided the features into 8 classes (1) positiondependent mononucleotide (PD Mono), (2) positiondependent dinucleotide (PD Dinuc), (3) frequencies of mono and dinucleotides (Freq) (4) optimal pairwise alignment between spacer target and scaffold (Align) (5) thermodynamics and secondary structures of [9] (Thermo) (6) secondary structures of [14, 15] (Packer) (7) physiochemical properties (PhyChem) of [17] and (8) pseudo ktuple nucleotide composition of [18] (PseKNC). We found that most of the features were consistently associated with sgRNA efficiency across datasets (Figs. 1 and 2).
Candidate feature ranking
To rank the contribution of each feature to the efficiency of sgRNA design, we fitted a logistic regression model within each dataset using the binary sgRNA efficiency indicator as the response and the features as predictors. The Bayesian Information Criterion (BIC) for the fitted model was computed. The features were ranked by the BIC scores and the top 10 most important features were shown in Additional file 1: Figure S1. The top ranked feature based on average BIC scores across the three datasets was the 16th feature from PseKNC model. This feature is a function of TT dinucleotide frequency. In addition, we computed the area under receiver operating characteristic curves (AUCs) for continuous features. The top 10 features ranked by AUC were shown in Fig. 3, in which the 16th feature from the PseKNC model was also ranked number one. The third measure we considered for feature ranking was the permutation based variable importance score from the random forest prediction algorithm. Random forest [21] is a nonparametric ensemble approach based on a large number of classification trees trained on bootstrap samples. The permutation based variable importance score of a feature is defined as the difference in prediction accuracy before and after permuting this feature, averaging over all trees. We used the unscaled version of variable importance score as recommended by [22, 23] to avoid bias due to number of trees grown. The top 10 features ranked by variable importance are shown in Additional file 1: Figure S2. Based on these results, the frequencies of T and TT had the strongest association with sgRNA efficiency, in which higher frequencies of T and TT were associated with decreased efficiency.
Predictive modeling
To assess the contribution of the 8 different feature classes in prediction sgRNA efficiency, we formed all possible combinations of feature classes (\({\sum \nolimits }_{i=1}^{8}{8\choose i}=255\) combinations). We adapted the strategy in [5] in constructing and evaluating the predictive model for sgRNA efficiency:

1.
To evaluate intraplatform consistency within the same class of genes, we performed 3way cross validation within dataset 1 (sgRNA targeting ribosomal genes) from [10]. We randomly split dataset 1 into 3 parts of equal sample size, trained the model on two parts (training set) and evaluated the performance of the resulting predictive model on the remaining part (test set). This process was repeated 3 times by leaving out a different test set, and results were averaged over 10 iterations of random sampling.

2.
To evaluate intraplatform consistency across different classes of genes, the predictive algorithm was trained on dataset 1 (ribosomal genes) and tested on dataset 2 (nonribosomal genes).

3.
To evaluate interplatform consistency, the predictive algorithm was trained on datasets 1 and 2 (ribosomal + nonribosomal genes) from [10] and tested on dataset 3 (mESC essential genes) from [11].
The elastic net algorithm [19] was used in constructing the predictive model on the training set based on 10 fold crossvalidation. Since the features we considered in this paper were functions of the nucleotide composition, they were correlated and the elastic net algorithm automatically selected nonredundant informative features. The objective function of elastic net consists of a loss function + penalty:
where \(\mathbf {\beta }_{1}={\sum \nolimits }_{j=1}^{p}\beta _{j}\) and \(\mathbf {\beta }^{2}={\sum \nolimits }_{j=1}^{p}\beta _{j}^{2}\).
We evaluated the performance on the test set in terms of AUC. The optimal cutpoints were determined by maximizing the Youden index(J) =Se+Sp −1, where Sensitivity(Se)\(=\frac {TP}{TP+FN}\) and Specificity(Sp)\(=\frac {TN}{TN+FP}\). The results were shown in Tables 1, 2 and 3. For each test set, we reported these performance measures for the predictive models constructed using each of the 8 feature classes, as well as the combinations of feature classes with the maximum AUC (Comb Feature). Across all comparisons, integrating multiple feature classes showed improvements in terms of AUC compared to positiondependent mononucleotide models (PD Mono) in [5]. Among the 8 individual feature classes, positiondependent dinucleotide models (PD Dinuc) consistently outperformed other feature classes in predicting sgRNA efficiency and were close to results from the combination of feature classes models in all 3 scenarios. A similar pattern was also observed in [6, 7], in which they showed that position dependent dinucleotide features yielded the largest average Gini importance among the set of features considered in their dataset [4, 7].
We also compared the results using the random forest and boosted regression to construct the predictive model. Random forest [21] was implemented in the R package randomForest, whereas the boosted regression based on extensions to AdaBoost [24] and gradient boosted machine [25] was implemented in the R package gbm. The results were shown in Additional file 1: Tables S1, S2 and S3 (randomforest) and Additional file 1: Tables S4, S5 and S6 (gbm). These results were comparable to the results from elastic net.
Related work for predicting CRISPR/Cas9 guide efficiency based on nucleotide properties and melting temperatures includes azimuth [4, 6, 7], which constructed a predictive model based on gradientboosted regression trees as described earlier. This method was recommended by [26] for invivo (U6) transcribed guides. In contrast, the sgRNA scorer of [27] was a predictive model based on the support vector machine (SVM) algorithm using position dependent mononucleotide on 5’ flanking (5 bp), spacer target and 3’ flanking (NGG + 5 bp) region. We included these two methods for comparison in Table 3 and Fig. 4. In this comparison, each method was trained on different datasets, but the performance was evaluated on the same test dataset generated by an independent research group, i.e., [11] dataset. The statistical significance for pairwise AUC comparisons was based on DeLong’s test [28]. Our proposed predictive algorithm achieved higher AUC compared to both azimuth and sgRNA scorer (p<0.001 in both cases). On the other hand, azimuth had better performance than sgRNA scorer (p<0.001). We have also implemented azimuth (based on continuous outcome gbm model) and sgRNA scorer (based on binary outcome SVM model) using the sequence features identified by [6, 7] and [27], respectively on the same training data (i.e., [10] ribosomal and nonribosomal genes) (Table 3). As expected, the performance of sgRNA scorer was comparable to the model using position dependent mononucleotide (Table 3), whereas the performance of azimuth was comparable to the gbm results in Additional file 1: Table S15. Our proposed predictive algorithm achieved higher AUC compared to the refitted sgRNA scorer (p=0.048) and comparable performance to the refitted azimuth (p>0.1).
We also included comparison using a regression model based on (1) the average log2 fold change (12 cell doublings vs initial seeding states) of HL60 and KBM7 cell lines for [10] data and (2) the average log2 fold change (mESC vs plasmid control) of replicate 1 and replicate 2 of mouse ESC JM8 cell lines for [11] data. We compared the performance of the sequence properties in prediction in terms of AUC, Pearson correlation coefficient, Spearman rank correlation coefficient and mean squared error on the test data. The results were presented in Additional file 1: Tables S7, S8 and S9. In addition, similar to the binary outcome model as described above; positiondependent dinucleotide models (PD Dinuc) consistently outperformed other feature classes in predicting sgRNA efficiency and were comparable to results from the combination of feature classes models in all 3 scenarios. Fusi et al. [6] and Doench et al. [7] showed that the regression model outperformed classification model using their dataset [4, 7]. However, we observed that the regression model and the classification model yielded comparable performance in both [10] and [11] datasets. The combination feature prediction model from the regression model (Comb Feature) exhibited larger AUC than both azimuth and sgRNA scorer (p<0.001 for all pairwise AUC comparisons using DeLong’s test [28]), but no difference using Spearman rank correlation coefficient for Comb Feature versus azimuth (p=0.88 from Fisher’s Ztransformation test [29, 30]) as shown in Additional file 1: Table S9. The results from random forest and boosted regression were presented in Additional file 1: Tables S10, S11 and S12 (randomforest) and Additional file 1: Tables S13, S14 and S15 (gbm). These results were comparable to the results from elastic net.
Following [6, 7], we also included the results from leaveonegene out prediction framework to obtain a generalization of our prediction model to new genes in Additional file 1 (Section 5 and Tables S19 and S20). The conclusion remained the same, i.e., Comb Feature yielded the largest AUC and PD Dinuc followed closely. Additional results including performance evaluation using 30 bp sequence [6, 7] instead of 40 bp sequence were presented in Additional file 1: Tables S16, S17 and S18. The results indicated that the performance of the prediction models were comparable regardless whether a 40 bp or 30 bp sequence was used.
We created an R package predictSGRNA implementing the proposed predictive algorithm based on positiondependent dinucleotide model, available at http://www.ams.sunysb.edu/~pfkuan/softwares.html#predictsgrna.
Conclusions
In this paper, we explored various aspects of nucleotide compositions including position dependent models, secondary structure and thermodynamics to gain better understanding of the nucleotide properties on CRISPR sgRNA design efficiency in a similar way as [6, 7]. Candidate feature ranking in terms of association with sgRNA effiency identified features which characterize the flexibility of the underlying DNA structure. Specifically, we found that the frequency of T and TT dinucleotide exhibited the strongest negative association with sgRNA efficiency. Packer et al. [14] illustrated that TT dinucleotide has the most rigid step and least flexible in terms of the ability to slide and shift, which could explain the decreased efficiency of sgRNA with higher abundance of TT dinucleotides. The results from the different predictive algorithms showed that across datasets, the position dependent mononucleotide model [5] achieved good operating characteristics while the prediction algorithm trained on position dependent dinucleotide model offered additional improvement in terms on AUC. The advantage of position dependent dinucleotide model in predicting sgRNA efficiency was also observed in [6, 7].
One factor that may guide improvement of future predictive algorithms is chromatin structure. Chromatin accessibility (packed vs unpacked) has been shown to be the major determinant of genomewide binding of dCas9sgRNA in [16]. Examples of epigenetic marks which are implicated in chromatin remodeling and accessibility include DNase I hypersensitive sites, transcription factor binding, DNA methylation and histone modification. Future work will include integrating both the nucleotide composition features and chromatin structures as features in the predictive model to characterize the binding efficiency of sgRNA.
In this study, we used datasets of size 3141 and achieved AUC of > 0.8. Prior efforts to improve the efficiency of RNAi design utilized highthroughput functional testing of the efficacy of different RNAi sequences to generate large (2182) [31] and very large datasets (∼250000) [32]. These large datasets in turn were used to develop improved prediction algorithms using machinelearning approaches similar to those used here [33, 34]. It is generally accepted that the first large test set (2182) was very useful for improving RNAi design, there is still uncertainty regarding the utility of examining very large datasets [34]. Part of the unresolved issues are the degree to which different prediction algorithms are dependent upon the vector used for shRNA expression [35] as well as the sequence context in the genome outside of the immediate target [36]. Therefore, as more CRISPR/Cas9 screens datasets are becoming available, we anticipate that the specificity of sgRNA efficacy prediction can be further improved by considering the vectordependent level of expression of the sgRNA.
Abbreviations
 AUC:

Area under the receiver operating characteristic curve
 Align:

Optimal pairwise alignment between spacer target and scaffold
 BIC:

Bayesian information criterion
 CRISPR:

Clustered regularly interspaced short palindromic repeats
 Freq:

Frequencies of mono and dinucleotides
 LSL:

Length of a potential stemloop
 MEF:

Minimum energy folding
 mESC:

Mouse embryonic stem cell
 PAM:

Protospacer adjacent motif
 PD Dinuc:

Positiondependent dinucleotide
 PD Mono:

Positiondependent mononucleotide
 PhyChem:

Physiochemical properties of [17]
 PseKNC:

Pseudo ktuple nucleotide composition of [18]
 Packer:
 sgRNA:

Single guide RNA
 Thermo:

Thermodynamics and secondary structures of [9]
References
 1
Barrangou R, Fremaux C, Deveau H, Richards M, Moineau P, Romero D, Horvath P. CRISPR provides acquired resistance against viruses in prokaryotesy. Science. 2007; 315(5819):1709–12.
 2
Jinek M, Chylinski K, Fonfara I, Hauer M, Doudna J, Charpentier E. A programmable dualrnaguided dna endonuclease in adaptive bacterial immunity. Science. 2012; 337:816–21.
 3
Hsu P, Lander E, Zhang F. Development and applications of CRISPRCas9 for genome engineering. Cell. 2014; 157:1262–78.
 4
Doench J, Hartenian E, Graham D, Tothova Z, Hegde M, Smith I, Sullender M, Ebert B, Xavier R, Root D. Rational design of highly active sgrnas for CRISPRCas9–mediated gene inactivation. Nat Biotechnol. 2014; 32(12):1262–67.
 5
Xu H, Xiao T, Chen C, Li W, Meyer C, Wu Q, Wu D, Cong L, Zhang F, Liu J, Brown M, Liu S. Sequence determinants of improved CRISPR sgRNA design. Genome Res. 2015; 25:1147–57.
 6
Fusi N, Smith I, Doench J, Listgarten J. In silico predictive modeling of CRISPR/Cas9 guide efficiency. bioRxiv. 2015; 1:021568.
 7
Doench J, Fusi N, Sullender M, Hegde M, Vaimberg E, Donovan K, Smith I, Tothova Z, Wilen C, Orchard R, Virgin H, Listgarten J, Root D. Optimized sgrna design to maximize activity and minimize offtarget effects of CRISPRCas9. Nat Biotechnol. 2016; 34(2):184–91.
 8
Wu Z, Irizarry R, Gentleman R, MartinezMurillo F, Spencer F. A modelbased background adjustment for oligonucleotide expression arrays. J Am Stat Assoc. 2004; 99(468):909–17.
 9
Wei H, Kuan P, Tian S, Yang C, Nie J, Sengupta S, Ruotti V, Jonsdottir G, Keles S, Thomson J, Stewart R. A study of the relationships between oligonucleotide properties and hybridization signal intensities from nimblegen microarray datasets. Nucleic Acids Res. 2008; 36(9):2926–38.
 10
Wang T, Wei J, Sabatini D, Lander E. Genetic screens in human cells using the CRISPRCas9 system. Nature. 2014; 343:80–4.
 11
KoikeYusa H, Li Y, Tan E, Mdel CVH, Yusa K. Genomewide recessive genetic screening in mammalian cells with a lentiviral CRISPRguide rna library. Nat Biotechnol. 2014; 32(3):267–73.
 12
Sambrook J, Fritsch EF, Maniatis T. Molecular Cloning: a laboratory manual.Cold Spring Harbor Laboratory Press; 1989. https://www.cabdirect.org/cabdirect/abstract/19901616061.
 13
SantaLucia J. A unified view of polymer, dumbbell, and oligonucleotide DNA nearestneighbor thermodynamics. Proc Natl Acad Sci. 1998; 95(4):1460–5.
 14
Packer M, Dauncey M, Hunter C. Sequencedependent dna structure: Dinucleotide conformational maps. J Mol Biol. 2000; 295:71–83.
 15
Packer M, Dauncey M, Hunter C. Sequencedependent dna structure: Tetranucleotide conformational maps. J Mol Biol. 2000; 295:85–103.
 16
Wu X, Scott D, Kriz A, Chiu A, Hsu P, Dadon D, Cheng A, Trevino A, Konermann S, Chen S, Jaenisch R, Zhang F, Sharp P. Genomewide binding of the CRISPR endonuclease Cas9 in mammalian cells. Nat Biotechnol. 2014; 32(7):670–5.
 17
Chen W, Lin H, Feng P, Ding C, Zuo Y, Chou K. iNucPhysChem: a sequencebased predictor for identifying nucleosomes via physiochemical properties. PLoS One. 2012; 7(10):47843.
 18
Guo S, Deng E, Xu L, Ding H, Lin H, Chen W, Chou K. iNucPseKNC: a sequencebased predictor for predicting nucleosome positioning in genomes with pseudo ktuple nucleotide composition. Bioinformatics. 2014; 30(11):1522–9.
 19
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B. 2005; 67:301–20.
 20
Needleman S, Wunsch C. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970; 48(3):443–53.
 21
Breiman L. Random forests. J Mach Learn. 2001; 45(1):5–32.
 22
DiazUriarte R, de Andres SA. Gene selection and classification of microarray data using random forest. BMC Bioinforma. 2006; 7(3):10–11861471210573.
 23
Nicodemus K, Malley J, Strobl C, Ziegler A. The behaviour of random forest permutationbased variable importance measures under predictor correlation. BMC Bioinforma. 2010; 11(110):10–11861471210511110.
 24
Freund Y, Schapire R. A short introduction to boosting. JJpn Soc Artif Intell. 1999; 14(771–780):1612.
 25
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001; 1:1189–232.
 26
Haeussler M, Schonig K, Eckert H, Eschstruch A, Mianne J, Renaud J, SchneiderMaunoury S, Shkumatava A, Teboul L, Kent J, Joly J, Concordet J. Evaluation of offtarget and ontarget scoring algorithms and integration into the guide rna selection tool crispor. Genome Biol. 2016; 17(1):148.
 27
Chari R, Mali P, Moosburner M, Church G. Unraveling crisprcas9 genome engineering parameters via a libraryonlibrary approach. Nat Methods. 2015; 12(9):823–6.
 28
DeLong ER, DeLong DM, ClarkePearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988; 1:837–45.
 29
Fisher RA. On the probable error of a coefficient of correlation deduced from a small sample. Metron. 1921; 1:3–2.
 30
Myers L, Sirois MJ. Spearman correlation coefficients, differences between. Wiley StatsRef: Statistics Reference Online. 2006.
 31
Huesken D, Lange J, Mickanin C, Weiler J, Asselbergs F, Warner J, Meloon B, Engel S, Rosenberg A, Cohen D, Labow M, Reinhardt M, Natt F, Hall J. Design of a genomewide sirna library using an artificial neural network. Nat Biotechnol. 2005; 23(8):995–1001.
 32
Fellmann C, Zuber J, McJunkin K, Chang K, Malone C, Dickins R, Xu Q, Hengartner M, Elledge S, Hannon G, Lowe S. Functional identification of optimized rnai triggers using a massively parallel sensor assay. Mol Cel. 2005; 41(6):733–46.
 33
Vert J, Foveau N, Lajaunie C, Vandenbrouck Y. An accurate and interpretable model for sirna efficacy prediction. BMC Bioinforma. 2006; 7(520):10–1186147121057520.
 34
Knott S, Maceli A, Erard N, Chang K, Marran K, Zhou X, Gordon A, Demerdash OE, Wagenblast E, Kim S, Fellmann C, Hannon G. A computational algorithm to predict shrna potency. Mol Cel. 2014; 56(6):796–807.
 35
Watanabe C, Cuellar T, Haley B. Quantitative evaluation of first, second, and third generation hairpin systems reveals the limit of mammalian vectorbased rnai. RNA Biol. 2016; 13(1):25–33.
 36
Liu L, Li Q, Lin H, Zuo Y. The effect of regions flanking target site on sirna potency. Genomics. 2013; 102(4):215–22.
Acknowledgements
Not applicable.
Funding
This work was supported in part by NIH grant U01CA168409 to S.P. The funding body had no role in the design, collection, analysis or interpretation of this study.
Availability of data and materials
The datasets used to perform the present analysis are publicly available on [5] http://genome.cshlp.org/content/25/8/1147/suppl/DC1, [10] www.sciencemag.org/content/343/6166/80/suppl/DC1 and [11] http://www.nature.com/nbt/journal/v32/n3/full/nbt.2800.html#supplementaryinformation. The R package predictSGRNA implementing the proposed predictive algorithm based on positiondependent dinucleotide model is available at http://www.ams.sunysb.edu/~pfkuan/softwares.html#predictsgrna.
Authors’ contributions
PK conceived and designed the study. PK, SH and KL carried out analyses and wrote the software. PK, SP, SH, KL, XZ and BH wrote the paper. PK, SP, XZ and BH critically read the manuscript and contributed to the discussion of the whole work. All authors read and approved of the final version of the manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Affiliations
Corresponding author
Additional file
Additional file 1
Supplementary Information. The pdf document that contains all supplementary notes, figures and tables. Figures S1S2 plot the top 10 most informative features ranked by BIC and variable importance scores, respectively. Tables S1S3 contain the results from randomforest in binary outcome model. Tables S4S6 contain the results from gbm in binary outcome model. Tables S7S9 contain the results from elastic net in continuous outcome model. Tables S10S12 contain the results from randomforest in continuous outcome model. Tables S13S15 contain the results from gbm in continuous outcome model. Tables S16S18 contain the results comparing 30bp and 40bp sequences. Tables S19S20 contain the results from leaveonegene out prediction. (PDF 151 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Kuan, P.F., Powers, S., He, S. et al. A systematic evaluation of nucleotide properties for CRISPR sgRNA design. BMC Bioinformatics 18, 297 (2017). https://doi.org/10.1186/s1285901716976
Received:
Accepted:
Published:
Keywords
 CRISPR
 Machine learning
 Predictive modeling
 Thermodynamics