Accurate prediction of nuclear receptors with conjoint triad feature

Background Nuclear receptors (NRs) form a large family of ligand-inducible transcription factors that regulate gene expressions involved in numerous physiological phenomena, such as embryogenesis, homeostasis, cell growth and death. These nuclear receptors-related pathways are important targets of marketed drugs. Therefore, the design of a reliable computational model for predicting NRs from amino acid sequence has now been a significant biomedical problem. Results Conjoint triad feature (CTF) mainly considers neighbor relationships in protein sequences by encoding each protein sequence using the triad (continuous three amino acids) frequency distribution extracted from a 7-letter reduced alphabet. In addition, chaos game representation (CGR) can investigate the patterns hidden in protein sequences and visually reveal previously unknown structure. In this paper, three methods, CTF, CGR, amino acid composition (AAC), are applied to formulate the protein samples. By considering different combinations of three methods, we study seven groups of features, and each group is evaluated by the 10-fold cross-validation test. Meanwhile, a new non-redundant dataset containing 474 NR sequences and 500 non-NR sequences is built based on the latest NucleaRDB database. Comparing the results of numerical experiments, the group of combined features with CTF and AAC gets the best result with the accuracy of 96.30 % for identifying NRs from non-NRs. Moreover, if it is classified as a NR, it will be further put into the second level, which will classify a NR into one of the eight main subfamilies. At the second level, the group of combined features with CTF and AAC also gets the best accuracy of 94.73 %. Subsequently, the proposed predictor is compared with two existing methods, and the comparisons show that the accuracies of two levels significantly increase to 98.79 % (NR-2L: 92.56 %; iNR-PhysChem: 98.18 %; the first level) and 93.71 % (NR-2L: 88.68 %; iNR-PhysChem: 92.45 %; the second level) with the introduction of our CTF-based method. Finally, each component of CTF features is analyzed via the statistical significant test, and a simplified model only with the resulting top-50 significant features achieves accuracy of 95.28 %. Conclusions The experimental results demonstrate that our CTF-based method is an effective way for predicting nuclear receptor proteins. Furthermore, the top-50 significant features obtained from the statistical significant test are considered as the “intrinsic features” in predicting NRs based on the analysis of relative importance. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0828-1) contains supplementary material, which is available to authorized users.


Background
Nuclear receptors (NRs) are members of a large family of ligand-inducible transcription factors that regulate gene expressions involved in numerous physiological phenomena. These physiological phenomena cover many aspects of multicellular organisms' lives, including embryogenesis, homeostasis, cell growth and death [1]. Different from cell surface receptors which have strong affinities with water-soluble peptide hormones and growth factors, NRs mostly bind to lipophilic hormone ligands, such as steroids, retinoids, thyroid hormones, vitamin D 3 and so forth. These fat-soluble ligands can get into cytoplasm through lipid bilayer of cell membranes, and bind to NRs. Furthermore, the resulting allosteric ligand-protein complexes get into cell nucleus and regulate expressions of target genes [1].
All NRs are modular proteins which share common structure organizations. They mostly have 6 (or 5) functional protein domains, including N-terminal A/B domain, DNA-binding domain (DBD, C domain), D domain, ligand-binding domain (LBD, E domain) and F domain of C-terminal end [2]. The N-terminal A/B domain contains at least one activation function 1 region (AF-1) which can operate autonomously and several varied autonomous transactivation domains (AD). It has not now been reported about the crystal structure of A/ B domains, which possibly are involved in posttranslational modification according to the report [3]. The most conserved domain is DBD, which acts as a central role of binding to specific DNA sequences. Several crystal structures of DBDs are reported, and they usually contain two typical cysteine-rich zinc finger motifs [4,5]. The P box in the first zinc finger determines the DNA-binding sequence specificity through a short AGGTCA motif. In addition, the D domain contains the nuclear localization signal (NLS) and severs as a hinge between the DBD and the LBD, permitting the DBDs and LBDs to adopt different conformations under hormone activation. Among all the domains, the largest domain is LBD, whose 3D structure is moderately conserved and comprises 12 α-helices and a β-turn [6]. In general, behind helix 3 and in the front of helices 7 and 10, LBD contains at least one ligand-binding pocket, which enables the binding of ligands. Ligand binding will induce a conformational change in LBD of NRs. Furthermore, agonists and antagonists will lead to distinct structural alterations of nuclear receptor LBDs [7]. NRs may or may not contain the F domain, whose structure and function remain unknown [2].
Based on aforementioned six (or five) domains, NRs can perform their function through typical features of domains. They can bind to ligands at the LBD, leading to the allosteric change of their 3D structures. As a result, stronger affinities with chromatin will be made by these conformational changes, which allow NRs to bind to DNA through the DBD. Agonist which acts as activated ligand will enhance the expression of the target gene, whereas antagonist which severs as depressing ligand will silence the gene expression. These specific abilities of regulating gene expressions imply that since NRs are related to major human diseases, such as breast cancer, diabetes, osteoporosis and so on, they are promising pharmacological targets [2]. Basically, NRs are the largest family of hormone receptors, comprising 49 genes in the human genome [8]. According to statistics, about 13 % of marketed drugs target NRs, which are among the one of most frequent targets of therapeutic drugs [9].
Conventional methods for identifying non-annotated proteins are experimental means, such as X-ray crystallography or NMR spectroscopy and so on. These effective techniques provide a detailed 3D structure of a protein for helping understand its function [4][5][6]. With the absence of experiment conditions, researchers may choose to run a standard basic local alignment search tool (BLAST) [10] to identify a protein to be NR based on the conserved motifs comprising two zinc fingers of the DNA-binding domain [1]. However, NRs are divided into eight classes according to their ligand binding, DNA binding, and dimerization properties [1,8]. The search tool, such as BLAST, cannot identify subfamilies of NRs [11] because different classes of NRs share low sequence similarities. Therefore, it is essential to develop novel methods to recognize NRs and their subfamilies.
An alternative way to identify NRs is to develop computational methods. With the rapid development of largescale genome and proteome sequencing project, huge amounts of biological data begin to accumulate. In the area of NRs, the NucleaRDB is a molecular class-specific information system that collects, combines, validates and disseminates large amounts of heterogeneous data on nuclear hormone receptors [8,12]. The collection of all these data provides possibilities to develop computational methods for predicting the function of NR proteins by their primary sequences. According to the latest release of NucleaRDB (July 01, 2011 -Version 11.7.1), the data are grouped into eight families or classes based on their ligand binding, DNA binding, and dimerization properties of NRs [8]. The eight families are (1) Thyroid hormone like, (2) HNF4-like, (3) Estrogen like, (4) Nerve Growth factor IB-like, (5) Fushi tarazu-F1 like, (6) Germ cell nuclear factor like, (7) Knirps like, and (8) DAX like. These NRs families and their structural features are closely correlated with their function [11], and it would be significant to develop a powerful computational method to classify NRs into particular families for the purpose of understanding their biological function and their potential as future drug targets.
In 2004, an early attempt for predicting NRs and their subfamilies was performed by Bhasin and Raghava based on amino acid composition (AAC) and dipeptide composition (DC) features [11]. Gao et al. [13] developed a feature selection approach to identify relevant features, and a reduced feature subset containing 30 features (18 AACs and 12 DCs) resulted in an improved overall accuracy. In the same year, Gao et al. employed pseudo amino acid composition (PseAA) for predicting and recognizing NRs using support vector machines (SVM) [14]. In 2011, Wang et al. [15] integrated various types of features, such as AAC, DC, complexity factor (CF) and fourier spectrum components (FSC), to represent protein sequences as 881-dimensional vectors. Thus, these sequence-derived features were put into fuzzy K nearest neighbor (FKNN) classifier to identify NRs and their families. Subsequently, Xiao et al. [16] constructed a predicting model based on physical-chemical matrix via a series of auto-covariance and cross-covariance transformations, and resulting predictor achieved higher accuracy rates of recognition on the same dataset [15]. Recently, a proteome-scale two level predicting method, named "NRfamPred", was developed based on dipeptide composition [17].
Here, we develop an integrated model by employing conjoint triad feature (CTF) and chaos game representation (CGR) to give an appropriate numerical representation of nuclear receptor protein sequence. Originally, CTF was used for prediction of protein-protein interaction (PPI) as important features of protein sequences and achieved excellent performance [18]. Whereafter, CTF was extended to represent protein sequence for identifying RNA-protein interaction (RPI) [19,20] and became a popular method for suitable representation of protein sequence [21][22][23][24]. On the other hand, in 1990, Jeffrey [25] proposed the chaos game representation (CGR) of DNA sequences, and CGR method could excavate hidden patterns in sequences. Subsequently, CGR method of DNA sequences was extended to represent protein sequences by Basu et al. [26], who used CGR algorithm to generate protein sequence by virtue of a 12-sided regular polygon. Each vertex of polygon represented a group of amino acid residues according to conservative substitutions. The authors claimed that CGR had the potential to reveal the evolutionary and functional relationships even between the proteins with no significant sequence homology. Up to present, CGR method has achieved many applications and attracted increasing studies in the area of bioinformatics [27][28][29][30].
At present, it is widely believed that the features for input vector of support vector machine (SVM) directly determined the efficiency of prediction model. So far no report yet has been published about CTF, CGR together with AAC as features to predict NRs. In this paper, we will present a CTF-based method, which is proposed to improve the accuracy of the classification of NRs.

Dataset
There are several well known datasets for identify NRs and their subfamilies in the literatures before, such as D282 [11,13,14] and D159 [15,16]. According to the latest information in NucleaRDB website (http://www. receptors.org/nucleardb) and recent publication [8], NucleaRDB updated its contents and information on July 01, 2011. The updated database added some recentpublished sequences and structures of NRs, many of which are not been included in D282 and D159 ( Table 2). Take more information into consideration, a new dataset was built from the latest version of NucleaRDB in this report. The newly updated NucleaRDB classified all the NRs into eight main families, (1) NR1: thyroid hormone like, (2) NR2: HNF4-like, (3) NR3: estrogen like, (4) NR4: nerve Growth factor IB-like, (5) NR5: fushi tarazu-F1 like, (6) NR6: germ cell nuclear factor like, (7) NR7: knirps like, and (8) NR8: DAX like. All the protein sequences of eight subfamilies were downloaded (detailed information can be found in Table 2).
To reduce the homology bias of prediction, a redundancy reduction procedure was performed on this dataset by CD-HIT program [31], and a cutoff threshold of 60 % was imposed to exclude those proteins from the benchmark datasets that have equal to or greater than 60 % sequence identity to any other in a same subset. Usually, a cutoff threshold of 25 % was recommended [32][33][34]. However, such a stringent criterion deduces that number of proteins would be too few to have statistical significance, so the cutoff threshold of 60 % is adopted in this study. As a result, the new dataset contains 474 NR sequences in total. On the other hand, to estimate the ability of the present method in discriminating NRs from non-NRs, a negative dataset containing 500 non-NRs sequences were collected from D159 [15]. Our final training set (denote by D474) contains 474 NR sequences and 500 non-NR sequences (Tables 1, 2), which can be downloaded in the Additional file 1.

Sample representation
For our computational approach, each protein is represented as a numerical vector, so as to be put into SVM for classification. Actually, a number of methods were used to extract information from protein sequences, for example, amino acid composition (AAC) was used to transform NR sequences into 20-dimension numerical vectors [11]. Meanwhile, in order to extract the information of sequence order, dipeptide composition (DC) was proposed to represent NR sequences by 400-dimension vectors, which captured local-order information and had been reported to improve classifications [11]. In addition, Gao et al. [14] used the concept of Chou's pseudo amino acid composition to represent each protein sequence by numerical features, which reflected a protein's overall sequence pattern. Recently, a web server called Pse-in-One [35] was established, which can generate various protein features to construct the predictor. Based on all works mentioned above, here three kinds of feature-derived methods, AAC, CTF, CGR, are employed to capture pivotal information of NR sequences.

Amino acid composition
Amino acid composition (AAC) was the most popular and also simplest way to represent protein sequences, and it is believed to be the fundamental features to perform protein prediction problems. More precisely, a protein sequence P with L amino acid residues can be expressed as: The AAC of a protein is defined as the normalized frequency of each amino acid in that protein; i.e., where f i ¼ n i L , and n i is the occurrence number of the i-th amino acid with each i(i = 1, ⋯, 20).

Conjoint triad feature
Conjoint triad feature (CTF) was originally used to transform protein sequences into 343-dimension numerical vectors for successfully predicting PPI [18], and was extended to predict RPI [19,20], enzyme function [21], functional related proteins [23]. CTF clustered 20 amino acids into seven classes ({AGV}, {ILFP}, {YMTS}, {HNQW}, {RK}, {DE}, {C}) according to their dipoles and volumes of the side chains [18]. Subsequently, they regarded any three continuous amino acids as a unit. It is worthy to note that the triads can be categorized according to the classes of amino acids, i.e., triads composed by three amino acids belonging to the same classes can be treated identically. Finally, CTF counts the frequencies of each triad type. By this way, each protein sequence is represented by a 343 (7 × 7 × 7) dimensional vector.
More precisely, a protein sequence P with L amino acid residues can be expressed as: Then we successively consider sliding windows with continuous three residues The CTF of a protein is defined as the normalized frequency of the corresponding 3-mer in that protein; i.e., where f i ¼ n i L−2 , and n i is the occurrence number of the i-th triad type of all continuous three residues with each i(i = 1, ⋯, 343). More detailed description for the CTF can be found in the following literatures [18,23].

Chaos game representation
The chaos game representation (CGR) algorithm of proteins is first proposed by Basu et al. [26]. The algorithm of CGR picture drawing is listed as below: Step 1. Draw a 12-sided regular polygon, and each vertex represents a kind group of amino acids (Fig. 1.); Step 2. Pick the center of polygon P 0 to be the initial point; Step 3. Given a protein sequence with length N, we draw N points in the polygon by the following way: In turn we read alphabet from the protein sequence, since each read belongs to one group of amino acids, then we determine a certain vertex of polygon and draw the midpoint between initial point P 0 and the chosen vertex. After finishing drawing one point, we set it to be the new initial point, and we can draw N points with such iteration.
More precisely, if we denote P 0 (0, 0) as the center of the polygon and V 1 (1, 0) as the first vertex of the polygon, we can easily get coordinates of the other eleven vertexes with the following formula: Then we compute coordinates of each CGR point as follows: where CGR i (x, y) refers to the coordinate of the i-th point drawn in the CGR picture, and V i (x, y) represents the coordinate of chosen vertex by the i-th read (each read determines a certain vertex of polygon). The CGR algorithm can generate an image that contains fractal structure and visually reveal previously unknown structure information for each concatenated amino acid sequences. Furthermore, for the sake of operable mathematical classification, a mathematical characterization of the CGR picture will be needed. We extract the frequency information of each segment by dividing the 12-sided polygon into 24 segments (grids), which are labeled serially with numbers 1-24, as shown in Fig. 1.
For each segment, i.e. S k , k = 1, 2, ⋯, 24, we denote by L k , k = 1, 2, ⋯, 24 the number of points which fall into L k . The points falling on boundaries of adjacent segments should be counted in any one of the neighboring segment. Then set where N is the length of amino acid sequence. From the above CGR and segment-counting algorithm, we find that each amino acid sequence induces a 24-dimensional vector (D 1 , ⋯, D 24 ).

Support vector machines
A support vector machine (SVM) performs a nonlinear mapping of the input vector x from the input space, the (a positive integer) dimensional euclidean space, into a higher dimensional Hilbert space, where the mapping is determined by the kernel function. It finds the Optimal Separating Hyper plane (OSH) in the space H corresponding to a non-linear boundary in the input space. For a given data set, only the kernel function and the regularity parameter C must be selected. A complete description to the usage of SVMs for pattern recognition could be found in [36]. In this study, the RBF kernel function (with a parameter γ) is adopted and the implementation of SVM is based on LibSVM 3.17, which is an open source that can be downloaded in the website: http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html.

Evaluation of the prediction performance
Usually, in statistical prediction, the following three criteria are often used to examine a predictor for its effectiveness in practical application: self-consistency test (re-substitution test), subsampling (K-fold cross-validation) test and jackknife test [37]. Particularly, the jackknife test often can be used to examine a predictor for its effectiveness in practical application [37] because the jackknife test is deemed the most rigorous one that can exclude the memory effects during the entire testing process and can always yield a unique result for a given dataset, as elucidated in [38] and demonstrated by [32]. In this paper, on the one hand, when comparing with other methods, we adopt the jackknife test following the original test method.
On the other hand, to test the performance of our hybrid method, we choose 10-fold cross-validation due to the new larger dataset. Generally, the performance of the prediction method is measured by sensitivity (Sens), specificity (Spec), accuracy (Acc) and Matthew's correlation coefficient (MCC) value, calculated as: where TP means the number of true positives (NRs predicted as NRs) in one experiment, FN means the number of false negatives (NRs predicted as non-NRs), TN means the number of true negatives (non-NRs predicted as non-NRs), FP means the number of false positives (non-NRs predicted as NRs). Additionally, to test the balance between true positive rate and false positive rate, we also draw the receive operating characteristic (ROC) curves and compute the corresponding the area under the curve (AUC) values (The AUC for a perfect classifier is 1, and for a random classifier is 0.5). Moreover, for the second level of multiclass classification problem, in order to compute the predicting performance of each class, we follow the evaluation criteria described in [39]. Firstly, four indexes of each subfamily are computed based on Equation 6: where N + (i) is the total number of the samples in the subset NRi, whereas N − + (i) is the number of samples in NRi that are incorrectly predicted belonging to the other subsets, and N − (i) is the total number of samples in all of the other subsets, whereas N + − (i) is the number of samples that are incorrectly predicted belonging to NRi. Subsequently, the performance of predicting method about each subfamily is evaluated by:

Predicting NRs and their subfamilies
Firstly, this work focuses on how to seek the best combinations of three groups of feature-derived methods, i.e. AAC, CTF, CGR, to predict nuclear receptors (NRs) and their subfamilies. At the first level, an un-annotated protein is predicted to be either an NR or a non-NR. If it is classified as a NR, it will be further put into the second level, which will classify a NR into one of the eight subfamilies. The detailed flowchart can be found in Fig. 2.
In order to seek the optimal combined features in the feature space, a series of comparative experiments are carried on via 10-fold cross-validation test. More precisely, all the protein sequences are randomly divided into ten groups for the following ten folds, and in each fold, one group is used for testing and other nine groups are used for training. Subsequently, a SVM classifier is trained by using inputting feature vectors and class labels (1 for NR; 0 for non-NR) extracted from the training dataset.
The numerical experiments are designed on seven groups of feature sets. The detailed results which include average values of Sens, Spec, Acc, MCC and AUC in identifying the NR proteins from non-NR proteins are listed in Table 3. From Table 3, for the first level, the average Accs range from 0.8511 to 0.9630, and the average MCCs range from 0.7022 to 0.9261, and the average AUCs range from 0.9290 to 0.9923. Particularly, Feature set 5, i.e. CTF + AAC features, performs the best results, and the average Acc achieves 96.30 % with the optimal parameters γ = 0.1899, C = 10.1197. Additionally, ROC curves of all seven different feature sets are shown in Fig. 3.
The results in identifying eight main NR families are listed in Table 4, from which we could find that the Comparing the predicting results of different combinations of features, it is worthy to note three important phenomena. Firstly, Feature set 5, CTF + AAC features, achieves the best performance both in the first level and in the second level, which means that the impact of jointly considering CTF and AAC features is excellent. Secondly, Feature set 3, CTF features, surprisingly achieves the second best performance after Feature set 5, which implies that CTF features alone may achieve relatively good results. Particularly, if we compare the predicting performances between Feature set 3 and Feature set 6 ( Table 3 and Table 4), we find that the overall acc unexpectedly reduces from 0.9430 to 0.9409 (Table 4) or remains equal (Table 3) when CGR features are added to CTF features, demonstrating that CGR features cannot provide useful helps in predicting NRs and their subfamilies. Thirdly, the differences between Feature set 5 and Feature set 3 are rather small, indicating that AAC features contribute little to predictions.  Those above results lead us to conclude that CTF is an important feature in prediction of NRs and their subfamilies. When feature combinations are AAC (or CGR) with CTF feature lost, the average Acc of first level and second level are at most 93 % and 71 % respectively, whereas the average accuracy of first level and second level have promoted up to 96 % and 94 % respectively when CTF feature is added.
At the second level, for the purpose of investigating the detailed predicting performances of each subfamilies between the two best feature set (Feature set 5 and Feature set 3), we list more detailed predicting information which includes specific values of Sens, Spec, Acc, MCC in each subfamilies in Table 5. It is noteworthy that Feature set 3 and Feature set 5 both perform satisfactory results, and the overall Sens achieve 0.9430 and 0.9473 respectively, which also illustrates that among all the 474 NRs, 447 NRs and 449 NRs are correctly classified into their original subfamilies respectively.

Comparisons with other methods at the first level
Many existing methods have classified NRs at a single level. In order to explain the superiority of our hybrid methods, we implement our algorithms on the same dataset (D159, 159 NRs, seven subfamilies) in NR-2L [15] and iNR-PhysChem [16] via the same test method-jackknife test. As a result, we list the detailed comparisons between our methods (Feature set 1-7) and existing methods (NR-2L, iNR-PhysChem) in Table 6.
From Table 6, as was expected, Feature set 5 again achieves the best predicting performances, which includes Acc value with 98.79 % and MCC value with 0.9667, higher than 92.56 %, 0.8500 from NR-2L [15] and 98.18 %, 0.9600 from iNR-PhysChem respectively. As same as before, Feature set 3 also achieves the second best results and the differences between Feature set 3 and Feature set 5 are also very small. Another noteworthy thing is that the predicting performances of Feature set 3,5,6,7 from our methods are all better than  NR-2L and iNR-PhysChem. The comparisons above indicate that our method has achieved a higher overall accuracy on the same benchmark datasets than some previous methods.

Comparisons with other methods at the second level
We also make comparison with NR-2L [15] and iNR-PhysChem [16] developed on dataset D159 (159 NRs, seven subfamilies) at the second level. NR-2L is the first classifier for predicting NRs at two levels with seven subfamilies. We implement our method on D159 at the second level via the same test method-jackknife test. All the detailed results and comparisons between our method (Feature set 3) and existing methods (NR-2L, iNR-PhysChem) are listed in Table 7.
Predicting results from Table 7 demonstrate that CTF method results in an overall Sens of 93.71 % at the second level of D159 dataset, higher than 88.68 % from NR-2L and 92.45 % from iNR-PhysChem. Significantly, comparing NR-2L and iNR-PhysChem, predicting performance increases five and two percent by using CTF method respectively. These results indicate that the proposed method of this paper outperforms NR-2L and iNR-PhysChem at the second levels.

NR proteins and non-NR proteins display distinct CTF-feature properties
Above results demonstrate that CTF method shows superiority both in the first level and in the second level when comparing existing methods and other methods. Next, for the propose of investigating "intrinsic features" among CTF features, we perform the statistical test between 474 NR proteins and 500 non-NR proteins for each feature which is taken from 343 CTF features (two-side Wilcoxon rank-sum test). As a result, 279 of the overall 343 features show significant differences between NR proteins and non-NR proteins (p < 0.01, each detailed p-value can be found in the Additional file 2). Among all the features, the most two significant features are the 35th feature  Table 8. It is  NR proteins and 500 non-NR proteins. It leads us to consider these top-10 (or top-50) significant features are the "intrinsic features" in identifying NR proteins.

Relative importance of significant CTF features
To further verify these top-10 (or top-50) significant features are the "intrinsic features" in identifying NR proteins, we perform a detailed analysis of relative important of these features. Precisely, considering that these top-10 (or top-50) significant features are particularly importance for NR  proteins predictions, we ask whether our prediction model could be simplified by using these top-10 (or top-50) features alone.
To answer this question, we adopt a two-direction strategy to demonstrate the importance of these significant features. One is to perform the predictions by using only top-10 (or top-50) features, whereas another is to perform the predictions by using the remaining CTF features with top-10 (or top-50) features (denote by "CTF-10", or "CTF-50") taken away. Remarkably, the performance of the simplified (top-50 significant features, Acc = 0.9528) and the full (343 CTF features, Acc = 0.9620) models is not significantly different (Table 9), whereas the difference between the performance of the CTF-50 model (CTF features with top-50 features taken away, 293 features, Acc = 0.9035) and the performance of the full model (343 CTF features, Acc = 0.9620) is obviously large (Table 9). Our findings indicate that the top-50 significant features are truly "intrinsic features" in identifying NR proteins, and we surmise these features contain substantial conserved motif information of NR proteins.

Further discussion
With the purpose of supporting our method, a further discussion is proposed. The results mentioned in Table 6 and 7 show that our novel method is superior to NR-2L and iNR-PhysChem. Investigates its reason, the CTF method plays a crucial role in predicting NRs. According to reports, amino acid composition (AAC) are simplest but effective features in predicting NRs [11,13,14], however, only AAC features are insufficient with a lake of sequence order information. To compensate for this deficiency, CTF-and CGR-based method is proposed in this research. From the results of Tables 3 and 4, the best accuracy achieves in the group with combined features of CTF and AAC. Moreover, the detailed comparisons between different features show an interesting phenomenon. On the one hand, we find that CTF are fundamental features and each group with absence of CTF achieves unsatisfied accuracy from the detailed results of Tables 3 and 4. On the other hand, although only CTF features cannot achieve the best accuracy, the predicting performances of only CTF features are good enough, so that they are already better than the two existing methods (NR-2L and iNR-PhysChem).
Taking above results into consideration, it is worthy to explore the reasons why CTF features are important for predicting NRs. Let us recall what CTF was and the relationship between CTF and prediction of protein-protein interactions (PPIs). In 2007, CTF originally was proposed to solve PPIs prediction problems [18]. The authors took the attitude that PPIs were mostly dominated by electrostatic and hydrophobic interactions between amino acids from interacting proteins, which might be reflected by the dipoles and volumes of the side chains of amino acids, respectively. Subsequently, 20 kinds of amino acids were classified into seven classes based on their dipoles and their volumes of the side chains. The amino acids belong to the same class were considered to have similar electrostatic and hydrophobic properties. Finally, any continuous amino acids were considered as a unit, from which 343 numerical features were extracted based on their conjoint electrostatic and hydrophobic properties. The CTF method based on conjoint electrostatic and hydrophobic properties naturally was extended to study RNA-protein interactions [19,20] for the reasons that RNA-protein interaction also might be influenced by electrostatic and hydrophobic interactions between amino acid (from protein) and nucleic acid (from RNA) similarly.
In situation of predicting NRs, proteins which probably are considered as NRs mostly are involved in several interactions, including between small molecules (in cytoplasm, through LBD), between other proteins and between DNA (in nucleus, through DBD). All these interactions are related to electrostatic and hydrophobic interactions, which might be the reasons why CTF method can get better performances than other existing methods in this study.

Conclusions
Nuclear receptors play a vitally important role in many processes of transcriptional regulations. The conjoint triad feature clusters 20 amino acids into seven classes according to their dipoles and volumes of the side chains. Any three continuous amino acids are regarded as a unit, from which 343 features can be extracted. The chaos game representation algorithm presents each protein sequence to a CGR picture with an iterated fractal approach. CGR pictures are divided into different segments, from which 24 quantitative features are extracted by computing the frequencies of points in each of the segments. We combine two factors (CTF, CGR) with amino acid composition as the candidate features which