Structural features based genome-wide characterization and prediction of nucleosome organization
© Gan et al; licensee BioMed Central Ltd. 2012
Received: 22 June 2011
Accepted: 26 March 2012
Published: 26 March 2012
Skip to main content
© Gan et al; licensee BioMed Central Ltd. 2012
Received: 22 June 2011
Accepted: 26 March 2012
Published: 26 March 2012
Nucleosome distribution along chromatin dictates genomic DNA accessibility and thus profoundly influences gene expression. However, the underlying mechanism of nucleosome formation remains elusive. Here, taking a structural perspective, we systematically explored nucleosome formation potential of genomic sequences and the effect on chromatin organization and gene expression in S. cerevisiae.
We analyzed twelve structural features related to flexibility, curvature and energy of DNA sequences. The results showed that some structural features such as DNA denaturation, DNA-bending stiffness, Stacking energy, Z-DNA, Propeller twist and free energy, were highly correlated with in vitro and in vivo nucleosome occupancy. Specifically, they can be classified into two classes, one positively and the other negatively correlated with nucleosome occupancy. These two kinds of structural features facilitated nucleosome binding in centromere regions and repressed nucleosome formation in the promoter regions of protein-coding genes to mediate transcriptional regulation. Based on these analyses, we integrated all twelve structural features in a model to predict more accurately nucleosome occupancy in vivo than the existing methods that mainly depend on sequence compositional features. Furthermore, we developed a novel approach, named DLaNe, that located nucleosomes by detecting peaks of structural profiles, and built a meta predictor to integrate information from different structural features. As a comparison, we also constructed a hidden Markov model (HMM) to locate nucleosomes based on the profiles of these structural features. The result showed that the meta DLaNe and HMM-based method performed better than the existing methods, demonstrating the power of these structural features in predicting nucleosome positions.
Our analysis revealed that DNA structures significantly contribute to nucleosome organization and influence chromatin structure and gene expression regulation. The results indicated that our proposed methods are effective in predicting nucleosome occupancy and positions and that these structural features are highly predictive of nucleosome organization.
The implementation of our DLaNe method based on structural features is available online.
In an eukaryotic nucleus, chromosomes are organized in condensed chromatin structures. The genomic DNA sequence wraps on a histone octamer to form primary repeating units of chromatin, termed nucleosomes. In many species, each nucleosome core particle consists of roughly 147 base pairs , which facilitates the storage and organization of long eukaryotic chromosomes. Nucleosome distribution on genomic DNA sequences can greatly affect gene transcription, DNA replication and reparation, by modulating the accessibility of underlying DNA sequences to various regulatory factors . However, how nucleosome organization is established has not been well understood.
Besides a multitude of factors, including chromatin remodelers [3–5] and specific DNA-binding proteins [6, 7], intrinsic DNA sequence preferences have been the focus of recent experimental and bioinformatical studies, which concern how and to what extent sequence features contribute to nucleosome organization [8–14]. In particular, AT- and GC-riched dimeric and trimeric motifs were first identified by the pioneer work of Trifonov . Subsequently, several studies delineated periodicity and sequence patterns associated with nucleosomal sequences [8, 10, 11]. Specifically, G + C content can explain ~50% of the variation of nucleosome occupancy in vitro . Computational methods based on such sequence compositional features have been proposed to predict nucleosome occupancy [8, 9, 12–14]. However, it has been demonstrated that DNA sequence preferences for certain sequence motifs are not the major determinants of nucleosome organization [16, 17], which raise a question about the role of the structural variability of DNA sequences in the formation of nucleosomes [10, 18–20].
To address this question, several studies have been geared toward structural properties of DNA sequences and the conformation mechanism of nucleosomes. Some physicochemical properties in nucleosomal DNA databases, such as tilt for DNA-protein complex and helical twist, have been identified to be significant for nucleosome binding . Based on the roll-and-slide model, Tolstorukov et al. found that slide of adjacent base pairs contributed predominately to DNA super-helical pitch and roll of neighboring base pairs accounts for DNA curvature . Miele et al. introduced dinucleotide-dependent DNA flexibility and intrinsic curvature to the analysis of nucleosome occupancy . Morozov et al. used a DNA elastic energy function to build a biophysical model of sequence dependence of nucleosome formation . The bendability of dinucleotides in the crystal structures of DNA duplexes was also analyzed within nucleosomal DNA fragments [24, 25]. Analysis of nucleosome crystal structures showed that the behaviors of base pairs, puckering of ribose rings and related backbone torsion jointly represent the major structural variations of nucleosomal DNA sequences . These studies suggested that there might exist many structural features related to nucleosome formation. Therefore, it is imperative to systematically analyze different structural properties and identify structural features that contribute to nucleosome formation and more importantly, to understand to what extent nucleosome organization is inherently hardwired in these structures of genome sequences. Furthermore, it is desirable to exploit those structural features that are characteristic of nucleosome occupancy and formation to develop effective novel methods for predicting nucleosome positioning.
We systematically investigated twelve structural features related to intrinsic flexibility, curvature and energy of DNA sequence, and analyzed their relation with nucleosome occupancy, chromatin organization and transcriptional regulation across the entire S. cerevisiae genome. By focusing on centromere and promoter regions, we further inquired into the underlying structural mechanisms of nucleosome organization and transcriptional regulation. To assess their predictive power for nucleosome organization, we combined these structural features in a linear model for predicting nucleosome occupancy. Further, we introduced a novel strategy to locate nucleosomes by detecting peaks of structural profiles, and developed a meta predictor to integrate information from different structural features, which significantly outperformed the existing sequence-based methods. We also constructed an alternative, hidden Markov model (HMM) for predicting nucleosome positions using the structural features, confirming the effectiveness of these structural features in locating nucleosomes. Our results shed lights on the recent debate on the role of sequence preference in nucleosome organization [9, 27, 28], indicating that DNA structures are important factors for determining nucleosome organization.
In vitro 
In vivo 
In vivo 
Propeller twist 
The angle of the two aromatic bases in a base pair.
DNA denaturation 
The ability of DNA to denature.
DNA-bending stiffness 
The anisotropic flexibility of DNA.
The trinucleotide bendability.
Duplex disrupt energy 
DNA duplex energy.
Stacking energy 
Energy scale of dinucleotide base-stacking energy scale.
The ability to be covered from B-to Z-DNA
Duplex free energy 
The thermodynamic energy content.
The free energy values for a transition from B- to A-DNA form.
Protein-DNA twist 
The ability to be deformed by protein.
B-DNA twist 
The mean twist angles in B-DNA.
Protein-induced deformation 
The ability to be changed by proteins.
These structural features are classified into two classes, positively correlated (upper part) and negatively correlated features (lower part)
First, we computed and compared the structural profiles of all these 12 structural features on 1,000 well-positioned nucleosomes and 1,000 nucleosome-depleted sequences  (see Methods). The results show that nucleosome-enriched sequences have different structural characteristics from that of nucleosome-depleted sequences. Based on their relationship with nucleosome occupancy, we can classify these structural features into two categories. As shown in Table 1 the first class of features are positively correlated with nucleosome occupancy. For each feature in this class, the calculated structural values along nucleosome sequences are greater than that along nucleosome-depleted sequences. In contrast, the structural features in the second class show negative correlations with nucleosome occupancy. Take DNA denaturation as an example, this feature captures the temperature at which DNA strands are half denatured and DNA regions with a low denaturation value denaturate more easily than regions with a higher value [35, 44]. Therefore, this feature can measure the stability of a double DNA strand. The results reveal that nucleosome-enriched DNA sequences denature at a higher temperature than nucleosome-depleted DNA sequences. In contrast, we observe that the duplex free energy of nucleosome sequences is evidently lower than that of nucleosome-depleted sequences. It is well known that DNA sequences with a low free energy is more stable than that with a high free energy [39, 44]. That is to say, a DNA segment in nucleosome is more stable than nucleosome-depleted sequences .
We observe in Figures 1(A) and 1(B) that the peaks and valleys for the positively related DNA denaturation and Propeller twist align well with the experimental nucleosome signals, both of which share very similar patterns with experimentally determined nucleosome occupancy. Figures 1(C) and 1(D) compare the profiles of negatively related structural features with the experimental nucleosome occupancy. As shown, the patterns of the actual nucleosome occupancy and the profiles of structural features are rather opposite. Specifically, the local valleys of the structural profiles correspond well to the peaks of experimental nucleosome signals. As a support to the above finding, the plot shows that nucleosome-enriched sequences indeed have different structural patterns from nucleosome-depleted sequences. In eukaryotic cells, promoter regions are normally less likely to be occupied by nucleosome, making them more accessible to the transcription machinery [46, 47]. The structural profiles we computed agree very well with this finding. For positively related features, deep valleys are located in the promoter regions, while peaks are observed for negatively related features. Taken together, these comparative results show that these structural patterns correlate to different degrees with the experimental nucleosome occupancy.
To quantify the power of structural features for capturing nucleosome occupancy signals, we analyzed the correlation between the structural profile of each feature and experimental nucleosome occupancy along the whole genome of S. cerevisiae. Specifically, we collected one in vitro  and two in vivo [9, 45] genome-wide nucleosome occupancy datasets as reference. The Pearson correlation coefficients, listed in Table 1 confirmed the results of our classification of the structural features that we studied. The result on nucleosome formation energy agreeed with the previous results from different models [20, 23], showing that nucleosome-energy is highly correlated with nucleosome occupancy. Furthermore, we analyzed other structural features related to DNA flexibility and intrinsic curvature in order to identify the features that contribute the most to nucleosome formation. Among the structural features we studied, Propeller twist, DNA denaturation and DNA-bending stiffness are the most positively correlated with nucleosome occupancy, and Stacking energy, Z-DNA and Duplex free energy are the most negatively correlated features. The close correlations between these structural features and nucleosome occupancy imply that these features are important factors of in vitro and in vivo nucleosome organization. Meanwhile, unlike in vitro situation, in vivo nucleosome occupancy data is less correlated with the structural features, suggesting that nucleosome organization may also be influenced by the action of additional external factors like DNA binding proteins and chromatin remodelers .
Since these features capture different aspects of nucleosome occupancy, we thus examined to what extent these features are correlated with each other. We calculated the pairwise Pearson correlation coefficients of these 12 features. The results, presented in Additional file 1: Table S1, show that there are close correlations among DNA denaturation, DNA-bending stiffness and energy-related features. Features measuring energy (Duplex free energy, Duplex disrupt energy, Stacking energy and Stabilizing energy of Z-DNA) are highly correlated with each other. Propeller twist, Aphilicity and other features are less correlated. These results demonstrate that these twelve features capture different structural dimensions of DNA sequence and have unequal capability for capturing nucleosome occupancy.
Previous analyses have shown that the G + C content is one of the most important features, which can explain approximately 50% the variation of in vitro nucleosome occupancy . To understand whether the effectiveness of these structural features that we studied depends on the G + C content, we studied the correlation of these structural features with the G + C content on the whole genome and in promoter and genic regions. As shown in Additional file 1: Table S2, the G + C content is correlated with some of the structural features, such as Aphilicity, Bendability, DNA-bending stiffness and the energy-related features. However, the corresponding Pearson correlation coefficients are not proportional to their performance of predicting nucleosome occupancy and positions. For example, although the Bendability and Duplex disrupt energy are highly correlated with the G + C content, they are not effective in capturing nucleosome occupancy (Table 1). Meanwhile, the correlation in the nucleosomedepleted promoter regions is higher than that in the nucleosome-enriched gene regions. All these results indicate that the effectiveness of these structural features is just marginally related to the G + C content, suggesting that the G + C content may be less important than we have thought  and some of the structural features may capture information of nucleosome occupancy beyond the G + C content.
To analyze whether intrinsic encoding of nucleosome occupancy varies across different types of chromosomal regions, we next focused on two representative kinds of local genomic regions, nucleosome-enriched centromere region and nucleosome-depleted promoter region. The centromere of a eukaryotic chromosome, which accommodates sites for segregation during mitosis and meiosis, is one of the essential parts of chromosome. Previous research revealed that a centromere region has high nucleosome occupancy . A key question is what determines the nucleosome occupancy over centromere regions.
Our analysis indicated that different genomic regions have distinct structural properties, which may dictate nucleosome occupancy patterns specific to these regions. Specifically, the regions upstream of transcription start sites (TSS) have less DNA-bending stiffness and Propeller twist, which may lead to more depletion of nucleosome than the corresponding downstream regions. Several independent studies further revealed that nucleosome depletion in promoter regions was related to gene regulation [2, 19, 45]. Given the correlation between nucleosome occupancy and the structural features we studied, variability in gene expression might be inherently hardwired in structural properties of promoters. To investigate whether genes with a similar expression pattern share some chromatin structures, we categorized genes on the basis of their expression levels and calculated the average structural profiles of promoter regions of 5,015 high-confidence transcripts of S. cerevisiae reported in [45, 50].
Intrigued by the high degrees of correlation of the 12 structure features with the experimental nucleosome occupancy, we adopted the least angle regression method (abbreviated as LARS)  to combine the structural features in a linear model for predicting nucleosome formation potential. The LARS algorithm determines a linear combination of the structural features by optimizing a linear model with a set of training data. In the model, the coefficients of the features specify which features are used and their relative weights in the combination, and the output gives rise to the prediction to nucleosome occupancy. Then we generated a structural feature-based nucleosome occupancy prediction model. In our implementation, we used the version of LARS in the R package . Particularly, we trained three linear models on chromosomes 1-9 using one in vitro dataset  and two in vivo datasets [9, 45] of nucleosome occupancy dataset, and applied the resulting models to predict nucleosome occupancy on chromosomes 10-16. The predicted nucleosome occupancy and the in vitro data are highly correlated, with a Pearson correlation coefficient of 0.88. For the in vivo nucleosome occupancy, the correlations are respectively 0.75 and 0.42 on Kaplan et al's dataset and Lee et al's dataset. The result shows the models based on these structural features are highly predictive of in vivo and in vitro nucleosome occupancy. However, the performance of these structural features for predicting in vivo nucleosome occupancy is not as good as for the in vitro nucleosome occupancy. This result indicates that in vivo nucleosome organization may also be influenced by other factors such as DNA methylation, histone variants, chromatin remodelers and DNA-binding proteins .
Genome-wide correlation coefficients between experimental nucleosome occupancies and nucleosome occupancies predicted by different models
Features used in a model
In vitro 
In vivo 
Our integrated model (this paper)
12 structural features in a linear model
Xi et al., 2010 
Position-dependant k-mer preferences (k up to 5)
Kaplan et al., 2009 
Position-dependant 5-mer preferences and periodic dinucleotide
Tillo and Hughes, 2009 
A linear model combining G + C content, propeller twist, slide and several 4-mer occurrence
Yuan and Liu, 2008 
Periodic dinucleotide signals of linker and nucleosomal sequence
Gabdank et al.,2010 
Uses DNA bendability matrix
Miele et al., 2008 
Sequence-dependant free energy of nucleosome formation
Field et al., 2008 
Uses 5-mer preferences and periodic dinucleotide
Lee et al., 2007 
G + C content, 4-mer occurrence, TFBSs and several structural features
So far we have observed that the profiles of structural features we analyzed are well correlated with experimental nucleosome occupancy data. Take the propeller twist feature as an example, most nucleosome regions have a peak in this profile and there is virtually no peak in nucleosome-depleted regions. This indicates that the structural properties are sufficiently distinct to allow efficient prediction of nucleosome positions. We thus developed a computational method, termed DLaNe, for detecting peaks and valleys of structural profiles to locate nucleosome positions. Specifically, for positively correlated features, our method detects peaks along the structural profiles to locate nucleosome; likewise, it detects valleys for negatively related features. Meanwhile, as nucleosome positions are influenced not only by high order chromatin structure , but also by repulsive and attractive interactions between neighboring nucleosomes , we considered the effect of the steric exclusion which prevents neighboring nucleosomes from overlapping in space  and dictates relatively fixed lengths of linker DNA. In yeast, the average length of nucleosome is about 147 bp, and the length of linker DNA ranges approximately in 10-20 bp . We set the window size for nucleosome position prediction at 165 bp to count for the distances between neighboring nucleosomes. In our analysis, we experimented with different window sizes. The results showed that this particular window width performed the best. The detail of our method is in Methods.
We applied our method to the S. cerevisiae genome. To determine the predictive power of different structural features, we validated our predicted nucleosome locations against the genome-wide nucleosome position map from Lee et al. , which provided 70,884 nucleosome positions at a 4 bp resolution from a tiling microarray. If a predicted nucleosome center is within L bp of a true site, we took it as a correct prediction, where L is a parameter of distance cutoff. To obtain a fair evaluation, we evaluated predicted positions by different distance cutoffs. We used six cutoff values, ranging from 10 bp to 60 bp with an increment of 10 bp. As previous studies evaluated their prediction accuracy in terms of sensitivity and specificity [13, 57], here we also adopted these criteria. Specifically, sensitivity (Se) represents the fraction of experimentally verified nucleosomes that are correctly predicted, and specificity (Sp ) is the fraction of correctly predicted nucleosomes out of all predictions. In addition, to compare the performance of methods with different Se and Sp, a unified F-measure was used, computed as 2·Se·Sp/(Se + Sp).
Genome-wide performance comparison among the Segal method, N-score, NuPoP, the Random method, the HMM method, DLaNe based on twelve individual structural features and the meta DLaNe method combing six features with the cutoff L = 35
Duplex disrupt energy
Duplex free energy
Meta DLaNe method
To determine the factors that make the meta DLaNe perform better than other methods, we also applied the HMM approach to locate nucleosomes using the structural profiles of the six top informative features (see Methods). The HMM model contained 16 hidden states: 15 nucleosome states and one linker state. We trained the model on Chromosome 3 and applied it to predict nucleosome positions by using Viterbi algorithm. As shown in Table 3 the HMM model performs slightly worse than the meta DLaNe, however, better than the existing method which mainly based on sequence features. Since this HMM method and the DLaNe are both based on structural features, the results suggest that these structural features are effective in capturing nucleosome positioning information.
It has been heatedly debated whether or not nucleosome organization is primarily determined by genomic DNA sequences [8, 27, 28]. By analyzing nucleosome occupancy in yeast, Kaplan et al concluded that DNA sequence preferences have a dominant role in nucleosome organization [9, 27]. However, subsequent studies derived a different conclusion [16, 17, 28]. The main dispute is to what extent sequence preferences dictate nucleosome organization. In the current study, we systematically investigated 12 structural properties of DNA sequences, including flexibility, curvature and energy, as features for nucleosome occupancy. We have identified some critically important structural features, such as DNA denaturation, DNA-bending stiffness, Stacking energy, Z-DNA, Propeller twist and free energy, which are not only highly correlated with in vitro nucleosome organization, but also accounted for much of the in vivo nucleosome occupancy. The correlation analysis between the 12 structural features and the G + C content of DNA sequences showed that the predictive power of these structural features just marginally related to the G + C content. Besides sequence compositional preferences, such as the G + C content, these structural features can also capture long range interactions that are invisible in local sequences.
Our study provided some new structure-based perspectives on nucleosome organization and gene regulation activities. Firstly, the genome-wide profiles of these 12 structural features are highly correlated with both in vitro and in vivo nucleosome occupancy. Based on their relation with nucleosome occupancy, these features are classified into two categories, positively and negatively correlated. The peaks of structural profiles for positively correlated features well correspond to nucleosome regions and the valleys match nucleosome-depleted ones, while negatively correlated features are the opposite. This suggests that structural properties of DNA sequence would directly determine nucleosome occupancy. These structural features differ in degrees of correlation with nucleosome occupancy. Secondly, the analysis over centromere regions showed the structural features of nucleosome-enriched sequence are very different from those of overall genomic sequence, suggesting these structural features involve in chromatin organization, acting as generator or repressor of nucleosome formation. Furthermore, differentially expressed genes exhibit different nucleosome occupancy patterns and chromatin structures in promoter regions. This observation indicated that these structural features play an important part in nucleosome organization and gene regulation, implying that the former may bridge the gap between nucleosome organization and gene expression.
Our findings illustrated the power of these structural features in predicting nucleosome occupancy and positioning. We used the least angle regression method to integrate all 12 structural features for predicting nucleosome occupancy. Besides those features such as the propeller twist and free energy which overlap with previous computational studies, we also find that the DNA denaturation, DNA-bending stiffness, Stacking energy and Z-DNA are effective in capturing nucleosome occupancy. These structural features capture more accurately in vivo nucleosome occupancy than sequence compositional features, consistent with a previous analysis which indicated that a major sequence signaling in vivo is a high-energy barrier rather than favorable sequence motifs . Furthermore, we proposed a novel computational method, DLaNe, to detect peaks (valleys) of structural profiles to locate nucleosome positions. Most of these structural features have better performances than the existing methods in locating nucleosomes. We developed a meta DLaNe to integrate predictive power of six top-performing features. Based on the profiles of these structural features, we used a HMM model to locate nucleosomes. Our meta DLaNe method and the HMM model are more accurate than three recently proposed computational methods in locating nucleosomes, showing effectiveness of secondary structures in capturing nucleosome positioning signal. Our prediction method is a new addition to the arsenal of nucleosome positioning prediction.
We downloaded the experimental nucleosome occupancy data measured in recent studies [9, 45, 58]. In these works, based on the susceptibility of nucleosome-depleted sequences to MNase, MNase assay was used for the digestion of genomic sequences. Then, microarray [45, 58] or massive parallel sequencing  techniques were adopted to determine nucleosome occupancy. The data of Lee et al. covered the whole S. cerevisiae genome at a higher resolution (4 bp) . Kaplan et al. used parallel sequencing to determine genome-wide nucleosome occupancy . The nucleosome intensity signals from these studies were represented as log ratio between nucleosomal DNA and genomic DNA, showing nucleosomes as peaks of about 150 bp long, surrounded by lower values corresponding to nucleosome-depleted regions. From these studies, the experimental nucleosome occupancy data were collected. We identified 1,000 well-positioned nucleosome and 1,000 nucleosomedepleted positions and extracted corresponding genomic sequences . For genome-wide comparison of structural profiles and the patterns of nucleosome occupancy, we respectively used the experimentally derived in vitro nucleosome occupancy dataset from Kaplan's study  and in vivo data from Lee's study .
The complete S. cerevisiae genome (May 2006 build) and the genome annotation were downloaded from Saccharomyces Genome Database (SGD) . To evaluate our prediction method, we compared it with three recent computational methods [8, 14, 54], whose predicted nucleosome positions were collected from their websites [8, 14] or generated by the program . All predictions were validated by the same reference dataset, a genome-wide atlas of nucleosome positions .
We analyzed a comprehensive list of structural features related to flexibility, curvature and energy of DNA sequences, including Aphilicity , B-DNA twist , Bendability , DNA-bending stiffness , DNA denaturation , Duplex free energy , Duplex disrupt energy , Propeller twist , Protein-DNA twist , Protein deformation , Stacking energy  and Z-DNA . For each feature, a corresponding structural model has been constructed by specific experimental technique. A detailed discussion of these features can be found in [42, 44].
We calculated the structural profiles of the above 12 features on S. cerevisiae genome. The calculation of a structural profile was divided into two steps. First, we converted each DNA sequence into a numerical sequence by replacing each dinucleotide or trinucleotide with a structural value. This transformation was based on experimentally determined structural models . Second, we used a moving average to smooth the raw structural profiles, with a step of 10 bp and a window size of 100 bp. The final structural profile is a vector of values of the structural features, at a resolution of 10 bp, which can be adjusted as needed. We tried different window sizes ranging from 5 to 200 bp. The result showed that smaller window sizes (< 75 bp) were not sufficient for value smoothing. On the contrary, bigger sizes (> 150 bp) had too strong an averaging effect, smoothing out the differences among intrinsic structural patterns at different positions. Thus, to retain a sufficient smoothing effect and avoid much modification to the data, we used the window size of 100 bp rather than the nucleosome size (165 bp). Meanwhile, with the step size 10 bp for the sliding window, we obtained the structural values at a resolution of 10 bp. This smoothing constraint may slightly affect the results of following nucleosome locations. For example, if the predictive peaks of structural profiles locate within ± 35 bp around true nucleosomes, the predictions have a resolution of ± 40 bp.
The structure vector S for a given sequence s can be obtained by the transformation procedure described above. The structural values, stored in S, can be plotted along the sequence, which may represent the changing patterns of the structural values, sketched in Figure 6(A). Meanwhile, we introduce four variables for defining a peak, i.e., peak intensity P i , left endpoint P l , right endpoint P r , and peak width P w (Figure 6(B)). To detect significant peaks, a predefined peak significance threshold P s needs to be determined empirically by an inspection of the average P i . In order to determine a P s for each chromosome, we tried different values in the range [0.1, 1]. The peak detection method performed best when Ps was chosen from [0.3, 0.6]. Then we can locate nucleosomes along the sequence as follows:
S.1) Filtering out noises of the structural profiles.
Although an initial smoothing is done to a structural profile, it may still have noises. Comparing with valid peaks, noises usually appear with low intensity and narrow shape. To filter out noises and meanwhile to minimize the amount of modification to the data, we adopt a median filtering to remove possible noises, i.e., for a position p, its value S p is replaced by the median value within a predefined window. Here the window size is the same as previous smoothing size (100 bp). Denote the median filter output of S as SM.
S.2) Determining the peak intensity threshold for each chromosome.
We then scan the noise-reduced structure vector SM with a sliding window. Since most common distances between adjacent nucleosome centers are approximately 165 bp (about 18 bp linker) in S. cerevisiae , the width of the window is set to 165 bp, other than the length of 147 bp as done in .
Peak intensity threshold = Ps·AP i .
S.3) Searching for each peak's maximum position and endpoints.
To locate the concrete nucleosome position, we take the structural profiles and steric effect into account. The reason is that detailed nucleosome positions are influenced not only by high order chromatin structure, but also by repulsive and attractive interactions between neighboring nucleosomes. Steric exclusion prevents consecutive nucleosomes from overlapping in space, dictating relatively fixed lengths of linker DNA [60, 61]. Thus, overlaps between two nucleosomes are not allowed owing to steric effect. A legal locating specifies positions for a set of non-overlapping 147-bp nucleosomes on S. cerevisiae. Thus, the detection of peaks follows the following rule. Given a peak intensity threshold, peak detection is performed by scanning the filtered structure vector SM . If the peak intensity of the window is less than the peak intensity threshold, viz., P i < P s ·AP i , there is no significant peak in this window, and the sliding window moves forward; otherwise, there exists a peak in the window. First, the position with the maximal SM value is regarded as P c . Since a well-positioned nucleosome is about 147 bp, P l and P r of this peak are correspondingly determined as follows, P l = P c - 73, P r = P c + 73, where the value 73 is equal to half of the length of nucleosome. If there is more than one peak that exceeds the cutoff in the current window, the higher peak is chosen by selecting the maximal structural value in the window. Iteratively, the sliding window move forward to locate next nucleosome till it comes to the end of sequence. Then, each feature can be used to locate nucleosomes.
S.4) Integrating the predictions of individual features.
Furthermore, we introduce a Random Forest  based meta-predictor to integrate predictions of different structural features. Random Forest classifier is an ensemble classifier consisting of many decision trees with variations in structure and outputs the class voted by the majority individual trees . First, the predictions of each feature are collected. For each prediction, the number of times that it is predicted by different features, the distance to its closest neighboring prediction and whether it is predicted by a certain feature are extracted as its features. Second, using the experimental nucleosome positions of one chromosome of yeast, we trained the Random Forest based meta predictor on the above selected features. Third, the trained meta predictor is applied to decide whether a prediction can be accepted. Finally, all accepted predictions are clustered if they are within 73 bp with each other, and the middle one in a cluster is taken as a meta prediction.
The hidden Markov model (HMM) has been applied to infer nucleosome positions from genome-wide hybridization data [45, 58]. As the profiles of these structural features are highly correlated with nucleosome occupancy, we also developed a HMM model to locate nucleosome from the structural profiles. Our implementation of HMM was based the HMM toolbox, which was downloaded from Murphy website http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html. According to the resolution of these transformed structural profiles, the HMM model contained 16 distinct states, 15 nucleosome states and one linker state, which is different from previous models [45, 58]. To apply the HMM model, the structural profiles of genomic sequences were first transformed as described above. After we obtained the structural profiles, we trained the model on Chromosome 3 based on Lee et al's reference nucleosome positions. Using the Viterbi algorithm, we applied the learned HMM model to compute the most-likely states. According to the predicted state sequence, we located the possible nucleosome positions.
The Additional file 2 provides the implementation of our DLaNe method based on structural features.
This work was supported in part by a United States NSF grant DBI-0743797, two United States NIH grants (RC1AR058681 and R01GM086412), a grant from the Alzheimer's Association, an internal funding from Fudan University, a National Basic Research Program of China (No. 2010CB126604) and two Chinese NSFC grants (No. 60873040 and No. 61173118). YLG was also supported by a grant from the China Scholarship Council. JHG was also supported by the Fundamental Research Funds for the Central Universities and the Shuguang Scholar Program of Shanghai Education Development Foundation.