- Open Access
Cheminformatics methods for novel nanopore analysis of HIV DNA termini
© Winters-Hilt et al; licensee BioMed Central Ltd. 2006
- Published: 26 September 2006
Channel current feature extraction methods, using Hidden Markov Models (HMMs) have been designed for tracking individual-molecule conformational changes. This information is derived from observation of changes in ionic channel current blockade "signal" upon that molecule's interaction with (and occlusion of) a single nanometer-scale channel in a "nanopore detector". In effect, a nanopore detector transduces single molecule events into channel current blockades. HMM analysis tools described are used to help systematically explore DNA dinucleotide flexibility, with particular focus on HIV's highly conserved (and highly flexible/reactive) viral DNA termini. One of the most critical stages in HIV's attack is the binding between viral DNA and the retroviral integrase, which is influenced by the dynamic-coupling induced high flexibility of a CA/TG dinucleotide positioned precisely two base-pairs from the blunt terminus of the duplex viral DNA. This suggests the study of a family of such CA/TG dinucleotide molecules via nanopore measurement and cheminformatics analysis.
HMMs are used for level identification on the current blockades, HMM/EM with boosted variance emissions are used for level projection pre-processing, and time-domain FSAs are used to parse the level-projected waveform for kinetic information. The observed state kinetics of the DNA hairpins containing the CA/TG dinucleotide provides clear evidence for HIV's selection of a peculiarly flexible/interactive DNA terminus.
- Hide Markov Model
- Finite State Automaton
- Integration Host Factor
- Multiclass Support Vector Machine
- Dinucleotide Step
HIV DNA is found to have a highly conserved CA dinucleotide step precisely two base-pairs from its blunt-end terminus [1–7]. In preliminary nanopore studies the blockade level lifetimes of the wild-type 3' end sequence (-C-A-T-G-3') were found to be similar to (-C-A-A-A-3'), consistent with their similarities in DNA conformation and ΔG. This similarly motivated the present study of a small group of nine base-pair stem DNA hairpins consisting of all adenosines on the 3' side of the molecule, except for one cytosine-adenosine step (the "CA-step" set). Contrary to the differences (seemingly) indicated by nature, the calculated ΔG° of hairpin formation (using mFold) is the same for the CA-step set. It is hypothesized that the highly conserved nature of the HIV DNA terminus corresponds to some beneficial flexibility that increases reactivity with the HIV integrase prior to insertion into the host DNA. A test of the hypothesized flexibility/reactivity is sought via analysis of channel current statistics for signs of notably different blockade kinetics between the blunt-ended HIV DNA conformer and the other blunt-ended hairpins in the CA-step set.
Sequence Dependent DNA Conformation
DNA conformation is dependent upon intrinsic properties of a given sequence and upon the environment in which the molecule is studied . Intrinsic sequence-dependent properties include minor groove width [9, 10], propensity to undergo B-to-A transition [11–14], and cation localization in the major vs minor groove [15–26].
Sequence-dependent conformation influences nearly all aspects of DNA biology including enzyme-dependent functions such as replication, transcription, and recombination. Here it is important to distinguish between the two general mechanisms by which enzymes recognize DNA : 1) recognition of functional groups on specific bases in the major groove ('direct' readout); and 2) conformation-dependent enzyme recognition of DNA ('indirect' readout). An example of indirect readout is DNA binding by E. coli Integration Host Factor (IHF). This heterodimeric protein binds to DNA in a sequence-specific manner that causes a 160 degree bend. This bend is required for recombination and transcription. Importantly, IHF contacts the phosphate backbone and the minor groove only, therefore its sequence-specificity must be conformation dependent.
Traditionally, efforts to explain DNA conformation have focused on the propensity of nucleotides to adopt C2' endo vs C3' endo sugar pucker, base stacking, groove hydration, and the preferred geometries of GC vs AT pairs (e.g. propeller twist) [28–33]. An interesting (and controversial) new hypothesis holds that sequence-dependent cation position in the minor or major groove determines DNA conformation . In either case, the structural predictions used to formulate and test hypotheses have relied upon angstrom precision measurements by X-ray diffraction analysis of oligonucleotide crystals and heteronuclear NMR spectrosocopy of DNA in solution.
Structural predictions based on X-ray crystallography and NMR spectroscopy
The first X-ray crystal structure of a DNA oligomer (the 'Dickerson dodecamer') was published in 1981 (Drew and Dickerson ). It established substantial deviation among base pairs in terms of propeller twist, rise per base pair, and sugar pucker. Numerous attempts have been made to understand the structural basis for these differences. As is true for models used to predict thermodynamic stability of duplexes , models based on dinucleotide steps have been reasonably successful. For example, Hassan used structural data from sixty oligomer crystals to establish features of dinucleotide steps that correlate with DNA flexibility. Pyrimidine-purine dinucleotide steps that are known to be flexible (e. g. TA and CA) were associated with little propeller twist and a variety of slide positions whereas steps that are known to be rigid (notably AA steps) were high in propeller twist and they had a limited range of slide. But others argue that dinucleotide steps are inadequate to describe sequence dependent structure and dynamics because context can strongly influence their behavior. This is illustrated in a study by Packer and Hunter  who used a similar crystal structure database to examine the effect of neighboring base pairs on dinucleotide flexibility (as measured by slide and shift). Their results indicate that some dinucleotide steps adopt conformations that are entirely independent of neighboring base pairs (e.g. AA, AT, TA), while others are weakly context dependent (e.g. AC, AG, CA, GA), and still others are strongly context dependent (CG, GC, CC).
Although crystal structures have provided fundamental information that helps illuminate how DNA can bend and twist when bound to proteins, the approach has limitations. For instance, close packing of DNA in crystals is known to alter structure relative to solution phase, and the cryogenic temperatures used for high resolution may lead to under-representation of conformers that are common at physiological temperatures. NMR spectroscopy can overcome these limitations because the experiments are typically run at 1 mM concentration and ambient temperature. This is illustrated by a recent comprehensive study  which compared an NMR structure for the Dickerson dodecamer with a high resolution crystal structure. There were two basic conclusions: 1) The average AATT core structure was very similar for the NMR-based and crystal-based predictions, i.e. strong propeller twist and a narrow minor groove. This is not surprising because the AATT sequence is relatively inflexible , it has been extensively studied , and it is constrained by four base pairs at either end of the dodecamer; 2) by comparison, the predicted structures for the CGCG segments demonstrated a profound variability. The authors attributed this difference to averaging of C3' -endo vs C2' -endo sugar puckering in the NMR structure, particularly among cytosines. At the cryogenic temperatures used for the high resolution crystal structure, the higher energy state C3' -endo pucker would be rarely observed. It is also likely that proximity to the duplex terminus can account for some of the difference because the helix ends overlap in crystals but not in solution . Whether structural averaging by NMR or approximation by a crystallized form, particularly near the important DNA terminal regions, neither approach provides a clear picture of the conformational history of a free molecule in solution at physiological temperature, as is described in what follows.
Structure and Dynamics of Duplex Ends
The structure and dynamics of DNA duplex ends can influence numerous enzyme-dependent processes. Some of the most biologically important of these are integration of transposons and retroviral dsDNA into target chromosomes. Two well studied examples are transposition of the phage Mu genome, and integration of HIV dsDNA copies into target chromosomal DNA. In both cases, a consensus CA dinucleotide step at or near the duplex terminus is believed to confer flexibility on the viral DNA that is required for processing and strand transfer.
DNA duplex ends are significantly under-represented in NMR and crystal structure studies despite their critical importance in biology. For example, Hassan and Calladine's landmark study  was based on X-ray crystal structures for 60 oligomers. A•T pairs appeared only twice in the terminal dinucleotide step of the 120 duplex ends. This under-representation may be due to a historical bias since the Dickerson dodecamer contains only G•C pairs in the four base pair termini. But it may also be due to recognition of a built in bias in crystal structures because the helix ends are known to overlap , and interpretation of their structure is therefore ambiguous. NMR studies of DNA structure have also been biased toward the Dickerson dodecamer and its variants.
Analysis of Individual DNA Hairpin Molecules Using a Protein Pore
A New Method for Single Molecule Detection and Characterization
Channel current based nanopore cheminformatics provides an incredibly versatile method for transducing single molecule events into channel current blockade states (see Figure 1). Single biomolecules and the ends of biopolymers such as DNA have been examined in solution with nanometer-scale precision [40–45]. In work described above , it was found that complete base-pair dissociations of dsDNA to ssDNA, "melting", could be observed for sufficiently short DNA hairpins. In later work [42, 44], the nanopore detector was used to "read" the ends of dsDNA molecules, and was operated as a chemical mixture tester. In recent work [40, 41, 43], the nanopore detector has been used to observe the conformational kinetics at the termini of single DNA molecules. And in the most recent work, reported here, the nanopore is used to measure conformational kinetics of a family of DNA molecules consisting of variations of the HIV DNA consensus terminus.
The channel current cheminformatics architecture
The classification approach adopted in  is designed to scale well to multi-species classification (or a few species in a very noisy environment). The scaling is possible due to use of a decision tree architecture and an SVM approach that permits rejection on weak data. SVMs are usually implemented as binary classifiers but may be grouped in a decision tree to arrive at a Multi-class discriminator. SVMs are much less susceptible to over-training than neural nets . This allows for a much more hands-off training process and provides a more stable classifier.
A multiclass implementation for an SVM is also possible – where multiple hyperplanes are optimized simultaneously. A (single-optimization, multi-hyperplane) multiclass SVM has a much more complicated implementation, but the reward is a classifier that is much easier to tune and train, especially when considering data rejection. The (single) multiclass SVM, doesn't have as non-scalable a throughput problem (with tree depth), and even appears to offer a natural drop zone via its margin definition. therefore it is being considered in further refinements of the method (see  in this same issue for recent applications of these refinements to other channel current data).
The SVM discriminators are trained by the Sequential Minimal Optimization (SMO) procedure . A chunking [57, 58] variant of SMO also is employed to manage the large training task at each SVM node. The multi-class SVM training generally involves thousands of blockade signatures for each signal class.
Different tools are employed at each stage of the signal analysis (as shown in Figure 3) in order to realize the robust (and noise resistant) tools for knowledge discovery, information extraction, and classification . Statistical methods for signal rejection using SVMs are also employed in order to reject extremely noisy signals.
Role of DNA Conformation in HIV DNA Terminus Flexibility/Reactivity
DNA conformation plays a very important role in protein-DNA complex formation . In this process two of the crucial factors are the environment in which the complex is formed and the properties of the specific sequence interacting with the protein or other DNA molecule . Despite the multitude of crystallographic studies [25, 32, 59] conducted on DNA, it is still difficult to translate the sequence-directed curvature information obtained through these tools to actual systems found in solution. Information on the DNA molecule's variation in structure and flexibility is important, however, to understanding the dynamically enhanced (naturally selected) DNA complex formations that are found with strong affinities to other, specific, DNA and protein molecules. Crystallographic and NMR studies alone can't give a perspective about the dynamics of these molecules in environments with similar physiological conditions.
Conformational kinetics of the HIV DNA termini
An important example of DNA conformational flexibility is the HIV attack on T-cells. In the retroviral attack of HIV one of the most critical stages is the integration process of viral DNA into the host DNA . The viral DNA sequence critical to the attachment and insertion of viral DNA into the host DNA is found at the terminus of the blunt-ended viral DNA [2–5]. The integration process is influenced by the dynamic-coupling induced by the high flexibility of a CA/TG dinucleotide positioned precisely two base-pairs from the blunt terminus of the duplex viral DNA . The CA/TG dinucleotide presence is a universal characteristic of retroviral genomes. Deletion of these base pairs impedes the integration process  and it is believed that the unusual flexibility imparted by this base-pair on the terminus geometry is necessary for the binding to integrase. Once bound to integrase the viral DNA molecule is modified by removal of the two residues at the 3'-end together with subsequent insertion into the host genome. Our hypothesis is that the DNA hairpin with a CA/TG dinucleotide positioned two base-pairs from the blunt terminus will have channel current statistics differentiable from the other DNA hairpins.
In what follows kinetic feature extraction is done on two types of channel current blockade events: (i) fixed level blockades, and (ii) blockade "spikes" (anomalous deflections from a specified level). The spike detection, and thus spike frequency, algorithm is FSA-based. The blockade level lifetime analysis is primarily HMM-based, where HMM/EM with boosted variance emissions is used for level projection pre-processing, and time-domain FSAs are used to parse the level-projected waveform for kinetic information. This provides a robust kinetic feature extraction formalism with a minimal amount of FSA-level tuning. Application of the spike detection tool permits strong discrimination capability not otherwise possible between DNA molecules with and without minor radiation damage. Application of the HMM kinetic feature extraction tool permits statistical differences to be discernible between molecules in the study of HIV DNA (described in what follows). The rich set of kinetic features obtained allows for DNA terminus classification/clustering. An SVM-based clustering method has been developed and was applied to the control molecules to test this capability. A Web-interface to the various software tools used is also described.
τ-FSA Blockade Acquisition and time-domain Feature Extraction
The spike detector software is designed to count "anomalous" spikes, i.e., spike noise not attributable to the gaussian fluctuations about the mean of the dominant blockade-level. Spike count plots are generated to show increasing counts as cut-off thresholds are relaxed (to where eventually any downward deflection will be counted as a spike). The plots are automatically generated and automatically fit with extrapolations of their linear phases (at the group's CCCool-tools website). The extrapolations provide an estimate of "true" anomalous spike counts – counts associated with terminus fraying in the captured DNA hairpin (via mechanism discussed in ). For the study above, the radiated form of the molecule frayed 17.6 times a second, on average, while in the LL state. The non-radiated molecule only frayed 3.58 times a second, on average, from the LL state (see Figure 5). This result is consistent with the weakened hydrogen bonding at the terminus of the radiation-damaged molecule.
Cheminformatics analysis of DNA conformational kinetics
It was hypothesized that the highly conserved nature of the HIV DNA terminus corresponds to some beneficial flexibility and thus reactivity with HIV integrase prior to insertion into the host DNA, and that this might lead to some statistically discernable difference in their channel blockade statistics. A test of the hypothesized flexibility/reactivity was performed on the set of DNA hairpins with a single CA dinucleotide step. Analysis of channel current statistics (Fig. 7b) shows that the blunt-ended HIV DNA conformer has notably different blockade kinetics than the other blunt-ended hairpins in the CA set (see Fig. 7a).
The unoSVM and CCCool Tools interfaces
Emission Variance Amplification (EVA) Projection
It is hypothesized that emission variance amplification (EVA) in a non-uniformly increasing transition probability region leads to Viterbi path migration with each EM/EVA iteration towards the dominant levels (regions of high occupation probability), while strongly preserving the transition times of level changes. The migration of fluctuations is disrupted (and the method fails) if pre-processing is done with a low-pass filter (using an N-sample moving average, for example, with N = 8). This may provide a method for automatically tuning the low-pass filter – by narrowing the pass band until the projection method fails and tuning accordingly. This offers the prospect of fewer tuning subtleties than the emergent-structure tuning, via wavelet FSA, that is currently used.
HMM-with-duration Viterbi Implementation
HMM-with-duration directly incorporates sub-blockade duration probabilities and provides a strong link to the underlying kinetic (physical) information. It is parameterized by the internal HMM signal representation (the emission and transition probabilities, and the duration distributions on state lifetimes), and can be efficiently and safely implemented (see  in this issue for further details). By incorporating HMM-with-duration, feature extraction will be more robust on long-lifetime states.
The Machine Learning Software Interface Project
The high volume and complexity of typical, noisy bioinformatics and cheminformatics (real-world) data motivates the use of sophisticated, yet highly efficient machine learning programs. The group website at http://logos.cs.uno.edu/~nano/ provides interfaces to: (i) several binary SVM variants (with novel kernel selections and heuristics); (ii) a multiclass (internal) SVM; (iii) an SVM-based Clustering tool; (iv) an FSA-based nanopore spike detector; (v) an HMM-parameter channel current feature extraction tool; and (vi) a kinetic feature extraction tool (via channel current sub-level lifetimes). The website is designed using HTML and CGI scripts that are executed to process the data sent when a form filled in by the user is received at the web server – results are then e-mailed to the address indicated by the user.
SVM Kernel Selection
Given its geometric expression, it is not surprising that a key construct in the SVM formulation (via the choice of kernel) is the notion of "nearness" between instances or nearness to the hyperplane, where it gives a measure of confidence in the classification, i.e., instances further from the decision hyperplane are called with greater confidence (see Figure 4). Most notions of nearness explored in this context have stayed with the geometric paradigm and are known as "distance kernels." One example being the familiar Gaussian kernel which is based on the Euclidean distance: KGaussian(x,y) = exp(-DEucl.(x,y)2/2σ2), where DEucl.(x,y) = [Σk(xk-yk)2]1/2 is the usual Euclidean distance. Those kernels are used in the signal pattern recognition analysis in Figure 3 along with a new class of kernels, "divergence kernels," based on a notion of nearness appropriate when comparing probability distributions (or probability feature vectors). The main example of this is the Entropic Divergence Kernel: KEntropic = exp(-DEntropic.(x,y)2/2σ2), where DEntropic.(x,y) = D(x||y)+D(y||x) and D(..||..) is the Kullback-Leibler Divergence (or relative entropy) between x and y.
HMM kinetic feature extraction methods have been developed. Application of the channel current cheminformatics tools to a set of DNA hairpins with single CA-dinucleotide steps clearly reveals the peculiar flexibility and interactivity of the HIV DNA consensus terminus.
Each experiment is conducted using one α-hemolysin channel inserted into a diphytanoyl-phosphatidylcholine/hexadecane bilayer across a 25-micron-diameter horizontal Teflon aperture, as described previously . Seventy microliter chambers on either side of the bilayer contains 1.0 M KCl buffered at pH 8.0 (10 mM HEPES/KOH) except in the case of buffer experiments where the salt concentration, pH, or identity may be varied. Voltage is applied across the bilayer between Ag-AgCl electrodes. DNA control probes are added to the cis chamber at 10 or 20 μM final concentration. All experiments are maintained at room temperature (23 ± 0.1°C), using a Peltier device.
Control probe design
Since the five DNA hairpins studied in the prototype experiment have been carefully characterized, they are used in further experiments as highly sensitive controls. The nine base-pair hairpin molecules examined in the prototype experiment share an eight base-pair hairpin core sequence, with addition of one of the four permutations of Watson-Crick base-pairs that may exist at the blunt end terminus, i.e., 5'-G•C-3', 5'-C•G-3', 5'-T•A-3', and 5'-A•T-3'. Denoted 9GC, 9CG, 9TA, and 9AT, respectively. The full sequence for the 9CG hairpin is 5' CTTCGAACG TTTTCGTTCGAAG 3', where the base-pairing region is underlined. The eight base-pair DNA hairpin is identical to the core nine base-pair subsequence, except the terminal base-pair is 5'-G•C-3'. The prediction that each hairpin would adopt one base-paired structure was tested and confirmed using the DNA mfold server http://bioinfo.math.rpi.edu/~mfold/dna/form1.cgi, which is based in part on data from .
DNA hairpin design
Seven DNA molecules were designed to contain a CA/TG dinucleotide at different positions along the DNA stem (labeled CA_0 – CA_6). In the control molecule the stem did not contain this base-pair, ignoring the CA at the loop terminus, and based on crystallographic predictions the stem was designed to be very rigid . The DNA molecules used for the experiments were designed with the aid of the M-fold program . Single stranded DNA (ssDNA) molecules were obtained from IDTDNA as powders, resuspended in TE buffer at a 10 mM concentration and stored at 4°C. The dsDNA molecules were obtained by annealing the resuspended ssDNA molecules at the required temperatures  and then were stored at the same temperature as the ssDNA molecules for further usage. The following ssDNA molecules were used to obtain the dsDNA hairpin structures:
CA_0 5'-TTTTTTTTG TTTTCAAAAAAAA - 3'
CA_1 5'-TGTTTTTTG TTTTCAAAAAACA - 3'
CA_2 5'-TTGTTTTTG TTTTCAAAAACAA - 3'
CA_3 5'-TTTGTTTTG TTTTCAAAACAAA - 3'
CA_4 5'-TTTTGTTTG TTTTCAAACAAAA - 3'
CA_5 5'-TTTTTGTTG TTTTCAACAAAAA - 3'
CA_6 5'-TTTTTTGTG TTTTCACAAAAAA - 3'
Data is acquired and processed in two ways depending on the experimental objectives. The first method uses commercial software from Axon Instruments (Redwood City, CA) to acquire data, where current will typically be filtered at 50 kHz bandwidth using an analog low pass Bessel filter and recorded at 20 μs intervals using an Axopatch 200B amplifier (Axon Instruments, Foster City, CA) coupled to an Axon Digidata 1200 digitizer. Applied potential is 120 mV (trans side positive) unless otherwise noted. In some experiments, semi-automated analysis of transition level blockades, current, and duration are performed using Clampex (Axon Instruments, Foster City, CA). The second method uses LabView-based experimental automation. In this case, ionic current is also acquired using an Axopatch 200B patch clamp amplifier (Axon Instruments, Foster City, CA), but it is then recorded using a NI-MIO-16E-4 National Instruments data acquisition card (National Instruments, Austin TX). In the LabView format, data is low-pass filtered by the amplifier unit at 50 kHz, and recorded at 20 μs intervals. In both fixed duty cycle (i.e., not feedback controlled) data acquisition approaches, the solution sampling protocol uses periodic reversal of the applied potential to accomplish the capture and ejection of single biomolecules. The biomolecules captured are typically added to the cis chamber in 20 μM concentrations. The time-domain finite state automaton (FSA, ) used in the prototype is used to perform the generic signal identification/acquisition for the first 100 msec of blockade signal (Acquisition Stage, Figure 8). The effective duty cycle for acquiring 100 ms blockade measurements, when found to be sufficient for classification purposes, is adjusted to approximately one reading every 0.4 seconds by choice of analyte concentration. Further details on the voltage toggling protocol and the time-domain FSA are in .
Channel Current Signal Analysis & Pattern Recognition
Signal Preprocessing Details
Each 100 ms signal acquired by the time-domain FSA consists of a sequence of 5000 sub-blockade levels (with the 20 μs analog-to-digital sampling). Signal preprocessing is then used for adaptive low-pass filtering. For the data sets examined, the preprocessing is expected to permit compression on the sample sequence from 5000 to 625 samples (later HMM processing then only required construction of a dynamic programming table with 625 columns). The signal preprocessing makes use of an off-line wavelet stationarity analysis (Off-line Wavelet Stationarity Analysis, Figure 8, also see ).
HMMs and Supervised Feature Extraction Details
With completion of preprocessing, an HMM  is used to remove noise from the acquired signals, and to extract features from them (Feature Extraction Stage, Figure 8). The HMM is, initially, implemented with fifty states, corresponding to current blockades in 1% increments ranging from 20% residual current to 69% residual current. The HMM states, numbered 0 to 49, corresponded to the 50 different current blockade levels in the sequences that are processed. The state emission parameters of the HMM are initially set so that the state j, 0 <= j <= 49 corresponding to level L = j+20, can emit all possible levels, with the probability distribution over emitted levels set to a discretized Gaussian with mean L and unit variance. All transitions between states are possible, and initially are equally likely. Each blockade signature is de-noised by 5 rounds of Expectation-Maximization (EM) training on the parameters of the HMM. After the EM iterations, 150 parameters are extracted from the HMM. The 150 feature vector components are extracted from parameterized emission probabilities, a compressed representation of transition probabilities, and use of a posteriori information deriving from the Viterbi path solution (further details in ). This information elucidates the blockade levels (states) characteristic of a given molecule, and the occupation probabilities for those levels, but doesn't directly provide kinetic information. The resulting parameter vector, normalized such that vector components sum to unity, is used to represent the acquired signal during discrimination at the Support Vector Machine stages.
Kinetic Feature Extraction
Extraction of kinetic information was done in two ways (with equivalent feature extractions). The initial method applied begins with identification of the main blockade levels for the various blockade classes (off-line HMM analysis). This information is then used to scan through already labeled (classified) blockade data, with projection of the blockade levels onto the levels previously identified (for that class of molecule). A time-domain FSA performs the above scan, and uses the information obtained to tabulate the lifetimes of the various blockade levels. Once the lifetimes of the various levels are obtained, information about a variety of kinetic properties is accessible. The complication of this "brute force" approach is that the FSA needed to extract kinetic features from the noisy, level-projected, waveform requires careful tuning.
Emission Variance Amplification (EVA) Projection
In the context of an HMM implementation with a stationary set of emission and transition probabilities, emission broadening via amplification of the emission state variances is a filtering heuristic that leads to a level-projection that strongly preserves transition times between major levels. In other words, emission variance amplification (EVA) highly preserves the transition macro-structure between the significant blockade levels. This provides robust kinetic feature extraction with minimal tuning at the FSA kinetic feature extraction stage.
SWH and other New Orleans researchers, ML, MT, IA, EM, CB, and SS, would like to thank MA and Prof. David Deamer at UCSC for strong collaborative support post-Katrina. SWH would like to thank Dr. Wenonah Vercoutere at NASA-Ames for the radiation damaged DNA dataset. Funding was provided by grants from the National Institutes for Health, The National Science Foundation, The Louisiana Board of Regents, and NASA.
- Polard P, Chandler M: Bacterial transposases and retroviral integrases. Mol Microbiol 1995, 15(1):13–23.View ArticlePubMedGoogle Scholar
- Colicelli J, Goff SP: Mutants and pseudorevertants of Moloney murine leukemia virus with alterations at the integration site. Cell 1985, 42(2):573–580.View ArticlePubMedGoogle Scholar
- Roe T, Chow SA, Brown PO: 3'-end processing and kinetics of 5'-end joining during retroviral integration in vivo. J Virol 1997, 71(2):1334–13340.PubMed CentralPubMedGoogle Scholar
- Kulkosky J, Katz RA, Merkel G, Skalka AM: Activities and substrate specificity of the evolutionarily conserved central domain of retroviral integrase. Virology 1995, 206(1):448–456.View ArticlePubMedGoogle Scholar
- Craigie R, Fujiwara T, Bushman F: The IN protein of Moloney murine leukemia virus processes the viral DNA ends and accomplishes their integration in vitro . Cell 1990, 62(4):829–837.View ArticlePubMedGoogle Scholar
- Scottoline BP, Chow S, Ellison V, Brown PO: Disruption of the terminal base pairs of retroviral DNA during integration. Genes Dev 1997, 11(3):371–382.View ArticlePubMedGoogle Scholar
- Colicelli J, Goff SP: Sequence and spacing requirements of a retrovirus integration site. J Mol Biol 1988, 199(1):47–59.View ArticlePubMedGoogle Scholar
- Wu Z, Delaglioa F, Tjandrab N, Zhurkinc VB, Bax A: Overall structure and sugar dynamics of a DNA dodecamer from homo and heteronuclear dipolar couplings and 31P chemical shift anisotropy. Journal of Biomolecular NMR 2003, 26: 297–315.View ArticlePubMedGoogle Scholar
- Dickerson RE, Drew HR: Structure of a B-DNA Dodecamer. II. Influence of Base Sequence on Helix Structure. J Mol Biol 1981, 149: 761–786.View ArticlePubMedGoogle Scholar
- Shatzky-Schwartz M, Arbuckle ND, Eisenstein M, Rabinovich D, Bareket-Samish A, Haran TE, Luisi BF, Shakked Z: X-ray and solution studies of DNA oligomers and implications for the structural basis of A-tract-dependent curvature. J Mol Biol 1997, 267: 595–623.View ArticlePubMedGoogle Scholar
- Minchenkova LE, Schyokina AK, Chernov BK, Ivanov VI: CC/GG-contacts facilitate the B to A transition in solution. J Biomol Struct Dyn 1986, 4: 463–476.View ArticlePubMedGoogle Scholar
- Peticolas WL, Wang Y, Thomas GA: Some rules for predicting the base-sequence dependence of DNA conformation. Proc Natl Acad Sci USA 1988, 85(8):2579–2583.PubMed CentralView ArticlePubMedGoogle Scholar
- Basham B, Ho PS: An A-DNA triplet code: Thermodynamic rules for predicting A- and B-DNA. Proc Natl Acad Sci USA 1995, 92: 6464–6468.PubMed CentralView ArticlePubMedGoogle Scholar
- Ivanov VI, Minchenkova LE: The A-form of DNA: in search of biological role (a review). Mol Biol 1995, 28: 780–788.Google Scholar
- Frøystein NA, Davis JT, Reid BR, Sletten E: Sequence-Selective Metal Ion Binding to DNA Oligonucleotides. Acta Chem Scand 1993, 47: 649–657.View ArticlePubMedGoogle Scholar
- Sletten E, Frøystein NA: NMR Studies of Oligonucleotide – Metal Ion Interactions. In Metal Ions in Biological Systems Edited by: Sigel H, Sigel A. 1996, 32: 397–418.Google Scholar
- Young MA, Jayaram B, Beveridge DL: Intrusion of Counterions into the Spine of Hydration in the Minor Groove of B-DNA: Fractional Occupancy of Electronegative Pockets. J Am Chem Soc 1997, 119: 59–69.View ArticleGoogle Scholar
- Hud NV, Feigon J: Localization of divalent metal ions in the minor groove of DNA A-tracts. J Am Chem Soc 1997, 119: 5756–5757.View ArticleGoogle Scholar
- Shui X, McFail-Isom L, Hu GG, Williams LD: The B-DNA dodecamer at high resolution reveals a spine of water on sodium. Biochemistry 1998, 37: 8341–8355.View ArticlePubMedGoogle Scholar
- Shui XQ, Sines CC, McFail-Isom L, VanDerveer D, Williams LD: Structure of the Potassium Form of CGCGAATTCGCG: DNA Deformation by Electrostatic Collapse around Inorganic Cations. Biochemistry 1998, 37: 16877–16887.View ArticlePubMedGoogle Scholar
- Hud NV, Sklenar V, Feigon J: Localization of ammonium lon in the minor groove of DNA duplexes in solution and the origin of DNA A-tract bending. J Mol Biol 1999, 286: 651–660.View ArticlePubMedGoogle Scholar
- Tereshko V, Minasov G, Egli M: A "Hydrat-Ion Spine" in a B-DNA minor groove. J Am Chem Soc 1999, 121: 3590–3595.View ArticleGoogle Scholar
- Denisov VP, Halle B: Sequence-specific binding of counterions to B-DNA. Proc Natl Acad Sci USA 2000, 97: 629–633.PubMed CentralView ArticlePubMedGoogle Scholar
- Howerton SB, Sines CC, VanDerveer D, Williams LD: Locating monovalent cations in the grooves of B-DNA. Biochemistry 2001, 40: 10023–10031.View ArticlePubMedGoogle Scholar
- MacPherson A: Introduction to Macromolecular Crystallography. Wiley-Liss; 2002.Google Scholar
- Hud N, Feigon F: Characterization of Divalent Cation Localization in the Minor Groove of the A n T n and T n A n DNA Sequence Elements by 1 H NMR Spectroscopy and Manganese(II). Biochemistry 2002, 41: 9900–9910.View ArticlePubMedGoogle Scholar
- Lynch TW, Read EK, Mattis AN, Gardner J, Rice PA: Integration host factor: putting a twist on protein-DNA recognition. J Mol Biol 2003, 330(3):493–502.View ArticlePubMedGoogle Scholar
- Drew H, Dickerson R: Structure of a B-DNA Dodecamer. III. Geometry of Hydration. J Mol Biol 1981, 151: 535–556.View ArticlePubMedGoogle Scholar
- Calladine CR: Mechanics of sequence-dependent stacking of bases in B-DNA. J Mol Biol 1982, 161: 343–352.View ArticlePubMedGoogle Scholar
- Nelson HCM, Finch JT, Luisi BF, Klug A: The structure of an oligo(dA). Oligo(dT) Tract and its biological implications. Nature 1987, 330: 221–226.View ArticlePubMedGoogle Scholar
- Dickerson RE: DNA structure from A to Z. Methods Enzymol 1992, 211: 67–111.View ArticlePubMedGoogle Scholar
- El Hassan MA, Calladine CR: Propeller-twisting of base-pairs and the flexibility of dinucleotide steps. J Mol Biol 1996, 259: 95–103.View ArticlePubMedGoogle Scholar
- Suzuki M, Amano N, Kakinuma J, Tateno M: Use of 3D structure data for understanding sequence-dependent conformational aspects of DNA. J Mol Biol 1997, 274: 421–435.View ArticlePubMedGoogle Scholar
- Hud NV, Polak M: DNA-cation interactions: the major and minor grooves are flexible ionophores. Curr Opin Struct Biol 2001, 11: 293–301.View ArticlePubMedGoogle Scholar
- SantaLucia J: A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc Natl Acad Sci USA 1998, 95(4):1460–1465.PubMed CentralView ArticlePubMedGoogle Scholar
- Packer MJ, Dauncey MP, Hunter CA: Sequence-dependent DNA structure: tetranucleotide conformational maps. J Mol Biol 2000, 295: 85–103.View ArticlePubMedGoogle Scholar
- Packer MJ, Dauncey MP, Hunter CA: Sequence-dependent DNA structure: dinucleotide conformational maps. J Mol Biol 2000, 295: 71–83.View ArticlePubMedGoogle Scholar
- Hud NV, Plavec J: A Unified Model for the Origin of Sequence-Directed Curvature. Biopolymers 2003, 69: 144–158.View ArticlePubMedGoogle Scholar
- Song L, Hobaugh MR, Shustak C, Cheley S, Bayley H, Gouaux JE: Structure of Staphylococcal Alpha-Hemolysin, a Heptameric Transmembrane Pore. Science 1996, 274(5294):1859–1866.View ArticlePubMedGoogle Scholar
- Winters-Hilt S: Nanopore detection using channel current cheminformatics. SPIE Second International Symposium on Fluctuations and Noise, 25–28 May, 2004Google Scholar
- Winters-Hilt S, Akeson M: Nanopore cheminformatics. DNA and Cell Biology 2004, 23(10):675–83.View ArticlePubMedGoogle Scholar
- Winters-Hilt S, Vercoutere W, DeGuzman VS, Deamer DW, Akeson M, Haussler D: Highly Accurate Classification of Watson-Crick Basepairs on Termini of Single DNA Molecules. Biophys J 2003, 84: 967–976.PubMed CentralView ArticlePubMedGoogle Scholar
- Winters-Hilt S: Highly Accurate Real-Time Classification of Channel-Captured DNA Termini. Third International Conference on Unsolved Problems of Noise and Fluctuations in Physics, Biology, and High Technology 2003, 355–368.Google Scholar
- Vercoutere W, Winters-Hilt S, DeGuzman VS, Deamer D, Ridino S, Rogers JT, Olsen HE, Marziali A, Akeson M: Discrimination Among Individual Watson-Crick Base-Pairs at the Termini of Single DNA Hairpin Molecules. Nucl Acids Res 2003, 31: 1311–1318.PubMed CentralView ArticlePubMedGoogle Scholar
- Vercoutere W, Winters-Hilt S, Olsen H, Deamer DW, Haussler D, Akeson M: Rapid discrimination among individual DNA hairpin molecules at single-nucleotide resolution using an ion channel. Nat Biotechnol 2001, 19(3):248–252.View ArticlePubMedGoogle Scholar
- Senior MM, Jones RA, Breslauer KJ: Influence of loop residues on the relative stabilities of DNA hairpin structures. Proc Natl Acad Sci USA 1988, 85: 6242–6246.PubMed CentralView ArticlePubMedGoogle Scholar
- Michael D: Chem-Site 3.01. Pyramid Learning LLC, Hudson, OH; 1999.Google Scholar
- Cormen TH, Leiserson CE, Rivest RL: Introduction to Algorithms. MIT-Press, Cambridge, USA; 1989.Google Scholar
- Chung S-H, Moore JB, Xia L, Premkumar LS, Gage PW: Characterization of single channel currents using digital signal processing techniques based on Hidden Markov models. Philos Trans R Soc Lond B Biol Sci 1990, 329: 265–285.View ArticlePubMedGoogle Scholar
- Chung S-H, Gage PW: Signal processing techniques for channel current analysis based on hidden Markov models. In Methods in Enzymology; Ion channels, Part B. Edited by: Conn PM. Academic Press, Inc., San Diego; 1998:420–437.View ArticleGoogle Scholar
- Colquhoun D, Sigworth FJ: Fitting and statistical analysis of single-channel products. Single-channel recording. Second edition. Edited by: Sakmann B, Neher E. Plenum Publishing Corp, New York; 1995:483–587.View ArticleGoogle Scholar
- Durbin R: Biological sequence analysis : probabilistic models of proteins and nucleic acids. Cambridge, UK & New York: Cambridge University Press; 1998.View ArticleGoogle Scholar
- Vapnik VN: The Nature of Statistical Learning Theory. 2nd edition. Springer-Verlag, New York; 1998.Google Scholar
- Burges CJC: A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 1998, 2: 121–67.View ArticleGoogle Scholar
- Winters-Hilt S, Yelundur A, McChesney C, Landry M: Support Vector Machine Implementations for Classification & Clustering. BMC Bioinformatics 2006, 7(Suppl 2):S4.PubMed CentralView ArticlePubMedGoogle Scholar
- Platt JC: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In Advances in Kernel Methods – Support Vector Learning. Volume 12. Edited by: Scholkopf B, Burges CJC, Smola AJ. MIT Press, Cambridge, USA; 1998.Google Scholar
- Osuna E, Freund R, Girosi F: An improved training algorithm for support vector machines. In Neural Networks for Signal Processing VII. Edited by: Principe J, Gile L, Morgan N, Wilson E. IEEE, New York; 1997:276–85.Google Scholar
- Joachims T: Making large-scale SVM learning practical. In Advances in Kernel Methods – Support Vector Learning. Volume 11. Edited by: Scholkopf B, Burges CJC, Smola AJ. MIT Press, Cambridge, USA; 1998.Google Scholar
- Hays FA, Teegarden A, Jones ZJR, Harms M, Raup D, Watson J, Cavaliere E: How Does Sequence Define Structure? a Crystallographic map of DNA structure and conformation. Proc Natl Acad Sci 2005, 102: 7157–7162.PubMed CentralView ArticlePubMedGoogle Scholar
- Winters-Hilt S: Hidden Markov Model Variants and their Application. BMC Bioinformatics 2006, 7(suppl 2):S14.PubMed CentralView ArticlePubMedGoogle Scholar
- Akeson M, Branton D, Kasianowicz JJ, Brandin E, Deamer DW: Microsecond Time-Scale Discrimination Among Polycytidylic Acid, Polyadenylic Acid, and Polyuridylic Acid as Homopolymers or as Segments Within Single RNA Molecules. Biophys J 1999, 77(6):3227–3233.PubMed CentralView ArticlePubMedGoogle Scholar
- Diserbo M, Masson P, Gourmelon P, Caterini R: Utility of the wavelet transform to analyze the stationarity of single ionic channel recordings. J Neurosci Methods 2000, 99(1–2):137–141.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.