Phylogenetic reconstruction in the Order Nymphaeales: ITS2 secondary structure analysis and in silico testing of maturase k (matK) as a potential marker for DNA bar coding
© Biswal et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
The Nymphaeales (waterlilly and relatives) lineage has diverged as the second branch of basal angiosperms and comprises of two families: Cabombaceae and Nymphaceae. The classification of Nymphaeales and phylogeny within the flowering plants are quite intriguing as several systems (Thorne system, Dahlgren system, Cronquist system, Takhtajan system and APG III system (Angiosperm Phylogeny Group III system) have attempted to redefine the Nymphaeales taxonomy. There have been also fossil records consisting especially of seeds, pollen, stems, leaves and flowers as early as the lower Cretaceous. Here we present an in silico study of the order Nymphaeales taking maturaseK (matK) and internal transcribed spacer (ITS2) as biomarkers for phylogeny reconstruction (using character-based methods and Bayesian approach) and identification of motifs for DNA barcoding.
The Maximum Likelihood (ML) and Bayesian approach yielded congruent fully resolved and well-supported trees using a concatenated (ITS2+ matK) supermatrix aligned dataset. The taxon sampling corroborates the monophyly of Cabombaceae. Nuphar emerges as a monophyletic clade in the family Nymphaeaceae while there are slight discrepancies in the monophyletic nature of the genera Nymphaea owing to Victoria-Euryale and Ondinea grouping in the same node of Nymphaeaceae. ITS2 secondary structures alignment corroborate the primary sequence analysis. Hydatellaceae emerged as a sister clade to Nymphaeaceae and had a basal lineage amongst the water lilly clades. Species from Cycas and Ginkgo were taken as outgroups and were rooted in the overall tree topology from various methods.
MatK genes are fast evolving highly variant regions of plant chloroplast DNA that can serve as potential biomarkers for DNA barcoding and also in generating primers for angiosperms with identification of unique motif regions. We have reported unique genus specific motif regions in the Order Nymphaeles from matK dataset which can be further validated for barcoding and designing of PCR primers. Our analysis using a novel approach of sequence-structure alignment and phylogenetic reconstruction using molecular morphometrics congrue with the current placement of Hydatellaceae within the early-divergent angiosperm order Nymphaeales. The results underscore the fact that more diverse genera, if not fully resolved to be monophyletic, should be represented by all major lineages.
The Basal angiosperm Order Nymphaeales is a group of water-living flowering plants. Though the group is taxonomically small, it has great significance in understanding the early evolutionary pattern of angiosperms. Classification of this Order varies from recognition of two to four families. A lot of progress has been made in recent years in understanding both the taxonomic position of Nymphaeales in the angiosperm tree and the relationship within the water lily clade [1–3].
Usually, two families, Cabombaceae and Nymphaeaceae are recognised. The Cabombaceae comprises the genera Cabomba and Brasenia and Nymphaeaceae comprise six genera: Euryale, Ondinea, Victoria, Barclaya, Nuphar and Nymphaea, the largest and most cosmopolitan in nature. Until recently, Hydatellaceae was placed among the monocots in previous systems and was placed with Poales, but a recent study with multi-marker plastid dataset found that the family belongs to Nymphaeales and includes two genera (Hydatella and Trithuria), which is restricted to Australiasia and India .
The Order Nymphaeales was considered to include the genera Nelumbo and Ceratophyllum as per earlier taxonomic treatments based on morphology [5–8]. However, in recent times with the use of modern molecular bio-markers, Nelumbo and Ceratophyllum are excluded, thereby, substantiating the monophyly of Nymphaeles [8–10]. This provided an impetus for revaluation of morphological characters that revealed the presence of certain features such as tricolpate pollen or epicuticular wax tubules in Nelumbo thereby further substantiating its exclusion from Nymphaeales [5, 11].
Hydatellaceae as it represents the single exception in an otherwise relatively harmonious congruence between the traditional and molecular circumscription of the monocot clade, the structural diversity of this remarkable family is of considerable interest. They are small and inconspicuous plants that received little attention from botanists prior to their taxonomic reassignment to the basal angiosperms. It would be really interesting to review our current knowledge on this species-poor but interesting family that has only recently been discovered in India .
Morphological and molecular data generally indicate a close association of Cabomba and Brasenia thereby affirming the monophyly of the family Cabombaceae [12, 13], whereas the monophyly of the family Nymphaeaceae is yet to get much support from the taxonomic community.
DNA barcoding has become an indispensable tool in identifying biological specimens using a short standardized region of both genomic as well as extra-chromosomal DNA very much in the way what universal product codes do for identification of consumer goods. Research community interested in DNA barcodes want to place their query sequences within the taxonomic hierarchy which is achieved by conventional sequence similarity search methods viz., Basic Local Alignment Search Tool (BLAST), Fast Alignment (FASTA) etc. that are often twitched to overcome biological mutations or sampling bias and this, in turn, poses tricky issues like successful tracking of minuscule sequence variations observed among closely related species. A step further, character based similarity relying on common ancestry is also employed in the form of phylogenetic trees or in the form of implicit hierarchic taxonomic descriptors . These methods heavily depend on multiple sequence alignments (MSA) which in fact, is a challenge as the barcoding requirements are contradictory to the very objective of MSA, i.e., looking for hyper variable regions to delineate the closely related species and yet be highly conserved for allowing design of universal PCR primers. Keeping these in mind, selecting a core barcode abiding the three important barcoding principles (standardization, minimalism and scalability) still remains a challenge for plant DNA barcoding unlike animal DNA barcoding. The standard animal Cytochrome oxidase (COI) DNA barcode being a haploid and uniparentally inherited with a single locus exhibiting high levels of discriminatory power fits well into the above barcoding criteria .
COI is a protein coding marker with high copy numbers per cell devoid of microinversions (frequent mononucleotide repeats) and drastic length variation with developed primer sets that aid in routine recovery of high quality sequence from animal clades and sequence recovery from poorly preserved samples as well . Finding a standard plant barcode analogue to COI in animals has proved difficult and COI from plant mitochondrial DNA (mtDNA) generally exhibits low nucleotide substitution rates thereby making it unsuccessful for universal plant barcoding initiatives. There are core research groups who have worked both in silico and in vitro suggesting multiple plastid markers but eventually couldn't arrive at a conclusion  and thus maturase K (matK) still holds good as a suitable substitute plant barcode that can be considered the animal barcode COI analogue .
matK is one of the most rapidly evolving coding regions in the plastid genome but unfortunately poses difficulty in PCR amplification with already existing universal primer sets especially in non-angiosperms contrary to another barcode region ribulose-bisphosphate carboxylase (rbcl) gene which is easy to amplify, sequence and align despite having modest discriminatory power . Hence, two marker plastid barcodes (rbcl+matK) are suggested as core barcodes until further works on matK universal primer development are a success. With these two joint challenges (matk primers in want of improvement and uncertainty in discriminatory powers of two plastid marker (rbcl+matk) system), continued sequencing and exploration of new possibilities in non-coding markers viz. trnH-psbA and internal transcribed spacers (ITS1 & ITS2) are harnessed to formalize the routine incorporation of other potential non-coding markers into plant barcoding design systems .
Officially rbcl+matK combination has been approved by Consortium for the Barcode of Life (CBOL) as a global DNA barcode for land plants while trnH-psbA are still under scrutiny as a backup barcoding locus. There have been studies for ferns with matK+rbcl and trnH+psbA loci with the former providing high discriminatory power, supporting their use as the official DNA barcode . Another research study has validated use of ITS2 as novel DNA barcode for medicinal plant identification as ITS2 sequences are considered potential phylogenetic markers at genus and species levels. Six parameters viz. average interspecific distance (K2P) between all species in each genus, average theta prime (θ'), where θ' is the mean pair wise distance within each genus with more than one species, smallest interspecific distance i.e., the minimum interspecific genetic distance within each genus with at least two species, average intra specific divergence (K2P difference), theta (θ) where θ is the mean pairwise distance within each species with at least two representatives and average coalescent depth (i.e., maximum intra-specific distance within each species with at least two representatives) were determined taking several plastid and ribosomal intergenic marker regions where ITS2 scored high exhibiting highest level of variation with all the parameters thereby accounting for ITS2 as a suitable marker with authentication ability .
Looking into these intriguing questions about phylogenetic relationships in Nymphaeales, we designed an in silico study using matK and ITS2 sequences available on the public domain covering all genera of Nymphaeales. Till date there are no reports on plant DNA barcoding approach where both matK and ITS2 are taken together and phylogenetic studies made. In case of water lilies, molecular identification and barcodes have been reported only for the genus Nymphaea and that too taking sequences from the rpoC1 gene and trnH-psbA spacer regions (which are still under assessment as backup loci by CBOL) and use of inter-simple sequence repeat (ISSR) for species identification and differentiation of Nymphaea cultivars and natural populations . The present study aims at using both matK and ITS2 as markers for elucidating the plant species of the order Nymphaeales using combined fusion matrix of both the markers, capturing the phylogenetic signals through molecular morphometrics for the ITS2 region and finding novel motifs that can be tested as PCR primers in design of potential barcodes at genus level for rapid and accurate plant identification across the three different families of Nymphaeaceae, Cabombaceae and Hydatellaceae without morphological characters.
ITS2 has common core secondary structures across eukaryotes that serve as a double-edged tool. The ITS2 region of the nuclear rDNA cistrons is widely used for phylogenetic analyses at the genus and species levels and also at the higher taxonomic ranks using comparisons of primary sequence. Although potential transcript secondary structure homology is often utilized to aid alignment in comparisons of ribosomal gene sequences, such consideration has rarely been applied to ITS primarily because secondary structures for its transcript were not available. Hence, the value of applying ITS2 RNA transcript secondary structure information to improve alignments, that in turn, allows comparisons at even deeper taxonomic levels harnessing the evolutionarily conserved subportions of ITS2 has become apparently necessary for positioning of the multimolecular transcript processing machinery amongst eukaryotes and thus makes ITS2 a valuable tool both for primary sequence analysis and molecular morphometrics .
Although individualistic approach for different barcodes exist in addressing several issues in plant barcode designing, there is a general need for integrating a range of analytical routine into a common work flow to provide comparable informatics support for existing molecular data on the public domain.
For the present in silico study three principal objectives have been envisaged:
1. Phylogenetic reconstruction of Nymphaeales based on matK genes and ITS2 using a combined approach of gene trees and species trees (using super matrix of concatenated loci) and testing monophyly of the genus Nymphaea.
2. Evaluation of the phylogenetic utility of matK as a potential marker for motif hunting and DNA barcoding.
3. ITS2 secondary structure prediction in the order Nymphaeales and alignment of secondary structures to produce a consensus Nymphaeale phylogeny.
Sequence analysis and phylogeny reconstructions
The sequences of ITS2 and matk were aligned separately with clustalW program  and manually edited and the resultant aligned files were concatenated using FASconCAT version 1.0  (Additional File 1). For the insertion of gaps, attention was given to both the potentially inserted sequence and its neighbouring sequences. A gap was inserted only when it prevented the inclusion of more than two substitutions among closely adjacent nucleotides in the alignment. For the placement of gaps, the recognition of sequence motifs was given priority as per alignment rules for length-variable DNA sequences . Giving priority to a motif can result in insertions that are correctly aligned as non-homologous (i.e. with different positional extensions) although sequence similarity would warrant their inaccurate placement under the same column . Individual positions in homo-nucleotide strings of different lengths (poly-As or -Ts) are considered to be of uncertain homology  and are therefore excluded. Slipped strand mispairing  is likely to have led to numerous length mutational events involving one to several nucleotides. As only nucleotides of the same kind are involved, accurate motif recognition is not possible. Entire indels of the same positional extension and of complete sequence similarity were very easily assessed as primary homologues and consequently placed in the same column(s) of the alignment. During primary homology assessment, no inference had to be made regardless of whether the length mutational event occurred in a common ancestor of all taxa sharing it or in parallel in different lineages. This is analogous to the fact that the synapomorphic status of a substitution in a particular position is not inferred in the alignment process. Recognition of a repeat motif was regarded as further evidence for correctly recognizing a length mutational event. The final concatenated supermatrix included 51 taxa with 1875 characters.
Maximum Likelihood (ML) analyses
Bayesian analyses and split networks
Molecular clock rates, dS/dN analysis
Results from a test of molecular clocks using the Maximum Likelihood method of Nymphaeales matK sequence.
The codon-based z-test was carried out by setting the model to Syn-Nonsynonymous and Nei-gojobori test. The resulting matrix displayed (dS-dN) values above the diagonal and p values below the diagonal (Additional Files 2 &3). The test was carried out for both the hypotheses of positive selection and purifying selection. When (dS-dN) value is positive it exhibits purifying selection and to test that in reality p value less than 0.05 supports significant purifying selection. Selecting the p-value of 1.0 and then looking for the corresponding (dS-dN) value exhibited positive values thus rejecting the hypothesis of positive selection as dS > dN, i.e., silent mutations or purifying selection outnumbered non-synonymous mutations. Hence, we can conclude that the evolution of matK genes has been under strong purifying selection, suggesting their role in the evolution of Nymphaeales.
Motif identification and matching
A total of 27 unique matK motifs are identified by the MEME software  and subsequently validated by the MAST tool . We have reported three motifs each for the genera Brasenia, cabomba, Barclaya, Euryale, Nuphur, Nymphaea, Ondinea, Victoria and Trithuria along with their E-value, p-value  and similarity among themselves as outlined in (Additional File 4). In the proposed motif analysis that can be further tested for designing barcodes the same sets of sequences were used both to generate databases and as query sequences for both BLAST  and MAST. BLAST queries were run without filtering. Before generating the database with MAST the sequences were run through a PERL script that added a reverse complement for each sequence in order to ensure that query sequences would match the database in either the forward or the reverse orientation.
ITS2 secondary structure and analysis: a double edged tool
Primary sequence-secondary structure alignment
Populations are relatively isolated from one another where species dispersal is poor thereby resulting in slow individual neutral mutational variants spreading throughout a species range and thus for a species to attain monophyly for a particular loci it will be comparatively slower than species whose populations are connected with a regular gene flow. Hence, species-specific barcodes are literally difficult with poorly dispersed species. Since plastid markers in water lilies are paternally inherited, and travel in pollen, they potentially cover larger distances and have a better resolution power at species level delineation exhibiting consistently greater congruence with morphological species boundaries than maternally inherited mitochondrial markers . There are also instances where multiple species are reported to share plastid DNA haplotypes yet remain distinct for nuclear markers like nrITS which is again explained by their dispersal ability i.e., plastid DNA is poorly dispersed compared to nrITS and thus a combined approach of marker selection with varied dispersal ability provide an optimal choice of augmenting plant barcodes with nuclear markers .
The markers used in our study are from both plastid and nrDNA with matK and ITS2 combination that were subjected to multiple sequence alignment and refined with Mesquite. Postulated indels were treated as missing data and prealigned marker datasets were concatenated to produce a fusion matrix and a supertree was generated. The combo approach of ITS2+matK had the combined effect of idiosyncratic behaviour of both the markers that potentially contributed to species grouping across different clades of the order Nymphaeales.
This study represents the exclusive molecular dataset for matK genes as potential markers for motif discovery till date. Due to a relatively high percentage of variable and informative characters, our dataset not only comprises a high number of informative characters for Nymphaeales but also characterised by low degrees of homoplasy and a strong phylogenetic signal. The ML method as well as the bayesian approach yielded the same results with an exactly matching topology and well supported nodes. The results confirm several earlier hypotheses on phylogenetic relationships of the Order Nymphaeales and corroborate the monophyly of Nymphaeaceae and Cabombaceae, which has been convincingly mentioned before based on integrated morphological, anatomical and molecular characters . Barclaya serves as an outgroup to the monophyletic grouping of Nymphaea, Ondinea, Victoria and Euryale. It also supports the Victoria-Euryale grouping that was long predicted based on seed morphology and presence of spines . Though Nymphaeales is a monophyletic group within the basal angiosperms, the monophyly of Nymphaeaceae is not fully convincing owing to Victoria-Euryale and Ondinea grouping. The classification of Nymphaea in India has been reported to be confusing, molecular taxonomic revision of four Indian representatives of the genus namely N. nouchali, N. pubescens, N. rubra and N. tetragona based on ITS, trnK intron and matK gene have been carried out by us earlier. Molecular evidence was in disagreement about the taxonomic identity of one specimen of N. nouchali and indicated a probable misidentification of N. tetragona. Interestingly, sequence analysis had revealed lack of or low sequence divergence between N. pubescens and N. rubra [44–46]. Further in the present study we tried to track down evolutionary relationships among the genera of the order Nymphaeales by comparing the nucleotide sequences of the plant genomic and chloroplast DNA. For the first time, we have banked upon a large dataset from publicly available matK and ITS2 markers for discussing Nymphaeale phylogeny with a molecular morphometrics approach. Several authors [48, 49] considered assigning Barclaya to a separate family, Barclayaceae as they were of the argument that the genus Barclaya is quite distinct in terms of its palinological features, the structure of the ovule and the karyotype and in the present study our secondary structure alignment data (Figure 6) indicate that the region analysed in these studies is too short to enable verification of a phylogenetic hypothesis though we have got favourable results for considering Barclaya to be placed in a different family and with more diverse dataset we can target appropriate phylogenetic signals for considering Barclaya in a separate family.
The fact that Nymphaea and Victoria are the sister genera in our study is quite expected as both are highly evolved representatives of Nymphaeales. The recent study on Hydatellaceae  that identified it as a new branch near the angiosperm basal phylogeny was also reflected in our molecular morphometric analysis. Earlier ideas on the relationships of Hydatellaceae with the monocot family Centrolepidaceae and their current placement within the early-divergent angiosperm order Nymphaeales has been of considerable interest to taxonomists. In general, the view of monocots as a well-defined monophyletic unit derived from within the paraphyletic group of basal dicots  is one of the morphology-based theories that are most readily supported by molecular data. Extensive molecular phylogenetic studies have allowed only one refinement to the classical circumscription of monocots, with a total complement of 65000 species and 3000 genera . Specifically, the family Hydatellaceae (twelve species in a single genus)  has been transferred from monocots to the early-divergent angiosperms. In the present study our data supported this finding with aid of ITS2 secondary structure alignment, Bayesian network, supermatrix trees from concatenated different loci (ITS2+matk) (Figures 1, 2, 3, 6), analysis of matK dataset and molecular clock MP trees (Figure 4). For the first time we have used an extensive molecular morphometrics phylogeny to support Hydatellaceae as a sister group to Nymphaeales.
Besides, the genera Nuphar emerges as a monophyletic group with all the Nuphar species forming a single cluster with well supported boot strap values. Nuphar takes the middle position between these two genera. However, according to the molecular data Nuphar (possessing many specialised synapomorphic features) is basal in the clade, thus making Victoria and Nymphaea closer to each other .
However, despite the high amount of characters sampled, the monophyly of the Nymphaeaceae is not convincingly supported. More strikingly, the present data set does not give support for the monophyly of the genus Nymphaea. Nymphaea alba emerges out group to the order Nymphaeales based on the molecular data as well secondary structure data (Figures 1, 2, 3, 5 and 6) and had a parallel evolution with other representatives from the genus Nuphar. In contrast to all previous phylogenetic studies and classifications, it is inferred to be paraphyletic with respect to the Victoria-Euryale clade and to Ondinea. A reason for the scarcity of informative characters at the base of Nymphaeales could be a rapid, early diversification into the three major lineages. Our results support the opinion that a high rate of evolution within this taxon can be explained by the rapid specialisation of these plants for stepwise adaptation to the aquatic environment.
The other objective in our study was to generate motifs for barcode designing. matK genes yielded unique motif regions and thus may provide more variations than other regions in the plant chloroplast genomes. The nr Plant database from European Molecular Biology laboratory (EMBL) was used to test for unique species-specific barcodes that could be used for a species level identification. For this, the sequence belonging to each species was retrieved from the database and used as a query sequence. If the query sequence returned an exact match only to itself, this was scored as a positive identification at the species level. If the query sequence returned an exact match to itself and other members of the same genus, this was scored as a negative identification at the species level, but a positive identification to the genus level. For BLAST, an additional constraint was added to positively score the identification at genus level i.e., the best match as well as the next most similar sequence had to match the genus of the query sequence. If any other genus was included in the top two hits, the result was not considered genus specific. The results are exemplary in the current scenario of plant barcoding. We have reported unique genus specific motif regions in the Order Nymphaeles from matK dataset which can be further validated for barcoding and designing of PCR primers.
The increased application of molecular data in plant systematics has led to an avalanche of sequence profiles flooding the public domain. With a judicious use of these data as phylogenetic signals, the goal of finding universal primer pairs for studying plant genomes won't be troublesome anymore. The unique motifs reported may further be validated for designing barcodes. With Nymphaeales as a case study, it is quite surprising to observe how stepwise adaptation to an aquatic life style has had an impact on water lilies evolution, with the generation of morphological complexity. For the first time we have reported an ITS2 secondary structure alignment and a phylogeny based on the molecular morphometrics that strongly congrued with the current placement of the family Hydatellaceae within the early-divergent angiosperm order Nymphaeales. Though we are far off from completely understanding the selective forces behind these transformations, nevertheless, the phylogenetic signals belied in the comparatively small marker datasets imbibes a source of inspiration to broaden our views on water lily origin and evolution in time.
Taxon sampling and sequence analysis
The dataset used in the present study comprises 64 matK and 67 ITS2 sequences from species representing the three families Cabombaceae, Nymphaeaceae and Hydatellaceae of the order Nymphaeales retrieved from GenBank  (via Ebot http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/ebot/ebot.cgi, an open source interactive tool that generates a Perl script implementing an E-utility pipeline for retrieving large datasets from National Centre for Biotechnology Information (NCBI) with key words and boolean operators. Information on all the species along with GenBank accessions, sequence length and AT, GC content of both the markers are summarized in (Additional Files 5). The sequences were subjected to alignment and manual editing by clustalW  and were concatenated for generating a supermatrix using FAsconcat . Subsequently the concatenated files were subjected to mesquite for various file format conversions to be readable by ML and Bayesian methods.
The supermatrix dataset (matK + ITS2) covering the Order Nymphaeales were first analysed separately through ML and Bayesian inference . MP analyses were conducted with PhyMl 3.0 . Node support was substantiated through aLRT and bootstrapping.
For Bayesian inference  the best models of molecular evolution were determined with aid of MrModeltest version 2.2 . Hence, a Bayesian analysis using MrBayes  was carried out for tree construction using a general time reversible substitution model (GTR) with substitution rates estimated by MrBayes. Metropolis-Coupled Markov Chain Monte Carlo (MCMCMC) sampling was performed with two incrementally heated chains that were combinatorially run for 20,000 generations. The convergence of MCMCMC was then monitored by examining the value of the marginal likelihood through generations. Coalescence of substitution rate and rate model parameters were also examined. Average standard deviation of split frequencies was checked and the generations were kept on adding until the standard deviation value was below 0.01. For analysis we ran 10,000,000 generations with a sample frequency of 1000. The values slightly differed because of stochastic effects. The sample of substitution model parameters and samples of trees and branch lengths were summarized by the "sump burnin" and "sumt burnin" commands, respectively. The values in the following commands were adjusted as per the 25% of our samples. A cladogram with the posterior probabilities for each split and a phylogram with mean branch lengths were generated and subsequently read by Mesquite . An alternative method using network analysis was performed using SplitsTree4  with the variable positions in the aligned supermatrix dataset. The alignment file was converted to nexus with READSEQ  at Eurpean Bioinformatics Institute (EBI) server readable by SplitsTree4 .
Estimation of molecular clock rates
The molecular clock test was performed by comparing the ML value for the given topology with and without the molecular clock constraints under Jukes-Cantor model . The null hypothesis of equal evolutionary rate throughout the tree was rejected at a 5% significance level (P < 1.20575200719741E-58). The analysis involved 64 matK sequences and was computed using MEGA5 .
Analysis of synonymous and non-synonymous substitution rates
Non-synonymous mutations to a DNA sequence cause a change in the translated amino acid sequence, whereas synonymous mutations do not. The comparison between the number of non-synonymous mutations (dn or Ka), and the number of synonymous mutations (ds or Ks), can suggest whether, at the molecular level, natural selection is acting to promote the fixation of advantageous mutations (positive selection) or to remove deleterious mutations (purifying selection). In general, when positive selection dominates, the Ka/Ks ratio is greater than 1; in this case, diversity at the amino acid level is favoured, likely due to the fitness advantage provided by the mutations. Conversely, when negative selection dominates, the Ka/Ks ratio is less than 1; in this case, most amino acid changes are deleterious and, therefore, are selected against. When the positive and negative selection forces balance each other, the Ka/Ks ratio is close to 1. The dS/dN ratio was computed on matK sequences only in MEGA5  for testing positive and purifying selection hypothesis.
Motif identification and testing
The matK sequence motifs were identified from aligned sequences using the PRATT software . Besides, the dataset in fasta format were fed to MEME  for determining highly significant motifs without any gaps and patterns with variable length gaps if any, were split by MEME into one or more separate motifs. The motif sites were listed in order of increasing statistical significance (p-value) . The p-value of a site is computed from the match score of the site with Position Specific Scoring Matrix (PSSM) for the motif. Further individual datasets for Nymphaea and Nuphur were subjected to MEME for analyzing the best motifs. The MEME output is subsequently analyzed by MAST  for depicting the best scoring matches and similarity to other motifs. The match score are computed if the match completely fits within the sequence and are reported in terms of P-value of the match. MAST takes into account four types of events for calculating the P-value namely the position P-value, sequence P-value, combined P-value and the E-value .
ITS2 secondary structure prediction and analysis
RNA secondary structure prediction for ITS2 sequences were carried out in MATLAB 2012a rnafold  and rnaplot  functions that uses the nearest-neighbor model and minimizes the total free energy associated with an RNA structure. The minimum free energy was estimated by summing individual energy contributions from base pair stacking, hairpins, bulges, internal loops and multi-branch loops. The energy contributions of these elements are sequence- and length-dependent and have been experimentally determined. The rnafold function uses the nearest-neighbor thermodynamic model to predict the minimum free-energy secondary structure of an RNA sequence. More specifically, the algorithm implemented in rnafold was used for dynamic programming to compute the energy contributions of all possible elementary substructures and then the secondary structures were predicted by considering the combination of elementary substructures whose total free energy were minimum. In this computation, the contribution of coaxially stacked helices is not accounted for, and the formation of pseudoknots (non-nested structural elements) is forbidden. Rnaplot (RNA2ndStruct) was used for drawing RNA secondary structures with specified format values 'Mountain' for ITS2. The secondary structures were computed in form of mountain graphs in MATLAB R2012a environment.
Besides, consensus structures of ITS2 regions were predicted using the RNAz server and LocARNA from Freiburg RNA tools server that outputs a multiple alignment together with a consensus structure. For the folding a very realistic energy model for RNAs was used that features RIBOSUM-like similarity scoring and realistic gap cost. The high performance of LocARNA  was mainly achieved by employing base pair probabilities during the alignment procedure. Results of the various species were compared to unravel the folding pattern common to them all for establishing the conserved structural models across several genera of Nymphaeales using 4SALE  and subsequently incorporated in ProfDistS  for generating molecular morphometrics phylogeny. The ProfDistS  output was read by NjPlot .
The work is supported by the Department of Biotechnology, Government of India sponsored Bioinformatics Centre at North-Eastern Hill University, Shillong, Meghalaya, India.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.
- Borsch T, Hilu KW, Wiersema JH, Löhne C, Barthlott W, Wilde V: Phylogeny of Nymphaea (Nymphaeaceae): evidence from substitutions and microstructural changes in the chloroplast trnT-trnF region. Int J Pl Sci. 2007, 168: 639-671. 10.1086/513476.View ArticleGoogle Scholar
- Qiu YL, Dombrovska O, Lee J, Li L, Whitlock BA, Bernasconi-Quadroni F, Rest JS, Davis CC, Borsch T, Hilu KW, Renner SS, Soltis DE, Soltis PE, Zanis MJ, Cannone JJ, Powell M, Savolainen V, Chatrou LW, Chase MW: Phylogenetic analyses of basal angiosperms based on nine plastid, mitochondrial, and nuclear genes. Int J Pl Sci. 2005, 166 (5): 815-842. 10.1086/431800.View ArticleGoogle Scholar
- Qiu YL, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS, Zanis M, Zimmer EA, Chen Z, Savolainen V, Chase MW: The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nat. 1999, 402 (6760): 404-407. 10.1038/46536.View ArticleGoogle Scholar
- Saarela JM, Rai HS, Doyle JA, Endress PK, Mathews S, Marchant AD, Briggs BG, Graham SW: Hydatellaceae identified as a new branch near the base of the angiosperm phylogenetic tree. Nat. 2007, 446 (7133): 312-325. 10.1038/nature05612.View ArticleGoogle Scholar
- Löhne C, Borsch T, John H, Wiersema : Phylogenetic analysis of Nymphaeales using fast-evolving and noncoding chloroplast markers. Bot J Linn Soc. 2007, 154 (2): 141-163. 10.1111/j.1095-8339.2007.00659.x.View ArticleGoogle Scholar
- Les DH, Schneider EL, Padgett DJ, Soltis PS, Soltis DE, Zanis M: Phylogeny, Classification and Floral Evolution of Water Lilies (Nymphaeaceae; Nymphaeales): A Synthesis of Non-molecular, rbcL, matK, and 18S rDNA Data. Syst Bot. 1999, 24 (1): 28-46. 10.2307/2419384.View ArticleGoogle Scholar
- Les DH, Garvin Dk, Wimpee CF: Molecular evolutionary history of ancient aquatic angiosperms. Natl Acad Sci USA. 1991, 88: 10119-10123. 10.1073/pnas.88.22.10119.View ArticleGoogle Scholar
- Ito M: Phylogenetic systematics of the Nymphaeales. Botanical Magazine (Tokyo). 1987, 100: 17-35. 10.1007/BF02488417.View ArticleGoogle Scholar
- Cronquist A: The evolution and classification of flowering plants. 1988, Bronx, NY: The New York Botanical GardenGoogle Scholar
- Chase MW, Soltis D, Olmstead RG, Morgan D, Les DH, Mishler BD, Duvall MR, Price R, Hills HG, Qiu Y-L, Kron KA, Rettig JH, Conti E, Palmer JD, Manhart JR, Sytsma KJ, Michaels HJ, Kress JW, Karol KG, Clark WD, Hédren M, Gaut BS, Jansen RK, Kim K, Wimpee CF, Smith JF, Furnier GR, Strauss SH, Xiang Q-Y, Plunkett GM, Soltis PS, Swensen SM, Williams SE, Gadek PA, Quinn CJ, Eguiarte LE, Golenberg EM, Learn GH, Graham SW, Barrett SCH, Dayanandan S, Albert VA: Phylogenetics of seed plants: an analysis of nucleotide sequences from the plastid gene Rbcl. Annals of the Missouri Botanical Garden. 1993, 80: 528-580. 10.2307/2399846.View ArticleGoogle Scholar
- Savolainen V, Chase MW, Morton CM, Soltis DE, Bayer C, Fay MF, De Bruijn A, Sullivan S, Qiu Y-L: Phylogenetics of flowering plants based upon a combined analysis of plastid atpB and rbcL gene sequences. Syst Biol. 2000, 49: 306-362.View ArticlePubMedGoogle Scholar
- Nandi OI, Chase MW, Endress PK: A combined cladistics analysis of angiosperms using rbcL and non-molecular data sets. Annals of the Missouri Botanical Garden. 1998, 85: 137-212. 10.2307/2992003.View ArticleGoogle Scholar
- Williamson PS, Schneider EL: Cabombaceae. The families and genera of vascular plants II. Edited by: Kubitzki K, Rohwer JG, Bittrich V. 1993, Berlin, Springer, 157-161.Google Scholar
- Little DP: DNA barcode sequence identification incorporating taxonomic hierarchy and within taxon variability. PLoS One. 2011, 6 (8): e20552-10.1371/journal.pone.0020552.PubMed CentralView ArticlePubMedGoogle Scholar
- Hollingsworth PM, Graham SW, Little DP: Choosing and using a plant DNA barcode. PLoS One. 2011, 6 (5): e19254-10.1371/journal.pone.0019254.PubMed CentralView ArticlePubMedGoogle Scholar
- Lahaye R, van der Bank M, Bogarin D, Warner J, Pupulin F: DNA barcoding the floras of biodiversity hotspots. Proc Natl Acad Sci. 2008, 105: 2923-2928. 10.1073/pnas.0709936105.PubMed CentralView ArticlePubMedGoogle Scholar
- Li FW, Kuo LY, Rothfels CJ, Ebihara A, Chiou WL, Windham MD, Pryer KM: rbcL and matK earn two thumbs up as the core DNA barcode for ferns. 2011, 6 (10): e26597-Google Scholar
- Chen S, Yao H, Han J, Liu C, Song J, Shi L, Zhu Y, Ma X, Gao T, Pang X, Luo K, Li Y, Li X, Jia X, Lin Y, Leon C: Validation of the ITS2 region as a novel DNA barcode for identifying medicinal plant species. PLoS One. 2010, 5 (1): e8613-10.1371/journal.pone.0008613.PubMed CentralView ArticlePubMedGoogle Scholar
- Chaveerach A, Tanee T, Sudmoon R: Molecular identification and barcodes for the genus Nymphaea. Acta Biol Hung. 2011, 62: 328-340. 10.1556/ABiol.62.2011.3.11.View ArticlePubMedGoogle Scholar
- Coleman AW: ITS2 is a double-edged tool for eukaryote evolutionary comparisons. Trends Genet. 2003, 7: 370-375.View ArticleGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-80. 10.1093/nar/22.22.4673.PubMed CentralView ArticlePubMedGoogle Scholar
- Kück P, Meusemann K: FASconCAT: Convenient handling of data matrices. Mol Phylogenet Evol. 2010, 56 (3): 1115-1158. 10.1016/j.ympev.2010.04.024.View ArticlePubMedGoogle Scholar
- Gatesy J, De Salle R, Wheeler W: Alignment- ambiguous nucleotide sites and the exclusion of systematic data. Mol Phylogenetics Evol. 1993, 2: 152-157. 10.1006/mpev.1993.1015.View ArticleGoogle Scholar
- Takezaki N, Rzhetsky A, Nei M: Phylogenetic test of the molecular clock and linearized trees. Mol Biol and Evol. 2004, 12: 823-833.Google Scholar
- Hoot SB, Douglas AW: Phylogeny of the Proteaceae based on atpB and atpB-rbcL intergenic spacer region sequences. Aust J Syst Bot. 1998, 11: 301-320. 10.1071/SB98027.View ArticleGoogle Scholar
- Levinson G, Gutman G: Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol Biol Evol. 1987, 4: 203-221.PubMedGoogle Scholar
- Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O: New algorithms and methods to estimate Maximum-Likelihood phylogenies: assessing the performance of PhyML 3.0. Sys Biol. 2010, 59 (3): 307-321. 10.1093/sysbio/syq010.View ArticleGoogle Scholar
- Shimodaira H, Hasegawa M: Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol. 1999, 16 (8): 1114-10.1093/oxfordjournals.molbev.a026201.View ArticleGoogle Scholar
- Ronquist F, Huelsenbeck JP: MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003, 19: 1572-1574. 10.1093/bioinformatics/btg180.View ArticlePubMedGoogle Scholar
- Maddison WP, Maddison DR: Mesquite: a modular system for evolutionary analysis. Version 2.75. [http://mesquiteproject.org]
- Huson DH, Bryant D: Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 2006, 23 (2): 254-267.View ArticlePubMedGoogle Scholar
- Jukes TH, Cantor CR: Evolution of protein molecules. Mammalian Protein Metabolism. Edited by: Munro HN. 1969, Academic Press, New York, 21-132.View ArticleGoogle Scholar
- Bailey TL, Williams N, Misleh C, Li WW: MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006, 34: W369-W373. 10.1093/nar/gkl198.PubMed CentralView ArticlePubMedGoogle Scholar
- Timothy BL, Bodén M, Buske FA, Frith M, Grant EC, Clementi L, Ren J, Li WW, Noble William S: MEME SUITE: tools for motif discovery and searching. Nuc Acids Res. 2009, 37: W202-W208. 10.1093/nar/gkp335.View ArticleGoogle Scholar
- Bailey TL, Gribskov M: Combining evidence using p-values: application to sequence homology searches. Bioinformatics. 1998, 14: 48-54. 10.1093/bioinformatics/14.1.48.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 3: 403-10.View ArticleGoogle Scholar
- Gruber AR, Neuböck R, Hofacker IL, Washietl S: The RNAz web server: prediction of thermodynamically stable and evolutionarily conserved RNA structures. Nucleic Acids Res. 2007, 35: 335-8. 10.1093/nar/gkm222.View ArticleGoogle Scholar
- Smith C, Heyne S, Richter SA, Will S, Backofen R: Freiburg RNA Tools: a web server integrating IntaRNA, ExpaRNA and LocARNA. Nucleic Acids Res. 2010, 38: 373-7. 10.1093/nar/gkq316.View ArticleGoogle Scholar
- Seibel PN, Müller T, Dandekar T, Wolf M: Synchronous visual analysis and editing of RNA sequence and secondary structure alignments using 4SALE. BMC ResNotes. 2008, 1: 91-Google Scholar
- Wolf M, Ruderisch B, Dandekar T, Müller T: ProfdistS: (Profile-) Distance based phylogeny on sequence-structure alignments. Bioinformatics. 2008, 24: 2401-2402. 10.1093/bioinformatics/btn453.View ArticlePubMedGoogle Scholar
- Perrière G, Gouy M: WWW-Query: An on-line retrieval system for biological sequence banks. Biochimie. 1996, 78: 364-369. 10.1016/0300-9084(96)84768-7.View ArticlePubMedGoogle Scholar
- Waddell PJ, Steel MA: General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. Mol Phylogenet Evol. 1997, 8 (3): 398-414. 10.1006/mpev.1997.0452.View ArticlePubMedGoogle Scholar
- Thorne R: Classification and geography of the flowering plants. Bot Rev. 1992, 58: 225-248. 10.1007/BF02858611.View ArticleGoogle Scholar
- Dkhar J, Kumaria S, Rao SR, Tandon P: Molecular phylogenetics and taxonomic reassessment of four Indian representatives of the genus Nymphaea. Aquatic Botany. 2010, 93: 135-139. 10.1016/j.aquabot.2010.03.010.View ArticleGoogle Scholar
- Dkhar J, Kumaria S, Tandon P: Nymphaea alba var. rubra is a hybrid of N.alba and N.odorata as evidenced by molecular analysis. Ann Bot Fennici. 2011, 48: 317-324. 10.5735/085.048.0403.View ArticleGoogle Scholar
- Dkhar J, Kumaria S, Tandon P: Molecular adaptation of the chloroplast matK gene in Nymphaea tetragona, a critically rare and endangered plant of India. Plant Genetic Resources. 2011, 9: 193-196. 10.1017/S1479262111000396.View ArticleGoogle Scholar
- Takhtajan AL: The System of Magnolyophyta. Nauka, Leningrad. 1987Google Scholar
- Cronquist A: An integrated system of classification of flowering plants, Columbia University Press. New York. 1981, 1262-Google Scholar
- Les DH, Schneider EL, Padgett DJ, Soltis PS, Soltis DE, Zanis M: Phylogeny, classification and floral evolution of water lilies (Nymphaeaceae; Nymphaeales): a synthesis of non-molecular, rbcL, matK, and rDNA Data. Syst Bot. 1999, 24: 2428-2446.View ArticleGoogle Scholar
- Takhtajan A: Systema Magnoliophytorum. 1987, Nauka, LeningradGoogle Scholar
- Takhtajan A: Flowering Plants. 2009, Springer, New YorkView ArticleGoogle Scholar
- Sokoloff DD, Remizowa MV, Macfarlane TD, Rudall PJ: Classification of the early-divergent angiosperm family Hydatellaceae: one genus instead of two, four new species and sexual dimorphism in dioecious taxa. Taxon. 2008, 57: 179-200.Google Scholar
- Les DH, Schneider EL, Padgett DJ, Soltis PS, Soltis DE, Zanis M: Phylogeny, classification and floral evolution of water lilies (Nymphaeaceae; Nymphaeales): a synthesis of non-molecular, rbcL, matK, and 18S rDNA data. Syst Bot. 1999, 24: 28-46. 10.2307/2419384.View ArticleGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res. 2011, 39: D32-7. 10.1093/nar/gkq1079.PubMed CentralView ArticlePubMedGoogle Scholar
- Posada D, Buckley TR: Model selection and model averaging in phylogenetics: advantages of the AIC and Bayesian approaches over likelihood ratio tests. Syst Biol. 2004, 53: 793-808. 10.1080/10635150490522304.View ArticlePubMedGoogle Scholar
- Nylander JA: MrModeltest v2. Program distributed by the author. 2004, Uppsala University: Evolutionary Biology CentreGoogle Scholar
- Gilbert D: Sequence file format conversion with command-line readseq. Curr Protoc Bioinformatics. 2003, Appendix 1:Appendix 1EGoogle Scholar
- Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol. 2007, 24: 1596-1599. 10.1093/molbev/msm092.View ArticlePubMedGoogle Scholar
- Jonassen I: Efficient discovery of conserved patterns using a pattern graph. Comput Appl Biosci. 1997, 13 (5): 509-522.PubMedGoogle Scholar
- Hofacker IL, Fontana W, Stadler PF, Bonhoeffer S, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Monatshefte f Chemie. 1994, 125: 167-188. 10.1007/BF00818163.View ArticleGoogle Scholar
- Lorenz R, Bernhart SH, Hoener zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL: ViennaRNA Package 2.0. Algorithms Mol Biol. 2011, 6: 26-10.1186/1748-7188-6-26.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.