StralSV: assessment of sequence variability within similar 3D structures and application to polio RNA-dependent RNA polymerase

Background Most of the currently used methods for protein function prediction rely on sequence-based comparisons between a query protein and those for which a functional annotation is provided. A serious limitation of sequence similarity-based approaches for identifying residue conservation among proteins is the low confidence in assigning residue-residue correspondences among proteins when the level of sequence identity between the compared proteins is poor. Multiple sequence alignment methods are more satisfactory--still, they cannot provide reliable results at low levels of sequence identity. Our goal in the current work was to develop an algorithm that could help overcome these difficulties by facilitating the identification of structurally (and possibly functionally) relevant residue-residue correspondences between compared protein structures. Results Here we present StralSV (structure-alignment sequence variability), a new algorithm for detecting closely related structure fragments and quantifying residue frequency from tight local structure alignments. We apply StralSV in a study of the RNA-dependent RNA polymerase of poliovirus, and we demonstrate that the algorithm can be used to determine regions of the protein that are relatively unique, or that share structural similarity with proteins that would be considered distantly related. By quantifying residue frequencies among many residue-residue pairs extracted from local structural alignments, one can infer potential structural or functional importance of specific residues that are determined to be highly conserved or that deviate from a consensus. We further demonstrate that considerable detailed structural and phylogenetic information can be derived from StralSV analyses. Conclusions StralSV is a new structure-based algorithm for identifying and aligning structure fragments that have similarity to a reference protein. StralSV analysis can be used to quantify residue-residue correspondences and identify residues that may be of particular structural or functional importance, as well as unusual or unexpected residues at a given sequence position. StralSV is provided as a web service at http://proteinmodel.org/AS2TS/STRALSV/.

Results: Here we present StralSV (structure-alignment sequence variability), a new algorithm for detecting closely related structure fragments and quantifying residue frequency from tight local structure alignments. We apply StralSV in a study of the RNA-dependent RNA polymerase of poliovirus, and we demonstrate that the algorithm can be used to determine regions of the protein that are relatively unique, or that share structural similarity with proteins that would be considered distantly related. By quantifying residue frequencies among many residueresidue pairs extracted from local structural alignments, one can infer potential structural or functional importance of specific residues that are determined to be highly conserved or that deviate from a consensus. We further demonstrate that considerable detailed structural and phylogenetic information can be derived from StralSV analyses.
Conclusions: StralSV is a new structure-based algorithm for identifying and aligning structure fragments that have similarity to a reference protein. StralSV analysis can be used to quantify residue-residue correspondences and identify residues that may be of particular structural or functional importance, as well as unusual or unexpected residues at a given sequence position. StralSV is provided as a web service at http://proteinmodel.org/AS2TS/ STRALSV/.

Background
Accurate sequence alignments between related proteins are important for many bioinformatics applications that involve comparative analysis. Derived from calculated alignments, residue-residue correspondences allow construction of sequence motifs and profiles important in building homology models or in predicting protein functions. Most of the currently used methods for protein function prediction rely on sequence-based comparisons between a query protein and those for which a functional annotation is provided. A serious limitation of sequence similarity-based approaches for identifying residue conservation among proteins is the lack of, or very low, confidence in assigning residue-residue correspondences among proteins when the level of sequence identity between the compared proteins is poor. Indeed, it was shown by Rost [1] that more than 95% of all pair-wise alignments occurring in the so-called twilight zone (20-35% sequence identity) may be incorrect [2]. Multiple sequence alignment methods are more satisfactory-still, they cannot provide reliable results at low levels of sequence identity, especially if the number of available closely related proteins is small (i.e., when the protein family has rather few members, or the list of related proteins that has been identified is short).
Having 3D structural information for a given protein can be especially useful in deriving functional annotation [3]. Structure comparison algorithms provide much higher confidence in assignment of residue-residue correspondences than do sequence-based algorithms. Nevertheless even calculated structural alignments may be inaccurate: for some compared proteins, or regions therein, more than one possible superposition can reasonably be reported, and it may be difficult to decide which alignment is most satisfactory [4,5]. Rigid body structural superpositions on the chain level have limitations when comparing multi-domain proteins with different conformations between domains. Comparisons on the domain level may yield better results, but splitting of structures into domains can be problematic, and there is no reliable method that can do this automatically. Even within compared structural domains, significant deviations can be observed in some loop regions, or due to large insertions or different conformations of structural motifs, all of which can significantly affect detection of structural residue-residue correspondences when rigid body approaches are used for alignment calculations. Several algorithms have been proposed to facilitate flexible protein structure alignment calculations [6][7][8], but the complexity of such calculations remains a challenging development goal. Another difficulty in identifying similar regions in compared protein structures lies in the possibility that analogous regions in structurally related proteins may display differences in sequential ordering of the motifs due to circular permutations or convergent evolution [9,10]. The majority of the existing flexible protein structure alignment algorithms report only sequential alignments, and there are very few (with varying levels of success) that can detect and align structures between which there are differences in the ordering of their structure motifs [11][12][13].
The accuracy of calculated structural alignments can also depend on the nature of compared structural models. The atomic coordinates obtained from experimentally solved structures (x-ray crystallography or nuclear magnetic resonance spectroscopy) are always associated with some degree of uncertainty resulting from experimental errors from the intrinsic flexibility of the proteins or from atom vibrations [14]. Such structural deviations may sometimes significantly affect the calculated alignments and lead to incorrect conclusions about sequence motifs, profiles, or possible residue substitutions within analyzed functional regions in proteins. The accuracy of calculated residue-residue correspondences can be improved by refinement methods that evaluate results produced by different structure-based alignment programs or explore sequence-based alignments using, for example, the Conserved Domain Database (CDD) as a set of reference alignments [15].
Our goal in the current work was to develop an algorithm that could address these difficulties and facilitate the identification of structurally (and possibly functionally) relevant residue-residue correspondences between compared protein structures. Our approach is to first detect similar structural motifs, and consequently derive structure-based alignments from the calculated local superpositions of corresponding similar regions. Our StralSV algorithm detects structurally similar regions within a given pair of protein structures, and reports residue-residue correspondences only from those local regions that are contained within a larger, similar structural context. When for a given reference structure a structure-based search is performed on a set of proteins from the Protein Data Bank (PDB), StralSV identifies all structurally similar fragments from that set, evaluates the calculated structure-based alignments between the query (reference) motif (designated "segment" in this work) and the detected structure fragments, and quantifies the observed sequence variability at each residue position on the query structure. Here we describe how the StralSV algorithm works, and we apply StralSV in a study of the RNA-dependent RNA polymerase of poliovirus.

Description of the StralSV algorithm
StralSV is an algorithm that identifies protein structural fragments having a 3D structure similar to that of a query structure, performs structure-based alignments between the query and the fragments, and quantifies at each position along the query structure the sequence variability represented among the selected fragments relative to the query. StralSV takes as input a query structure of interest, a database of protein structures, and various parameters (discussed below) that control the selection of fragments from the database and the sequence variability calculations. Figure 1 illustrates the steps in the algorithm. The algorithm uses a sliding-window approach for breaking the query structure into overlapping segments, each of which is independently used to identify from the input database protein structure fragments with 3D similarity. A recommended (default) window_size parameter is set to 90 amino acids in length, although an arbitrary length can be chosen. The query structure is thus split into overlapping segments of length window_size; overlaps are by default 1/2 the length of the window_size. A final segment is taken one window_size in length extending from the C-terminus to ensure that all portions of the query structure are represented within a segment of exactly the window_size. Each so-calculated query segment is then compared to all protein structures in the database using the LGA (local-global alignment) code [16] to identify structure fragments with sufficient structure similarity to the query segment. The LGA_S score is used to evaluate structure similarity between a query segment and detected similar fragments. Calculated LGA_S scores range from 0% to 100% and reflect a percentage of residues from the query segment that are identified as structurally aligned with a given similar fragment. In the StralSV algorithm, a value of LGA_S of at least 50% is used as a cutoff to ensure that there is sufficient structural similarity between the segment and the fragment over at least half the length of the segment.
LGA's distance cutoff parameter determines the maximum allowable distance between alpha carbons (Cα) of superimposed amino acids within a calculated alignment; typically this parameter is set to 4.0 Å, and this default is used for StralSV calculations with win-dow_size values of 90 residues. Thus, fragments with sufficient structure similarity to the query segment are identified. Each fragment is then evaluated to determine the tightness of its alignment to the query segment.
The criteria for tight structure similarities in local regions (spans), described by Zemla et al. [17], are used to identify ranges within the alignment that have tight local superpositions. Each residue-residue pair from the alignment (closest superimposed residues from the query segment and database fragment) is assigned a score by calculating the local RMSD (root mean square deviation) among the surrounding residue-residue pairs. A continuous set of at least three residue-residue pairs that fulfil the RMSD cutoff of 0.5 Å comprises a span. A desired size of a calculated span (span_size; shortest acceptable tight local alignment without gaps) is used as an input parameter to StralSV, and is typically specified as 3, 5, or 7, but can be of arbitrary length. Our previous experience [17], suggests that 5 is a reasonable minimum length over which to impose a tight alignment; 3 is the minimum value for span_size that is meaningful (since any 2 Ca atoms can be perfectly aligned), and 7 imposes stringency that tends to eliminate capture of some related (desired) structure fragments. For the work reported here we selected a minimum span length of 5 (span_size = 5). From each alignment is extracted a set of spans. All alignments that contain at least one span of length no less than the specified minimum span length are deemed "qualified hits". (For an illustration of a "span", see additional file 1: StralSV-RdRp_Suppl_Figure 1.docx.) All residue-residue pairs that are contained within a span's alignment are used to calculate the sequence variability data at the corresponding position in the query structure. Note that not all residues from calculated structure alignments contribute to the variability statistics at a given position in the query segment; regions in which the local RMSD distances between corresponding residues exceed 0.5 Å induce breakpoints between spans. Also, because the algorithm uses overlapping segments, duplications are appropriately factored out in calculating the sequence variability.
A frequency matrix (Table 1) is constructed for each position in the query structure by tallying the frequency at which each amino-acid (the 20 standard amino acids plus 'X', corresponding to unusual or modified residues) is observed within the spans. From this matrix are extracted statistics describing the number of positional hits (residue-residue correspondences contributing sequence variability data) per position and a list of residues observed at each position, ordered by frequency of occurrence. These statistics are used to construct a variability profile (Table 2), which can be used to identify positions at which there is relatively high or low sequence variability in structure context. The sample profile given in Table 2 shows the observed residues for positions 229 through 269 for polio RNA-dependent RNA polymerase (RdRp) run as the query protein against the complete PDB database (released on November 24, 2009).

Selection of parameters for StralSV using benchmark structure set
To illustrate the StralSV algorithm, we selected a minimum span length of 5 and conducted analyses on poliovirus RdRp because poliovirus has been extensively studied and its polymerase is a member of a widely distributed protein family. In analyzing poliovirus RdRp, part of our effort involved determining suitable parameters for window_size and distance_cutoff. As win-dow_size is increased, the stringency with which similar structure fragments are selected is increased, due to the greater structural context that is provided by the query segment. As distance cutoff is increased, that stringency is reduced, because more laxity in the alignment is allowed. Thus, for larger window_size values, a larger distance cutoff should be applied in order for the algorithm to not eliminate related structure fragments that have local structural deviations with respect to the query segment. Likewise, as the window size is decreased, the distance cutoff should be reduced as well, to prevent capture of many small, less related structure fragments. To determine what combinations of win-dow_size and distance_cutoff values would provide com-parable results, we performed several benchmark tests involving various structure libraries (e.g., complete PDB, PDB-select 40, ASTRAL40) and input parameters (data not shown). The results from one such test involving the capture of structure fragments from a benchmark data set containing 38 polymerase structures plus more than 1000 other structures randomly selected from the PDB is presented in Figure 2. Suitable parameters input to StralSV were expected to capture only structurally related fragments (i.e., the polymerases), whereas overly stringent parameters were expected to yield result sets lacking at least some of the polymerase fragments, and low stringency parameters were expected to capture less related fragments. In this way, we examined the dependency between window_size and distance_cutoff values to determine optimal (for the current study) parameter settings, and help define default parameter settings for StralSV.
We conducted a test whereby window_size and distan-ce_cutoff parameters were varied from 20 to 150 and 1.0Å to 6.5Å, respectively ( Figure 2). We ran StralSV using each window_size/distance_cutoff combination for all query (poliovirus RdRp) segments enclosing residue G64. We selected a region in the N-terminal portion of the protein upon which to focus this exercise. The region defined by residues 1-140 tended to be structurally unique, but at the same time contained a variety of well defined secondary structure elements of different sizes ( Figure 2C). Furthermore, the selection of fragments to inspect from within this region was somewhat arbitrary; we selected those segments containing residue G64 due to its significance as a functional residue [18]. We determined how many qualified hits were obtained for each window_size/distance_cutoff combination ( Figure 2B). Parameter settings comprising window_size values ranging from 70 to 90 at distance cutoffs 3.0Å to 5.0Å gave satisfactory results; fragments from the seeded 38 polymerase structures were reliably captured within a set of window_size and distance_cutoff parameter value combinations within these ranges (dotted oval in Figure 2B). Window_size values smaller than 70 tended to yield qualified hit sets missing some of the 38 polymerases (i.e., true positives) for lesser (more stringent) values of dis-tance_cutoff and tended to capture increasing numbers of unrelated (false positive) structure fragments as the distance cutoff was increased. Also, window_size values smaller than 70 were sensitive to the distance cutoff value, yielding acceptable result sets only for narrow ranges of distance cutoff. Very large window_size values (e.g., 150) resulted in selection of all true positives for all but the very tightest (distance_cutoff < 2.5Å) alignments. The parameter settings highlighted in red type in Figure  2B were selected as default values for StralSV.

Plotting of StralSV results
StralSV produced variability matrices and sequence profiles (not shown; for excerpts see Tables 1 and 2) from which were extracted data for analyzing related structure Table 1 Excerpt of sample StralSV output "matrix" file for positions 229-269 from poliovirus RdRp analysis fragments and quantifying positional variability. Plots containing data extracted directly from the matrix and profiles files are shown in Figures 3, 4, and 5. We performed additional analyses in order to annotate the primary StralSV results: 1) Secondary structure assignments were calculated for poliovirus RdRp using DSSP [20] and were plotted along the x-axis of Figure 3. DSSP output was simplified as follows: helix, comprising alpha helix (H), pi helix (I), and 3/10 helix (G); strand, comprising extended strand (E) or residue in isolated beta bridge (B).
2) To determine whether we could discern patterns with respect to positional hit frequency and previously    Table 3   identified polymerase sequence motifs, we overlaid a plot comprising qualified hits versus sequence position with the positions of the well known motifs, extracted from the literature ( Figure 5, gray boxes with labels A-G). This was accomplished by examining sequence alignments and extracting coordinates defining the known sequence motifs A through G from papers that included poliovirus RdRp among the aligned sequences [21][22][23][24]. Because there was considerable inconsistency regarding the boundaries of the sequence motifs, we defined inclusive boundaries for each motif whereby residues were included if they were identified within a motif in any of the sequence alignments reported in the literature. 3) In order to inspect the most abundant sequence variants (i. e., the most conserved positions) in the context of the positional hit frequency along the reference protein, we extracted from the matrix files the frequencies of the most frequent residues and plotted them versus sequence position along with the positional hits for window size 80 ( Figure 5). 4) The literature was searched for evidence of functional annotation [18,[21][22][23][24][25][26][27][28][29][30][31] for the most frequently observed residue positions (see additional file 2: StralSV-RdRp_Suppl_Table 1) and positions for which functional annotation was identified were marked in Figure 5.

Annotation of representative positional hits
The qualified hits (structure fragments from PDB with detected local similarities to the query structure) for six selected sequence positions (positional hits) of polio RdRp (positions identified in Figure 3) were categorized and quantified based on SCOP (Structure Classification of Proteins database; version 1.75, June 2009 release) identifiers [32]. Note that because SCOP is a manually curated database of structure domains from the PDB, there is some delay (currently more than one year) before a new PDB entry is classified in SCOP. Therefore, we have included in Table 3 and additional files 3, 4, 5, 6, 7, and 8 (StralSV-RdRp_Suppl_Table 2, StralSV-RdRp_Suppl_Table 3, StralSV-RdRp_Suppl_Table 4, StralSV-RdRp_Suppl_Table 5, StralSV-RdRp_Suppl_Table 6, StralSV-RdRp_Suppl_Table 7) data pertaining to qualified hits and sequence variability based on StralSV analysis using a complete tally of PDB identifiers in addition to those hits and variabilities that could be categorized by SCOP classification. The hits were evaluated separately for each of windows 50, 70, 80 and 90, at six sequence positions selected to be representative of the types of frequency variation (of qualified hits) over the length of the polymerase (see specified positions in Figure 3). For each of the six selected positions, the following information was extracted from the StralSV matrix output file: all PDB templates (including chain information) that contributed to the profile at that position, the corresponding LGA_S score, the sequence identity (Seq ID), the number of (template) amino acids that matched the query, the name of the corresponding amino acid of the template, and, when available, the SCOP identifier for each classified PDB template.

Results
Effects of parameter settings on capture of database structure fragments Numbers of positional hits (residues within spans) corresponding to each position along the poliovirus query . The regions of high variability were "locked" between regions of high specificity, which exhibited little or no change with the variation of win-dow_size. As window_size was increased, the numbers of positional hits tended to decrease in those regions for which window_size-dependent variability was observed.
Comparison of positional hit variabilities to secondary structure ( Figure 3, along x-axis) revealed that the window-related variability tended to occur in regions of helical secondary structure. A polymerase "baseline", amounting to fewer than 50 positional hits (corresponding primarily to polymerase structures) was observed along the entirety of the poliovirus chain ( Figure 3 xaxis). All polymerase baseline regions, with the exception of region 410-461, were identifiable independently of window_size. The latter C-terminal region showed high specificity at window size 90. In this region, lowering the window_size value to just 70 resulted in the capture of many (> 900) structure fragments that have apparent structural similarity to poliovirus RdRp ( Figure  3, tall red peaks at right side of plot). The N-terminal region of the poliovirus protein yielded the least positional hits, indicating relative structural uniqueness in this region.

Composition of captured structure fragments at specific positions
We selected six positions (Figure 3, residues numbered at top of plot) at which to examine the diversity of structure fragments captured at each of four window_size values (Table 3,  At positions D238, D328, and L374, qualification of structure fragments (and resulting positional hits) was largely independent of window_size, and few unrelated (non-e.8) fragments were captured, indicating that the structure motifs in the surrounding regions were largely limited to the families detected (e.g., e.8.1.4 and e.8.1.6) ( Table 3, positions D238 and D328). For positions N297, H398, and H413, at which there was considerable diversity in the SCOP families detected at smaller win-dow_size values (< 90), omission of these more distantly related positional hits was observed as window_size increased. Overall it was observed that window_size 90 resulted predominantly in capture of structure fragments from the structure family to which poliovirus RdRp belongs (e. 8.1.4). Therefore, at all six positions, window_size 90 effectively filtered from among the nearly 150,000 PDB chains only those members of the polymerase/transcriptase families.

Detection of sequence variability in structure context
To determine how sequence variability was distributed among the e.8.1 families, we further categorized the amino-acid variabilities by SCOP family for position N297 at window_size 80 (Table 4). In general approximately 2/3 of the positional hits could be categorized using SCOP; as an example, 320 of the 572 total positional hits were derived from structures that had been classified in SCOP. For this position (at which the greatest sequence variability was observed among the 6 selected), the distribution of amino-acid variability contributed by e.8 superfamily members (e.8.1.1, 2, 4, and 6) was considerably more narrow (5 amino acids (A, F,  G, H, N), of which half (80) were N) than that observed overall (13 amino acids). All of the positional hits that coincided in sequence with poliovirus RdRp at N297 were members of the e.8.1.4 (RNA-dependent RNApolymerase) family, although a minority (5; all of the remaining) of the hits (all RdRp of lambda 3) from this family had H at position 297. Thus, sequence variability was limited to N (94%) and H (6%) within family e.8.1.4, in which poliovirus RdRp is categorized. (This family also includes HCV, FM, lambda 3, BVDV, Norwalk virus, rhinovirus, rabbit hemorrhagic fever virus, and IBVD.) Summarizing the amino-acid occurrences among qualified hits within fold e.8 (comprising 1 superfamily and 4 families), we observed the following amino-acid distributions at position N297: N = 50%, F = 26%, G = 20%, H = 3%, A = 0.6%, compared to a much broader distribution of amino-acid variabilities observed within templates outside of fold e.8: V = 31%, R = 31%, Q = 9%, I = 6%, H = 5%, Y = 5%, and C, E, F, K, L, M, and T each < = 3% (comprising the "tail" in a distribution of sequence variability). For completeness, amino-acid variabilities derived from positional hits at all six positions examined (see Figure 3) are summarized in additional files 3, 4, 5, 6, 7, and 8: StralSV-RdRp_Suppl_Table 2, StralSV-RdRp_Suppl_Table 3, StralSV-RdRp_Suppl_Table 4, StralSV-RdRp_Suppl_Table 5, StralSV-RdRp_Suppl_Table 6, StralSV-RdRp_Suppl_Table 7. Distributions of observed amino-acids are shown for all templates and grouped as within or outside of fold e.8. From this detailed analysis it was possible to categorize the specific amino-acid variabilities per position-thus, StralSV can be used to detect positional trends and anomalies among structures related to the protein of interest.

Effect of window_size on detection of sequence variability
To determine how the window size parameter might affect the detection of amino-acid variability "globally" along the poliovirus RdRp chain, we plotted for each position along the reference structure the detected absolute amino-acid variability versus the number of qualified hits for window_size values 70, 80, and 90 ( Figure 4). As window_size decreased from 90 to 70, there was observed an increase in the number of database structure fragments that contributed positional hits, and a corresponding increase in the amino-acid variabilities: at win-dow_size 70 there were far more (32) positions at which all amino-acids were observed among the positional hits compared to window_size 80 (1 position). At window_size 90 there was considerably less sequence variability detected among the positional hits, with the most variable positions accounting for no more than 14 distinct amino acids observed. The inset in Figure 4 displays the data points for those positions at which the dominant (most frequently observed) amino-acid occurred at a frequency of at least 80%. Circled data points are those corresponding to window_size 80. There was considerable overlap among window_size values 70, 80, and 90 with respect to positions at which the dominant residue occurred at frequency > = 80%. Window_size 70 produced 79 positions, 80 produced 74, and 90 produced 79; 62 positions occurred in all three of these data sets. This plot demonstrates that the most dominant amino-acid residues occurred within the same (narrow) positional hit frequency range (0 to < 400) regardless of window_size, suggesting that highly conserved positions display relatively little sequence variability regardless of the selected window_size.

Sequence and structure motifs associated with RdRps
Regions corresponding to the well-known sequence motifs characteristic of polymerases were mapped to the  positional hit frequency plot for window size 80 ( Figure  5). As mentioned above, regions of high positional hit frequency tended to correspond with helix secondary structure ( Figure 3), but also with defined palm-domain motifs ( Figure 5). Also plotted in Figure 5 are the frequencies of the dominant residues per position along poliovirus RdRp (lavender plot in figure). Included are residue-position labels for those positions at which the dominant residue frequency exceeded 90% and for which we were able to identify a functional annotation in the literature (see additional file 2: StralSV-RdRp_Suppl_Table 1). Although there is no clear criterion for identifying functional residues and short sequence or structure motifs based on StralSV profiles, it was evident that functionally relevant residues tended to emerge when selecting those positions displaying high degrees of conservation in structure context. Furthermore, lowering the window size to just 70 resulted in the capture of many structure fragments with common structure motifs (see last 3 maxima in Figure 3 graph and Table 3 "other" column for positions H398 and H413), implying that StralSV may enable identification of common structure motifs shared among distantly related proteins.

Discussion
StralSV differs in several respects from other sequenceand structure-based algorithms for comparing proteins. First, by applying an overlapping sliding window to define segments of a structure of interest, the algorithm avoids the pitfalls of algorithms that compare proteins at a global level: by dividing a protein into segments, StralSV enables comparison of structures by using fragments corresponding approximately in size to super-secondary elements or structure motifs. In this way, portions of a structure can be compared at a local level to like fragments in the PDB without losing the greater structural context. StralSV applies a two-step approach to filtering PDB fragments in order to select those which are most likely to be relevant for a meaningful comparison. For example, the results presented in Tables 3 and 4 and additional files 3, 4, 5, 6, 7, and 8: StralSV-RdRp_Suppl_- Table 2, StralSV-RdRp_Suppl_Table 3, StralSV-RdRp_Suppl_Table 4, StralSV-RdRp_Suppl_Table 5, StralSV-RdRp_Suppl_Table 6, StralSV-RdRp_Suppl_Table 7, illustrate how StralSV can be used to quantify and annotate sequence variability among structure fragments that form tight local alignments (determined by span and distance_cutoff parameters), yet have sufficient structure context (determined by window_size) to filter out unrelated fragments. Larger window_size values effectively increase the stringency with which structure fragments are selected, whereas smaller distance_cutoff values also increase this stringency, but more locally, by enforcing tighter local alignments. As seen in Table 3, considerably fewer fragments are selected at window_size 90 than with smaller window_size values for those positions that are sensitive to window_size, but the result set is greatly enriched for fragments that are closely related to the reference structure (i.e., polio RdRp). How much stringency the user wishes to apply in running StralSV may depend on one's research interest and the protein being studied. Smaller window_size (as well as greater distan-ce_cutoff) parameter values will result in capture of more structure fragments, many of which will be from structures that are more distantly related in terms of their SCOP classification and the taxonomy of the organisms that they are from (Table 5). It must be noted that as the window_size (or distance_cutoff) parameter is relaxed, more "noise" arises in the result set; however, at the same time there is potential for discovering structural relationships and sequence laxity that may not otherwise be discernable when comparing only closely related structures.
Effects of parameter setting on capture of database structure fragments Not unexpectedly, the frequencies of positional hits tend toward maxima in regions of helical secondary structure, and as window_size is decreased the numbers of positional hits tend to increase (Figures 3, 5). Reduction in window_size relaxes constraints on the alignment, and therefore smaller fragments may align more tightly, thereby meeting the distance cutoff. In some regions, the numbers of positional hits do not change significantly with changes in window_size (e.g., peaks up to position 240 and 360-390), indicating that in certain regions there exist conserved structure motifs that are shared only among a specific set of structures. For example, there is observed in the N-terminal regions up to approximately position 140 (Figures 3, 5) a "polymerase baseline", which appears to be structurally unique. This region is external to the catalytic tunnel of the polymerase, and may define structural and functional specificity for polio and related viruses. Window_size 90 produced a polymerase baseline in the C-terminal region as well, although here the positional hit frequency was highly sensitive to window_size. It may be that only window_size 90 (or greater) effectively filtered out non-polymerase structure fragments due to the stringency enforced by context, as it appeared that smaller segments of polio RdRp in this region resembled short segments in distantly related proteins. We observed considerable diversity among the SCOP families represented by fragments detected at different positions at window_size values less than 90 (N297, H398, H413). This demonstrated a clear filtering of these positional hits as the window_size increased. One must be aware, however, that the two parameters, window_size and distance_cutoff, have an opposing relationship with respect to qualifying a hit: increase in alignment stringency is achieved by increasing window_size, but also by decreasing distance_cutoff. As window_size is decreased, it is necessary to also reduce the distance_cutoff in order to avoid an unacceptable increase in the number of "false positive" (considerably less structurally relevant) fragments captured. We achieved a reasonable selection of window-size/distance-cutoff parameter pairings by examining the relationship between these parameters ( Figure 2). The apparent (perhaps unexpected) decrease in the numbers of non-e.8 structure fragments captured by window 50 compared to window 70 at positions N297 and H398 (Table 3 "Other" column) are explained by the high-stringency filtering achieved by the distance cutoff 2.5 Å applied at window 50. Thus, it is desirable to strike a balance between window size and distance cutoff. Small window_size values are useful for capturing shorter fragments, but to ensure that the result set is not populated by spurious hits corresponding to ubiquitous secondary structure elements (e.g., alpha helix), one must apply added stringency at the level of distance_cutoff. Smaller window_size values are appropriate when the user is interested in focusing the analysis on a relatively small structure motif, which may occur with or without the surrounding structural context in the reference structure. In capturing these smaller structure fragments, one is cautioned to enforce tighter alignments in order to assure that the resulting qualified hits are relevant to the study.

Detection of Sequence Motifs
Regions corresponding to the well-known sequence motifs A-G, characteristic of polymerases, were mapped to the positional hit frequency plot for window size 80 ( Figure 5). All categories of polymerase (RdRp, RdDp/RT, DdRp, DdDp) are recognized to contain sequence motifs A-D. Identification of motifs E, F, and G, however, has been somewhat obscured by the greater diversity among sequences assigned to these structure motifs. O'Reilley and Kao [23] reported motif E as being exclusive to RdDps and RdRps, and motifs F [22,24] and G [24] were identified in poliovirus RdRp. Motif F was identified in phi6 [33], BVDV and HCV [34], reovirus, phi6, BVDV, HCV, rhinovirus, Norwalk virus, and HIV [35], and FMDV, RHDV, and HIV1 [36]. A detailed structurebased comparison of these polymerases using StralSV may clarify the assignment of motifs A-G among the polymerase classes.

Detection of sequence variability in structure context
In examining the amino-acid variability versus qualified hit frequency (Figure 4 inset, circled data points) for 11 highly conserved positions ( Figure 5 dots) that had been functionally annotated (see additional file 2: StralSV-RdRp_Suppl_Table 1), it appeared that the numbers of positional hits and the degree of amino-acid variability observed for high-frequency positions was largely independent of window_size. This implies (at least for the positions that were examined in this study) that the functionally relevant positions are consistently detected as high-frequency-residue positions regardless of win-dow_size. Therefore, the StralSV algorithm is sufficiently sensitive to detect structurally or functionally conserved residues even when the parameters may not be perfectly tuned.
StralSV is especially useful for identifying highly conserved residues at positions that occur in regions in which there are large numbers of positional hits; dominant residue frequency cannot be considered significant in regions of structural conservation (e.g., the N-terminal region up to about position 140 in poliovirus RdRp), whereas identification of dominant residues occurring with high frequency at positions with large numbers of positional hits may help identify residues at positions that are structurally and/or functionally significant. For example, more than 250 positional hits contributed to residue counts at positions D233 and D328 (window_size 80, Figure 5), known to be critical for RdRp function. Aspartic acid occurred at close to 100% frequency at these positions. The structure fragments contributing to these data points comprised RdRps and dsRNA phage polymerases (SCOP family e.8.1.4). The combination of many positional hits and low amino acid variability may provide a means of identifying key functional residues.
Because the ability to detect possibly functional residues is of particular interest in protein functional annotation, we compared the results obtained using StralSV to those of another bioinformatics tool [37]. FireStar uses alignments to identify functional residues based on close atomic contacts in PDB structures and annotated residues in the Catalytic Site Atlas (CSA). We ran poliovirus RdRp (PDB: 1ra6) through the online FireStar server (data not shown), and found considerable overlap between the functional residue list generated by FireStar and the list of 90% dominance residues generated by StralSV (see additional file 2: StralSV-RdRp_Suppl_Table 1): K158, K167, R174, D233, D238, G289, and K359. StralSV identified the more broadly conserved residues within the functional set identified by FireStar. A main difference between the two approaches is that FireStar identifies functional residues involved in ligand bindingsome being highly conserved and others displaying considerable sequence variability. In particular we note N297, which is determined by FireStar to be associated with binding of cytidine-s-triphosphate and uridine-smonophosphate (sites 2 and 3). StralSV identified N297 as a residue that displays some degree of conservation, but also quantified the degree of conservation at this position across a wide range of structures (see additional file 4: StralSV-RdRp_Suppl_Table 3). Furthermore, StralSV is not limited to identification of residues that have been associated with a binding site, but can be used to infer structure and/or functional significance based on sequence conservation in structure context regardless of pre-existing annotation information.

Additional applications of StralSV
Results from StralSV analysis can be used to characterize residue positions in a reference protein by detection of similar locations in other proteins (sometimes from quite distant organisms and different assigned structure classifications), in which corresponding residue positions within similar structural motifs are observed. Such analyses may potentiate rapid identification of invariant (as well as unusual or unexpected) residues, which in many cases are essential to a protein's function. It also may enlighten studies of newly discovered natural or engineered mutations that have not yet been observed in the sequence databases. Results from StralSV structure similarity searches performed against large sets of structurally related proteins can facilitate refinement of constructed homology models by suggesting corrections to the query-template alignments (using an approach similar to that of [2]) or by providing a list of possible conformational variants of corresponding structural fragments for loop-building procedures. Calculated residue-residue correspondences can be used to evaluate pure sequence alignment methods and also to derive structural environment-specific substitution matrices, which have been shown to be useful for detection of remote homologs [38]. When applied to experimentally solved structures, StralSV also could facilitate identification of structural motifs (local conformations) that have not yet been observed in PDB. Such findings could aid in the discovery of previously unidentified structural motifs or suggest refinement of constructed structural models in particular regions.