DASP3 search scores segregate true positives from false positives to a greater extent than previous versions
Because DASP search scores are a critical part of our methods for clustering proteins into functionally relevant groups, it is paramount to understand how DASP3 search scores compare to DASP/DASP2 search scores. ASPs of previously identified functionally relevant groups were used to search both the PDB and GenBank databases with DASP2 and DASP3 for proteins that share active site features. Functionally relevant groups are defined here as groups identified by our Two Level Iterative clustering Process (TuLIP), which are largely equivalent to the subgroups and families annotated by SFLD curators (manuscript under review). Ideally, when each protein group is used to search the PDB, every protein in the group should be identified with a DASP search score more significant than the trusted cutoff and no other proteins should be identified with significant scores. The trusted cutoff for TuLIP is ≤1E-10 for most of the groups but can sometimes vary between ≤1E-8 and ≤1E-12.
Active site profiles from 79 functionally relevant groups identified from five SFLD superfamilies (crotonase, enolase, GST, radical SAM, and VOC) and one expertly curated superfamily (peroxiredoxin) were used to search the PDB database. Each search was performed using both DASP2 and DASP3. The search results demonstrate DASP/DASP2 and DASP3 identify all group members at search scores ≤1E-8, but the DASP3 search scores are more significant by 2.97 orders of magnitude, on average (Fig. 5a, left). Paired t-test calculations indicate group member (true positive) DASP search scores are significantly improved between DASP/DASP2 and DASP3 for each superfamily with all p-values ≤1E-4 (Additional file 1: Table S1); DASP search scores for group non-members are not significantly changed between DASP2 and DASP3 (Fig. 5a, right).
Notably, as in previous versions, the group members and non-members are separated by at least two orders of magnitude in all 79 DASP3 searches (Fig. 6), demonstrating DASP3 can distinguish self and non-self across the isofunctional groups in the six diverse superfamilies. Furthermore, the average separation between the least significantly scoring group member and most significantly scoring non-member increases from 11 orders of magnitude in DASP/DASP2 to 13 orders of magnitude in DASP3 (Fig. 6), suggesting DASP3 separates true positives and false positives better than early versions of the software. The line of separation between group members and non-members falls in the range 1E-8 to 1E-12 for all 79 groups in DASP/DASP2; similarly, in DASP3, the line of separation is between 1E-10 and 1E-14 for all groups, as expected from the search score significance shift. The DASP search score which separates group members from group non-members is remarkably consistent, corroborating previous data suggesting significance thresholds for DASP search scores are less dependent on superfamily than other common classification methods [26].
To validate DASP3 performance in GenBank searches, 12 ASPs (from the enolase, ISII, and Prx superfamilies) corresponding to SFLD-defined functionally relevant groups were used to search GenBank with both DASP2 and DASP3 (Additional file 1: Table S1). Proteins were deemed true positives or false positives based on membership in the SFLD functional group represented by the input set, as it has been previously shown that proteins identified at significant DASP search scores are almost always annotated to the SFLD functional group of the input set [26]. Any proteins identified in GenBank searches which are not annotated in the SFLD were not used in this analysis as accurate functional group membership cannot be determined.
Similar to the PDB searches, DASP3 search scores for each subgroup are more significant by an average of 2.81 orders of magnitude compared to DASP/DASP2 (Fig. 5b, left); Wilcoxon rank test p-values are < 2E-16 for each superfamily, indicating significant improvement of DASP3 search scores (Additional file 1: Table S1).
Further, the false positive discovery rate [FP/(TP + FP)] for both DASP/DASP2 and DASP3 is < 0.5 % at a generous threshold ≤1E-8 and < 0.01 % at a trusted threshold ≤1E-12. In this analysis, false positives are defined as proteins that are members of SFLD functional groups not included in the input profile. While the false positive discovery rate is slightly higher for DASP3 (Fig. 5b, right), the difference is not statistically significant (t-test, p = 0.233). Taken together, these results demonstrate that DASP3 modifications enhance significance of the returned score and increase the score difference between true and false positives compared to previous versions of DASP.
DASP3 accurately identifies known functionally relevant groups of protein structures using an iterative clustering process
The Two Level Iterative clustering Process (TuLIP), was recently developed to identify functionally relevant groups of protein structures using iterative clustering and DASP PDB searches (manuscript under review). In TuLIP, a protein cluster is defined as a functionally relevant group if the DASP PDB search returns only the proteins in the cluster at significant scores with no false positives. The process has demonstrated the ability to identify known isofunctional groups in multiple superfamilies (manuscript under review). However, major changes to the DASP algorithm could profoundly affect the groups identified in the TuLIP process. To analyze the impact of DASP modifications on TuLIP clustering, TuLIP was performed using both DASP2 and DASP3 on four superfamilies.
Prior expert analysis separated the peroxiredoxin (Prx) superfamily into six subgroups [26]. DASP was previously able to identify these subgroups distinctly in both PDB and GenBank searches using a manually curated starting set [26]. When TuLIP was used with DASP/DASP2 to cluster the Prx proteins with no a priori knowledge (Fig. 7a, left), just one of the six subgroups was identified distinctly (Prx5 as Sct3). The Tpx subgroup was combined with some of the PrxQ proteins, while the remaining two PrxQ proteins formed another group. The final three subgroups (Prx6, Prx1, and AhpE) were combined into one TuLIP group (Sct4). Conversely, when DASP3 was used to perform TuLIP, four of the six subgroups (Prx5, Tpx, PrxQ, and Prx6) were grouped according to expert subgroup annotation, while the remaining two subgroups (Prx1 and AhpE) were combined (Fig. 7a, right). In this limited test case, the TuLIP-identified groups match the known functional groups more closely using DASP3 than early versions of the software.
While DASP3 improves Prx subgroup identification over previous versions, additional SFLD superfamilies (enolase, crotonase, and GST) showed minimal differences between the two versions. When DASP/DASP2 and DASP3 are used by TuLIP to cluster these three superfamilies into functionally relevant groups, 52 and 44 %, respectively, of TuLIP-identified groups correspond one-to-one with SFLD subgroups or families, a small difference that is not statistically significant (Additional file 1: Figure S2). The subgroups and families which are combined in DASP3, such as OSBS, dipeptide epimerases, and several in the glutathione transferase superfamily, are previously shown to be difficult to cluster [29, 34, 35].
Overall, the DASP/DASP2 and DASP3 results are consistent with regard to TuLIP-based functionally relevant clustering of the very limited proteins of known structure in the PDB. In some superfamilies, such as crotonase, enolase, and GST, DASP3 identifies functionally relevant groups in a similar fashion to early versions of the software. In other superfamilies, such as Prx, TuLIP is able to identify functionally relevant groups more accurately using DASP3.
DASP3 accurately identifies known Prx isofunctional groups of protein sequences with one GenBank search
To analyze if DASP3 can identify all Prx protein sequences from a small set of known protein structures, the structures in the Prx superfamily were separated into six expertly-identified functionally relevant groups, as previously described [26]. Each of these six groups was used to search GenBank using DASP2 and DASP3. The F-measure was calculated at each DASP search score from 1E-8 to 1E-25 for both methods (Fig. 8). F-measure is the harmonic mean of precision [TP/(TP + FP)] and recall [TP/(TP + FN]; true positives, false positive, true negatives, and false negatives were defined by inclusion in the previously expertly-identified groups [26], as explained in detail by Knutson et al. (manuscript under review). All proteins identified in these GenBank searches that were not previously identified by Nelson et al. were not included in the F-measure calculations as group membership cannot be validated. F-measure scores range from 0 to 1 where 1 indicates the search identified all true positive proteins without identifying any false positive proteins at the given DASP search score threshold.
On average, the F-measure does not significantly differ between the DASP/DASP2 and DASP3 searches (Fig. 8). However, group-by-group analysis highlights some interesting behavior. In the AhpE subgroup, the F-measure does not significantly differ at any DASP search score threshold (Fig. 8, orange). For the Prx5 subgroup, the F-measure is consistently higher in DASP/DASP2 than DASP3, though the differences are small until more significant DASP search score thresholds (Fig. 8, purple). Similarly, the Prx1 subgroup results demonstrate a higher F-measure in DASP/DASP2 than DASP3 at DASP search scores ≤ 1E-17, but similar F-measure values at less significant thresholds. For both the Prx1 and Prx5 subgroups, the lower F-measure in DASP3 is due to the emergence of false negatives at more significant DASP search scores; that is, some proteins are identified at less significant DASP search scores in DASP3 than DASP2. Interestingly, the opposite pattern in F-measure values is observed for the Tpx, PrxQ, and Prx6 subgroups (Fig. 8, green, pink, and red, respectively). In these subgroups, the F-measure is higher in DASP3 than DASP/DASP2, particularly at more significant DASP search scores. Again, the presence of false negatives in DASP/DASP2 causes the lower F-measure scores as proteins are identified at less significant scores in DASP/DASP2 searches than DASP3 searches. The enhancements made to create DASP3 result in variable F-measure improvements on a group-by-group basis, but overall no significant differences are observed after a single GenBank search (paired t-test at significance thresholds ≤1E-14 for DASP/DASP2 and ≤1E-16 for DASP3; p-value = 0.12). Notably, DASP3 identifies a large proportion of known Prx sequences in the appropriate groups; the average weighted F-measure at significance thresholds of ≤1E-8 and ≤1E-16 is 0.97 and 0.72, respectively.
DASP3 accurately and efficiently identifies known functionally relevant groups of protein sequences using an iterative sequence search process
As only structurally characterized proteins are clustered by TuLIP, GenBank searches are necessary to identify protein sequences belonging to each TuLIP group. Therefore, the Multi-level Iterative Sequence Searching Technique (MISST) was developed to iteratively identify protein sequences with active site similarity to a given functionally relevant group and, further, to determine when such groups should be subdivided based on active site similarity (manuscript under review). MISST has demonstrated the ability to identify, cluster, and subdivide the Prx superfamily and other superfamilies using DASP2. Since MISST is a key method in our software arsenal and relies on iterative searching of the sequence database, it was relevant to compare the results of the MISST process using DASP/DASP2 and DASP3. Consequently, MISST was applied to the functionally relevant groups in the Prx superfamily.
Three iterations of MISST were performed starting with the TuLIP groups identified by both DASP/DASP2 and DASP3 (Fig. 7a), as MISST is specifically designed to use TuLIP results as input. On the whole, both DASP/DASP2- and DASP3-identified MISST groups compare well with known functional groups (Additional file 1: Figure S3). After the first GenBank search (Search0), some subgroups are identified more completely by DASP/DASP2 (Prx1, Prx5, and Prx6), while some subgroups are identified more completely by DASP3 (Tpx, PrxQ, AhpE); this result supports the single-GenBank search result previously described: DASP3 does not significantly improve search results across the board after a single GenBank search. However, a greater percentage of each subgroup was identified (at DASP search scores ≤1E-8) in fewer DASP3 iterations compared to earlier versions (Fig. 7b). Notably, the PrxQ subgroup, which was difficult to identify using DASP/DASP2, was identified in full after just two iterative searches using DASP3.
Using more stringent thresholds to reduce the presence of false positives (≤1E-14 in DASP2 and ≤1E-16 in DASP3; see Fig. 5), we identified 21,632 total sequences with four iterations of DASP/DASP2 searches and 23,300 total sequences with four iterations of DASP3 searches, compared to the 3,390 sequences previously identified with a single DASP search of GenBank using a stringent threshold of ≤1E-10 [26]. Much of this increase is likely due to the five additional years of sequence addition to the database. However, some are likely newly identified sequences, given the added benefit of the modified algorithms and iterative searches. Given these and previous results, we expect the false positive rate at these score thresholds to be less than 1 %, but detailed analysis of these sequences is beyond the scope of this manuscript.
Together, these results show that beginning with DASP3-identified TuLIP groups, iterative DASP3 GenBank searches identify the six known Prx isofunctional groups to a similar standard as expert identification. Additionally, superfamily coverage through iterative searches is obtained more quickly using DASP3 than previous versions of the software. Though the enhancements produce incremental improvement for TuLIP clustering and single GenBank searches, the improvements sum to significantly improve the efficiency of identifying and clustering across the iterative process, which is necessary for complete functionally relevant clustering of protein superfamilies.