Skip to main content

STRIDE: a command-line HMM-based identifier and sub-classifier of Plasmodium falciparum RIFIN and STEVOR variant surface antigen families

Abstract

Background

RIFINs and STEVORs are variant surface antigens expressed by P. falciparum that play roles in severe malaria pathogenesis and immune evasion. These two highly diverse multigene families feature multiple paralogs, making their classification challenging using traditional bioinformatic methods.

Results

STRIDE (STevor and RIfin iDEntifier) is an HMM-based, command-line program that automates the identification and classification of RIFIN and STEVOR protein sequences in the malaria parasite Plasmodium falciparum. STRIDE is more sensitive in detecting RIFINs and STEVORs than available PFAM and TIGRFAM tools and reports RIFIN subtypes and the number of sequences with a FHEYDER amino acid motif, which has been associated with severe malaria pathogenesis.

Conclusions

STRIDE will be beneficial to malaria research groups analyzing genome sequences and transcripts of clinical field isolates, providing insight into parasite biology and virulence.

Background

The eukaryotic parasite Plasmodium falciparum features the repetitive interspersed family (RIFIN) and subtelomeric variable open reading frame (STEVOR) multigene family, variant surface antigens that are associated with severe malaria pathogenesis and immune evasion [1,2,3]. RIFINs and STEVORs share a domain architecture, although RIFINs can be further subtyped into RIFIN-As and -Bs based on a 25 amino acid indel in the semi-conserved domain and differences in subcellular localization suggestive of distinct functions (Fig. 1) [4, 5]. A subset of RIFIN-As harboring a seven amino acid FHEYDER motif in the semi-conserved domain have been shown to inhibit B- and NK-cell activation, weakening host defenses against malaria infection [6]. Both protein families are also targets of natural immunity [7].

Fig. 1
figure 1

General structure of RIFINs and STEVORs. RIFINs and STEVORs are expressed on the surface of an erythrocyte infected with P. falciparum. Protein domains are illustrated as green (signal peptide), grey (variable domains), red (transmembrane domains), blue (25 amino acid insertion), and orange and purple (semi-conserved domains). There are approximately 160 rif genes in the 3D7 reference genome, separated into two subtypes, RIFIN-A and RIFIN-B, depending on sequence and subcellular localization. The FHEYDER motif (in blue) is present in the semi-conserved domain of 36 RIFIN-As in the 3D7 reference strain. STEVORs encompass ~ 30 genes per genome and are structurally similar to RIFIN-Bs

RIFINs and STEVORs pose challenges in genomic analyses due to their immense genetic diversity and numerous paralogs, which cause difficulties in reference-based assembly and identification. There are limited bioinformatic approaches to distinguish between RIFINs and STEVORs and to further classify RIFINs to the subtype level. Apart from laborious sequence alignment and phylogenetic analyses, BLAST is one of the few available tools [8]. However, BLAST requires a comprehensive reference index, lacks the sensitivity to detect highly divergent sequences, and cannot readily delineate between RIFIN subtypes. In contrast, profile Hidden Markov Models (HMM) offer not only better accuracy and speed, but also sensitivity in detecting remote homologs [9]. Three HMM-based tools have been used to categorize RIFIN and STEVOR sequences: RSpred [4], TIGRFAM [10], and PFAM [11]; however, each is built using limited sets of reference RIFIN and/or STEVOR sequences. The more recent tools TIGRFAM and PFAM, as part of the Interpro database [11], do not subtype RIFINs or automatically assign annotations. While RSpred addressed these concerns, it was web-based, could only evaluate ten sequences per job, and its web interface is no longer responsive.

Here, we introduce an improved HMM-based, command-line program called STRIDE (STevor and RIfin iDEntifier). STRIDE has better sensitivity than available HMM tools to detect both RIFINs and STEVORs, and also features RIFIN subtyping, automated annotations, and adjustable thresholds for sensitivity and specificity. Importantly, STRIDE allows for the determination of the number of RIFIN-A sequences with a FHEYDER motif, providing insight into mechanisms to weaken host defenses. STRIDE will have particular value for malaria genomic epidemiologists, as next-generation sequencing of clinical field isolates increases in prevalence and the contributions of RIFIN and STEVOR multigene families to severe malaria pathogenesis and the acquisition of natural immunity to malaria become clearer.

Implementation

STRIDE consists of a merged HMM generated from three different refined multiple sequence alignments of full-length publicly available RIFIN and STEVOR protein sequences (Additional file 1: Figs. S1 and S2). A total of 3536 RIFIN and STEVOR sequences were downloaded from PlasmoDB (Release 45; August 28, 2019, keyword: “RIFIN/STEVOR”). Redundant sequences were clustered with CD-HIT v4.6 (option: -c 1.0). RIFIN-A, RIFIN-B, and STEVOR proteins were first identified via BLAST. For each set of protein sequences, a multiple sequence alignment was created, and a corresponding HMM was generated with hmmbuild (default parameters) as part of the HMMER3 v3.2.1 package. In an iterative process (Additional file 1: Fig. S1), we used each HMM profile to search for homologous sequences in other datasets. Sequences with the highest scores were incorporated into a new seed alignment, where another respective HMM profile was created. Training concluded for each HMM profile when no additional sequences could be extracted.

STRIDE uses a FASTA file as input and scores the query sequences against the HMM profile. A subprogram written in Perl v5.24 parses these scores and outputs the sequence classifications as a tab-delimited text file (Additional file 2). The main classifications are “RIFIN-A”, “RIFIN-B”, and “STEVOR.” STRIDE outputs the number of RIFIN-As with a FHEYDER amino acid motif as an exact pattern match. Truncated or highly divergent sequences are designated as “likely” RIFIN or STEVOR, and those that are unable to meet RIFIN subtyping criteria due to insufficient discriminatory characteristics (e.g. missing the protein segment containing the defining 25 amino acid indel) are called simply “RIFIN.”

To determine sensitivities and specificities, we created a “validation” dataset that spanned a range of variant surface antigen sequence sizes, including 3888 presumed RIFINs and STEVORs from sequenced clinical isolates and publicly available assemblies (Table 1, Additional file 1: Fig. S2) [12]. In addition, we downloaded annotated protein FASTA files from several Plasmodium reference genomes: P. falciparum 3D7 (5548 sequences), P. vivax (6667 sequences), P. berghei strain ANKA (5076 sequences), P. reichenowi (5644 sequences), and P. chabaudi (5217 sequences) to test our profiles for false positives and negatives.

Table 1 Comparison of STRIDE to PFAM and TIGRFAM, using the same parameter values

Results

Generation of HMM profiles

From the 3536 RIFIN and STEVOR sequences downloaded from PlasmoDB, 967 RIFIN-A, 495 RIFIN-B, and 229 STEVOR sequences comprised the final datasets at the conclusion of HMM training (Fig. 2, Additional file 1: Fig. S2). This included representation of sequences from all sampled genomes. The Malian (ML01) and Togo (TG01) strains were polyclonal and had higher overall numbers of representative sequences. Of the 228 total RIFINs and STEVORs annotated in the 3D7 reference genome, STRIDE incorporated 122 of these sequences.

Fig. 2
figure 2

Stacked bar graphs of the sequence distribution from all available P. falciparum genomes from PlasmoDB v45 at the conclusion of training each HMM profile. A total of 3536 RIFIN and STEVOR sequences were downloaded from PlasmoDB (Release 45; August 28, 2019). Redundant sequences were clustered with CD-HIT v4.6. HMM (Hidden Markov Model) profiles specific for RIFIN-A, RIFIN-B, and STEVOR proteins were created and iteratively trained against subsets of sequences that were not present in the initial seeding. 967 RIFIN-A, 495 RIFIN-B, and 229 STEVOR sequences comprised the final datasets, providing representation of sequences from all genomes. The Malian (ML01) and Togo (TG01) strains were polyclonal and had overall higher numbers of representative sequences. Of the total of 228 RIFINs and STEVORs annotated in the 3D7 reference genome, STRIDE used 122 3D7 sequences

Performance evaluation

The sensitivity and specificity of STRIDE is adjustable, although default parameters have been optimized to produce the most conservative designations (Fig. 3, Additional file 2). Datasets of 404 RIFIN-A, 476 RIFIN-B, and 40 STEVOR sequences that were randomly selected and excluded from the HMM training were used to test and define the limits of detection for each profile (Fig. 3, Additional file 1: Figs. S1 and S2). All RIFIN-A and -B sequences had low concordance to the STEVOR profile, failing to meet the STEVOR threshold score of 145. The 404 RIFIN-A sequences had whole sequence (represented in blue) and domain (represented in red) scores that exceeded the thresholds for the RIFIN-A profile. In contrast, none of the 404 RIFIN-A sequences met classification criteria for RIFIN-Bs, as their domain scores (red) were below the threshold score of 250. In the same manner, none of the 476 RIFIN-B sequences met the 250 domain threshold score (red) to be classified as a RIFIN-A profile. A set of positive control sequences from 3D7 demonstrated high concordance to each profile illustrated by their respective Circos plot (Additional file 1: Fig. S3).

Fig. 3
figure 3

Depicting relationships of HMM Scores. Whole sequence scores are represented in blue and HMM domain scores are represented in red. Sets of sequences excluded from the creation of each HMM profile were used to define the limits of detection, represented by a grey line. Datasets of 404 RIFIN-A, 476 RIFIN-B, and 40 STEVOR sequences excluded from the HMM training were used to test and define the limits of detection for each profile. All RIFIN-A and -B sequences had low concordance to the STEVOR profile, failing to meet its threshold score of 145. The 404 RIFIN-A sequences had whole sequence (blue) and domain (red) scores that exceeded the thresholds for the RIFIN-A profile. In contrast, none of the 404 RIFIN-A sequences met classification criteria for RIFIN-Bs, as their domain scores (red) were below the threshold score of 250. The Y-axis represents HMM scores, and the X-axis represents the ordered numerical label for each sequence

Based on these findings, we developed an algorithm to specify the type and subtype of a queried sequence based on whole sequence and domain scores (Additional file 2). The first limit of detection determines which of the three profiles (RIFIN-A, RIFIN-B, or STEVOR) registered the greatest whole sequence score. For a queried sequence to be considered a RIFIN, the whole sequence score must surpass a threshold of 200 against either the RIFIN-A or RIFIN-B profile. Queries with whole sequence scores between 100 and 200 to a RIFIN profile are considered “likely RIFINs” and scores ≤ 100 are considered “unlikely RIFINs”. RIFIN subtyping requires a domain score ≥ 250 to a respective RIFIN profile, otherwise a query receives only a “RIFIN” annotation. Similarly, for the STEVOR HMM profiles, scores between 100 and 145 were considered “likely STEVORs,” and scores ≤ 100 were “unlikely STEVORs.” STRIDE does not report queries that are vastly different to any of the profiles.

Discussion

To compare sensitivity and specificity between tools, we adjusted the parameters of PFAM and TIGRFAM to match those of STRIDE. STRIDE detected STEVORs in the curated 3D7 reference genome with similar sensitivity to PFAM and TIGRFAM, although sensitivity of STRIDE to detect RIFINs was higher, but this was not statistically significant (p = 0.30; χ2 = 2.41, DF = 2, Table 2). Specificity to 3D7 sequences was equivalent across all tools. Unlike PFAM and TIGRFAM, STRIDE was not trained using the entirety of RIFINs and STEVORs from the 3D7 repertoire (Fig. 2, Additional file 1: Fig. S4).

Table 2 Depicting the sensitivity and specificity analyses of STRIDE compared to PFAM# and TIGRFAM# using 3D7^

The “validation” dataset spanned a range of variant surface antigen sequence sizes, which included 3888 presumed RIFINs and STEVORs from sequenced clinical isolates and publicly available assemblies (Table 1). STRIDE detected a total of 3540 RIFIN and STEVOR sequences (91.0%), more than the counts for PFAM (2707, 69.6%; p < 0.00001, χ2 = 31.30, DF = 1) or for TIGRFAM (3394, 87.3%; p = 0.31716, χ2 = 1.00, DF = 1). We also used other Plasmodium reference genomes to further test for specificity. STRIDE appropriately detected RIFINs and STEVORs in gorilla- and chimpanzee-infecting parasites (e.g. P. reichenowi) but did not register any hits to the genomes of P. vivax, berghei, or chabaudi, three species that lack RIFIN and STEVOR orthologs (Table 1).

Using STRIDE, we reevaluated a subset of 320 sequences from PlasmoDB that had received a broad, overlapping “RIFIN/STEVOR family, putative” designation (Additional file 3). These sequences originated from long read-based assemblies of several parasite strains [13]. Among the 312 sequences that met or exceeded identification thresholds, 176 were determined to be RIFIN-As, including 52 with FHEYDER motifs; 80 were RIFIN-Bs; and 56 were STEVORs. Eight sequences did not meet the designated limits of detection for exact classifications. These were mostly truncated copies and thus classified by STRIDE as “RIFIN” or “likely RIFIN.”

We also applied STRIDE to predict the number and classification of RIFINs and STEVORs from 15 unannotated long read-based de novo assemblies of clinical field isolates (Additional file 3) [12]. Initial classification using BLASTp led to mixed results and overlapping annotations. The number of STRIDE-predicted RIFINs and STEVORs from the NF54 de novo assembly mirrored that of 3D7, which was expected given that 3D7 is a clone of the NF54 isolate [14]. STRIDE also consistently identified comparable numbers of RIFINs, STEVORs, and FHEYDER motifs across most clinical samples from diverse geographies. Several “likely RIFIN” sequences from each assembly are encoded by short, truncated contigs in each assembly and could not be precisely classified. There were proportionally greater numbers of sequences found in the Myanmar samples, which are likely polyclonal (Additional file 3).

Conclusions

We present STRIDE, an HMM-based, command-line program that automates RIFIN and STEVOR prediction, differentiates RIFIN-As from RIFIN-Bs, and identifies the number of sequences with the known pathogenic FHEYDER motif. STRIDE eliminates the need to perform multiple sequence alignments and phylogenetic analyses or to use specialized knowledge of these two protein families to sort RIFINs and STEVORs. STRIDE has better sensitivity to detect RIFINs than other available HMM-based tools and supports adjustable thresholds to customize desired levels of sensitivity and specificity. This HMM-based approach for variant surface antigen classification may be useful for other Plasmodium species and organisms with multigene families, such as Trypanosoma.

Availability of data and materials and requirements

The datasets with their respective accession numbers supporting the conclusions of this article are listed in Additional file 3. Other datasets used and/or analyzed in this current study are available from the corresponding author upon request.

Project Name: STRIDE (STevor RIfin iDEntifer)

Project Home Page: https://github.com/albert-zhou-umb/STRIDE.git

Operating system: Platform-independent

Programming Language: Command-line application written in Bash

Other Requirements: HMMERv3.3 and Perl v5.24

License: GNU GPL

Any Restrictions to Use by Non-Academics: None.

Abbreviations

3D7:

Reference strain and lab variant of P. falciparum

BLASTn/p:

Basic local alignment search tool (nucleotide/protein)

FHEYDER:

A seven amino acid motif of RIFIN-As that bind to B- and NK-cells

HMM(ER):

Hidden Markov model

PFAM:

Database of protein families to support genome annotation

RIF/RIFIN:

Repetitive interspersed family of polypeptides protein

RSpred:

RIFIN/STEVOR predictor

SC:

Semi-conserved domain

STEV/STEVOR:

Sub-telomeric variable open reading frame protein

STRIDE:

STEVOR and RIFIN Identifier

TIGRFAM:

Database of protein families to support genome annotation

V1/V2:

Variable domain 1/variable domain 2

References

  1. Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, et al. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002;419:498.

    Article  CAS  Google Scholar 

  2. Goel S, Palmkvist M, Moll K, Joannin N, Lara P, Akhouri R, et al. RIFINs are adhesins implicated in severe Plasmodium falciparum malaria. Nat Med. 2015;21:314–7.

    Article  CAS  Google Scholar 

  3. Travassos MA, Niangaly A, Bailey JA, Ouattara A, Coulibaly D, Lyke KE, et al. Children with cerebral malaria or severe malarial anaemia lack immunity to distinct variant surface antigen subsets. Sci Rep. 2018;8:6281.

    Article  Google Scholar 

  4. Joannin N, Kallberg Y, Wahlgren M, Persson B. RSpred, a set of hidden Markov models to detect and classify the RIFIN and STEVOR proteins of Plasmodium falciparum. BMC Genomics. 2011;12:119.

    Article  CAS  Google Scholar 

  5. Petter M, Haeggström M, Khattab A, Fernandez V, Klinkert M-Q, Wahlgren M. Variant proteins of the Plasmodium falciparum RIFIN family show distinct subcellular localization and developmental expression patterns. Mol Biochem Parasitol. 2007;156:51–61.

    Article  CAS  Google Scholar 

  6. Saito F, Hirayasu K, Satoh T, Wang CW, Lusingu J, Arimori T, et al. Immune evasion of Plasmodium falciparum by RIFIN via inhibitory receptors. Nature. 2017;552:101–5.

    Article  CAS  Google Scholar 

  7. Zhou AE, Berry AA, Bailey JA, Pike A, Dara A, Agrawal S, et al. Antibodies to peptides in semiconserved domains of RIFINs and STEVORs correlate with malaria exposure. mSphere. 2019;4:13.

    Article  Google Scholar 

  8. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.

    Article  CAS  Google Scholar 

  9. Eddy SR. HMMER user’s guide. 2018:221.

  10. Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E. TIGRFAMs and genome properties in 2013. Nucleic Acids Res. 2013;41(Database Issue):D387-395.

    CAS  PubMed  Google Scholar 

  11. Finn RD, Attwood TK, Babbitt PC, Bateman A, Bork P, Bridge AJ, et al. InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res. 2017;45:D190–9.

    Article  CAS  Google Scholar 

  12. Moser KA, Drábek EF, Dwivedi A, Stucke EM, Crabtree J, Dara A, et al. Strains used in whole organism Plasmodium falciparum vaccine trials differ in genome structure, sequence, and immunogenic potential. Genome Med. 2020;12:6.

    Article  CAS  Google Scholar 

  13. Otto TD, Böhme U, Sanders M, Reid A, Bruske EI, Duffy CW, et al. Long read assemblies of geographically dispersed Plasmodium falciparum isolates reveal highly structured subtelomeres. Wellcome Open Res. 2018;3:52.

    Article  Google Scholar 

  14. Walliker D, Quakyi IA, Wellems TE, McCutchan TF, Szarfman A, London WT, et al. Genetic analysis of the human malaria parasite Plasmodium falciparum. Science. 1987;236:1661–6.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We would like to thank the study teams that assisted in field sample collections in Brazil, Myanmar, and Thailand—Drs. Priscila T. Rodrigues, Marcelo U. Ferreira, Michele D. Spring, Krisada Jongsakul, Chanthap Lon, Dysoley Lek, Stuart D. Tyner, David L. Saunders, Myaing M. Nyunt, and Christopher V. Plowe. We would also like to acknowledge Drs. Terrie Taylor and Don Mathanga for their assistance with collection of clinical isolates in Malawi.

Funding

This work was supported by National Institutes of Health Grants 1F30HL146095-01A1, K23AI125720, R01HL146377, R01HL130750, R01AI099628, R01AI141900, and U19AI110820. M. A. Travassos was supported by a Passano Foundation Clinician-Investigator Award. A.E. Zhou was supported by the IMSD Meyerhoff Graduate Fellowship and the University of Maryland, Baltimore Institute for Clinical and Translational Research (ICTR) and Medical Scientist Training Program (MSTP). Partial funding for open access was provided by the University of Maryland Health Sciences and Human Services Library's Open Access Fund. None of the funding bodies played any role in the design of the study; collection, analysis, and interpretation of data; or in writing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

AEZ, MAT, and JCS designed and analyzed the study and drafted the manuscript. AAB, DS, STH, TDO, JCS, and MAT funded and contributed to the analyses. AEZ, ZVS, KRB, and JBM contributed bioinformatics assistance and to the analyses. All authors edited, read, and approved the final version of the manuscript.

Corresponding author

Correspondence to Mark A. Travassos.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

. Word document with information on the creation of STRIDE as well as sequences used for generating, training, and validating each HMM profile.

Additional file 2

. README text file containing information for the user about the STRIDE program, including instructions on its installation and execution and the interpretation of output.

Additional file 3

List of reclassified PlasmoDB sequences and annotated de novo assemblies using STRIDE. Excel document containing a list of PlasmoDB gene IDs with their original annotations and the suggested classifications based on STRIDE, in addition to the predicted number of RIFINs and STEVORs from 15 de novo assemblies with their respective accession numbers.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhou, A.E., Shah, Z.V., Bradwell, K.R. et al. STRIDE: a command-line HMM-based identifier and sub-classifier of Plasmodium falciparum RIFIN and STEVOR variant surface antigen families. BMC Bioinformatics 23, 15 (2022). https://doi.org/10.1186/s12859-021-04515-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-021-04515-8

Keywords

  • Malaria
  • Plasmodium falciparum
  • RIFIN
  • STEVOR
  • Bioinformatics
  • Hidden Markov models