Greedy de novo motif discovery to construct motif repositories for bacterial proteomes

Background Bacterial surfaces are complex systems, constructed from membranes, peptidoglycan and, importantly, proteins. The proteins play crucial roles as critical regulators of how the bacterium interacts with and survive in its environment. A full catalog of the motifs in protein families and their relative conservation grade is a prerequisite to target the protein-protein interaction that bacterial surface protein makes to host proteins. Results In this paper, we propose a greedy approach to identify conserved motifs in large sequence families iteratively. Each iteration discovers a motif de novo and masks all occurrences of that motif. Remaining unmasked sequences are subjected to the next round of motif detection until no more significant motifs can be found. We demonstrate the utility of the method through the construction of a proteome-wide motif repository for Group A Streptococcus (GAS), a significant human pathogen. GAS produce numerous surface proteins that interact with over 100 human plasma proteins, helping the bacteria to evade the host immune response. We used the repository to find that proteins part of the bacterial surface has motif architectures that differ from intracellular proteins. Conclusions We elucidate that the M protein, a coiled-coil homodimer that extends over 500 A from the cell wall, has a motif architecture that differs between various GAS strains. As the M protein is known to bind a variety of different plasma proteins, the results indicate that the different motif architectures are responsible for the quantitative differences of plasma proteins that various strains bind. The speed and applicability of the method enable its application to all major human pathogens. Electronic supplementary material The online version of this article (10.1186/s12859-019-2686-8) contains supplementary material, which is available to authorized users.


Background
The rise of antibiotics resistant bacteria poses a major global health issue predicted to cause 10 million deaths per year in 2050, more than heart disease and cancer combined [1]. The increasing resistance to antibiotics necessitates the development of alternative treatment strategies. One promising alternative treatment strategy includes the disruption of protein binding interfaces between bacteria *Correspondence: lars.malmstroem@uzh.ch 1 Faculty of Science, Institute for Computational Science, University of Zurich, 429 Winterthurerstrasse, 190, CH-8057 Zurich Switzerland 2 Service and Support 430 for Science IT (S3IT), University of Zurich, Winterthurerstrasse, 190, CH-8057 431 Zurich, Switzerland Full list of author information is available at the end of the article and human proteins to disarm bacterial defense systems [2]. Such strategies require high-confident identification of sequence motifs that correspond to a structural unit that are necessary for protein folding or binding of ligands and other proteins.
Motifs are short segments of a protein sequence which shows a level of conservation throughout a protein family and beyond. Conserved motifs can be extracted from multiple sequence alignment of proteins with similar functions in different species. While finding such motifs can provide insights for prediction of functional residues, identifying and understanding them is fundamental to discovering binding interfaces in protein complexes [3]. It is generally believed that the binding interfaces forming interactions to help bacteria evade the immune system or to obtain nutrients are comparatively more conserved compared to interactions that are benefiting the host, such as surface exposed epitope. Over time, this results in segments of exposed proteins that are significantly more conserved for functional reasons.
Disrupting the protein-protein interactions by targeting the conserved segments would potentially facilitate the host immune response [4][5][6]. However, the high variability of bacterial surface proteins makes it challenging to study them with traditional sequence analysis methods. InterPro for example [7] contains motifs for the anchor and the signal peptide whereas the rest of the protein sequence remains largely unannotated. Multiplesequence alignment algorithms typically run into problems with the variable number of repeats and tends to produce highly gapped alignments. The rapid growth of known bacterial protein sequences presents an opportunity to identify protein-family specific motifs (in contrast to Interpro that attempts to find motifs common to multiple families).
Group A streptococcus (GAS) is one of the most important bacterial pathogens causing over 700 million mild infections such as tonsillitis, impetigo and erysipelas and, occasionally, severe invasive infections including sepsis, meningitis or necrotizing fasciitis with mortality rates up to 25% [8]. Surface proteins play important roles in the interaction with host proteins [9]. Several bacterial surface proteins interact with numerous of host proteins, forming complex protein-protein interaction networks.
One of the key surface proteins of S. pyogenes is the M protein, a coiled-coil homodimer that extends over 500 Å from the cell wall. The M protein is capable of binding several plasma proteins such as fibrinogen [6] and albumin [10,11]. A crystal structure of M and fibrinogen was published in 2011 demonstrates that the M and fibrinogen form a cross-like complex structure. Further, the M protein is composed of several repeats that are present a variable number of times; some of these repeats overlap with protein-protein interactions binding interfaces [12][13][14][15]. Accordingly, a comprehensive repository of the motifs in coiled-coil proteins and their relative conservation grade is a prerequisite to target the protein-protein interaction that bacterial surface protein makes to host proteins [16].
Here, we present a strategy to iteratively identify protein-family specific motifs from large genome resources, then mask all occurrences of these motifs until no more significant motifs can be found. We applied this strategy to a GAS strain as a model system. We constructed a compendium of almost 60000 motifs for GAS. Further, we demonstrate the power of the approach using the M protein and describe the motif resource in general terms.

Outline of the algorithm
The algorithm starts with a database of protein sequence families and sub-selects a user-defined number (here 100) of sequences for each family containing more than 100 sequences as shown in the pseudocode in Fig. 1. The main part of the algorithm finds the first motif in a sequence and mask all identified occurrences, then remove them from the sequence and produce one new sequence for each occurrence in the second iteration. After finishing this loop, all identified motifs are stored in the main repository and followed by architectural analysis by considering the occurrences of each motif in the entire genome and by computing the internal overlaps of that motif inside the family as well. In the last step, the results will be compared with InterPro to find common overlapping motifs to report and the new ones for further analysis. All the results as well as the main repository of all discovered motifs are stored and available in a SQLite table. SQLite is an embedded SQL database engine that implements a transactional SQL database engine. The code for SQLite is in the public domain and free for use.

Construction of protein families
We selected a representative genome from an invasive M1 S. pyogenes isolated in Ontario, Canada. This sample is available with id 293653.4 from PatricBRC, the bacterial bioinformatics resource database [17]. This genome has 1931 coding sequences (CDS). We downloaded additional 70459 genomes from PatricBRC. This number here refers to the number of genomes available at PatricBRC that had both an .ffa protein fasta file and .cds files that contains a table which links the PatricBRC sequence accession number to the FigFam ID [18]. We used this resource to build one protein fasta file per FigFam ID filtering out duplicate entries. We constructed 1564 FIGfams families containing a total of 9,041,083 protein sequences of which 3,817,065 were unique at the amino acid level. This sequence resource was used as input to the workflow outlined in Fig. 2. Figure 2 shows the general workflow of our approach, where we make use of MEME [19,20] and FIMO [21] in the core part of the system to handle motif discovery and masking the multiple occurrences of each motif on the sequence. MEME is an open-source application which has been widely used for sequence motif discovery and analysis in both DNA and proteins. It is based on GLAM2 algorithm [22] and enables covering of motifs containing gaps. While MEME finds a single occurrence of a motif in the sequence, FIMO is able to consider the MEME's output and define multiple occurrences for any individual gapped or un-gapped motifs. FIMO assign different Fig. 2 The workflow of de novo motif discovery approach. All FIGfams families were downloaded for a user-specified organism and get through the core of the processing -MEME and FIMO-where all the motifs are discovered and masked in an iterative process. A sqlite repository stores all the motifs and motif architectures with required information like organism name, FIGfams family name, the start and stop points on the sequence and etc scores for each matched sequence according to a dynamic programming approach [23] and then motif-specific qvalues are computed based on a bootstrap procedure [24]. FIMO's outputs are considered according to their pvalues, and q-values make it possible to set a user-defined thresholds to cover only specific motif occurrences.

InterPro
InterProScan [7] is a reference resource that provides a functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites. In order to achieve a general view of the coverage of our approach, we compared the generated de novo based motif repository of GAS with all GAS-related motifs in InterProScan.

Assigning proteins to cellular compartments
All proteins were assigned to one of seven compartments by using information from mass spectrometry experiments, Fig. 3 Sub-selection test on two sample families. Two sample families are selected and analyzed by sub-selection test. The bubble graph indicates that selection of 100 sequences has a good coverage while saving computational resources more than 20-fold annotations from several databases followed by manual curation. In short, we identified exposed, cell wall associated and secreted protein using data from Karlsson et al. [9]. Transmembrane proteins were identified using TMHMM [25]. DNA associated protein and transcription factors were identified using InterPro [7] and RegPrecise [26]. All other proteins were assigned to the intracellular compartment.
Software availability MEME and FIMO 4.11.1 was used through out the project. The workflow is implemented in GC3pie [27] which makes it possible to parallelize over all available computational cores. All parts of the workflow are written in Python 2.7 and is wrapped by applicake, an opensource and free framework useful when designing workflows. The workflow is available through a singularity container [28] and the container together with the data and an ipython notebook contains instructions and examples to parse the data are provided online with this DOI: 10.5281/zenodo.1403142.

Sub-selection
We analyzed a large sequence database of all GAS proteins containing 1564 FIGfams sequence families as outlined in the Methods section. The FIGfams contain a different number of sequences. This begs the question whether a subset of them would be sufficient to cover most of the motifs. We designed a general sub-selection test to reduce the number of sequences due to computational resource reasons. The sub-selection considers two different families and select a set of 2, 10, 20, 50, 100, 250, 500, and 1000 sequences randomly and repeat the whole analysis for 10 times. In each sub-selection test, we ran the workflow to find all the motifs, and we made an average of motif-coverage between all 10 repeats. Figure 3 demon- The number of sequences, the motif coverage in percentage (which is the average of 10 repeated test) and the computational time on 1 CPU are shown Fig. 4 Two rounds (first to the left, second to the right) of the algorithm is displayed. It starts with a collection of sequences and then discovers motifs in this collection using MEME. It uses FIMO to find additional occurrences of the motif within the sequence collection. In the second round, the motifs are masked (gray bars) before MEME is applied once more. The algorithm iterates through round 3 to N until no more motifs are found, or the sequence collection is fully annotated strates that sub-selection of 100 sequences is sufficient to cover the majority number of all motifs while reducing time and computational resources more than 20-fold (see Table 1).

MEME/FIMO
The workflow starts by entering the name of the desired organism and the q-value cut-off (optional) which are the only required inputs (Fig. 2). In the second phase, all FIGfams protein families related to the input organism are downloaded and stored in a database. Then, by considering the accessibility of computational resources, de novo motif discovery on protein families starts. Figure 4 shows two sample runs of the algorithm where MEME is applied to the sequence collection, restricting the number of identified motifs to one. Motif occurrences were discovered in the sequence collection using FIMO, and only occurrences with e-values of 1e -6 or lower were Fig. 5 The auto-generated result of our approach on M1 protein. a: Binding interfaces of fibrinogen according to the reference crystal structure (PDB id 2XNX). [6]. b: M1 domains proposed in [10]. c: The output results of our approach considered. The proteins were split using the number of occurrences and remaining parts longer than ten amino acids are carried forward to create a new merged sequence collection, mixed with full-length and partial proteins.
The new sequence collection is used as the input for each iterative round of MEME, FIMO, split until no more significant motifs could be discovered, or all remaining sub-sequence were below ten amino acids. All the motif occurrences with corresponding features are stored in an SQLite database. To give further information to the user, known motifs are also integrated from [10] and the InterPro database and visualized using pViz.js [29].

Protein M1 Motif discovery
As an example of application on specific protein family, we collected a large sequence collection of M proteins from four sources: PatricBRC, genomes we have previously sequenced and assembled [30], the M database from CDC (Centers for Disease Control and prevention) and the UniProtKB/TrEMBL database. Any M protein sequence without motifs representing an anchor or a signal peptide was discarded, and the remaining sequences were reduced to a 98% sequence identity using CD-HIT [31]. In total, the algorithm ended after 18 rounds, resulting in 20 motifs from the M protein sequence collection. The SF370 M1 protein reference [32] contained motifs m01-m03, m05-m08, m11-m14, m16-m17 but not m04, m09-10, m15, and m18-20. Additional file 1 contains the logo of all discovered motifs. Figure 5 shows the general motif architecture as the output of the algorithm. Note that an architecture (motif pattern) shows the distribution of motifs over the entire protein family. By considering such representation, it is possible to show the general motif pattern that most of the proteins in the family follow. So, the architectural motif view helps to find potential protein-protein interaction binding sites as the majority member of the family desire to follow such pattern. Accordingly, we found a total of 123 motif architectures, and of these, 85% (104) are associated with a single serotype.

Analysis of conserved motif in GAS genome
We evaluated identified motifs separately based on protein families in different cellular compartments ( Table 3). The main idea is to provide a general comparison between protein families in different cellular compartments in terms of motif-based conservation grade which helps to discover the general evolutionary pressure on cellular compartments and further distinguishing potential drug targets inside and outside the cell. To do that, one should consider the fact that the number of motifs in each compartment is a function of sequence length and the average is dependent on sequence variability. Such dependency affects the comparison between different protein families led to results that are biased against sequence length. To address these issues, we represent motif architecture per sequence and most importantly per family. Accordingly, each protein or its related family can have one or several architectures based on motif variability on that family. Consequently, protein families with few architectures indicates higher sequence conservation inside the family and generally shows that the family has more conserved motifs to do special cellular functions. In this way, by comparing the number of architecture in two different protein families, it is possible to state which family is more conserved.  As shown in Table 4, most variable proteins in GAS are transmembrane and secreted proteins which are less conserved and have a more diversified interaction with host proteins. Most conserved proteins are DNA-related and transcription factors together with intracellular proteins that have special machinery roles inside the cell. Transmembrane proteins which play crucial role as the transportation system on the bacterial surface are also more evolved according to the evolutionary pressure. In general as Fig. 6 indicates, we can conclude that the evolutionary pressure is lower on intracellular proteins compared to the surface and secreted proteins.

Comparison with InterPro
To compare our results to InterPro, we analyzed and filtered motifs based on their signature from InterProScan which revealed that 11996 distinct motifs related to GAS are not recognized/discovered by InterProScan (71.15% of all discovered motifs) while there are many important motifs also in common (28.85%). Table 5 contains the list of most commonly overlapping motifs with special InterPro description which discovered by our approach.

Discussion
Conserved protein sequence domains, also referred to as motifs, play an important role in protein function, protein structure and protein-protein interactions. Motifs are the results of several evolutionary processes where, for example, a part of a protein is evolving at a different rate compared to other parts of the same protein. Identifying motifs are fundamental to understanding protein function and to discover putative binding interfaces.  Motifs can both be used to shed light on the evolutionary process underpinning the development of a protein family with respect to the protein's function over time; it can also be used to produce a simplified view on the protein as a series of conserved motifs that together specify a proteins motif architecture. Although several approaches have been developed to address motif discovery on protein sequences, most are either focused on a given motif or finding motifs, such as signal peptides, that can be found in a general population of protein sequences.
Here, we developed a de novo motif discovery approach and applied to protein families that share a common ancestral protein; this resulted in a repository of motifs over an entire organism. This approach is focused on understanding the evolutionary processes that have acted on that protein family in a comparatively short evolutionary time. We developed and designed this approach as a software package which is written in Python and distributed via singularity containers [28] making it easy to install and use. We demonstrated the approach on GAS, an important human pathogen with a mortality rate of 25% at invasive infections. We also characterized the proteome-wide motif repository by comparing it to InterPro; furthermore, we analyzed the motif architecture for these proteins and discovered that the number of sequence per architecture is different for different cellular compartments.
Given the speed and flexibility of our approach, we believe it will be useful in breaking analyzing surface protein of pathogens as these proteins are under high selective pressure and therefore cannot be analyzed using more traditional approaches such as multiple-sequence alignments (MSAs). Our attempts to use various MSA algorithms failed due to high sequence variability in regions between motifs and the varying number of motifs. Also, motif searching approaches failed and only identified a small subset of the motifs that our approach discovered.

Conclusion
In this paper, we demonstrate a proof-of-principle approach to parsing large sequence families into motifs using a denovo-based greedy approach. This simple approach can easily handle situations where parts of proteins are repeated or re-arranged, and this can be time-consuming using other approaches. While this general approach can be applied to any bacteria, we used GAS as a model system to make a comprehensive motif repository of its proteins. We further analyzed M1 protein, one of the most important virulence factor of S. pyogenes to show the motif-based architectural analysis. We observe that we over-parse some domains, but also observe that many of these large domains are only partly conserved over the sequence collection. The result indicates that many of the newly discovered motifs are not always present together with adjacent motifs, indicating that they might have different and independent functions. Interestingly, many of our newly discovered motifs are not found in any of the emm1 strains, and some of these might be responsible for binding other ligands.