ProtPlat: an efficient pre-training platform for protein classification based on FastText

Background For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. However, the prediction performance often suffers from the lack of labeled data. In recent years, pre-training methods have been widely studied to address the small-sample issue in computer vision and natural language processing fields, while specific pre-training techniques for protein sequences are few. Results In this paper, we propose a pre-training platform for representing protein sequences, called ProtPlat, which uses the Pfam database to train a three-layer neural network, and then uses specific training data from downstream tasks to fine-tune the model. ProtPlat can learn good representations for amino acids, and at the same time achieve efficient classification. We conduct experiments on three protein classification tasks, including the identification of type III secreted effectors, the prediction of subcellular localization, and the recognition of signal peptides. The experimental results show that the pre-training can enhance model performance effectively and ProtPlat is competitive to the state-of-the-art predictors, especially for small datasets. We implement the ProtPlat platform as a web service (https://compbio.sjtu.edu.cn/protplat) that is accessible to the public. Conclusions To enhance the feature representation of protein amino acid sequences and improve the performance of sequence-based classification tasks, we develop ProtPlat, a general platform for the pre-training of protein sequences, which is featured by a large-scale supervised training based on Pfam database and an efficient learning model, FastText. The experimental results of three downstream classification tasks demonstrate the efficacy of ProtPlat. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04604-2.

results in function-related prediction tasks based on protein sequences, such as protein subcellular localization [22][23][24], protein structural characteristics prediction [28,29], and protein-protein interaction prediction [30,31]. Especially, with the rise of deep neural networks, traditional feature extraction methods have been largely replaced by sequence encoding schemes, like the pre-trained word embeddings techniques [5,6], which produce dense continuous vectors and obtain much better performance than discrete features [7][8][9].
Protein sequence classification has widely utilized the technologies of text categorization from natural language processing [41]. Benefitting from deep learning models and word embedding methods, text categorization has achieved great progress, which also brings opportunities for improving the performance of protein classification tasks. However, there are inherent difficulties to adapt the word embedding techniques from natural language processing to protein sequence representation. For one thing, there are no defined words in amino acid sequences, while the pre-training of embedding vectors mostly relies on language modeling, e.g., next word prediction. For another thing, protein sequences have a much smaller alphabet but are quite longer than natural language sentences, which brings new challenges to learning models.
To improve the learning performance of machine learning methods, pre-training is a very effective strategy. Pre-training was first proposed in the computer vision field and achieved good results. In recent years, it has been widely used in various tasks of natural language processing. The pre-trained models often have fast convergence speed and good generalization performance, especially for the tasks with limited training data. The existing pre-training models are mainly unsupervised models, like ELMo [10] and BERT [11], which are computation intensive. For instance, SeqVec [26] introduces the language model ELMo to represent amino acid sequences as embedding vectors to obtain the biophysical properties of proteins; ProtTrans [27] uses various transformer models taken from natural language processing to provide the pre-trained model for amino acid sequences. Both the two methods have a high demand for computation resources. Alternatively, as the protein sequence-based classification tasks share some common sequence features, the pre-training can leverage a large-scale labeled dataset and transfer the knowledge to other small-data problems [35,36].
Based on this idea, we propose a supervised pre-training platform, ProtPlat. We perform a large-scale pre-training task, protein family classification, to automatically extract the effective information from the protein sequences. Benefiting from the large-scale Pfam database [2] and the FastText library [12], ProtPlat has sufficient data for pre-training word embeddings and a highly efficient classification procedure. Furthermore, the pre-trained model can be applied to various protein classification tasks. We implement a web service, which allows users to upload their training and test data. The training data is used to fine-tune the pre-trained model, and the prediction results on the test data are provided on the website. We evaluate the performance of the platform on three downstream protein classification tasks with different data scales, namely the identification of type III secreted effectors, the prediction of protein subcellular localization, and the recognition of signal peptides. The experimental results show that the platform not only has a high response speed but also improves the accuracy of all these tasks. It can be used as a general platform to improve the task of protein sequence classification.

Materials
As a large-scale corpus is required for pre-training, we use the Pfam database [2], which is a comprehensive collection of protein families, including over 34 million protein sequences. The family label is the training target. As there are too many labels (17,929 labels in version 32.0, 2019), to avoid issues caused by an extremely imbalanced data distribution (as shown in Table 1), we remove the families whose samples are less than 500, resulting in 7249 protein families with 32,853,084 sequences in our corpus.
To assess the performance of the pre-training platform, we experiment with three downstream classification tasks, as described in the following.
Task I: Identification of type III secreted effectors The type III secretion system (TTSS) is related to the secretion of virulence factors of many Gram-negative pathogens. The effector proteins of the type III secretion system (T3SEs) are directly secreted from bacterial cells into host cells, and then play roles in disease progression and immune response suppression. Identifying the type III secreted effectors can help reveal the mechanism of TTSS. However, the prediction of T3SEs is a particularly challenging job due to the lack of conserved motif or secretion signal, and the existing methods mainly utilize statistical characteristics of amino acid sequences. Here, we adopt the same dataset as WEDeepT3 [13], including 525 training samples (241 effectors and 284 non-effectors) and 138 test samples. The sequence identity is below 40%. Data statistics are shown in Table 2.
Task II: Prediction of subcellular localization The location of a protein in a cell is closely related to its function. Only in a suitable subcellular location can a protein perform its function correctly. Computational prediction of protein localization in cells has been a hot topic in the field of bioinformatics. Most of the existing tools are based on protein sequences and machine learning methods [23-25, 42, 43]. We use a classic benchmark set, BaCeILo [14], including proteins from animals, fungi, and plants, located at four subcellular compartments, i.e., nucleus, cytoplasm, mitochondrion, and secretory pathway. Data statistics are shown in Table 3. Besides, we also use the latest dataset that is used in DeepLoc [4], including 13,858 protein sequences located at 10 subcellular compartments, i.e., nucleus, cytoplasm, extracellular, mitochondrion, membrane,  endoplasmic, plastid, Golgi, lysosome, and peroxisome. The data statistics are shown in Additional file 1: Table S1. Task III: Recognition of signal peptides Signal peptides are usually located at the N-terminals of protein sequences and generally 5-30 amino acids in length. The main function of signal peptides is to promote the secretion of proteins outside the cell or localize them to certain organelles, so the identification of signal peptides can provide clues for revealing protein functions. We consider two types of signal peptides, i.e., Sec substrates cleaved by SPase I (Sec/SPI) and others using the SignalP 5.0 dataset [15]. The proteins are from Eukarya, Archaea, Gram-positive bacteria, and Gram-negative bacteria. Data statistics are shown in Table 4.

Protein sequence segmentation
For text categorization tasks, word features are widely used in the classifiers. Similarly, features are extracted from short peptides (i.e., k-mers) for protein sequence classification. The quality of word segmentation may have a substantial impact on the accuracy of protein sequence classification. Therefore, the preprocessing step mainly focuses on the segmentation of protein sequences into k-mers. As there is neither a dictionary nor word boundaries in biological sequences, it is difficult to segment out k-mers with specific semantic meaning. Instead, the protein sequences are often simply segmented into fragments of fixed length. Two segmentation methods are described in the following.
Non-overlapping segmentation In the field of bioinformatics, biological sequences are often segmented into fixed-length k-mers for feature extraction [7,33,34,39]. Normally,  k is less than or equal to 5 and cannot be too large. The reason is that, as k increases, the k-mer space will increase exponentially, which leads to extremely high dimensionality and difficulties for the classification methods [13,[32][33][34].
The advantage of the k-mer segmentation method is that it is simple and convenient. Each k-mer contains not only the information of single amino acids but also the information of their surrounding context. However, the shortcoming of the k-mers segmentation method is also obvious, i.e., it only considers the k-mers with fixed length. To alleviate this limitation, here we consider the word space including all words whose length is less than or equal to k. Take the protein sequence "MASPAAERKS" as an example, when k is set to 3, the set of words of this sequence is {M, A, S, P, E, R, K, MA, SP, AA, ER, KS, MAS, PAA, ERK}.
Overlapping segmentation Using non-overlapping segmentation, a little shift on the starting site of segmentation may lead to very different segmented words. To avoid such uncertainty, the overlapping segmentation method has also been widely used. This method adopts a sliding window to segment out k-mers with a stride of 1. In this way, all the substrings of length k in the sequences are considered, which has a larger feature space than non-overlapping segmentation and may lead to redundant information. For the protein sequence "MASPAAERKS", when k is still set to 3, the word set becomes {M, A, S, P, E, R, K, MA, AS, SP, PA, AA, AE, ER, RK, KS, MAS, ASP, SPA, PAA, AAE, AER, ERK, RKS}. The overlapping k-mer segmentation method can preserve more sequence information than non-overlapping segmentation.
To fully utilize the sequence information, we adopt the overlapping segmentation in ProtPlat. We compare the model performance of using these two segmentation methods. Results are discussed in Sect. 4.4.1.

Processing of the Pfam database
We use a large-scale corpus, the Pfam database, to perform the supervised pre-training. The construction of Pfam database consists of the following steps (shown in Fig. 1

ProtPlat model design
To describe the working principle of the ProtPlat model, we define the following notations.
C : The size of the feature space, i.e., the number of k-mers used for classification. m : The dimension of the embedding representations. p : The dimension of hidden layer. n : The number of labels. V ∈ R p * m : Input weight matrix. U ∈ R n * p : Output weight matrix. The workflow of the ProtPlat model can be formulated in the following.
Let (x (1) , x (2) , . . . , x (C) )ǫR m be the input vectors. Pass them through a fully connected layer to get the embedding vectors, Get the averaged embedding vector, hǫR p Then pass a fully connected layer to generate a score vector, zǫR n Convert the score vector into the probability distribution of the label, yǫR n The main structure of ProtPlat is a three-layer neural network, as shown in Fig. 2. The input is a C × m-dimensional matrix, which consists of the vectors for words with length less than or equal to k in protein sequences. For instance, when k is set to 3, the input covers all amino acids, 2-mer and 3-mer features. The embedding vector of a k-mer is the average of the embedding vectors of the k amino acid vectors that it contains. An important mechanism in the classification model is the hierarchical softmax function, which uses a binary tree to represent all categories. Each leaf node in the tree is a category, which can effectively ensure the efficient classification of a large number of labels. Hierarchical softmax is built based on Huffman coding [40], and the label is coded, which can greatly reduce the number of prediction targets of the model. The embedding representation of protein sequences in the model is a hidden variable, which can be reused. This architecture is similar to the CBow model [16], except that the central word in this model is replaced by a sequence label.

The two-stage training procedure for downstream tasks
To apply ProtPlat to downstream tasks, we perform a two-stage training procedure.
(i) Pre-training As described in Sect. 3.1, using the protein sequences and family labels in Pfam, we train the ProtPlat model. The input vectors for single amino acids are randomly initialized and the vector for a k-mer is the average of the k amino acid vectors that it contains. The output is the Pfam family labels. After training, we save the vector h (i) of each amino acid as its embedding representation, which is used as the input vector (i.e., x (i) ) for the downstream tasks.

(ii) Fine-tuning
The fine-tuning stage has almost the same training process as the pre-training. The differences lie in the input and output, where the input is pre-trained embedding vectors, and the output is the labels for the downstream classification task.

Web server
We implement the ProtPlat platform as a web service (https:// compb io. sjtu. edu. cn/ protp lat) that is accessible to the public. The web server interface is shown in Fig. 3. The background model of the web server has been pre-trained via the family classification task base on the Pfam database. Users can upload their own training and test sets to the server. The system will fine-tune the pre-trained model by using the uploaded training data and yield prediction results for the test data. After waiting for a while, the prediction results will be displayed on the web page. Besides, users can also download the embedding vectors of amino acids pre-trained by the platform (in the Download Tab).
In many protein classification problems, the training set is too small to support the learning of good representations from input data. Since many protein classification problems share common features extracted from their amino acid sequences, the smalldata tasks can benefit a lot from the two-stage training strategy.

Experimental settings and evaluation metrics
For both the pre-training and fine-tuning, we randomly extract 20% of the training data to form the validation set. We select the best hyperparameters based on the model performance on the validation set. Table 5 shows the hyperparameter settings for the pretraining phase in ProtPlat. Note that the value of k is determined in the pre-training phase and remains unchanged in the downstream tasks. For each downstream task, the number of epochs and learning rate are tuned on its validation set.  To assess the performance of the model, we use four evaluation metrics in binary classification downstream tasks, including accuracy (ACC), F1 score, precision (Pre), and recall (Rec). They are formulated as follows.
where TP, TN, FP, and FN denote the numbers of true positive, true negative, false positive, and false negative, respectively. As for the multi-class problems, the F1 is defined as follows, where TP i , FP i , and F N i denote the numbers of true positive, false positive, and false negative for the i-th class, respectively.

Performance of the pre-training
Here we compare the models with and without pre-training. The model with pre-training uses pre-trained embedding vectors as the initial input, while the model without pre-training uses one-hot encoding vectors as the initial input and randomly initializes the input weights. We compare their performance on 9 datasets, including the T3SE dataset, three subcellular localization datasets (plants, fungi, and animals from BaCeILo [14] and the DeepLoc dataset [4]), and four signal peptide datasets (archaea, eukaryotes, Gram-positive, and Gram-negative). The results are shown in Fig. 4. As can be seen, pre-training can improve the prediction accuracy on all these datasets. The F1 value is increased by 0.03-0.08. Moreover, we perform a statistical significance analysis on the performance difference. For each dataset of the downstream tasks, we run the models with and without pre-training for 30 times, respectively. The p-values of the pair-wise t-test are listed in Additional file 1: Table S2 and S4. For all the 8 downstream tasks, the p-values are much less than 0.01, indicating that the pre-trained model is significantly superior to the model without pre-training.

Task I: identification of type III secreted effectors
Identification of type III secreted effectors is a binary classification problem, i.e., T3SE and non-T3SE. To evaluate the performance of ProtPlat, we compare it with the existing representative methods, including WEDeepT3 [13], BPBAac [17], Effec-tiveT3 [18], T3_MM [19], DeepT3 [20], Bastion3 [21] and BEAN 2.0 [22]. The results of ProtPlat obtained on the test set of WEDeepT3 are shown in Table 6. As can be seen, ProtPlat has achieved the best performance. Compared with the second-best model WEDeepT3, the F1 value of the pre-trained ProtPlat model has increased by 0.128, and the total accuracy has increased by 0.021, which confirms the classification performance of the pre-training platform. Since F1 score is threshold-dependent, we compute the F-max metric (shown in Additional file 1: Table S6). The F1 scores under different thresholds are shown in Additional file 1: Figure S1. The accuracy and F1 scores of the baseline methods are extracted from [13]. All the methods are evaluated on the same test set.

Task II: prediction of protein subcellular location
For protein subcellular localization, we use the dataset in BaCeILo [14] and compare with Euk-mPLoc [23], LOCTree [24], BaCeILo [14], and YLoc [25]. The prediction performance is evaluated by accuracy and F1 score. The F-max metric and F1 scores under different thresholds are shown in Additional file 1: Table S6 and Figure S1. For all the three datasets (plants, fungi, and animals), ProtPlat achieves competitive performance. Especially on the Fungi dataset, ProtPlat outperforms other models by a large margin (both F1 and accuracy are increased by over 10%) ( Table 7), indicating that small datasets may benefit more from the pre-training. Note that the training sets of the baseline models are different [25], and many of the baseline models are more general predictors, which can predict more than 4 locations, like YLoc-HighRes, YLoc + , MultiLoc2-High-Res, WoLF PSORT, Euk-mPLoc, and LOCTree. Thus, they may perform worse than the predictors specifically trained for these four locations. The accuracy and F1 scores of the baseline methods are extracted from YLoc [25].
It is worth noting that almost all the baseline methods utilize information from multiple sources as input features, including some kinds of domain knowledge, such as protein functional domain and Gene Ontology. By contrast, ProtPlat uses sequence information and the protein family labels in Pfam, which are general information for proteins and not specific to prediction tasks, while it can obtain comparable or even better results than the baseline methods.
The comparison results suggest the powerful learning ability of the pre-training platform, which would be very useful when domain knowledge is scarce.
We also compare ProtPlat with the state-of-the-art methods on the DeepLoc dataset. The accuracy of ProtPlat is much lower than DeepLoc (results shown in Additional file 1: Table S3). The reason is mainly due to the imbalanced distribution of the DeepLoc dataset. As described in Sect. 2 (Task II), there are 10 classes in this dataset, and the largest class has 4043 samples while the smallest one has only 154 samples (shown in Additional file 1: Table S1). The DeepLoc adopts a cost matrix-based method to mitigate the effect of class imbalance, while there is no specific operation in our model for dealing with this issue. The other two methods, LocTree2 [44] and YLoc [25], also perform better than ProtPlat, as both utilize some domain knowledge. Besides the sequence features, i.e., amino acid composition and pseudo composition, YLoc also uses the PROSITE motifs and GO terms from the homologs of the query protein. LocTree2's input includes sequences as well as their profiles searched via PSI-BLAST [45], and it incorporates similarities among subcellular locations in the design of SVM classifiers.

Task III: signal peptide prediction
For the recognition of signal peptides, we perform binary classification and compare ProtPlat with 16 baseline methods mentioned in SignalP 5.0 [15]. The prediction performance is evaluated by precision, recall, and F1. Results are shown in Table 8. The F-max metric and F1 scores under different thresholds are shown in Additional file 1: Table S6 and Figure S1. For the Archaea and Gram-negative datasets, ProtPlat has the highest F1 scores. In general, the performance of ProtPlat is comparable to SignalP 5.0, which uses hand-crafted features and specific architecture for recognizing signals, and higher than all the other 15 baselines. For Eukaryotes dataset, the precision and recall values of ProtPlat are closer to SignalP 5.0 and higher than those of other baselines. For Gramnegative dataset, the result of ProtPlat is close to that of SignalP 5.0, where the precision value is significantly higher than other baselines and the recall value is also higher. For the Gram-positive dataset, ProtPlat achieves comprehensively better performance, and the precision is significantly higher than other baselines.

Comparison of the two segmentation methods
We experiment with both the non-overlapping and overlapping segmentation methods. The results on the three downstream tasks are shown in Table 9. The model settings are  show that when using the overlapping segmentation, the embedding vectors pre-trained by the model lead to a better F1 score for all the tasks.

Investigation on the value of k
In ProtPlat, the value of k is set to 3, which is determined in the pre-training phase, i.e., based on the performance on the protein family classification task. To investigate whether it is a good choice for the downstream tasks, here we assess the model performance under different values of k. Figure 5 shows the F1 scores for the downstream tasks when k is set to 1, 2, 3, 4, and 5, respectively, using the overlapping segmentation for protein sequences. The results show that the best performance is achieved when k is set to 3 for all the tasks, suggesting that the protein sequence-based classification tasks share sequence features and the pre-training can transfer knowledge to other tasks. When k is set to 1, each amino acid is treated independently and contextual information (i.e., local sequence information) is not included, thus the performance is not good. When k is equal to or greater than 5, the accuracy drops because the k-mer space has extremely

Comparison with other pre-trained protein representations
We compare ProtPlat with two state-of-the-art protein pre-training models, i.e., SeqVec [26] and ProtTrans [27], on all the downstream task datasets. The results are shown in Table 10. Besides, we perform a statistical significance analysis for the performance different of three pre-training methods shown in Additional file 1: Table S5. We use the 1024-dimensional pre-trained embedding vectors taken from SeqVec and the ProtAlbert model in ProtTrans as the input of ProtPlat. As can be seen, ProtPlat achieves the best performance on two relatively small datasets, especially the Archaea dataset, which contains only 55 training samples. The reason is that the embedding vectors yielded by SeqVec and ProtTrans have a high dimensionality (1024), which may result in the overfitting issue, while our model only uses 100-dimensional embedding vectors.
Although ProtPlat seems to have little advantage in the comparison with SeqVec and ProtTrans regarding the prediction accuracy, ProtPlat is a lightweight, cost-effective, and highly efficient model. Different from SeqVec and ProtTrans, both of which are based on language modeling to perform large-scale unsupervised pretraining, ProtPlat adopts supervised learning, i.e., the protein family classification in Pfam, as the pretraining task. For ELMo-based SeqVec, it was trained for three weeks on 5 Nvidia Titan GPUs with 12 GB memory each. As mentioned in Background, ProtTrans uses various transformer models, which were trained on a supercomputer with 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). By contrast, ProtPlat takes only several CPU hours for pre-training. Therefore, it could also serve as an alternative pre-training model for protein-related prediction tasks, especially when the downstream task has very limited training samples.

Discussion
This study proposes a pre-training platform to address the contradiction between a large number of protein sequences and the small scale of training data for various protein classification problems. The advantages of ProtPlat are mainly two folds. (i) Fast and lightweight. ProtPlat does not use evolutionary information in the process of training, but only uses sequence information for prediction. It is a simple model with a few training parameters and fast training speed. We can train ProtPlat for classifying half a million sentences among 312 K classes in less than a minute using a standard CPU without GPU support. (ii) Suitable for small datasets. ProtPlat model has especially good performance on small data sets. Since small data is usually insufficient for training, the pre-training procedure that provides a good initial model has a more obvious effect. Our experimental results also show that the classification of small datasets benefits more from the platform.
A limitation of this study is that we only consider the protein-level classification tasks, while a lot of prediction tasks are at the residue-level, such as secondary structure prediction and residue contact map prediction. Although the word embeddings learned in the system represent single amino acids or k-mers, the residue-level prediction has not been supported in the current version yet. One of our future works is to incorporate the residuelevel prediction function into our platform and make it more general for protein-related computation tasks.