Skip to main content


Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Figure 6 | BMC Bioinformatics

Figure 6

From: Automatic discovery of cross-family sequence features associated with protein function

Figure 6

Function-related sequence features. The sequence_classifier subroutines of fixed-target predictors contain one or more evolved regular expressions which may influence the classifier in a positive or negatively way. As described in the Methods, this positive or negative influence can be determined with an approximation method. The positively influencing regular expressions are matched against test set sequences (cuts 1 to 4 of the data individually, or pooled together, indicated with "All" in the figure). The 500 most-matched residues or sequence fragments are then analysed manually for recurrent patterns. In panel A, we summarise the sequence features that are important for predictors of the functions: "nuclear", "transcription" and "DNA". As expected, sequence features containing multiple lysine and arginine residues are an important signal in nuclear proteins (the pattern [KR] {3} is found in approximately 15% of the top 500 positively influencing residues for "nuclear" predictors). Other signals thought to be involved in protein-protein interactions in the nucleus are also identified by this analysis: repeated acidic residues and polyglutamine. The polyglutamine feature, and particularly polyglutamine flanked by at least one of the residues D/R/H/A/N/K/S/L/E/P/T, is a stronger signal for "transcription" predictors. In panel B, the same analysis is performed for predictors of "cytoplasmic", "biosynthesis" and "catalyzes". In this case only single-residue "features" are apparent from the data. For instance, aromatic residues are more important for predictors of "biosynthesis" and "catalyzes" than for "cytoplasmic" (green bars).

Back to article page