SCMPSP: Prediction and characterization of photosynthetic proteins based on a scoring card method
- Tamara Vasylenko†1,
- Yi-Fan Liou†1,
- Hong-An Chen1,
- Phasit Charoenkwan1,
- Hui-Ling Huang1, 2Email author and
- Shinn-Ying Ho1, 2Email author
© Vasylenko et al.; licensee BioMed Central Ltd. 2015
Published: 21 January 2015
Photosynthetic proteins (PSPs) greatly differ in their structure and function as they are involved in numerous subprocesses that take place inside an organelle called a chloroplast. Few studies predict PSPs from sequences due to their high variety of sequences and structues. This work aims to predict and characterize PSPs by establishing the datasets of PSP and non-PSP sequences and developing prediction methods.
A novel bioinformatics method of predicting and characterizing PSPs based on scoring card method (SCMPSP) was used. First, a dataset consisting of 649 PSPs was established by using a Gene Ontology term GO:0015979 and 649 non-PSPs from the SwissProt database with sequence identity <= 25%.- Several prediction methods are presented based on support vector machine (SVM), decision tree J48, Bayes, BLAST, and SCM. The SVM method using dipeptide features-performed well and yielded - a test accuracy of 72.31%. The SCMPSP method uses the estimated propensity scores of 400 dipeptides - as PSPs and has a test accuracy of 71.54%, which is comparable to that of the SVM method. The derived propensity scores of 20 amino acids were further used to identify informative physicochemical properties for characterizing PSPs. The analytical results reveal the following four characteristics of PSPs: 1) PSPs favour hydrophobic side chain amino acids; 2) PSPs are composed of the amino acids prone to form helices in membrane environments; 3) PSPs have low interaction with water; and 4) PSPs prefer to be composed of the amino acids of electron-reactive side chains.
The SCMPSP method not only estimates the propensity of a sequence to be PSPs, it also discovers characteristics that further improve understanding of PSPs. The SCMPSP source code and the datasets used in this study are available at http://iclab.life.nctu.edu.tw/SCMPSP/.
The photosynthetic conversion of sunlight energy into chemical energy is among the most important biochemical processes on earth. Photosynthetic proteins (PSPs) from plants, algae and photosynthetic bacteria greatly differ in their structure and function as they are involved in many subprocesses, including solar energy harvesting, diffusive transport, energy conversion, electron and ion transport reactions from water to NADP+, ATP generation, and a series of enzymatic reactions in the stroma of the chloroplast . The PSPs are localized in special organelles, called chloroplasts, which have many inner compartments. Although most PSPs are embedded in thylakoid membranes, others are found in the thylakoid lumenal space and in the soluble stroma of chloroplasts. The stromal compartment mostly contains the components of the Calvin cycle, which are needed for fixation of carbon dioxide. The thylakoid membrane contains four protein complexes: photosystem (PS) I, PSII, cytochrome (Cyt) b6f, and adenosine triphosphate (ATP) synthase, which carry out the light reactions of photosynthesis [2, 3]. The Arabidopsis thaliana (A. thaliana) genome sequencing project and subsequent proteomics studies have revealed that the thylakoid membrane and thylakoid lumen still contain a number of proteins with unknown functions [4, 5]. These proteins may have roles as yet unknown subunits of the photosynthetic complexes and may also be auxiliary proteins guiding the biogenesis, maintenance, and regulated breakdown of the photosynthetic complexes.
Considerable effort is needed to identify novel PSPs using laboratory techniques. Several recent studies have comprehensively analyzed the pea, spinach and A. thaliana chloroplast proteome using various fractionation and mass spectrometry methods. Kieselbach et al.  isolated and characterized the luminal fraction of spinach thylakoids by thylakoid membrane removal, Yeda press fragmentation and centrifugation. Their studies contributed to the discovery of the extrinsic proteins PsbO, PsbP, and PsbQ that are thought to stabilize PSII. In Kleffmann et al. , tandem mass spectrometry shotgun proteomics was used to develop a comprehensive map of all metabolic and regulatory pathways in A. thaliana chloroplasts, which enabled identification of 687 PSPs. Schubert et al.  studied the chloroplast lumen of A. thaliana and used two-dimensional SDS-PAGE, mass spectrometry, and microsequencing techniques for protein separation. Peltier et al.  identified thylakoid proteome from pea and A. thaliana by using gel electrophoresis, mass spectrometry and Edman degradation sequencing. They also presented the results of a stromal proteome analysis of A. thaliana in an attempt to quantify proteins of the Calvin cycle .
Because of the complexity of chloroplasts and the wide taxonomic distribution among photosynthetic organisms, using only experimental techniques is prohibitively time-consuming and labour-intensive. Therefore, bioinformatics methods have become powerful tools for photosynthetic research. Ishikava et al.  performed a pilot study that combined bioinformatic and experimental approaches to identify nuclear-encoded chloroplast proteins of endosymbiontic origin. Most proteins in chloroplasts are encoded by the nucleus and require N-terminal presequences (cTPs) to be imported into the organelle . Nakai et al. were the first to report protein cTPs in eukaryotic cells . Emanuelsson et al. [11, 12] proposed neural network-based localization predictors for discriminating cTPs (ChloroP) and for assigning a cleavage site prediction capability (TargetP) to chloroplast, mitochondrion, ER/golgi/secreted, and other localizations. However, not all plastid proteins can be predicted by the localization predictors because several known plastid proteins apparently have no obvious cTPs and because outer envelope proteins of chloroplasts do not have a cleavable cTP . Recent studies have proposed the use of a classifier based on support vector machine (SVM) to identify the four plastid types, including chloroplasts, by utilizing sequence features such as amino acid composition, dipeptide composition, the physicochemical properties of amino acids, etc . However, photosynthetic bacteria have no chloroplasts with photosynthetic proteins directly embedded into plasma membrane. At present, Ashkenasi et al.  proposed the unique method off identifying PSPs by homology match (BLAST, PSI-BLAST, HSSP, and Pfam). They concluded that, since the false positive rate based on overall sequence similarity is rather high (~70%), short motifs-based approaches can reveal functional similarities more accurately. Therefore, an effective predictor for discriminating between PSPs and non-PSPs from sequences is needed to discover new PSPs for industrial photosynthesis.
Considering the large number of subprocesses in which PSPs are engaged, it is clear that PSPs have a wide range of numerous subprocesses in which they participate, PSPs clearly have widely varying applications. The PSII complex from plants, algae, and cyanobacteria, bacterial reaction centers, and bacteriorhodopsin from halobacteria have the potential to provide the core for numerous innovative devices . The PSII-based biosensors are being developed to replace more complex laboratory analyses used to detect photosynthetic herbicides. Engineered bacterial reaction [centers, as well as OR centers as well as OR centers and the] the isolated components of the PSI are used in photovoltaic cells to promote the conversion of visible light energy into electrical or chemical energy . Another potential application of industrial photosynthesis is producing biodiesel fuel from engineered cyanobacterial organisms . This work had three objectives: 1) developing an effective prediction method for identifying novel PSPs, 2) estimating propensity scores of dipeptides and amino acids to be PSPs for mutagenesis studies, and 3) characterizing PSPs that have potential applications.
Since no dataset of PSPs and non-PSPs is currently available for developing bioinformatics methods that use machine learning, this work first establishes a dataset (PSPGO) consisting of 649 PSPs extracted by using a Gene Ontology term GO:0015979 and 649 non-PSPs from the SwissProt database with sequence identity <= 25%. The proposed SCMPSP prediction method uses the estimated propensity scores of 400 dipeptides as PSPs based on a scoring card method (SCM) [18, 19]. The derived propensity scores of 20 amino acids for the 400 dipeptides are then used to discover informative physicochemical properties for characterizing PSPs. To investigate potential prediction methods, several typical prediction methods based on SVM, decision tree J48, and Bayes classifiers with some commonly-used sequence features are also implemented. For comparisons with existing prediction methods, the BLAST method is also implemented . Comparisons of the mean prediction accuracies of all presented prediction methods suggest that the proposed PSPGO dataset provides a higher prediction accuracy compared to the datasetcontaining sequences not reviewed in UniProtKB .
To characterize the PSPs, the propensity scores of 20 amino acids were correlated with the physicochemical properties of amino acids in the AAIndex database . Physicochemical properties with high correlation coefficients (R values) can be used to study PSPs. However, the reported properties of amino acids for specific protein functions can also be used. The findings of PSP characteristics in this work are as follows: 1) PSPs favour hydrophobic side chain amino acids; 2) PSPs are composed of the amino acids prone to form helices in membrane environments; 3) PSPs have low interaction with water; and 4) PSPs prefer to be composed of the amino acids of electron-reactive side chains.
Materials and Methods
Summary of the three datasets consisting of training and test data.
Sequence identity (%)
ORI and ORIRW
Ashkenazi et al.  established the PSP dataset to assess the false positive rate of commonly used homology-based function prediction methods. We adopted this dataset as positive sets of ORI and ORIRW datasets to ensure a reliable performance comparison. However, almost one third of the sequences from the previously reported dataset  were marked as 'unreviewed' in the SwissProt. Therefore, the selected PSPs of the ORI dataset included all available PS proteins provided by the previous study  (excluding obsolete entries). While the positive part of the ORIRW dataset contained only 'reviewed' PSPs. Putative non-PSPs for both ORI and ORIRW datasets were extracted from the SwissProt and chosen to be all the proteins, originating from the same organisms as those of the PSPs, excluding the positive sequences. The ORI dataset has 1236 positives and 1236 negatives selected randomly from 10692 putative non-PSPs with 50% sequence identity. The ORIRW dataset has 733 reviewed positives and 733 negatives selected randomly from 7048 putative non-PSPs with 25% sequence identity. Both ORI and ORIRW were divided into training (ORI-TRN, ORIRW-TRN) and test (ORI-TST, ORIRW-TST) subsets.
Considering the absence of several obsolete entries (from 1425 stated in paper, only 1409 found in SwissProt to date), as well as the presence of 'unreviewed' sequences in the dataset, collected from the previous paper  (from 1409 entries 452 'unreviewed'), we established a new PSPGO dataset using the GO terms used to collect unique sequences belonging to the dataset of PSPs, owing to a high quality of its functional annotations.
The PSPGO dataset was used to design the SCMPSP method was established by collecting sequences from the SwissProt database using the GO term GO:0015979 (Photosynthesis) and its child terms, which are not child terms of other processes. The Ancestor chart of the Quick GO browser was used to search relationships among the GO terms. Totally, 30 GO terms were used. The sequences, which were not annotated by these GO terms, were considered putative non-PSPs. The sequence identity of any pair of a sequences was reduced to 25% by using USEARCH . Table 1 shows how the 649 positives and 649 randomly chosen negatives were divided into training PSPGO-TRN and test PSPGO-TST subsets.
Typical classification methods
A literature review shows that few effective methods or tools for predicting PSPs from sequences have been proposed. To develop an accurate predictor, other studies have compared the performance of typical classification methods such as those based on SVM, decision tree J48, and Bayes classifiers with a single type of sequence features. The SVM is generally considered an accurate classifier for predicting proteins with a specific function. Generally, the predictive performance of an SVM with effective features is considered the gold standard for evaluating predictors. Radial basis SVM classifiers are implemented using the LIBSVM package . The SVM parameters are evaluated by using a grid search method to maximize 10-fold cross validation (10-CV) accuracy in a training dataset. Some commonly used features such as amino acid composition (AAC), dipeptide composition (DPC), and 531 PCPs in the AAindex are evaluated in the design of predictors.
The J48 and Bayes classifiers are implemented using WEKA . The J48 is a decision tree classifier generated by the C4.5 algorithm developed by Quinlan . The Naïve Bayes classifier is a statistical classifier that can predict the probability of class membership under the assumption of mutually independent features . For comparison with the existing method , BLASTP is used to evaluate the performance of sequence alignment method.
Scoring card method
The SCM is a new method for predicting proteins with a particular function and for gaining insight into the characteristics of proteins based on OR new method for predicting proteins with a particular function and for characterizing proteins according to primary sequences. Huang et al. developed the SCM-based methods [18, 26]. Unlike complex classification mechanisms such as SVM, which is not easily understood by biologists, using SCM to estimate the propensities of amino acids and dipeptides to provide the function of interest is a simple and easily interpretable method of prediction and analysis. In terms of prediction accuracy, SCM is slightly worse than, or comparable to, SVM when they are used with dipeptide features [18, 26]. The advantages of the SCM method are threefold. First, the classification mechanism of SCM adopts a weighted sum of composition and propensity scores of dipeptides to score the protein sequence. Compared with the hyperplane of the SVM, SCM classifies proteins using a threshold value, which is easily understood and manipulated by biologists. Second, the propensity scores of dipeptides and amino acids can be used to identify the PCPs that provide information about a global property of general proteins in a further analysis of characteristics of the proteins. Third, the SCM is a general-purpose method of identifying protein sequences with a particular function. The proposed SCMPSP method is based on the SCM method using the training dataset of PSPGO-TRN. For a clear understanding, the SCM and the SCMPSP algorithm are described below.
The SCM-based predictors are designed in three main steps: 1) preparing both positive and negative sequences in a training dataset as inputs (519 PSPs and 519 non-PSPs in PSPGO-TRN); 2) using a simple statistical method to generate an initial scoring card with 400 propensity scores of dipeptides; 3) obtaining propensity scores of 20 amino acids from those of 400 dipeptides; 4) using a global optimization method to optimize the initial scoring card, and 5) generating a binary SCM classifier with a threshold value as an output of the procedure. Further details of the SCM method and its applications can be found in [18, 26]. The SCMPSP is as follows.
Step 1: Prepare a training dataset PSPGO-TRN comprising 519 PSPs and 519 non-PSPs.
Step 2: Generate an initial scoring card consisting of 400 propensity scores of dipeptides, which are obtained by subtracting the dipeptide composition of dipeptides in non-PSPs from those in PSPs. Then, normalize all dipeptide scores into the range [0, 1000].
Step 3: Calculate the propensity score of each amino acid × by averaging 40 propensity scores of dipeptides that contain X.
Informative physiochemical properties
Physicochemical properties (PCPs) of amino acids have been shown to have meaningful features that can be used for predicting and analysing the functions of proteins in primary sequences . By using SCMP-PCP  to identify the most informative of the 531 PCPs rearranged from the AAIndex that currently contains 544 PCPs of amino acids, PCPs can be discovered, and PSPs can be characterized. Each PCP consists of an accession number, a simple description of the index, the reference information, and the numerical values for the properties of 20 amino acids. The propensity scores of 20 amino acids can be derived from the propensity scores of 400 dipeptides.
The SCM-PCP method is performed in two main steps. First, calculate the R value of the Pearson correlation coefficients between the propensity scores and the numerical values of 20 amino acids for each of 531 PCPs. The property of interest is a candidate for PSP function analysis when the absolute value of R is larger than 0.5. 2). Second, use the known PSP function to identify the informative properties from existing studies, which are not included in AAIndex. The composition of amino acids in PSPs and non-PSPs can also be used to infer the properties of PSPs.
Results and discussion
Propensity Scores of PSPs
The PE dipeptide has a rather high score of 932. In Mori et al. , multiple-sequence alignment of PSPs from plants and bacteria showed that the PE motif is strictly conserved in the transmembrane region of proteins, which is translocated into the thylakoid membrane via the ΔpH-dependent pathway. The PE motif with Glu included in the α-helices of a transmembrane region apparently stabilizes the protein structures by forming a stable interaction with Lys, His or Gln. Even in proteins such as Tha4 and TatA/E, which have different functions, this protein is still conserved. The GP motif, which has a fairly high propensity score of 743, was also completely conserved between the membrane-spanning helix and amphipathic helix of ΔpH-dependent protein precursors . The twin arginine motif RR was a distinguishing feature in protein precursors translocated into the thylakoid membranes via the ΔpH-pathway. The three dipeptides PE, GP, and RR have important roles in the ΔpH-pathway.
The propensity scores and composition (%) of amino acids.
Composition of PS: A(%)
Composition of Non-PS: B(%)
Composition difference: A-B(%)
Second, PSPs must build hydrophobic environments for binding with coenzymes and cofactors such as heme and chlorophyll. For example, most electron transfer reactions during photosynthesis result from protein binding with chlorophyll, which is a porphyrin derivative. The PSPs with the chlorophyll as the cofactor form the transmembrane α-helical structures, which favour Ala as the reaction center (review see ). In this reaction center, the top-ranked residues in terms of composition are Leu (15%), Ala (14%), Phe (12%) and Ile (10%), all of which are hydrophobic amino acids.
The hydrophobic core of the light harvesting polypeptide.
The lowest score for these five hydrophobic cores is 518.91, which is much higher than the threshold 441.29 of the PSP classifier. In this LH polypeptide, Leu has the highest average composition (20.23%) followed by Ala (14.50%). The hydrophobic cores revealed no extreme polar residues such as Asp, Asn, Glu and Gln.
Performance comparisons of PSP predictors
Three datasets were used to design various PSP classifiers and to compare prediction performance in all classification methods with various feature types. The proposed SCMPSP method was compared with SVM, decision tree J48, and Bayes classifiers. For each classifier, we evaluated three kinds of features: 1) amino acid composition (AAC), 2) dipeptide composition (DPC), and 3) 531 PCPs in AAindex.
Performance of established datasets as compared for various E-value cut-offs by BLSTP
Comparison of the prediction accuracies (%) of PSP predictors.
The experimental results can be briefly summarized as follows. The performance of SCMPSP, SVM-based, J48-based, and Bayes-based methods to predict PSPs outperform BLSTP. The PSPGO datasets are more suitable to develop methods of predicting PSPs than the ORI and ORIRW datasets. For predicting PSPs, SVM-based methods are more effective than J48- and Bayes-based methods. Bayes-based methods do not perform well in predicting PSPs. The performance of the SCMPSP method is comparable to that of the SVM-DPC method, which outperforms J48-DPC and Bayes-DPC in the PSPGO-TRN and PSPGO-TST datasets.
Performance evaluation of SCMPSP
10 independent runs of the SCMPSP on PSPGO-TRN.
Train Accuracy (%)
The SCMPSP method achieves a test accuracy of 71.54%, an MCC of 0.43, a sensitivity of 0.72, and a specificity of 0.72 in PSPGO-TST. The SVM-DPC method using SVM with DPC features achieves a test accuracy of 72.31%, an MCC of 0.45, a sensitivity of 0.75, and a specificity of 0.70. These experimental results indicate that SCMPSP is comparable to SVM-DPC in terms of predicting PSPs.
Propensity analysis using informative PCPs
The amino acids scores derived from SCMPSP and physicochemical properties selected by SCM-PCPs.
1BLAS910101 Score (Rank)
2WOLR810101 Score (Rank)
3PUNT030101 Score (Rank)
A. PSPs favour hydrophobic side chain amino acids
The BLAS910101, which can be described as ''Scaled side chain hydrophobicity values'', had the highest positive correlation (R = 0.7955) . Estimation of hydrophobicity profiles has proven to be a powerful approach to protein sequence analysis. Many scales have been developed to quantify the hydrophobic properties of the standard 20 amino acids. The values of BLAS910101 property were calculated by using a modification of the 'hydrophobic fragmental constant' approach developed by Rekker , i.e., only the side chains of the post- or cotranslationally modified amino acyl residues were considered and not the peptide backbone .
The high positive correlation indicates that PSPs favour hydrophobic amino acids. Table 7 shows that the top five amino acids are Ala, Phe, Tyr, Ile and Leu. According to the property BLAS910101, four of these residues (Phe, Leu, Ile and Tyr) possessed the highest side chain hydrophobicity values. However, the Ala residue had a rank of 10] .
As mentioned in the primary analysis of the propensity scores, the overall hydrophobicity of the PSPs is high. In SCM-PCPs, however, the hydrophobic characteristics of the side chains do not include the backbone hydrophobic characteristics because BLAS910101 does not include the backbone hydrophobicity. Based on these experimental results, we postulate that side chains play an important role in reducing the folding free energy and to construct the hydrophobic environments for cofactor binding.
B. PSPs are composed of the amino acids prone to form helices in membrane environments
The PUNT030101 property, described as ''Knowledge-based membrane-propensity scale from 1D_Helix in MPtopo databases'', showed the highest negative correlation (R = -0.7948) . Of the many hydropathy scales currently available for protein analysis, one of the most widely used is the knowledge-based amino acid membrane-propensity scale developed by Punta et al. , which exhibits strong correlation with several representative hydropathy scales, and approaches different prediction tasks. This scale is derived using a set of transmembrane helices segments from MPtopo databases with the requirement that each component of the set must have an OR database. However, each component of the set must have a free energy lower than that of a typical soluble protein sequence of the same length . Punta et al. attempted to use this index to solve two problems: predicting the soluble/membrane proteins and predicting α-helical transmembrane/signal segments. Hence, PUNT030101 has two characteristics: the propensity to form a membrane and the propensity to form an α-helical structure in the membrane.
High correlation with PUNT030101 suggested that PSPs tend to be composed of amino acids with high membrane propensity. Since negative values for PUNT030101 indicate high membrane propensity, a strong inverse correlation between SCMPSP and PUNT030101 scores implies a high membrane propensity of the photosynthetic proteins. Table 7 shows that the amino acids with the 5 highest propensity scores correspond to the bottom- and middle-scored Phe, Leu, Ile, Ala and Tyr residues at ranks 20, 19, 18, 15 and 13, respectively. In contrast, those with the lowest scores were Gln, Glu, Arg, Asp and Lys, with PUNT030101 ranks of 4, 6, 2, 1 and 3, respectively. Notably, only Lys is a hydrophobic amino acid. A high hydrophobicity is a characteristic feature of amino acids belonging to transmembrane regions, which is in agreement with the previously described BLAS910101 property correlation results . Figure 3 shows that chlorophyll-protein complexes, which catalyze the light reactions of the photosynthesis, are known to be embedded in the thylakoid membranes of chloroplasts. The photosynthetic thylakoid membrane encloses a single lumen space and differentiates into cylindrical stacked grana and interconnecting single membrane regions called the stroma lamellae. The four membrane-associated components of the photosynthetic apparatus include PSI and ATPase, which are located in the stroma lamellae of thylakoids, PSII, which resides mainly in the grana membranes, and the cytochrome b6/f complex, which is almost evenly distributed between the two membrane types . In each case, the overall organization and the number of transmembrane regions differ. Remarkably, both PSI and ATPase complexes have bulky stromal-exposed parts whereas the PSII core and cytochrome b6/f complexes protrude from the lumenal side . These results indicate the important role of the membrane in photosynthesis, so some PSPs must have functions in the membrane environment.
The high correlation with PUNT030101 also indicates that PSPs tend to be composed of amino acids that form transmembrane helices. In the photosynthetic reaction centers (RCs), the membrane-spanning helices are the main structures. For example, the RCs from purple bacteria have 11 membrane-spanning helices that can form a hydrophobic binding site for cofactor binding; in contrast, some external helices that are exposed outside the membrane connect to the membrane-span helices.
C. PSPs have small interaction with water
The property of WOLR810101, described as "Hydration potential", was selected by SCM-PCPs with R = 0.7597 . The hydration potentials of amino acid side chains represent their free energies of transfer from the vapor phase to dilute water. The structures of macromolecules can be determined by comparing amino acid residues in terms of their strength of solvation by water. Many researchers have attempted to calculate solvation properties of the natural amino acid side chains. By using more sensitive techniques compared to earlier measurements, Wolfenden et al. developed a scale of effective free energies of transfer of amino acid side chains from the vapor phase to neutral aqueous solution buffered at pH 7. Since highly hydrophilic amino acid side chains strongly release energy after they are dissolved in water, those lower score amino acids in WOLR810101 indicated these residues favour interaction with water molecules and vice versa. Wolfenden et al. reported the results for a scale of hydration potential spanning a range of ~22 kcal/mol. The residues with the five lowest scores were Arg, Asp, His, Glu and Asn .
Table 7 shows that the residues with the five lowest propensity scores were hydrophilic residues Gln, Glu, Arg, Asp and Lys, with Arg being at rank 18. According to the previously reported scale, aliphatic side chains of Gly, Leu, Ile, Val and Ala can only form weak bonds with water and exhibited positive free energies of transfer from vapor phase to water . Among the SCMPSP top-5 scored residues, Ala, Ile and Leu are found together with aromatic Phe and Tyr residues. The Phe and Tyr were ranked 6 and 13, respectively, by a previously reported scale of amino acid affinities. Wolfenden et al. not only determined affinities of amino acid side chains, but also identified a statistically significant relationship between the resulting scale and the inside-outside distributions of amino acids in globular proteins. The outside residues directly interact with water while the inside ones do not. Based on earlier solvent accessibility calculations, they showed that residues with negative free energies of transfer from the vapor phase to water tend to appear on the surface rather than in the interior of globular proteins . Thus, unlike the accessible Arg, Asp, His, Glu and Asn residues, the Gly, Leu, Ile, Val and Ala amino acids tend to be "buried".
D. PSPs tend to be composed of amino acids with electron-reactive side chains
Photosynthetic machinery that collects solar energy and converts it to chemical energy is susceptible to oxidative damage resulting from an excess light [40, 41]. Strong light causes oxidation and increases production of reactive oxygen species . Many studies of the effects of reactive oxygen species (ROS) on various proteins [43–45] indicate that ROS cause oxidative damage to proteins, which results in biological dysfunctions such as perturbed activity of enzymes, transport proteins and receptors , which can oxidatively modify the proteins and enhance the proteolysis process in orgasms. ROS are also generated by the light-driven reactions of electron transfer decrease PSII activity and lead to irreversible oxidation of the D1 protein in photosynthesis systems . To reduce ROS damage, some ROS catalytic enzymes such as glutathione peroxidase and superoxide dismutase have been studied because they apparently have important roles in destroying ROS and the interplay between cyclic, linear and pseudocyclic electron transport pathways is required for the prevention of over-oxidized state and coordination of energy metabolism during photosynthesis . Some studies of mechanisms that overcome the oxidative stress have postulated that the photosynthesis system has a prevention mechanism in which the PSPs can immediately capture the ROS generated from light, and prevent the ROS to attack other PSPs. However, the AAindex database does not have an ROS-related index.
The SCMPSP scores and Rate constants by Davies et al. 
Rate constants by Davies et al. 
7.7 × 107
6.5 × 109
1.3 × 1010
1.8 × 109
1.7 × 109
1.7 × 107
7.6 × 108
8.3 × 109
4.8 × 108
1.3 × 1010
5.1 × 108
3.2 × 108
4.9 × 107
1.3 × 1010
3.4 × 1010
3.4 × 108
7.5 × 107
3.5 × 109
2.3 × 108
5.4 × 108
Therefore, Ala, Pro and Gly were excluded from further correlation analysis. After excluding these three amino acids, the correlation dramatically increases from 0.31 to 0.50. This phenomenon indicates that PSPs tend to be composed of amino acids with high ROS reacting side chains.
In the ROS reaction constant analysis, Ala has the highest propensity score, but its side chain has a weak interaction with ROS. Some researchers have hypothesized that most PSPs function in the membrane environment and are composed of the alpha-helical structures. Arkin and Brunger  provided statistical results showing that Ala tends to be composed of transmembrane alpha-helical structures since structures that influence protein functions are more important for PSPs than for capturing ROS. The Trp with a rapidly ROS interacting ability is rank tenth, which indicates that PSPs and non-PSPs would have equal propensity to be composed of Trp. Although the experiments show that Trp has a good ability to react with ROS, Trp is a high energy amino acid that needs most energy to be produced. This explains why PSPs do not use this residue to protect against ROS.
The correlations obtained for the highly oxidizing HO* radical indicate that the PSPs tend to be composed of amino acids that deplete rapidly. The advantage of this amino acid composition is the neutralization of the ROS that cause protein damage and prevent ROS from destroying the photosynthesis system. Studies of the cell cycle also indicate that an overdose of ROS can trigger apoptosis. (review see ) Although live organisms have enzymes such as glutathione peroxidase and super oxide dismutase that can catalyze ROS, rapidly neutralizing ROS may be a good strategy for preventing ROS accumulation. In the photosynthesis system, however, generation of the energy and oxidative stress uses the energy saving strategy that Trp, an energy cost amino acid but having better ability to absorb the ROS, is not preferential in PSPs compared to non-PSPs.
The above understanding is needed to engineer photosynthetic organisms with enhanced oxidative stress tolerance . Therefore, plants with enhanced antioxidant content should be selectively engineered to reduce oxidative stress. Recent studies have also focused on searching for or creating antioxidant peptides [41, 52]. Peptides, most of which are food-derived antioxidant peptides, are thought to promote the health and disease preventing. Although most peptides are extracted from milk or fish, understanding how proteins prevent ROS would also help to generate new proteins with highly oxidative reacting ability.
The current work proposed a novel SCM-based SCMPSP method for predicting and analysing of PSPs from their sequences. Several other homology-based and machine-learning approaches have been explored: BLAST, support vector machine (SVM), decision tree J48 and Bayes. The performance of the SCMPSP method was comparable to that of the SVM-based method, which in turn outperformed J48- and Bayes-based methods when applied to independent test set. Additionally, the propensity scores derived from the SCMPSP resulted in identification of informative physicochemical properties, providing insights into the nature of PSPs. Our SCM_PCPs method yielded high correlation results with such PCPs, as: BLAS910101, PUNT030101 and WOLR810101. Additional correlation analysis has been conducted to explore the nature of PSPs in their interaction with ROS. In summary, PSPs are more likely to be composed of amino acids with hydrophobic, and electron-reactive side chains, as well as those, which reinforce the formation of helices in membrane environments. Moreover, PSPs have low interaction with water. The SCMPSP source code and the datasets used in this study are available at http://iclab.life.nctu.edu.tw/SCMPSP/.
This work was funded by National Science Council of Taiwan under the contract numbers of NSC-103-2221-E-009-117- and NSC-103-2221-E-009-141-. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 1, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S1
- Tanaka A, Makino A: Photosynthetic research in plant science. Plant and cell physiology. 2009, 50 (4): 681-683. 10.1093/pcp/pcp040.PubMed CentralView ArticlePubMedGoogle Scholar
- Dekker JP, Boekema EJ: Supramolecular organization of thylakoid membrane proteins in green plants. Biochimica et Biophysica Acta (BBA)-Bioenergetics. 2005, 1706 (1): 12-39.View ArticleGoogle Scholar
- Kieselbach T, Hagman Å, Andersson B, Schröder WP: The Thylakoid Lumen of Chloroplasts ISOLATION AND CHARACTERIZATION. Journal of Biological Chemistry. 1998, 273 (12): 6710-6716. 10.1074/jbc.273.12.6710.View ArticlePubMedGoogle Scholar
- Kleffmann T, Russenberger D, von Zychlinski A, Christopher W, Sjölander K, Gruissem W, Baginsky S: The Arabidopsis thaliana Chloroplast Proteome Reveals Pathway Abundance and Novel Protein Functions. Current Biology. 2004, 14 (5): 354-362. 10.1016/j.cub.2004.02.039.View ArticlePubMedGoogle Scholar
- Schubert M, Petersson UA, Haas BJ, Funk C, Schröder WP, Kieselbach T: Proteome map of the chloroplast lumen of Arabidopsis thaliana. Journal of Biological Chemistry. 2002, 277 (10): 8354-8365. 10.1074/jbc.M108575200.View ArticlePubMedGoogle Scholar
- Peltier J-B, Friso G, Kalume DE, Roepstorff P, Nilsson F, Adamska I, van Wijka KJ: Proteomics of the chloroplast: systematic identification and targeting analysis of lumenal and peripheral thylakoid proteins. The Plant Cell Online. 2000, 12 (3): 319-341. 10.1105/tpc.12.3.319.View ArticleGoogle Scholar
- Peltier J-B, Cai Y, Sun Q, Zabrouskov V, Giacomelli L, Rudella A, Ytterberg AJ, Rutschow H, van Wijk KJ: The oligomeric stromal proteome of Arabidopsis thaliana chloroplasts. Molecular & Cellular Proteomics. 2006, 5 (1): 114-133.View ArticleGoogle Scholar
- Ishikawa M, Fujiwara M, Sonoike K, Sato N: Orthogenomics of photosynthetic organisms: bioinformatic and experimental analysis of chloroplast proteins of endosymbiont origin in Arabidopsis and their counterparts in Synechocystis. Plant and cell physiology. 2009, 50 (4): 773-788. 10.1093/pcp/pcp027.View ArticlePubMedGoogle Scholar
- Leister D: Chloroplast research in the genomic age. TRENDS in Genetics. 2003, 19 (1): 47-56. 10.1016/S0168-9525(02)00003-3.View ArticlePubMedGoogle Scholar
- Nakai K, Kanehisa M: A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics. 1992, 14 (4): 897-911. 10.1016/S0888-7543(05)80111-9.View ArticlePubMedGoogle Scholar
- Emanuelsson O, Nielsen H, Von Heijne G: ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Science. 1999, 8 (05): 978-984. 10.1110/ps.8.5.978.PubMed CentralView ArticlePubMedGoogle Scholar
- Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of molecular biology. 2000, 300 (4): 1005-1016. 10.1006/jmbi.2000.3903.View ArticlePubMedGoogle Scholar
- Zybailov B, Rutschow H, Friso G, Rudella A, Emanuelsson O, Sun Q, van Wijk KJ: Sorting signals, N-terminal modifications and abundance of the chloroplast proteome. PloS one. 2008, 3 (4): e1994-10.1371/journal.pone.0001994.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaundal R, Sahu SS, Verma R, Weirick T: Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning. BMC bioinformatics. 2013, 14 (Suppl 14): S7-10.1186/1471-2105-14-S14-S7.PubMed CentralView ArticlePubMedGoogle Scholar
- Ashkenazi S, Snir R, Ofran Y: Assessing the relationship between conservation of function and conservation of sequence using photosynthetic proteins. Bioinformatics. 2012, 28 (24): 3203-3210. 10.1093/bioinformatics/bts608.View ArticlePubMedGoogle Scholar
- Giardi MT, Pace E: Photosynthetic proteins for technological applications. TRENDS in Biotechnology. 2005, 23 (5): 257-263. 10.1016/j.tibtech.2005.03.003.View ArticlePubMedGoogle Scholar
- Robertson DE, Jacobson SA, Morgan F, Berry D, Church GM, Afeyan NB: A new dawn for industrial photosynthesis. Photosynthesis research. 2011, 107 (3): 269-277. 10.1007/s11120-011-9631-7.PubMed CentralView ArticlePubMedGoogle Scholar
- Charoenkwan P, Shoombuatong W, Lee HC, Chaijaruwanich J, Huang HL, Ho SY: SCMCRYS: Predicting Protein Crystallization Using an Ensemble Scoring Card Method with Estimating Propensity Scores of P-Collocated Amino Acid Pairs. Plos One. 2013, 8 (9):Google Scholar
- Huang H-L: Propensity Scores for Prediction and Characterization of Bioluminescent Proteins from Sequences. PloS one. 2014, 9 (5): e97158-10.1371/journal.pone.0097158.PubMed CentralView ArticlePubMedGoogle Scholar
- Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008, D202-205. 36 DatabaseGoogle Scholar
- Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R: UCHIME improves sensitivity and speed of chimera detection. Bioinformatics. 2011, 27 (16): 2194-2200. 10.1093/bioinformatics/btr381.PubMed CentralView ArticlePubMedGoogle Scholar
- Chang C, Lin C: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2 (27): 21-27.Google Scholar
- Frank E, Hall M, Trigg L, Holmes G, Witten IH: Data mining in bioinformatics using Weka. Bioinformatics. 2004, 20 (15): 2479-2481. 10.1093/bioinformatics/bth261.View ArticlePubMedGoogle Scholar
- Salzberg SL: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Machine Learning. 1994, 16 (3): 235-240.Google Scholar
- Han J, Kamber M: Data Mining: Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems). 2006, Elsevier, secondGoogle Scholar
- Huang HL, Charoenkwan P, Kao TF, Lee HC, Chang FL, Huang WL, Ho SJ, Shu LS, Chen WL, Ho SY: Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. Bmc Bioinformatics. 2012, 13:Google Scholar
- Ho SY, Shu LS, Chen JH: Intelligent evolutionary algorithms for large parameter optimization problems. Ieee T Evolut Comput. 2004, 8 (6): 522-541. 10.1109/TEVC.2004.835176.View ArticleGoogle Scholar
- Tamm LK, Hong H, Liang B: Folding and assembly of beta-barrel membrane proteins. Biochimica et biophysica acta. 2004, 1666 (1-2): 250-263. 10.1016/j.bbamem.2004.06.011.View ArticlePubMedGoogle Scholar
- Mori H, Cline K: Post-translational protein translocation into thylakoids by the Sec and DeltapH-dependent pathways. Biochimica et biophysica acta. 2001, 1541 (1-2): 80-90. 10.1016/S0167-4889(01)00150-1.View ArticlePubMedGoogle Scholar
- Von Heijne G: Sequence analysis in molecular biology: treasure trove or trivial pursuit. 2012, ElsevierGoogle Scholar
- Spyridaki A, Psylinakis E, Ghanotakis DF: Photosystem II: Composition and Structure. Biotechnological Applications of Photosynthetic Proteins: Biochips, Biosensors and Biodevices. 2007, 11-Google Scholar
- Jensen PE, Bassi R, Boekema EJ, Dekker JP, Jansson S, Leister D, Robinson C, Scheller HV: Structure, function and regulation of plant photosystem I. Biochimica et Biophysica Acta (BBA)-Bioenergetics. 2007, 1767 (5): 335-352. 10.1016/j.bbabio.2007.03.004.View ArticleGoogle Scholar
- Huber CG, Walcher W, Timperio AM, Troiani S, Porceddu A, Zolla L: Multidimensional proteomic analysis of photosynthetic membrane proteins by liquid extraction-ultracentrifugation-liquid chromatography-mass spectrometry. Proteomics. 2004, 4 (12): 3909-3920. 10.1002/pmic.200400823.View ArticlePubMedGoogle Scholar
- Rees D, Komiya H, Yeates T, Allen J, Feher G: The bacterial photosynthetic reaction center as a model for membrane proteins. Annual review of biochemistry. 1989, 58 (1): 607-633. 10.1146/annurev.bi.58.070189.003135.View ArticlePubMedGoogle Scholar
- Nagata M, Nango M, Kashiwada A, Yamada S, Ito S, Sawa N, Ogawa M, Iida K, Kurono Y, Ohtsuka T: Construction of photosynthetic antenna complex using light-harvesting polypeptide-alpha from photosynthetic bacteria, R. rubrum with zinc substituted bacteriochlorophyll alpha. Chemistry Letters. 2003, 32 (3): 216-217. 10.1246/cl.2003.216.View ArticleGoogle Scholar
- Ochiai T, Nagata M, Shimoyama K, Amano M, Kondo M, Dewa T, Hashimoto H, Nango M: Immobilization of porphyrin derivatives with a defined distance and orientation onto a gold electrode using synthetic light-harvesting alpha-helix hydrophobic polypeptides. Langmuir. 2010, 26 (18): 14419-14422. 10.1021/la102869w.View ArticlePubMedGoogle Scholar
- Black SD, Mould DR: Development of hydrophobicity parameters to analyze proteins which bear post-or cotranslational modifications. Analytical biochemistry. 1991, 193 (1): 72-82. 10.1016/0003-2697(91)90045-U.View ArticlePubMedGoogle Scholar
- Punta M, Maritan A: A knowledge-based scale for amino acid membrane propensity. Proteins: Structure, Function, and Bioinformatics. 2003, 50 (1): 114-121.View ArticleGoogle Scholar
- Wolfenden R, Andersson L, Cullis P, Southgate C: Affinities of amino acid side chains for solvent water. Biochemistry. 1981, 20 (4): 849-855. 10.1021/bi00507a030.View ArticlePubMedGoogle Scholar
- Jurić S, Hazler-Pilepić K, Tomašić A, Lepeduš H, Jeličić B, Puthiyaveetil S, Bionda T, Vojta L, Allen JF, Schleiff E: Tethering of ferredoxin: NADP+ oxidoreductase to thylakoid membranes is mediated by novel chloroplast protein TROL. The Plant Journal. 2009, 60 (5): 783-794. 10.1111/j.1365-313X.2009.03999.x.View ArticlePubMedGoogle Scholar
- Bougatef A, Nedjar-Arroume N, Manni Ll, Ravallec R, Barkia A, Guillochon D, Nasri M: Purification and identification of novel antioxidant peptides from enzymatic hydrolysates of sardinelle (Sardinella aurita) by-products proteins. Food chemistry. 2010Google Scholar
- Nishiyama Y, Yamamoto H, Allakhverdiev SI, Inaba M, Yokota A, Murata N: Oxidative stress inhibits the repair of photodamage to the photosynthetic machinery. The EMBO journal. 2001, 20 (20): 5587-5594. 10.1093/emboj/20.20.5587.PubMed CentralView ArticlePubMedGoogle Scholar
- Salvi A, Carrupt P-A, Tillement J-P, Testa B: Structural damage to proteins caused by free radicals: asessment, protection by antioxidants, and influence of protein binding. Biochemical pharmacology. 2001, 61 (10): 1237-1242. 10.1016/S0006-2952(01)00607-4.View ArticlePubMedGoogle Scholar
- Stadtman E, Levine R: Free radical-mediated oxidation of free amino acids and amino acid residues in proteins. Amino acids. 2003, 25 (3-4): 207-218. 10.1007/s00726-003-0011-2.View ArticlePubMedGoogle Scholar
- Davies KJ: Protein damage and degradation by oxygen radicals. I. general aspects. Journal of Biological Chemistry. 1987, 262 (20): 9895-9901.PubMedGoogle Scholar
- Davies K, Goldberg A: Proteins damaged by oxygen radicals are rapidly degraded in extracts of red blood cells. Journal of Biological Chemistry. 1987, 262 (17): 8227-8234.PubMedGoogle Scholar
- Foyer CH, Shigeoka S: Understanding oxidative stress and antioxidant functions to enhance photosynthesis. Plant Physiology. 2011, 155 (1): 93-100. 10.1104/pp.110.166181.PubMed CentralView ArticlePubMedGoogle Scholar
- Arkin IT: Statistical analysis of predicted transmembrane α-helices. Biochimica et Biophysica Acta (BBA)-Protein Structure and Molecular Enzymology. 1998, 1429 (1): 113-128. 10.1016/S0167-4838(98)00225-8.View ArticleGoogle Scholar
- Simon H-U, Haj-Yehia A, Levi-Schaffer F: Role of reactive oxygen species (ROS) in apoptosis induction. Apoptosis. 2000, 5 (5): 415-418. 10.1023/A:1009616228304.View ArticlePubMedGoogle Scholar
- Demmig-Adams B, Adams WW: Antioxidants in photosynthesis and human nutrition. Science. 2002, 298 (5601): 2149-2153. 10.1126/science.1078002.View ArticlePubMedGoogle Scholar
- Power O, Jakeman P, FitzGerald R: Antioxidative peptides: enzymatic production, in vitro and in vivo antioxidant activity and potential applications of milk-derived antioxidative peptides. Amino Acids. 2013, 44 (3): 797-820. 10.1007/s00726-012-1393-9.View ArticlePubMedGoogle Scholar
- Davies MJ: The oxidative environment and protein damage. Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics. 2005, 1703 (2): 93-109. 10.1016/j.bbapap.2004.08.007.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.