Skip to main content

Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence

Abstract

Background

Knowing the submitochondria localization of a mitochondria protein is an important step to understand its function. We develop a method which is based on an extended version of pseudo-amino acid composition to predict the protein localization within mitochondria. This work goes one step further than predicting protein subcellular location. We also try to predict the membrane protein type for mitochondrial inner membrane proteins.

Results

By using leave-one-out cross validation, the prediction accuracy is 85.5% for inner membrane, 94.5% for matrix and 51.2% for outer membrane. The overall prediction accuracy for submitochondria location prediction is 85.2%. For proteins predicted to localize at inner membrane, the accuracy is 94.6% for membrane protein type prediction.

Conclusion

Our method is an effective method for predicting protein submitochondria location. But even with our method or the methods at subcellular level, the prediction of protein submitochondria location is still a challenging problem. The online service SubMito is now available at: http://bioinfo.au.tsinghua.edu.cn/subMito

Background

Mitochondria are subcellular organelles that appear only in eukaryotic cells. They are surrounded by two layers of membrane, the inner membrane and the outer membrane. Proteins which are localized within mitochondria play important roles in energy metabolism process. Inner membrane, outer membrane and matrix contain proteins which contribute to different procedures in energy metabolism. It has been proved that mitochondria are involved in several complex biological processes, like programmed cell death[1] and ionic homeostasis[2]. There are over 100 kinds of complex diseases related with mitochondria. Thus, it is important to understand the protein function within mitochondria.

Knowing protein localization is an important step to understand its function. But, to experimentally identify the protein subcellular location is costly and time consuming. A host of computational systems which are designed for predicting protein subcellular location had been developed during the last two decades. Various features of sequence had been used for predicting protein subcellular location, such as terminal signalling peptides[3, 4], amino acid composition [58], pseudo-amino acid composition[9, 10], dipeptide composition[11, 12], functional domain composition[13, 14] and GO information[14, 15]. And a number of machine learning approaches had been introduced to predict protein subcellular location, such as the Markov chain method[16], discriminate function[17, 18], SVM[9, 1921], artificial neural network[22, 23], OET-KNN[24], fuzzy-KNN[11] and classifier fusion technique [2426]. Some reviews described most of these methods in detail[27, 28]. Most of these methods assigned a unique subcellular location for a protein. But other methods can assign more than one subcellular locations for a protein [2931], which are called multiplex subcellular location predictors.

Recently, the advances of experimental technology have enabled the large-scale identification of nuclear proteins[32, 33]. A database for nuclear proteins and their subnuclear location has been constructed[34]. The prediction of protein subcellular location has been extended to a new level, the subnuclear level[35, 36], where the protein location within cell nucleus can be predicted.

To the best of our knowledge, however, there exists no computational system for predicting protein submitochondria location. In this paper, we develop a computational system called SubMito to predict the submitochondria location for a protein only from its primary sequence. The system can assign one of the three submitochondria locations which are mitochondria inner membrane, mitochondria outer membrane and mitochondria matrix for a sequence. Since there had been several sophisticated methods for predicting mitochondria protein, like MitoPred[37], this prediction that goes one level deeper should be a good complement to the mitochondrial protein identification systems.

Membrane protein type prediction is another challenging problem. Some powerful methods [3845] have been introduced to predict membrane protein type for a membrane protein. We try to integrate membrane protein type prediction with submitochondria location prediction. We predict the membrane protein type for a protein after we predict it to be a membrane protein. Due to the limitation of the data, we only predict membrane protein type for mitochondrial inner membrane proteins.

We hope that our work can provide a useful complement to those subcellular location predictors which are developed previously.

Results

Evaluation method

Since the leave-one-out cross validation method is more objective and rigorous[27] than sub-sampling methods, we adopt leave-one-out cross validation method in our work to get a more accurate estimation of prediction accuracy and Matthew's correlation coefficient[46] which are widely used statistics for evaluating the performance of subcellular location predictors.

The prediction accuracy and Matthew's correlation coefficient of the ith location are defined in equation 1 and equation 2 respectively.

A C C ( i ) = T P ( i ) T P ( i ) + F N ( i ) ( 1 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqcqWGdbWqcqWGdbWqcqGGOaakcqWGPbqAcqGGPaqkcqGH9aqpdaWcaaqaaiabdsfaujabdcfaqjabcIcaOiabdMgaPjabcMcaPaqaaiabdsfaujabdcfaqjabcIcaOiabdMgaPjabcMcaPiabgUcaRiabdAeagjabd6eaojabcIcaOiabdMgaPjabcMcaPaaacaWLjaGaaCzcamaabmaabaGaeGymaedacaGLOaGaayzkaaaaaa@48AC@
M C C ( i ) = T P ( i ) T N ( i ) F P ( i ) F N ( i ) ( T P ( i ) + F P ( i ) ) ( T P ( i ) + F N ( i ) ) ( T N ( i ) + F N ( i ) ) ( T N ( i ) + F P ( i ) ) ( 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGnbqtcqWGdbWqcqWGdbWqcqGGOaakcqWGPbqAcqGGPaqkcqGH9aqpdaWcaaqaaiabdsfaujabdcfaqjabcIcaOiabdMgaPjabcMcaPiabdsfaujabd6eaojabcIcaOiabdMgaPjabcMcaPiabgkHiTiabdAeagjabdcfaqjabcIcaOiabdMgaPjabcMcaPiabdAeagjabd6eaojabcIcaOiabdMgaPjabcMcaPaqaamaakaaabaGaeiikaGIaemivaqLaemiuaaLaeiikaGIaemyAaKMaeiykaKIaey4kaSIaemOrayKaemiuaaLaeiikaGIaemyAaKMaeiykaKIaeiykaKIaeiikaGIaemivaqLaemiuaaLaeiikaGIaemyAaKMaeiykaKIaey4kaSIaemOrayKaemOta4KaeiikaGIaemyAaKMaeiykaKIaeiykaKIaeiikaGIaemivaqLaemOta4KaeiikaGIaemyAaKMaeiykaKIaey4kaSIaemOrayKaemOta4KaeiikaGIaemyAaKMaeiykaKIaeiykaKIaeiikaGIaemivaqLaemOta4KaeiikaGIaemyAaKMaeiykaKIaey4kaSIaemOrayKaemiuaaLaeiikaGIaemyAaKMaeiykaKIaeiykaKcaleqaaaaakiaaxMaacaWLjaWaaeWaaeaacqaIYaGmaiaawIcacaGLPaaaaaa@8345@

The overall prediction accuracy is defined in equation 3.

A C C o v e r a l l = 1 N k = 1 3 T P ( k ) ( 3 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqcqWGdbWqcqWGdbWqdaWgaaWcbaGaem4Ba8MaemODayNaemyzauMaemOCaiNaemyyaeMaemiBaWMaemiBaWgabeaakiabg2da9maalaaabaGaeGymaedabaGaemOta4eaamaaqahabaGaemivaqLaemiuaaLaeiikaGIaem4AaSMaeiykaKcaleaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqaIZaWma0GaeyyeIuoakiaaxMaacaWLjaWaaeWaaeaacqaIZaWmaiaawIcacaGLPaaaaaa@4CA0@

TP(i), TN(i), FP(i), FN(i) are the numbers of true positives, true negatives, false positives and false negatives of the ith location. N is the total number of the sequences in training data set.

Prediction performance

The leave-one-out cross validation result is shown in Table 1.

Table 1 The leave one out cross validation result

After a sequence is predicted to localize at inner membrane, we continue to predict its membrane protein type. In the correctly identified 112 inner membrane proteins, there are 106 of them predicted to be correct membrane protein type. There are only 6 of them predicted to be wrong membrane protein type. The method correctly predicts the membrane protein type and the submitochondria location for 80.9% of the 131 inner membrane proteins. For different membrane protein types, 84 out of 101 multi-pass inner membrane proteins are predicted correctly, the success rate is about 83.2%; 22 out of the 30 matrix-side membrane protein are predicted correctly, making the success rate about 73.3%.

Prediction on complete proteome

We adopt our method on the complete sequenced mitochondrial proteome of Arabidopsis thaliana to demonstrate that our method can predict a fraction of protein to different submitochondria locations. The mitochondrial proteome of Arabidopsis thaliana is downloaded from AMPDB[47].

The prediction result is shown in Table 2. Our method predicts that about 21% of all proteins are in the inner membrane, 13% of all proteins are in the outer membrane and 66% of all proteins are in the matrix. The distribution of the prediction result shows that the majority of the Arabidopsis thaliana mitochondria proteome are located at matrix. The experimental research on yeast mitochondria proteome[48] also shows that the majority of the mitochondria proteome are soluble proteins. This observation consists with our prediction results. Because only a very small part of these proteins are annotated with submitochondria location (18 of inner membrane, 11 of outer membrane and 15 of matrix), we can not provide a good estimation for the prediction accuracy on this particular mitochondria proteome. However, our method can correctly identify most of these annotated sequences (10 out of 18 of inner membrane, 11 out of 11 of outer membrane and 15 out of 15 of matrix). This implies that our method should be a novel tool for computational annotating those sequences without submitochondria location annotation.

Table 2 Prediction result on complete mitochondria proteome of Arabidopsis thaliana

Discussion

Result

Because there exists no other method for predicting protein submitochondria location, we are unable to provide a comparison with other methods. We are focusing on different dataset even for the membrane protein type prediction part, so the comparison with other methods on the same basis is impossible. By reviewing the performance that most subcellular location predictors can achieve, we can say that our method has high overall prediction accuracy.

Our method can identify proteins localized at the inner membrane and matrix very well, but identifying the outer membrane proteins does not work as well as the other two locations. For membrane protein type prediction part, our method can correctly predict membrane protein type for 94.6% of the correctly predicted inner membrane protein. The accuracy of the whole cascade prediction is more than 80%. Thus, our method is an effective method for predicting protein submitochondria location and the membrane protein type for mitochondria inner membrane proteins.

We show MCC value in each location in order to show a more comprehensive evaluation of the performance of our predictor. Since MCC considers not only the number of true positives but also the number of false positives, false negatives and true negatives, it is more reliable and more comprehensive than accuracy statistic, especially when the training set is unbalanced. Showing MCC and accuracy together can give the readers a clearer understanding on the performance of our method. The MCC range of 0.6 to 0.7 shows that our method has good prediction performance. And the accuracy we report should not be a result of the problem caused by unbalanced training set.

Method

As we described in Method section, we set the sequence identity cut off to 40%. As suggested by some recent research[25], the sequence identity should be controlled at level 25% to get rid of the homologues and redundancy bias. But if we use such low cut off value, we can not obtain enough sequences to build sufficient large training set. Thus we use a higher sequence identity cut off value in order to get a balance between the homologues bias and the training set size.

We have tried different segmentation numbers which is the parameter c in our method. The prediction results of c = 1, 2, 3, 4 are shown in Table 3. It is very interesting that the prediction accuracy of every submitochondria location seems to peak at a special c value. Two of these peaks and the overall accuracy peak are on the same c value. This c value is 2. So we finally choose c = 2 as an optimized parameter in our method.

Table 3 Prediction accuracy for different c

Another technique that had been rarely used previously in subcellular location prediction studies is the 9 kinds of physicochemical properties that we used in our method. As we described in Method section, only 1 kind of physicochemical property had been used in Chou's pseudo-amino acid composition. Here we show a comparison result to demonstrate the usefulness of the additional physicochemical properties. We exclude all physicochemical properties except "Hydrophilicity value" and "Consensus normalized hydrophobicity" and perform prediction with these 2 properties. The comparison result is shown in Table 4. We find that the decrease in accuracy is significant after we exclude 7 kinds of physicochemical properties, especially the accuracy at outer membrane location. We believe that this decrease in accuracy is the result of losing information about long distance interaction between residues along the sequence.

Table 4 Prediction accuracy for different number of physicochemical properties

Software

The available data on submitochondria location in Swiss-Prot database increases rapidly, so we designed our software with an upgradeable architecture. The model we used in our software can be updated if a certain amount of new data is available. We will publish these updates on the web site of SubMito.

Another point we need to make it clear is that SubMito only predicts submitochondria location for a mitochondria protein. Users of SubMito should only submit known or predicted mitochondria protein to SubMito. If users only have an amino acid sequence, they should use MitoPred (which is the best mitochondria protein predictor in our opinion) to predict whether the sequence is a mitochondria protein first. If the user submits a predicted mitochondrial protein to SubMito, the program's rate of false positives will be higher, as some of the submitted proteins will be false positives generated by the mitochondrial prediction server.

Conclusion

In this paper, we develop a computational system for predicting protein submitochondria location only from its primary sequence. Like subnuclear location prediction, submitocondria location predictor can predict the location of a protein with higher precision than subcellular location prediction. Online service and software SubMito has been developed for predicting protein submitochondria location. By reviewing similar work at the subcellular level, predicting submitochondria location is still a challenging problem.

Methods

Data set

The raw data set used in this work is extracted from Swiss-Prot[49] release 48.0. To construct a high quality working dataset, we use the following steps to process all sequences extracted from the database.

  1. (1)

    The sequences which have a subcellular location annotation containing word "mitochondrion" are selected. The following steps are done on this subset of all sequences.

  2. (2)

    The sequences which have a subcellular location annotation containing any of the words "Probable", "Potential", "Possible" or "By Similarity" are excluded, because their annotations are lack of confidence.

  3. (3)

    The sequences containing ambiguous residues like "X", "B" and "Z" are excluded.

  4. (4)

    The sequences which are fragment of other proteins are excluded.

  5. (5)

    The sequences which localize at more than one submitochondria location are excluded.

  6. (6)

    The left sequences are processed using the CD-HIT[50] program to remove the highly homologues sequences. The identity between any 2 sequences in the processed data set is less than 40%. The identity cut off is set to 40% in order to get a balance between the homologous bias and the size of the training set.

  7. (7)

    The sequences localizing at inner membrane without membrane protein type annotation like "multi-pass membrane protein", "matrix side" or "peripheral membrane protein" are excluded.

  8. (8)

    The submitochondria locations or the membrane protein type containing less than 15 sequences are dropped.

After strictly following the above steps, we finally obtain 317 sequences classified into 3 submitochondria locations. Table 5 shows the distribution of the data.

Table 5 The distribution of data set

Feature vector

Proteins localized at different submitochondria locations have different N-terminal or C-terminal targeting signal peptides. Andrade, et al. [5] have pointed out that at the subcellular level, the average physicochemical properties of a protein molecular surface are adapted to the micro environment the protein localized at, and the average physicochemical properties of the molecular surface are correlated with the amino acid composition of the sequence. The investigation of Markov Chain[16] method and the work based on pseudo-amino acid composition[10] imply that the long distance interaction between residues is correlated with the subcellular location. We assume these are still correct at submitochondria level. So we attempt to construct a feature vector representing the targeting signal information, the average physicochemical properties of molecular surface and the long distance interactions between residues along the whole sequence.

The feature vector is made up by three parts. Before constructing the first two parts, the sequence is segmented into c same length segmentations.

The first part of the feature vector is the amino acid composition which is the occurrence frequencies of different residues. Assume the length of the ith segmentation is L i , and the numbers of different residues appear in the ith segmentation are n1, n2, ..., n20, the amino acid composition vector of the ith segmentation is defined in equation 4.

V i 1 = 1 L i [ n 1 , n 2 , , n 20 ] T ( 4 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGwbGvgaWcamaaBaaaleaacqWGPbqAcqaIXaqmaeqaaOGaeyypa0ZaaSaaaeaacqaIXaqmaeaacqWGmbatdaWgaaWcbaGaemyAaKgabeaaaaGccqGGBbWwcqWGUbGBdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabd6gaUnaaBaaaleaacqaIYaGmaeqaaOGaeiilaWIaeS47IWKaeiilaWIaemOBa42aaSbaaSqaaiabikdaYiabicdaWaqabaGccqGGDbqxdaahaaWcbeqaaiabdsfaubaakiaaxMaacaWLjaWaaeWaaeaacqaI0aanaiaawIcacaGLPaaaaaa@49F8@

The amino acid composition may represent the average physicochemical properties of the molecular surface according to our assumptions, but the amino acid composition vector contains no sequence order information of the residues. We use the dipeptide composition which denotes the occurrence frequencies of two consecutive residues as the second part of the feature vector in order to add some sequence order information to the amino acid composition. Since we segment the sequence into c segmentations, this part of the feature vector may represent the sequence order information of different part of the sequence, especially the N-terminal and C-terminal targeting signal peptides. Assume that the numbers of different dipeptide appear in the ith segmentation are n1, n2, ..., n400, the dipeptide composition is defined in equation 5.

V i 2 = 1 L i 1 [ n 1 , n 2 , , n 400 ] T ( 5 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGwbGvgaWcamaaBaaaleaacqWGPbqAcqaIYaGmaeqaaOGaeyypa0ZaaSaaaeaacqaIXaqmaeaacqWGmbatdaWgaaWcbaGaemyAaKgabeaakiabgkHiTiabigdaXaaacqGGBbWwcqWGUbGBdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabd6gaUnaaBaaaleaacqaIYaGmaeqaaOGaeiilaWIaeS47IWKaeiilaWIaemOBa42aaSbaaSqaaiabisda0iabicdaWiabicdaWaqabaGccqGGDbqxdaahaaWcbeqaaiabdsfaubaakiaaxMaacaWLjaWaaeWaaeaacqaI1aqnaiaawIcacaGLPaaaaaa@4CCB@

After constructing the first two parts of the feature vector, the c segmentations are merged together to form a complete sequence again. The physicochemical properties of the residues are considered in the third part of the feature vector in order to involve some information about long distance interactions between residues. Chou used three kinds of physicochemical properties in his pseudo-amino acid composition[10, 14, 15], two kinds of properties in his amphiphilic pseudo-amino acid composition[25, 38]. We choose 9 kinds of physicochemical properties which had been used in other researches[51, 52] for our problem. We hope this will involve more information about the long distance interactions between residues along the sequence.

The first step to construct the third part of the feature vector is to replace the amino acid residues with the normalized amino acid indexes, which are numbers representing the physicochemical properties of the residue. The 9 physicochemical properties selected in our work are listed in Table 6. For the ith amino acid index extracted from AAIndex database[53], we use the normalization procedure described by equation 6 to 8 which has been used in Chou's hybridization space methods [5456] to normalize physicochemical properties.

Table 6 The 9 physicochemical properties used in this work
p n o r m a l _ i ( k ) = p i ( k ) p ¯ i V a r ( p i ) ( 6 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCdaqhaaWcbaGaemOBa4Maem4Ba8MaemOCaiNaemyBa0MaemyyaeMaemiBaWMaei4xa8LaemyAaKgabaGaeiikaGIaem4AaSMaeiykaKcaaOGaeyypa0ZaaSaaaeaacqWGWbaCdaqhaaWcbaGaemyAaKgabaGaeiikaGIaem4AaSMaeiykaKcaaOGaeyOeI0IafmiCaaNbaebadaWgaaWcbaGaemyAaKgabeaaaOqaamaakaaabaGaemOvayLaemyyaeMaemOCaiNaeiikaGIaemiCaa3aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkaSqabaaaaOGaaCzcaiaaxMaadaqadaqaaiabiAda2aGaayjkaiaawMcaaaaa@53EC@

Where

p ¯ i = 1 20 k = 1 20 p i ( k ) ( 7 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGWbaCgaqeamaaBaaaleaacqWGPbqAaeqaaOGaeyypa0ZaaSaaaeaacqaIXaqmaeaacqaIYaGmcqaIWaamaaWaaabCaeaacqWGWbaCdaqhaaWcbaGaemyAaKgabaGaeiikaGIaem4AaSMaeiykaKcaaaqaaiabdUgaRjabg2da9iabigdaXaqaaiabikdaYiabicdaWaqdcqGHris5aOGaaCzcaiaaxMaadaqadaqaaiabiEda3aGaayjkaiaawMcaaaaa@44E5@
V a r ( p i ) = 1 20 k = 1 20 ( p i ( k ) p ¯ i ) 2 ( 8 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGwbGvcqWGHbqycqWGYbGCcqGGOaakcqWGWbaCdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabg2da9maalaaabaGaeGymaedabaGaeGOmaiJaeGimaadaamaaqahabaWaaeWaaeaacqWGWbaCdaqhaaWcbaGaemyAaKgabaGaeiikaGIaem4AaSMaeiykaKcaaOGaeyOeI0IafmiCaaNbaebadaWgaaWcbaGaemyAaKgabeaaaOGaayjkaiaawMcaamaaCaaaleqabaGaeGOmaidaaaqaaiabdUgaRjabg2da9iabigdaXaqaaiabikdaYiabicdaWaqdcqGHris5aOGaaCzcaiaaxMaadaqadaqaaiabiIda4aGaayjkaiaawMcaaaaa@511F@

For each property, the replacement produces a serial of numbers. Assume that for the ith property, the serial is p 1 ( i ) p 2 ( i ) p L ( i ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCdaqhaaWcbaGaeGymaedabaGaeiikaGIaemyAaKMaeiykaKcaaOGaemiCaa3aa0baaSqaaiabikdaYaqaaiabcIcaOiabdMgaPjabcMcaPaaakiabl+UimjabdchaWnaaDaaaleaacqWGmbataeaacqGGOaakcqWGPbqAcqGGPaqkaaaaaa@3F9A@ , where L is the length of the sequence and p k ( i ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCdaqhaaWcbaGaem4AaSgabaGaeiikaGIaemyAaKMaeiykaKcaaaaa@32AE@ ,1 ≤ kL is the ith normalized amino acid index of the kth residue in the sequence. Then we calculate the value of auto correlation function R i (τ), 1 ≤ τT using equation 9, where T is a constant.

R i ( τ ) = 1 L τ k = 1 L τ p k ( i ) p k + τ ( i ) ( 9 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGudaWgaaWcbaGaemyAaKgabeaakiabcIcaOGGaciab=r8a0jabcMcaPiabg2da9maalaaabaGaeGymaedabaGaemitaWKaeyOeI0Iae8hXdqhaamaaqahabaGaemiCaa3aa0baaSqaaiabdUgaRbqaaiabcIcaOiabdMgaPjabcMcaPaaakiabdchaWnaaDaaaleaacqWGRbWAcqGHRaWkcqWFepaDaeaacqGGOaakcqWGPbqAcqGGPaqkaaaabaGaem4AaSMaeyypa0JaeGymaedabaGaemitaWKaeyOeI0Iae8hXdqhaniabggHiLdGccaWLjaGaaCzcamaabmaabaGaeGyoaKdacaGLOaGaayzkaaaaaa@549D@

So for each property, we get the third part of the feature vector which may involve some information about the long distance interactions between residues along the sequence.

V i 3 = [ R i ( 1 ) , R i ( 2 ) , , R i ( T ) ] T ( 10 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGwbGvgaWcamaaBaaaleaacqWGPbqAcqaIZaWmaeqaaOGaeyypa0ZaamWaaeaacqWGsbGudaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabigdaXiabcMcaPiabcYcaSiabdkfasnaaBaaaleaacqWGPbqAaeqaaOGaeiikaGIaeGOmaiJaeiykaKIaeiilaWIaeS47IWKaeiilaWIaemOuai1aaSbaaSqaaiabdMgaPbqabaGccqGGOaakcqWGubavcqGGPaqkaiaawUfacaGLDbaadaahaaWcbeqaaiabdsfaubaakiaaxMaacaWLjaWaaeWaaeaacqaIXaqmcqaIWaamaiaawIcacaGLPaaaaaa@4E74@

Finally, three parts of the feature vector, the c amino acid composition vectors, c dipeptide composition vectors and 9 auto correlation vectors are combined to form a 420c+9T dimension feature vector as equation 11.

V = [ V 11 , V 21 , , V c 1 , V 12 , V 22 , , V c 2 , V 13 , V 23 , , V 93 ] T ( 11 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGwbGvgaWcaiabg2da9maadmaabaGafmOvayLbaSaadaWgaaWcbaGaeGymaeJaeGymaedabeaakiabcYcaSiqbdAfawzaalaWaaSbaaSqaaiabikdaYiabigdaXaqabaGccqGGSaalcqWIVlctcqGGSaalcuWGwbGvgaWcamaaBaaaleaacqWGJbWycqaIXaqmaeqaaOGaeiilaWIafmOvayLbaSaadaWgaaWcbaGaeGymaeJaeGOmaidabeaakiabcYcaSiqbdAfawzaalaWaaSbaaSqaaiabikdaYiabikdaYaqabaGccqGGSaalcqWIVlctcqGGSaalcuWGwbGvgaWcamaaBaaaleaacqWGJbWycqaIYaGmaeqaaOGaeiilaWIafmOvayLbaSaadaWgaaWcbaGaeGymaeJaeG4mamdabeaakiabcYcaSiqbdAfawzaalaWaaSbaaSqaaiabikdaYiabiodaZaqabaGccqGGSaalcqWIVlctcqGGSaalcuWGwbGvgaWcamaaBaaaleaacqaI5aqocqaIZaWmaeqaaaGccaGLBbGaayzxaaWaaWbaaSqabeaacqWGubavaaGccaWLjaGaaCzcamaabmaabaGaeGymaeJaeGymaedacaGLOaGaayzkaaaaaa@6595@

After several testing, we found that c = 2 and T = 20 are the best parameters for the prediction.

Prediction algorithm

SVM is machine learning algorithm based on Statistical Learning Theory which was introduced by Vapnik. It searches for an optimal separating hyper plane which maximizes the margin in feature space. SVM was originally introduced to solve binary classification problem. A one-versus-one framework was adopted in this work to deal with the multi-class classification problem. Altogether 4 classifiers were designed using SVM. For every two locations listed in Table 7, we construct a classifier, and for two different membrane protein types at inner membrane, the 4th classifier is constructed.

Table 7 The classifiers parameters and accuracy

Since the RBF kernel is the most flexible and the most widely used kernel function, a RBF kernel function is used in our classifier. The RBF kernel function is described as the following:

K ( x i , x j ) = exp ( γ x i x j 2 ) ( 12 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWsdaqadaqaaiqbdIha4zaalaWaaSbaaSqaaiabdMgaPbqabaGccqGGSaalcuWG4baEgaWcamaaBaaaleaacqWGQbGAaeqaaaGccaGLOaGaayzkaaGaeyypa0JagiyzauMaeiiEaGNaeiiCaa3aaeWaaeaacqGHsisliiGacqWFZoWzdaqbdaqaaiqbdIha4zaalaWaaSbaaSqaaiabdMgaPbqabaGccqGHsislcuWG4baEgaWcamaaBaaaleaacqWGQbGAaeqaaaGccaGLjWUaayPcSdWaaWbaaSqabeaacqaIYaGmaaaakiaawIcacaGLPaaacaWLjaGaaCzcamaabmaabaGaeGymaeJaeGOmaidacaGLOaGaayzkaaaaaa@4FF2@

where x MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWG4baEgaWcaaaa@2E37@ i and x MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWG4baEgaWcaaaa@2E37@ j are feature vectors, and γ is a parameter.

We use a grid search approach assisted by manually trial to find a good parameter combination for C and γ for each classifier, where C is the cost parameter of SVM and γ is the parameter in RBF kernel function. The results of parameter optimization and leave-one-out cross validation accuracy for the four classifiers are shown in Table 3.

While predicting submitochondria location for a test sample, the first 3 classifiers take a vote on the test sample. The test sample gets a score for each of the 3 submitochondria locations. And it will predict the location as being that with the highest score. If the three locations have the same score, the predictor reports "unknown" as a result. If the test sample is predicted to localize at inner membrane then the forth classifier predicts the membrane protein type for the test sample.

Availability and requirements

Project name: SubMito.

Project home page: http://bioinfo.au.tsinghua.edu.cn/subMito.

Operating system: online service is web based; local version of the software is platform independent.

Programming language: Java and PHP.

Other requirements: online service needs a web browser supporting JavaScript. Local version of the software needs Java Runtime Environment version higher than 1.5.0.

License: free.

For non-academics use, please contact daulyd@tsinghua.edu.cn.

References

  1. 1.

    Gottlieb RA: Programmed cell death. Drug news Perspect 2000, 13: 471–476.

    CAS  PubMed  Google Scholar 

  2. 2.

    Jassem W, Fuggle SV, Rela M, Koo DD, ND H: The role of mitochondria in ischemia/reperfusion injury. Transplantation 2000, 73: 493–499. 10.1097/00007890-200202270-00001

    Article  Google Scholar 

  3. 3.

    Emanuelsson O, Nielsen H, Brunak S, Heijne Gv: Predicting Subcellular Localization of Proteins Based on their N-terminal Amino Acid Sequence. J Mol Biol 2000, 300: 1005–1016. 10.1006/jmbi.2000.3903

    CAS  Article  PubMed  Google Scholar 

  4. 4.

    Nakai K, P H: PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends in Biochem Sci 1999, 24: 34–35. 10.1016/S0968-0004(98)01336-X

    CAS  Article  Google Scholar 

  5. 5.

    Andrade MA, O'Donoghue SI, Rost B: Adaption of Protein Surface to Subcellular Location. J Mol Biol 1998, 276: 517–525. 10.1006/jmbi.1997.1498

    CAS  Article  PubMed  Google Scholar 

  6. 6.

    Cedano J, Aloy P, A.Perez-Pons J, Querol E: Relation Between Amino Acid Composition and Cellular Location. J Mol Biol 1997, 266: 594–600. 10.1006/jmbi.1996.0804

    CAS  Article  PubMed  Google Scholar 

  7. 7.

    Cui Q, Jiang T, Liu B, Ma S: Esub8: A novel tool to predict protein subcellular localization in eukaryotic organisms. BMC Bioinformatics 2004., 5(66):

  8. 8.

    Zhou G-P, Doctor K: Subcellular location prediction of apoptosis proteins. PROTEINS: Structure, Fucntion, and Genetics 2003, 50: 44–48. 10.1002/prot.10251

    CAS  Article  Google Scholar 

  9. 9.

    Cai Y-D, Liu X-J, Xu X-b, Chou K-C: Support Vector Machines for Prediction of Protein Subcellular Location by Incorporating Quasi-Sequence-Order Effect. Journal of Cellular Biochemistry 2002, 84: 343–348. 10.1002/jcb.10030

    Article  PubMed  Google Scholar 

  10. 10.

    Chou K-C: Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition. PROTEINS: Structure, Fucntion, and Genetics 2001, 43: 246–255. 10.1002/prot.1035

    CAS  Article  Google Scholar 

  11. 11.

    Huang Y, Li Y: Prediction of protein subcellular locations using Fuzzy K-NN method. Bioinformatics 2004, 20: 21–28. 10.1093/bioinformatics/btg366

    CAS  Article  PubMed  Google Scholar 

  12. 12.

    Park K-J, Kanehisa M: Prediction subcellular location by support vector machines using composition of amino acids and amino acid pairs. Bioinformatics 2003, 19(13):1656–1663. 10.1093/bioinformatics/btg222

    CAS  Article  PubMed  Google Scholar 

  13. 13.

    Guda C, Subramaniam S: pTARGET: A new method for predicting protein subcellular localization in eukaryotes. Bioinformatics 2005, 21: 3963–3969. 10.1093/bioinformatics/bti650

    CAS  Article  PubMed  Google Scholar 

  14. 14.

    Chou K-C, Cai Y-D: Prediction of protein subcellular locations by GO-FunD-PseAA predictor. Biochemical and Biophysical Research Communications 2004, 320: 1236–1239. 10.1016/j.bbrc.2004.06.073

    CAS  Article  PubMed  Google Scholar 

  15. 15.

    Chou K-C, Cai Y-D: A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology. Biochemical and Biophysical Research Communications 2003, 311: 743–747. 10.1016/j.bbrc.2003.10.062

    CAS  Article  PubMed  Google Scholar 

  16. 16.

    Yuan Z: Prediction of protein subcellular location using Markov chain models. FEBS Letters 1999, 451: 23–26. 10.1016/S0014-5793(99)00506-2

    CAS  Article  PubMed  Google Scholar 

  17. 17.

    Chou K-C, Elrod DW: Protein subcellular location prediction. Protein Engineering 1999, 12: 107–118. 10.1093/protein/12.2.107

    CAS  Article  PubMed  Google Scholar 

  18. 18.

    Chou K-C, Elrod DW: Using Discriminant Function for Prediction of Subcellular Location of Prokaryotic Proteins. Biochemical and Biophysical Research Communications 1998, 252: 63–68. 10.1006/bbrc.1998.9498

    CAS  Article  PubMed  Google Scholar 

  19. 19.

    Cai Y-D, Liu X-J, Xu X-b, Chou K-C: Support Vector Machines for Prediction of Protein Subcellular Location. Molecular Cell Biology Research Communication 2000, 4: 230–233. 10.1006/mcbr.2001.0285

    CAS  Article  Google Scholar 

  20. 20.

    Hua S, Sun Z: Support vector machine approach fro protein subcellular localization prediction. Bioinformatics 2001, 17: 721–728. 10.1093/bioinformatics/17.8.721

    CAS  Article  PubMed  Google Scholar 

  21. 21.

    Sarda D, Chua GH, Li K-B, Krishnan A: pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinformatics 2005., 6(152):

  22. 22.

    Reinhardt A, Hubbard T: Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Research 1998, 26(9):2230–2236. 10.1093/nar/26.9.2230

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  23. 23.

    Cai Y-D, Chou K-C: Using Neural Networks for Prediction of Subcellular Location of Prokaryotic and Eukaryotic Proteins. Molecular Cell Biology Research Communication 2000, 4: 172–173. 10.1006/mcbr.2001.0269

    CAS  Article  Google Scholar 

  24. 24.

    Chou K-C, Shen H-B: Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. Journal of Proteome Research 2006, 5: 1888–1897. 10.1021/pr060167c

    CAS  Article  PubMed  Google Scholar 

  25. 25.

    Chou K-C, Shen H-B: Hum-PLoc: A novel ensemble classifier for predicting human protein subcellular localization. Biochemical and Biophysical Research Communications 2006, 347: 150–157. 10.1016/j.bbrc.2006.06.059

    CAS  Article  PubMed  Google Scholar 

  26. 26.

    Chou K-C, Shen H-B: Predicting protein subcellular location by fusing multiple classifiers. Journal of Cellular Biochemistry 2006, 1097–4644.

    Google Scholar 

  27. 27.

    Feng Z-P: An overview on predicting the subcellular location of a protein. In Silico Biology 2002, 2: 291–303.

    CAS  PubMed  Google Scholar 

  28. 28.

    Chou K-C: Review: Prediction of protein structural classes and subcellular locations. Current Protein and Peptide Science 2000, 1: 171–208. 10.2174/1389203003381379

    CAS  Article  PubMed  Google Scholar 

  29. 29.

    Chou K-C, Cai Y-D: Predicting protein localization in budding yeast. Bioinformatics 2004, 21: 944–950. 10.1093/bioinformatics/bti104

    Article  PubMed  Google Scholar 

  30. 30.

    Chou K-C, Shen H-B: Addendum to "Hum-PLoc: A novel ensemble classifier for predicting human protein subcellular localization". Biochemical and Biophysical Research Communications 2006. Avalable online 14 Augest 2006 Avalable online 14 Augest 2006

    Google Scholar 

  31. 31.

    Scott MS, Thomas DY, Hallett MT: Predicting Subcellular Localization via Protein Motif Co-Occurrence. Genome Research 2004, 14: 1957–1966. 10.1101/gr.2650004

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  32. 32.

    BickMore WA, Sutherland HGE: Addressing protein localization within the nucleus. The EMBO Journal 2002, 21: 1248–1254. 10.1093/emboj/21.6.1248

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  33. 33.

    Sutherland HGE, Mumford GK, Newton K, Ford LV, Farrall R, Dellaire G, Caceres JF, BickMore WA: Large-scale identification of mammalian proteins lacalized to nuclear sub-compartments. Human Molecular Genetics 2001, 10(8):1995–2011. 10.1093/hmg/10.18.1995

    CAS  Article  PubMed  Google Scholar 

  34. 34.

    Dellaire G, Farrall R, Bickmore WA: The Nuclear Protein Database (NPD): sub-nuclear localisation and functional annotation of the nuclear proteome. Nucleic Acids Research 2003, 31(1):328–330. 10.1093/nar/gkg018

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  35. 35.

    Lei Z, Dai Y: An SVM-based system for predicting protein subnuclear localizations. BMC Bioinformatics 2005., 6(291):

    Google Scholar 

  36. 36.

    Shen H-B, Chou K-C: Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition. Biochemical and Biophysical Research Communications 2005, 337: 752–756.

    CAS  Article  PubMed  Google Scholar 

  37. 37.

    Guda C, Fahy E, Subramaniam S: MITOPRED: a genome-scale method for prediction of nucleus-encoded mitochondrial proteins. Bioinformatics 2004, 20: 1785–1794. 10.1093/bioinformatics/bth171

    CAS  Article  PubMed  Google Scholar 

  38. 38.

    Chou K-C, Cai Y-D: Prediction of membrane protein types by incorporating amphipathic effects. Journal of Chemical Information and Modeling 2005, 45: 407–413. 10.1021/ci049686v

    CAS  Article  PubMed  Google Scholar 

  39. 39.

    Chou K-C, Cai Y-D: Using GO-PseAA predictor to identify membrane proteins and their types. Biochemical and Biophysical Research Communications 2005, 327: 845–847. 10.1016/j.bbrc.2004.12.069

    CAS  Article  PubMed  Google Scholar 

  40. 40.

    Chou K-C, Elrod DW: Prediction of membrane protein types and subcellular locations. PROTEINS: Structure, Fucntion, andGenetics 1999, 34: 137–153. http://www.dx.doi.org/10.1002/(SICI)1097–0134(19990101)34:1%3c137::AID-PROT11%3e3.0.CO;2-O 10.1002/(SICI)1097-0134(19990101)34:1%3c137::AID-PROT11%3e3.0.CO;2-O

    CAS  Article  Google Scholar 

  41. 41.

    Liu H, Wang M, Chou K-C: Low-frequency Fourier spectrum for predicting membrane protein types. Biochemical and Biophysical Research Communications 2005, 336: 737–739. 10.1016/j.bbrc.2005.08.160

    CAS  Article  PubMed  Google Scholar 

  42. 42.

    Shen H-B, Chou K-C: Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo amino acid composition to predict membrane protein types. Biochemical and Biophysical Research Communications 2005, 334: 288–292. 10.1016/j.bbrc.2005.06.087

    CAS  Article  PubMed  Google Scholar 

  43. 43.

    Wang M, Yang J, Liu G-P, Xu Z-J, Chou K-C: Weighted-support vector machines for predicting membrane protein types based on pseudo amino acid composition. Protein Engineering, Design, and Selection 2004, 17: 509–516. 10.1093/protein/gzh061

    CAS  Article  Google Scholar 

  44. 44.

    Wang M, Yang J, Xu Z-J, Chou K-C: SLLE for predicting membrane protein types. Journal of Theoretical Biology 2005, 232: 7–15. 10.1016/j.jtbi.2004.07.023

    CAS  Article  PubMed  Google Scholar 

  45. 45.

    Wang S-Q, Yang J, Chou K-C: Using stacked generalization to predict membrane protein types based on pseudo amino acid composition. Journal of Theoretical Biology 2006, in press.

    Google Scholar 

  46. 46.

    Matthews B: Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 1975, 405: 442–451.

    CAS  Article  PubMed  Google Scholar 

  47. 47.

    Heazlewood JL, Tonti-Filippini JS, Gout AM, Day DA, Whelan J, Millar AH: Experimental Analysis of the Arabidopsis Mitochondrial Proteome Highlights Signaling and Regulatory Components, Provides Assessment of Targeting Prediction Programs, and Indicates Plant-Specific Mitochondrial Proteins. Plant Cell 2004, 16: 241–256. 10.1105/tpc.016055

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  48. 48.

    Kumar A, Agarwal S, Heyman JA, Matson S, Heidtman M, Piccirillo S, Umansky L, Drawid A, Jansen R, Liu Y, et al.: Subcellular localization of the yeast proteome. Genes & Development 2002, 16: 707–719. 10.1101/gad.970902

    CAS  Article  Google Scholar 

  49. 49.

    Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al.: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Research 2005, 34: 187–191. 10.1093/nar/gkj161

    Article  Google Scholar 

  50. 50.

    Li W, Jaroszewski L, Godzik A: Clustering of highly homologous sequence to reduce the size of large protein database. Bioinformatics 2001, 17: 282–283. 10.1093/bioinformatics/17.3.282

    CAS  Article  PubMed  Google Scholar 

  51. 51.

    Gao Q-B, Wang Z-Z, Yan C, Du Y-H: Prediction of protein subcellular location using a combined feature of sequence. FEBS Letters 2005, 579: 3444–3448. 10.1016/j.febslet.2005.05.021

    CAS  Article  PubMed  Google Scholar 

  52. 52.

    Lio P, Vannucci M: Wavelet change-point prediction of transmembrane proteins. Bioinformatics 2000, 16: 376–382. 10.1093/bioinformatics/16.4.376

    CAS  Article  PubMed  Google Scholar 

  53. 53.

    Kawashima S, Ogata H, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Research 2000, 28: 374. 10.1093/nar/28.1.374

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  54. 54.

    Chou K-C, Cai Y-D: Predicting of protease type in a hybridization space. Biochemical and Biophysical Research Communications 2006, 339: 1015–1020. 10.1016/j.bbrc.2005.10.196

    CAS  Article  PubMed  Google Scholar 

  55. 55.

    Chou K-C, Cai Y-D: Predicting protein-protein interactions from sequence in a hybridization space. Journal of Proteome Research 2006, 5: 316–322. 10.1021/pr050331g

    CAS  Article  PubMed  Google Scholar 

  56. 56.

    Chou K-C, Cai Y-D: Predicting enzyme family class in a hybridization space. Protein Science 2004, 13: 2857–2863. 10.1110/ps.04981104

    PubMed Central  CAS  Article  PubMed  Google Scholar 

Download references

Acknowledgements

Thanks to Dr. Jun Cai for helpful discussions. Thanks to Katherine Zhang for helping us with the language. This work is partially supported by NSFC projects no. 60234020 and 60572086 of China.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Yanda Li.

Additional information

Authors' contributions

PD extracts the data from Swiss-Prot database, implements the algorithm, carries out the analyses and writes the manuscript. YL guides the design of the study, analyses the data and result and writes the manuscript. All authors read and approve the final manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Du, P., Li, Y. Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinformatics 7, 518 (2006). https://doi.org/10.1186/1471-2105-7-518

Download citation

Keywords

  • Feature Vector
  • Dipeptide Composition
  • Subcellular Location Prediction
  • Membrane Protein Type
  • Mitochondrion Protein