Inferring modules of functionally interacting proteins using the Bond Energy Algorithm
© Watanabe et al; licensee BioMed Central Ltd. 2008
Received: 28 February 2008
Accepted: 17 June 2008
Published: 17 June 2008
Non-homology based methods such as phylogenetic profiles are effective for predicting functional relationships between proteins with no considerable sequence or structure similarity. Those methods rely heavily on traditional similarity metrics defined on pairs of phylogenetic patterns. Proteins do not exclusively interact in pairs as the final biological function of a protein in the cellular context is often hold by a group of proteins. In order to accurately infer modules of functionally interacting proteins, the consideration of not only direct but also indirect relationships is required.
In this paper, we used the Bond Energy Algorithm (BEA) to predict functionally related groups of proteins. With BEA we create clusters of phylogenetic profiles based on the associations of the surrounding elements of the analyzed data using a metric that considers linked relationships among elements in the data set.
Using phylogenetic profiles obtained from the Cluster of Orthologous Groups of Proteins (COG) database, we conducted a series of clustering experiments using BEA to predict (upper level) relationships between profiles. We evaluated our results by comparing with COG's functional categories, And even more, with the experimentally determined functional relationships between proteins provided by the DIP and ECOCYC databases. Our results demonstrate that BEA is capable of predicting meaningful modules of functionally related proteins. BEA outperforms traditionally used clustering methods, such as k-means and hierarchical clustering by predicting functional relationships between proteins with higher accuracy.
This study shows that the linked relationships of phylogenetic profiles obtained by BEA is useful for detecting functional associations between profiles and extending functional modules not found by traditional methods. BEA is capable of detecting relationship among phylogenetic patterns by linking them through a common element shared in a group. Additionally, we discuss how the proposed method may become more powerful if other criteria to classify different levels of protein functional interactions, as gene neighborhood or protein fusion information, is provided.
The development of automated methods for inferring functional association of proteins from sequence and genomic data is becoming an increasingly important area of investigation in bioinformatics and computational biology. In effect, the determination of unknown gene interactions in functional pathways and perhaps, their association with diseases, relies crucially on sound computational algorithms capable of producing meaningful predictions.
The homology-based methods are useful to assign function to proteins by establishing sequence similarity of the underlying sequences with others with previously assigned function . There is a variety of those homology-based methods that use specific aspects of similarity between protein attributes, such as similarity between complete sequences [2–4], presence of motifs and functional blocks , specific spatial positions of functional residues [6, 7] or combinations of the aforementioned .
However, when similarity of the underlined sequences is not sufficiently significant, alternative approaches have been considered. Several non homology-based methods have been developed to predict functional relationships between proteins [9, 10], using additional sources of genomic information [11–13]. These methods have been called application-based in context, and they include phylogenetic profiles, protein fusions (Rosetta Stone), gene coexpresion, and neighborhood conservation . It has been demonstrated that functional relationships, functional modules, molecular networks and genotype-phenotype relationships can be accurately predicted using these methods .
Among modern post-genomic approaches developed in recent years, those based on the correlated presence and absence of genes (i. e. phylogenetic profiles) among a collection of organisms have proven to be particularly effective [14, 15]. Theoretically, with the increasing availability of complete genome sequences from more organisms, these methods hold the promise of increasing efficacy. Particularly, phylogenetic profiles have been successfully used for assigning protein function, for localizing proteins in cells, for reconstructing metabolic pathways, and for phylogenetic studies, among other applications [11, 16–18].
Predictions obtained from phylogenetic profiles depend critically on the employed clustering method. Most clustering algorithms used to date are based on the calculation of Euclidean and Hamming distances between pair of elements , which means that the clustering is directed by the intrinsic properties of these patterns and no additional information is often considered, albeit there are few exceptions . For example, other studies have employed a variety of metrics such as intergenic distance  and kernels . It is known that proteins do not only interact by pairs . For example, in the case of metabolic pathways, a biologically meaningful function is normally performed by a group of proteins. Also, our previous work demonstrated that often proteins can have functional analogs with no sequence similarity that can displace each other . Thus, in order to predict a link, it is necessary to employ a method capable of considering not only a direct but also an indirect relationship created by the association with a third party element. The Bond Energy Algorithm (BEA) is a widely used methodology to create vertical fragmentation of distributed databases. This algorithm creates a collection of clusters based on the relationships of the surrounding elements of the analyzed data in a particular cluster using an non-trivial Attribute Affinity measure, which is a weight matrix denoting the strength of the relationship among all the elements in the data set. We would like to posit here that non obvious relationships can be predicted by this method when an apparently non related element is linked to another one by a third element [21, 22]. BEA uses a two step approach to clustering: the first step is the ordering algorithm, while the second step is the grouping algorithm.
Despite its clustering capabilities, BEA has only recently been used in bioinformatics studies . We propose a new method for clustering phylogenetic profiles, consisting in applying BEA, to predict functional associations of profiles and to detect displacements of functionally equivalent analog proteins. Here we show that BEA can detect functional associations between profiles not detected by conventional methods. A major goal of this work is to explore the extent in which the prediction of protein functional interactions can be accurately inferred from protein phylogenetic profiles. Therefore, we have focused on improving the clustering of these patterns by using implicit information not considered by most clustering methods. It is also possible to consider explicit domain knowledge in order to improve results in all clustering algorithms (e.g. an underlying metric incorporating knowledge constraints). Similarly, a consequence of our study is to find whether or not the implicit information derived from phylogenetic patterns is useful for inferring functional interactions without the resort of additional domain knowledge.
Results and Discussion
The first version of the COG database provides a collection of 3307phylogenetic profiles [23, 24]. Using this information we conducted a series of computational simulations with BEA and compared our results with those produced by other clustering methods, such as k-means, Hierarchical Clustering and Partition Around Medoids (PAM).
A predefined collection of input parameters is required by most clustering algorithms. The number of clusters, the initial position of cluster centers, and the distance metric, are typically provided as input to the clustering algorithm. Appropriate values for parameters are often determined empirically, especially, when there is little information on the underlying structure of the data set. The definition of these parameters is known to be critical for obtaining meaningful clustering results .
For instance, in the k-means algorithm we used a random positioning of 18 centers for the clusters. The determination of this number of centers was based on the number of different functional categories described in the COG database. Recall that COG phlogenetic profiles and COG functional categories are independent assessments. The selected metric was the Hamming distance due to the fact that it is a suitable measure for binary data. The accuracy of the algorithm was calculated as the average performance of 10 simulations of the algorithm based on the ten-fold validation method [26, 27]. The Matlab package  was employed for this experiment.
Results produced by k-means are typically sensitive to initial conditions. For example, a particular partition of points in which any single point is moved to a different cluster increases the total sum of distances. This problem can be approached by an appropriate choice of starting points. Specifically, we employed a version of the k-means algorithm that uses a two-phase iterative algorithm to minimize the sum of point-to-centroid distances, summed over all k clusters .
The first phase use "batch" updates, in which each iteration consists of re-assigning points to their nearest cluster centroid, all at once, followed by recalculation of cluster centroid. This phase may be viewed as providing a fast but potentially only approximate solution and as a starting point for the second phase. The second phase use "online" updates, in which points are individually reassigned in the space in order to reduce the sum of distances; cluster centroids are recomputed after each reassignment. Each iteration during this second phase consists of stepwise pass though all the points to assign the new centroids. For the Hierarchical Clustering experiments, we used the R package statistical toolkit (AGNES). AGNES is fully described in [25, 29]. Compared to other agglomerative clustering methods such as "hclus", "agnes" yields the agglomerative coefficient which measures the amount of clustering structure found.
The AGNES algorithm constructs a hierarchy of clusters. Initially each observation is a small cluster by itself. Clusters are merged until only one large cluster remains which contains all the observations. At each stage of the algorithm the two nearest clusters are combined to form a single larger cluster. We used the average method in which the distance between two clusters is the average of the dissimilarities between the points in one cluster and the points in the other. The obtained agglomerative coefficient of 0.7788952 provided the basis to cut the branch at the coefficient value. The chosen metric was again the Hamming distance. As mentioned above, we repeated 10 cycles and used the average result of them.
For the experiments with PAM we also used the R package statistical toolkit. 18 clusters was provided as input parameter to the algorithm as for "k-means". PAM is a more robust version of k-means since it additionally takes a dissimilarity matrix as input. PAM algorithm is based on the search for k representative objects or medoids among the observations of the data set. These observations should represent the structure of the data. After finding a set of k medoids, k clusters are constructed by assigning each observation to the nearest medoid. The goal is to find k representative objects which minimize the sum of the dissimilarities of the observations to their closest representative object. By default, when medoids are not specified, the algorithm first looks for a good initial set of medoids (this is called the build phase). Then, it finds a local minimum for the objective function, that is, a solution reached until there is not a single switch of an element within a medoid that decreases the objective function (this is called the swap phase). When the medoids are specified, their order is not critical, in general, the algorithms have been designed to be independ of the order of the observations. The metric was set to Manhattan, as before. We hold 10 cycles and took the average result for a better accuracy.
In contrast, for the experiments with BEA, the number of clusters was automatically created by the cutting method in conjunction with a grouping step based on the "guilty by association" principle. The Hamming was also used for the experiments with BEA and we hold 10 cycles, as before.
A more detailed description of BEA is presented in the Methods section.
Resulting classification for all methods
Information storage and processing
Validations with COG functional categories
For the first testing, we used the functional categories established by the COG database. The individual COGs are constructed by grouping putative orthologous by bidirectional best hit in completely sequenced genomes . Therefore, we think using COGs instead of individual proteins is less prone to classification errors. We tested BEA ordering algorithm by calibrating the cutting points by giving the functional category as an input. The functional category of a cluster is calculated by density (the majority of the elements in the cluster that have the same function). BEA accuracy was close to 99.90% of correct classification (3307 elements in the existing 18 categories). In effect, 3304 out of 3307 COGs were classified satisfactorily in each of the 18 existing categories. In contrast k-means classified 30.15% correctly, Hierarchical Clustering obtained 20.38% and PAM made 3.33% resulting in a notorious better accuracy for BEA.
In our previous studies we showed that COG0611 (Protein ThiL with Functional Classification H) and COG1564 (Protein Thi80 with Functional Classification H) are non homolog proteins with similar functions . In our present study we show that they are related by COG0352 (Protein Thiamine monophosphate synthase with Functional Classification H). This means that COG0611 and COG1564 were close and bonded by COG0352. This demonstrates that BEA ordering algorithm worked as expected, locating functionally related proteins close together and separating unrelated proteins. However, we expected a diminishing of the accuracy of results as we removed the functional categories as an input to BEA ordering algorithm.
Validations with DIP
For the second testing, we used DIP (Database of Interacting Proteins, Additional file 1). Particularly, 154 protein relationships were used for validations of the different methods . The DIP database describes experimentally determined physical interactions between pairs of proteins. Therefore, in this validation we consider that two proteins are related if they belong to the same cluster. If the DIP relationship is contained in the same cluster then it is presumed to be a true positive, otherwise the relationship was considered a false negative. A caveat of using DIP is that in this database proteins that do not belong to the same functional cluster could physically interact. We also verified if the relationship was close to the neighbor clusters, maximum five, in order to calculate the efficiency of the algorithm. BEA accuracy in classification was approximately 62.37% in the same cluster (97 out of 154 DIP relationships were correctly classified). In contrast k-means classified 46.10% correctly, Hierarchical Clustering obtained 22.73% and PAM made 21.43% resulting in a notorious better accuracy for BEA.
Validation 1 for DIP in the same cluster
Validation 2 for DIP in the near cluster
Validation 3 for DIP in the surrounding five clusters
Validations with ECOCYC
Validation 1 for ECOCYC in the same cluster
Validation 2 for ECOCYC in the near cluster
Validation 3 for ECOCYC in the surrounding five clusters
As can be seen in this testing, BEA has a higher accuracy in classifying relationships between phylogenetic profiles. However, the capability of BEA was not totally exploited as we discuss in the next section.
We compared the three validations and analyzed the data set of the DIP and ECOCYC to understand the grouping obtained by our method. The analysis concluded that for DIP, the 57.80% of the relationship of these proteins are in the same functional category and 42.21% in a different category. This shows us that BEA is classifying reasonably well even when the proteins belong to a different functional category, although our method was used for functional category classification. For the ECOCYC classification we found that 84.375% belong to the same functional category and 15.625% are in a different functional category and BEA found exactly the same result.
One interesting aspect to emphasize is that there should exists several useful classification criteria depending on the abstraction level in the conceptual hierarchy of the biology of organisms to be observed. For example, at the metabolic pathway level a proper classification would include proteins with different functional categories in the same cluster. Therefore, if we perform classifications based exclusively on functional categories then the metabolic pathways would be hardly revealed by this method.
We have not intended our validations to be exhaustive by using all available databases. However, the consideration of additional databases such as GO, Funcat, among others, would be useful for more comprehensive validations.
The focus of this study was to improve the prediction capabilities of phylogenetic profiles using BEA. Our results showed that BEA increases the accuracy of predictions of protein modules with respect to the traditional clustering methods, especially when the underlying phylogenetic patterns are relatively dissimilar. In effect, BEA was capable to detect relationships among proteins by relying on the presence or absence of third party proteins. This method is capable of finding relationships such as: if a protein A works with protein C and protein B is an analog of A then A and B will be related though C. So A and B will be functionally equivalent while A and C, and B and C will be functionally linked. This study shows that taking into account indirect relationships can be useful for detecting associations between proteins and reconstructing functional modules. If additional criteria is provided, as genomic context information, to classify different levels of relationships, higher level of accuracy could be achieved using BEA. Therefore, it will be useful to complement the information provided by phylogenetic profiles with additional genomic context information, such as intergenic distance and experimental data as implemented in GeConT .
This information can be included in the AA matrix to create clusters. BEA may also be improved by using different criteria for separating clusters to increase the capabilities of the algorithm for detecting genetic circuits for system biology . On the one hand, BEA has many advantages, such as low processing time and memory requirements. By using the information of the attribute affinity matrix, direct and indirect relationships are considered, so this creates a balanced and more accurate classification. Based on the results presented here, we showed that BEA holds much promise to create better classifications for protein relationships and gene function annotation. A limitation of BEA from the computational perspective is the greedy nature of the algorithm, such that results are sensitive to the order of the input data. This problem could be solved by using a Genetic Algorithms to optimize the search of the solution space [34, 35].
We employed a collection of different data sets for conducting the experiments and validations reported here.
Clusters of Orthologous Groups (COG) phylogenetic patterns 
This data set consists of a phylogenetic classification of proteins encoded in complete genomes. COGs were identified by comparison of protein sequences from 43 complete genomes, representing 30 major phylogenetic lineages.
In order to extract the maximum amount of information from the rapidly accumulating genome sequences, all conserved genes need to be classified according to their homologous relationships. Comparison of proteins encoded in seven completely sequenced genomes from five major phylogenetic lineages and elucidation of consistent patterns of sequence similarities allowed the delineation of 720 clusters of orthologous groups (COGs).
COG FUNCTIONAL CATEGORIES
Pathways and functional systems
Information storage and processing
Translation, ribosomal structure and biogenesis
DNA replication, recombination and repair
Cell division and chromosome partitioning
Posttranslational modification, protein turnover, chaperones
Cell envelope biogenesis, outer membrane
Cell motility and secretion
Inorganic ion transport and metabolism
Signal transduction mechanisms
Energy production and conversion
Carbohydrate transport and metabolism
Amino acid transport and metabolism
Nucleotide transport and metabolism
Secondary metabolites biosynthesis, transport and catabolism
General function prediction only
Database of Interacting Proteins (DIP) 
The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both, manually by expert curators and also automatically using computational approaches that utilize the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data [37, 38].
EcoCyc is a bioinformatics database that describes the genome and the biochemical machinery of E. coli K-12 MG1655. The long-term goal of the project is to describe the molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists, and for biologists who work with related microorganisms.
EcoCyc contains the complete genome sequence of E. coli, and describes the nucleotide position and function (if known) of every E. coli gene. A staff of five fulltime curators update the annotation of the E. coli genome on an ongoing basis using a literature-based curation strategy. Users can retrieve the nucleotide sequence of a gene, and the amino-acid sequence of a gene product.
EcoCyc describes all known metabolic pathways and signal-transduction pathways of E. coli. it describes each metabolic enzyme of E. coli, including its cofactors, activators, inhibitors, and subunit structure [40, 41].
The Bond Energy Algorithm (BEA) has been widely used for vertical fragmentation of distributed databases. This algorithm was originally proposed by McCormick and Hoffer and Severande . BEA creates clusters using a non-trivial similarity metric (attribute affinity measure) defined on the elements of the data set. In consequence, diffcult relationships in which a third element is needed to make the relationship obvious can be discovered by this method (e. g. transitive relationships).
BEA is comprised by two algorithms, the first one is used for ordering the data set to locate the most related elements close together (and to separate the unrelated elements) and the second one is used for creating the groups to determine at what point has to make a cut on the ordered data set (i.e. create a cluster).
The BEA Ordering Algorithm
It is designed specifically to determine groups of similar items and opposed to, say, a linear ordering of the items (i.e., it clusters the attributes with large affinity values together, and the ones with smaller values together).
The final groupings are insensitive to the order in which items are presented to the algorithm.
The algorithm complexity is similar to other methods but can have better results [O(n 2), where n is the number of attributes].
Secondary interrelationships between clustered attribute groups are identifiable.
whereaff(A0, A j ) = aff(A i , A0) = aff(An+1, A j ) = aff(A i , An+1) = 0
The last set of conditions takes care of the cases where an attribute is being placed in CA to the left of the leftmost attribute or to the right of the rightmost attribute during column permutations, and prior to the topmost row and following the last row during row permutations. In these cases, we take 0 to be the aff values between the attributes being considered for placement and its left or right (top or bottom) neighbors, which do not exist in CA.
where A i is an attribute of the relation, AA is the Attribute Affinity Matrix, CA is the Clustered Affinity Matrix, and AM is the Affinity Measure.
Initialization. Place and fix one of the columns of AA arbitrarily into CA. Column 1 was chosen in the algorithm.
Iteration. Pick each of the remaining n-i columns (where i is the number of columns already placed in CA) and try to place them in the remaining i + 1 positions in the CA matrix. Choose the placement that makes the greatest contribution to the global affinity measure described above. Continue this step until no more columns remain to be placed.
Row ordering. Once the column ordering is determined, the placement of the rows should also be reordered to make their relative positions match the position of the columns.
This is where BEA find indirect relationships.
Now consider the following n attributes:A1A2...Ai-1A i A j Aj+1...A n AM' = [A1A2...Ai-1] AM" = [Aj+1...A n ]
Now we consider placing a new attribute A k and A j in the clustered affinity matrix. The new global affinity measure can be similarly written as:AM new = AM' + AM" + bond(A i , A k ) + bond(A k , A i ) + bond(A k , A j ) + bond(A j , A k ) = AM' + AM" + 2bond(A i , A k ) - 2bond(A k , A j )
where AM' is the Affinity Measure before the insert position, AM" is the Affinity Measure after the insert position, bond(A i , A k ) is the Bond Energy Evaluation for the insertion of the elements, and bond(A k , A j ) is the Bond Energy Evaluation for the separation of the elements.
Thus, the next contribution to the global affinity measure of placing attribute A k between A i and A j is:cont(A i , A k , A j ) = AM new - AM old = bond(A i , A k ) + bond(A k , A j ) - bond(A i , A j )
The input data for our experiments were Phylogenetic Profiles, this means that the entries are strings of 1's and 0's that shows the presence or absence of a certain protein in some organisms, in which every column represented a gene and every row was an organism. The Hamming distance between pairs of phylogenetic profiles was used to calculate the entries of the Attribute Affinity Matrix (AA) that represents the relationship between proteins.
Then we run the algorithm as follows:
input: Phylogenetic Profiles.
output: CA: Clustered Affinity matrix and order list array.
[initialize; the AA matrix is created]
CA(•, 1) ← AA(•, 1)
CA(•, 2) ← AA(•, 2)
index ← 3
while index ≤ n do [choose the "best" location for profile AA index ]
for i from 1 to index - 1 by 1 do
calculate cont(AAi-1, AA index , AA i )
calculate cont(AAindex-1, AA index , AAindex+1[boundary condition]
loc ← placement given by maximum cont value
for j from index to loc by -1 do [shuffle the two matrices]
calculate CA(•, j) ← CA(•, j - 1)
CA(•, loc) ← AA(•, index)
index ← index + 1
order the rows according to the relative ordering of columns
Note: • means for each element in the data set.
The BEA Grouping Algorithm
Once BEA ordering algorithm was executed on the input data, the CA matrix must be grouped, for this propose we used a "guilty by association" method based on COG's functional categories. Particularly, the unknown function of a protein was transferred from the neighbor proteins with characterized function in the CA[45, 46].
if i + 1 and i - 1 category is known and category(i - 1) is equal to category(i + 1) then category(i) = category(i - 1)
where i is the position in the ordered matrix.
The cutting point is calculated when a change in the classification occurs.
We repeat the cut process until reaching the total number of elements in the CA matrix.
Our validations were made using the above mentioned data sets: the COG's functional categories for testing BEA ordering algorithm; and the DIP and ECOCYC databases were used as additional data sets for testing accuracy of BEA grouping algorithm.
We thank to Mateo Pellegrini for the interesting discussions of this work. We thank to Salvador Venegas for critical reading of the manuscript. We thank to Arturo Ocadiz and Shirley Ainsworth for technical assistance. This work was supported by the Consejo Nacional de Ciencia y Tecnología (CONACYT) under SEP-CONACYT award No. SEP-2004-C01-47434. Any opinions, findings and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the sponsoring agency.
- Schomburg I, Chang A, Ebeling C, Gremse M, Huhn HeldtGC, Schomburg D: BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res 2004, D: D431–433. 10.1093/nar/gkh081View ArticleGoogle Scholar
- Rost B: Enzyme function less conserved than anticipated. J Mol Biol 2002, 318: 595–608. 10.1016/S0022-2836(02)00016-5View ArticlePubMedGoogle Scholar
- Tian W, Skolnick J: How well is enzyme function conserved as a function of pairwise sequence identity. Bioinformatics 2003, 333: 863–882.Google Scholar
- Wilson C, Kreychman J, Gerstein M: Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 2000, 297: 233–249. 10.1006/jmbi.2000.3550View ArticlePubMedGoogle Scholar
- Henikoff J, Pietrokovski S, S H: Recent enhancements to the Blocks Databases servers. Nucleic Acid Res 1997, 25: 222–225. 10.1093/nar/25.1.222PubMed CentralView ArticlePubMedGoogle Scholar
- Fetrow J, Skolnick J: Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and TI ribonucleases. J Mol Bio 1998, 281: 949–968. 10.1006/jmbi.1998.1993View ArticleGoogle Scholar
- Wallace A, Borkakoti N, Thrnton J: TESS: a geometric hasing algorithm for deriving ED coordinate templates for searching structural databases. Application to enzyme active sites. Protein Sci 1997, 6: 2308–2323.PubMed CentralView ArticlePubMedGoogle Scholar
- Tian W, Arakaki A, Skolnick J: EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucleic Acid Res 2004, 32: 5225–5239.View ArticleGoogle Scholar
- Bork P, Jensen L, Mering C, Ramani A, Lee I, Marcotte E: Protein interaction networks from yeast to human. Curr Opin Struct Biol 2004, 14: 292–299. 10.1016/j.sbi.2004.05.003View ArticlePubMedGoogle Scholar
- Eisenberg D, Marcotte E, Xenarios I, Yeates T: Protein function in the post-genomic era. Nature 2000, 405: 823–826. 10.1038/35015694View ArticlePubMedGoogle Scholar
- Chen L, Vitkup D: Predicting genes for orphan metabolic activities using phylogenetic profiles. Genome Biol 2006, 7(2):R17. 10.1186/gb-2006-7-2-r17PubMed CentralView ArticlePubMedGoogle Scholar
- Vert J: A tree kernel to analyse phylogenetic profiles. Bioinformatics 2002, 18: S276–84.View ArticlePubMedGoogle Scholar
- Kim Y, Subramaniam S: Locally defined protein phylogenetic profiles reveal previously missed protein interactions and functional relationships. Proteins 2005, 62(4):1115–1124. 10.1002/prot.20830View ArticleGoogle Scholar
- Pellegrini M, Marcotte E, Thompson M, Eisenberg D, Yeates T: Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. PNAS 1999, 96(8):4285–4288. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMedGoogle Scholar
- Morett E, Kobel J, Rajan E, Saab-Rincon G, Olvera L, Olvera M, Schmidt O, Snel B, Bork P: Systematic discovery of analogous enzymes in thiamin biosynthesis. Nature Biotechnology 2003, 21(7):790–795. 10.1038/nbt834View ArticlePubMedGoogle Scholar
- Wu J, Kasif S, DeLisi C: Identification of functional links between genes using phylogenetic profiles. Bioinformatics 2003, 19(12):1524–1530. 10.1093/bioinformatics/btg187View ArticlePubMedGoogle Scholar
- Yamanishi Y, Itoh M, Kenehisa M: Extraction of groups from phylogenetic profiles using independent component analysis. Genome Informatics 2002, 13: 61–70.PubMedGoogle Scholar
- Moreno G, Collado J: A poweful non-homology method for the prediction of operons in prokaryotes. Bioinformatics 2002, 18: s329-s336. 10.1093/bioinformatics/18.2.329View ArticleGoogle Scholar
- Glazko Galina V, Mushegian Arcady R: Detection of evolutionarily stable fragments of cellular pathways by herarchical clustering of phyletic patterns. Genome Biology 2004., 5(R32):Google Scholar
- Snoep J, Westerhoff H: Systems Biology: From isolation to integration, a systems biology approach for building the Silicon Cell. USA: Springer-Verlag; 2005.Google Scholar
- Arabie P, Hubert L: The bond energy algorithm revisted. IEEE Transaction on Systems. Man and Cybernetics 1990, 20: 268–274. 10.1109/21.47829View ArticleGoogle Scholar
- Liu Y, Navathe S, Civera J, Dasigi V, Ram A, Ciliax B, Dingledine R: Text mining biomedical literature for discovering gene-to-gene relationships: A comparative Study of algorithms. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2005, 2: 62–77. 10.1109/TCBB.2005.14View ArticlePubMedGoogle Scholar
- Tatusov R, Koonin E, Lipman D: A Genomic perspective on protein families. Science 1997, 278(5338):631–637. 10.1126/science.278.5338.631View ArticlePubMedGoogle Scholar
- Tatusov R, Natale D, Garkavtsev I, Tatusova T, Shankavaram U, Rao B, Kiryutin B, Galperin M, Fedrova N, Koonin E: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001, 29: 22–28. 10.1093/nar/29.1.22PubMed CentralView ArticlePubMedGoogle Scholar
- Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley; 1990.View ArticleGoogle Scholar
- Kohavi R: A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence (IJCAI), IJCAI 1995.Google Scholar
- Breiman L, Spector P: Submodel selection and evaluation in regression: The X-random case. International Statistical Review 1992, 60: D334-D337.Google Scholar
- The MathWorks – MATLAB and Simulink for Technical Computing[http://www.mathworks.com]
- The R Project for Statistica Toolkit[http://www.r-project.org/]
- COG clusters of orthologous groups database[http://www.ncbi.nih.gov/COG]
- Lithwick G, Margalit H: Relative predicted protein levels of functionally associated proteins are conserved across organisms. Nucleic Acids Research 2005, 33(3):1051–1057. 10.1093/nar/gki261PubMed CentralView ArticlePubMedGoogle Scholar
- Ciria R, C AG, Morett E, Merino E: GeConT: gene context analysis. Bioinformatics, Oxford University Press 2004.Google Scholar
- Bernhard OP: System Biology. New York: Cambridge University Press; 2006.Google Scholar
- Watanabe R, Vallejo E, E M: Inferring fuctional coupling of genes from phylogenetic profiles using the bond energy algorithm with genetic algorithms. Proc. of CICB 2006, CICB 2006.Google Scholar
- Watanabe R, Vallejo E, Morett E: Inferring fuctional coupling of genes from phylogenetic profiles using the bond energy algorithm. Automatic Function Prediction 2006.Google Scholar
- DIP: Home[http://dip.doe-mbi.ucla.edu/]
- Salwinski L, Miller C, Smith A, Pettit F, Bowie J, D E: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32: D449-D451. 10.1093/nar/gkh086PubMed CentralView ArticlePubMedGoogle Scholar
- Xenarios I, Salwinski L, Duan X, Higney P, Kim S, D E: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 2002, 30: 303–305. 10.1093/nar/30.1.303PubMed CentralView ArticlePubMedGoogle Scholar
- EcoCyc: Encyclopedia of Escherichia coli K-12 Genes and Metabolism[http://ecocyc.org]
- Krap P, Riley M, Saier M, Paulsen I, Collado-Vides J, Paley S, Pellegini-Toole A, Bonavides C, Gama-Castro S: The EcoCyc Database. Nucleic Acids Res 2002, 30: 56–58. 10.1093/nar/30.1.56View ArticleGoogle Scholar
- Keseler I, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen I, Peralta-Gil M, Krap P: EcoCyc: a comprehensive database resource for Escherichia coli . Nucleic Acids Res 2005, 33: D334-D337. 10.1093/nar/gki108PubMed CentralView ArticlePubMedGoogle Scholar
- McCormick W, Schweitzer P, White T: Problem Decomposition and Data Reorganization by a Clustering Technique. Op Res 1972.Google Scholar
- Hoffer J: A clustering approach to the generation of subfiles for the design of a computer data base. PhD thesis. Ph.D. dissertion, Ithaca, N.Y.: Department of Operations Research, Cornell University; 1975.Google Scholar
- Navathe S, Ceri S, Wiederhold G, Dou J: Vertical partitioning of algorithms for database design. ACM Trans Databese Syst 1984.Google Scholar
- Fellenberg M, Albermann K, Zollner A, Mewes H, Hani J: Integrative analysis of protein interaction data. Proc Int Conf Intell Syst Mol Biol 2000, 8: 152–161.PubMedGoogle Scholar
- Schwikowski B, Uetz P, Fields S: A network of protein-protein interaction in yeast. Nature Biotechnology 2000, 18: 1257–1261. 10.1038/82360View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.