Connectivity vs evolutionary rate
In our analysis of several different datasets we found that a correlation between protein interactions and evolutionary rate exists in some datasets and not in others (Figure 2). Where correlations were observed, they were weak, (between Spearman's ρ: -0.1 and -0.25)) but statistically significant (P-value: < 10-3). The weakness of observed correlations could be attributed to the incomplete nature of the known yeast interactome. It has been estimated that the ~6000 proteins in yeast participate in at most 40,000 interactions [38–40]. GRID, the largest dataset contains 4907 proteins and a total of 17,598 interactions. Considering this also includes a percentage of false positives, it is clear that only a fraction of yeast interactions have been measured. There is also very little overlap between interactions returned from different experimental methods. Results returned by Gavin et al (2002) measure 3957 interactions using the TAP method. Only 63 of these interactions can be found in the Uetz et al's (2000) dataset. Although this could be due to different experimental methods favouring different types of interactions, the lack of agreement denotes the partial nature of the picture to date.
The large fraction of missing data on evolutionary distance is another factor that may explain the weakness of correlations. When searching for orthologs in the M musculus we found BRH orthologs for only half of the proteins in the GRID dataset, (Table 1). This resulted in 50% of the nodes from the interactome missing from our final correlation graph (Figure 1). A highly interacting node which is missing one interaction would simply move its position on the graph, however if the node itself is missing the graph will be missing a point. In this case 50% of the points are missing and it is entirely possible that the 50% that are present may have the wrong number of interactions.
To address the issue of missing orthologs, we searched for orthologous proteins in the more closely related species S paradoxus. This resulted in far more orthologs being found (over 80%), however the strength and general pattern of the correlations remained the same as before (Figure 2). The slight disparity between the cerevisiae-musculus and cerevisiae-paradoxus based correlations is probably due to the relatively small amount of evolutionary change that occurred between the cerevisiae and paradoxus. However the correlation still remains weak even though we located more orthologs. It therefore seems that the primary reason for the weak magnitude of the correlation may be the incomplete nature of the network. Our error rate analysis showed that the accuracy of sets varied (Table 2). When considering the accuracy of the sets it is important that we consider the three error rate indicators collectively. Considering any error rate indicator on its own could be misleading as the experimental methods used could bias a particular error rate indicator. For example the MIPS_Genetic dataset has a high EPR index (74.9%). This indicator on its own suggests that this dataset is highly accurate. However the Reference index is 4.08% and the LS index is 27.52%. The EPR index value is explained by the experimental methods used to obtain the interactions in this dataset. The experimental methods used (synthetic lethality and suppression analysis) check for functional interactions. Functionally related genes tend to be expressed in a similar manner and so a high EPR index would be expected.
In order to translate error rates into a meaningful consensus based representation, we calculated the average rank over the three independent measures to give us an Average Accuracy Rank (Figure 3). Three datasets have been omitted. The MIPS_GENETIC dataset as it is a set of functional interactions rather than physical interactions. The BIND dataset as we could only calculate one of the three error rates for it, and the INTACT_SMALL dataset as it will give uncertain error rates due to its small size. From Table 2 and Figure 3, it is clear that the MIPS_PHYSICAL dataset shows no correlation and its accuracy level is amongst the lowest. In general we find that datasets that demonstrate stronger correlations between connectivity and evolutionary rate are more accurate whereas datasets that show no correlation are found to be less accurate.
The UETZ dataset has relatively high accuracy levels for all three of our error measures. Its consensus accuracy is higher than that shown by GRID and MINT (Figure 3). However it shows no statistical correlation between connectivity and rate of change. The UETZ dataset is obtained via the Y2H method. This has previously been shown to be an inaccurate experimental process [13]. A plausible explanation for the lack of correlation is its lack of representation of highly interacting proteins. The UETZ dataset contains a comparable number of proteins to the GAVIN and HO sets, yet it contains a significantly lower number of interactions (Table 1). The UETZ dataset contains 1438 interactions for 1328 proteins which averages to no more than 1.08 interactions per protein, whereas the GAVIN and HO sets average over 2.3 interactions per protein. It is fair to say that with just 1.08 interactions per protein, proteins with more than one interactions are highly under-represented in the UETZ dataset.
The HO dataset has a low accuracy according to our consensus measure (Figure 3), yet it returns a strong-correlation. It possessed the lowest EPR Index and LS index from amongst all the datasets. Von Mering's analysis estimated the experimental method used in this set to have an accuracy of only 2% [13]. The strength of the correlation between connectivity and evolutionary rate in this dataset could be due to a previously discussed artificatually generated association between connectivity and expression level [9]. Specifically the artificial correlation shown by the HO dataset could arise from the experimental method used to generate the interactions. Ho et al used the HMS_PCI protocol, where the bait proteins are transiently overexpressed. This overexpression may have led to the detection of a large number of false interactions for highly expressed genes. However our partial correlation analysis does not support this conclusion, as we find that if we control for expression level, the correlation observed between connectivity and evolutionary rate still exists.
An analysis of the overlap between the accumulative datasets (sets containing data from many sources) and single experimental method datasets (HO, GAVIN, ITO, UETZ) further corroborates findings from the error rate analysis (Table 3). The DIP_Core dataset, has very little overlap with the inaccurate ITO dataset. This gives further support to our initial assertion that the DIP_Core dataset contains a large fraction of good interactions. Interestingly the three accumulative datasets, BIND, MIPS_Genetic and MIPS_Physical, which showed no correlation between interaction and connectivity, had very little overlap with the GAVIN dataset. The GAVIN dataset is obtained by the affinity purification method and our error rate analysis considers it to be quite accurate. The BIND and MIPS databases are missing a very large fraction of affinity-purification data.
It was also noted that the MIPS_Genetic dataset has very little overlap with any of the single experimental method datasets. To a certain extent this is to be expected as the MIPS_Genetic set contains functional interactions as opposed to physical interactions. The lack of congruence between the MIPS_Genetic set and physical interaction datasets highlights the stark differences between functional interactions and physical interactions.
Abundance vs evolutionary rate
We used the mRNA expression levels in yeast as a measure of abundance. Table 1 shows the correlations observed in all the datasets, between the three factors, evolutionary rate, abundance and interactions. Strong and significant correlations between abundance and evolutionary rate are detected for all the datasets bar the SMALL dataset. A possible explanation for the absence of a correlation in the SMALL dataset is that within this set only 60 proteins had both evolutionary rate and expression information. As has previously been the case, such a small set of nodes may be lacking enough information to display a significant correlation.
Datasets with high accuracy also demonstrated a relationship between abundance and interactions. A positive correlation (where abundant proteins tend to possess more interactions), of a similar magnitude to that observed between interactions and evolutionary rate was seen. This correlation was not observed in sets which were considered to be inaccurate. A simple explanation for this could be that proteins which are broadly present in the cell will have a greater functional role and therefore will participate in many interactions.
When comparing the two correlations, the strength of the abundance vs. evolutionary rate correlation is far stronger than the correlation of interactions vs. evolutionary rate. An explanation for this could be that the interactome is far from complete as discussed earlier. Expression data on the other hand, is far more exhaustive, with expression levels known for 6172 proteins [34]. Expression data is also thought to be of a better quality [15].
Previously it was suggested that affinity-purification methods were biased in that they measured more interactions for highly expressed proteins [9]. This assertion was based on the observation, only in affinity purification sets, of a positive correlation between number of interactions and expression levels, i.e. highly expressed proteins had more interactions. It is a questionable claim, as it can be said that highly expressed proteins are more abundant because of their important functional role, and such a role may require it to interact with many proteins.
Our findings throw further doubt on the claim as we also observed positive correlations between expression levels and number of interactions in accumulative datasets as well as our "golden standard" dataset DIP_Core. Accumulative datasets contain interaction information from different sources, small scale, Y2H and TAP methods. The DIP_Core dataset, our dataset of true interactions, compiled and verified from different sources shows a significant positive correlation between expression and interactions (Spearman's ρ:0.1755).
Nevertheless Bloom et al went on to conclude that because of the stronger positive correlation between expression levels and evolutionary rate, the expression levels were responsible for any correlation between the number of interactions and the evolutionary rate.
To judge what effect, if any, expression levels had on the relationship between connectivity and evolutionary rate we calculated the partial correlations based on the Spearman's rank correlation. When controlling for expression, we found that the strength of the correlation between interactions and evolutionary rate did decrease slightly, in cases where it was observed in the first instance (Table 1). This however does not imply that expression levels are the reason why we observe a correlation between interactions and rate of change. Judging by the strength of the correlation between expression and evolutionary rate, we believe that expression is simply a better predictor of evolutionary rate than connectivity.
Interactions vs age
We found proteins with a high ER value tend to participate in more interactions. Conducting our analysis on the DIP_Core dataset, we found a very strong correlation between ER and interactions (Figure 5). This supports the belief that hub proteins are more likely to be older than non-hub proteins and corroborates previous work [41, 42]. The ER differential can be explained by the theory of preferential attachment [43]. New protein's once having entered the interactome by a growth process, are more likely to form connections with proteins that are already highly interacting. As a result of this process proteins that are present in the interactome for a longer period will accrue more interactions, i.e. hub proteins are older. The scale-free nature of the interaction network could also be explained by such a growth process [44]. We also examined the relationship between ER and expression. This correlation was slightly weaker than the correlation between ER and interactions, yet still significant (Figure 6). The correlation suggests that proteins that tend to be older are more abundant. Collectively this suggests that older proteins are not only highly expressed but also participate in more interactions.
In order to ensure that expression was not causing a bias in the relationship between ER and interactions, we took all the proteins from several expression bins and checked for an association between ER and interactions. This stratification analysis is an effective way of checking if expression has any affect on the relationship between ER and connectivity (Figure 7). For the DIP_Core dataset a correlation, between ER and number of interactions, of strength ρ: 0.98301 was detected, this correlation remained when we examined an expression based bin in which a large number of proteins (> 50) were present (rho : 0.91762). This indicates that the abundance levels of proteins had very little effect on the relationship between ER and interactions.