Are scale-free networks robust to measurement errors?
- Nan Lin^{1} and
- Hongyu Zhao^{2, 3}Email author
DOI: 10.1186/1471-2105-6-119
© Lin and Zhao; licensee BioMed Central Ltd. 2005
Received: 31 October 2004
Accepted: 16 May 2005
Published: 16 May 2005
Abstract
Background
Many complex random networks have been found to be scale-free. Existing literature on scale-free networks has rarely considered potential false positive and false negative links in the observed networks, especially in biological networks inferred from high-throughput experiments. Therefore, it is important to study the impact of these measurement errors on the topology of the observed networks.
Results
This article addresses the impact of erroneous links on network topological inference and explores possible error mechanisms for scale-free networks with an emphasis on Saccharomyces cerevisiae protein interaction networks. We study this issue by both theoretical derivations and simulations. We show that the ignorance of erroneous links in network analysis may lead to biased estimates of the scale parameter and recommend robust estimators in such scenarios. Possible error mechanisms of yeast protein interaction networks are explored by comparisons between real data and simulated data.
Conclusion
Our studies show that, in the presence of erroneous links, the connectivity distribution of scale-free networks is still scale-free for the middle range connectivities, but can be greatly distorted for low and high connecitivities. It is more appropriate to use robust estimators such as the least trimmed mean squares estimator to estimate the scale parameter γ under such circumstances. Moreover, we show by simulation studies that the scale-free property is robust to some error mechanisms but untenable to others. The simulation results also suggest that different error mechanisms may be operating in the yeast protein interaction networks produced from different data sources. In the MIPS gold standard protein interaction data, there appears to be a high rate of false negative links, and the false negative and false positive rates are more or less constant across proteins with different connectivities. However, the error mechanism of yeast two-hybrid data may be very different, where the overall false negative rate is low and the false negative rates tend to be higher for links involving proteins with more interacting partners.
Background
Recent studies have found that many complex networks, ranging from the World-Wide Web [1] and the scientific collaboration network [2] to biological systems such as the yeast protein interaction network [3], are scale-free. The scale-free property states that the distribution of the connectivity k (number of links per node) in a network can be described by the power law, i.e.,
P(k) = ck^{-γ}, c > 0, γ > 0. (1)
A visual diagnosis of the scale-free behavior can be made through the log-log plot of the connectivity distribution, in which a straight line with slope -γ is expected. In scale-free networks, the nodes are not randomly or evenly connected with some highly connected nodes ("hubs"). The ratio of the number of "hubs" to that of nodes in the rest of the network remains constant as the network changes in size. One attractive feature is that scale-free networks are more resistant to random failures compared with random networks due to the existence of a few highly connected "hubs" [4]. Remarkably, it has been observed that the scale parameter γ varied only in the narrow range of 2.1 – 4 in the aforementioned real-world networks. All existing studies on scale-free networks assumed that the observed links represented the underlying structure of the network, but paid little attention to the fact that the observed links often involved errors, namely, false positives and false negatives. For example, Jeong et al. [3] considered the Saccharomyces cerevisiae protein interaction network inferred from yeast two-hybrid (Y2H) experiments. It is well-known that the Y2H system has many false positives as well as false negatives [5]. A natural question to ask is whether a scale-free network is still observed as scale-free in the presence of errors. And if it is, what are the possible underlying error mechanisms and how variable is the observed scale parameter γ? Answering these questions may lead to further insight to the scale-free property, better understanding and correct usage of the observed network data. For convenience, we will call networks observed with erroneous links as perturbed networks in the rest of this article.
Results
In this article, we address the above questions by both theoretical derivations and simulation studies using the yeast protein interaction network as a prototype. However, the results apply to general scale-free networks.
Connectivity distribution of scale-free networks with erroneous links under a simple model
We first study how the connectivity distribution of a scale-free network is affected when errors are present. Following previous studies on the reliability of protein interaction networks [6], we assume a simple error mechanism in which the false positive rate (r_{ FP }) and false negative rate (r_{ FN }) are the same for all node pairs, and false positives and false negatives are independently generated. The false positive rate and false negative rate of a node pair refer to the probability that the pair of nodes is observed as linked when they are actually not and the probability that the pair of nodes is observed as unlinked when they are actually linked. Under this assumption, every truly linked pair of nodes has a probability r_{ FN }to be observed as unlinked nodes, and every truly unlinked pair of nodes has a probability r_{ FP }to be observed as linked nodes.
The above assumption is similar to the grand canonical ensembles of random networks in Chapter 4 of Dorogovtsev and Mendes [7], in which networks evolve by removing existing edges and adding new edges with certain probabilities. We can also view the perturbed network as obtained by removing edges (false negative) and adding edges (false positive) from the underlying network. The probability of adding an edge between two non-linked nodes is the false positive rate r_{ FP }, and the probability of removing the edge between two linked nodes is the false negative rate r_{ FN }. However, while Dorogovtsev and Mendes mostly discussed the connectivity distribution of equilibrium networks (networks obtained after infinite times edge adding and removing), we focus on the connectivity distribution of the observed network that are obtained by considering removing every existing edge and adding non-existing edges just once.
Connectivity distribution of the perturbed network
In the following, we will derive the distribution of the observed connectivities for a scale-free network of size n for given values of r_{ FP }and r_{ FN }. Let N_{ P }and N_{ T }denote the observed and true connectivity of a node, respectively. Then the probability to observe a node with k links is
The minimum and maximum connectivity of a node, T_{ min }and T_{ max }, are assumed to be the same for all the nodes in the network, and their values depend on the specific network. In general, we set T_{ min }= 0 and T_{ max }= n - 1 when expert knowledge is not available, where n denotes the size of the network, i.e., the total number of nodes in the network. The following elucidates how to calculate (2) analytically. Let N_{ FP }, N_{ TP }, N_{ FN }, N_{ TN }, and N_{ N }be the numbers of false positive links (observed as linked but actually not), true positive links (observed as linked and actually linked), false negative links (observed as unlinked but actually linked), true negative (observed as unlinked and actually unlinked) and negative links (actually unlinked) associated with the node, respectively. Since the observed links of a node consist of both false positive and true positive ones, and the true links consist of true positive and false negative ones, we have N_{ P }= N_{ FP }+ N_{ TP }, N_{ T }= N_{ FN }+ N_{ TP }, N_{ N }= N_{ FP }+ N_{ TN }, and T_{ max }= N_{ T }+ N_{ N }. Furthermore, underour assumed error mechanism, following similar derivations as shown in [7], N_{ FP }and N_{ FN }follow the binomial distributions Bin(T_{ max }- N_{ T }, r_{ FP }) and Bin(N_{ T }, r_{ FN }), respectively, for a given value of N_{ T }. This implies that r_{ FP }= E(N_{ FP })/(T_{ max }- N_{ T }) = E(N_{ FP })/(N_{ FP }+ N_{ TN }) and r_{ FP }= E(N_{ FN })/N_{ T }= E(N_{ FN })/(N_{ TP }+ N_{ FN }), where E(X) denotes the expectation of random variable X. Then the conditional probability P(N_{ P }= k|N_{ T }= j) in (2) can be written as follows.
where dBin(k; p, n) = P(X = k) with X ~ Bin(n, p). Moreover, the power law of the scale-free network implies that P(N_{ T }= j) = cj^{-γ}. Hence, the observed connectivity distribution can be calculated by
Simulations
We next explore the impact of the erroneous links on the topology of the scale-free networks. With an emphasis on the yeast protein interaction network, we compute the distribution of the observed connectivity of scale-free networks with the false positive rate (r_{ FP }) and false negative rate (r_{ FN }) similar to the yeast protein interaction network under the assumption of the aforementioned simple error mechanism. We set the scale parameter γ = 3, the size of the network n = 1000 or 7000, and vary r_{ FP }from 0.0001 to 0.0003 and r_{ FN }from 0.1 to 0.9 on 9 equally spaced values. These ranges of r_{ FP }and r_{ FN }are based on Deng et al. [8], in which the authors estimated the false positive rate and false negative rate to be less than 0.000285 and greater than 0.64, respectively, based on the Y2H data. We consider a larger range of r_{ FP }to cover other data sources, such as the MIPS complex data, where false positives are less frequent. In the calculations, we use T_{ min }= 1 and T_{ max }= n - 1.
Estimation of γ
The connectivity distribution of the perturbed network suggests a cautious use of the observed link data, especially on estimating γ. The scaling parameter γ, an important characteristic measure of the scale-free network, is commonly estimated using the ordinary least squares (OLS) in the linear model from the log transformation of (1).
log P(k) = log c - γ log k. (4)
It is well known that the OLS estimator can be very sensitive to even a small number of outliers. For example, applying the OLS estimator in Figure 1(a) will not be able to capture the linear trend if the point at the last end is included in the estimation. Therefore, robust estimators, such as the M-estimator and the least trimmed squares (LTS) estimator [9] are more proper choices in such situations due to their resistance to outliers. Our simulations suggest that the LTS estimator can correctly capture the linear trend without visual diagnosis of the connectivity distribution, while the OLS and M-estimator often fail to estimate the slope of the linear part correctly. Therefore, we will use the LTS estimator in our following simulation studies.
Exploring error mechanisms of yeast protein interaction networks by simulations
In the previous section, we found that the scale-free property can be conserved to a large extent under a simple error mechanism. However, the error mechanisms of the real data are often more complicated. For more complicated error mechanisms, theoretical derivations of the connectivity distribution of the perturbed networks are often intractable. But it is also important to know how the empirical connectivity distributions of real networks are affected by the erroneous links. Therefore, we conduct extensive simulation studies to investigate the finite-sample impact of the error mechanisms on the connectivity distribution. Our study focuses on the yeast protein-interaction network data.
In the following, we investigate the error mechanisms of two real yeast protein interaction network data sets used in Jeong et al. [3] and Deng et al. [6] by comparing the connectivity distribution of these two networks with that of the simulated network perturbed by different error mechanisms. We assume that the true underlying topology of the yeast protein interaction network is scale-free [3]. Then if we perturb the simulated scale-free network by the error mechanisms similar to the ones of the real yeast protein interaction networks, the resulting connectivity distribution should be similar to the ones of the real networks.
MIPS and Y2H yeast protein networks
Error mechanisms
- 1.
constant: p_{ ij }= r_{ FP }and q_{ ij }= r_{ FN }for all (x_{ i }, x_{ j });
- 2.
increasing (with connectivity):
Nine error mechanisms.
Error mechanism | p _{ ij } | q _{ ij } |
---|---|---|
S 1 | constant | constant |
S 2 | constant | increasing |
S 3 | constant | decreasing |
S 4 | increasing | constant |
S 5 | increasing | increasing |
S 6 | increasing | decreasing |
S 7 | decreasing | constant |
S 8 | decreasing | increasing |
S 9 | decreasing | decreasing |
Simulation studies
Parameter estimates for Net_{0}.
Parameter | OLS | M-estimation | LTS |
---|---|---|---|
log c | 1.4600 | 1.7846 | 4.008 |
γ | 2.0918 | 2.1769 | 2.803 |
Under the nine different error mechanisms, the connectivity distribution of the perturbed Net_{0} can be dramatically different. Under error mechanisms S 2, S 5, S 6 and S 9, the perturbed networks contain a small proportion of nodes with low connectivity, which differs greatly from the observed yeast protein interaction networks (Figures 5 and 6). This finding suggests that these four mechanisms are far different from the true error structure, and we will not discuss them in the following. We also observe that changes in r_{ FP }render little impact on the connectivity distribution under all error mechanisms, but a higher value of r_{ FN }increases the probability of nodes with small connectivity under S 1, S 3 and S 8. And mechanisms S 4 and S 7 are highly stable structures, that is, the connectivity distribution changes little in response to changes in r_{ FP }or r_{ FN }under these two error mechanisms. This suggests that scale-free networks with constant false negative rates can still provide very credible information about its topological structure. This finding is also confirmed by the fact that the estimates of γ vary little when r_{ FN }changes (see Tables A.5 and A.6 in Additional file 1). The estimated values of γ vary only from 2.61 to 3.03 with a standard error of 0.125 under S 4 and only from 2.56 to 3.31 with a standard error of 0.161 under S 7, whereas the estimate of γ clearly decreases as r_{ FN }increases under S 3 and S 8 (Tables A. 4 and A. 7 in Additional file 1). Under S 1, there is no clear pattern on the estimated γ as r_{ FN }changes (Table A.3 in Additional file 1), but the estimates of γ vary in a much wider range (1.16 – 4.35) compared with those under S 3 and S 8. It is worth noting that our conclusions are restricted to the particular range of r_{ FP }and r_{ FN }we have studied, however these ranges are believed to be reasonable to describe the Y2H systems.
The simple error mechanism S 1 with a high false negative rate produces patterns (Figures 8(a) and 10 (a)) similar to that of the gold standard data (Figure 6). For the Y2H yeast protein interaction network (Figure 5), S 4 gives the best approximation, but still differs slightly in the probabilities of nodes with small connectivity. This suggests that the real error structure of the Y2H analyses may be more complicated than all the simple proposals we have considered.
Conclusion
This article first investigates the impact of erroneous links on network topological inference. From our theoretical and simulation results, we find that, under a simple error mechanism, the scale-free property is preserved for moderate connectivities. But the linear pattern is distorted at both the small and large connectivity regions. Accordingly, we recommend to use robust estimators (e.g. LTS) that are more resistant to the outliers at both ends of the distribution to estimate the scale parameter γ.
Moreover, we have also explored possible error mechanisms of the yeast protein interaction data by simulations considering nine different error mechanisms. The results suggest that changes in the overall false positive rates have little impact on the resulting connectivity distribution, but increasing the overall false negative rates can increase the probability of nodes with small connectivities under some error mechanisms, and hence decrease the scale parameter γ. The connectivity distribution can be very stable under several error mechanisms when the overall false positive rates and false negative rates change, which suggests that in certain situations the observed data can provide suffcient topological information on the underlying network structure even when the false negative rates are quite high.
The simple error mechanism that assumes that the false positive rate and false negative rate of each protein pair are constants agrees well with the MIPS gold standard data when the false negative rate is high. A different error mechanism is suggested for the Y2H data, where more connected protein pairs tend to have higher false positive rates and lower false negative rates. As this error mechanism provides only a reasonable approximation to the Y2H data, more sophisticated mechanisms might be needed to better capture its error structure.
Methods
Preferential attachment growth model
- 1.
Growth: starting with a small number (m_{0}) of nodes, add a new node at every time step and connect it to m (≤ m_{0}) nodes already present in the system
- 2.
Preferential attachment: The new node is more likely to connect to nodes with larger connectivity. The probability Π_{ i }that a new node will be connected to node i depends on its connectivity k_{ i }, such that .
Least Trimmed Squares (LTS)
The basic idea of LTS estimation is to minimize the sum of h smallest squared residuals instead of all squared residuals in the OLS to achieve robustness and also maintain good effciency. Please refer to [9] for more details of the algorithm, such as practical choices of h. In this article, the LTS estimation is performed using the lqs() function implemented in R [17].
Declarations
Acknowledgements
This work was supported in part by NSF grant DMS 0241160 and NIH grant R01 GM59507.
Authors’ Affiliations
References
- Albert BarabásiRAL, Jeong H: Scale-free characteristics of random networks: The topology of the World Wide Web. Physica A 2000, 281: 69–77.View ArticleGoogle Scholar
- Barabási AL, Jeong H, Néda Z, Revasz E, Schubert A, Vicsek T: Evolution of the social network of scientific collaborations. Physica A 2002, 311: 590–614.View ArticleGoogle Scholar
- Jeong H, Mason SP, Barabási AL, Oltvai ZN: Lethality and centrality in protein networks. Nature 2001, 411: 41–42. 10.1038/35075138View ArticlePubMedGoogle Scholar
- Albert R, Jeong H, Barabási AL: Error and attach tolerance of complex networks. Nature 2000, 406: 378–382. 10.1038/35019019View ArticlePubMedGoogle Scholar
- Criekinge WV, Beyaert R: Yeast two-hybrid: State of the art. Biol Proced Online 1999, 2: 1–38. 10.1251/bpo16PubMed CentralView ArticlePubMedGoogle Scholar
- Deng M, Sun F, Chen T: Assessment of the reliability of protein-protein interactions and protein function prediction. Pac Symp Biocomput 2003, 140–151.Google Scholar
- Dorogovtsev SN, Mendes JFF: Evolution of Networks. New York: Oxford University Press; 2003.View ArticleGoogle Scholar
- Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions from protein-protein interactions. Genome Research 2002, 1540–1548. 10.1101/gr.153002Google Scholar
- Rousseeuw PJ, Leroy AM: Robust regression and outlier detection. New York: Wiley; 1987.View ArticleGoogle Scholar
- Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae . Nature 2000, 403: 601–603. 10.1038/35001165View ArticleGoogle Scholar
- Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D: DIP: the database of interacting proteins. Nucleic Acids Res 2000, 28: 289–291. 10.1093/nar/28.1.289PubMed CentralView ArticlePubMedGoogle Scholar
- Y2H protein interaction network data[http://www.nd.edu/~networks/database/protein/bo.dat.gz]
- MIPS gold standard protein interaction network data[http://hto-b.usc.edu/~msms/AssessInteraction/MIPSMatchYPD.txt]
- Yeast Proteome Database[http://www.proteome.com/YPDhome.html]
- Barabási AL, Albert R, Jeong H: Mean-field theory for scale-free random networks. Physica A 1999, 272: 173–187.View ArticleGoogle Scholar
- Albert R, Barabási AL: Topology of evolving networks: local events and universality. Phys Rev Lett 2000, 85: 5234–5237. 10.1103/PhysRevLett.85.5234View ArticlePubMedGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria 2004. [http://www.R-project.org]Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.