Are scale-free networks robust to measurement errors?

Lin, Nan; Zhao, Hongyu

doi:10.1186/1471-2105-6-119

Research article
Open access
Published: 16 May 2005

Are scale-free networks robust to measurement errors?

Nan Lin¹ &
Hongyu Zhao^2,3

BMC Bioinformatics volume 6, Article number: 119 (2005) Cite this article

5102 Accesses
15 Citations
Metrics details

Abstract

Background

Many complex random networks have been found to be scale-free. Existing literature on scale-free networks has rarely considered potential false positive and false negative links in the observed networks, especially in biological networks inferred from high-throughput experiments. Therefore, it is important to study the impact of these measurement errors on the topology of the observed networks.

Results

This article addresses the impact of erroneous links on network topological inference and explores possible error mechanisms for scale-free networks with an emphasis on Saccharomyces cerevisiae protein interaction networks. We study this issue by both theoretical derivations and simulations. We show that the ignorance of erroneous links in network analysis may lead to biased estimates of the scale parameter and recommend robust estimators in such scenarios. Possible error mechanisms of yeast protein interaction networks are explored by comparisons between real data and simulated data.

Conclusion

Our studies show that, in the presence of erroneous links, the connectivity distribution of scale-free networks is still scale-free for the middle range connectivities, but can be greatly distorted for low and high connecitivities. It is more appropriate to use robust estimators such as the least trimmed mean squares estimator to estimate the scale parameter γ under such circumstances. Moreover, we show by simulation studies that the scale-free property is robust to some error mechanisms but untenable to others. The simulation results also suggest that different error mechanisms may be operating in the yeast protein interaction networks produced from different data sources. In the MIPS gold standard protein interaction data, there appears to be a high rate of false negative links, and the false negative and false positive rates are more or less constant across proteins with different connectivities. However, the error mechanism of yeast two-hybrid data may be very different, where the overall false negative rate is low and the false negative rates tend to be higher for links involving proteins with more interacting partners.

Background

Recent studies have found that many complex networks, ranging from the World-Wide Web [1] and the scientific collaboration network [2] to biological systems such as the yeast protein interaction network [3], are scale-free. The scale-free property states that the distribution of the connectivity k (number of links per node) in a network can be described by the power law, i.e.,

P(k) = ck^-γ, c > 0, γ > 0. (1)

A visual diagnosis of the scale-free behavior can be made through the log-log plot of the connectivity distribution, in which a straight line with slope -γ is expected. In scale-free networks, the nodes are not randomly or evenly connected with some highly connected nodes ("hubs"). The ratio of the number of "hubs" to that of nodes in the rest of the network remains constant as the network changes in size. One attractive feature is that scale-free networks are more resistant to random failures compared with random networks due to the existence of a few highly connected "hubs" [4]. Remarkably, it has been observed that the scale parameter γ varied only in the narrow range of 2.1 – 4 in the aforementioned real-world networks. All existing studies on scale-free networks assumed that the observed links represented the underlying structure of the network, but paid little attention to the fact that the observed links often involved errors, namely, false positives and false negatives. For example, Jeong et al. [3] considered the Saccharomyces cerevisiae protein interaction network inferred from yeast two-hybrid (Y2H) experiments. It is well-known that the Y2H system has many false positives as well as false negatives [5]. A natural question to ask is whether a scale-free network is still observed as scale-free in the presence of errors. And if it is, what are the possible underlying error mechanisms and how variable is the observed scale parameter γ? Answering these questions may lead to further insight to the scale-free property, better understanding and correct usage of the observed network data. For convenience, we will call networks observed with erroneous links as perturbed networks in the rest of this article.

Results

In this article, we address the above questions by both theoretical derivations and simulation studies using the yeast protein interaction network as a prototype. However, the results apply to general scale-free networks.

Connectivity distribution of scale-free networks with erroneous links under a simple model

We first study how the connectivity distribution of a scale-free network is affected when errors are present. Following previous studies on the reliability of protein interaction networks [6], we assume a simple error mechanism in which the false positive rate (r_FP) and false negative rate (r_FN) are the same for all node pairs, and false positives and false negatives are independently generated. The false positive rate and false negative rate of a node pair refer to the probability that the pair of nodes is observed as linked when they are actually not and the probability that the pair of nodes is observed as unlinked when they are actually linked. Under this assumption, every truly linked pair of nodes has a probability r_FNto be observed as unlinked nodes, and every truly unlinked pair of nodes has a probability r_FPto be observed as linked nodes.

The above assumption is similar to the grand canonical ensembles of random networks in Chapter 4 of Dorogovtsev and Mendes [7], in which networks evolve by removing existing edges and adding new edges with certain probabilities. We can also view the perturbed network as obtained by removing edges (false negative) and adding edges (false positive) from the underlying network. The probability of adding an edge between two non-linked nodes is the false positive rate r_FP, and the probability of removing the edge between two linked nodes is the false negative rate r_FN. However, while Dorogovtsev and Mendes mostly discussed the connectivity distribution of equilibrium networks (networks obtained after infinite times edge adding and removing), we focus on the connectivity distribution of the observed network that are obtained by considering removing every existing edge and adding non-existing edges just once.

Connectivity distribution of the perturbed network

In the following, we will derive the distribution of the observed connectivities for a scale-free network of size n for given values of r_FPand r_FN. Let N_Pand N_Tdenote the observed and true connectivity of a node, respectively. Then the probability to observe a node with k links is

The minimum and maximum connectivity of a node, T_minand T_max, are assumed to be the same for all the nodes in the network, and their values depend on the specific network. In general, we set T_min= 0 and T_max= n - 1 when expert knowledge is not available, where n denotes the size of the network, i.e., the total number of nodes in the network. The following elucidates how to calculate (2) analytically. Let N_FP, N_TP, N_FN, N_TN, and N_Nbe the numbers of false positive links (observed as linked but actually not), true positive links (observed as linked and actually linked), false negative links (observed as unlinked but actually linked), true negative (observed as unlinked and actually unlinked) and negative links (actually unlinked) associated with the node, respectively. Since the observed links of a node consist of both false positive and true positive ones, and the true links consist of true positive and false negative ones, we have N_P= N_FP+ N_TP, N_T= N_FN+ N_TP, N_N= N_FP+ N_TN, and T_max= N_T+ N_N. Furthermore, underour assumed error mechanism, following similar derivations as shown in [7], N_FPand N_FNfollow the binomial distributions Bin(T_max- N_T, r_FP) and Bin(N_T, r_FN), respectively, for a given value of N_T. This implies that r_FP= E(N_FP)/(T_max- N_T) = E(N_FP)/(N_FP+ N_TN) and r_FP= E(N_FN)/N_T= E(N_FN)/(N_TP+ N_FN), where E(X) denotes the expectation of random variable X. Then the conditional probability P(N_P= k|N_T= j) in (2) can be written as follows.

where dBin(k; p, n) = P(X = k) with X ~ Bin(n, p). Moreover, the power law of the scale-free network implies that P(N_T= j) = cj^-γ. Hence, the observed connectivity distribution can be calculated by

Simulations

We next explore the impact of the erroneous links on the topology of the scale-free networks. With an emphasis on the yeast protein interaction network, we compute the distribution of the observed connectivity of scale-free networks with the false positive rate (r_FP) and false negative rate (r_FN) similar to the yeast protein interaction network under the assumption of the aforementioned simple error mechanism. We set the scale parameter γ = 3, the size of the network n = 1000 or 7000, and vary r_FPfrom 0.0001 to 0.0003 and r_FNfrom 0.1 to 0.9 on 9 equally spaced values. These ranges of r_FPand r_FNare based on Deng et al. [8], in which the authors estimated the false positive rate and false negative rate to be less than 0.000285 and greater than 0.64, respectively, based on the Y2H data. We consider a larger range of r_FPto cover other data sources, such as the MIPS complex data, where false positives are less frequent. In the calculations, we use T_min= 1 and T_max= n - 1.

In the log-log plot (Figures 1 and 2) of the observed connectivity distribution of the perturbed networks when (r_FP= 0.0001, r_FN= 0.3) and (r_FP= 0.00015, r_FN= 0.8), it can be seen that the connectivity distribution after perturbation still maintains the scale-free property in the middle range of the connectivity, but deviates from the original linear pattern at both the small and large connectivity regions. The slope of the linear part is close to the true value -3 (see Tables A.1 and A.2 in Additional file 1). The deviation is more significant in the large connectivity region than that in the small connectivity region. This deviation pattern is consistent across networks of different sizes considered in our calculations (data not shown). Comparisons among the observed connectivity distributions (figures not shown) of perturbed networks with different values of r_FPand r_FNsuggest that the deviation depends little on r_FPbut largely on r_FN. As r_FNincreases, the deviation of the tail probability becomes more significant. This deviation is also more obvious in a smaller network.

Estimation of γ

The connectivity distribution of the perturbed network suggests a cautious use of the observed link data, especially on estimating γ. The scaling parameter γ, an important characteristic measure of the scale-free network, is commonly estimated using the ordinary least squares (OLS) in the linear model from the log transformation of (1).

log P(k) = log c - γ log k. (4)

It is well known that the OLS estimator can be very sensitive to even a small number of outliers. For example, applying the OLS estimator in Figure 1(a) will not be able to capture the linear trend if the point at the last end is included in the estimation. Therefore, robust estimators, such as the M-estimator and the least trimmed squares (LTS) estimator [9] are more proper choices in such situations due to their resistance to outliers. Our simulations suggest that the LTS estimator can correctly capture the linear trend without visual diagnosis of the connectivity distribution, while the OLS and M-estimator often fail to estimate the slope of the linear part correctly. Therefore, we will use the LTS estimator in our following simulation studies.

Exploring error mechanisms of yeast protein interaction networks by simulations

In the previous section, we found that the scale-free property can be conserved to a large extent under a simple error mechanism. However, the error mechanisms of the real data are often more complicated. For more complicated error mechanisms, theoretical derivations of the connectivity distribution of the perturbed networks are often intractable. But it is also important to know how the empirical connectivity distributions of real networks are affected by the erroneous links. Therefore, we conduct extensive simulation studies to investigate the finite-sample impact of the error mechanisms on the connectivity distribution. Our study focuses on the yeast protein-interaction network data.

For real network data, no matter whether erroneous links are involved or not, the empirical connectivity distribution will not display a linear pattern as clear as the ones in Figure 1 due to sampling variations and its discrete approximation to the tiny probability of nodes with large connectivities. For example, Figure 3 shows the connectivity distribution of a simulated scale-free network Net₀ and Figure 4 shows the connectivity distribution of Net₀ after perturbation by the simple error mechanism discussed above. In Figure 4, we observe a much larger curvature deviation from the linear trend at the small connectivity region than that in Figures 1 and 2. It is not clear why the empirical distributions of the simulated networks are so different from the theoretical calculations, but this observation demonstrates that simulation studies are necessary to complement the findings from the theoretical calculations. In addition, simulation studies can also explore possible error mechanisms by comparing the connectivity distributions of simulated perturbed scale-free networks with the observed networks by assuming that their underlying structure are indeed scale-free.

In the following, we investigate the error mechanisms of two real yeast protein interaction network data sets used in Jeong et al. [3] and Deng et al. [6] by comparing the connectivity distribution of these two networks with that of the simulated network perturbed by different error mechanisms. We assume that the true underlying topology of the yeast protein interaction network is scale-free [3]. Then if we perturb the simulated scale-free network by the error mechanisms similar to the ones of the real yeast protein interaction networks, the resulting connectivity distribution should be similar to the ones of the real networks.

MIPS and Y2H yeast protein networks

Jeong et al. derived the yeast protein network from combined, non-overlapping Y2H data [10, 11]. This network has 1,870 proteins as nodes, connected by 2,240 identified direct physical interactions [12]. The other network was obtained from the gold standard of yeast protein interactions based on the MIPS complex data [13]. This gold standard data set has 1,376 proteins and 2,876 interacting protein pairs, out of which 2,559 are also recorded in the Yeast Proteome Database (YPD) [14]. The YPD subset has 1,373 proteins. Estimates of γ from the Y2H network, the gold standard data and the YPD subset are 2.396, 2.721 and 2.870, respectively. The connectivity distributions of these two networks are shown in Figure 5 and Figure 6, respectively.

Error mechanisms

We consider different error mechanisms in terms of different types of false positive rates (p_ij= P (x_iand x_jare observed linked|x_iand x_jare actually unlinked)) and false negative rates (q_ij= P (x_iand x_jare observed unlinked|x_iand x_jare actually linked)) for node pair (x_i, x_j), i = 1,..., n, j = 1,..., n, i ≠ j. Assume that the overall false positive rate and false negative rate are r_FPand r_FN, in the sense that the expected number of false positive links and false negative links are E(N_FP) = r_FPN_Nand E(N_FN) = r_FNN_P. We consider nine different error mechanisms by letting p_ijand q_ijbe one of the following three different types:

1.
constant: p_ij= r_FPand q_ij= r_FNfor all (x_i, x_j);
2.
increasing (with connectivity):

3.
decreasing (with connectivity):

where L(x) denotes the true connectivity of node x. For Net₀, N_P= 49, 007 and N_N= 24, 503, 521. The combinations of different structures on false positive rates and false negative rates produce nine error mechanisms in Table 1.

Table 1 Nine error mechanisms.

Full size table

Simulation studies

We simulate a scale-free network Net₀ using the preferential attachment growth model [15, 16]. In this algorithm, we start from m₀ = 7 isolated nodes and add m = 7 links to the existing nodes with probability proportional to their connectivity in each of the T = 7, 000 evolving steps. Net₀ has L = 49, 007 links and n = 7, 008 nodes. The mean-field theory [15] suggests that the theoretical value of γ for Net₀ is 3, which agrees well with the estimates in Table 2.

Table 2 Parameter estimates for Net₀.

Full size table

We always assume that false positives and false negatives are independently generated. In the simulations, a link is added (false positive) between every two unlinked nodes (x_i, x_j) in Net₀ with probability p_ij, and the link is removed (false negative) between two linked nodes (x_i, x_j) in Net₀ with probability q_ij. We also consider these error mechanisms under high and low overall false positive (r_FP) and false negative rates (r_FN). The connectivity distributions of Net₀ after perturbation are shown in Figures 7, 8, 9, 10 for different values of r_FPand r_FN: (0.00025, 0.5), (0.00025, 0.8), (0.00015, 0.5), (0.00025, 0.8).

Under the nine different error mechanisms, the connectivity distribution of the perturbed Net₀ can be dramatically different. Under error mechanisms S 2, S 5, S 6 and S 9, the perturbed networks contain a small proportion of nodes with low connectivity, which differs greatly from the observed yeast protein interaction networks (Figures 5 and 6). This finding suggests that these four mechanisms are far different from the true error structure, and we will not discuss them in the following. We also observe that changes in r_FPrender little impact on the connectivity distribution under all error mechanisms, but a higher value of r_FNincreases the probability of nodes with small connectivity under S 1, S 3 and S 8. And mechanisms S 4 and S 7 are highly stable structures, that is, the connectivity distribution changes little in response to changes in r_FPor r_FNunder these two error mechanisms. This suggests that scale-free networks with constant false negative rates can still provide very credible information about its topological structure. This finding is also confirmed by the fact that the estimates of γ vary little when r_FNchanges (see Tables A.5 and A.6 in Additional file 1). The estimated values of γ vary only from 2.61 to 3.03 with a standard error of 0.125 under S 4 and only from 2.56 to 3.31 with a standard error of 0.161 under S 7, whereas the estimate of γ clearly decreases as r_FNincreases under S 3 and S 8 (Tables A. 4 and A. 7 in Additional file 1). Under S 1, there is no clear pattern on the estimated γ as r_FNchanges (Table A.3 in Additional file 1), but the estimates of γ vary in a much wider range (1.16 – 4.35) compared with those under S 3 and S 8. It is worth noting that our conclusions are restricted to the particular range of r_FPand r_FNwe have studied, however these ranges are believed to be reasonable to describe the Y2H systems.

The simple error mechanism S 1 with a high false negative rate produces patterns (Figures 8(a) and 10 (a)) similar to that of the gold standard data (Figure 6). For the Y2H yeast protein interaction network (Figure 5), S 4 gives the best approximation, but still differs slightly in the probabilities of nodes with small connectivity. This suggests that the real error structure of the Y2H analyses may be more complicated than all the simple proposals we have considered.

Conclusion

This article first investigates the impact of erroneous links on network topological inference. From our theoretical and simulation results, we find that, under a simple error mechanism, the scale-free property is preserved for moderate connectivities. But the linear pattern is distorted at both the small and large connectivity regions. Accordingly, we recommend to use robust estimators (e.g. LTS) that are more resistant to the outliers at both ends of the distribution to estimate the scale parameter γ.

Moreover, we have also explored possible error mechanisms of the yeast protein interaction data by simulations considering nine different error mechanisms. The results suggest that changes in the overall false positive rates have little impact on the resulting connectivity distribution, but increasing the overall false negative rates can increase the probability of nodes with small connectivities under some error mechanisms, and hence decrease the scale parameter γ. The connectivity distribution can be very stable under several error mechanisms when the overall false positive rates and false negative rates change, which suggests that in certain situations the observed data can provide suffcient topological information on the underlying network structure even when the false negative rates are quite high.

The simple error mechanism that assumes that the false positive rate and false negative rate of each protein pair are constants agrees well with the MIPS gold standard data when the false negative rate is high. A different error mechanism is suggested for the Y2H data, where more connected protein pairs tend to have higher false positive rates and lower false negative rates. As this error mechanism provides only a reasonable approximation to the Y2H data, more sophisticated mechanisms might be needed to better capture its error structure.

Methods

Preferential attachment growth model

In a series of papers [15, 16], Barabási et al. demonstrated that a scale-free network could be obtained by growing from a small number of isolated nodes by preferential attachment. The simulation scheme is defined in two steps:

1.
Growth: starting with a small number (m₀) of nodes, add a new node at every time step and connect it to m (≤ m₀) nodes already present in the system
2.
Preferential attachment: The new node is more likely to connect to nodes with larger connectivity. The probability Π_ithat a new node will be connected to node i depends on its connectivity k_i, such that .

Least Trimmed Squares (LTS)

The basic idea of LTS estimation is to minimize the sum of h smallest squared residuals instead of all squared residuals in the OLS to achieve robustness and also maintain good effciency. Please refer to [9] for more details of the algorithm, such as practical choices of h. In this article, the LTS estimation is performed using the lqs() function implemented in R [17].

References

Albert BarabásiRAL, Jeong H: Scale-free characteristics of random networks: The topology of the World Wide Web. Physica A 2000, 281: 69–77.
Article Google Scholar
Barabási AL, Jeong H, Néda Z, Revasz E, Schubert A, Vicsek T: Evolution of the social network of scientific collaborations. Physica A 2002, 311: 590–614.
Article Google Scholar
Jeong H, Mason SP, Barabási AL, Oltvai ZN: Lethality and centrality in protein networks. Nature 2001, 411: 41–42. 10.1038/35075138
Article CAS PubMed Google Scholar
Albert R, Jeong H, Barabási AL: Error and attach tolerance of complex networks. Nature 2000, 406: 378–382. 10.1038/35019019
Article CAS PubMed Google Scholar
Criekinge WV, Beyaert R: Yeast two-hybrid: State of the art. Biol Proced Online 1999, 2: 1–38. 10.1251/bpo16
Article PubMed Central PubMed Google Scholar
Deng M, Sun F, Chen T: Assessment of the reliability of protein-protein interactions and protein function prediction. Pac Symp Biocomput 2003, 140–151.
Google Scholar
Dorogovtsev SN, Mendes JFF: Evolution of Networks. New York: Oxford University Press; 2003.
Book Google Scholar
Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions from protein-protein interactions. Genome Research 2002, 1540–1548. 10.1101/gr.153002
Google Scholar
Rousseeuw PJ, Leroy AM: Robust regression and outlier detection. New York: Wiley; 1987.
Book Google Scholar
Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae . Nature 2000, 403: 601–603. 10.1038/35001165
Article Google Scholar
Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D: DIP: the database of interacting proteins. Nucleic Acids Res 2000, 28: 289–291. 10.1093/nar/28.1.289
Article PubMed Central CAS PubMed Google Scholar
Y2H protein interaction network data[http://www.nd.edu/~networks/database/protein/bo.dat.gz]
MIPS gold standard protein interaction network data[http://hto-b.usc.edu/~msms/AssessInteraction/MIPSMatchYPD.txt]
Yeast Proteome Database[http://www.proteome.com/YPDhome.html]
Barabási AL, Albert R, Jeong H: Mean-field theory for scale-free random networks. Physica A 1999, 272: 173–187.
Article Google Scholar
Albert R, Barabási AL: Topology of evolving networks: local events and universality. Phys Rev Lett 2000, 85: 5234–5237. 10.1103/PhysRevLett.85.5234
Article CAS PubMed Google Scholar
R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria 2004. [http://www.R-project.org]
Google Scholar

Download references

Acknowledgements

This work was supported in part by NSF grant DMS 0241160 and NIH grant R01 GM59507.

Author information

Authors and Affiliations

Department of Mathematics, Washington University in St. Louis, St. Louis, MO, 63143, USA
Nan Lin
Department of Epidemiology and Public Health, Yale University, New Haven, CT, 06520, USA
Hongyu Zhao
Department of Genetics, Yale University, New Haven, CT, 06520, USA
Hongyu Zhao

Authors

Nan Lin
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongyu Zhao.

Additional information

Authors' contributions

HZ had the initial idea and initiated the study. NL conducted the data analyses, and created all tables and figures, under the supervision of HZ. Both authors read and approved the final manuscript.

Electronic supplementary material

Additional File 1: Tables of the estimates of the scale parameter γ. (PDF 40 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, N., Zhao, H. Are scale-free networks robust to measurement errors?. BMC Bioinformatics 6, 119 (2005). https://doi.org/10.1186/1471-2105-6-119

Download citation

Received: 31 October 2004
Accepted: 16 May 2005
Published: 16 May 2005
DOI: https://doi.org/10.1186/1471-2105-6-119

Are scale-free networks robust to measurement errors?

Abstract

Background

Results

Conclusion

Background

Results

Connectivity distribution of scale-free networks with erroneous links under a simple model

Connectivity distribution of the perturbed network

Simulations

Estimation of γ

Exploring error mechanisms of yeast protein interaction networks by simulations

MIPS and Y2H yeast protein networks

Error mechanisms

Simulation studies

Conclusion

Methods

Preferential attachment growth model

Least Trimmed Squares (LTS)

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us