GraphCrunch: A tool for large network analyses

Background The recent explosion in biological and other real-world network data has created the need for improved tools for large network analyses. In addition to well established global network properties, several new mathematical techniques for analyzing local structural properties of large networks have been developed. Small over-represented subgraphs, called network motifs, have been introduced to identify simple building blocks of complex networks. Small induced subgraphs, called graphlets, have been used to develop "network signatures" that summarize network topologies. Based on these network signatures, two new highly sensitive measures of network local structural similarities were designed: the relative graphlet frequency distance (RGF-distance) and the graphlet degree distribution agreement (GDD-agreement). Finding adequate null-models for biological networks is important in many research domains. Network properties are used to assess the fit of network models to the data. Various network models have been proposed. To date, there does not exist a software tool that measures the above mentioned local network properties. Moreover, none of the existing tools compare real-world networks against a series of network models with respect to these local as well as a multitude of global network properties. Results Thus, we introduce GraphCrunch, a software tool that finds well-fitting network models by comparing large real-world networks against random graph models according to various network structural similarity measures. It has unique capabilities of finding computationally expensive RGF-distance and GDD-agreement measures. In addition, it computes several standard global network measures and thus supports the largest variety of network measures thus far. Also, it is the first software tool that compares real-world networks against a series of network models and that has built-in parallel computing capabilities allowing for a user specified list of machines on which to perform compute intensive searches for local network properties. Furthermore, GraphCrunch is easily extendible to include additional network measures and models. Conclusion GraphCrunch is a software tool that implements the latest research on biological network models and properties: it compares real-world networks against a series of random graph models with respect to a multitude of local and global network properties. We present GraphCrunch as a comprehensive, parallelizable, and easily extendible software tool for analyzing and modeling large biological networks. The software is open-source and freely available at . It runs under Linux, MacOS, and Windows Cygwin. In addition, it has an easy to use on-line web user interface that is available from the above web page.

Since finding local structural properties of networks is compute intensive, we have enabled GraphCrunch with parallel computing capabilities. A user can distribute the GraphCrunch processing over a cluster of machines, by including -r "machine1 machine2 machineN" in the above command, where machine1, machine2 and machineN are machine names.

The run-dialog interface
The run-dialog interface is started by the command: ./run-dialog <input graph name>. The check-boxes are used to (un)select the following: the random graph models (Figure S3 Figure S3 F), or "advanced options" can be configured to change network models, parameters, comparisons, or the machines over which to distribute the processing (see GraphCrunch webpage for details).

GraphCrunch on-line web user interface
The GraphCrunch on-line web user interface is available from the GraphCrunch webpage. To ensure a fair access for all users to our computing resources, the maximum network size is limited to 20,000 nodes and 80,000 edges, the maximum number of random graphs per network model is limited to five, and a user is allowed to process up to four real-world networks per day. (No limitations exist for the other two GraphCrunch interfaces that users run on their own machines.) From the main on-line GraphCrunch web page, users can login to their accounts, or access the help page and see some examples. New users are given temporary "guest" accounts that are intended for short-term use. At any time, users can change their guest accounts into permanent ones that are password protected.
Both types of accounts allow access to the on-line GraphCrunch functions and the users' data sets at any time.
After login, the menu of the on-line GraphCrunch is organized as follows: • The Data Sets page ( Figure S4) allows for uploading the new data sets and accessing already submitted data sets. If the data sets are still being processed, users can monitor the status of their processing. For a given data set, a user can download the input file and the results. The input file is available for download in three different formats: edge list format (.txt ), LEDA format (.gw ) and GML format (.gml ). The results are available in .xls format.
• The Results page ( Figure S5) provides visualization of the results of the processed data sets. A user can choose to visualize all or some of his or her data sets. Spectral properties are plotted: the degree distribution, the spectrum of shortest path lengths (denoted by "distance spectrum"), the clustering spectrum, and the graphlet frequency spectrum. A "mouse over" these plots gives the Pearson's correlation coefficients between the properties of the data and the model networks. Other parameters and comparisons are given as numerics.
• The GDD Viewer page ( Figure S6) presents plots of graphlet degree distributions (GDDs) for each of the 73 orbits of real-world and model networks. GDDs of only one model network for the corresponding random graph model are displayed.
• The My Account page enables users to manage their accounts.
• The Help page provides instructions on how to use the on-line GraphCrunch and how to interpret the results.

Interpreting results
An example of the tabular output file is presented in Table S2: network "Net1.gw " is compared against five network models, three random networks per network model are used, and the comparisons are done with respect to all of the network properties currently supported by GraphCrunch. The command that produces these results is: ./crunch -f Net1.gw -p "diameter-avg clustcoef-avg graphlet-count" -c "degree-distrib diameter-spectrum clust-spectrum gdd-agreement:amean gdd-agreement:gmean graphlet-dist" -m "er er dd geo sf sticky" -n 3 -o output example.tsv The set of intermediate files is stored in the subdirectories of the data/<input-file-name> directory (e.g., the generated model networks that correspond to the input network, are saved in the subdirectory data/<input-file-name>/models).
The visualized output, i.e., the plots and all of the files necessary for their generation (e.g., .gnuplot files) are stored in the plots/ directory. Before executing the plot function of GraphCrunch (given below), the following must be satisfied: (1) the data sets must be processed with ./crunch command (described in Section 2.1) and the corresponding intermediate files must exist in the data/ directory; and (2) gnuplot command-line plotting utility must be available on a user's system. Visualized output files are created by running the plot.sh script (found in the contrib/ directory); the syntax for running this script is: where multiple input networks or models may be specified, as illustrated in the example below. Examples of plots are presented in Figure S7: four network properties for three input data sets and five network models per data set are illustrated. The command used to produce Figure S7 A is:  Figure S1. All 3-node, 4-node and 5-node graphlets [1]. Figure S2. Automorphism orbits 0, 1, 2,..., 72 for the thirty 2-, 3-, 4-, and 5-node graphlets G 0 , G 1 ,..., G 29 . In a graphlet G i , i ∈ 0, 1,..., 29, nodes belonging to the same orbit are of the same shade [2]. Figure S3. The sequence of steps in the run-dialog GraphCrunch interface: (A) choosing network models against which the real-world network is to be compared; in the figure, all five network models (er, er dd, geo, sf, and sticky) are selected; (B) specifying the number of random graphs to be generated per network model; in the figure, 3 graphs are to be generated per random network model; (C) choosing the parameters (as described in Section 2.1) to be computed for the data and model networks; in the figure, all three parameters (clustcoef-avg, diameter-avg, and graphlet-count) have been selected; (D) choosing the comparisons (as described in Section 2.1) to be computed between the data and model networks; in the figure, the following comparisons have been selected: clust-spectrum, degree-distrib, diameter-spectrum, gdd-agreement:amean, gdd-agreement:gmean, and graphlet-dist; (E) specifying the name of the tabular output file; in the figure, the file named "output example.tsv" is designated as the output file; (F) proceeding with processing with the current selections (by choosing the "Done -Run crunch" option). Figure S4. Data sets page of the GraphCrunch on-line web user interface. Users can upload the new data sets (in the "Upload New Data Sets" section of the page) and access already submitted data sets (in the "Current Data Sets" section of the page). If the data sets are still being processed, users can monitor the status of the processing. For a given data set, a user can download the input file (three formats are available) and the results. Figure S5. Results page of the GraphCrunch on-line web user interface. It provides visualization of the results of the processed data sets, with an option for filtering the data sets for which the results will be shown. In the figure, the results for the data set named "ItoCore" (denoted by "DATA" and highlighted in blue) and five model networks (denoted by "er", "er dd", "geo3d", "sf", and "sticky") are shown. Two networks are processed per network model. Spectral properties are plotted: the degree distribution, the spectrum of shortest path lengths (denoted by "distance spectrum"), the clustering spectrum, and the graphlet frequency spectrum. A "mouse over" these plots gives the Pearson's correlation coefficients between the properties of the data and the model networks. Other parameters (average diameter, clustering coefficient, total number of graplets) and comparisons (RGF-distance, denoted by "Graphlet Distance", and GDD-agreement) are given as numerics. Figure S6. GDD Viewer page of the GraphCrunch on-line web user interface. It contains plots of graphlet degree distributions (GDDs) for each of the 73 orbits of real-world and model networks. In the figure, GDDs of the data set named "bork2455 gene short" (denoted by "DATA" and highlighted in blue) and five model networks (denoted by "er", "er dd", "geo3d", "sf", and "sticky") are shown. The plots illustrating ten GDDs are displayed in each row, and 50 is chosen to be the maximum degree on x-axis in each of the plots.  Figure S7. Examples of plots that illustrate the fit of five network models (ER, ER-DD, GEO-3D, SF-BA, and STICKY) to three data sets (Net1, Net2 and Net3) with respect to four networks properties: (A) GDD-agreement, (B) RGF-distance, (C) the degree distribution, and (D) the spectrum of shortest path lengths. Points in panels represent averages of properties over model networks belonging to the same random graph model; the error bars represent one standard deviation above and below the average.

Table S2
An example of the output file resulting from processing input network Net1.gw by GraphCrunch. The five network models and all of the currently supported properties are presented. Three networks per random graph model were generated (denoted by 1, 2, and 3 in column "Random Networks/Stats"). Thus, the total number of random networks analyzed in this example is 3x5=15. The column denoted by "Data Network " contains the name of the real-world network being analyzed. The column denoted by "Model Networks" contains the names of network models against which the data is being compared ("er", "er dd", "sf", "geo", and "sticky"). Random graphs from the same network model are denoted by a sequence of integers presented in column "Random Network/Stats"; in the same column, "AVG" and "STDDEV" denote that the fields in these rows contain the averages and standard deviations of network properties (given in columns to the right) computed over all random graphs from the given network model (that were generated and analyzed by GraphCrunch). The columns denoted by "Average Diameter " and "Clustering Coefficient" contain the average diameter and the average clustering coefficient of a network, respectively. The column denoted by "Total Number of Graphlets" contains the total number of all 2-5-node graphlets in a network. The columns denoted by "Degree Distribution (Pearson)", "Distance Spectrum (Pearson)" and "Clustering Spectrum (Pearson)" contain the Pearson's rank correlation coefficients of the degree distributions, the spectra of shortest path lengths and the clustering spectra between the real-world and model networks, respectively. The columns denoted by "GDD agreement (amean)" and "GDD agreement (gmean)" contain the arithmetic and geometric means of GDD-agreements between the real-world and model networks, respectively. Finally, the column denoted by "RGF Distance" contains the RGF-distance between the data and model networks.