CSN: unsupervised approach for inferring biological networks based on the genome alone

Background Most organisms cannot be cultivated, as they live in unique ecological conditions that cannot be mimicked in the lab. Understanding the functionality of those organisms’ genes and their interactions by performing large-scale measurements of transcription levels, protein-protein interactions or metabolism, is extremely difficult and, in some cases, impossible. Thus, efficient algorithms for deciphering genome functionality based only on the genomic sequences with no other experimental measurements are needed. Results In this study, we describe a novel algorithm that infers gene networks that we name Common Substring Network (CSN). The algorithm enables inferring novel regulatory relations among genes based only on the genomic sequence of a given organism and partial homolog/ortholog-based functional annotation. It can specifically infer the functional annotation of genes with unknown homology. This approach is based on the assumption that related genes, not necessarily homologs, tend to share sub-sequences, which may be related to common regulatory mechanisms, similar functionality of encoded proteins, common evolutionary history, and more. We demonstrate that CSNs, which are based on S. cerevisiae and E. coli genomes, have properties similar to ‘traditional’ biological networks inferred from experiments. Highly expressed genes tend to have higher degree nodes in the CSN, genes with similar protein functionality tend to be closer, and the CSN graph exhibits a power-law degree distribution. Also, we show how the CSN can be used for predicting gene interactions and functions. Conclusions The reported results suggest that ‘silent’ code inside the transcript can help to predict central features of biological networks and gene function. This approach can help researchers to understand the genome of novel microorganisms, analyze metagenomic data, and can help to decipher new gene functions. Availability Our MATLAB implementation of CSN is available at https://www.cs.tau.ac.il/~tamirtul/CSN-Autogen


3
The GO table we used for the annotation on the E. coli map taken from the Gene Ontology Consortium database (Ashburner et al., 2000) Data Source As a proof of concept, we chose to construct the CSN on S. cerevisiae and E. coli genomes due to a large amount of available information regarding their gene interactions, expression level data, gene functionality, and more. As reference networks, we used well-established biological networks to show that the CSN can be competitive to these networks, which are based on experimental data.
Both organisms' protein level data was downloaded from the PaxDB site (Wang et al., 2015). For S. cerevisiae, we used the " 4932-WHOLE_ORGANISM-integrated.txt" file published in 2015, and For E. coli, the "511145-WHOLE_ORGANISM-integrated.txt" file also published in 2015. From the initial set of 6008 S. cerevisiae genes, 91 genes' protein levels were missing from the database and were ignored in the relevant test. Similarly, 223 genes were missing for E. coli and hence were ignored. Predicted CDS with annotation. Another file mapped the annotation ID to its definition. The CDS prediction is performed using FragGenScan on the reads, and InterProScan annotated all predicted proteins.
The additional information for the InterPro entries was mapped by the InterPro database (Mitchell, A. et al., 2014).

Evaluating the CSN based on Coding Region and Region Upstream of it Separately
The sequence range we used for the S. cerevisiae and the E. coli CSNs in the article is the beginning of the coding region together with the coding region's upstream sequence (See Fig. 1C).
To demonstrate that the sequence range we used in order to infer the CSN, is superior to using only the coding region or using only the sequence upstream of the coding region, we repeated the analyses for each of these two regions separately and compared the performances.
As can be seen in figure S11, the combined network is more informative for PA and interaction prediction. The CSN that based only on the coding region gets a slightly better result for functionality prediction. These results suggest that in general, it is better to work with the combined score and that for function prediction, the upstream region may be omitted.

Determine Sequence Range
We chose to implement our algorithm on genomic sequences with a constant length and with a similar range around the start site of the gene's coding region. Specifically, for S. cerevisiae, we picked the end of the gene promoter, starting from 100 nucleotides upstream to transcription initiation site (in the promoter), and ending (up to) 500 nucleotides into the reading frame. These regions tend to hold many gene regulatory signals, and we expect that the gene functionality is related to them ( From 6,008 S. cerevisiae genes, we removed 36 genes (Supplemental Fig. S6) with a coding sequence shorter than 120nt. Assuming these are, most likely, hypothetical genes, based on their description.
For S. cerevisiae, the CDS sequence length was set to 500 nucleotides, based on the fact that 85% of the analyzed coding regions are longer than 500 nucleotides. The organism's average coding region length is 1485 nucleotides. Moreover, it is known that the region with various regulatory signals is usually near the start codon (figure S8) (Tuller and Zur, 2015).
Coding regions shorter than 500nt were considered with their real length. Coding regions shorter than 120nt were removed from all analyses (Supplementary Table S6).
S. cerevisiae's promoter length was set to 100 nucleotides since with this length, in 96% of genes, the defined promoter sequence will not include the previous gene. The average distance between S. cerevisiae's coding regions is 565 nucleotides (See supplementary fig S3). E. coli sequences start 50 nucleotides upstream to the start codon and end 250 nucleotides downstream to it, based on the gene's length and the distribution of the distance between genes (Fig S12C, S12D).
To make sure we are using the right length of regulatory and coding sequences, we run additional experiments that analyze CSNs with different sequence lengths. Results show that CSN, with the coding region of 500 nucleotides and regulatory region of 100 nucleotides, gets the best scores in all measured aspects ( Supplementary Fig. S1).
Data exploration of the Metagenomic sample revealed that all sequences in the sample are longer than 100 nucleotides (see Fig. S16 for lengths distribution of the different genes). For some of the sequences, the suggested amino acid sequence is only part of the nucleotide sequence but always longer than 33 amino-acid.
As we did with the other networks, we analyzed the first 99 nucleotides of all the sequences.

Data Preprocessing
To compare CSN with a known biological network created by long term laboratory work, we used several 1999) and node degrees are available for the most and least centered genes of S. cerevisiae. (Fig. S10)."

Functional Annotation Analysis
Information about genes known as functional attributes (terms) was taken from the Gene Ontology DataBase (Ashburner et al., 2000). We analyzed both the full lists of 4,373 GO terms and the ' GO slim' set, a subset of 166 GO terms. We considered the Biological Process, Molecular Function, and Cellular Component ontologies.
In most cases, we focused on the GO slim set.

The SAFE Algorithm
This annotation method aimed at finding "regions" enriched with specific functional annotation (or other characteristics) in a biological network that undergo layout in a 2D, where similar nodes tend to be closer.
The network is embedded in 2D based on force-directed layout, where nodes are placed by their edge's weights.

6
First, for each node in the graph, the algorithm defines a set of nodes, with distance (radius) smaller than a certain threshold as the node's neighborhood. From a radius range of 0-100, the default value was 0.5.
Based on a GO information (a table with the functional annotations of all nodes/genes), the algorithm calculates for each gene g and GO term n a hypergeometric p-value. Suppose that there are x genes in the neighborhood of g with the term n. A hypergeometric p-value is computed based on the probability of getting ≥ genes with the term n, in a neighborhood with the same node size as the g neighborhood, that are related to the term n under uniform probability. If enough genes are found as significantly related to the same term and are clustered close to one another in the plotted network, they will be colored with the same color on the output map (see, for example, Fig. 4A (Baryshnikova, 2018; Baryshnikova, 2016); see Equation 5.
For the metagenomic sample, because the graph is small and there are some very low weights, we ran SAFE with one modification; the neighborhood radius was set to 10.

Validating SAFE Results
We used the GO matrix and the SAFE output p-value matrix to calculate a Sensitivity Score for each term in the GO-slim set. The Sensitivity Score (Powers, 2011) is the percentage of genes that are significantly related to a term based on our prediction and also known to be related to the term based on the GO table.

Validating Interaction Prediction
After matching our computational network (CSN/ppCSN) with the reference network densities and sizes, we used the two connectivity matrices to calculate a Sensitivity Score for each of the computational networks and each interaction type. In this case, the Sensitivity Score (Powers, 2011) is the percentage of interactions that appear in our network CSN/ppCSN, which are also known as real interaction based on the reference network. We also calculated interaction prediction hypergeometric p-value, which is the probability of getting ≥ interactions\edges in CSN, from m optional interactions, that are also known as real biological interactions from type t, under uniform probability (see Supplemental Fig. S7).

ROC Diagram
Network's Co-Complex interaction prediction True Positive Rate [TPR] is the percentage of CSN/ppCSN edges that also appear in the co-complex network. The False Positive Rate [FPR] is the percentage of CSN/ppCSN edges that are not part of the co-complex network.
The ROC diagram (Fig. 3C) was generated as follow: A set of graphs with densities from 0 to 1 (a fully connected graph) was generated by adding edges to the CSN nodes. Next, we calculate the Co-Complex interaction prediction TPR and FPR, for each different density network.

Biological Networks Alignments
Biological networks are not fully aligned, even when related to the same organism and gene set.
Here we run a short experiment that compares the interactions alignment between 2 known biological networks that we used in the interaction prediction section: STRING and Combined APMS networks.  The SAFE method   Inputs: network N, GO table G, and a cutoff distance r. 1. Apply the force-directed layout algorithm on network N.
2. For every node vi in network G, define a local neighborhood -a set of nodes, such that each node, vj, in the set is with limited Euclidean distance r to the examined node on the network map.
C vi, vj = x i − x j 2 + y i − y j 2 ≤ r, (x i , y i ), (x j , y j ) are the coordinates of the nodes vi, vj respectively.
• (x i , y i ), (x j , y j ) are the coordinates of the nodes vi, vj respectively.
• By default, r equals to the 0.5th-percentile of all pair-wise node distances in the network.
• Radius (denoted here as r) -is the distance threshold that defines the size of the neighborhood. All genes that hold Euclidean distance from the examined node smaller than r on the network map will be considered in its neighborhood.

Calculate the enrichment of the node's neighborhoods for each functional attribute
For term t and node v: • Where N is the total number of nodes in the network, F is the number of network nodes that are known to be related to the functional attribute t, U is the number of nodes in the neighborhood of node v, and S is the number of nodes in the neighborhood of node v that known as related to functional attribute (term) t.
4. Determine if node v neighborhood is significantly enriched for term t if P(v,t)<=0.05.
5. Define functional attribute as region-specific, if at least 65% of its significantly enriched neighborhoods are within a distance of 2r from each other, where r is the distance threshold that defines the size of the radius of the neighborhood (see step 2).
6. Enrichment landscape is defined as the enrichment scores' vector for a specific functional attribute per all network's nodes.
7. Group functional attributes into functional domains based on the similarity (correlation) of their enrichment landscapes.
8. The functional domain is defined as the functional attributes' groups with enrichment landscapes' similarity.
9. The Algorithm assigned the same color to all functional attribute in the same domain.
Outputs: (1) composite functional attribute map of the network. (2) p-values matrix of the gene-term enrichment scores.