Skip to main content
Fig. 1 | BMC Bioinformatics

Fig. 1

From: Juxtapose: a gene-embedding approach for comparing co-expression networks

Fig. 1

Methodology for generating joint gene embeddings from co-expression networks. 1 The co-expression networks are constructed from gene expression data. 2 Anchor genes (\({a_1, ..., a_n}\)) are selected as anchor nodes, which have relatively stable behaviour in the co-expression networks being compared. Dangling structures of \(\gamma\) artificial nodes are added to the graphs (shown with dashed borders and edges shown in grey) with equal edge weights across the networks being compared. In this illustration \(\gamma =4\). These dangling structures are connected to one of the selected anchor nodes in the original networks. 3 These networks are used to generate a set of random walks from each gene in each network. 4 The paths through the nodes are used as sentences to feed to a word2vec model, which learns informative embeddings for each gene in the networks. The model takes a gene in a network and the genes surrounding it in a path within a defined window and feeds them to a neural network that, after training, predicts the probability that each gene appears in the window around the focus gene. The process begins with a vector that contains all zeros and a 1 which represents the corresponding gene in the network. An \(N \times \Vert G \Vert\) embedding matrix contains one row for every gene in the vocabulary and the number of columns equal to the embedding size N. Pairs of genes are used to train the model and generate a representative embedding for each gene. This newly discovered embedding vector of dimension N forms the hidden layer. The input gene, selected using multiplication of the embedding matrix and Input vector, is fed to the model. The multiplication of the hidden layer and the word context matrix produces the output, which will be a prediction of the most probable output gene. Then, the loss is calculated between what was expected and the gene predicted. During backpropogation, when computing the gradient of the loss function, network weights including the embeddings for all genes in the vocabulary get updated. Given a hypothetical path from a random walk \({g1, g3, g4, g9, \ldots , g2}\) and a window size of 2, g3 has the following input gene pairs (g1, g3) and (g3, g4) under the Skip-gram architecture of word2vec. 5 The pairwise similarity scores between genes in the embedding matrix are calculated resulting from the word2vec model. 6 The embeddings and the distances between genes in the embedding are are analysed and visualized

Back to article page