Skip to main content
Fig. 5 | BMC Bioinformatics

Fig. 5

From: A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies

Fig. 5

Principle of the learning algorithm of the FLTM model. Illustration for first iteration. a Given some scalable clustering method, the observed variables are clustered into disjoint clusters. b For each cluster C of size at least 2, a latent class model (LCM) is straightforwardly inferred. An LCM simply connects the variables in cluster C to a new single latent variable L. c The cardinality of this single latent variable is computed as an affine function of the number of child nodes in the LCM, controled with a maximum cardinality. d The EM algorithm is run on the LCM, and provides the LCM’s parameters (i.e. the probability distributions of the LCM’s nodes). e Now the probability distribution is known for L, the quality of the latent variable is assessed as follows: the average mutual information between L and any child in C, normalized by the maximum of entropies of L and any child in C, is compared to a user-specified threshold (τ); with mutual information defined as \(MI(X,Y) = \sum _{x\in Dom(X)}\ \sum _{y\in Dom(Y)}\ \mathbb {P}(x,y)\log {\frac {\mathbb {P}(x,y)}{\mathbb {P}(x)\mathbb {P}(y)}}\), and entropy defined as \(H(X) = - \sum _{x\in Dom(X)} \mathbb {P}(x)\log \mathbb {P}(x)\). f If the latent variable is validated, the FLTM model is updated: in the FLTM under construction, a novel node representing L is connected to the variables in C; the former probability distribution \(\mathbb {P}(ch)\) of any child variable ch in C is replaced with \(\mathbb {P}(ch / L)\). The probability distribution \(\mathbb {P}(L)\) is stored. Finally, the variables in C are no more referred to in the data, latent variable L in considered instead. The updated graph and data are now ready for the next iteration. This process is iterated until all remaining variables are subsumed by one latent variable or no new valid latent variable can be created. For any latent variable L, and any observation j, data can be inferred through sampling based on probability distribution \(\mathbb {P}(L / C)\) for j’s values of child variables in cluster C.

Back to article page