Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Muñiz Moreno, Maria del Mar; Gavériaux-Ruff, Claire; Herault, Yann

doi:10.1186/s12859-022-05111-0

Table 1 Gdaphen goals and more relevant features described for each module: pre-processing, analysis and visualization

From: Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Gdaphen module	Step	Goal	Relevant features
Pre-processing	Pre-processing	Prepare the data for the analysis	Anonymization of the samples/ individuals in case we work with human datasets. Renaming the identifiers, noted as "Ind" of each animal if there are duplicated when considered per genotype/sex Imputation of NAs using two methods (a) the mean if only one value is mising. (b) Using aregImpute from Hmisc package [12] to perform a Multiple Imputation using Additive Regression, Bootstrapping, and Predictive Mean Matching Removal of quantitative features with less than 3 unique values Removal of qualitative features with less than two factors Scaling of the data to standardize the features
Analysis	Feature selection	Analyze the correlation in the explanatory variables or features	Identification of the highly correlated variable setting up a threshold of correlation decided by the experimenter/analyst
		Apply methods for pre-selection of features to decrease the noise of the data using the low correlated features	Methods for pre-selection of features based on: (a) The removal of highly correlated features (b) A novel approach selection of explanatory features contributing to the discrimination more than a 30% after running the MFA analysis
Analysis	Multi factor analysis (MFA)	A multiple factor analysis (MFA) to identify the weight of each feature/group of features to the prediction	Perform a Multivariate Analysis of Mixed Data using the package PCAmixdata [21] the function MFAmix. Some of the outputs of the function are the following: Squared loadings Eig matrix with eigenvalues Results for qualitative or quantitative features apart Results for the levels of each qualitative feature Coordinates of groups Partial individual coordinates The coefficients of the linear combinations of features
		Cosine similarity distances to identify the degree of similarity between the dependent feature and the explanatory features in the PCA dimensional space. Method developed by Escoffier and Pages [28]	Calculate the cosine similarity distance matrixes
Analysis	Classification	Several classifiers’ algorithms to identify the main discriminative explanatory variables some specially elected to identify variables affected by gene dosage as GLM, multiGLM and GLMNet. And a supervised method Random Forest	Generalized linear regression model, noted as GLM. Computed by the function train of the package caret, setting the parameter method = "glm” Penalized Multinomial Regression also called multinomial log-linear models via neural networks method, noted as multiGLM using the function multinom from the package net [27] Elastic Net method, noted as GLM-Net model, fits a generalized linear model in more than two factors by using a penalization maximum likelihood by combination of the ridge and lasso shrinkages and optimizing the parameter lambda and alfa. Is computed using the function train of the package caret, setting the parameter method = "glmnet”. Requires de package glint and Matrix. The theory is described in Friedman et al. [29], Simon et al. [30] A random forest, noted RF, supervised algorithm. Is computed using the function train of the package caret, setting the parameter method = “rf”
Analysis			Calculi of the mean, SD, error and upper and lower confidence intervals for each feature. Computed using the function ggparcoord from the ggally package [31]
Visualization	Classification results	Asses the efficiency of the classifier We provide a table and plots where the scaled importance of each feature relativized to the most important one is shown for all the variables	Produce plots and tables containing the scaled importance of each feature after running the glm/glmNet/multiGLM/RF classifiers for the three-input data (features selected and not selected)
Visualization	MFA results	9 plots that show the individual observations distribution in the in the top dimensions of the three main principal components of the PCA (quantitative data)/MCA (qualitative data) and the contribution of each feature or grouped feature in the main principal components	1.Variance explained by top ten PCA components & cumulative variance 2.Variables contribution/correlation with the three main principal components: grouped-features correlation partial axes plot 3.Grouped features correlation to the three main principal components dimensions PCA plot 4.Observations distribution to the classifier discrimination of qualitative variables in the principal components three top dimensions using 2D and 3D PCA plot visualizations 5.Squared loadings of the quantitative and qualitative features and the principal components 6.Features correlation with the three main principal components: all variables’ correlations 7.Qualitative features discrimination: Qualitative features distribution in the main dimensional spaces 8.Parallel coordinates plots with scaling and nots scaling of features to visualize the differences between your dependent variable factors 9.Scaled importance of each variable after running the glm/glmNet/multiGLM/RF classifiers

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com

BMC Bioinformatics

Contact us