Gdaphen module | Step | Goal | Relevant features |
---|---|---|---|
Pre-processing | Pre-processing | Prepare the data for the analysis | Anonymization of the samples/ individuals in case we work with human datasets. Renaming the identifiers, noted as "Ind" of each animal if there are duplicated when considered per genotype/sex Imputation of NAs using two methods (a) the mean if only one value is mising. (b) Using aregImpute from Hmisc package [12] to perform a Multiple Imputation using Additive Regression, Bootstrapping, and Predictive Mean Matching Removal of quantitative features with less than 3 unique values Removal of qualitative features with less than two factors Scaling of the data to standardize the features |
Analysis | Feature selection | Analyze the correlation in the explanatory variables or features | Identification of the highly correlated variable setting up a threshold of correlation decided by the experimenter/analyst |
Apply methods for pre-selection of features to decrease the noise of the data using the low correlated features | Methods for pre-selection of features based on: (a) The removal of highly correlated features (b) A novel approach selection of explanatory features contributing to the discrimination more than a 30% after running the MFA analysis | ||
Analysis | Multi factor analysis (MFA) | A multiple factor analysis (MFA) to identify the weight of each feature/group of features to the prediction | Perform a Multivariate Analysis of Mixed Data using the package PCAmixdata [21] the function MFAmix. Some of the outputs of the function are the following: Squared loadings Eig matrix with eigenvalues Results for qualitative or quantitative features apart Results for the levels of each qualitative feature Coordinates of groups Partial individual coordinates The coefficients of the linear combinations of features |
Cosine similarity distances to identify the degree of similarity between the dependent feature and the explanatory features in the PCA dimensional space. Method developed by Escoffier and Pages [28] | Calculate the cosine similarity distance matrixes | ||
Analysis | Classification | Several classifiers’ algorithms to identify the main discriminative explanatory variables some specially elected to identify variables affected by gene dosage as GLM, multiGLM and GLMNet. And a supervised method Random Forest | Generalized linear regression model, noted as GLM. Computed by the function train of the package caret, setting the parameter method = "glm” Penalized Multinomial Regression also called multinomial log-linear models via neural networks method, noted as multiGLM using the function multinom from the package net [27] Elastic Net method, noted as GLM-Net model, fits a generalized linear model in more than two factors by using a penalization maximum likelihood by combination of the ridge and lasso shrinkages and optimizing the parameter lambda and alfa. Is computed using the function train of the package caret, setting the parameter method = "glmnet”. Requires de package glint and Matrix. The theory is described in Friedman et al. [29], Simon et al. [30] A random forest, noted RF, supervised algorithm. Is computed using the function train of the package caret, setting the parameter method = “rf” |
Analysis | Calculi of the mean, SD, error and upper and lower confidence intervals for each feature. Computed using the function ggparcoord from the ggally package [31] | ||
Visualization | Classification results | Asses the efficiency of the classifier We provide a table and plots where the scaled importance of each feature relativized to the most important one is shown for all the variables | Produce plots and tables containing the scaled importance of each feature after running the glm/glmNet/multiGLM/RF classifiers for the three-input data (features selected and not selected) |
Visualization | MFA results | 9 plots that show the individual observations distribution in the in the top dimensions of the three main principal components of the PCA (quantitative data)/MCA (qualitative data) and the contribution of each feature or grouped feature in the main principal components | 1.Variance explained by top ten PCA components & cumulative variance 2.Variables contribution/correlation with the three main principal components: grouped-features correlation partial axes plot 3.Grouped features correlation to the three main principal components dimensions PCA plot 4.Observations distribution to the classifier discrimination of qualitative variables in the principal components three top dimensions using 2D and 3D PCA plot visualizations 5.Squared loadings of the quantitative and qualitative features and the principal components 6.Features correlation with the three main principal components: all variables’ correlations 7.Qualitative features discrimination: Qualitative features distribution in the main dimensional spaces 8.Parallel coordinates plots with scaling and nots scaling of features to visualize the differences between your dependent variable factors 9.Scaled importance of each variable after running the glm/glmNet/multiGLM/RF classifiers |