Skip to main content

Table 1 Gdaphen goals and more relevant features described for each module: pre-processing, analysis and visualization

From: Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Gdaphen module

Step

Goal

Relevant features

Pre-processing

Pre-processing

Prepare the data for the analysis

Anonymization of the samples/ individuals in case we work with human datasets. Renaming the identifiers, noted as "Ind" of each animal if there are duplicated when considered per genotype/sex

Imputation of NAs using two methods (a) the mean if only one value is mising. (b) Using aregImpute from Hmisc package [12] to perform a Multiple Imputation using Additive Regression, Bootstrapping, and Predictive Mean Matching

Removal of quantitative features with less than 3 unique values

Removal of qualitative features with less than two factors

Scaling of the data to standardize the features

Analysis

Feature selection

Analyze the correlation in the explanatory variables or features

Identification of the highly correlated variable setting up a threshold of correlation decided by the experimenter/analyst

  

Apply methods for pre-selection of features to decrease the noise of the data using the low correlated features

Methods for pre-selection of features based on:

(a) The removal of highly correlated features

(b) A novel approach selection of explanatory features contributing to the discrimination more than a 30% after running the MFA analysis

Analysis

Multi factor analysis (MFA)

A multiple factor analysis (MFA) to identify the weight of each feature/group of features to the prediction

Perform a Multivariate Analysis of Mixed Data using the package PCAmixdata [21] the function MFAmix. Some of the outputs of the function are the following:

 Squared loadings

 Eig matrix with eigenvalues

 Results for qualitative or quantitative features apart

 Results for the levels of each qualitative feature

 Coordinates of groups

 Partial individual coordinates

 The coefficients of the linear combinations of features

  

Cosine similarity distances to identify the degree of similarity between the dependent feature and the explanatory features in the PCA dimensional space. Method developed by Escoffier and Pages [28]

Calculate the cosine similarity distance matrixes

Analysis

Classification

Several classifiers’ algorithms to identify the main discriminative explanatory variables some specially elected to identify variables affected by gene dosage as GLM, multiGLM and GLMNet. And a supervised method Random Forest

Generalized linear regression model, noted as GLM. Computed by the function train of the package caret, setting the parameter method = "glm”

Penalized Multinomial Regression also called multinomial log-linear models via neural networks method, noted as multiGLM using the function multinom from the package net [27]

Elastic Net method, noted as GLM-Net model, fits a generalized linear model in more than two factors by using a penalization maximum likelihood by combination of the ridge and lasso shrinkages and optimizing the parameter lambda and alfa. Is computed using the function train of the package caret, setting the parameter method = "glmnet”. Requires de package glint and Matrix. The theory is described in Friedman et al. [29], Simon et al. [30]

A random forest, noted RF, supervised algorithm. Is computed using the function train of the package caret, setting the parameter method = “rf”

Analysis

  

Calculi of the mean, SD, error and upper and lower confidence intervals for each feature. Computed using the function ggparcoord from the ggally package [31]

Visualization

Classification results

Asses the efficiency of the classifier

We provide a table and plots where the scaled importance of each feature relativized to the most important one is shown for all the variables

Produce plots and tables containing the scaled importance of each feature after running the glm/glmNet/multiGLM/RF classifiers for the three-input data (features selected and not selected)

Visualization

MFA results

9 plots that show the individual observations distribution in the in the top dimensions of the three main principal components of the PCA (quantitative data)/MCA (qualitative data) and the contribution of each feature or grouped feature in the main principal components

1.Variance explained by top ten PCA components & cumulative variance

2.Variables contribution/correlation with the three main principal components: grouped-features correlation partial axes plot

3.Grouped features correlation to the three main principal components dimensions PCA plot

4.Observations distribution to the classifier discrimination of qualitative variables in the principal components three top dimensions using 2D and 3D PCA plot visualizations

5.Squared loadings of the quantitative and qualitative features and the principal components

6.Features correlation with the three main principal components: all variables’ correlations

7.Qualitative features discrimination: Qualitative features distribution in the main dimensional spaces

8.Parallel coordinates plots with scaling and nots scaling of features to visualize the differences between your dependent variable factors

9.Scaled importance of each variable after running the glm/glmNet/multiGLM/RF classifiers