Recent advances in high throughput ‘omics’ technologies enable quantitative measurements of expression or abundance of biological molecules in a whole biological system. Various popular omics platforms in systems biology include transcriptomics, proteomics, cytomics and metabolomics. These experiments are usually designed to compare changes observed between different conditions or groups and are often used to identify biomarkers capable of characterising pathological states or response to treatment.
The decreasing costs of these high-throughput platforms now enable repeated measures experiments on the same individuals or biological samples. Such experiments allow a substantial gain in information. For instance, longitudinal designs are more powerful as they reduce the noise due to inter individual variability, as long as the correlation between repeated observations is taken into account. There exists an abundant literature on the analysis of repeated measurements of omics data [1, 2]. In this context, a common approach is to apply a univariate mixed model on each gene followed by multiple testing correction . However, this approach disregards the dependency between genes, and due to the high dimensionality of the data, numerous hypotheses tests must be performed.
The mixed model approach has been used for the analysis of one single data type (e.g. gene expression). However, a growing number of high-throughput data are generated in standard clinical trials. For example, the evaluation of HIV vaccine in phase I/II trials incorporates measurements of counts of numerous types of cell, of the production of intra and extracellular cytokines and of gene expression . The integration of such multi-layer information can help unravel the complexities of a biological system, as each functional level is hypothesized to be related to each other . However the integration of omics data is a challenging task. Firstly, the large number of measured biological entities makes it very difficult to obtain a good overview or understanding of the system under study. Secondly, the small number of samples or patients makes statistical inference difficult and argue for using the maximum amount of available information. Thirdly, the integration of heterogeneous data represents an analytical and numerical challenge when trying to find common patterns in data from different origins.
In recent years, several multivariate approaches have been proposed to combine two omics data, often in an unsupervised framework. In contrast to univariate repeated measures analysis, these linear multivariate approaches take into account the dependency between genes, are able to handle large and noisy data sets and do not face computational issues in the high dimensional case as matrix inversions are avoided. Most importantly in the context of this study, they enable the integration of data coming from different platforms and provide interpretable visualisation tools. These approaches aim at selecting correlated biological entities from two [6–11] or more data sets . In particular, with sparse Partial Least Squares (sPLS) we have shown that the integrative analysis of large scale omics datasets could generate new knowledge not accessible by the analysis of a single data type alone [7, 8]. The biological relevance of this approach has been illustrated recently in some studies [13, 14].
The flexibility and versatility of PLS also enable a supervised framework through PLS-Discriminant Analysis (PLS-DA ). A variant of which has recently been proposed to select discriminative features that best separate the different conditions (sPLS-DA, ). sPLS-DA was shown to give similar performances to classical classification methods such as Machine Learning approaches and variants of Linear Discriminant Analysis and was recently applied in a biological study .
In this paper, we consider a two-step approach to model the correlation between repeated measurements while taking advantage of the multivariate approaches. We first propose to extract the within-sample variation [18–20] before analysing this transformed data set using sPLS-DA for a discriminant analysis or sPLS for an integrative analysis.
Starting from the classical mixed-model, we present the principle of a multilevel analysis to extract the within-sample deviation of the data and we extend the approach to a two-factor analysis. The within data set is then analysed with either sPLS-DA to select discriminative genes between the groups of subjects on a single data set, or with sPLS to select subsets of correlated variables from two data sets. A simulation study is performed which demonstrates the good performance of multilevel sPLS-DA compared to a classical sPLS-DA. The approach is then illustrated on an HIV vaccination study, where the effect of a lipopeptide based vaccine was explored by measuring before and after vaccination various components of the immune response, including gene expression and cytokine secretion. These repeated measurement were made in several in vitro conditions on Peripheral Blood Mononuclear Cells: ‘NS’ (no stimulation); HIV Gag peptides ‘GAG+’ (peptides included in the vaccine), HIV Gag peptides ‘GAG-’ (peptides not included in the vaccine) and ‘LIPO5’ (all five peptides included in the vaccine).