MultiDataSet is a S4 class of R implemented under Bioconductor guidelines [39]. Its structure is an extension of the abstract eSet class. MultiDataSet is therefore a data-storage class that comprises datasets of different omic data (assay data), feature data and phenotypic data. Despite its general form, MultiDataSet maintains the specific characteristics of the datasets (e.g. it preserves matrices of calls and probabilities of a SnpSet).
Internal structure of MultiDataSet
MultiDataSet comprises five fields that are R standard lists. Their names match other Bioconductor classes: assayData that contains the measurement values; phenoData that stores the description of the samples; featureData and rowRanges that have the description of the features; and return_method that allows recovering the original dataset. Relation between fields is shown in Fig. 1. In each dataset, samples are shared between assayData and phenoData, and features between assayData, featureData and rowRanges. We have programmed a function to recover the original datasets. The class is designed such that the different data is coordinated. A particular feature of MultiDataSet is the storing of datasets from different experiments that may not share the full set of samples between them.
Six accesors are available to retrieve information from each MultiDataSet’s fields: assayData, pData, fData, rowRanges, rowRangesElements and sampleNames. The first four retrieve the content of assayData (a list of environments), phenoData (a list of AnnotatedDataFrames), featureData (a list of AnnotatedDataFrames) and rowRanges (a list of GenomicRanges with NAs for the datasets with features without genomic coordinates). rowRangesElements returns the names of datasets with a genomic coordinates in a GenomicRanges. The accessor sampleNames returns a named list with the samples names of each data set.
Adding datasets to MultiDataSet
Following Bioconductor guidelines, MultiDataSet objects are created empty through its constructor. Once the object is created, datasets can be added with add_eset and add_rse. The first function adds an object of class eSet while the second adds a SummarizedExperiment object and its extensions. The two functions have the same arguments: the MultiDataSet object, the dataset to be added, a tag for the type of dataset (i.e. methylation, expression…) and a name for each dataset. MultiDataSet thus allows the storage of multiple dataset of the same type, under different names. For features with genomic coordinates, a GenomicRanges object is created from the dataset’s featureData. In order to maintain the consistency across all datasets, the names of the samples are given by those in the phenotype dataset (a column called “id” is requested). If not present, object’s sampleNames are used.
MultiDataSet package incorporates three specific functions to include specific omic data sets: ExpressionSet (Biobase package), MethylationSet (MultiDataSet package) and SnpSet (Biobase package). These specific functions call general functions to add the data after performing extensive or specific checks (e.g. checking the class of the set or checking fData’s columns). As a result, only datasets with defined features can be introduced to MultiDataSet through a specific function.
Specific functions should always be used by users to ensure that the sets are properly added to MultiDataSet. The two basic functions add_eset and add_rse are intended to be used only by developers to develop new specific functions. The hierarchy between the specific and basic functions is shown in Fig. 2.
Subsetting MultiDataSet
We have implemented two methods to perform subsetting. The operator ‘[’ can be used to select individuals, datasets and/or features. In the case of having tables with different samples, subsetting is performed by considering the union of samples from the different tables. For instance, let us assume that table 1 contains individuals A and B. Table 2 has individuals A, B and C, and Table 3 is having individuals A, C. Let us also assume that we are interested in getting information from tables 1, 2 and 3 for individuals A and C. Our subsetting method will return a MultiDataSet object containing individuals A for table 1 and individuals A and C for tables 2 and 3. We think that this procedure is better than returning a MultiDataSet object only having individual A (i.e. intersection) for the three tables. Therefore, subsetting by individuals may not return complete cases. Notice that the package has another function (commonSamples) that can be applied to this object to get complete cases if necessary.
When subsetting by datasets, if only one is selected, the original dataset is returned (aka. SnpSet, MethylationSet…). GenomicRanges object can be used as an argument to select the features present in a given genomic range. In this process, sets with no genomic coordinates (e.g. metabolomic data) are discarded. We extended the subsetting function of R (subset) to select specific features within a dataset, such as features associated to a gene or filtering individuals given a phenotype.