This Section illustrates the main features of the repo package and its philosophy through an application example. The example involves the creation and population of a repository, its exploration, manipulation and distribution.
Repository creation and population
In repo all the data and annotations for a single repository completely reside under a specified file system position. One repository can store resources produced by different analyses. The choice between the creation of a single central repository or multiple project-specific repositories is up to the user. The following code creates a new, empty repository in a temporary directory:
The example code reported in this Section is contained in a file named article.Rnw. The next code block stores the source code as a repository item. The attach function stores generic files (as opposed to R objects) in the repository. An item description and a list of tags are also specified. The project command creates a special repository item containing pipeline-wise information. The options commands sets the default source file and the default project to associate items with.
This example uses the “Mice Protein Expression Data Set” from the UCI repository [17]. In the following block the data is downloaded and a copy is stored in the repository, specifying the download URL. The URL field is useful to trace the provenance of the data, but can also be used to download the item contents through the pull function. The variable xls.name which contains the name of the downloaded file, is also used to set the identifier of the newly created object in the repository.
The stored data is not in R format. The following code imports it into the variable data and permanently stores the variable in the repository through the put function. In this case two relations are annotated for the newly created item: the generating source code, set as the file article.Rnw; and a dependency from the downloaded file (xls.name variable). Note that Mice Cortex is annotated as being dependent on the appropriate repository item, which contains both necessary and sufficient data to build the newly created resource. However, the actual code loads the data from the downloaded file and uses a variable defined elsewhere (xls.name). These inconsistencies with the process will be fixed later in accordance to the data-centered paradigm (see Fig. 1).
The dataset includes missing values and non-real variables. As a preprocessing step, all incomplete samples are removed and a reduced version of the dataset is stored. Dependence of the reduced set from the full set (just stored as Mice Cortex) is also annotated.
Suppose that a change is decided about the data preprocessing step. One may want to overwrite the current Mice Cortex notNA item, but keeping the previous one as a possible alternative. repo implements a simple versioning system to accomplish this task. The following code creates a scaled version of the dataset and overwrites the previously created Mice Cortex notNA item. However, since the parameter replace is set to addversion, the old item is preserved with a new name, as shown by the print output.
The attach function can be exploited to store visualizations in the repository and link them to the data they represent. The following code plots a 2-dimensional visualization of the Mice Cortex data to a PDF file and attach-es it to the item containing the corresponding data (using the to parameter).
The accuracy of the 2D plot is bound to the amount of variance explained by the first two Principal Components of the reduced dataset. The following code creates a plot of the variance explained by each Principal Component and attaches it to the previous plot.
Repository exploration
repo supports a few commands to visualize information about a repository or a set of items. Global information can be visualized through the info command as follows.
It is also possible to visualize the composition of the repository in terms of memory usage through the pies function (see Fig. 2).
Other details about single items can be visualized using the print function. Some items (like attachments) are hidden by default. The code below lists all the items in the repository, including hidden ones.
Three types of relations between items are supported in repo: attached to, depends on, generated by. Such relations can be represented through a directed graph. The dependencies function creates the corresponding visualization (see Fig. 3). When items are properly annotated, such visualization defines the analysis data flow.
As a repository grows, it may contain a large number of items from multiple projects. In order to properly identify item subgroups, tags can be exploited as filters. Tags are supported by many repo functions and can be combined using different logic operators. In the next code block the plot items (associated with the tag “visualization”) are excluded from the dependency graph (see Fig. 4).
The repo package also includes a preliminary visual interface (see Fig. 5). The current version allows to browse repository items and load them into the current workspace.
Items access
The most used command in repo is get. get loads an item from the permanent storage basing on its name.
On the other hand, all the details stored for a single item are reported by the info function. The summary also reports the dimensions of the data, its creation date, the storage space used, the relative file system path to the file containing the data, and an MD5 checksum.
If the exact identifier is unknown the find function can be used to perform a string matching against all item details.
Analysis reproducibility
While repo focuses on data, it also supports features directly dealing with processes. Such features make the tool able to reproduce resources basing on the code they were annotated to. Reproducibility is also supported by the special project items, which collect information about an entire analysis, including the list of resources involved, R version used and necessary libraries. The info command implements a special behaviour for project items, as shown in the following:
Items in the example repository have dependencies set, thus enabling to trace back which data were used to build each resource. This may provide significant help in reproducing an analysis or reuse produced items in other analyses. However, the exact process building each resource is not described, as a generic source file is associated with all of them. Following the data-centered approach (see Fig. 1), once the analysis is well assessed, source code can be cleaned up and single processes assigned to each item. Although the code used for this example is rather simple, the following is a refinement of the block related to the Mice Cortex resource:
Note that the xls.name variable is not used anymore, and the downloaded data set is loaded from within the repository. This code is now both necessary end sufficient to build the Mice Cortex resource if its dependencies are satisfied. The comments starting with “## chunk” will be used by repo to associate the Mice Cortex resource with the actual instructions that are necessary to build it. The following lines update the source code of the project by resetting its content and show the newly defined code chunk:
The build command runs the code associated with a resource. By default, if the resource has dependencies not already present in the repository, their associated code is run first, recursively. Otherwise their code chunks are skipped. It is also possible to set a session-wise option to determine other behaviours. For example, the following code can be used to download the latest version of the file “Data_Cortex_Nuclear.xls” and build the corresponding Mice Cortex object, without overwriting the respective previous versions. Annotation of the Data_Cortex_Nuclear.xls code chunk, as shown above for the Mice Cortex chunk, is assumed.
As previously explained, when new versions of existing items are created, the latter are renamed by adding an incremental version number. Note that, thanks to the mechanism of code chunk annotation, repo supports reentrancy [6] at each properly defined pipeline stage.
Data exchange
The repo system stores data and metadata into subfolders of the repository root in the R standard RDS format. Internally, all references to stored files are relative to the root directory, implying that each repository is completely self-contained and can be easily cloned or moved. Dedicated support for data exchange is described in this Subsection.
The tool can handle multiple repositories and copy items from one repository to another. For example, the code below creates a new repository and copies two items to it:
The related function returns the names of all items that are directly or indirectly linked to a given item, thus allowing to select an independent set of items. In the following such a set is saved to the standard R data format RDS (or their original format for attachments) using the export function.
An interesting application of the URL annotation regards the distribution of repositories. The buildURL parameter of the set function can be used to assign a base URL to all items. The code below copies the previously selected set of items to the repository rp2 and sets a base URL for all items (except Data_Cortex_Nuclear.xls).
Once the repository directory is copied to a public website, its index (i.e. the file R_repo.RDS in the repository root) can be distributed. Users can then selectively download items of interest using the pull
repo function. The check command can be used to run an integrity check on all repository items.