ReUseData: an R/Bioconductor tool for reusable and reproducible genomic data management

Background The increasing volume and complexity of genomic data pose significant challenges for effective data management and reuse. Public genomic data often undergo similar preprocessing across projects, leading to redundant or inconsistent datasets and inefficient use of computing resources. This is especially pertinent for bioinformaticians engaged in multiple projects. Tools have been created to address challenges in managing and accessing curated genomic datasets, however, the practical utility of such tools becomes especially beneficial for users who seek to work with specific types of data or are technically inclined toward a particular programming language. Currently, there exists a gap in the availability of an R-specific solution for efficient data management and versatile data reuse. Results Here we present ReUseData, an R software tool that overcomes some of the limitations of existing solutions and provides a versatile and reproducible approach to effective data management within R. ReUseData facilitates the transformation of ad hoc scripts for data preprocessing into Common Workflow Language (CWL)-based data recipes, allowing for the reproducible generation of curated data files in their generic formats. The data recipes are standardized and self-contained, enabling them to be easily portable and reproducible across various computing platforms. ReUseData also streamlines the reuse of curated data files and their integration into downstream analysis tools and workflows with different frameworks. Conclusions ReUseData provides a reliable and reproducible approach for genomic data management within the R environment to enhance the accessibility and reusability of genomic data. The package is available at Bioconductor (https://bioconductor.org/packages/ReUseData/) with additional information on the project website (https://rcwl.org/dataRecipes/). Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05626-0.

In this supplement, we will demonstrate the use of the ReUseData functions in managing the data recipes and curated data resources.There are three major sections: 1) Project resources.
2) The package core functions for data recipes.3) The package core functions for curated data management.

1
Project resources 1.1 ReUseData portal The project website https://rcwl.org/dataRecipes/provides a central hub for all the pre-built data recipes for data curation (downloading, unzipping, indexing, etc.) of commonly used public data resources.The website contains a search bar with autocompletion for convenient recipe searching.Each data recipe comes with a landing page, where there is the recipe description, the link to recipe source code, original data sources, annotation for the input and output parameters, and user instructions with example code chunks.
While these pre-built data recipes are ready for direct use, they also serve as templates for users to create their own recipes, say for protected data sets.Before installing ReUseData, we recommend that potential users browse this portal to get some idea about what it is and how it works.
If anyone is interested in the CWL workflow infrastructure of ReUseData in R, or running CWL pipelines in R, there are resources available on this main website https://rcwl.org/,such as the Rcwl tutorial e-book, more than 200 pre-built Rcwl tools and pipelines, and case studies of using RcwlPipelines in preprocessing single-cell RNA-seq data, etc.

Pre-built data recipes
The pre-built ReUseData recipe scripts are included in the package and also physically residing in a dedicated GitHub repository, which demonstrates the recipe construction in different situations.The most common case is that a data recipe can manage multiple data sets from the same/similar sources using input parameters (species, versions, etc.).For example, the gencode _ transcripts recipe downloads from GENCODE, unzips and indexes the transcript fasta file for human or mouse with different versions.A simple data downloading (using wget) for a specific file can be written as a data recipe without any input parameters.For example, the data recipe gencode _ genome _ grch38.R downloads the human genome file GRCh38.primary_ assembly.genome.fa.gz from GENCODE release 42.
If the data curation gets more complicated, say, multiple command-line tools are to be involved, and conda is to be used to install required packages, or some secondary files are to be generated and collected, the raw way of building a ReUseData recipe using Rcwl functions is recommended.It gives more flexibility and power to accommodate different situations.An example recipe is the reference _ genome, which downloads, formats, and indexes reference genome data using tools of samtools, picard, and bwa, and manages multiple secondary files besides the main fasta file for later reuse.

Cloud sharing of curated data
With the pre-built data recipes for curation of commonly used public data resources, we have generated some curated data sets to share on the Google Cloud (https://storage.cloud.google.com/reusedata).These data sets can be used directly on the cloud computing platforms (e.g., Terra, CGC), which may benefit from the low latency of cloud-to-cloud data transfer.They can also be downloaded and added to your local data cache.The concomitant annotation files will also be downloaded automatically for subsequent data reuse.
Any recipe with cloud data available has on its landing page the example code chunk showing how to download the data (without evaluating the recipe yourself) and add them into the local data cache.

Package installation
1. Install the package from Bioconductor or GitHub.

Recipe construction and evaluation
One can construct a data recipe from scratch or convert existing shell scripts for data processing into data recipes, by specifying input parameters and output globbing patterns using the recipeMake function.Then, the data recipe is represented in R as an S4 class cwlProcess.
Upon assigning values to the input parameters, the recipe is ready to be evaluated to generate data of interest.Here we show two examples.
• Write a data recipe froms scratch Let's take a look at the output file, which were successfully generated in the user-specified directory and grabbed through the outputGlob argument.• Convert an existing shell script into a data recipe The script can be shell or other ad hoc data processing scripts.Here we are using the shell script that download and index transcript annotation files from GENCODE for bulk or single-cell RNAseq analysis.
shfile <-system.file("extdata","gencode _ transcripts.sh",package = "ReUseData") readLines(shfile) The file path to the newly generated data set can be easily retrieved.The user-created data recipes can be then deposited on a private GitHub repository, exclusively accessible by a specific workgroup, or contributed back to ReUseData for broader accessibility to benefit researchers in similar research domains.In this case, additional meta information will be required for each data recipe, such as the links to data origin and source code, description of valid parameter values and demonstrative code of using, so that a landing page of the data recipe can be created on the ReUseData portal.This can be facilitated by the RcwlMeta package by communicating with the developer team.

Recipe caching and updating
recipeUpdate() creates a local cache (if first time use) for data recipes that are in specified GitHub repository, and it syncs and updates data recipes from the GitHub repo to the local caching system, so any newly added recipes can be readily accessed and loaded into R.

NOTE:
• The cachePath argument needs to match within recipeUpdate, recipeLoad, and recipeSearch functions.• Use force=TRUE when any previously cached recipes are updated.3 Core functions for curated data Here we introduce the core functions of ReUseData for data management and reuse: getData (or getCloudData) for reproducible data generation (or downloading from Google bucket), dataUpdate for syncing and updating the data cache, and dataSearch for multi-keywords searching of a data set of interest.

Data generation
Once we have a data recipe, we will first need to check the landing page for the recipe annotation, e.g., eligible values for each input parameter.Users can then assign values to the input parameters, and evaluate the recipe (getData) to generate data of interest.Users need to specify an output directory for all files (desired data files, and concomitant annotation files that are internally generated for data reuse).We encourage detailed notes for the data to be generated, which will be used for keywords matching in later data searches.
There are some automatically generated annotation files to help track the data recipe evaluation, including the * .sh to record the original shell script, * .cwlfile as the official workflow script, which was internally submitted for data recipe evaluation, * .ymlfile as part of the CWL workflow evalution, which also records data annotations, and * .md5checksum file to check and verify the integrity of generated data files.
list.files(outdir, pattern = "GRCh38") The * .ymlfile contains information about recipe input parameters, the file path to the output file, the notes for the data set, and date and time for data generation.A later data search using dataSearch() will refer to this file for keywords matching. readLines(res$yml)

Know your data
Here we provide a function meta _ data() to create a data frame that contains all information about the data sets in the specified file path (recursively), including the annotation file ($yml column), parameter values for the recipe ($params column), data file path ($output column), keywords for the data file (notes columns), and date and time information for data generation (date column).
Use cleanup = TRUE to cleanup any invalid or expired/older intermediate files.

sure to check the instructions on eligible input parameter values before recipe evaluation.
master recipeUpdate returns a recipeHub object with a list of all available recipes.One can subset the list with [ and use getter functions recipeNames() to get the recipe names, which can then be passed to the recipeSearch() or recipeLoad().Cached data recipes can be searched using multiple keywords to match the recipe name.It returns a recipeHub object with a list of recipes matching.Recipes can be directly loaded into R using the recipeLoad function with a user assigned name (or the original recipe name, see below for details).Once the recipe is successfully loaded, a message will be returned with a link to the recipe landing page on ReUseData portal with detailed user instructions.Make It's important to check the required inputs() of the recipe and the recipe landing page for eligible input parameter values before evaluting the recipe to generate data of interest.
#> cache path: /home/qian/.cache/R/ReUseDataRecipe#>#recipeSearch()to query specific recipes using multipe keywords #> # recipeUpdate() to update the local recipe cache #> Check here: https://rcwl.org/dataRecipes/STAR_index.html#>foruserinstructions (e.g., eligible input values, data source, etc.) NOTE Use return=FALSE if you want to keep the original recipe name, or if multiple recipes are to be loaded.recipeLoad("STAR_index",return = FALSE) identical(rcp, STAR _ index) #> [1] TRUE recipeLoad(c("STAR _ index", "gencode _ annotation"), return=FALSE) #> Data recipe loaded!#>Use inputs(STAR _ index) to check required input parameters before evaluation.#>Checkhere: https://rcwl.org/dataRecipes/STAR_index.html#>foruser instructions (e.g., eligible input values, data source, etc.) #> Data recipe loaded!#>Use inputs(gencode _ annotation) to check required input parameters before evaluation.#>Checkhere: https://rcwl.org/dataRecipes/gencode_annotation.html#> for userinstructions (e.g., eligible input values, data source, etc.) () creates (if first itme use), syncs and updates the local cache for curated data sets.It finds and reads all the .ymlfilesrecursivelly in the provided data folder, creates a cache record for each data set that is associated (including newly generated ones with getData()), and updates the local cache for later data searching and reuse.anddataSearchreturn a dataHub object with a list of all available or matching data sets.One can subset the list with [ and use getter functions to retrieve the annotation information about the data, e.g., data names, parameters values to the recipe, notes, tags, and the corresponding yaml file.If the argument cloud=TRUE is enabled, dataUpdate() will also cache the pre-generated data sets (from evaluation of pre-built recipes) that are available on the ReUseData Google bucket and return those in the dataHub object that are fully searchable.//storage.googleapis.com/reusedata/ucsc_database/refGene _ m...If the data of interest already exist on the cloud, then getCloudData will directly download the data (and concomitant annotation files) to your computer.You can add it to the local caching system using dataUpdate() for later use.
dataUpdateIMPORTANT: It is recommended that users create a specified folder for data archival (e.g., file/path/to/SharedData) that other group members have access to, and use sub-folders for different kinds of data sets (e.g., those generated from the same recipe).dataUpdate