Medical R: Building a Data Package

Peter Higgins

Getting Started on a Data Package

Most blog posts about building packages are about bundling functions that are related to each other into a single, thematic package. This is a common usage of packages, but at times you may want to share some common data sets with users in the form of a package. This is not as well-documented, but can be valuable.

A few examples of data packages include:

babynames
fueleconomy
nasaweather
nycflights13
usdanutrients

To get started, you will want to follow the guidance in the R Packages Book at: https://r-pkgs.org/whole-game.html. This blog post will be a quick getting started guide, relative to the big R Packages book (but use RP as a reference if you get stuck).

Setup

You want to have a new project open in RStudio, with the project named something like {datapackage} (pick a more appropriate name for your particular package). You then want to load (library commands in the Console) some critical packages including {devtools}, {usethis}, and {roxygen2}.

library(devtools)
library(usethis)
library(roxygen2)

Creating the Package

Then use the create_package() function to create your new package {datapackage}, by running this function in the Console (pick a better name than datapackage):

create_package("datapackage")

This will set up your directories and folders in the current project to help you create a package.

Versioning with Git and Github

If you have never used git before for version control, this would be a good time to install and initialize git on your computer, using the guidance in Chapter 6 of Happy Git with R, at: https://happygitwithr.com/install-git.html

Details on how to configure git are in chapter 7, at https://happygitwithr.com/hello-git.html

Now make a git repository to track your versions, by running this in the Console:

use_git()

Go ahead and commit files when prompted to do so.

If you have never used GitHub before for version control and collaboration, this would be a good time set up a Github account at www.github.com and to add a PAT (Personal Access Token) to your computer, using the guidance in Chapter 4 of Happy Git with R, at: https://happygitwithr.com/github-acct.html

Details on PATs (you want one, trust me) are found in Appendix B, at: https://happygitwithr.com/github-pat.html

Edit the Description

Look in your Files tab, open the DESCRIPTION file, and edit:

Make yourself the author (and add collaborators!)
add a title
add a longer Description - who is it for, why would they want this data, what is a use case?

Add a License

Mke your life easy, and run use_mit_license() in the Console

Add data

In the Files tab in RStudio, create a raw-data folder in your project. Add the raw files (csv, excel, etc.) that you want to turn into your accessible datasets.
Now create a separate data folder in your project.

In the raw-data folder, create a new R script named prep_data.R. In this file, wrangle, filter, or otherwise prepare your datasets. Let’s pretend you have a datafile that you have wrangled from Excel, and it is now a dataframe named sportsdata.

Now you want to save this datasets in *.Rdata format, as sportsdata.Rdata in the raw-data folder. Pick the name carefully, as you will have to live with it for a long time.

near the end of this script, do this saving step with with saveRDS(sportsdata, "data-raw/sportsdata.Rdata")

Then use the data (put the data in the proper final [*.rda] format in the data folder) with the following line in your script:

usethis::use_data(sportsdata, overwrite = TRUE)

Installing the Package

Deal with Dependencies

Do you need to use other packages to run functions (most of the time not, for a data package)?

On the off chance you do, you will want to run: use_package("packagename")

As in, if you need readr to read in some data, run in the Console:

use_package('readr')

Set up for push to GitHub

Run in the Console:

use_github() This will link your project to a Github repository, so that you can share your datapackage with the world.

Basic documentation

Create an R script in the R/ folder, named something like datapackage.R.

In this file, you can add roxygen2 documentation, which looks like a long list of funny-looking comments (they start with #' rather than just a hashtag).

Typically you include:

a short description, in a sentence or two.
after the @format statement, details on the format (dataframe) and the dimensions (how many rows, how many columns), and a short description of each variable (item), with levels or units; and the data type.
after the @source statement, a description of where the data came from, often including a link.

This looks something like:

#' Baby names.
#'
#' Full baby name data provided by the SSA. This includes all names with at
#' least 5 uses.
#'
#' Lifetables
#'
#' Cohort life tables data as provided by SSA.
#'
#' @format A data frame with nine variables:
#' \describe{
#' \item{\code{x}}{age in years}
#' \item{\code{qx}}{probability of death at age \code{x}}
#' \item{\code{lx}}{number of survivors, of birth cohort of 100,000, at next integral age}
#' \item{\code{dx}}{number of deaths that would occur between integral ages}
#' \item{\code{Lx}}{Number of person-years lived between \code{x} and \code{x+1}}
#' \item{\code{Tx}}{Total number of person-years lived beyond age \code{x}}
#' \item{\code{ex}}{Average number of years of life remaining for members of cohort alive at age \code{x}}
#' \item{\code{sex}}{Sex}
#' \item{\code{year}}{Year}
#' }
#'
#' For further details, see \url{http://www.ssa.gov/oact/NOTES/as120/LifeTables_Body.html#wp1168594}
#'
"lifetables"

Note that this is fairly easy to do for 2-22 variables, but can get ugly to do by hand when you have over 100 variables. At this point, it can be helpful to use a code solution to automate creating the items. The generate_variable_items_roxygen.R file in the RCode folder is very helpful for this. It will glue together text to make a rough draft of all the items, with output in the Console. You can copy and paste this into your dataset.R file to get started. The NAs need to be edited to full variable name, add units, or definitions of numeric codes.

(OPTIONAL} Note that use can use Rmarkdown formatting for this file (and for similar files throughout your package) if you add the following line to your package DESCRIPTION file:
Roxygen: list(markdown = TRUE)

To build your roxygen2 document into a proper help file, you need to run devtools::document(). This will use roxygen2 to convert this *.R script file to a *.Rd file in the man folder. This provides the “manual” for your data.

To build this into the package, re-build the package, with Cmd-Shift-B (Mac) or Ctrl-Shift-B (Windows), or click on the Install and Restart button under the Build tab in RStudio. This will rebuild the package, restart R, install the package, and load (library) the package.

Now you should be able to call from the console:

?dataset or help(dataset)

and see your documentation.

Then save the files and make a Git commit and then push the updated package up to your Github repository.

Documenting for a pkgdown website

If you have built a pkgdown website, you will want to add your new datasets to the website when you add them to the package.

build the *.R documentation in the R folder
build the *.Rd documentation in the man folder, with devtools::document()
add the new dataset to the _pkgdown.yml file as
- dataset
run pkgdown::build_site()
check the website preview - if OK
Commit and push project to GitHub

Advanced documentation

Description Document You can write a long description of the provenance of the data, when and how it was collected, where it was published in a journal, etc. (go for it, provide all the details!) in a Rmarkdown file (named something like datapackage_desc.Rmd), then knit this to pdf. When you push your project/package to Github, your pdfs show up properly formatted. You can then take the link to the pdf and put it in your README file, so that anyone can read up on your datasets.

I store these (both the Rmd and the knit HTML files) in a description_docs folder which is located within the man folder. This does not seem to cause any problems with package building or CRAN checks, as long as my .Rbuildingore file contains the line:
^man/description_docs$.

Codebook Document You can create a nice table of each variable, a longer description, units of measurement, levels/category codes and what they mean, in a nicely formatted Rmd table. Create a Rmarkdown file (named something like datapackage_codebook.Rmd), then knit this to pdf. When you push your project/package to Github, your pdfs show up properly formatted. You can then take the link to the pdf and put it in your README file, so that anyone can know what each variable in your dataset is about. I store these (both the Rmd and the knit HTML files) in a description_docs folder which is located within the man folder. This does not seem to cause any problems with package building or CRAN checks, as long as my .Rbuildingore file contains the line:
^man/description_docs$.

I store these (both the Rmd and the knit HTML files) in a codebooks folder which is located within the man folder. This does not seem to cause any problems with package building or CRAN checks, as long as my .Rbuildingore file contains the line:
^man/codebooks$.

Add a README

Just run the following in the Console:

use_readme_rmd()

Then open README.Rmd and edit this. This is a good place to put a description of the package, who it is intended for, and some likely use cases.

This is a helpful place to let users, especially new users, know how to load your package, and how to access the datasets. Show an example or two.

It is also a good place to add a table with the names of the datasets and links to the descriptions and codebooks.

Check and install

check()

install()

Build

The Build tab in RStudio will let you Build Package which will also restart R.

This will give you a clean slate and re-load (library) your datapackage.

Once you are happy with the datapackage, commit and push to Github. Now anyone can use your datapackage.

Checklist for adding a new dataset

get raw data, wrangle as needed, and save it into your raw-data folder. Save this to .Rdata format with saveRDS if needed. saveRDS(newdata, "data-raw/newdata.Rdata"). Delete the original if needed.
format the data with usethis::use_data(newdata, overwrite = TRUE). This will put it into the data folder in .rda format.
Create a help file in the R folder. Create an R script in the R/ folder, named something like newdata.R. Copy and SaveAs from a pre-existing file. Use roxygen2 and rmarkdown formatting as above.
Run devtools::document() to build this into documentation in the man folder.
Run help(newdata) to check the help file. Edit if needed, then repeat step 4 and 5 until the help file is satisfactory.
Generate a description document (similar to the help file) in rmarkdown in the man/description_docs folder. Copy and SaveAs from a pre-existing file. Knit to HTML.
Generate a codebook document (similar to the help file) in rmarkdown in the man/codebooks folder. Copy and SaveAs from a pre-existing file. Knit to HTML.
Edit the _pkgdown.yml folder to add to contents: - add a line for your newdata. Save.
Commit your changes and push to github if you have not already
Edit the README.Rmd file. Update the number of datasets in the Overview paragraph.
Edit the README.Rmd file. Update the List of Datasets table. Add a line for the new dataset, and add the links to the new description document and codebook (should be similar to the line for the scurvy dataset, but with a different dataset name).
Commit/push your changes and push to github.
Rebuild the website with pkgdown::build_site(). Check the website, and edit the README if needed. Repeat steps 11-13 until the website is satisfactory.
Rebuild the package, with Cmd-Shift-B.
On the Build tab, run a Check. Fix any errors or problems.
On the Build tab, run Install and Restart.
Commit/push all changes to github

Building a Data Package