For building a package that is largely about making datasets readily available in R.
Most blog posts about building packages are about bundling functions that are related to each other into a single, thematic package. This is a common usage of packages, but at times you may want to share some common data sets with users in the form of a package. This is not as well-documented, but can be valuable.
A few examples of data packages include:
You want to have a new project open in RStudio, with the project named something like {datapackage} (pick a more appropriate name for your particular package). You then want to load (library commands in the Console) some critical packages including {devtools}, {usethis}, and {roxygen2}.
library(devtools)
library(usethis)
library(roxygen2)
Then use the create_package() function to create your new package {datapackage}, by running this function in the Console (pick a better name than datapackage):
create_package("datapackage")
This will set up your directories and folders in the current project to help you create a package.
If you have never used git before for version control, this would be a good time to install and initialize git on your computer, using the guidance in Chapter 6 of Happy Git with R, at: https://happygitwithr.com/install-git.html
Details on how to configure git are in chapter 7, at https://happygitwithr.com/hello-git.html
Now make a git repository to track your versions, by running this in the Console:
use_git()
Go ahead and commit files when prompted to do so.
If you have never used GitHub before for version control and collaboration, this would be a good time set up a Github account at www.github.com and to add a PAT (Personal Access Token) to your computer, using the guidance in Chapter 4 of Happy Git with R, at: https://happygitwithr.com/github-acct.html
Details on PATs (you want one, trust me) are found in Appendix B, at: https://happygitwithr.com/github-pat.html
Look in your Files tab, open the DESCRIPTION file, and edit:
Mke your life easy, and run use_mit_license()
in the Console
In the Files tab in RStudio, create a raw-data
folder in your project. Add the raw files (csv, excel, etc.) that you want to turn into your accessible datasets.
Now create a separate data
folder in your project.
In the raw-data
folder, create a new R script named prep_data.R
. In this file, wrangle, filter, or otherwise prepare your datasets. Let’s pretend you have a datafile that you have wrangled from Excel, and it is now a dataframe named sportsdata
.
Now you want to save this datasets in *.Rdata
format, as sportsdata.Rdata
in the raw-data
folder. Pick the name carefully, as you will have to live with it for a long time.
near the end of this script, do this saving step with with saveRDS(sportsdata, "data-raw/sportsdata.Rdata")
Then use the data (put the data in the proper final [*.rda
] format in the data
folder) with the following line in your script:
usethis::use_data(sportsdata, overwrite = TRUE)
Do you need to use other packages to run functions (most of the time not, for a data package)?
On the off chance you do, you will want to run: use_package("packagename")
As in, if you need readr to read in some data, run in the Console:
use_package('readr')
Run in the Console:
use_github()
This will link your project to a Github repository, so that you can share your datapackage with the world.
Create an R script in the R/
folder, named something like datapackage.R
.
In this file, you can add roxygen2 documentation, which looks like a long list of funny-looking comments (they start with #'
rather than just a hashtag).
Typically you include:
This looks something like:
#' Baby names.
#'
#' Full baby name data provided by the SSA. This includes all names with at
#' least 5 uses.
#'
#' Lifetables
#'
#' Cohort life tables data as provided by SSA.
#'
#' @format A data frame with nine variables:
#' \describe{
#' \item{\code{x}}{age in years}
#' \item{\code{qx}}{probability of death at age \code{x}}
#' \item{\code{lx}}{number of survivors, of birth cohort of 100,000, at next integral age}
#' \item{\code{dx}}{number of deaths that would occur between integral ages}
#' \item{\code{Lx}}{Number of person-years lived between \code{x} and \code{x+1}}
#' \item{\code{Tx}}{Total number of person-years lived beyond age \code{x}}
#' \item{\code{ex}}{Average number of years of life remaining for members of cohort alive at age \code{x}}
#' \item{\code{sex}}{Sex}
#' \item{\code{year}}{Year}
#' }
#'
#' For further details, see \url{http://www.ssa.gov/oact/NOTES/as120/LifeTables_Body.html#wp1168594}
#'
"lifetables"
Note that this is fairly easy to do for 2-22 variables, but can get ugly to do by hand when you have over 100 variables. At this point, it can be helpful to use a code solution to automate creating the items
. The generate_variable_items_roxygen.R
file in the RCode folder is very helpful for this. It will glue together text to make a rough draft of all the items, with output in the Console. You can copy and paste this into your dataset.R
file to get started. The NAs need to be edited to full variable name, add units, or definitions of numeric codes.
(OPTIONAL} Note that use can use Rmarkdown formatting for this file (and for similar files throughout your package) if you add the following line to your package DESCRIPTION file:
Roxygen: list(markdown = TRUE)
To build your roxygen2 document into a proper help file, you need to run devtools::document()
. This will use roxygen2 to convert this *.R
script file to a *.Rd
file in the man
folder. This provides the “manual” for your data.
To build this into the package, re-build the package, with Cmd-Shift-B (Mac) or Ctrl-Shift-B (Windows), or click on the Install and Restart
button under the Build tab in RStudio. This will rebuild the package, restart R, install the package, and load (library) the package.
Now you should be able to call from the console:
?dataset
or help(dataset)
and see your documentation.
Then save the files and make a Git commit and then push the updated package up to your Github repository.
If you have built a pkgdown website, you will want to add your new datasets to the website when you add them to the package.
devtools::document()
- dataset
pkgdown::build_site()
Description Document You can write a long description of the provenance of the data, when and how it was collected, where it was published in a journal, etc. (go for it, provide all the details!) in a Rmarkdown file (named something like datapackage_desc.Rmd), then knit this to pdf. When you push your project/package to Github, your pdfs show up properly formatted. You can then take the link to the pdf and put it in your README file, so that anyone can read up on your datasets.
I store these (both the Rmd and the knit HTML files) in a description_docs folder which is located within the man folder. This does not seem to cause any problems with package building or CRAN checks, as long as my .Rbuildingore
file contains the line:
^man/description_docs$
.
Codebook Document You can create a nice table of each variable, a longer description, units of measurement, levels/category codes and what they mean, in a nicely formatted Rmd table. Create a Rmarkdown file (named something like datapackage_codebook.Rmd), then knit this to pdf. When you push your project/package to Github, your pdfs show up properly formatted. You can then take the link to the pdf and put it in your README file, so that anyone can know what each variable in your dataset is about. I store these (both the Rmd and the knit HTML files) in a description_docs folder which is located within the man folder. This does not seem to cause any problems with package building or CRAN checks, as long as my .Rbuildingore
file contains the line:
^man/description_docs$
.
I store these (both the Rmd and the knit HTML files) in a codebooks folder which is located within the man folder. This does not seem to cause any problems with package building or CRAN checks, as long as my .Rbuildingore
file contains the line:
^man/codebooks$
.
Just run the following in the Console:
use_readme_rmd()
Then open README.Rmd and edit this. This is a good place to put a description of the package, who it is intended for, and some likely use cases.
This is a helpful place to let users, especially new users, know how to load your package, and how to access the datasets. Show an example or two.
It is also a good place to add a table with the names of the datasets and links to the descriptions and codebooks.
check()
install()
The Build tab in RStudio will let you Build Package
which will also restart R.
This will give you a clean slate and re-load (library) your datapackage.
Once you are happy with the datapackage, commit and push to Github. Now anyone can use your datapackage.
raw-data
folder. Save this to .Rdata format with saveRDS if needed. saveRDS(newdata, "data-raw/newdata.Rdata")
. Delete the original if needed.usethis::use_data(newdata, overwrite = TRUE)
. This will put it into the data
folder in .rda format.R/
folder, named something like newdata.R
. Copy and SaveAs from a pre-existing file. Use roxygen2 and rmarkdown formatting as above.devtools::document()
to build this into documentation in the man
folder.help(newdata)
to check the help file. Edit if needed, then repeat step 4 and 5 until the help file is satisfactory.man/description_docs
folder. Copy and SaveAs from a pre-existing file. Knit to HTML.man/codebooks
folder. Copy and SaveAs from a pre-existing file. Knit to HTML._pkgdown.yml
folder to add to contents:
- add a line for your newdata
. Save.Overview
paragraph.pkgdown::build_site()
. Check the website, and edit the README if needed. Repeat steps 11-13 until the website is satisfactory.
For attribution, please cite this work as
Higgins (2020, Oct. 18). Medical R: Building a Data Package. Retrieved from https://higgi13425.github.io/medical_r/posts/2020-10-18-building-a-data-package/
BibTeX citation
@misc{higgins2020building, author = {Higgins, Peter}, title = {Medical R: Building a Data Package}, url = {https://higgi13425.github.io/medical_r/posts/2020-10-18-building-a-data-package/}, year = {2020} }