An R Package Of Medical Data For Teaching • medicaldata

Overview

This is a data package with 19 medical datasets for teaching Reproducible Medical Research with R. The link to the pkgdown reference website for {medicaldata} is here and in the links at the right. This package will be useful for anyone teaching R to medical professionals, including doctors, nurses, pharmacists, trainees, and students.

These datasets range from reconstructed versions of James Lind’s scurvy dataset (1757) and the original Streptomycin for Tuberculosis trial (1948), a 2012 RCT of indomethacin to prevent post-ERCP pancreatitis that I was involved in, to cohort data on SARS-CoV2 testing results (2020). Many of the datasets come from the American Statistical Association’s TSHS (Teaching Statistics in the Health Sciences) Resources Portal, maintained by Carol Bigelow at the University of Massachusetts (with permission). A growing number of datasets in the dev version were generously donated by Frank Harrell from his website here. These datasets are currently only in the dev version of the package on github.com, which should make it to CRAN in June of 2023.

How to Install and Use {medicaldata} Datasets

Install the stable, current CRAN version with install.packages("medicaldata"). If you want to try out the in-development version (which may have new datasets and vignettes, but which may also be intermittently wonky), install with: remotes::install_github("higgi13425/medicaldata")
Then load the package with library(medicaldata)
Then you can list the datasets available with data(package = "medicaldata")
Then assign a particular dataset to a named object in your environment with:
covid <- medicaldata::covid_testing
where covid is the name of the new object, and covid_testing is the name of the dataset.
Articles (vignettes) on how to use the datasets can be found at the pkgdown website under the Articles tab.
You can click on the links below to view the description document and/or codebook for each dataset. This information is also available under the Reference tab above, or within R by using help(dataset_name).

Please Donate Datasets

If you have access to data from a randomized, controlled clinical trial, or a prospective cohort study, or even a case-control study, please consider obtaining the appropriate permissions, anonymizing the data, and donating the dataset for teaching purposes to add to this package. Open an issue on the github page (source code link at the top right) to open the discussion of a data donation. I am happy to help with anonymization.

List of Datasets

Click on links below for more details about the dataset itself in the Description Document, and more details about the variables included in the dataset in the Codebook. Note that each dataset also has a help file that you can use within R or RStudio, by entering help("dataset_name") in the Console pane. The fourth column of the table below (scroll to the right or widen your browser window) describes the study design, as requested by Dan Sjoberg of {gtsummary} fame.

Dataset	Description document	Codebook	Design
strep_tb	strep_tb_desc	strep_tb_codebook	Randomized Controlled Trial (RCT)
scurvy	scurvy_desc	scurvy_codebook	RCT
indo_rct	indo_rct_desc	indo_rct_codebook	RCT
polyps	polyps_desc	polyps_codebook	RCT
cervical dystonia (dev)	cdystonia_desc	cdystonia_codebook	RCT
covid_testing	covid_desc	covid_codebook	Retrospective cross-sectional
blood_storage	blood_storage_desc	blood_storage_codebook	Retrospective Cohort Study
cytomegalovirus	cytomegalovirus_desc	cytomegalovirus_codebook	Retrospective Cohort Study
esoph_ca	esoph_ca_desc	esoph_ca_codebook	Case-control study
laryngoscope	laryngoscope_desc	laryngoscope_codebook	RCT
licorice_gargle	licorice_gargle_desc	licorice_gargle_codebook	RCT
opt	opt_desc	opt_codebook	RCT
cath (dev)	cath_desc	cath_codebook	Retrospective Cohort Study
smartpill	smartpill_desc	smartpill_codebook	Prospective Cohort Study
supraclavicular	supraclavicular_desc	supraclavicular_codebook	RCT
indometh	indometh_desc	indometh_codebook	Prospective Cohort Pharmacokinetic (PK) Study
theoph	theoph_desc	theoph_codebook	Prospective Cohort PK Study
diabetes (dev)	diabetes_desc	diabetes_codebook	Prospective Longitudinal Cohort Study
thiomon (dev)	thiomon_desc	thiomon_codebook	Retrospective Cohort Study, suitable for ML
abm (dev)	abm_desc	abm_codebook	Retrospective Cohort Study

Messy Datasets

I am doing a beta test of messy datasets, largely in Excel, with many annoying non-tidy and non-rectangular features that will help teach data cleaning/wrangling. These are not actually in the package itself (as they are not R files), but can be found in the GitHub repository.

You can download and open these from the GitHub repo in all of their messy Excel glory by clicking on the URL links in the table below. You can also find them here in the list on the GitHub repo, where you can click on one of the *.xlsx files, then click on the View Raw button to download it.

You can read these datasets directly into R from the urls in the table below with the example code found in the following code chunk, which reads in the messy_infarct dataset and assigns it to the object infarct. It may be easiest to copy the entire code chunk below by hovering over the copy icon in the top right corner, then clicking to copy.

# install.packages('openxlsx')
# if not already installed
library(openxlsx)
url <- "https://github.com/higgi13425/medicaldata/raw/master/data-raw/messy_data/messy_infarct.xlsx"
# replace the filename "messy_infarct.xlsx" at the end of this long url path with the filename that you want to load. 
# Or just copy the whole path from the URL column below.
infarct <- openxlsx::read.xlsx(url)
head(infarct)

Available Messy Datasets (beta)

Dataset	URL	Type of Messiness
messy_cirrhosis	“https://github.com/higgi13425/medicaldata/raw/master/data-raw/messy_data/messy_cirrhosis.xlsx”	Pivot Table
messy_infarct	“https://github.com/higgi13425/medicaldata/raw/master/data-raw/messy_data/messy_infarct.xlsx”	Pivot Table
messy_aki	“https://github.com/higgi13425/medicaldata/raw/master/data-raw/messy_data/messy_aki.xlsx”	unique ids, header and footer rows, empty rows & cols, messy varnames, no units, typos in factors, visit date in headers, dates
messy_bp	“https://github.com/higgi13425/medicaldata/raw/master/data-raw/messy_data/messy_bp.xlsx”	unite and separate, vars without units, visit num in headers, data entry errors
messy_glucose	“https://github.com/higgi13425/medicaldata/raw/master/data-raw/messy_data/messy_glucose.xlsx”	factors, vars without units, visit num in headers, header rows, empty rows/cols