class: center, middle, inverse, title-slide # How to Use
filter()
functions to Remove Rows with Missing Values ## Selecting specific rows with or without NAs ### Peter Higgins ### 2021-01-10 --- ### How to Use the _filter()_ function for NA or *not* NA values #### Format: <br> `dataset %>% filter(is.na(variable)))` Negation is done by preceding with ! , <br> so !is.na(variable) produces a logical vector which is <br> TRUE for all _non-missing_ value rows. **R**emember that filte**R** is for selecting **R**ows because it ends with an **R**. Let's walk through some **filter** Examples! --- count: false Example 1/4: Filter Rows for Missing Dates - find which Ones are Missing .panel1-filter1-auto[ ```r # how many rows when you start *nrow(covid_dates) ``` ] .panel2-filter1-auto[ ``` [1] 15524 ``` ] --- count: false Example 1/4: Filter Rows for Missing Dates - find which Ones are Missing .panel1-filter1-auto[ ```r # how many rows when you start nrow(covid_dates) *covid_dates ``` ] .panel2-filter1-auto[ ``` [1] 15524 ``` ``` # A tibble: 15,524 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 1412 jhezane westerling female 4 covid inpatient … 2 533 penny targaryen female 7 covid clinical l… 3 9134 grunt rivers male 7 covid clinical l… 4 8518 melisandre swyft female 8 covid clinical l… 5 8967 rolley karstark male 8 covid emergency … 6 11048 megga karstark female 8 covid oncology d… 7 663 ithoke targaryen male 9 covid clinical l… 8 2158 ravella frey female 9 covid emergency … 9 3794 styr tyrell male 9 covid clinical l… 10 4706 wynafryd seaworth male 9 covid clinical l… # … with 15,514 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 1/4: Filter Rows for Missing Dates - find which Ones are Missing .panel1-filter1-auto[ ```r # how many rows when you start nrow(covid_dates) covid_dates %>% # selected 3 columns # see how date changes w/filter * select(fake_date, result, gender) ``` ] .panel2-filter1-auto[ ``` [1] 15524 ``` ``` # A tibble: 15,524 x 3 fake_date result gender <date> <chr> <chr> 1 2020-03-05 negative female 2 2020-03-08 negative female 3 2020-03-08 negative male 4 2020-03-09 negative female 5 2020-03-09 negative male 6 2020-03-09 negative female 7 NA negative male 8 2020-03-10 negative female 9 2020-03-10 negative male 10 2020-03-10 negative male # … with 15,514 more rows ``` ] --- count: false Example 1/4: Filter Rows for Missing Dates - find which Ones are Missing .panel1-filter1-auto[ ```r # how many rows when you start nrow(covid_dates) covid_dates %>% # selected 3 columns # see how date changes w/filter select(fake_date, result, gender) %>% * filter(is.na(fake_date)) ``` ] .panel2-filter1-auto[ ``` [1] 15524 ``` ``` # A tibble: 7 x 3 fake_date result gender <date> <chr> <chr> 1 NA negative male 2 NA negative male 3 NA negative female 4 NA negative female 5 NA negative male 6 NA negative male 7 NA negative female ``` ] --- count: false Example 1/4: Filter Rows for Missing Dates - find which Ones are Missing .panel1-filter1-auto[ ```r # how many rows when you start nrow(covid_dates) covid_dates %>% # selected 3 columns # see how date changes w/filter select(fake_date, result, gender) %>% filter(is.na(fake_date)) # see how many rows now # check dates - should be NA # Format: *# filter(is.na(variable)) <br> # filter(is.na(variable)) <br> ``` ] .panel2-filter1-auto[ ``` [1] 15524 ``` ``` # A tibble: 7 x 3 fake_date result gender <date> <chr> <chr> 1 NA negative male 2 NA negative male 3 NA negative female 4 NA negative female 5 NA negative male 6 NA negative male 7 NA negative female ``` ] <style> .panel1-filter1-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-filter1-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-filter1-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- count: false Example 2/4: Filter Rows for Non-Missing Dates with !is.na() .panel1-filter2-auto[ ```r # how many rows when you start *nrow(bmt_dates) ``` ] .panel2-filter2-auto[ ``` [1] 64 ``` ] --- count: false Example 2/4: Filter Rows for Non-Missing Dates with !is.na() .panel1-filter2-auto[ ```r # how many rows when you start nrow(bmt_dates) *bmt_dates ``` ] .panel2-filter2-auto[ ``` [1] 64 ``` ``` # A tibble: 64 x 29 id age sex race diagnosis diagnosis_type time_to_transpl… <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 61 1 0 acute my… 1 5.16 2 NA NA NA NA <NA> NA NA 3 3 63 0 1 non-Hodg… 0 35.6 4 4 33 0 1 Hodgkin … 0 33.0 5 NA NA NA NA <NA> NA NA 6 6 55 1 1 myelofib… 1 2.43 7 7 67 1 1 acute my… 1 9.59 8 8 51 1 1 acute my… 1 NA 9 9 44 0 0 multiple… 0 43.4 10 10 59 1 1 chronic … 0 92.7 # … with 54 more rows, and 22 more variables: prior_radiation <dbl>, # prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>, # donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>, # cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>, # cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>, # cgvhd <dbl>, time_to_cgvhd <dbl>, fake_dx_date <date>, # fake_bmt_date <date>, fake_agvhd_date <date> ``` ] --- count: false Example 2/4: Filter Rows for Non-Missing Dates with !is.na() .panel1-filter2-auto[ ```r # how many rows when you start nrow(bmt_dates) bmt_dates %>% # selected 3 columns # see how date changes w/filter * select(fake_dx_date, race, donor_sex) ``` ] .panel2-filter2-auto[ ``` [1] 64 ``` ``` # A tibble: 64 x 3 fake_dx_date race donor_sex <date> <dbl> <dbl> 1 2014-05-07 0 0 2 NA NA NA 3 2014-05-07 1 0 4 2014-05-07 1 1 5 NA NA NA 6 2014-05-07 1 1 7 2014-05-07 1 1 8 NA 1 0 9 2014-05-07 0 1 10 2014-05-07 1 0 # … with 54 more rows ``` ] --- count: false Example 2/4: Filter Rows for Non-Missing Dates with !is.na() .panel1-filter2-auto[ ```r # how many rows when you start nrow(bmt_dates) bmt_dates %>% # selected 3 columns # see how date changes w/filter select(fake_dx_date, race, donor_sex) %>% * filter(!is.na(fake_dx_date)) ``` ] .panel2-filter2-auto[ ``` [1] 64 ``` ``` # A tibble: 59 x 3 fake_dx_date race donor_sex <date> <dbl> <dbl> 1 2014-05-07 0 0 2 2014-05-07 1 0 3 2014-05-07 1 1 4 2014-05-07 1 1 5 2014-05-07 1 1 6 2014-05-07 0 1 7 2014-05-07 1 0 8 2014-05-07 1 1 9 2014-05-07 1 0 10 2014-05-07 1 0 # … with 49 more rows ``` ] --- count: false Example 2/4: Filter Rows for Non-Missing Dates with !is.na() .panel1-filter2-auto[ ```r # how many rows when you start nrow(bmt_dates) bmt_dates %>% # selected 3 columns # see how date changes w/filter select(fake_dx_date, race, donor_sex) %>% filter(!is.na(fake_dx_date)) # see how many rows now # check dates - should be NOT NA # Format: *# filter(!is.na(variable)) <br> # filter(!is.na(variable)) <br> ``` ] .panel2-filter2-auto[ ``` [1] 64 ``` ``` # A tibble: 59 x 3 fake_dx_date race donor_sex <date> <dbl> <dbl> 1 2014-05-07 0 0 2 2014-05-07 1 0 3 2014-05-07 1 1 4 2014-05-07 1 1 5 2014-05-07 1 1 6 2014-05-07 0 1 7 2014-05-07 1 0 8 2014-05-07 1 1 9 2014-05-07 1 0 10 2014-05-07 1 0 # … with 49 more rows ``` ] <style> .panel1-filter2-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-filter2-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-filter2-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- count: false Example 3/4: Filter Rows for Non-Missing Dates - when you have (completely) Empty Rows <br>the {janitor} package can help .panel1-filter3-auto[ ```r # how many rows when you start *nrow(bmt_dates) ``` ] .panel2-filter3-auto[ ``` [1] 64 ``` ] --- count: false Example 3/4: Filter Rows for Non-Missing Dates - when you have (completely) Empty Rows <br>the {janitor} package can help .panel1-filter3-auto[ ```r # how many rows when you start nrow(bmt_dates) *bmt_dates ``` ] .panel2-filter3-auto[ ``` [1] 64 ``` ``` # A tibble: 64 x 29 id age sex race diagnosis diagnosis_type time_to_transpl… <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 61 1 0 acute my… 1 5.16 2 NA NA NA NA <NA> NA NA 3 3 63 0 1 non-Hodg… 0 35.6 4 4 33 0 1 Hodgkin … 0 33.0 5 NA NA NA NA <NA> NA NA 6 6 55 1 1 myelofib… 1 2.43 7 7 67 1 1 acute my… 1 9.59 8 8 51 1 1 acute my… 1 NA 9 9 44 0 0 multiple… 0 43.4 10 10 59 1 1 chronic … 0 92.7 # … with 54 more rows, and 22 more variables: prior_radiation <dbl>, # prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>, # donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>, # cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>, # cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>, # cgvhd <dbl>, time_to_cgvhd <dbl>, fake_dx_date <date>, # fake_bmt_date <date>, fake_agvhd_date <date> ``` ] --- count: false Example 3/4: Filter Rows for Non-Missing Dates - when you have (completely) Empty Rows <br>the {janitor} package can help .panel1-filter3-auto[ ```r # how many rows when you start nrow(bmt_dates) bmt_dates %>% # selected 3 columns # remove the 2 empty rows with janitor * select(fake_dx_date, race, donor_sex) ``` ] .panel2-filter3-auto[ ``` [1] 64 ``` ``` # A tibble: 64 x 3 fake_dx_date race donor_sex <date> <dbl> <dbl> 1 2014-05-07 0 0 2 NA NA NA 3 2014-05-07 1 0 4 2014-05-07 1 1 5 NA NA NA 6 2014-05-07 1 1 7 2014-05-07 1 1 8 NA 1 0 9 2014-05-07 0 1 10 2014-05-07 1 0 # … with 54 more rows ``` ] --- count: false Example 3/4: Filter Rows for Non-Missing Dates - when you have (completely) Empty Rows <br>the {janitor} package can help .panel1-filter3-auto[ ```r # how many rows when you start nrow(bmt_dates) bmt_dates %>% # selected 3 columns # remove the 2 empty rows with janitor select(fake_dx_date, race, donor_sex) %>% * remove_empty("rows") ``` ] .panel2-filter3-auto[ ``` [1] 64 ``` ``` # A tibble: 62 x 3 fake_dx_date race donor_sex <date> <dbl> <dbl> 1 2014-05-07 0 0 2 2014-05-07 1 0 3 2014-05-07 1 1 4 2014-05-07 1 1 5 2014-05-07 1 1 6 NA 1 0 7 2014-05-07 0 1 8 2014-05-07 1 0 9 2014-05-07 1 1 10 2014-05-07 1 0 # … with 52 more rows ``` ] --- count: false Example 3/4: Filter Rows for Non-Missing Dates - when you have (completely) Empty Rows <br>the {janitor} package can help .panel1-filter3-auto[ ```r # how many rows when you start nrow(bmt_dates) bmt_dates %>% # selected 3 columns # remove the 2 empty rows with janitor select(fake_dx_date, race, donor_sex) %>% remove_empty("rows") # see how many rows now # check dates - should be NOT NA # Format: *# remove_empty("rows") (or "cols") <br> # remove_empty("rows") (or "cols") <br> ``` ] .panel2-filter3-auto[ ``` [1] 64 ``` ``` # A tibble: 62 x 3 fake_dx_date race donor_sex <date> <dbl> <dbl> 1 2014-05-07 0 0 2 2014-05-07 1 0 3 2014-05-07 1 1 4 2014-05-07 1 1 5 2014-05-07 1 1 6 NA 1 0 7 2014-05-07 0 1 8 2014-05-07 1 0 9 2014-05-07 1 1 10 2014-05-07 1 0 # … with 52 more rows ``` ] <style> .panel1-filter3-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-filter3-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-filter3-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- count: false Example 4/4: Filter Rows for Complete Cases - with zero NAs in that row/observation .panel1-filter4-auto[ ```r # how many rows when you start *nrow(bmt_dates) ``` ] .panel2-filter4-auto[ ``` [1] 64 ``` ] --- count: false Example 4/4: Filter Rows for Complete Cases - with zero NAs in that row/observation .panel1-filter4-auto[ ```r # how many rows when you start nrow(bmt_dates) *bmt_dates ``` ] .panel2-filter4-auto[ ``` [1] 64 ``` ``` # A tibble: 64 x 29 id age sex race diagnosis diagnosis_type time_to_transpl… <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 61 1 0 acute my… 1 5.16 2 NA NA NA NA <NA> NA NA 3 3 63 0 1 non-Hodg… 0 35.6 4 4 33 0 1 Hodgkin … 0 33.0 5 NA NA NA NA <NA> NA NA 6 6 55 1 1 myelofib… 1 2.43 7 7 67 1 1 acute my… 1 9.59 8 8 51 1 1 acute my… 1 NA 9 9 44 0 0 multiple… 0 43.4 10 10 59 1 1 chronic … 0 92.7 # … with 54 more rows, and 22 more variables: prior_radiation <dbl>, # prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>, # donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>, # cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>, # cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>, # cgvhd <dbl>, time_to_cgvhd <dbl>, fake_dx_date <date>, # fake_bmt_date <date>, fake_agvhd_date <date> ``` ] --- count: false Example 4/4: Filter Rows for Complete Cases - with zero NAs in that row/observation .panel1-filter4-auto[ ```r # how many rows when you start nrow(bmt_dates) bmt_dates %>% # remove incomplete cases with drop_na * drop_na() ``` ] .panel2-filter4-auto[ ``` [1] 64 ``` ``` # A tibble: 44 x 29 id age sex race diagnosis diagnosis_type time_to_transpl… <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 61 1 0 acute my… 1 5.16 2 3 63 0 1 non-Hodg… 0 35.6 3 4 33 0 1 Hodgkin … 0 33.0 4 6 55 1 1 myelofib… 1 2.43 5 7 67 1 1 acute my… 1 9.59 6 10 59 1 1 chronic … 0 92.7 7 11 45 1 1 multiple… 0 39 8 15 35 1 1 myelodys… 1 16.3 9 18 45 0 1 myelodys… 1 8.71 10 19 62 1 0 non-Hodg… 0 76.1 # … with 34 more rows, and 22 more variables: prior_radiation <dbl>, # prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>, # donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>, # cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>, # cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>, # cgvhd <dbl>, time_to_cgvhd <dbl>, fake_dx_date <date>, # fake_bmt_date <date>, fake_agvhd_date <date> ``` ] --- count: false Example 4/4: Filter Rows for Complete Cases - with zero NAs in that row/observation .panel1-filter4-auto[ ```r # how many rows when you start nrow(bmt_dates) bmt_dates %>% # remove incomplete cases with drop_na drop_na() # see how many rows now - fewer! # now all vars should be NOT NA # Format: *# drop_na() <br> # drop_na() <br> ``` ] .panel2-filter4-auto[ ``` [1] 64 ``` ``` # A tibble: 44 x 29 id age sex race diagnosis diagnosis_type time_to_transpl… <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 61 1 0 acute my… 1 5.16 2 3 63 0 1 non-Hodg… 0 35.6 3 4 33 0 1 Hodgkin … 0 33.0 4 6 55 1 1 myelofib… 1 2.43 5 7 67 1 1 acute my… 1 9.59 6 10 59 1 1 chronic … 0 92.7 7 11 45 1 1 multiple… 0 39 8 15 35 1 1 myelodys… 1 16.3 9 18 45 0 1 myelodys… 1 8.71 10 19 62 1 0 non-Hodg… 0 76.1 # … with 34 more rows, and 22 more variables: prior_radiation <dbl>, # prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>, # donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>, # cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>, # cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>, # cgvhd <dbl>, time_to_cgvhd <dbl>, fake_dx_date <date>, # fake_bmt_date <date>, fake_agvhd_date <date> ``` ] <style> .panel1-filter4-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-filter4-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-filter4-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- class: inverse, center # End of This Flipbook ## On to The Coding Exercises!