How to Use filter() functions to Remove Rows with Missing Values

class: center, middle, inverse, title-slide

# How to Use <em>filter()</em> functions to Remove Rows with Missing Values
## Selecting specific rows with or without NAs
### Peter Higgins
### 2021-01-10

---

### How to Use the _filter()_ function for NA or *not* NA values

####

Format: <br>

`dataset %>% 
  filter(is.na(variable)))`

Negation is done by preceding with ! , <br>
so !is.na(variable) produces a logical vector which is  <br>
TRUE for all _non-missing_ value rows.

**R**emember that filte**R** is for selecting **R**ows because it ends with an **R**.

Let's walk through some **filter** Examples!

---

count: false
 
Example 1/4: Filter Rows for Missing Dates - find which Ones are Missing
.panel1-filter1-auto[

```r
# how many rows when you start
*nrow(covid_dates)
```
]
 
.panel2-filter1-auto[

```
[1] 15524
```
]

---
count: false
 
Example 1/4: Filter Rows for Missing Dates - find which Ones are Missing
.panel1-filter1-auto[

```r
# how many rows when you start
nrow(covid_dates)

*covid_dates
```
]
 
.panel2-filter1-auto[

```
[1] 15524
```

```
# A tibble: 15,524 x 18
   subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name
        <dbl> <chr>           <chr>          <chr>    <dbl> <chr>   <chr>      
 1       1412 jhezane         westerling     female       4 covid   inpatient …
 2        533 penny           targaryen      female       7 covid   clinical l…
 3       9134 grunt           rivers         male         7 covid   clinical l…
 4       8518 melisandre      swyft          female       8 covid   clinical l…
 5       8967 rolley          karstark       male         8 covid   emergency …
 6      11048 megga           karstark       female       8 covid   oncology d…
 7        663 ithoke          targaryen      male         9 covid   clinical l…
 8       2158 ravella         frey           female       9 covid   emergency …
 9       3794 styr            tyrell         male         9 covid   clinical l…
10       4706 wynafryd        seaworth       male         9 covid   clinical l…
# … with 15,514 more rows, and 11 more variables: result <chr>,
#   demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>,
#   orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>,
#   rec_ver_tat <dbl>, fake_date <date>
```
]

---
count: false
 
Example 1/4: Filter Rows for Missing Dates - find which Ones are Missing
.panel1-filter1-auto[

```r
# how many rows when you start
nrow(covid_dates)

covid_dates %>%
# selected 3 columns
# see how date changes w/filter
* select(fake_date, result, gender)
```
]
 
.panel2-filter1-auto[

```
[1] 15524
```

```
# A tibble: 15,524 x 3
   fake_date  result   gender
   <date>     <chr>    <chr> 
 1 2020-03-05 negative female
 2 2020-03-08 negative female
 3 2020-03-08 negative male  
 4 2020-03-09 negative female
 5 2020-03-09 negative male  
 6 2020-03-09 negative female
 7 NA         negative male  
 8 2020-03-10 negative female
 9 2020-03-10 negative male  
10 2020-03-10 negative male  
# … with 15,514 more rows
```
]

---
count: false
 
Example 1/4: Filter Rows for Missing Dates - find which Ones are Missing
.panel1-filter1-auto[

```r
# how many rows when you start
nrow(covid_dates)

covid_dates %>%
# selected 3 columns
# see how date changes w/filter
  select(fake_date, result, gender) %>%
* filter(is.na(fake_date))
```
]
 
.panel2-filter1-auto[

```
[1] 15524
```

```
# A tibble: 7 x 3
  fake_date  result   gender
  <date>     <chr>    <chr> 
1 NA         negative male  
2 NA         negative male  
3 NA         negative female
4 NA         negative female
5 NA         negative male  
6 NA         negative male  
7 NA         negative female
```
]

---
count: false
 
Example 1/4: Filter Rows for Missing Dates - find which Ones are Missing
.panel1-filter1-auto[

```r
# how many rows when you start
nrow(covid_dates)

covid_dates %>%
# selected 3 columns
# see how date changes w/filter
  select(fake_date, result, gender) %>%
  filter(is.na(fake_date))

# see how many rows now
# check dates - should be NA

# Format:
*#   filter(is.na(variable))                 <br>  #   filter(is.na(variable))                 <br>
```
]
 
.panel2-filter1-auto[

```
[1] 15524
```

---

count: false
 
Example 2/4: Filter Rows for Non-Missing Dates with !is.na()
.panel1-filter2-auto[

```r
# how many rows when you start
*nrow(bmt_dates)
```
]
 
.panel2-filter2-auto[

```
[1] 64
```
]

---
count: false
 
Example 2/4: Filter Rows for Non-Missing Dates with !is.na()
.panel1-filter2-auto[

```r
# how many rows when you start
nrow(bmt_dates)

*bmt_dates
```
]
 
.panel2-filter2-auto[

```
[1] 64
```

```
# A tibble: 64 x 29
      id   age   sex  race diagnosis diagnosis_type time_to_transpl…
   <dbl> <dbl> <dbl> <dbl> <chr>              <dbl>            <dbl>
 1     1    61     1     0 acute my…              1             5.16
 2    NA    NA    NA    NA <NA>                  NA            NA   
 3     3    63     0     1 non-Hodg…              0            35.6 
 4     4    33     0     1 Hodgkin …              0            33.0 
 5    NA    NA    NA    NA <NA>                  NA            NA   
 6     6    55     1     1 myelofib…              1             2.43
 7     7    67     1     1 acute my…              1             9.59
 8     8    51     1     1 acute my…              1            NA   
 9     9    44     0     0 multiple…              0            43.4 
10    10    59     1     1 chronic …              0            92.7 
# … with 54 more rows, and 22 more variables: prior_radiation <dbl>,
#   prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>,
#   donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>,
#   cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>,
#   cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>,
#   cgvhd <dbl>, time_to_cgvhd <dbl>, fake_dx_date <date>,
#   fake_bmt_date <date>, fake_agvhd_date <date>
```
]

---
count: false
 
Example 2/4: Filter Rows for Non-Missing Dates with !is.na()
.panel1-filter2-auto[

```r
# how many rows when you start
nrow(bmt_dates)

bmt_dates %>%
# selected 3 columns
# see how date changes w/filter
* select(fake_dx_date, race, donor_sex)
```
]
 
.panel2-filter2-auto[

```
[1] 64
```

```
# A tibble: 64 x 3
   fake_dx_date  race donor_sex
   <date>       <dbl>     <dbl>
 1 2014-05-07       0         0
 2 NA              NA        NA
 3 2014-05-07       1         0
 4 2014-05-07       1         1
 5 NA              NA        NA
 6 2014-05-07       1         1
 7 2014-05-07       1         1
 8 NA               1         0
 9 2014-05-07       0         1
10 2014-05-07       1         0
# … with 54 more rows
```
]

---
count: false
 
Example 2/4: Filter Rows for Non-Missing Dates with !is.na()
.panel1-filter2-auto[

```r
# how many rows when you start
nrow(bmt_dates)

bmt_dates %>%
# selected 3 columns
# see how date changes w/filter
  select(fake_dx_date, race, donor_sex) %>%
* filter(!is.na(fake_dx_date))
```
]
 
.panel2-filter2-auto[

```
[1] 64
```

```
# A tibble: 59 x 3
   fake_dx_date  race donor_sex
   <date>       <dbl>     <dbl>
 1 2014-05-07       0         0
 2 2014-05-07       1         0
 3 2014-05-07       1         1
 4 2014-05-07       1         1
 5 2014-05-07       1         1
 6 2014-05-07       0         1
 7 2014-05-07       1         0
 8 2014-05-07       1         1
 9 2014-05-07       1         0
10 2014-05-07       1         0
# … with 49 more rows
```
]

---
count: false
 
Example 2/4: Filter Rows for Non-Missing Dates with !is.na()
.panel1-filter2-auto[

```r
# how many rows when you start
nrow(bmt_dates)

bmt_dates %>%
# selected 3 columns
# see how date changes w/filter
  select(fake_dx_date, race, donor_sex) %>%
  filter(!is.na(fake_dx_date))

# see how many rows now
# check dates - should be NOT NA

# Format:
*#   filter(!is.na(variable))                 <br>  #   filter(!is.na(variable))                 <br>
```
]
 
.panel2-filter2-auto[

```
[1] 64
```

---

count: false
 
Example 3/4: Filter Rows for Non-Missing Dates - when you have (completely) Empty Rows <br>the {janitor} package can help
.panel1-filter3-auto[

```r
# how many rows when you start
*nrow(bmt_dates)
```
]
 
.panel2-filter3-auto[

```
[1] 64
```
]

---
count: false
 
Example 3/4: Filter Rows for Non-Missing Dates - when you have (completely) Empty Rows <br>the {janitor} package can help
.panel1-filter3-auto[

```r
# how many rows when you start
nrow(bmt_dates)

*bmt_dates
```
]
 
.panel2-filter3-auto[

```
[1] 64
```

---
count: false
 
Example 3/4: Filter Rows for Non-Missing Dates - when you have (completely) Empty Rows <br>the {janitor} package can help
.panel1-filter3-auto[

```r
# how many rows when you start
nrow(bmt_dates)

bmt_dates %>%
# selected 3 columns
# remove the 2 empty rows with janitor
* select(fake_dx_date, race, donor_sex)
```
]
 
.panel2-filter3-auto[

```
[1] 64
```

---
count: false
 
Example 3/4: Filter Rows for Non-Missing Dates - when you have (completely) Empty Rows <br>the {janitor} package can help
.panel1-filter3-auto[

```r
# how many rows when you start
nrow(bmt_dates)

bmt_dates %>%
# selected 3 columns
# remove the 2 empty rows with janitor
  select(fake_dx_date, race, donor_sex) %>%
* remove_empty("rows")
```
]
 
.panel2-filter3-auto[

```
[1] 64
```

```
# A tibble: 62 x 3
   fake_dx_date  race donor_sex
   <date>       <dbl>     <dbl>
 1 2014-05-07       0         0
 2 2014-05-07       1         0
 3 2014-05-07       1         1
 4 2014-05-07       1         1
 5 2014-05-07       1         1
 6 NA               1         0
 7 2014-05-07       0         1
 8 2014-05-07       1         0
 9 2014-05-07       1         1
10 2014-05-07       1         0
# … with 52 more rows
```
]

---
count: false
 
Example 3/4: Filter Rows for Non-Missing Dates - when you have (completely) Empty Rows <br>the {janitor} package can help
.panel1-filter3-auto[

```r
# how many rows when you start
nrow(bmt_dates)

bmt_dates %>%
# selected 3 columns
# remove the 2 empty rows with janitor
  select(fake_dx_date, race, donor_sex) %>%
  remove_empty("rows")

# see how many rows now
# check dates - should be NOT NA

# Format:
*#   remove_empty("rows") (or "cols")                 <br>  #   remove_empty("rows") (or "cols")                 <br>
```
]
 
.panel2-filter3-auto[

```
[1] 64
```

---

count: false
 
Example 4/4: Filter Rows for Complete Cases - with zero NAs in that row/observation
.panel1-filter4-auto[

```r
# how many rows when you start
*nrow(bmt_dates)
```
]
 
.panel2-filter4-auto[

```
[1] 64
```
]

---
count: false
 
Example 4/4: Filter Rows for Complete Cases - with zero NAs in that row/observation
.panel1-filter4-auto[

```r
# how many rows when you start
nrow(bmt_dates)

*bmt_dates
```
]
 
.panel2-filter4-auto[

```
[1] 64
```

---
count: false
 
Example 4/4: Filter Rows for Complete Cases - with zero NAs in that row/observation
.panel1-filter4-auto[

```r
# how many rows when you start
nrow(bmt_dates)

bmt_dates %>%
#   remove incomplete cases with drop_na
* drop_na()
```
]
 
.panel2-filter4-auto[

```
[1] 64
```

```
# A tibble: 44 x 29
      id   age   sex  race diagnosis diagnosis_type time_to_transpl…
   <dbl> <dbl> <dbl> <dbl> <chr>              <dbl>            <dbl>
 1     1    61     1     0 acute my…              1             5.16
 2     3    63     0     1 non-Hodg…              0            35.6 
 3     4    33     0     1 Hodgkin …              0            33.0 
 4     6    55     1     1 myelofib…              1             2.43
 5     7    67     1     1 acute my…              1             9.59
 6    10    59     1     1 chronic …              0            92.7 
 7    11    45     1     1 multiple…              0            39   
 8    15    35     1     1 myelodys…              1            16.3 
 9    18    45     0     1 myelodys…              1             8.71
10    19    62     1     0 non-Hodg…              0            76.1 
# … with 34 more rows, and 22 more variables: prior_radiation <dbl>,
#   prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>,
#   donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>,
#   cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>,
#   cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>,
#   cgvhd <dbl>, time_to_cgvhd <dbl>, fake_dx_date <date>,
#   fake_bmt_date <date>, fake_agvhd_date <date>
```
]

---
count: false
 
Example 4/4: Filter Rows for Complete Cases - with zero NAs in that row/observation
.panel1-filter4-auto[

```r
# how many rows when you start
nrow(bmt_dates)

bmt_dates %>%
#   remove incomplete cases with drop_na
  drop_na()

# see how many rows now - fewer!
# now all vars should be NOT NA

# Format:
*#   drop_na()                                        <br>  #   drop_na()                                        <br>
```
]
 
.panel2-filter4-auto[

```
[1] 64
```

---
class: inverse, center

# End of This Flipbook

## On to The Coding Exercises!