How to Use distinct() and get_dupes() to find distinct rows

# How to Use <em>distinct()</em> and <em>get_dupes()</em> to find distinct rows
## Selecting Distinct vs. Duplicated Rows
### Peter Higgins
### 2021-01-10

---

### How to Use the distinct() and get_dupes() functions to find duplicate or distinct rows

You can  use _distinct(variable)_ to find all the distinct values of that variable, and therefore how many are duplicates for that variable.

Imagine that you have some duplicate observations in your dataset, due to a problem with data entry, or more likely with a merging of two datasets.

It is important to have a single row for each observation. But it can be hard to find the duplicates in a large dataset.

So you (wisely) decide to check this with the _distinct()_ function. If the resulting number of rows (nrow) is less than the original nrow, you have some duplicates.

The {janitor} function _get_dupes()_ can help find these.

Let's find some duplicates!

---

count: false
 
Example 1/5: Are Some Subjects Being Tested Multiple Times?
.panel1-filter1-auto[

```r
# how many rows when you start

*nrow(covid_dates)
```
]
 
.panel2-filter1-auto[

```
[1] 15524
```
]

---
count: false
 
Example 1/5: Are Some Subjects Being Tested Multiple Times?
.panel1-filter1-auto[

```r
# how many rows when you start

nrow(covid_dates)

# check if subject_id is a unique identifier
*covid_dates
```
]
 
.panel2-filter1-auto[

```
[1] 15524
```

```
# A tibble: 15,524 x 18
   subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name
        <dbl> <chr>           <chr>          <chr>    <dbl> <chr>   <chr>      
 1       1412 jhezane         westerling     female       4 covid   inpatient …
 2        533 penny           targaryen      female       7 covid   clinical l…
 3       9134 grunt           rivers         male         7 covid   clinical l…
 4       8518 melisandre      swyft          female       8 covid   clinical l…
 5       8967 rolley          karstark       male         8 covid   emergency …
 6      11048 megga           karstark       female       8 covid   oncology d…
 7        663 ithoke          targaryen      male         9 covid   clinical l…
 8       2158 ravella         frey           female       9 covid   emergency …
 9       3794 styr            tyrell         male         9 covid   clinical l…
10       4706 wynafryd        seaworth       male         9 covid   clinical l…
# … with 15,514 more rows, and 11 more variables: result <chr>,
#   demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>,
#   orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>,
#   rec_ver_tat <dbl>, fake_date <date>
```
]

---
count: false
 
Example 1/5: Are Some Subjects Being Tested Multiple Times?
.panel1-filter1-auto[

```r
# how many rows when you start

nrow(covid_dates)

# check if subject_id is a unique identifier
covid_dates %>%
* distinct(subject_id)
```
]
 
.panel2-filter1-auto[

```
[1] 15524
```

```
# A tibble: 12,344 x 1
   subject_id
        <dbl>
 1       1412
 2        533
 3       9134
 4       8518
 5       8967
 6      11048
 7        663
 8       2158
 9       3794
10       4706
# … with 12,334 more rows
```
]

---
count: false
 
Example 1/5: Are Some Subjects Being Tested Multiple Times?
.panel1-filter1-auto[

```r
# how many rows when you start

nrow(covid_dates)

# check if subject_id is a unique identifier
covid_dates %>%
  distinct(subject_id)

# see how many rows now
# down to 12,344 - subject_id is not a unique key

# Format:
*#   distinct(variable)                    <br>  #   distinct(variable)                    <br>
```
]
 
.panel2-filter1-auto[

```
[1] 15524
```

---

```r
# how many rows when you start

*nrow(covid_dates)
```
]
 
.panel2-filter2-auto[

```
[1] 15524
```
]

---
count: false
 
Example 2/5: See if Distinct by Full Name
.panel1-filter2-auto[

```r
# how many rows when you start

nrow(covid_dates)

*covid_dates
```
]
 
.panel2-filter2-auto[

```
[1] 15524
```

---
count: false
 
Example 2/5: See if Distinct by Full Name
.panel1-filter2-auto[

```r
# how many rows when you start

nrow(covid_dates)

covid_dates %>%
# check if names are distinct
* select(fake_first_name, fake_last_name)
```
]
 
.panel2-filter2-auto[

```
[1] 15524
```

```
# A tibble: 15,524 x 2
   fake_first_name fake_last_name
   <chr>           <chr>         
 1 jhezane         westerling    
 2 penny           targaryen     
 3 grunt           rivers        
 4 melisandre      swyft         
 5 rolley          karstark      
 6 megga           karstark      
 7 ithoke          targaryen     
 8 ravella         frey          
 9 styr            tyrell        
10 wynafryd        seaworth      
# … with 15,514 more rows
```
]

---
count: false
 
Example 2/5: See if Distinct by Full Name
.panel1-filter2-auto[

```r
# how many rows when you start

nrow(covid_dates)

covid_dates %>%
# check if names are distinct
  select(fake_first_name, fake_last_name) %>%
# check if first & last name is a unique identifier = key
* distinct(fake_first_name, fake_last_name)
```
]
 
.panel2-filter2-auto[

```
[1] 15524
```

```
# A tibble: 8,286 x 2
   fake_first_name fake_last_name
   <chr>           <chr>         
 1 jhezane         westerling    
 2 penny           targaryen     
 3 grunt           rivers        
 4 melisandre      swyft         
 5 rolley          karstark      
 6 megga           karstark      
 7 ithoke          targaryen     
 8 ravella         frey          
 9 styr            tyrell        
10 wynafryd        seaworth      
# … with 8,276 more rows
```
]

---
count: false
 
Example 2/5: See if Distinct by Full Name
.panel1-filter2-auto[

```r
# how many rows when you start

nrow(covid_dates)

covid_dates %>%
# check if names are distinct
  select(fake_first_name, fake_last_name) %>%
# check if first & last name is a unique identifier = key
  distinct(fake_first_name, fake_last_name)

# see how many rows now - 8,286
# lots of full names are repeated

# Format:
*#   distinct(variable1, variable 2)                    <br>  #   distinct(variable1, variable 2)                    <br>
```
]
 
.panel2-filter2-auto[

```
[1] 15524
```

---

```r
# how many rows when you start
*nrow(covid_dates)
```
]
 
.panel2-filter3-auto[

```
[1] 15524
```
]

---
count: false
 
Example 3/5: Which Rows are Duplicates
.panel1-filter3-auto[

```r
# how many rows when you start
nrow(covid_dates)

*covid_dates
```
]
 
.panel2-filter3-auto[

```
[1] 15524
```

---
count: false
 
Example 3/5: Which Rows are Duplicates
.panel1-filter3-auto[

```r
# how many rows when you start
nrow(covid_dates)

covid_dates %>%
# check if names are distinct
* select(subject_id, fake_first_name,
*        fake_last_name)
```
]
 
.panel2-filter3-auto[

```
[1] 15524
```

```
# A tibble: 15,524 x 3
   subject_id fake_first_name fake_last_name
        <dbl> <chr>           <chr>         
 1       1412 jhezane         westerling    
 2        533 penny           targaryen     
 3       9134 grunt           rivers        
 4       8518 melisandre      swyft         
 5       8967 rolley          karstark      
 6      11048 megga           karstark      
 7        663 ithoke          targaryen     
 8       2158 ravella         frey          
 9       3794 styr            tyrell        
10       4706 wynafryd        seaworth      
# … with 15,514 more rows
```
]

---
count: false
 
Example 3/5: Which Rows are Duplicates
.panel1-filter3-auto[

```r
# how many rows when you start
nrow(covid_dates)

covid_dates %>%
# check if names are distinct
  select(subject_id, fake_first_name,
         fake_last_name) %>%
# check if first & last name is unique id
* get_dupes(fake_first_name,
*           fake_last_name)
```
]
 
.panel2-filter3-auto[

```
[1] 15524
```

```
# A tibble: 10,712 x 4
   fake_first_name fake_last_name dupe_count subject_id
   <chr>           <chr>               <int>      <dbl>
 1 addam           clegane                 3      11694
 2 addam           clegane                 3      10897
 3 addam           clegane                 3      10897
 4 addam           manderly                3       1114
 5 addam           manderly                3       1114
 6 addam           manderly                3       3158
 7 addam           seaworth                2       9210
 8 addam           seaworth                2       2628
 9 addam           tully                   2      10606
10 addam           tully                   2      12024
# … with 10,702 more rows
```
]

---
count: false
 
Example 3/5: Which Rows are Duplicates
.panel1-filter3-auto[

```r
# how many rows when you start
nrow(covid_dates)

covid_dates %>%
# check if names are distinct
  select(subject_id, fake_first_name,
         fake_last_name) %>%
# check if first & last name is unique id
  get_dupes(fake_first_name,
            fake_last_name)

# now you can see the duplicates

# Format:
*#   get_dupes(variable1, variable 2)                    <br>  #   get_dupes(variable1, variable 2)                    <br>
```
]
 
.panel2-filter3-auto[

```
[1] 15524
```

---

```r
# how many rows in bmt

*nrow(bmt)
```
]
 
.panel2-filter4-auto[

```
[1] 83
```
]

---
count: false
 
Example 4/5: Are All Rows in BMT distinct?
.panel1-filter4-auto[

```r
# how many rows in bmt

nrow(bmt)

*bmt
```
]
 
.panel2-filter4-auto[

```
[1] 83
```

```
# A tibble: 83 x 26
      id   age   sex  race diagnosis diagnosis_type time_to_transpl…
   <dbl> <dbl> <dbl> <dbl> <chr>              <dbl>            <dbl>
 1     1    61     1     0 acute my…              1             5.16
 2     2    62     1     1 non-Hodg…              0            79.0 
 3     3    63     0     1 non-Hodg…              0            35.6 
 4     4    33     0     1 Hodgkin …              0            33.0 
 5     5    54     0     1 acute ly…              0            11.4 
 6     6    55     1     1 myelofib…              1             2.43
 7     7    67     1     1 acute my…              1             9.59
 8     8    51     1     1 acute my…              1            NA   
 9     9    44     0     0 multiple…              0            43.4 
10    10    59     1     1 chronic …              0            92.7 
# … with 73 more rows, and 19 more variables: prior_radiation <dbl>,
#   prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>,
#   donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>,
#   cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>,
#   cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>,
#   cgvhd <dbl>, time_to_cgvhd <dbl>
```
]

---
count: false
 
Example 4/5: Are All Rows in BMT distinct?
.panel1-filter4-auto[

```r
# how many rows in bmt

nrow(bmt)

bmt %>%
* distinct()
```
]
 
.panel2-filter4-auto[

```
[1] 83
```

```
# A tibble: 64 x 26
      id   age   sex  race diagnosis diagnosis_type time_to_transpl…
   <dbl> <dbl> <dbl> <dbl> <chr>              <dbl>            <dbl>
 1     1    61     1     0 acute my…              1             5.16
 2     2    62     1     1 non-Hodg…              0            79.0 
 3     3    63     0     1 non-Hodg…              0            35.6 
 4     4    33     0     1 Hodgkin …              0            33.0 
 5     5    54     0     1 acute ly…              0            11.4 
 6     6    55     1     1 myelofib…              1             2.43
 7     7    67     1     1 acute my…              1             9.59
 8     8    51     1     1 acute my…              1            NA   
 9     9    44     0     0 multiple…              0            43.4 
10    10    59     1     1 chronic …              0            92.7 
# … with 54 more rows, and 19 more variables: prior_radiation <dbl>,
#   prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>,
#   donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>,
#   cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>,
#   cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>,
#   cgvhd <dbl>, time_to_cgvhd <dbl>
```
]

---
count: false
 
Example 4/5: Are All Rows in BMT distinct?
.panel1-filter4-auto[

```r
# how many rows in bmt

nrow(bmt)

bmt %>%
  distinct()

# 64 rows instead of 83 - some duplicates

# Format:
*#   distinct(dataset)                    <br>  #   distinct(dataset)                    <br>
```
]
 
.panel2-filter4-auto[

```
[1] 83
```

---

```r
# how many rows in bmt
*nrow(bmt)
```
]
 
.panel2-filter5-auto[

```
[1] 83
```
]

---
count: false
 
Example 5/5: Which Rows are Duplicates
.panel1-filter5-auto[

```r
# how many rows in bmt
nrow(bmt)

*bmt
```
]
 
.panel2-filter5-auto[

```
[1] 83
```

---
count: false
 
Example 5/5: Which Rows are Duplicates
.panel1-filter5-auto[

```r
# how many rows in bmt
nrow(bmt)

bmt %>%
* get_dupes()
```
]
 
.panel2-filter5-auto[

```
[1] 83
```

```
# A tibble: 38 x 27
      id   age   sex  race diagnosis diagnosis_type time_to_transpl…
   <dbl> <dbl> <dbl> <dbl> <chr>              <dbl>            <dbl>
 1    40    46     0     1 renal ce…             NA             8.48
 2    40    46     0     1 renal ce…             NA             8.48
 3    41    36     0     0 multiple…              0            11.7 
 4    41    36     0     0 multiple…              0            11.7 
 5    42    62     0     1 myelodys…              1            13.0 
 6    42    62     0     1 myelodys…              1            13.0 
 7    43    60     0     1 multiple…              0            14   
 8    43    60     0     1 multiple…              0            14   
 9    44    36     1     1 aplastic…             NA             1.84
10    44    36     1     1 aplastic…             NA             1.84
# … with 28 more rows, and 20 more variables: prior_radiation <dbl>,
#   prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>,
#   donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>,
#   cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>,
#   cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>,
#   cgvhd <dbl>, time_to_cgvhd <dbl>, dupe_count <int>
```
]

---
count: false
 
Example 5/5: Which Rows are Duplicates
.panel1-filter5-auto[

```r
# how many rows in bmt
nrow(bmt)

bmt %>%
  get_dupes()

# 438 duplicate rows

# Format:
*#   get_dupes(dataset)                    <br>  #   get_dupes(dataset)                    <br>
```
]
 
.panel2-filter5-auto[

```
[1] 83
```

---
class: inverse, center

# End of This Flipbook

## On to The Coding Exercises!