class: center, middle, inverse, title-slide # How to Use
distinct()
and
get_dupes()
to find distinct rows ## Selecting Distinct vs. Duplicated Rows ### Peter Higgins ### 2021-01-10 --- ### How to Use the distinct() and get_dupes() functions to find duplicate or distinct rows You can use _distinct(variable)_ to find all the distinct values of that variable, and therefore how many are duplicates for that variable. Imagine that you have some duplicate observations in your dataset, due to a problem with data entry, or more likely with a merging of two datasets. It is important to have a single row for each observation. But it can be hard to find the duplicates in a large dataset. So you (wisely) decide to check this with the _distinct()_ function. If the resulting number of rows (nrow) is less than the original nrow, you have some duplicates. The {janitor} function _get_dupes()_ can help find these. Let's find some duplicates! --- count: false Example 1/5: Are Some Subjects Being Tested Multiple Times? .panel1-filter1-auto[ ```r # how many rows when you start *nrow(covid_dates) ``` ] .panel2-filter1-auto[ ``` [1] 15524 ``` ] --- count: false Example 1/5: Are Some Subjects Being Tested Multiple Times? .panel1-filter1-auto[ ```r # how many rows when you start nrow(covid_dates) # check if subject_id is a unique identifier *covid_dates ``` ] .panel2-filter1-auto[ ``` [1] 15524 ``` ``` # A tibble: 15,524 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 1412 jhezane westerling female 4 covid inpatient … 2 533 penny targaryen female 7 covid clinical l… 3 9134 grunt rivers male 7 covid clinical l… 4 8518 melisandre swyft female 8 covid clinical l… 5 8967 rolley karstark male 8 covid emergency … 6 11048 megga karstark female 8 covid oncology d… 7 663 ithoke targaryen male 9 covid clinical l… 8 2158 ravella frey female 9 covid emergency … 9 3794 styr tyrell male 9 covid clinical l… 10 4706 wynafryd seaworth male 9 covid clinical l… # … with 15,514 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 1/5: Are Some Subjects Being Tested Multiple Times? .panel1-filter1-auto[ ```r # how many rows when you start nrow(covid_dates) # check if subject_id is a unique identifier covid_dates %>% * distinct(subject_id) ``` ] .panel2-filter1-auto[ ``` [1] 15524 ``` ``` # A tibble: 12,344 x 1 subject_id <dbl> 1 1412 2 533 3 9134 4 8518 5 8967 6 11048 7 663 8 2158 9 3794 10 4706 # … with 12,334 more rows ``` ] --- count: false Example 1/5: Are Some Subjects Being Tested Multiple Times? .panel1-filter1-auto[ ```r # how many rows when you start nrow(covid_dates) # check if subject_id is a unique identifier covid_dates %>% distinct(subject_id) # see how many rows now # down to 12,344 - subject_id is not a unique key # Format: *# distinct(variable) <br> # distinct(variable) <br> ``` ] .panel2-filter1-auto[ ``` [1] 15524 ``` ``` # A tibble: 12,344 x 1 subject_id <dbl> 1 1412 2 533 3 9134 4 8518 5 8967 6 11048 7 663 8 2158 9 3794 10 4706 # … with 12,334 more rows ``` ] <style> .panel1-filter1-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-filter1-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-filter1-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- count: false Example 2/5: See if Distinct by Full Name .panel1-filter2-auto[ ```r # how many rows when you start *nrow(covid_dates) ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ] --- count: false Example 2/5: See if Distinct by Full Name .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) *covid_dates ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ``` # A tibble: 15,524 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 1412 jhezane westerling female 4 covid inpatient … 2 533 penny targaryen female 7 covid clinical l… 3 9134 grunt rivers male 7 covid clinical l… 4 8518 melisandre swyft female 8 covid clinical l… 5 8967 rolley karstark male 8 covid emergency … 6 11048 megga karstark female 8 covid oncology d… 7 663 ithoke targaryen male 9 covid clinical l… 8 2158 ravella frey female 9 covid emergency … 9 3794 styr tyrell male 9 covid clinical l… 10 4706 wynafryd seaworth male 9 covid clinical l… # … with 15,514 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 2/5: See if Distinct by Full Name .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) covid_dates %>% # check if names are distinct * select(fake_first_name, fake_last_name) ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ``` # A tibble: 15,524 x 2 fake_first_name fake_last_name <chr> <chr> 1 jhezane westerling 2 penny targaryen 3 grunt rivers 4 melisandre swyft 5 rolley karstark 6 megga karstark 7 ithoke targaryen 8 ravella frey 9 styr tyrell 10 wynafryd seaworth # … with 15,514 more rows ``` ] --- count: false Example 2/5: See if Distinct by Full Name .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) covid_dates %>% # check if names are distinct select(fake_first_name, fake_last_name) %>% # check if first & last name is a unique identifier = key * distinct(fake_first_name, fake_last_name) ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ``` # A tibble: 8,286 x 2 fake_first_name fake_last_name <chr> <chr> 1 jhezane westerling 2 penny targaryen 3 grunt rivers 4 melisandre swyft 5 rolley karstark 6 megga karstark 7 ithoke targaryen 8 ravella frey 9 styr tyrell 10 wynafryd seaworth # … with 8,276 more rows ``` ] --- count: false Example 2/5: See if Distinct by Full Name .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) covid_dates %>% # check if names are distinct select(fake_first_name, fake_last_name) %>% # check if first & last name is a unique identifier = key distinct(fake_first_name, fake_last_name) # see how many rows now - 8,286 # lots of full names are repeated # Format: *# distinct(variable1, variable 2) <br> # distinct(variable1, variable 2) <br> ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ``` # A tibble: 8,286 x 2 fake_first_name fake_last_name <chr> <chr> 1 jhezane westerling 2 penny targaryen 3 grunt rivers 4 melisandre swyft 5 rolley karstark 6 megga karstark 7 ithoke targaryen 8 ravella frey 9 styr tyrell 10 wynafryd seaworth # … with 8,276 more rows ``` ] <style> .panel1-filter2-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-filter2-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-filter2-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- count: false Example 3/5: Which Rows are Duplicates .panel1-filter3-auto[ ```r # how many rows when you start *nrow(covid_dates) ``` ] .panel2-filter3-auto[ ``` [1] 15524 ``` ] --- count: false Example 3/5: Which Rows are Duplicates .panel1-filter3-auto[ ```r # how many rows when you start nrow(covid_dates) *covid_dates ``` ] .panel2-filter3-auto[ ``` [1] 15524 ``` ``` # A tibble: 15,524 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 1412 jhezane westerling female 4 covid inpatient … 2 533 penny targaryen female 7 covid clinical l… 3 9134 grunt rivers male 7 covid clinical l… 4 8518 melisandre swyft female 8 covid clinical l… 5 8967 rolley karstark male 8 covid emergency … 6 11048 megga karstark female 8 covid oncology d… 7 663 ithoke targaryen male 9 covid clinical l… 8 2158 ravella frey female 9 covid emergency … 9 3794 styr tyrell male 9 covid clinical l… 10 4706 wynafryd seaworth male 9 covid clinical l… # … with 15,514 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 3/5: Which Rows are Duplicates .panel1-filter3-auto[ ```r # how many rows when you start nrow(covid_dates) covid_dates %>% # check if names are distinct * select(subject_id, fake_first_name, * fake_last_name) ``` ] .panel2-filter3-auto[ ``` [1] 15524 ``` ``` # A tibble: 15,524 x 3 subject_id fake_first_name fake_last_name <dbl> <chr> <chr> 1 1412 jhezane westerling 2 533 penny targaryen 3 9134 grunt rivers 4 8518 melisandre swyft 5 8967 rolley karstark 6 11048 megga karstark 7 663 ithoke targaryen 8 2158 ravella frey 9 3794 styr tyrell 10 4706 wynafryd seaworth # … with 15,514 more rows ``` ] --- count: false Example 3/5: Which Rows are Duplicates .panel1-filter3-auto[ ```r # how many rows when you start nrow(covid_dates) covid_dates %>% # check if names are distinct select(subject_id, fake_first_name, fake_last_name) %>% # check if first & last name is unique id * get_dupes(fake_first_name, * fake_last_name) ``` ] .panel2-filter3-auto[ ``` [1] 15524 ``` ``` # A tibble: 10,712 x 4 fake_first_name fake_last_name dupe_count subject_id <chr> <chr> <int> <dbl> 1 addam clegane 3 11694 2 addam clegane 3 10897 3 addam clegane 3 10897 4 addam manderly 3 1114 5 addam manderly 3 1114 6 addam manderly 3 3158 7 addam seaworth 2 9210 8 addam seaworth 2 2628 9 addam tully 2 10606 10 addam tully 2 12024 # … with 10,702 more rows ``` ] --- count: false Example 3/5: Which Rows are Duplicates .panel1-filter3-auto[ ```r # how many rows when you start nrow(covid_dates) covid_dates %>% # check if names are distinct select(subject_id, fake_first_name, fake_last_name) %>% # check if first & last name is unique id get_dupes(fake_first_name, fake_last_name) # now you can see the duplicates # Format: *# get_dupes(variable1, variable 2) <br> # get_dupes(variable1, variable 2) <br> ``` ] .panel2-filter3-auto[ ``` [1] 15524 ``` ``` # A tibble: 10,712 x 4 fake_first_name fake_last_name dupe_count subject_id <chr> <chr> <int> <dbl> 1 addam clegane 3 11694 2 addam clegane 3 10897 3 addam clegane 3 10897 4 addam manderly 3 1114 5 addam manderly 3 1114 6 addam manderly 3 3158 7 addam seaworth 2 9210 8 addam seaworth 2 2628 9 addam tully 2 10606 10 addam tully 2 12024 # … with 10,702 more rows ``` ] <style> .panel1-filter3-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-filter3-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-filter3-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- count: false Example 4/5: Are All Rows in BMT distinct? .panel1-filter4-auto[ ```r # how many rows in bmt *nrow(bmt) ``` ] .panel2-filter4-auto[ ``` [1] 83 ``` ] --- count: false Example 4/5: Are All Rows in BMT distinct? .panel1-filter4-auto[ ```r # how many rows in bmt nrow(bmt) *bmt ``` ] .panel2-filter4-auto[ ``` [1] 83 ``` ``` # A tibble: 83 x 26 id age sex race diagnosis diagnosis_type time_to_transpl… <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 61 1 0 acute my… 1 5.16 2 2 62 1 1 non-Hodg… 0 79.0 3 3 63 0 1 non-Hodg… 0 35.6 4 4 33 0 1 Hodgkin … 0 33.0 5 5 54 0 1 acute ly… 0 11.4 6 6 55 1 1 myelofib… 1 2.43 7 7 67 1 1 acute my… 1 9.59 8 8 51 1 1 acute my… 1 NA 9 9 44 0 0 multiple… 0 43.4 10 10 59 1 1 chronic … 0 92.7 # … with 73 more rows, and 19 more variables: prior_radiation <dbl>, # prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>, # donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>, # cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>, # cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>, # cgvhd <dbl>, time_to_cgvhd <dbl> ``` ] --- count: false Example 4/5: Are All Rows in BMT distinct? .panel1-filter4-auto[ ```r # how many rows in bmt nrow(bmt) bmt %>% * distinct() ``` ] .panel2-filter4-auto[ ``` [1] 83 ``` ``` # A tibble: 64 x 26 id age sex race diagnosis diagnosis_type time_to_transpl… <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 61 1 0 acute my… 1 5.16 2 2 62 1 1 non-Hodg… 0 79.0 3 3 63 0 1 non-Hodg… 0 35.6 4 4 33 0 1 Hodgkin … 0 33.0 5 5 54 0 1 acute ly… 0 11.4 6 6 55 1 1 myelofib… 1 2.43 7 7 67 1 1 acute my… 1 9.59 8 8 51 1 1 acute my… 1 NA 9 9 44 0 0 multiple… 0 43.4 10 10 59 1 1 chronic … 0 92.7 # … with 54 more rows, and 19 more variables: prior_radiation <dbl>, # prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>, # donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>, # cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>, # cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>, # cgvhd <dbl>, time_to_cgvhd <dbl> ``` ] --- count: false Example 4/5: Are All Rows in BMT distinct? .panel1-filter4-auto[ ```r # how many rows in bmt nrow(bmt) bmt %>% distinct() # 64 rows instead of 83 - some duplicates # Format: *# distinct(dataset) <br> # distinct(dataset) <br> ``` ] .panel2-filter4-auto[ ``` [1] 83 ``` ``` # A tibble: 64 x 26 id age sex race diagnosis diagnosis_type time_to_transpl… <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 61 1 0 acute my… 1 5.16 2 2 62 1 1 non-Hodg… 0 79.0 3 3 63 0 1 non-Hodg… 0 35.6 4 4 33 0 1 Hodgkin … 0 33.0 5 5 54 0 1 acute ly… 0 11.4 6 6 55 1 1 myelofib… 1 2.43 7 7 67 1 1 acute my… 1 9.59 8 8 51 1 1 acute my… 1 NA 9 9 44 0 0 multiple… 0 43.4 10 10 59 1 1 chronic … 0 92.7 # … with 54 more rows, and 19 more variables: prior_radiation <dbl>, # prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>, # donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>, # cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>, # cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>, # cgvhd <dbl>, time_to_cgvhd <dbl> ``` ] <style> .panel1-filter4-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-filter4-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-filter4-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- count: false Example 5/5: Which Rows are Duplicates .panel1-filter5-auto[ ```r # how many rows in bmt *nrow(bmt) ``` ] .panel2-filter5-auto[ ``` [1] 83 ``` ] --- count: false Example 5/5: Which Rows are Duplicates .panel1-filter5-auto[ ```r # how many rows in bmt nrow(bmt) *bmt ``` ] .panel2-filter5-auto[ ``` [1] 83 ``` ``` # A tibble: 83 x 26 id age sex race diagnosis diagnosis_type time_to_transpl… <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 61 1 0 acute my… 1 5.16 2 2 62 1 1 non-Hodg… 0 79.0 3 3 63 0 1 non-Hodg… 0 35.6 4 4 33 0 1 Hodgkin … 0 33.0 5 5 54 0 1 acute ly… 0 11.4 6 6 55 1 1 myelofib… 1 2.43 7 7 67 1 1 acute my… 1 9.59 8 8 51 1 1 acute my… 1 NA 9 9 44 0 0 multiple… 0 43.4 10 10 59 1 1 chronic … 0 92.7 # … with 73 more rows, and 19 more variables: prior_radiation <dbl>, # prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>, # donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>, # cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>, # cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>, # cgvhd <dbl>, time_to_cgvhd <dbl> ``` ] --- count: false Example 5/5: Which Rows are Duplicates .panel1-filter5-auto[ ```r # how many rows in bmt nrow(bmt) bmt %>% * get_dupes() ``` ] .panel2-filter5-auto[ ``` [1] 83 ``` ``` # A tibble: 38 x 27 id age sex race diagnosis diagnosis_type time_to_transpl… <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 40 46 0 1 renal ce… NA 8.48 2 40 46 0 1 renal ce… NA 8.48 3 41 36 0 0 multiple… 0 11.7 4 41 36 0 0 multiple… 0 11.7 5 42 62 0 1 myelodys… 1 13.0 6 42 62 0 1 myelodys… 1 13.0 7 43 60 0 1 multiple… 0 14 8 43 60 0 1 multiple… 0 14 9 44 36 1 1 aplastic… NA 1.84 10 44 36 1 1 aplastic… NA 1.84 # … with 28 more rows, and 20 more variables: prior_radiation <dbl>, # prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>, # donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>, # cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>, # cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>, # cgvhd <dbl>, time_to_cgvhd <dbl>, dupe_count <int> ``` ] --- count: false Example 5/5: Which Rows are Duplicates .panel1-filter5-auto[ ```r # how many rows in bmt nrow(bmt) bmt %>% get_dupes() # 438 duplicate rows # Format: *# get_dupes(dataset) <br> # get_dupes(dataset) <br> ``` ] .panel2-filter5-auto[ ``` [1] 83 ``` ``` # A tibble: 38 x 27 id age sex race diagnosis diagnosis_type time_to_transpl… <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 40 46 0 1 renal ce… NA 8.48 2 40 46 0 1 renal ce… NA 8.48 3 41 36 0 0 multiple… 0 11.7 4 41 36 0 0 multiple… 0 11.7 5 42 62 0 1 myelodys… 1 13.0 6 42 62 0 1 myelodys… 1 13.0 7 43 60 0 1 multiple… 0 14 8 43 60 0 1 multiple… 0 14 9 44 36 1 1 aplastic… NA 1.84 10 44 36 1 1 aplastic… NA 1.84 # … with 28 more rows, and 20 more variables: prior_radiation <dbl>, # prior_chemo <dbl>, prior_transplant <dbl>, recipient_cmv <dbl>, # donor_cmv <dbl>, donor_sex <dbl>, tnc_dose <dbl>, cd34_dose <dbl>, # cd3_dose <dbl>, cd8_dose <dbl>, tbi_dose <dbl>, c1_c2 <dbl>, a_ki_rs <dbl>, # cmv <dbl>, time_to_cmv <dbl>, agvhd <dbl>, time_to_agvhd <dbl>, # cgvhd <dbl>, time_to_cgvhd <dbl>, dupe_count <int> ``` ] <style> .panel1-filter5-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-filter5-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-filter5-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- class: inverse, center # End of This Flipbook ## On to The Coding Exercises!