class: center, middle, inverse, title-slide # Demo with Tabsets (Panelsets) ## Selecting Rows with _slice_sample() ### Peter Higgins ### 2021-01-15 --- ### How to Use the _slice()_ functions to Take Slices of Rows #### If you have a very large dataset, and want to develop code on a smaller (but random) sample, _slice_sample()_ can help. This is also helpful for sampling for training and testing sets when modeling. _slice_sample()_ can take n or proportion (prop) arguments Let's see some **sampling** examples! --- .panelset[ .panel[.panel-name[R Code] ```r # how many rows when you start nrow(covid_dates) covid_dates %>% slice_sample(prop = 0.3) # see how many rows now # Format: # slice_sample(prop = 0.nn) ``` ] .panel[.panel-name[Results] ``` [1] 15524 ``` ``` # A tibble: 4,657 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 10396 melisandre lannister female 72 covid emergency … 2 5050 gerold westerling male 31 covid clinical l… 3 3713 ardrian tully male 31 covid cc care nt… 4 7687 amory sand male 104 covid nicu 5 1245 arwyn targaryen female 75 covid clinical l… 6 293 gilly westerling female 105 covid radiation … 7 10262 elinor targaryen female 106 covid clinical l… 8 4366 donyse westerling female 100 covid clinical l… 9 2981 humfrey swyft male 105 covid inpatient … 10 7610 mudge swyft male 86 covid clinical l… # … with 4,647 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] ] --- count: false Example 2/3: Take a Random 70% Sample for Training and a Complementary 30% for Testing. .panel1-filter2-auto[ ```r # how many rows when you start *nrow(covid_dates) ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ] --- count: false Example 2/3: Take a Random 70% Sample for Training and a Complementary 30% for Testing. .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) # make training set *covid_dates ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ``` # A tibble: 15,524 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 1412 jhezane westerling female 4 covid inpatient … 2 533 penny targaryen female 7 covid clinical l… 3 9134 grunt rivers male 7 covid clinical l… 4 8518 melisandre swyft female 8 covid clinical l… 5 8967 rolley karstark male 8 covid emergency … 6 11048 megga karstark female 8 covid oncology d… 7 663 ithoke targaryen male 9 covid clinical l… 8 2158 ravella frey female 9 covid emergency … 9 3794 styr tyrell male 9 covid clinical l… 10 4706 wynafryd seaworth male 9 covid clinical l… # … with 15,514 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 2/3: Take a Random 70% Sample for Training and a Complementary 30% for Testing. .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) # make training set covid_dates %>% * slice_sample(prop = 0.7) ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ``` # A tibble: 10,866 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 11684 anguy stark male 95 covid autopsy 2 3851 ryman karstark male 23 covid emergency … 3 11585 emmon karstark male 37 covid clinical l… 4 1175 ben ryswell male 11 covid clinical l… 5 3906 gyles snow male 33 covid emergency … 6 2127 edric targaryen male 36 covid clinical l… 7 10780 marissa seaworth female 63 covid emergency … 8 11895 lysa swyft female 50 covid clinical l… 9 4750 tanda mormont female 58 covid intl patie… 10 11380 hallyne clegane male 92 covid clinical l… # … with 10,856 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 2/3: Take a Random 70% Sample for Training and a Complementary 30% for Testing. .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) # make training set covid_dates %>% slice_sample(prop = 0.7) -> *training_covid_dates ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ] --- count: false Example 2/3: Take a Random 70% Sample for Training and a Complementary 30% for Testing. .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) # make training set covid_dates %>% slice_sample(prop = 0.7) -> training_covid_dates # now make testing set *covid_dates ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ``` # A tibble: 15,524 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 1412 jhezane westerling female 4 covid inpatient … 2 533 penny targaryen female 7 covid clinical l… 3 9134 grunt rivers male 7 covid clinical l… 4 8518 melisandre swyft female 8 covid clinical l… 5 8967 rolley karstark male 8 covid emergency … 6 11048 megga karstark female 8 covid oncology d… 7 663 ithoke targaryen male 9 covid clinical l… 8 2158 ravella frey female 9 covid emergency … 9 3794 styr tyrell male 9 covid clinical l… 10 4706 wynafryd seaworth male 9 covid clinical l… # … with 15,514 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 2/3: Take a Random 70% Sample for Training and a Complementary 30% for Testing. .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) # make training set covid_dates %>% slice_sample(prop = 0.7) -> training_covid_dates # now make testing set covid_dates %>% * anti_join(training_covid_dates) ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ``` # A tibble: 4,658 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 8115 patrek frey male 9 covid clinical l… 2 8943 myria rivers female 9 covid picu 3 6965 arthor lannister male 9 covid clinical l… 4 2103 ollo snow male 10 covid clinical l… 5 4930 sarra frey female 10 covid emergency … 6 8138 frenya swyft female 10 covid clinical l… 7 2114 azzak tully male 10 covid inpatient … 8 227 maege sand female 11 covid emergency … 9 252 nymeria karstark female 11 covid ob gyn 10 1299 alys manderly female 11 covid inpatient … # … with 4,648 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 2/3: Take a Random 70% Sample for Training and a Complementary 30% for Testing. .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) # make training set covid_dates %>% slice_sample(prop = 0.7) -> training_covid_dates # now make testing set covid_dates %>% anti_join(training_covid_dates)-> *testing_covid_dates ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ] --- count: false Example 2/3: Take a Random 70% Sample for Training and a Complementary 30% for Testing. .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) # make training set covid_dates %>% slice_sample(prop = 0.7) -> training_covid_dates # now make testing set covid_dates %>% anti_join(training_covid_dates)-> testing_covid_dates # see how many rows in each *training_covid_dates ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ``` # A tibble: 10,866 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 275 qezza greyjoy female 31 covid nicu 2 4943 hoster targaryen male 91 covid emergency … 3 12286 harra harlaw female 45 covid laboratory 4 11283 petyr mormont male 57 covid clinical l… 5 6343 mord bolton male 57 covid clinical l… 6 8979 qyburn seaworth male 50 covid clinical l… 7 1805 godry stark male 98 covid clinical l… 8 1488 alys baratheon female 75 covid inpatient … 9 5966 joffrey martell male 32 covid emergency … 10 7384 harra targaryen female 22 covid clinical l… # … with 10,856 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 2/3: Take a Random 70% Sample for Training and a Complementary 30% for Testing. .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) # make training set covid_dates %>% slice_sample(prop = 0.7) -> training_covid_dates # now make testing set covid_dates %>% anti_join(training_covid_dates)-> testing_covid_dates # see how many rows in each training_covid_dates *testing_covid_dates ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ``` # A tibble: 10,866 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 1478 lucas lannister male 101 covid clinical l… 2 11044 alysane rivers female 58 covid emergency … 3 7414 donal stark male 42 covid urgent car… 4 393 glendon lannister male 97 covid clinical l… 5 1344 anya seaworth female 104 covid clinical l… 6 5101 nymeria snow female 100 covid clinical l… 7 4541 matrice seaworth female 91 covid inpatient … 8 7514 mathos tyrell male 103 covid emergency … 9 4310 marq clegane male 31 covid clinical l… 10 1427 nolla baelish female 32 covid inpatient … # … with 10,856 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ``` # A tibble: 4,658 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 663 ithoke targaryen male 9 covid clinical l… 2 3794 styr tyrell male 9 covid clinical l… 3 9309 maege sand female 9 covid medical ce… 4 8943 myria rivers female 9 covid picu 5 8031 gueren sand male 10 covid clinical l… 6 10919 woth snow male 10 covid clinical l… 7 252 nymeria karstark female 11 covid ob gyn 8 2427 daenerys umber female 11 covid inpatient … 9 2983 ronnel snow male 11 covid emergency … 10 3854 husband snow male 11 covid clinical l… # … with 4,648 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 2/3: Take a Random 70% Sample for Training and a Complementary 30% for Testing. .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) # make training set covid_dates %>% slice_sample(prop = 0.7) -> training_covid_dates # now make testing set covid_dates %>% anti_join(training_covid_dates)-> testing_covid_dates # see how many rows in each training_covid_dates testing_covid_dates # Format: *# slice_sample(prop = 0.nn) # slice_sample(prop = 0.nn) ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ``` # A tibble: 10,866 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 4175 gilwood targaryen male 55 covid clinical l… 2 4756 kella baratheon female 39 covid inpatient … 3 6797 jon umber male 85 covid clinical l… 4 6966 margaery greyjoy female 53 covid emergency … 5 9502 lorcas mormont male 10 covid clinical l… 6 220 tickler frey male 38 covid clinical l… 7 2848 ghost stark female 96 covid emergency … 8 4951 beric tarly male 83 covid clinical l… 9 407 eddard martell male 64 covid clinical l… 10 1140 tanda westerling female 30 covid emergency … # … with 10,856 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ``` # A tibble: 4,658 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 1412 jhezane westerling female 4 covid inpatient … 2 3794 styr tyrell male 9 covid clinical l… 3 4706 wynafryd seaworth male 9 covid clinical l… 4 8943 myria rivers female 9 covid picu 5 2103 ollo snow male 10 covid clinical l… 6 2349 yezzan royce male 10 covid line clini… 7 2083 weasel tarly female 10 covid emergency … 8 8031 gueren sand male 10 covid clinical l… 9 10468 chella mormont female 10 covid emergency … 10 9217 ragwyle martell female 10 covid clinical l… # … with 4,648 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 2/3: Take a Random 70% Sample for Training and a Complementary 30% for Testing. .panel1-filter2-auto[ ```r # how many rows when you start nrow(covid_dates) # make training set covid_dates %>% slice_sample(prop = 0.7) -> training_covid_dates # now make testing set covid_dates %>% anti_join(training_covid_dates)-> testing_covid_dates # see how many rows in each training_covid_dates testing_covid_dates # Format: # slice_sample(prop = 0.nn) # slice_sample(prop = 0.nn) *# set1 %>% anti_join(set2) # set1 %>% anti_join(set2) ``` ] .panel2-filter2-auto[ ``` [1] 15524 ``` ``` # A tibble: 10,866 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 4652 woth martell male 84 covid clinical l… 2 293 gilly westerling female 73 covid radiation … 3 3072 andar baratheon male 84 covid clinical l… 4 77 nymella tarly female 77 covid radiation … 5 968 tytos tarly male 89 covid clinical l… 6 10324 qezza kettleblack female 84 covid clinical l… 7 12030 creighton targaryen male 102 covid inpatient … 8 706 alerie kettleblack female 47 covid inpatient … 9 1625 duram seaworth male 65 covid clinical l… 10 4582 falyse bolton female 31 covid clinical l… # … with 10,856 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ``` # A tibble: 4,658 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 1412 jhezane westerling female 4 covid inpatient … 2 2158 ravella frey female 9 covid emergency … 3 4706 wynafryd seaworth male 9 covid clinical l… 4 8943 myria rivers female 9 covid picu 5 6965 arthor lannister male 9 covid clinical l… 6 8138 frenya swyft female 10 covid clinical l… 7 10468 chella mormont female 10 covid emergency … 8 252 nymeria karstark female 11 covid ob gyn 9 392 moon mormont male 11 covid clinical l… 10 1299 alys manderly female 11 covid inpatient … # … with 4,648 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] <style> .panel1-filter2-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-filter2-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-filter2-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- count: false Example 3/3: Take a Random Sample of 50 Rows from covid_dates. .panel1-filter3-auto[ ```r # how many rows when you start *nrow(covid_dates) ``` ] .panel2-filter3-auto[ ``` [1] 15524 ``` ] --- count: false Example 3/3: Take a Random Sample of 50 Rows from covid_dates. .panel1-filter3-auto[ ```r # how many rows when you start nrow(covid_dates) *covid_dates ``` ] .panel2-filter3-auto[ ``` [1] 15524 ``` ``` # A tibble: 15,524 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 1412 jhezane westerling female 4 covid inpatient … 2 533 penny targaryen female 7 covid clinical l… 3 9134 grunt rivers male 7 covid clinical l… 4 8518 melisandre swyft female 8 covid clinical l… 5 8967 rolley karstark male 8 covid emergency … 6 11048 megga karstark female 8 covid oncology d… 7 663 ithoke targaryen male 9 covid clinical l… 8 2158 ravella frey female 9 covid emergency … 9 3794 styr tyrell male 9 covid clinical l… 10 4706 wynafryd seaworth male 9 covid clinical l… # … with 15,514 more rows, and 11 more variables: result <chr>, # demo_group <chr>, age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, # orderset <dbl>, payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 3/3: Take a Random Sample of 50 Rows from covid_dates. .panel1-filter3-auto[ ```r # how many rows when you start nrow(covid_dates) covid_dates %>% * slice_sample(n = 50) ``` ] .panel2-filter3-auto[ ``` [1] 15524 ``` ``` # A tibble: 50 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 5262 rigney stark male 99 covid hem onc da… 2 9991 del bolton male 69 covid emergency … 3 3369 brienne umber female 105 covid clinical l… 4 10956 wat targaryen male 97 covid cc care nt… 5 3811 kella frey female 75 covid oncology d… 6 6551 masha mormont female 30 covid emergency … 7 8211 jhezane greyjoy female 32 covid clinical l… 8 385 tanda snow female 44 covid clinical l… 9 10780 marissa seaworth female 63 covid emergency … 10 3543 alia karstark female 30 covid emergency … # … with 40 more rows, and 11 more variables: result <chr>, demo_group <chr>, # age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, orderset <dbl>, # payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] --- count: false Example 3/3: Take a Random Sample of 50 Rows from covid_dates. .panel1-filter3-auto[ ```r # how many rows when you start nrow(covid_dates) covid_dates %>% slice_sample(n = 50) # see how many rows now # Format: *# slice_sample(n = NN) <br> # slice_sample(n = NN) <br> ``` ] .panel2-filter3-auto[ ``` [1] 15524 ``` ``` # A tibble: 50 x 18 subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 8272 ragwyle rivers female 86 covid emergency … 2 10480 zei westerling female 53 covid clinical l… 3 592 osney seaworth male 32 covid nicu 4 8204 kojja baratheon female 23 covid emergency … 5 9551 ricasso snow male 21 covid clinical l… 6 10767 palla mormont female 58 covid clinical l… 7 3041 tanda stark female 93 covid emergency … 8 10432 mag karstark male 91 covid inpatient … 9 2943 shagwell rivers male 75 covid inpatient … 10 3719 halys tully male 50 covid clinical l… # … with 40 more rows, and 11 more variables: result <chr>, demo_group <chr>, # age <dbl>, drive_thru_ind <dbl>, ct_result <dbl>, orderset <dbl>, # payor_group <chr>, patient_class <chr>, col_rec_tat <dbl>, # rec_ver_tat <dbl>, fake_date <date> ``` ] <style> .panel1-filter3-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-filter3-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-filter3-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- class: inverse, center # End of This Flipbook ## On to The Coding Exercises!