class: center, middle, inverse, title-slide # Wrangling ### Introduction to R
Bern R Bootcamp
### June 2020 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> Introduction to R | June 2020 </font> </a> </span> </div> --- # What is wrangling? <p align = "center"> <img src="image/wrangling.jpeg" height=450px><br> <font style="font-size:10px">from <a href="https://datasciencebe.com/tag/data-wrangling/">datasciencebe.com</a></font> </p> --- .pull-left45[ # This is wrangling! <div style="padding-bottom:10px"></div> #### <high>Transform</high> Change variable names Add new variables #### <high>Organise</high></font> Sort data by variables Merging data from two separate dataframes Move data between variables and rows #### <high>Aggregate and summarise</high></font> Group data and calculate and summary stats ] .pull-right5[ <br> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/organise_transform_aggregate.png" height = "530px"> </p> ] --- # `dplyr` + `tidyr` .pull-left5[ To wrangle data in R, we will use the <high><mono>dplyr</mono></high> and <high><mono>tidyr</mono></high> packages, which are part of the <high><mono>tidyverse</mono></high>. | Package | Function| Function| |:-------------|:----| |<b>dplyr</b> | Transformation | `rename()`, `mutate()`, `case_when()`, `*_join()` | |<b>dplyr</b> | Organisation | `arrange()`, `slice()`, `filter()`, `select()` | |<b>tidyr</b> | Organisation | `pivot_longer()`, `pivot_wider()` | |<b>dplyr</b> | Aggregation | `group_by()`, `summarise()` | ] .pull-right4[ <p align = "center"> <img src="image/packages.png" height=320px> </p> ] --- # The Pipe! <high>`%>%`</high> .pull-left4[ `dplyr` makes extensive use of a new operator called the <high><i>Pipe</i> `%>%`</high><br> Read the <high><i>Pipe</i> `%>%`</high> as "And Then..." <br> ```r # Start with data data %>% # AND THEN... DO_SOMETHING %>% # AND THEN... DO_SOMETHING %>% # AND THEN... DO_SOMETHING %>% # AND THEN... ``` ] .pull-right55[ <p align="center"> <img src="https://upload.wikimedia.org/wikipedia/en/thumb/b/b9/MagrittePipe.jpg/300px-MagrittePipe.jpg" width = "400px"><br> <br><i>This is not a pipe. But <mono>%>%</mono> is!</i> </p> ] --- # The Pipe! <high>`%>%`</high> .pull-left4[ ```r # Vector of `scores` score <- c(8, 4, 6, 3, 7, 3) score ``` ``` [1] 8 4 6 3 7 3 ``` ```r # Mean: Base-R-way mean(x = score) ``` ``` [1] 5.167 ``` ```r # Mean: Tidyverse-style (with %>%) score %>% # AND THEN mean() ``` ``` [1] 5.167 ``` ] .pull-right55[ <p align="center"> <img src="image/pipe.png" height = "400px"> </p> ] --- # The Pipe! <high>`%>%`</high> .pull-left4[ ```r # Vector of `scores` score <- c(8, 4, 6, 3, 7, 3) score ``` ``` [1] 8 4 6 3 7 3 ``` ```r # Mean: Base-R-way round(mean(x = score), digits = 1) ``` ``` [1] 5.2 ``` ```r # Mean: Tidyverse-style (with %>%) score %>% # AND THEN mean() %>% # AND THEN round(digits = 1) ``` ``` [1] 5.2 ``` ] .pull-right55[ <p align="center"> <img src="image/pipe.png" height = "400px"> </p> ] --- .pull-left4[ # 2 dirty data sets ### Goals <b>1</b> <i>Rename</i> - Give meaningful <high>variable names</high>.<br><br> <b>2</b> <i>Recode</i> - Use appropriate <high>units</high> and <high>labels</high> for nominal variables.<br><br> <b>3</b> <i>Join</i> - <high>Combine</high> datasets.<br><br> <b>4</b> <i>Sort</i> - <high>Sort</high> tibble by age. <br><br> <b>5</b> <i>Filter</i> - Select relevant <high>cases</high>. <br><br> <b>6</b> <i>Select</i> - Select relevant <high>variables</high>.<br><br> <b>7</b> <i>Pivot</i> - Change to <high>wide format</high>.<br><br> ] .pull-right45[ <br><br2> ```r # patients tibble patients ``` ``` # A tibble: 5 x 3 ID X1 X2 <dbl> <dbl> <dbl> 1 1 37 1 2 2 65 2 3 3 57 2 4 4 34 1 5 5 45 2 ``` ```r # results tibble results ``` ``` # A tibble: 5 x 3 ID t_1 t_2 <dbl> <dbl> <dbl> 1 4 100 105 2 92 134 150 3 1 123 135 4 2 143 140 5 99 102 68 ``` ] --- # Transformation .pull-left4[ Transformation functions are used to <high>alter the content</high> of a `tibble`. <br2> | Function| Description| |:-------------|:----| | `rename()` | <high>Change names</high> of variables | | `mutate()`| <high>Create variable</high> from existing variables| | `case_when()`| <high>Recode values</high> from a vector to another| | `left_join()` | <high>Combine tibbbles</high> | ] .pull-right5[ ```r patients # Baselers data ``` ``` # A tibble: 5 x 3 ID X1 X2 <dbl> <dbl> <dbl> 1 1 37 1 2 2 65 2 3 3 57 2 4 4 34 1 5 5 45 2 ``` ] --- # `rename()` .pull-left4[ <high>Change variable names</high> with `rename()`. ```r patients %>% rename(NEW = OLD, NEW = OLD) ``` ```r patients # Original ``` ``` # A tibble: 5 x 3 ID X1 X2 <dbl> <dbl> <dbl> 1 1 37 1 2 2 65 2 3 3 57 2 4 4 34 1 5 5 45 2 ``` ] .pull-right5[ Change `X1` to `age`, and `X2` to `arm`. ```r # 0) Start with patients data patients %>% # 1) Change variable names with rename() rename(age = X1, # New = Old arm = X2) # New = Old ``` ``` # A tibble: 5 x 3 ID age arm <dbl> <dbl> <dbl> 1 1 37 1 2 2 65 2 3 3 57 2 4 4 34 1 5 5 45 2 ``` ] --- # `mutate()` .pull-left4[ Create <high>new variables</high>, or <high>change existing ones</high>, with `mutate()`. ```r tibble %>% mutate( NEW1 = DEFINITION1, NEW2 = DEFINITION2, NEW3 = DEFINITION3, ... ) ``` ] .pull-right5[ Calculate two new variables `age_months` and `age_decades`. ```r patients %>% rename(age = X1, arm = X2) %>% # AND THEN... # Create new variables with mutate() mutate(age_months = age * 12, age_decades = age / 10) ``` ``` # A tibble: 5 x 5 ID age arm age_months age_decades <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 37 1 444 3.7 2 2 65 2 780 6.5 3 3 57 2 684 5.7 4 4 34 1 408 3.4 5 5 45 2 540 4.5 ``` ] --- # `case_when()` .pull-left4[ Use `case_when()` with `mutate()` to define <high>new variables based on logical conditions</high>. ```r # Using mutate(case_when()) tibble %>% mutate( NEW = case_when( COND1 ~ VAL1, COND2 ~ VAL2 )) ``` ] .pull-right55[ Create `arm_lab` that carries `'placebo'` for `arm == 1` and `'drug'` for `arm == 2`. ```r patients %>% rename(age = X1, arm = X2) %>% # Create arm_lab from arm mutate(arm_lab = case_when(arm == 1 ~ "placebo", arm == 2 ~ "drug")) ``` ``` # A tibble: 5 x 4 ID age arm arm_lab <dbl> <dbl> <dbl> <chr> 1 1 37 1 placebo 2 2 65 2 drug 3 3 57 2 drug 4 4 34 1 placebo 5 5 45 2 drug ``` ] --- # Joining data <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/joining_data.png" height="450px"> </p> --- # `left_join()` .pull-left35[ Use `left_join()` to <high>combine two data frames</high> based on one or more key variables. ```r # Join tibble_1 to tibble_2 # matched by KEY tibble_1 %>% left_join(tibble_2, by = c("KEY")) ``` Other `*_join()` functions: `right_join()`, `full_join()`, `inner_join()`, `anti_join()`, `semi_join()`. ] .pull-right6[ ```r # Join patients with results to create combined combined <- patients %>% rename(age = X1, arm = X2) %>% mutate(arm_lab = case_when(arm == 1 ~ "placebo", arm == 2 ~ "drug")) %>% # Join with results with left_join() left_join(results, by = "ID") ``` ```r # Show combined data set combined ``` ``` # A tibble: 5 x 6 ID age arm arm_lab t_1 t_2 <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 37 1 placebo 123 135 2 2 65 2 drug 143 140 3 3 57 2 drug NA NA 4 4 34 1 placebo 100 105 5 5 45 2 drug NA NA ``` ] --- .pull-left4[ # 2 dirty data sets ### Goals <b>1</b> <i>Rename</i> - Give meaningful <high>variable names</high>.<br><br> <b>2</b> <i>Recode</i> - Use appropriate <high>units</high> and <high>labels</high> for nominal variables.<br><br> <b>3</b> <i>Join</i> - <high>Combine</high> datasets.<br><br> <b>4</b> <i>Sort</i> - <high>Sort</high> tibble by age. <br><br> <b>5</b> <i>Filter</i> - Select relevant <high>cases</high>. <br><br> <b>6</b> <i>Select</i> - Select relevant <high>variables</high>.<br><br> <b>7</b> <i>Pivot</i> - Change to <high>wide format</high>.<br><br> ] .pull-right45[ <br><br2> ```r # patients tibble patients ``` ``` # A tibble: 5 x 3 ID X1 X2 <dbl> <dbl> <dbl> 1 1 37 1 2 2 65 2 3 3 57 2 4 4 34 1 5 5 45 2 ``` ```r # results tibble results ``` ``` # A tibble: 5 x 3 ID t_1 t_2 <dbl> <dbl> <dbl> 1 4 100 105 2 92 134 150 3 1 123 135 4 2 143 140 5 99 102 68 ``` ] --- # Organisation .pull-left4[ Organisation functions help you change the organisation of your data by <high>sorting rows</high> by variables, <high>filter rows</high> based on criteria, <high>select variables</high> (etc). | Function| Description| |:--------|:----|:-------------| | `arrange()` |<high>Sort rows</high> by variables| | `slice()`| <high>Select rows</high> by location| | `filter()` | <high>Select rows</high> by criteria| | `select()`| <high>Select variables</high>| ] .pull-right55[ ```r # combined tibble combined ``` ``` # A tibble: 5 x 6 ID age arm arm_lab t_1 t_2 <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 37 1 placebo 123 135 2 2 65 2 drug 143 140 3 3 57 2 drug NA NA 4 4 34 1 placebo 100 105 5 5 45 2 drug NA NA ``` ] --- # `arrange()` .pull-left4[ Use `arrange()` to <high>sort rows</high> in increasing or decreasing (using `desc()`) order of one or more variables. ```r tibble %>% arrange(A, B) ``` To sort in descending order, use `desc()` ```r tibble %>% arrange(desc(A), B) ``` ] .pull-right55[ Sort by `arm`. ```r combined %>% arrange(arm) # Sort by arm ``` ``` # A tibble: 5 x 6 ID age arm arm_lab t_1 t_2 <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 37 1 placebo 123 135 2 4 34 1 placebo 100 105 3 2 65 2 drug 143 140 4 3 57 2 drug NA NA 5 5 45 2 drug NA NA ``` ] --- # `arrange()` .pull-left4[ Use `arrange()` to <high>sort rows</high> in increasing or decreasing (using `desc()`) order of one or more variables. ```r tibble %>% arrange(A, B) ``` To sort in descending order, use `desc()` ```r tibble %>% arrange(desc(A), B) ``` ] .pull-right55[ Sort by `arm` and then `age`. ```r combined %>% arrange(arm, age) # Sort by arm then age ``` ``` # A tibble: 5 x 6 ID age arm arm_lab t_1 t_2 <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 4 34 1 placebo 100 105 2 1 37 1 placebo 123 135 3 5 45 2 drug NA NA 4 3 57 2 drug NA NA 5 2 65 2 drug 143 140 ``` ] --- # `slice()` .pull-left4[ Use `slice()` to <high>select rows</high> (or remove) by row number. Use `c()`, `a:b`, or `seq()` to create row numbers ```r # Integer vector c(2, 6, 10) ``` ``` [1] 2 6 10 ``` ```r # Integer vector of 0 to 5 0:5 ``` ``` [1] 0 1 2 3 4 5 ``` ] .pull-right55[ Select rows `3` and `5`. ```r # Rows 3 and 5 only combined %>% slice(c(3, 5)) ``` ``` # A tibble: 2 x 6 ID age arm arm_lab t_1 t_2 <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 3 57 2 drug NA NA 2 5 45 2 drug NA NA ``` ] --- # `slice()` .pull-left4[ Use `slice()` to <high>select rows</high> (or remove) by row number. Use `c()`, `a:b`, or `seq()` to create row numbers ```r # Integer vector c(2, 6, 10) ``` ``` [1] 2 6 10 ``` ```r # Integer vector of 0 to 5 0:5 ``` ``` [1] 0 1 2 3 4 5 ``` ] .pull-right55[ Select rows `1` through `4`. ```r # First 4 rows combined %>% slice(1:4) ``` ``` # A tibble: 4 x 6 ID age arm arm_lab t_1 t_2 <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 37 1 placebo 123 135 2 2 65 2 drug 143 140 3 3 57 2 drug NA NA 4 4 34 1 placebo 100 105 ``` ] --- # `filter()` .pull-left4[ Use `filter()` to <high>select rows</high> (or remove) based on logical statements. <high>Chain</high> logical comparison operators with `&` (AND) and `|` (OR). <high>`==`</high> - is equal to<br> <high>`<`</high>, <high>`>`</high> - smaller/greater than<br> <high>`≤`</high>, <high>`≥`</high> - smaller/greater than or equal<br> <high>`&`</high>, <high>`&&`</high> - logical AND<br> <high>`|`</high>, <high>`||`</high> - logical OR<br> ] .pull-right55[ Select <i>only</i> patients over `30`. ```r # Filter patients younger than 40 combined %>% filter(age < 40) ``` ``` # A tibble: 2 x 6 ID age arm arm_lab t_1 t_2 <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 37 1 placebo 123 135 2 4 34 1 placebo 100 105 ``` ] --- # `filter()` .pull-left4[ Use `filter()` to <high>select rows</high> (or remove) based on logical statements. <high>Chain</high> logical comparison operators with `&` (AND) and `|` (OR). <high>`==`</high> - is equal to<br> <high>`<`</high>, <high>`>`</high> - smaller/greater than<br> <high>`≤`</high>, <high>`≥`</high> - smaller/greater than or equal<br> <high>`&`</high>, <high>`&&`</high> - logical AND<br> <high>`|`</high>, <high>`||`</high> - logical OR<br> ] .pull-right55[ Select <i>only</i> patients over `30` in the `'drug'` arm. ```r # Filter patients older than 30 given drug combined %>% filter(age > 30 & arm_lab == 'drug') ``` ``` # A tibble: 3 x 6 ID age arm arm_lab t_1 t_2 <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 2 65 2 drug 143 140 2 3 57 2 drug NA NA 3 5 45 2 drug NA NA ``` ] --- # `select()` .pull-left4[ Use `select()` to <high>select variables</high> (and remove all others) ```r # Select variables A, B tibble %>% select(A, B) ``` <high>Remove variables</high> with `-`. ```r # Select everything BUT A tibble %>% select(-A) ``` ] .pull-right55[ Select variables `ID` and `arm`. ```r combined %>% select(ID, arm) # Select ID and arm variables ``` ``` # A tibble: 5 x 2 ID arm <dbl> <dbl> 1 1 1 2 2 2 3 3 2 4 4 1 5 5 2 ``` ] --- # `select()` .pull-left4[ Use `select()` to <high>select variables</high> (and remove all others) ```r # Select variables A, B tibble %>% select(A, B) ``` <high>Remove variables</high> with `-`. ```r # Select everything BUT A tibble %>% select(-A) ``` ] .pull-right55[ Select everything <i>except</i> t_2 ```r combined %>% select(-t_2) # Everything BUT ID ``` ``` # A tibble: 5 x 5 ID age arm arm_lab t_1 <dbl> <dbl> <dbl> <chr> <dbl> 1 1 37 1 placebo 123 2 2 65 2 drug 143 3 3 57 2 drug NA 4 4 34 1 placebo 100 5 5 45 2 drug NA ``` ] --- # Long and wide formats .pull-left4[ Some functions require data to be in a certain shape, that is to be either in a wide or a long format. Use <high>`pivot_longer()`</high> and <high>`pivot_wider()`</high> from the `tidyr`package to change a tibble between <high>wide</high> and <high>long</high> formats. <br> | Function | Result | |:----------|:-------| |`pivot_longer()`|<high>wide → long</high> format| |`pivot_wider()`|<high>long → wide</high> format| ] .pull-right55[ <p align="center"> <img src="image/longwide.png" height=420px> </p> ] --- # `pivot_longer()` .pull-left45[ ```r # Show original data (wide) combined ``` ``` # A tibble: 5 x 6 ID age arm arm_lab t_1 t_2 <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 37 1 placebo 123 135 2 2 65 2 drug 143 140 3 3 57 2 drug NA NA 4 4 34 1 placebo 100 105 5 5 45 2 drug NA NA ``` ] .pull-right45[ ```r # "pivot_longer" wide data to long combined %>% pivot_longer(names_to = 'time', values_to = 'value', cols = c(t_1, t_2)) ``` ``` # A tibble: 10 x 6 ID age arm arm_lab time value <dbl> <dbl> <dbl> <chr> <chr> <dbl> 1 1 37 1 placebo t_1 123 2 1 37 1 placebo t_2 135 3 2 65 2 drug t_1 143 4 2 65 2 drug t_2 140 5 3 57 2 drug t_1 NA 6 3 57 2 drug t_2 NA 7 4 34 1 placebo t_1 100 8 4 34 1 placebo t_2 105 9 5 45 2 drug t_1 NA 10 5 45 2 drug t_2 NA ``` ] --- # `pivot_wider()` .pull-left45[ ```r # Show long data long ``` ``` # A tibble: 10 x 6 ID age arm arm_lab time value <dbl> <dbl> <dbl> <chr> <chr> <dbl> 1 1 37 1 placebo t_1 123 2 1 37 1 placebo t_2 135 3 2 65 2 drug t_1 143 4 2 65 2 drug t_2 140 5 3 57 2 drug t_1 NA 6 3 57 2 drug t_2 NA 7 4 34 1 placebo t_1 100 8 4 34 1 placebo t_2 105 9 5 45 2 drug t_1 NA 10 5 45 2 drug t_2 NA ``` ] .pull-right45[ ```r # "pivot_wider" long data to wide long %>% pivot_wider(names_from = 'time', values_from = 'value') ``` ``` # A tibble: 5 x 6 ID age arm arm_lab t_1 t_2 <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 1 37 1 placebo 123 135 2 2 65 2 drug 143 140 3 3 57 2 drug NA NA 4 4 34 1 placebo 100 105 5 5 45 2 drug NA NA ``` ] --- class: middle, center <h1><a href="https://dwulff.github.io/Intro2R_Unibe/_sessions/Wrangling/Wrangling_practical.html">Practical</a></h1>