Trulli
from parkinsonsnewstoday.com

Overview

In this practical you’ll practice “data wrangling” with the dplyr and tidyr packages (part of the `tidyverse collection of packages).

By the end of this practical you will know how to:

  1. Change column names, select specific columns
  2. Create new columns
  3. Filter rows of data based on multiple criteria
  4. Combine (aka, ‘join’) multiple data sets through key columns
  5. Convert data between wide and long formats

Tasks

A - Setup

  1. Open your BernRBootcamp R project. It should already have the folders 1_Data and 2_Code. Make sure that the data files listed in the Datasets section above are in your 1_Data folder
# Done!
  1. Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called wrangling_practical.R in the 2_Code folder.
# Done!
  1. Using library() load the set of packages for this practical listed in the packages section above.

  2. For this practical, we’ll use the trial_act data, this is the result of a randomized clinical trial comparing the effects of different medications on adults infected with the human immunodeficiency virus (You can learn more about the trial here). Using the following template, load the data into R and store it as a new object called trial_act.

# Load trial_act.csv from the data folder in your working directory

trial_act <- read_csv(file = "XX")
trial_act <- read_csv(file = "1_Data/trial_act.csv")
  1. Using the same code structure, load the trial_act_demo_fake.csv data as a new dataframe called trial_act_demo
trial_act_demo <- read_csv(file = "1_Data/trial_act_demo_fake.csv")
  1. Take a look at the first few rows of the datasets by printing them to the console.
# Print trial_act object
trial_act
# A tibble: 2,139 x 27
   pidnum   age  wtkg  hemo  homo drugs karnof oprior   z30 zprior preanti  race
    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl> <dbl>
 1  10056    48  89.8     0     0     0    100      0     0      1       0     0
 2  10059    61  49.4     0     0     0     90      0     1      1     895     0
 3  10089    45  88.5     0     1     1     90      0     1      1     707     0
 4  10093    47  85.3     0     1     0    100      0     1      1    1399     0
 5  10124    43  66.7     0     1     0    100      0     1      1    1352     0
 6  10140    46  88.9     0     1     1    100      0     1      1    1181     0
 7  10165    31  73.0     0     1     0    100      0     1      1     930     0
 8  10190    41  66.2     0     1     1    100      0     1      1    1329     0
 9  10198    40  82.6     0     1     0     90      0     1      1    1074     0
10  10229    35  78.0     0     1     0    100      0     1      1     964     0
# … with 2,129 more rows, and 15 more variables: gender <dbl>, str2 <dbl>,
#   strat <dbl>, symptom <dbl>, treat <dbl>, offtrt <dbl>, cd40 <dbl>,
#   cd420 <dbl>, cd496 <dbl>, r <dbl>, cd80 <dbl>, cd820 <dbl>, cens <dbl>,
#   days <dbl>, arms <dbl>
  1. Use the the summary() function to print more details on the columns of the datasets.
summary(trial_act)
     pidnum            age            wtkg            hemo      
 Min.   : 10056   Min.   :12.0   Min.   : 31.0   Min.   :0.000  
 1st Qu.: 81446   1st Qu.:29.0   1st Qu.: 66.7   1st Qu.:0.000  
 Median :190566   Median :34.0   Median : 74.4   Median :0.000  
 Mean   :248778   Mean   :35.2   Mean   : 75.1   Mean   :0.084  
 3rd Qu.:280277   3rd Qu.:40.0   3rd Qu.: 82.6   3rd Qu.:0.000  
 Max.   :990077   Max.   :70.0   Max.   :159.9   Max.   :1.000  
                                                                
      homo           drugs           karnof          oprior           z30      
 Min.   :0.000   Min.   :0.000   Min.   : 70.0   Min.   :0.000   Min.   :0.00  
 1st Qu.:0.000   1st Qu.:0.000   1st Qu.: 90.0   1st Qu.:0.000   1st Qu.:0.00  
 Median :1.000   Median :0.000   Median :100.0   Median :0.000   Median :1.00  
 Mean   :0.661   Mean   :0.131   Mean   : 95.4   Mean   :0.022   Mean   :0.55  
 3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:100.0   3rd Qu.:0.000   3rd Qu.:1.00  
 Max.   :1.000   Max.   :1.000   Max.   :100.0   Max.   :1.000   Max.   :1.00  
                                                                               
     zprior     preanti          race           gender           str2      
 Min.   :1   Min.   :   0   Min.   :0.000   Min.   :0.000   Min.   :0.000  
 1st Qu.:1   1st Qu.:   0   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.000  
 Median :1   Median : 142   Median :0.000   Median :1.000   Median :1.000  
 Mean   :1   Mean   : 379   Mean   :0.288   Mean   :0.828   Mean   :0.586  
 3rd Qu.:1   3rd Qu.: 740   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
 Max.   :1   Max.   :2851   Max.   :1.000   Max.   :1.000   Max.   :1.000  
                                                                           
     strat         symptom          treat           offtrt           cd40     
 Min.   :1.00   Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :   0  
 1st Qu.:1.00   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.: 264  
 Median :2.00   Median :0.000   Median :1.000   Median :0.000   Median : 340  
 Mean   :1.98   Mean   :0.173   Mean   :0.751   Mean   :0.363   Mean   : 351  
 3rd Qu.:3.00   3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.: 423  
 Max.   :3.00   Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :1199  
                                                                              
     cd420          cd496            r              cd80          cd820     
 Min.   :  49   Min.   :   0   Min.   :0.000   Min.   :  40   Min.   : 124  
 1st Qu.: 269   1st Qu.: 209   1st Qu.:0.000   1st Qu.: 654   1st Qu.: 632  
 Median : 353   Median : 321   Median :1.000   Median : 893   Median : 865  
 Mean   : 371   Mean   : 329   Mean   :0.627   Mean   : 987   Mean   : 935  
 3rd Qu.: 460   3rd Qu.: 440   3rd Qu.:1.000   3rd Qu.:1207   3rd Qu.:1146  
 Max.   :1119   Max.   :1190   Max.   :1.000   Max.   :5011   Max.   :6035  
                NA's   :797                                                 
      cens            days           arms     
 Min.   :0.000   Min.   :  14   Min.   :0.00  
 1st Qu.:0.000   1st Qu.: 727   1st Qu.:1.00  
 Median :0.000   Median : 997   Median :2.00  
 Mean   :0.244   Mean   : 879   Mean   :1.52  
 3rd Qu.:0.000   3rd Qu.:1091   3rd Qu.:3.00  
 Max.   :1.000   Max.   :1231   Max.   :3.00  
                                              
summary(trial_act_demo)
     pidnum          exercise    education        
 Min.   : 10056   Min.   :0.0   Length:2139       
 1st Qu.: 81446   1st Qu.:1.0   Class :character  
 Median :190566   Median :1.0   Mode  :character  
 Mean   :248778   Mean   :2.2                     
 3rd Qu.:280277   3rd Qu.:3.0                     
 Max.   :990077   Max.   :7.0                     

B - Change column names with rename()

  1. Print the names of the trial_act data with names(XXX)
names(XXX)
names(trial_act)
 [1] "pidnum"  "age"     "wtkg"    "hemo"    "homo"    "drugs"   "karnof" 
 [8] "oprior"  "z30"     "zprior"  "preanti" "race"    "gender"  "str2"   
[15] "strat"   "symptom" "treat"   "offtrt"  "cd40"    "cd420"   "cd496"  
[22] "r"       "cd80"    "cd820"   "cens"    "days"    "arms"   
  1. Using rename(), change the column name wtkg in the trial_act dataframe to weight_kg. Be sure to assign the result back to trial_act to change it!
# Change the name to weight_kg from wtkg

trial_act <- trial_act %>%
  rename(XX = XX)
trial_act <- trial_act %>%
  rename(weight_kg = wtkg)
  1. Look at the names of your trial_act dataframe again, do you now see the column weight_kg?
names(trial_act)
 [1] "pidnum"    "age"       "weight_kg" "hemo"      "homo"      "drugs"    
 [7] "karnof"    "oprior"    "z30"       "zprior"    "preanti"   "race"     
[13] "gender"    "str2"      "strat"     "symptom"   "treat"     "offtrt"   
[19] "cd40"      "cd420"     "cd496"     "r"         "cd80"      "cd820"    
[25] "cens"      "days"      "arms"     
  1. Change the column name age to age_y (to specify that age is in years).
trial_act <- trial_act %>%
  rename(age_y = age)

C - Select columns with select()

  1. Using the select() function, select only the column age_y and print the result (but don’t assign it to anything). Do you see only the age_y column now?
XXX %>%
  select(XXX)
trial_act %>% 
  select(age_y)
# A tibble: 2,139 x 1
   age_y
   <dbl>
 1    48
 2    61
 3    45
 4    47
 5    43
 6    46
 7    31
 8    41
 9    40
10    35
# … with 2,129 more rows
  1. Using select() select the columns pidnum, age_y, gender, and weight_kg (but don’t assign the result to anything)

  2. Now it’s time to create a new dataframe! Select the columns pidnum, arms, cd40, cd420, and cd496 and assign the result to a new object called CD4_wide. The cd40, cd420, and cd496 columns show patient’s CD4 T cell counts at baseline, 20 weeks, and 96 weeks. After you create CD4_wide, print it to make sure it worked!

XX <- trial_act %>% 
  select(XX, XX, XX, XX, ...)
CD4_wide <- trial_act %>%
  select(`pidnum`, `arms`, `cd40`, `cd420`, `cd496`)
  1. Did you know you can easily select all columns that start with specific characters using starts_with()? Try adapting the following code to get the same result you got before.
CD4_wide <- trial_act %>% 
  select(pidnum, arms, starts_with("XXX"))
CD4_wide <- trial_act %>% 
  select(pidnum, arms, starts_with("cd"))
  1. A colleague wants an anonymised dataframe that does not contain the columns pidnum and age. Create this dataframe by selecting all columns except pidnum and age (hint: use the notation select(-XXX, -XXX)) to select everything except specified columns

D - Add new columns with mutate()

  1. Using the mutate() function, add the column age_m which shows each patient’s age in months instead of years (Hint: Just multiply age_y by 12!)
trial_act <- trial_act %>%
  mutate(XX = XX * 12)
trial_act <- trial_act %>%
  mutate(age_m = age_y * 12)
  1. Using mutate, add the following new columns to trial_act. (Try combining these into one call to the mutate() function!)
  • weight_lb: Weight in lbs instead of kilograms. You can do this by multiplying weight_kg by 2.2.
  • cd_change_20: Change in CD4 T cell count from baseline to 20 weeks. You can do this by taking cd420 - cd40
  • cd_change_960: Change in CD4 T cell count from baseline to 96 weeks. You can do this by taking cd496 - cd40
XXX <- XXX %>% 
  mutate(weight_lb = XXX,
         cd_change_20 = XXX,
         XXX = XXX)
trial_act <- trial_act %>% 
  mutate(weight_lb = weight_kg * 2.2,
         cd_change_20 = cd420 - cd40,
         cd_change_960 = cd496 - cd40)
  1. If you look at the gender column, you will see that it is numeric (0s and 1s). Using the mutate() and case_when() functions, create a new column called gender_char which shows the gender as a string, where 0 = “female” and 1 = “male”:
# Create gender_char which shows gender as a stringh
trial_act <- trial_act %>%
  mutate(
  gender_char = case_when(
    gender == XX ~ "XX",
    gender == XX ~ "XX"))
trial_act <- trial_act %>%
  mutate(
  gender_char = case_when(
    gender == 0 ~ "female",
    gender == 1 ~ "male"))
  1. The column arms is also numeric and not very meaningful. Create a new column arms_char contains the names of the arms. Here is a table of the mapping
arms arms_char
0 zidovudine
1 zidovudine and didanosine
2 zidovudine and zalcitabine
3 didanosine
trial_act <- trial_act %>%
  mutate(
  arms_char = case_when(
    arms == 0 ~ "zidovudine",
    arms == 1 ~ "zidovudine and didanosine",
    arms == 2 ~ "zidovudine and zalcitabine",
    arms == 3 ~ "didanosine"))
  1. If you haven’t already, try putting all the code for your previous questions in one call to mutate(). That is, in one block of code, create age_m, weight_lb, cd_change_20, cd_change_960, gender_char and over50 using the mutate() function only once. Here’s how your code should look:
trial_act <- trial_act %>%
  mutate(
    age_m = XXX,
    weight_lb = XXX,
    cd_change_20 = XXX,
    XXX = XXX,
    XXX = case_when(XXX),
    XXX = case_when(XXX)
  )
trial_act <- trial_act %>%
  mutate(
    agem = age_y * 12,
    weight_lb = weight_kg * 2.2,
    cd_change_20 = cd420 - cd40,
    cd_change_960 = cd496 - cd40,
    gender_char = case_when(
                      gender == 0 ~ "female",
                      gender == 1 ~ "male"),
   arms_char = case_when(
    arms == 0 ~ "zidovudine",
    arms == 1 ~ "zidovudine and didanosine",
    arms == 2 ~ "zidovudine and zalcitabine",
    arms == 3 ~ "didanosine")
  )

E - Arrange rows with arrange()

  1. Using the arrange() function, arrange the trial_act data in ascending order of age_y (from lowest to highest). After you do, print the data to make sure it worked!
trial_act <- trial_act %>% 
 arrange(XXX)
trial_act <- trial_act %>% 
 arrange(age_y)

trial_act
# A tibble: 2,139 x 34
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior preanti
    <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl>
 1 940533    12      41.4     1     0     0    100      0     0      1       0
 2 950037    12      53.1     1     0     0    100      0     1      1     902
 3 950056    12      31       1     0     0    100      0     1      1    2500
 4 910034    13      32.7     1     0     0    100      0     1      1     553
 5 940534    13      62.9     1     0     0    100      0     0      1     366
 6 960014    13      48.5     1     0     0    100      0     0      1       0
 7 310767    14      65       1     0     0    100      0     1      1     490
 8 920050    14      54.2     1     0     0    100      0     1      1     610
 9 940544    14      41.1     1     0     0    100      0     0      1       0
10 950061    14      64.3     1     0     0    100      0     1      1     387
# … with 2,129 more rows, and 23 more variables: race <dbl>, gender <dbl>,
#   str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>, offtrt <dbl>,
#   cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>, cd80 <dbl>, cd820 <dbl>,
#   cens <dbl>, days <dbl>, arms <dbl>, age_m <dbl>, weight_lb <dbl>,
#   cd_change_20 <dbl>, cd_change_960 <dbl>, gender_char <chr>,
#   arms_char <chr>, agem <dbl>
  1. Now arrange the data in descending order of age_y (from highest to lowest). After, look the data to make sure it worked. To arrange data in descending order, just include desc() around the variable. E.g.; data %>% arrrange(desc(height))
trial_act <- trial_act %>% 
 arrange(desc(age_y))

trial_act
# A tibble: 2,139 x 34
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior preanti
    <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl>
 1  11438    70      73.9     0     1     0    100      0     1      1     674
 2 211360    70      63.1     0     1     0     90      0     0      1       0
 3 211284    69      81.6     0     1     0    100      0     0      1       0
 4  50580    68      90.5     0     1     1    100      0     1      1     618
 5  81127    68      70.8     0     1     0     90      0     1      1    1770
 6  10924    67      71       0     1     0    100      0     0      1       0
 7 241150    67      82.1     0     1     0     90      0     0      1       0
 8  50662    66      84.4     0     1     0    100      0     0      1       0
 9  11987    65      77.2     0     1     0     90      0     0      1       0
10 140797    65      60.5     0     1     0     90      0     0      1       0
# … with 2,129 more rows, and 23 more variables: race <dbl>, gender <dbl>,
#   str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>, offtrt <dbl>,
#   cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>, cd80 <dbl>, cd820 <dbl>,
#   cens <dbl>, days <dbl>, arms <dbl>, age_m <dbl>, weight_lb <dbl>,
#   cd_change_20 <dbl>, cd_change_960 <dbl>, gender_char <chr>,
#   arms_char <chr>, agem <dbl>
  1. You can sort the rows of dataframes with multiple columns by including many arguments to arrange(). Now sort the data by arms (arms) and then age_y (age_y). Print the result to make sure it looks right!
trial_act <- trial_act %>% 
 arrange(XXX, XXX)
trial_act <- trial_act %>% 
 arrange(arms, age_y)

trial_act
# A tibble: 2,139 x 34
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior preanti
    <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl>
 1 960014    13      48.5     1     0     0    100      0     0      1       0
 2 960031    14      48.3     1     0     0    100      0     0      1       0
 3 990071    14      60       1     0     0    100      0     0      1       0
 4 980042    16      63       1     0     0    100      0     1      1     753
 5 171040    17      51.3     0     0     0     90      0     0      1       0
 6 990026    17     103.      1     0     0    100      0     1      1     417
 7 310234    18      57.3     1     0     0    100      0     1      1     344
 8 940519    18      56.8     1     0     0    100      0     1      1     805
 9 211314    19      50.8     0     0     0     90      0     0      1       0
10 340767    19      74.8     0     1     0    100      0     0      1       0
# … with 2,129 more rows, and 23 more variables: race <dbl>, gender <dbl>,
#   str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>, offtrt <dbl>,
#   cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>, cd80 <dbl>, cd820 <dbl>,
#   cens <dbl>, days <dbl>, arms <dbl>, age_m <dbl>, weight_lb <dbl>,
#   cd_change_20 <dbl>, cd_change_960 <dbl>, gender_char <chr>,
#   arms_char <chr>, agem <dbl>

F - Filter specific rows with filter()

  1. Using the filter() function, filter only the rows from males (Hint: gender_char == "male")
trial_act %>%
  filter(XXX == "XXX")
trial_act %>%
  filter(gender_char == "male")
# A tibble: 1,771 x 34
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior preanti
    <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl>
 1 960014    13      48.5     1     0     0    100      0     0      1       0
 2 960031    14      48.3     1     0     0    100      0     0      1       0
 3 990071    14      60       1     0     0    100      0     0      1       0
 4 980042    16      63       1     0     0    100      0     1      1     753
 5 990026    17     103.      1     0     0    100      0     1      1     417
 6 940519    18      56.8     1     0     0    100      0     1      1     805
 7 340767    19      74.8     0     1     0    100      0     0      1       0
 8 211007    20      72.7     0     1     0     90      0     1      1      21
 9 261065    20      57.3     0     1     0    100      0     0      1       0
10 490308    20      72       0     1     0    100      0     0      1     539
# … with 1,761 more rows, and 23 more variables: race <dbl>, gender <dbl>,
#   str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>, offtrt <dbl>,
#   cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>, cd80 <dbl>, cd820 <dbl>,
#   cens <dbl>, days <dbl>, arms <dbl>, age_m <dbl>, weight_lb <dbl>,
#   cd_change_20 <dbl>, cd_change_960 <dbl>, gender_char <chr>,
#   arms_char <chr>, agem <dbl>
  1. Create a new dataframe called trial_act_male that only contains rows from males (hint: just assign what you did in the previous question to a new object!). After you create trial_act_male, print it to make sure it only contains data from males!
trial_act_male <- trial_act %>%
  filter(gender_char == "male")

trial_act_male
# A tibble: 1,771 x 34
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior preanti
    <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl>
 1 960014    13      48.5     1     0     0    100      0     0      1       0
 2 960031    14      48.3     1     0     0    100      0     0      1       0
 3 990071    14      60       1     0     0    100      0     0      1       0
 4 980042    16      63       1     0     0    100      0     1      1     753
 5 990026    17     103.      1     0     0    100      0     1      1     417
 6 940519    18      56.8     1     0     0    100      0     1      1     805
 7 340767    19      74.8     0     1     0    100      0     0      1       0
 8 211007    20      72.7     0     1     0     90      0     1      1      21
 9 261065    20      57.3     0     1     0    100      0     0      1       0
10 490308    20      72       0     1     0    100      0     0      1     539
# … with 1,761 more rows, and 23 more variables: race <dbl>, gender <dbl>,
#   str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>, offtrt <dbl>,
#   cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>, cd80 <dbl>, cd820 <dbl>,
#   cens <dbl>, days <dbl>, arms <dbl>, age_m <dbl>, weight_lb <dbl>,
#   cd_change_20 <dbl>, cd_change_960 <dbl>, gender_char <chr>,
#   arms_char <chr>, agem <dbl>
  1. Filter only rows for patients under the age of 60.
trial_act %>%
  filter(age_y < 60)
# A tibble: 2,110 x 34
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior preanti
    <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl>
 1 960014    13      48.5     1     0     0    100      0     0      1       0
 2 960031    14      48.3     1     0     0    100      0     0      1       0
 3 990071    14      60       1     0     0    100      0     0      1       0
 4 980042    16      63       1     0     0    100      0     1      1     753
 5 171040    17      51.3     0     0     0     90      0     0      1       0
 6 990026    17     103.      1     0     0    100      0     1      1     417
 7 310234    18      57.3     1     0     0    100      0     1      1     344
 8 940519    18      56.8     1     0     0    100      0     1      1     805
 9 211314    19      50.8     0     0     0     90      0     0      1       0
10 340767    19      74.8     0     1     0    100      0     0      1       0
# … with 2,100 more rows, and 23 more variables: race <dbl>, gender <dbl>,
#   str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>, offtrt <dbl>,
#   cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>, cd80 <dbl>, cd820 <dbl>,
#   cens <dbl>, days <dbl>, arms <dbl>, age_m <dbl>, weight_lb <dbl>,
#   cd_change_20 <dbl>, cd_change_960 <dbl>, gender_char <chr>,
#   arms_char <chr>, agem <dbl>
  1. A colleague of yours named Tracy wants a datafame only containing data from females over the age of 40. Create this dataframe with filter() and call it trial_act_Tracy. After you create the dataframe, print it to make sure it looks correct!
trial_act_Tracy <- XXX %>%
  filter(XXX > XXX & XXX == XXX)
trial_act_Tracy <- trial_act %>%
  filter(age_y > 40 & gender_char == "female")

G - Combine dataframes with left_join()

The trial_act_demo.csv file contains additional (fictional) demographic data about the patients, namely the number of days of exercise they get per week, and their highest level of education. Our goal is to add the demographic information to our trial_act data.

  1. In order to combine the two dataframes, we need to find one ‘key’ column that we can use to match rows. Look at both the trial_act and trial_act_demo dataframes. Which column can we use as the ‘key’ column?

  2. Use the left_join() function to combine the trial_act and trial_act_demo datasets, set the by argument to the name of the key column that is common in both data sets. Assign the result to trial_act.

trial_act <- trial_act %>%
  left_join(XX, by = "XX")
trial_act <- trial_act %>%
  left_join(trial_act_demo, by = "pidnum")
  1. Print your new trial_act dataframe. Do you now see the demographic data?
trial_act
# A tibble: 2,139 x 36
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior preanti
    <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl>
 1 960014    13      48.5     1     0     0    100      0     0      1       0
 2 960031    14      48.3     1     0     0    100      0     0      1       0
 3 990071    14      60       1     0     0    100      0     0      1       0
 4 980042    16      63       1     0     0    100      0     1      1     753
 5 171040    17      51.3     0     0     0     90      0     0      1       0
 6 990026    17     103.      1     0     0    100      0     1      1     417
 7 310234    18      57.3     1     0     0    100      0     1      1     344
 8 940519    18      56.8     1     0     0    100      0     1      1     805
 9 211314    19      50.8     0     0     0     90      0     0      1       0
10 340767    19      74.8     0     1     0    100      0     0      1       0
# … with 2,129 more rows, and 25 more variables: race <dbl>, gender <dbl>,
#   str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>, offtrt <dbl>,
#   cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>, cd80 <dbl>, cd820 <dbl>,
#   cens <dbl>, days <dbl>, arms <dbl>, age_m <dbl>, weight_lb <dbl>,
#   cd_change_20 <dbl>, cd_change_960 <dbl>, gender_char <chr>,
#   arms_char <chr>, agem <dbl>, exercise <dbl>, education <chr>

H - Reshaping with pivot_longer() and pivot_wider()

Remember the CD4_wide dataframe you created before? Currently it is in the wide format, where each row is a patient, where key data (different CD4 T cell counts) are in different columns like this:

CD4_wide <- trial_act %>%
  select(`pidnum`, `arms`, `cd40`, `cd420`, `cd496`)
# Data is in a 'wide' format
CD4_wide
# A tibble: 2,139 x 5
   pidnum  arms  cd40 cd420 cd496
    <dbl> <dbl> <dbl> <dbl> <dbl>
 1 960014     0   283   271    NA
 2 960031     0   481   428   519
 3 990071     0   166   169    28
 4 980042     0   299   214   124
 5 171040     0   549   415   436
 6 990026     0   373   218    NA
 7 310234     0   445   371   338
 8 940519     0   276   150    34
 9 211314     0   298   267    NA
10 340767     0   503   452    NA
# … with 2,129 more rows

Our goal is to convert this data to a ‘long’ format, where each row represents a single CD4 T cell count for a specific patient, like this:

# This is the same data in 'long' format
CD4_long
# A tibble: 6,417 x 4
   pidnum  arms time  value
    <dbl> <dbl> <chr> <dbl>
 1 960014     0 cd40    283
 2 960014     0 cd420   271
 3 960014     0 cd496    NA
 4 960031     0 cd40    481
 5 960031     0 cd420   428
 6 960031     0 cd496   519
 7 990071     0 cd40    166
 8 990071     0 cd420   169
 9 990071     0 cd496    28
10 980042     0 cd40    299
# … with 6,407 more rows
  1. Using the pivot_longer() function, create a new dataframe called CD4_long that shows the CD4_wide data in the ‘long’ format. To do this, use the following template. Set the grouping column to time and the new data column to value.
CD4_long <- CD4_wide %>% 
  pivot_longer(names_to = 'XXX',  # New grouping column
         values_to = 'XXX',  # New data column
         cols = c('XXX', 'XXX', 'XXX'))   # Names of columns to replicate
  1. Print your CD4_long dataframe! Do you now see that each row is a specific observation for a patient?
CD4_long
# A tibble: 6,417 x 4
   pidnum  arms time  value
    <dbl> <dbl> <chr> <dbl>
 1 960014     0 cd40    283
 2 960014     0 cd420   271
 3 960014     0 cd496    NA
 4 960031     0 cd40    481
 5 960031     0 cd420   428
 6 960031     0 cd496   519
 7 990071     0 cd40    166
 8 990071     0 cd420   169
 9 990071     0 cd496    28
10 980042     0 cd40    299
# … with 6,407 more rows
  1. Now use the pivot_wider() function to convert the long data back into the wide format! To do this, make the first argument time, and the second argument value
CD4_long %>%
   pivot_wider(names_from = 'XXX',  # Grouping column
   values_from = 'XXX')   # Value columnn
CD4_long %>%
  pivot_wider(names_from = 'time', 
              values_from = 'value')
# A tibble: 2,139 x 5
   pidnum  arms  cd40 cd420 cd496
    <dbl> <dbl> <dbl> <dbl> <dbl>
 1 960014     0   283   271    NA
 2 960031     0   481   428   519
 3 990071     0   166   169    28
 4 980042     0   299   214   124
 5 171040     0   549   415   436
 6 990026     0   373   218    NA
 7 310234     0   445   371   338
 8 940519     0   276   150    34
 9 211314     0   298   267    NA
10 340767     0   503   452    NA
# … with 2,129 more rows

Practice!

  1. Create a new dataframe called trial_A from trial_act with the following restrictions:
  • Only patients who take intraveneous drugs (drugs == 1)
  • Only patients younger than 60
  • Only patients whose baseline CD4 T cell counts cd40 were greater than 250 and less than 400.
trial_A <- trial_act %>%
  filter(drugs == 1,
         age_y < 60,
         cd40 > 250 & cd40 < 400)

trial_A
# A tibble: 126 x 36
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior preanti
    <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl>
 1 630023    26      86.2     0     0     1     90      0     0      1       0
 2  10378    27      67.3     0     0     1     90      0     1      1     875
 3 320475    27      66.6     0     0     1     90      0     0      1       0
 4  71333    29      73.9     0     1     1    100      0     0      1       0
 5 140809    29      70       0     1     1    100      0     0      1       0
 6 630030    29      88.2     0     0     1    100      0     0      1       0
 7 190395    33      68       0     0     1    100      0     1      1     842
 8 320365    34      86.0     0     0     1    100      0     0      1       0
 9 540015    34      72.6     0     0     1    100      0     0      1       0
10 180917    35      77.7     0     0     1     90      0     1      1     248
# … with 116 more rows, and 25 more variables: race <dbl>, gender <dbl>,
#   str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>, offtrt <dbl>,
#   cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>, cd80 <dbl>, cd820 <dbl>,
#   cens <dbl>, days <dbl>, arms <dbl>, age_m <dbl>, weight_lb <dbl>,
#   cd_change_20 <dbl>, cd_change_960 <dbl>, gender_char <chr>,
#   arms_char <chr>, agem <dbl>, exercise <dbl>, education <chr>
  1. Add a new column to trial_act called drugs_char which is "User" when drugs == 1, and "Non User" when drugs == 0.
trial_act <- trial_act %>%
  mutate(drugs_char = case_when(
    drugs == 1 ~ "User",
    drugs == 0 ~ "Non User"
  ))

trial_act
# A tibble: 2,139 x 37
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior preanti
    <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl>
 1 960014    13      48.5     1     0     0    100      0     0      1       0
 2 960031    14      48.3     1     0     0    100      0     0      1       0
 3 990071    14      60       1     0     0    100      0     0      1       0
 4 980042    16      63       1     0     0    100      0     1      1     753
 5 171040    17      51.3     0     0     0     90      0     0      1       0
 6 990026    17     103.      1     0     0    100      0     1      1     417
 7 310234    18      57.3     1     0     0    100      0     1      1     344
 8 940519    18      56.8     1     0     0    100      0     1      1     805
 9 211314    19      50.8     0     0     0     90      0     0      1       0
10 340767    19      74.8     0     1     0    100      0     0      1       0
# … with 2,129 more rows, and 26 more variables: race <dbl>, gender <dbl>,
#   str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>, offtrt <dbl>,
#   cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>, cd80 <dbl>, cd820 <dbl>,
#   cens <dbl>, days <dbl>, arms <dbl>, age_m <dbl>, weight_lb <dbl>,
#   cd_change_20 <dbl>, cd_change_960 <dbl>, gender_char <chr>,
#   arms_char <chr>, agem <dbl>, exercise <dbl>, education <chr>,
#   drugs_char <chr>
  1. Create a new dataframe called trial_B from trial_act with the following restrictions:
  • Only patients whose CD4 T cell counts increase more than 200 between baseline (cd40) and after 96 weeks (cd496)
  • Only drug users
trial_B <- trial_act %>%
  filter(
    cd496 - cd40 > 200,
    drugs == 1
  )

trial_B
# A tibble: 8 x 37
  pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior preanti
   <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl>
1 220489    45      70.9     0     0     1    100      0     0      1       0
2 140744    30      81.2     0     1     1    100      0     0      1       0
3  10962    41      81.9     0     0     1     90      0     0      1       0
4  50580    68      90.5     0     1     1    100      0     1      1     618
5  10881    36      81.6     0     1     1    100      0     1      1      86
6 250197    49      86.2     0     1     1     90      0     0      1       0
7  50572    28      84.4     0     1     1     90      0     0      1      15
8  50623    35      77       0     1     1    100      0     0      1       0
# … with 26 more variables: race <dbl>, gender <dbl>, str2 <dbl>, strat <dbl>,
#   symptom <dbl>, treat <dbl>, offtrt <dbl>, cd40 <dbl>, cd420 <dbl>,
#   cd496 <dbl>, r <dbl>, cd80 <dbl>, cd820 <dbl>, cens <dbl>, days <dbl>,
#   arms <dbl>, age_m <dbl>, weight_lb <dbl>, cd_change_20 <dbl>,
#   cd_change_960 <dbl>, gender_char <chr>, arms_char <chr>, agem <dbl>,
#   exercise <dbl>, education <chr>, drugs_char <chr>

Examples

# Wrangling with dplyr and tidyr ---------------------------

library(tidyverse)    # Load tidyverse for dplyr and tidyr

# Load baselers data
baselers <- read_csv("https://raw.githubusercontent.com/therbootcamp/baselers/master/inst/extdata/baselers.txt")

# Perform many dplyr operations in a row

baselers %>%
  
  # Change some names
  rename(age_y = age,
         swimming = rhine) %>%
  
  # Only include people over 30
  filter(age_y > 30) %>%
  
  # Calculate some new columns
  mutate(weight_lbs = weight * 2.22,
         height_m = height / 100,
         BMI = weight / height_m ^ 2,
         
         # Make binary version of sex
         sex_bin = case_when(
                      sex == "male" ~ 0,
                      sex == "female" ~ 1),

        # Show when height is greater than 150
        height_lt_150 = case_when(
                                height < 150 ~ 1,
                                height >= 150 ~ 0)) %>%
  
  # Sort in ascending order of sex, then
  #  descending order of age
  arrange(sex, desc(age_y)))

# TIPP -------
# In this practical you will do many operations on dataframes. Remember: when using `dplyr`, you can chain
# multiple functions together with the pipe `%>%`. When giving your answers to the questions in this practical,
# see how many operations you can chain with the pipe!

baselers <- read_csv("1_Data/baselers.csv")

# Method 1: Separate operations

baselers <- baselers %>%
              rename(Age_y = age)   # Change age to Age__y

baselers <- baselers %>%
              mutate(food_p = food / income)  # calculate food_p

baselers <- baselers %>%
              filter(sex == "m")    # Only include males

# Method 2: Chain with the pipe!

baselers <- read_csv("1_Data/baselers.csv")

baselers <- baselers %>%
              rename(Age_y = age) %>%             # Change age to Age__y
              mutate(food_p = food / income) %>%  # calculate food_p
              filter(sex == "m")                  # Only include males

Datasets

File Rows Columns
trial_act.csv 2139 27
trial_act_demo_fake.csv 2139 3

The trial_act data set contains a randomized clinical trial to compare monotherapy with zidovudine or didanosine with combination therapy with zidovudine and didanosine or zidovudine and zalcitabine in adults infected with the human immunodeficiency virus type I whose CD4 T cell counts were between 200 and 500 per cubic millimeter.

File Rows
pidnum patient’s ID number
age age in years at baseline
wtkg weight in kg at baseline
hemo hemophilia (0=no, 1=yes)
homo homosexual activity (0=no, 1=yes)
drugs history of intravenous drug use (0=no, 1=yes)
karnof Karnofsky score (on a scale of 0-100)
oprior non-zidovudine antiretroviral therapy prior to initiation of study treatment (0=no, 1=yes)
z30 zidovudine use in the 30 days prior to treatment initiation (0=no, 1=yes)
zprior zidovudine use prior to treatment initiation (0=no, 1=yes)
preanti number of days of previously received antiretroviral therapy
race race (0=white, 1=non-white)
gender gender (0=female, 1=male)
str2 antiretroviral history (0=naive, 1=experienced)
strat antiretroviral history stratification (1=’antiretroviral naive’, 2=’> 1 but ≤ 52 weeks of prior antiretroviral therapy’, 3=’> 52 weeks’)
symptom symptomatic indicator (0=asymptomatic, 1=symptomatic)
treat treatment indicator (0=zidovudine only, 1=other therapies)
offtrt indicator of off-treatment before 96±5 weeks (0=no,1=yes)
cd40 CD4 T cell count at baseline
cd420 CD4 T cell count at 20±5 weeks
cd496 CD4 T cell count at 96±5 weeks (=NA if missing)
r missing CD4 T cell count at 96±5 weeks (0=missing, 1=observed)
cd80 CD8 T cell count at baseline
cd820 CD8 T cell count at 20±5 weeks
cens indicator of observing the event in days
days number of days until the first occurrence of: (i) a decline in CD4 T cell count of at least 50 (ii) an event indicating progression to AIDS, or (iii) death.
arms treatment arm (0=zidovudine, 1=zidovudine and didanosine, 2=zidovudine and zalcitabine, 3=didanosine).

Functions

Packages

Package Installation
tidyverse install.packages("tidyverse")

Functions

Function Package Description
rename() dplyr Rename columns
select() dplyr Select columns based on name or index
filter() dplyr Select rows based on some logical criteria
arrange() dplyr Sort rows
mutate() dplyr Add new columns
case_when() dplyr Recode values of a column
group_by(), summarise() dplyr Group data and then calculate summary statistics
left_join() dplyr Combine multiple data sets using a key column
pivot_wider() tidyr Convert long data to wide format - from rows to columns
pivot_longer() tidyr Convert wide data to long format - from columns to rows

Resources

dplyr vignette

See https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html for the full dplyr vignette with lots of wrangling tips and tricks.

Cheatsheets


from R Studio