adapted from trueloveproperty.co.uk

Overview

In this practical, you’ll practice grouping and analysing data with the dplyr and tidyr packages (part of the `tidyverse collection of packages).

By the end of this practical you will know how to:

  1. Group data and calculate summary statistics
  2. Run simple statistical analyses

Tasks

A - Setup

  1. Open your BernRBootcamp R project. It should already have the folders 1_Data and 2_Code. Make sure that the data files listed in the Datasets section above are in your 1_Data folder.
# Done!
  1. Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called statisticsI_practical.R in the 2_Code folder.
  2. Using library() load the set of packages for this practical listed in the packages section above.
library(tidyverse)
  1. For this practical, we’ll use the kc_house.csv data. This dataset contains house sale prices for King County, Washington. It includes homes sold between May 2014 and May 2015. Using the following template, load the data into R and store it as a new object called kc_house.
kc_house <- read_csv(file = "XX")
  1. Using print(), summary(), and head(), explore the data to make sure it was loaded correctly.
kc_house
# A tibble: 21,613 x 21
   id    date                 price bedrooms bathrooms sqft_living sqft_lot
   <chr> <dttm>               <dbl>    <dbl>     <dbl>       <dbl>    <dbl>
 1 7129… 2014-10-13 00:00:00 2.22e5        3      1           1180     5650
 2 6414… 2014-12-09 00:00:00 5.38e5        3      2.25        2570     7242
 3 5631… 2015-02-25 00:00:00 1.80e5        2      1            770    10000
 4 2487… 2014-12-09 00:00:00 6.04e5        4      3           1960     5000
 5 1954… 2015-02-18 00:00:00 5.10e5        3      2           1680     8080
 6 7237… 2014-05-12 00:00:00 1.23e6        4      4.5         5420   101930
 7 1321… 2014-06-27 00:00:00 2.58e5        3      2.25        1715     6819
 8 2008… 2015-01-15 00:00:00 2.92e5        3      1.5         1060     9711
 9 2414… 2015-04-15 00:00:00 2.30e5        3      1           1780     7470
10 3793… 2015-03-12 00:00:00 3.23e5        3      2.5         1890     6560
# … with 21,603 more rows, and 14 more variables: floors <dbl>,
#   waterfront <dbl>, view <dbl>, condition <dbl>, grade <dbl>,
#   sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
#   zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>
summary(kc_house)
      id                 date                         price        
 Length:21613       Min.   :2014-05-02 00:00:00   Min.   :  75000  
 Class :character   1st Qu.:2014-07-22 00:00:00   1st Qu.: 321950  
 Mode  :character   Median :2014-10-16 00:00:00   Median : 450000  
                    Mean   :2014-10-29 04:38:01   Mean   : 540088  
                    3rd Qu.:2015-02-17 00:00:00   3rd Qu.: 645000  
                    Max.   :2015-05-27 00:00:00   Max.   :7700000  
    bedrooms      bathrooms     sqft_living       sqft_lot           floors    
 Min.   : 0.0   Min.   :0.00   Min.   :  290   Min.   :    520   Min.   :1.00  
 1st Qu.: 3.0   1st Qu.:1.75   1st Qu.: 1427   1st Qu.:   5040   1st Qu.:1.00  
 Median : 3.0   Median :2.25   Median : 1910   Median :   7618   Median :1.50  
 Mean   : 3.4   Mean   :2.11   Mean   : 2080   Mean   :  15107   Mean   :1.49  
 3rd Qu.: 4.0   3rd Qu.:2.50   3rd Qu.: 2550   3rd Qu.:  10688   3rd Qu.:2.00  
 Max.   :33.0   Max.   :8.00   Max.   :13540   Max.   :1651359   Max.   :3.50  
   waterfront         view        condition        grade         sqft_above  
 Min.   :0.000   Min.   :0.00   Min.   :1.00   Min.   : 1.00   Min.   : 290  
 1st Qu.:0.000   1st Qu.:0.00   1st Qu.:3.00   1st Qu.: 7.00   1st Qu.:1190  
 Median :0.000   Median :0.00   Median :3.00   Median : 7.00   Median :1560  
 Mean   :0.008   Mean   :0.23   Mean   :3.41   Mean   : 7.66   Mean   :1788  
 3rd Qu.:0.000   3rd Qu.:0.00   3rd Qu.:4.00   3rd Qu.: 8.00   3rd Qu.:2210  
 Max.   :1.000   Max.   :4.00   Max.   :5.00   Max.   :13.00   Max.   :9410  
 sqft_basement     yr_built     yr_renovated     zipcode           lat      
 Min.   :   0   Min.   :1900   Min.   :   0   Min.   :98001   Min.   :47.2  
 1st Qu.:   0   1st Qu.:1951   1st Qu.:   0   1st Qu.:98033   1st Qu.:47.5  
 Median :   0   Median :1975   Median :   0   Median :98065   Median :47.6  
 Mean   : 292   Mean   :1971   Mean   :  84   Mean   :98078   Mean   :47.6  
 3rd Qu.: 560   3rd Qu.:1997   3rd Qu.:   0   3rd Qu.:98118   3rd Qu.:47.7  
 Max.   :4820   Max.   :2015   Max.   :2015   Max.   :98199   Max.   :47.8  
      long      sqft_living15    sqft_lot15    
 Min.   :-122   Min.   : 399   Min.   :   651  
 1st Qu.:-122   1st Qu.:1490   1st Qu.:  5100  
 Median :-122   Median :1840   Median :  7620  
 Mean   :-122   Mean   :1987   Mean   : 12768  
 3rd Qu.:-122   3rd Qu.:2360   3rd Qu.: 10083  
 Max.   :-121   Max.   :6210   Max.   :871200  
head(kc_house)
# A tibble: 6 x 21
  id    date                 price bedrooms bathrooms sqft_living sqft_lot
  <chr> <dttm>               <dbl>    <dbl>     <dbl>       <dbl>    <dbl>
1 7129… 2014-10-13 00:00:00 2.22e5        3      1           1180     5650
2 6414… 2014-12-09 00:00:00 5.38e5        3      2.25        2570     7242
3 5631… 2015-02-25 00:00:00 1.80e5        2      1            770    10000
4 2487… 2014-12-09 00:00:00 6.04e5        4      3           1960     5000
5 1954… 2015-02-18 00:00:00 5.10e5        3      2           1680     8080
6 7237… 2014-05-12 00:00:00 1.23e6        4      4.5         5420   101930
# … with 14 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
#   condition <dbl>, grade <dbl>, sqft_above <dbl>, sqft_basement <dbl>,
#   yr_built <dbl>, yr_renovated <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>,
#   sqft_living15 <dbl>, sqft_lot15 <dbl>

B - Recap

  1. Print the names of the kc_house data with names().
names(kc_house)
 [1] "id"            "date"          "price"         "bedrooms"     
 [5] "bathrooms"     "sqft_living"   "sqft_lot"      "floors"       
 [9] "waterfront"    "view"          "condition"     "grade"        
[13] "sqft_above"    "sqft_basement" "yr_built"      "yr_renovated" 
[17] "zipcode"       "lat"           "long"          "sqft_living15"
[21] "sqft_lot15"   
  1. Change the following column names using rename().
New Name Old Name
living_sqft sqft_living
lot_sqft sqft_lot
above_sqft sqft_above
basement_sqft sqft_basement
built_yr yr_built
renovated_yr yr_renovated
kc_house <- kc_house %>%
  rename(NEW = OLD,
         NEW = OLD,
         NEW = OLD)
kc_house <- kc_house %>%
  rename(living_sqft = sqft_living,
         lot_sqft = sqft_lot,
         above_sqft = sqft_above,
         basement_sqft = sqft_basement,
         built_yr = yr_built,
         renovated_yr = yr_renovated)
  1. Create new column(s) living_sqm, lot_sqm, above_sqm and basement_sqm which show the respective room sizes in square meters rather than square feet (Hint: Multiply each by 0.093).
kc_house <- kc_house %>%
  mutate(living_sqm = XXX * XXX,
         lot_sqm = XXX * XXX,
         XXX = XXX,
         XXX = XXX)
kc_house <- kc_house %>%
  mutate(living_sqm = living_sqft * 0.093,
         lot_sqm = lot_sqft * 0.093,
         above_sqm = above_sqft * 0.093,
         basement_sqm  = basement_sqft * 0.093)
  1. Add a new variable to the dataframe called mansion which is “Yes” when the sum of the house’s living, above, and basement space is above 750 square meters.
kc_house <- kc_house %>%
                mutate(XXX = case_when(
                              XXX + XXX + XXX > XXX ~ "XXX",
                              XXXX + XXX + XXX <= XXX ~ "XXX"))
kc_house <- kc_house %>%
                mutate(mansion = case_when(
                              living_sqm + above_sqm +  basement_sqm > 750 ~ "Yes",
                              living_sqm + above_sqm + basement_sqm <= 750 ~ "No"))

C - Simple summaries

  1. Using the base-R df$col notation, calculate the mean price of all houses.
mean(XXX$XXX)
mean(kc_house$price)
[1] 540088
  1. Now, do the same using summarise() with the following template. Do you get the same answer? What is different about the output from summarise() versus using the dollar sign?
kc_house %>%
  summarise(
    price_mean = mean(XXX)
  )
kc_house %>%
  summarise(
    price_mean = mean(price)
  )
# A tibble: 1 x 1
  price_mean
       <dbl>
1    540088.
  1. What is the median price of all houses? Use the median() function!
kc_house %>%
  summarise(
    price_median = median(price)
  )
# A tibble: 1 x 1
  price_median
         <dbl>
1       450000
  1. What was the highest selling price? Use the max() function!
kc_house %>%
  summarise(
    price_max = max(price)
  )
# A tibble: 1 x 1
  price_max
      <dbl>
1   7700000
  1. Using the following template, sort the data frame in descending order of price. Then, print it. Do you see the house with the highest selling price at the top?
kc_house <- kc_house %>%
  arrange(desc(XXX))

kc_house
kc_house <- kc_house %>%
  arrange(desc(price))

kc_house
# A tibble: 21,613 x 26
   id    date                 price bedrooms bathrooms living_sqft lot_sqft
   <chr> <dttm>               <dbl>    <dbl>     <dbl>       <dbl>    <dbl>
 1 6762… 2014-10-13 00:00:00 7.70e6        6      8          12050    27600
 2 9808… 2014-06-11 00:00:00 7.06e6        5      4.5        10040    37325
 3 9208… 2014-09-19 00:00:00 6.88e6        6      7.75        9890    31374
 4 2470… 2014-08-04 00:00:00 5.57e6        5      5.75        9200    35069
 5 8907… 2015-04-13 00:00:00 5.35e6        5      5           8000    23985
 6 7558… 2015-04-13 00:00:00 5.30e6        6      6           7390    24829
 7 1247… 2014-10-20 00:00:00 5.11e6        5      5.25        8010    45517
 8 1924… 2014-06-17 00:00:00 4.67e6        5      6.75        9640    13068
 9 7738… 2014-08-15 00:00:00 4.50e6        5      5.5         6640    40014
10 3835… 2014-06-18 00:00:00 4.49e6        4      3           6430    27517
# … with 21,603 more rows, and 19 more variables: floors <dbl>,
#   waterfront <dbl>, view <dbl>, condition <dbl>, grade <dbl>,
#   above_sqft <dbl>, basement_sqft <dbl>, built_yr <dbl>, renovated_yr <dbl>,
#   zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>,
#   sqft_lot15 <dbl>, living_sqm <dbl>, lot_sqm <dbl>, above_sqm <dbl>,
#   basement_sqm <dbl>, mansion <chr>
  1. What percentage of houses sold for more than 1,000,000? Let’s answer this with summarise().
kc_house %>%
  summarise(mil_p = mean(XXX > 1000000))
kc_house %>%
  summarise(mil_p = mean(price > 1000000))
# A tibble: 1 x 1
   mil_p
   <dbl>
1 0.0678
  1. For mansions only, calculate the mean number of floors and bathrooms (hint: before summarising the data, use filter() to only select mansions!)
kc_house %>%
  filter(XXX == XXX) %>%
  summarise(
    floors_mean = XXX(XXX),
    bathrooms_mean = XXX(XXX)
  )
kc_house %>%
  filter(mansion == "Yes") %>%
  summarise(
    floors_mean = mean(floors),
    bathrooms_mean = mean(bathrooms)
  )
# A tibble: 1 x 2
  floors_mean bathrooms_mean
        <dbl>          <dbl>
1        1.92           3.68

D - Simple grouped summaries

  1. How many mansions are there? To do this, use group_by() to group the dataset by the mansions column, then use the n() function to count the number of cases.
kc_house %>%
  group_by(XXX) %>%
  summarise(XXX = n())
kc_house %>%
  group_by(mansion) %>%
  summarise(N = n())
# A tibble: 2 x 2
  mansion     N
  <chr>   <int>
1 No      20862
2 Yes       751
  1. What is the mean selling price of mansions versus non-mansions? To do this, just add another argument to your summarise() function!
kc_house %>%
  group_by(mansion) %>%
  summarise(N = n(),
            XXX = XXX(XXX))
kc_house %>%
  group_by(mansion) %>%
  summarise(N = n(),
            price_mean = mean(price))
# A tibble: 2 x 3
  mansion     N price_mean
  <chr>   <int>      <dbl>
1 No      20862    504024.
2 Yes       751   1541915.
  1. Using group_by() and summarise(), create a dataframe showing the same results as the following table.
kc_house %>%
  group_by(mansion) %>%
  summarise(N = n(),
            price_min = min(price),
            price_mean = mean(price),
            price_median = median(price),
            price_max = max(price)) %>%
  knitr::kable()
mansion N price_min price_mean price_median price_max
No 20862 75000 504024 441000 3100000
Yes 751 404000 1541915 1300000 7700000
  1. Do houses built in later years tend to have more living space? To answer this, group the data by built_yr, and then calculate the mean number of living square meters. Be sure to also include the number of houses built in each year!
kc_house %>%
  group_by(built_yr) %>%
  summarise(N = n(),
            living = mean(living_sqm))
# A tibble: 116 x 3
   built_yr     N living
      <dbl> <int>  <dbl>
 1     1900    87   161.
 2     1901    29   164.
 3     1902    27   179.
 4     1903    46   140.
 5     1904    45   149.
 6     1905    74   183.
 7     1906    92   168.
 8     1907    65   177.
 9     1908    86   158.
10     1909    94   177.
# … with 106 more rows
  1. Was that table too big? Try using the following code to get the results for each decade rather than each year!
kc_house %>%
  mutate(built_decade = floor(built_yr / 10)) %>%
  group_by(built_decade) %>%
  summarise(XX = XX,
            XX = XX(XX))
  1. A friend of yours who is getting into Seattle real estate wants to know how the number of floors a house has affects its selling price. Create a table for her showing the minimum, mean, and maximum price for houses separated by the number of floors they have.

E - Multiple groups

  1. Your friend Brumhilda is interested in statistics on houses in the following 4 zipcodes only: 98001, 98109, 98117, 98199. Create a new dataframe called brumhilda that only contains data from houses in those zipcode (hint: use filter() combined with the %in% operator as follows:
brumhilda <- kc_house %>%
  filter(XXX %in% c(XXX, XXX, XXX, XXX))
brumhilda <- kc_house %>%
  filter(zipcode %in% c(98001, 98109, 98117, 98199))
  1. For each of the zip codes, calculate the mean() and median() selling price (as well as the number of houses) in each zip code.
brumhilda %>%
  group_by(zipcode) %>%
  summarise(price_mean = mean(price),
           price_median = median(price),
           N = n())
# A tibble: 4 x 4
  zipcode price_mean price_median     N
    <dbl>      <dbl>        <dbl> <int>
1   98001    280805.       260000   362
2   98109    879624.       736000   109
3   98117    576795.       544000   553
4   98199    791821.       689800   317
  1. Now Brumhilda wants the same data separated by whether or not the house is a mansion or not. Include these results by also grouping the data by mansion (as well as zipcode), and calculating the same summary statistics as before.
brumhilda %>%
  group_by(zipcode, mansion) %>%
  summarise(price_mean = mean(price),
           price_median = median(price),
           N = n())
# A tibble: 8 x 5
# Groups:   zipcode [4]
  zipcode mansion price_mean price_median     N
    <dbl> <chr>        <dbl>        <dbl> <int>
1   98001 No         277589.       260000   359
2   98001 Yes        665667.       637000     3
3   98109 No         833528.       730500   106
4   98109 Yes       2508333.      2900000     3
5   98117 No         575626.       543000   551
6   98117 Yes        898750        898750     2
7   98199 No         753625.       675000   305
8   98199 Yes       1762618.      1425000    12
  1. Ok that was good, but now she also wants to know what the maximum and minimum number of floors were in each group. Add these summary statistics!
brumhilda %>%
  group_by(zipcode) %>%
  summarise(price_mean = mean(price),
           price_median = median(price),
           floors_min = min(floors),
           floors_max = max(floors),
           N = n())
# A tibble: 4 x 6
  zipcode price_mean price_median floors_min floors_max     N
    <dbl>      <dbl>        <dbl>      <dbl>      <dbl> <int>
1   98001    280805.       260000          1        2.5   362
2   98109    879624.       736000          1        3     109
3   98117    576795.       544000          1        3     553
4   98199    791821.       689800          1        3     317

F - Statistics

  1. Let’s see if there is a significant difference between the selling prices of houses on the waterfront versus those not on the waterfront. To do this, you’ll conduct a t-test using the t.test() function and assign the result to waterfront_htest. To prepare for the t-test, you need to create two vectors carrying the prices for the houses on the waterfront and those not on the waterfront. (Note: this is a bit of a hack that we don’t need when speciying models the proper way -> next session)
# create list separating values for waterfron yes and no
prices <- kc_house %>%
  mutate(waterfront_lab = 
           case_when(waterfront == 0 ~ 'no_waterfront',
                     waterfront == 1 ~ 'waterfront')) %>%
  with(split(price, waterfront_lab))

# show names
  1. Fill in the XXs in the code below, such that x becomes the 'no_waterfront' prices and y becomes the 'waterfront' prices.
waterfront_htest <- t.test(x = prices$XX, y = prices$XX)
waterfront_htest <- t.test(x = prices$no_waterfront, y = prices$waterfront)
  1. Print your waterfront_htest object to see a printout of the main results.
waterfront_htest

    Welch Two Sample t-test

data:  prices$no_waterfront and prices$waterfront
t = -13, df = 162, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1303662  -956963
sample estimates:
mean of x mean of y 
   531564   1661876 
  1. Look at the names of your waterfront_htest object with names().
names(waterfront_htest)
 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
 [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
  1. Using the $, print the test statistic (statistic) from your waterfront_htest object.
waterfront_htest$statistic
    t 
-12.9 
  1. Now using $, print only the p-value (p.value) from the object.
waterfront_htest$p.value
[1] 1.38e-26
  1. Run a Wilcoxon test instead of the t-test to evalute whether the result is robust. Is the result the same?
wilcox.test(x = prices$XX, y = prices$XX)
wilcox.test(x = prices$no_waterfront, y = prices$waterfront)

    Wilcoxon rank sum test with continuity correction

data:  prices$no_waterfront and prices$waterfront
W = 4e+05, p-value <2e-16
alternative hypothesis: true location shift is not equal to 0

X - Challenges

  1. Which zipcode has the highest percentage of houses on the waterfront? (Hint: group by zipcode, calculate the percentage of houses on the waterfront using mean(), then sort the data in descending order) with arrange(), then select the first row with slice(). Once you find it, try searching for that zipcode on Google Maps and see if it’s location makes sense!
kc_house %>%
  group_by(zipcode) %>%
  summarise(waterfront_p = mean(waterfront)) %>%
  arrange(desc(waterfront_p)) %>%
  slice(1)
# A tibble: 1 x 2
  zipcode waterfront_p
    <dbl>        <dbl>
1   98070        0.203
  1. Which house had the highest price to living space ratio? To answer this, create a new variable called price_to_living that takes price / living_sqm. Then, sort the data in descending order of this variable, and select the first row with slice()! What id value do you get?
kc_house %>%
  mutate(price_to_living = price / living_sqm) %>%
  arrange(desc(price_to_living)) %>%
  slice(1)
  1. Which are the top 10 zip codes in terms of mean housing prices? To answer this, group the data by zipcode, calculate the mean price, arrange the dataset in descending order of mean price, then select the top 10 rows!
kc_house %>%
  group_by(zipcode) %>%
  summarise(price_mean = mean(price)) %>%
  arrange(desc(price_mean)) %>%
  slice(1:10)
# A tibble: 10 x 2
   zipcode price_mean
     <dbl>      <dbl>
 1   98039   2160607.
 2   98004   1355927.
 3   98040   1194230.
 4   98112   1095499.
 5   98102    901258.
 6   98109    879624.
 7   98105    862825.
 8   98006    859685.
 9   98119    849448.
10   98005    810165.
  1. Create the following dataframe exactly as it appears.
kc_house %>%
  filter(built_yr >= 1990 & built_yr < 1999) %>%
  group_by(built_yr) %>%
  summarise(N = n(),
            price_mean = mean(price),
            price_max = max(price),
            living_sqm_mean = mean(living_sqm)) %>%
  knitr::kable(digits = 0)
built_yr N price_mean price_max living_sqm_mean
1990 320 563966 3640900 234
1991 224 630441 5300000 244
1992 198 548169 2480000 223
1993 202 556612 3120000 226
1994 249 486834 2880500 209
1995 169 577771 3200000 224
1996 195 639534 3100000 240
1997 177 606058 3800000 234
1998 239 594159 1960000 241
built_yr N price_mean price_max living_sqm_mean
1990 320 563966 3640900 234
1991 224 630441 5300000 244
1992 198 548169 2480000 223
1993 202 556612 3120000 226
1994 249 486834 2880500 209
1995 169 577771 3200000 224
1996 195 639534 3100000 240
1997 177 606058 3800000 234
1998 239 594159 1960000 241
kc_house %>%
  filter(built_yr >= 1990 & built_yr < 1999) %>%
  group_by(built_yr) %>%
  summarise(N = n(),
            price_mean = mean(price),
            price_max = max(price),
            living_sqm_mean = mean(living_sqm)) %>%
  knitr::kable(digits = 0)
built_yr N price_mean price_max living_sqm_mean
1990 320 563966 3640900 234
1991 224 630441 5300000 244
1992 198 548169 2480000 223
1993 202 556612 3120000 226
1994 249 486834 2880500 209
1995 169 577771 3200000 224
1996 195 639534 3100000 240
1997 177 606058 3800000 234
1998 239 594159 1960000 241
  1. The chisq.test() function allows you to do conduct a chi square test testing the relationship between two nominal variables. Look at the help menu to see how the function works. Then, conduct a chi-square test to see if there is a relationship between whether a house is on the waterfront and the grade of the house. Do houses on the waterfront tend to have higher (or lower) grades than houses not on the waterfront?
# First look at a table
table(kc_house$waterfront, kc_house$grade)
   
       1    3    4    5    6    7    8    9   10   11   12   13
  0    1    3   29  238 2026 8958 6028 2590 1106  379   79   13
  1    0    0    0    4   12   23   40   25   28   20   11    0
# Then run the test
chisq.test(table(kc_house$waterfront, kc_house$grade))

    Pearson's Chi-squared test

data:  table(kc_house$waterfront, kc_house$grade)
X-squared = 335, df = 11, p-value <2e-16

Examples

# Wrangling with dplyr and tidyr ---------------------------

library(tidyverse)    # Load tidyverse for dplyr and tidyr

# Load baselers data
baselers <- read_csv("1_Data/baselers.txt")

# No grouping variables

bas <- baselers %>%
  summarise(
    age_mean = mean(age, na.rm = TRUE),
    income_median = median(income, na.rm = TRUE),
    N = n()
  )

# One grouping variable
bas_sex <- baselers %>%
  group_by(sex) %>%
  summarise(
    age_mean = mean(age, na.rm = TRUE),
    income_median = median(income, na.rm = TRUE),
    N = n()
  )

bas_sex

# Two grouping variables
bas_sex_ed <- baselers %>%
  group_by(sex, education) %>%
  summarise(
    age_mean = mean(age, na.rm = TRUE),
    income_median = median(income, na.rm = TRUE),
    N = n()
  )

# Advanced scoping

# Calculate mean of ALL numeric variables
baselers %>%
  group_by(sex, education) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE)


# Examples of hypothesis tests on the diamonds -------------

library(tidyverse)
library(broom)
library(rsq)

# First few rows of the diamonds data

diamonds


# 2-sample t- test ---------------------------

# Q: Is there a difference in the carats of color = E and color = I diamonds?

# split data
carat <- diamonds %>%
  filter(color %in% c("E", "I")) %>%
  with(split(carat, color))

# run t-test
htest <- t.test(carat$E, carat$I,
                alternative = "two.sided")  # Two-sided test
htest  # Print result

# run wilcoxon test
htest <- wilcox.test(carat$E, carat$I,
                alternative = "two.sided")  # Two-sided test
htest  # Print result

Datasets

File Rows Columns Description
kc_house.csv 21613 21 House sale prices for King County between May 2014 and May 2015.

1

Functions

Wrangling

Function Package Description
rename() dplyr Rename columns
select() dplyr Select columns based on name or index
filter() dplyr Select rows based on some logical criteria
arrange() dplyr Sort rows
mutate() dplyr Add new columns
case_when() dplyr Recode values of a column
group_by(), summarise() dplyr Group data and then calculate summary statistics
left_join() dplyr Combine multiple data sets using a key column
pivot_wider() tidyr Convert long data to wide format - from rows to columns
pivot_longer() tidyr Convert wide data to long format - from columns to rows

Statistical Tests

Function Hypothesis Test
t.test() One and two sample t-test
wilcox.test() Wilcoxon test
cor.test() Correlation test
chisq.test Chi-square tests for frequency tables

Resources

dplyr vignette

See https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html for the full dplyr vignette with lots of wrangling tips and tricks.

Cheatsheets


from R Studio