
adapted from trueloveproperty.co.uk
In this practical, you’ll practice grouping and analysing data with the dplyr and tidyr packages (part of the `tidyverse collection of packages).
By the end of this practical you will know how to:
BernRBootcamp R project. It should already have the folders 1_Data and 2_Code. Make sure that the data files listed in the Datasets section above are in your 1_Data folder.# Done!
statisticsI_practical.R in the 2_Code folder.library() load the set of packages for this practical listed in the packages section above.library(tidyverse)
kc_house.csv data. This dataset contains house sale prices for King County, Washington. It includes homes sold between May 2014 and May 2015. Using the following template, load the data into R and store it as a new object called kc_house.kc_house <- read_csv(file = "XX")
print(), summary(), and head(), explore the data to make sure it was loaded correctly.kc_house
# A tibble: 21,613 x 21
id date price bedrooms bathrooms sqft_living sqft_lot
<chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7129… 2014-10-13 00:00:00 2.22e5 3 1 1180 5650
2 6414… 2014-12-09 00:00:00 5.38e5 3 2.25 2570 7242
3 5631… 2015-02-25 00:00:00 1.80e5 2 1 770 10000
4 2487… 2014-12-09 00:00:00 6.04e5 4 3 1960 5000
5 1954… 2015-02-18 00:00:00 5.10e5 3 2 1680 8080
6 7237… 2014-05-12 00:00:00 1.23e6 4 4.5 5420 101930
7 1321… 2014-06-27 00:00:00 2.58e5 3 2.25 1715 6819
8 2008… 2015-01-15 00:00:00 2.92e5 3 1.5 1060 9711
9 2414… 2015-04-15 00:00:00 2.30e5 3 1 1780 7470
10 3793… 2015-03-12 00:00:00 3.23e5 3 2.5 1890 6560
# … with 21,603 more rows, and 14 more variables: floors <dbl>,
# waterfront <dbl>, view <dbl>, condition <dbl>, grade <dbl>,
# sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
# zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>
summary(kc_house)
id date price
Length:21613 Min. :2014-05-02 00:00:00 Min. : 75000
Class :character 1st Qu.:2014-07-22 00:00:00 1st Qu.: 321950
Mode :character Median :2014-10-16 00:00:00 Median : 450000
Mean :2014-10-29 04:38:01 Mean : 540088
3rd Qu.:2015-02-17 00:00:00 3rd Qu.: 645000
Max. :2015-05-27 00:00:00 Max. :7700000
bedrooms bathrooms sqft_living sqft_lot floors
Min. : 0.0 Min. :0.00 Min. : 290 Min. : 520 Min. :1.00
1st Qu.: 3.0 1st Qu.:1.75 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.00
Median : 3.0 Median :2.25 Median : 1910 Median : 7618 Median :1.50
Mean : 3.4 Mean :2.11 Mean : 2080 Mean : 15107 Mean :1.49
3rd Qu.: 4.0 3rd Qu.:2.50 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.00
Max. :33.0 Max. :8.00 Max. :13540 Max. :1651359 Max. :3.50
waterfront view condition grade sqft_above
Min. :0.000 Min. :0.00 Min. :1.00 Min. : 1.00 Min. : 290
1st Qu.:0.000 1st Qu.:0.00 1st Qu.:3.00 1st Qu.: 7.00 1st Qu.:1190
Median :0.000 Median :0.00 Median :3.00 Median : 7.00 Median :1560
Mean :0.008 Mean :0.23 Mean :3.41 Mean : 7.66 Mean :1788
3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:4.00 3rd Qu.: 8.00 3rd Qu.:2210
Max. :1.000 Max. :4.00 Max. :5.00 Max. :13.00 Max. :9410
sqft_basement yr_built yr_renovated zipcode lat
Min. : 0 Min. :1900 Min. : 0 Min. :98001 Min. :47.2
1st Qu.: 0 1st Qu.:1951 1st Qu.: 0 1st Qu.:98033 1st Qu.:47.5
Median : 0 Median :1975 Median : 0 Median :98065 Median :47.6
Mean : 292 Mean :1971 Mean : 84 Mean :98078 Mean :47.6
3rd Qu.: 560 3rd Qu.:1997 3rd Qu.: 0 3rd Qu.:98118 3rd Qu.:47.7
Max. :4820 Max. :2015 Max. :2015 Max. :98199 Max. :47.8
long sqft_living15 sqft_lot15
Min. :-122 Min. : 399 Min. : 651
1st Qu.:-122 1st Qu.:1490 1st Qu.: 5100
Median :-122 Median :1840 Median : 7620
Mean :-122 Mean :1987 Mean : 12768
3rd Qu.:-122 3rd Qu.:2360 3rd Qu.: 10083
Max. :-121 Max. :6210 Max. :871200
head(kc_house)
# A tibble: 6 x 21
id date price bedrooms bathrooms sqft_living sqft_lot
<chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7129… 2014-10-13 00:00:00 2.22e5 3 1 1180 5650
2 6414… 2014-12-09 00:00:00 5.38e5 3 2.25 2570 7242
3 5631… 2015-02-25 00:00:00 1.80e5 2 1 770 10000
4 2487… 2014-12-09 00:00:00 6.04e5 4 3 1960 5000
5 1954… 2015-02-18 00:00:00 5.10e5 3 2 1680 8080
6 7237… 2014-05-12 00:00:00 1.23e6 4 4.5 5420 101930
# … with 14 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
# condition <dbl>, grade <dbl>, sqft_above <dbl>, sqft_basement <dbl>,
# yr_built <dbl>, yr_renovated <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>,
# sqft_living15 <dbl>, sqft_lot15 <dbl>
kc_house data with names().names(kc_house)
[1] "id" "date" "price" "bedrooms"
[5] "bathrooms" "sqft_living" "sqft_lot" "floors"
[9] "waterfront" "view" "condition" "grade"
[13] "sqft_above" "sqft_basement" "yr_built" "yr_renovated"
[17] "zipcode" "lat" "long" "sqft_living15"
[21] "sqft_lot15"
rename().| New Name | Old Name |
|---|---|
| living_sqft | sqft_living |
| lot_sqft | sqft_lot |
| above_sqft | sqft_above |
| basement_sqft | sqft_basement |
| built_yr | yr_built |
| renovated_yr | yr_renovated |
kc_house <- kc_house %>%
rename(NEW = OLD,
NEW = OLD,
NEW = OLD)
kc_house <- kc_house %>%
rename(living_sqft = sqft_living,
lot_sqft = sqft_lot,
above_sqft = sqft_above,
basement_sqft = sqft_basement,
built_yr = yr_built,
renovated_yr = yr_renovated)
living_sqm, lot_sqm, above_sqm and basement_sqm which show the respective room sizes in square meters rather than square feet (Hint: Multiply each by 0.093).kc_house <- kc_house %>%
mutate(living_sqm = XXX * XXX,
lot_sqm = XXX * XXX,
XXX = XXX,
XXX = XXX)
kc_house <- kc_house %>%
mutate(living_sqm = living_sqft * 0.093,
lot_sqm = lot_sqft * 0.093,
above_sqm = above_sqft * 0.093,
basement_sqm = basement_sqft * 0.093)
mansion which is “Yes” when the sum of the house’s living, above, and basement space is above 750 square meters.kc_house <- kc_house %>%
mutate(XXX = case_when(
XXX + XXX + XXX > XXX ~ "XXX",
XXXX + XXX + XXX <= XXX ~ "XXX"))
kc_house <- kc_house %>%
mutate(mansion = case_when(
living_sqm + above_sqm + basement_sqm > 750 ~ "Yes",
living_sqm + above_sqm + basement_sqm <= 750 ~ "No"))
df$col notation, calculate the mean price of all houses.mean(XXX$XXX)
mean(kc_house$price)
[1] 540088
summarise() with the following template. Do you get the same answer? What is different about the output from summarise() versus using the dollar sign?kc_house %>%
summarise(
price_mean = mean(XXX)
)
kc_house %>%
summarise(
price_mean = mean(price)
)
# A tibble: 1 x 1
price_mean
<dbl>
1 540088.
median() function!kc_house %>%
summarise(
price_median = median(price)
)
# A tibble: 1 x 1
price_median
<dbl>
1 450000
max() function!kc_house %>%
summarise(
price_max = max(price)
)
# A tibble: 1 x 1
price_max
<dbl>
1 7700000
kc_house <- kc_house %>%
arrange(desc(XXX))
kc_house
kc_house <- kc_house %>%
arrange(desc(price))
kc_house
# A tibble: 21,613 x 26
id date price bedrooms bathrooms living_sqft lot_sqft
<chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6762… 2014-10-13 00:00:00 7.70e6 6 8 12050 27600
2 9808… 2014-06-11 00:00:00 7.06e6 5 4.5 10040 37325
3 9208… 2014-09-19 00:00:00 6.88e6 6 7.75 9890 31374
4 2470… 2014-08-04 00:00:00 5.57e6 5 5.75 9200 35069
5 8907… 2015-04-13 00:00:00 5.35e6 5 5 8000 23985
6 7558… 2015-04-13 00:00:00 5.30e6 6 6 7390 24829
7 1247… 2014-10-20 00:00:00 5.11e6 5 5.25 8010 45517
8 1924… 2014-06-17 00:00:00 4.67e6 5 6.75 9640 13068
9 7738… 2014-08-15 00:00:00 4.50e6 5 5.5 6640 40014
10 3835… 2014-06-18 00:00:00 4.49e6 4 3 6430 27517
# … with 21,603 more rows, and 19 more variables: floors <dbl>,
# waterfront <dbl>, view <dbl>, condition <dbl>, grade <dbl>,
# above_sqft <dbl>, basement_sqft <dbl>, built_yr <dbl>, renovated_yr <dbl>,
# zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>,
# sqft_lot15 <dbl>, living_sqm <dbl>, lot_sqm <dbl>, above_sqm <dbl>,
# basement_sqm <dbl>, mansion <chr>
summarise().kc_house %>%
summarise(mil_p = mean(XXX > 1000000))
kc_house %>%
summarise(mil_p = mean(price > 1000000))
# A tibble: 1 x 1
mil_p
<dbl>
1 0.0678
filter() to only select mansions!)kc_house %>%
filter(XXX == XXX) %>%
summarise(
floors_mean = XXX(XXX),
bathrooms_mean = XXX(XXX)
)
kc_house %>%
filter(mansion == "Yes") %>%
summarise(
floors_mean = mean(floors),
bathrooms_mean = mean(bathrooms)
)
# A tibble: 1 x 2
floors_mean bathrooms_mean
<dbl> <dbl>
1 1.92 3.68
group_by() to group the dataset by the mansions column, then use the n() function to count the number of cases.kc_house %>%
group_by(XXX) %>%
summarise(XXX = n())
kc_house %>%
group_by(mansion) %>%
summarise(N = n())
# A tibble: 2 x 2
mansion N
<chr> <int>
1 No 20862
2 Yes 751
summarise() function!kc_house %>%
group_by(mansion) %>%
summarise(N = n(),
XXX = XXX(XXX))
kc_house %>%
group_by(mansion) %>%
summarise(N = n(),
price_mean = mean(price))
# A tibble: 2 x 3
mansion N price_mean
<chr> <int> <dbl>
1 No 20862 504024.
2 Yes 751 1541915.
group_by() and summarise(), create a dataframe showing the same results as the following table.kc_house %>%
group_by(mansion) %>%
summarise(N = n(),
price_min = min(price),
price_mean = mean(price),
price_median = median(price),
price_max = max(price)) %>%
knitr::kable()
| mansion | N | price_min | price_mean | price_median | price_max |
|---|---|---|---|---|---|
| No | 20862 | 75000 | 504024 | 441000 | 3100000 |
| Yes | 751 | 404000 | 1541915 | 1300000 | 7700000 |
built_yr, and then calculate the mean number of living square meters. Be sure to also include the number of houses built in each year!kc_house %>%
group_by(built_yr) %>%
summarise(N = n(),
living = mean(living_sqm))
# A tibble: 116 x 3
built_yr N living
<dbl> <int> <dbl>
1 1900 87 161.
2 1901 29 164.
3 1902 27 179.
4 1903 46 140.
5 1904 45 149.
6 1905 74 183.
7 1906 92 168.
8 1907 65 177.
9 1908 86 158.
10 1909 94 177.
# … with 106 more rows
kc_house %>%
mutate(built_decade = floor(built_yr / 10)) %>%
group_by(built_decade) %>%
summarise(XX = XX,
XX = XX(XX))
brumhilda that only contains data from houses in those zipcode (hint: use filter() combined with the %in% operator as follows:brumhilda <- kc_house %>%
filter(XXX %in% c(XXX, XXX, XXX, XXX))
brumhilda <- kc_house %>%
filter(zipcode %in% c(98001, 98109, 98117, 98199))
mean() and median() selling price (as well as the number of houses) in each zip code.brumhilda %>%
group_by(zipcode) %>%
summarise(price_mean = mean(price),
price_median = median(price),
N = n())
# A tibble: 4 x 4
zipcode price_mean price_median N
<dbl> <dbl> <dbl> <int>
1 98001 280805. 260000 362
2 98109 879624. 736000 109
3 98117 576795. 544000 553
4 98199 791821. 689800 317
mansion (as well as zipcode), and calculating the same summary statistics as before.brumhilda %>%
group_by(zipcode, mansion) %>%
summarise(price_mean = mean(price),
price_median = median(price),
N = n())
# A tibble: 8 x 5
# Groups: zipcode [4]
zipcode mansion price_mean price_median N
<dbl> <chr> <dbl> <dbl> <int>
1 98001 No 277589. 260000 359
2 98001 Yes 665667. 637000 3
3 98109 No 833528. 730500 106
4 98109 Yes 2508333. 2900000 3
5 98117 No 575626. 543000 551
6 98117 Yes 898750 898750 2
7 98199 No 753625. 675000 305
8 98199 Yes 1762618. 1425000 12
brumhilda %>%
group_by(zipcode) %>%
summarise(price_mean = mean(price),
price_median = median(price),
floors_min = min(floors),
floors_max = max(floors),
N = n())
# A tibble: 4 x 6
zipcode price_mean price_median floors_min floors_max N
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 98001 280805. 260000 1 2.5 362
2 98109 879624. 736000 1 3 109
3 98117 576795. 544000 1 3 553
4 98199 791821. 689800 1 3 317
t.test() function and assign the result to waterfront_htest. To prepare for the t-test, you need to create two vectors carrying the prices for the houses on the waterfront and those not on the waterfront. (Note: this is a bit of a hack that we don’t need when speciying models the proper way -> next session)# create list separating values for waterfron yes and no
prices <- kc_house %>%
mutate(waterfront_lab =
case_when(waterfront == 0 ~ 'no_waterfront',
waterfront == 1 ~ 'waterfront')) %>%
with(split(price, waterfront_lab))
# show names
x becomes the 'no_waterfront' prices and y becomes the 'waterfront' prices.waterfront_htest <- t.test(x = prices$XX, y = prices$XX)
waterfront_htest <- t.test(x = prices$no_waterfront, y = prices$waterfront)
waterfront_htest object to see a printout of the main results.waterfront_htest
Welch Two Sample t-test
data: prices$no_waterfront and prices$waterfront
t = -13, df = 162, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1303662 -956963
sample estimates:
mean of x mean of y
531564 1661876
waterfront_htest object with names().names(waterfront_htest)
[1] "statistic" "parameter" "p.value" "conf.int" "estimate"
[6] "null.value" "stderr" "alternative" "method" "data.name"
$, print the test statistic (statistic) from your waterfront_htest object.waterfront_htest$statistic
t
-12.9
$, print only the p-value (p.value) from the object.waterfront_htest$p.value
[1] 1.38e-26
wilcox.test(x = prices$XX, y = prices$XX)
wilcox.test(x = prices$no_waterfront, y = prices$waterfront)
Wilcoxon rank sum test with continuity correction
data: prices$no_waterfront and prices$waterfront
W = 4e+05, p-value <2e-16
alternative hypothesis: true location shift is not equal to 0
mean(), then sort the data in descending order) with arrange(), then select the first row with slice(). Once you find it, try searching for that zipcode on Google Maps and see if it’s location makes sense!kc_house %>%
group_by(zipcode) %>%
summarise(waterfront_p = mean(waterfront)) %>%
arrange(desc(waterfront_p)) %>%
slice(1)
# A tibble: 1 x 2
zipcode waterfront_p
<dbl> <dbl>
1 98070 0.203
price_to_living that takes price / living_sqm. Then, sort the data in descending order of this variable, and select the first row with slice()! What id value do you get?kc_house %>%
mutate(price_to_living = price / living_sqm) %>%
arrange(desc(price_to_living)) %>%
slice(1)
kc_house %>%
group_by(zipcode) %>%
summarise(price_mean = mean(price)) %>%
arrange(desc(price_mean)) %>%
slice(1:10)
# A tibble: 10 x 2
zipcode price_mean
<dbl> <dbl>
1 98039 2160607.
2 98004 1355927.
3 98040 1194230.
4 98112 1095499.
5 98102 901258.
6 98109 879624.
7 98105 862825.
8 98006 859685.
9 98119 849448.
10 98005 810165.
kc_house %>%
filter(built_yr >= 1990 & built_yr < 1999) %>%
group_by(built_yr) %>%
summarise(N = n(),
price_mean = mean(price),
price_max = max(price),
living_sqm_mean = mean(living_sqm)) %>%
knitr::kable(digits = 0)
| built_yr | N | price_mean | price_max | living_sqm_mean |
|---|---|---|---|---|
| 1990 | 320 | 563966 | 3640900 | 234 |
| 1991 | 224 | 630441 | 5300000 | 244 |
| 1992 | 198 | 548169 | 2480000 | 223 |
| 1993 | 202 | 556612 | 3120000 | 226 |
| 1994 | 249 | 486834 | 2880500 | 209 |
| 1995 | 169 | 577771 | 3200000 | 224 |
| 1996 | 195 | 639534 | 3100000 | 240 |
| 1997 | 177 | 606058 | 3800000 | 234 |
| 1998 | 239 | 594159 | 1960000 | 241 |
| built_yr | N | price_mean | price_max | living_sqm_mean |
|---|---|---|---|---|
| 1990 | 320 | 563966 | 3640900 | 234 |
| 1991 | 224 | 630441 | 5300000 | 244 |
| 1992 | 198 | 548169 | 2480000 | 223 |
| 1993 | 202 | 556612 | 3120000 | 226 |
| 1994 | 249 | 486834 | 2880500 | 209 |
| 1995 | 169 | 577771 | 3200000 | 224 |
| 1996 | 195 | 639534 | 3100000 | 240 |
| 1997 | 177 | 606058 | 3800000 | 234 |
| 1998 | 239 | 594159 | 1960000 | 241 |
kc_house %>%
filter(built_yr >= 1990 & built_yr < 1999) %>%
group_by(built_yr) %>%
summarise(N = n(),
price_mean = mean(price),
price_max = max(price),
living_sqm_mean = mean(living_sqm)) %>%
knitr::kable(digits = 0)
| built_yr | N | price_mean | price_max | living_sqm_mean |
|---|---|---|---|---|
| 1990 | 320 | 563966 | 3640900 | 234 |
| 1991 | 224 | 630441 | 5300000 | 244 |
| 1992 | 198 | 548169 | 2480000 | 223 |
| 1993 | 202 | 556612 | 3120000 | 226 |
| 1994 | 249 | 486834 | 2880500 | 209 |
| 1995 | 169 | 577771 | 3200000 | 224 |
| 1996 | 195 | 639534 | 3100000 | 240 |
| 1997 | 177 | 606058 | 3800000 | 234 |
| 1998 | 239 | 594159 | 1960000 | 241 |
chisq.test() function allows you to do conduct a chi square test testing the relationship between two nominal variables. Look at the help menu to see how the function works. Then, conduct a chi-square test to see if there is a relationship between whether a house is on the waterfront and the grade of the house. Do houses on the waterfront tend to have higher (or lower) grades than houses not on the waterfront?# First look at a table
table(kc_house$waterfront, kc_house$grade)
1 3 4 5 6 7 8 9 10 11 12 13
0 1 3 29 238 2026 8958 6028 2590 1106 379 79 13
1 0 0 0 4 12 23 40 25 28 20 11 0
# Then run the test
chisq.test(table(kc_house$waterfront, kc_house$grade))
Pearson's Chi-squared test
data: table(kc_house$waterfront, kc_house$grade)
X-squared = 335, df = 11, p-value <2e-16
# Wrangling with dplyr and tidyr ---------------------------
library(tidyverse) # Load tidyverse for dplyr and tidyr
# Load baselers data
baselers <- read_csv("1_Data/baselers.txt")
# No grouping variables
bas <- baselers %>%
summarise(
age_mean = mean(age, na.rm = TRUE),
income_median = median(income, na.rm = TRUE),
N = n()
)
# One grouping variable
bas_sex <- baselers %>%
group_by(sex) %>%
summarise(
age_mean = mean(age, na.rm = TRUE),
income_median = median(income, na.rm = TRUE),
N = n()
)
bas_sex
# Two grouping variables
bas_sex_ed <- baselers %>%
group_by(sex, education) %>%
summarise(
age_mean = mean(age, na.rm = TRUE),
income_median = median(income, na.rm = TRUE),
N = n()
)
# Advanced scoping
# Calculate mean of ALL numeric variables
baselers %>%
group_by(sex, education) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
# Examples of hypothesis tests on the diamonds -------------
library(tidyverse)
library(broom)
library(rsq)
# First few rows of the diamonds data
diamonds
# 2-sample t- test ---------------------------
# Q: Is there a difference in the carats of color = E and color = I diamonds?
# split data
carat <- diamonds %>%
filter(color %in% c("E", "I")) %>%
with(split(carat, color))
# run t-test
htest <- t.test(carat$E, carat$I,
alternative = "two.sided") # Two-sided test
htest # Print result
# run wilcoxon test
htest <- wilcox.test(carat$E, carat$I,
alternative = "two.sided") # Two-sided test
htest # Print result
| File | Rows | Columns | Description |
|---|---|---|---|
| kc_house.csv | 21613 | 21 | House sale prices for King County between May 2014 and May 2015. |
1
Wrangling
| Function | Package | Description |
|---|---|---|
rename() |
dplyr |
Rename columns |
select() |
dplyr |
Select columns based on name or index |
filter() |
dplyr |
Select rows based on some logical criteria |
arrange() |
dplyr |
Sort rows |
mutate() |
dplyr |
Add new columns |
case_when() |
dplyr |
Recode values of a column |
group_by(), summarise() |
dplyr |
Group data and then calculate summary statistics |
left_join() |
dplyr |
Combine multiple data sets using a key column |
pivot_wider() |
tidyr |
Convert long data to wide format - from rows to columns |
pivot_longer() |
tidyr |
Convert wide data to long format - from columns to rows |
Statistical Tests
| Function | Hypothesis Test |
|---|---|
t.test() |
One and two sample t-test |
wilcox.test() |
Wilcoxon test |
cor.test() |
Correlation test |
chisq.test |
Chi-square tests for frequency tables |
dplyr vignetteSee https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html for the full dplyr vignette with lots of wrangling tips and tricks.
from R Studio