adapted from trueloveproperty.co.uk
In this practical, you’ll practice grouping and analysing data with the dplyr
and tidyr
packages (part of the `tidyverse collection of packages).
By the end of this practical you will know how to:
BernRBootcamp
R project. It should already have the folders 1_Data
and 2_Code
. Make sure that the data files listed in the Datasets
section above are in your 1_Data
folder.# Done!
statisticsI_practical.R
in the 2_Code
folder.library()
load the set of packages for this practical listed in the packages section above.library(tidyverse)
kc_house.csv
data. This dataset contains house sale prices for King County, Washington. It includes homes sold between May 2014 and May 2015. Using the following template, load the data into R and store it as a new object called kc_house
.kc_house <- read_csv(file = "XX")
print()
, summary()
, and head()
, explore the data to make sure it was loaded correctly.kc_house
# A tibble: 21,613 x 21
id date price bedrooms bathrooms sqft_living sqft_lot
<chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7129… 2014-10-13 00:00:00 2.22e5 3 1 1180 5650
2 6414… 2014-12-09 00:00:00 5.38e5 3 2.25 2570 7242
3 5631… 2015-02-25 00:00:00 1.80e5 2 1 770 10000
4 2487… 2014-12-09 00:00:00 6.04e5 4 3 1960 5000
5 1954… 2015-02-18 00:00:00 5.10e5 3 2 1680 8080
6 7237… 2014-05-12 00:00:00 1.23e6 4 4.5 5420 101930
7 1321… 2014-06-27 00:00:00 2.58e5 3 2.25 1715 6819
8 2008… 2015-01-15 00:00:00 2.92e5 3 1.5 1060 9711
9 2414… 2015-04-15 00:00:00 2.30e5 3 1 1780 7470
10 3793… 2015-03-12 00:00:00 3.23e5 3 2.5 1890 6560
# … with 21,603 more rows, and 14 more variables: floors <dbl>,
# waterfront <dbl>, view <dbl>, condition <dbl>, grade <dbl>,
# sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
# zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>
summary(kc_house)
id date price
Length:21613 Min. :2014-05-02 00:00:00 Min. : 75000
Class :character 1st Qu.:2014-07-22 00:00:00 1st Qu.: 321950
Mode :character Median :2014-10-16 00:00:00 Median : 450000
Mean :2014-10-29 04:38:01 Mean : 540088
3rd Qu.:2015-02-17 00:00:00 3rd Qu.: 645000
Max. :2015-05-27 00:00:00 Max. :7700000
bedrooms bathrooms sqft_living sqft_lot floors
Min. : 0.0 Min. :0.00 Min. : 290 Min. : 520 Min. :1.00
1st Qu.: 3.0 1st Qu.:1.75 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.00
Median : 3.0 Median :2.25 Median : 1910 Median : 7618 Median :1.50
Mean : 3.4 Mean :2.11 Mean : 2080 Mean : 15107 Mean :1.49
3rd Qu.: 4.0 3rd Qu.:2.50 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.00
Max. :33.0 Max. :8.00 Max. :13540 Max. :1651359 Max. :3.50
waterfront view condition grade sqft_above
Min. :0.000 Min. :0.00 Min. :1.00 Min. : 1.00 Min. : 290
1st Qu.:0.000 1st Qu.:0.00 1st Qu.:3.00 1st Qu.: 7.00 1st Qu.:1190
Median :0.000 Median :0.00 Median :3.00 Median : 7.00 Median :1560
Mean :0.008 Mean :0.23 Mean :3.41 Mean : 7.66 Mean :1788
3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:4.00 3rd Qu.: 8.00 3rd Qu.:2210
Max. :1.000 Max. :4.00 Max. :5.00 Max. :13.00 Max. :9410
sqft_basement yr_built yr_renovated zipcode lat
Min. : 0 Min. :1900 Min. : 0 Min. :98001 Min. :47.2
1st Qu.: 0 1st Qu.:1951 1st Qu.: 0 1st Qu.:98033 1st Qu.:47.5
Median : 0 Median :1975 Median : 0 Median :98065 Median :47.6
Mean : 292 Mean :1971 Mean : 84 Mean :98078 Mean :47.6
3rd Qu.: 560 3rd Qu.:1997 3rd Qu.: 0 3rd Qu.:98118 3rd Qu.:47.7
Max. :4820 Max. :2015 Max. :2015 Max. :98199 Max. :47.8
long sqft_living15 sqft_lot15
Min. :-122 Min. : 399 Min. : 651
1st Qu.:-122 1st Qu.:1490 1st Qu.: 5100
Median :-122 Median :1840 Median : 7620
Mean :-122 Mean :1987 Mean : 12768
3rd Qu.:-122 3rd Qu.:2360 3rd Qu.: 10083
Max. :-121 Max. :6210 Max. :871200
head(kc_house)
# A tibble: 6 x 21
id date price bedrooms bathrooms sqft_living sqft_lot
<chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7129… 2014-10-13 00:00:00 2.22e5 3 1 1180 5650
2 6414… 2014-12-09 00:00:00 5.38e5 3 2.25 2570 7242
3 5631… 2015-02-25 00:00:00 1.80e5 2 1 770 10000
4 2487… 2014-12-09 00:00:00 6.04e5 4 3 1960 5000
5 1954… 2015-02-18 00:00:00 5.10e5 3 2 1680 8080
6 7237… 2014-05-12 00:00:00 1.23e6 4 4.5 5420 101930
# … with 14 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
# condition <dbl>, grade <dbl>, sqft_above <dbl>, sqft_basement <dbl>,
# yr_built <dbl>, yr_renovated <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>,
# sqft_living15 <dbl>, sqft_lot15 <dbl>
kc_house
data with names()
.names(kc_house)
[1] "id" "date" "price" "bedrooms"
[5] "bathrooms" "sqft_living" "sqft_lot" "floors"
[9] "waterfront" "view" "condition" "grade"
[13] "sqft_above" "sqft_basement" "yr_built" "yr_renovated"
[17] "zipcode" "lat" "long" "sqft_living15"
[21] "sqft_lot15"
rename()
.New Name | Old Name |
---|---|
living_sqft | sqft_living |
lot_sqft | sqft_lot |
above_sqft | sqft_above |
basement_sqft | sqft_basement |
built_yr | yr_built |
renovated_yr | yr_renovated |
kc_house <- kc_house %>%
rename(NEW = OLD,
NEW = OLD,
NEW = OLD)
kc_house <- kc_house %>%
rename(living_sqft = sqft_living,
lot_sqft = sqft_lot,
above_sqft = sqft_above,
basement_sqft = sqft_basement,
built_yr = yr_built,
renovated_yr = yr_renovated)
living_sqm
, lot_sqm
, above_sqm
and basement_sqm
which show the respective room sizes in square meters rather than square feet (Hint: Multiply each by 0.093).kc_house <- kc_house %>%
mutate(living_sqm = XXX * XXX,
lot_sqm = XXX * XXX,
XXX = XXX,
XXX = XXX)
kc_house <- kc_house %>%
mutate(living_sqm = living_sqft * 0.093,
lot_sqm = lot_sqft * 0.093,
above_sqm = above_sqft * 0.093,
basement_sqm = basement_sqft * 0.093)
mansion
which is “Yes” when the sum of the house’s living, above, and basement space is above 750 square meters.kc_house <- kc_house %>%
mutate(XXX = case_when(
XXX + XXX + XXX > XXX ~ "XXX",
XXXX + XXX + XXX <= XXX ~ "XXX"))
kc_house <- kc_house %>%
mutate(mansion = case_when(
living_sqm + above_sqm + basement_sqm > 750 ~ "Yes",
living_sqm + above_sqm + basement_sqm <= 750 ~ "No"))
df$col
notation, calculate the mean price of all houses.mean(XXX$XXX)
mean(kc_house$price)
[1] 540088
summarise()
with the following template. Do you get the same answer? What is different about the output from summarise()
versus using the dollar sign?kc_house %>%
summarise(
price_mean = mean(XXX)
)
kc_house %>%
summarise(
price_mean = mean(price)
)
# A tibble: 1 x 1
price_mean
<dbl>
1 540088.
median()
function!kc_house %>%
summarise(
price_median = median(price)
)
# A tibble: 1 x 1
price_median
<dbl>
1 450000
max()
function!kc_house %>%
summarise(
price_max = max(price)
)
# A tibble: 1 x 1
price_max
<dbl>
1 7700000
kc_house <- kc_house %>%
arrange(desc(XXX))
kc_house
kc_house <- kc_house %>%
arrange(desc(price))
kc_house
# A tibble: 21,613 x 26
id date price bedrooms bathrooms living_sqft lot_sqft
<chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6762… 2014-10-13 00:00:00 7.70e6 6 8 12050 27600
2 9808… 2014-06-11 00:00:00 7.06e6 5 4.5 10040 37325
3 9208… 2014-09-19 00:00:00 6.88e6 6 7.75 9890 31374
4 2470… 2014-08-04 00:00:00 5.57e6 5 5.75 9200 35069
5 8907… 2015-04-13 00:00:00 5.35e6 5 5 8000 23985
6 7558… 2015-04-13 00:00:00 5.30e6 6 6 7390 24829
7 1247… 2014-10-20 00:00:00 5.11e6 5 5.25 8010 45517
8 1924… 2014-06-17 00:00:00 4.67e6 5 6.75 9640 13068
9 7738… 2014-08-15 00:00:00 4.50e6 5 5.5 6640 40014
10 3835… 2014-06-18 00:00:00 4.49e6 4 3 6430 27517
# … with 21,603 more rows, and 19 more variables: floors <dbl>,
# waterfront <dbl>, view <dbl>, condition <dbl>, grade <dbl>,
# above_sqft <dbl>, basement_sqft <dbl>, built_yr <dbl>, renovated_yr <dbl>,
# zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>,
# sqft_lot15 <dbl>, living_sqm <dbl>, lot_sqm <dbl>, above_sqm <dbl>,
# basement_sqm <dbl>, mansion <chr>
summarise()
.kc_house %>%
summarise(mil_p = mean(XXX > 1000000))
kc_house %>%
summarise(mil_p = mean(price > 1000000))
# A tibble: 1 x 1
mil_p
<dbl>
1 0.0678
filter()
to only select mansions!)kc_house %>%
filter(XXX == XXX) %>%
summarise(
floors_mean = XXX(XXX),
bathrooms_mean = XXX(XXX)
)
kc_house %>%
filter(mansion == "Yes") %>%
summarise(
floors_mean = mean(floors),
bathrooms_mean = mean(bathrooms)
)
# A tibble: 1 x 2
floors_mean bathrooms_mean
<dbl> <dbl>
1 1.92 3.68
group_by()
to group the dataset by the mansions
column, then use the n()
function to count the number of cases.kc_house %>%
group_by(XXX) %>%
summarise(XXX = n())
kc_house %>%
group_by(mansion) %>%
summarise(N = n())
# A tibble: 2 x 2
mansion N
<chr> <int>
1 No 20862
2 Yes 751
summarise()
function!kc_house %>%
group_by(mansion) %>%
summarise(N = n(),
XXX = XXX(XXX))
kc_house %>%
group_by(mansion) %>%
summarise(N = n(),
price_mean = mean(price))
# A tibble: 2 x 3
mansion N price_mean
<chr> <int> <dbl>
1 No 20862 504024.
2 Yes 751 1541915.
group_by()
and summarise()
, create a dataframe showing the same results as the following table.kc_house %>%
group_by(mansion) %>%
summarise(N = n(),
price_min = min(price),
price_mean = mean(price),
price_median = median(price),
price_max = max(price)) %>%
knitr::kable()
mansion | N | price_min | price_mean | price_median | price_max |
---|---|---|---|---|---|
No | 20862 | 75000 | 504024 | 441000 | 3100000 |
Yes | 751 | 404000 | 1541915 | 1300000 | 7700000 |
built_yr
, and then calculate the mean number of living square meters. Be sure to also include the number of houses built in each year!kc_house %>%
group_by(built_yr) %>%
summarise(N = n(),
living = mean(living_sqm))
# A tibble: 116 x 3
built_yr N living
<dbl> <int> <dbl>
1 1900 87 161.
2 1901 29 164.
3 1902 27 179.
4 1903 46 140.
5 1904 45 149.
6 1905 74 183.
7 1906 92 168.
8 1907 65 177.
9 1908 86 158.
10 1909 94 177.
# … with 106 more rows
kc_house %>%
mutate(built_decade = floor(built_yr / 10)) %>%
group_by(built_decade) %>%
summarise(XX = XX,
XX = XX(XX))
brumhilda
that only contains data from houses in those zipcode (hint: use filter()
combined with the %in%
operator as follows:brumhilda <- kc_house %>%
filter(XXX %in% c(XXX, XXX, XXX, XXX))
brumhilda <- kc_house %>%
filter(zipcode %in% c(98001, 98109, 98117, 98199))
mean()
and median()
selling price (as well as the number of houses) in each zip code.brumhilda %>%
group_by(zipcode) %>%
summarise(price_mean = mean(price),
price_median = median(price),
N = n())
# A tibble: 4 x 4
zipcode price_mean price_median N
<dbl> <dbl> <dbl> <int>
1 98001 280805. 260000 362
2 98109 879624. 736000 109
3 98117 576795. 544000 553
4 98199 791821. 689800 317
mansion
(as well as zipcode
), and calculating the same summary statistics as before.brumhilda %>%
group_by(zipcode, mansion) %>%
summarise(price_mean = mean(price),
price_median = median(price),
N = n())
# A tibble: 8 x 5
# Groups: zipcode [4]
zipcode mansion price_mean price_median N
<dbl> <chr> <dbl> <dbl> <int>
1 98001 No 277589. 260000 359
2 98001 Yes 665667. 637000 3
3 98109 No 833528. 730500 106
4 98109 Yes 2508333. 2900000 3
5 98117 No 575626. 543000 551
6 98117 Yes 898750 898750 2
7 98199 No 753625. 675000 305
8 98199 Yes 1762618. 1425000 12
brumhilda %>%
group_by(zipcode) %>%
summarise(price_mean = mean(price),
price_median = median(price),
floors_min = min(floors),
floors_max = max(floors),
N = n())
# A tibble: 4 x 6
zipcode price_mean price_median floors_min floors_max N
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 98001 280805. 260000 1 2.5 362
2 98109 879624. 736000 1 3 109
3 98117 576795. 544000 1 3 553
4 98199 791821. 689800 1 3 317
t.test()
function and assign the result to waterfront_htest
. To prepare for the t-test, you need to create two vectors carrying the prices for the houses on the waterfront and those not on the waterfront. (Note: this is a bit of a hack that we don’t need when speciying models the proper way -> next session)# create list separating values for waterfron yes and no
prices <- kc_house %>%
mutate(waterfront_lab =
case_when(waterfront == 0 ~ 'no_waterfront',
waterfront == 1 ~ 'waterfront')) %>%
with(split(price, waterfront_lab))
# show names
x
becomes the 'no_waterfront'
prices and y
becomes the 'waterfront'
prices.waterfront_htest <- t.test(x = prices$XX, y = prices$XX)
waterfront_htest <- t.test(x = prices$no_waterfront, y = prices$waterfront)
waterfront_htest
object to see a printout of the main results.waterfront_htest
Welch Two Sample t-test
data: prices$no_waterfront and prices$waterfront
t = -13, df = 162, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1303662 -956963
sample estimates:
mean of x mean of y
531564 1661876
waterfront_htest
object with names()
.names(waterfront_htest)
[1] "statistic" "parameter" "p.value" "conf.int" "estimate"
[6] "null.value" "stderr" "alternative" "method" "data.name"
$
, print the test statistic (statistic
) from your waterfront_htest
object.waterfront_htest$statistic
t
-12.9
$
, print only the p-value (p.value
) from the object.waterfront_htest$p.value
[1] 1.38e-26
wilcox.test(x = prices$XX, y = prices$XX)
wilcox.test(x = prices$no_waterfront, y = prices$waterfront)
Wilcoxon rank sum test with continuity correction
data: prices$no_waterfront and prices$waterfront
W = 4e+05, p-value <2e-16
alternative hypothesis: true location shift is not equal to 0
mean()
, then sort the data in descending order) with arrange()
, then select the first row with slice()
. Once you find it, try searching for that zipcode on Google Maps and see if it’s location makes sense!kc_house %>%
group_by(zipcode) %>%
summarise(waterfront_p = mean(waterfront)) %>%
arrange(desc(waterfront_p)) %>%
slice(1)
# A tibble: 1 x 2
zipcode waterfront_p
<dbl> <dbl>
1 98070 0.203
price_to_living
that takes price / living_sqm
. Then, sort the data in descending order of this variable, and select the first row with slice()
! What id value do you get?kc_house %>%
mutate(price_to_living = price / living_sqm) %>%
arrange(desc(price_to_living)) %>%
slice(1)
kc_house %>%
group_by(zipcode) %>%
summarise(price_mean = mean(price)) %>%
arrange(desc(price_mean)) %>%
slice(1:10)
# A tibble: 10 x 2
zipcode price_mean
<dbl> <dbl>
1 98039 2160607.
2 98004 1355927.
3 98040 1194230.
4 98112 1095499.
5 98102 901258.
6 98109 879624.
7 98105 862825.
8 98006 859685.
9 98119 849448.
10 98005 810165.
kc_house %>%
filter(built_yr >= 1990 & built_yr < 1999) %>%
group_by(built_yr) %>%
summarise(N = n(),
price_mean = mean(price),
price_max = max(price),
living_sqm_mean = mean(living_sqm)) %>%
knitr::kable(digits = 0)
built_yr | N | price_mean | price_max | living_sqm_mean |
---|---|---|---|---|
1990 | 320 | 563966 | 3640900 | 234 |
1991 | 224 | 630441 | 5300000 | 244 |
1992 | 198 | 548169 | 2480000 | 223 |
1993 | 202 | 556612 | 3120000 | 226 |
1994 | 249 | 486834 | 2880500 | 209 |
1995 | 169 | 577771 | 3200000 | 224 |
1996 | 195 | 639534 | 3100000 | 240 |
1997 | 177 | 606058 | 3800000 | 234 |
1998 | 239 | 594159 | 1960000 | 241 |
built_yr | N | price_mean | price_max | living_sqm_mean |
---|---|---|---|---|
1990 | 320 | 563966 | 3640900 | 234 |
1991 | 224 | 630441 | 5300000 | 244 |
1992 | 198 | 548169 | 2480000 | 223 |
1993 | 202 | 556612 | 3120000 | 226 |
1994 | 249 | 486834 | 2880500 | 209 |
1995 | 169 | 577771 | 3200000 | 224 |
1996 | 195 | 639534 | 3100000 | 240 |
1997 | 177 | 606058 | 3800000 | 234 |
1998 | 239 | 594159 | 1960000 | 241 |
kc_house %>%
filter(built_yr >= 1990 & built_yr < 1999) %>%
group_by(built_yr) %>%
summarise(N = n(),
price_mean = mean(price),
price_max = max(price),
living_sqm_mean = mean(living_sqm)) %>%
knitr::kable(digits = 0)
built_yr | N | price_mean | price_max | living_sqm_mean |
---|---|---|---|---|
1990 | 320 | 563966 | 3640900 | 234 |
1991 | 224 | 630441 | 5300000 | 244 |
1992 | 198 | 548169 | 2480000 | 223 |
1993 | 202 | 556612 | 3120000 | 226 |
1994 | 249 | 486834 | 2880500 | 209 |
1995 | 169 | 577771 | 3200000 | 224 |
1996 | 195 | 639534 | 3100000 | 240 |
1997 | 177 | 606058 | 3800000 | 234 |
1998 | 239 | 594159 | 1960000 | 241 |
chisq.test()
function allows you to do conduct a chi square test testing the relationship between two nominal variables. Look at the help menu to see how the function works. Then, conduct a chi-square test to see if there is a relationship between whether a house is on the waterfront and the grade of the house. Do houses on the waterfront tend to have higher (or lower) grades than houses not on the waterfront?# First look at a table
table(kc_house$waterfront, kc_house$grade)
1 3 4 5 6 7 8 9 10 11 12 13
0 1 3 29 238 2026 8958 6028 2590 1106 379 79 13
1 0 0 0 4 12 23 40 25 28 20 11 0
# Then run the test
chisq.test(table(kc_house$waterfront, kc_house$grade))
Pearson's Chi-squared test
data: table(kc_house$waterfront, kc_house$grade)
X-squared = 335, df = 11, p-value <2e-16
# Wrangling with dplyr and tidyr ---------------------------
library(tidyverse) # Load tidyverse for dplyr and tidyr
# Load baselers data
baselers <- read_csv("1_Data/baselers.txt")
# No grouping variables
bas <- baselers %>%
summarise(
age_mean = mean(age, na.rm = TRUE),
income_median = median(income, na.rm = TRUE),
N = n()
)
# One grouping variable
bas_sex <- baselers %>%
group_by(sex) %>%
summarise(
age_mean = mean(age, na.rm = TRUE),
income_median = median(income, na.rm = TRUE),
N = n()
)
bas_sex
# Two grouping variables
bas_sex_ed <- baselers %>%
group_by(sex, education) %>%
summarise(
age_mean = mean(age, na.rm = TRUE),
income_median = median(income, na.rm = TRUE),
N = n()
)
# Advanced scoping
# Calculate mean of ALL numeric variables
baselers %>%
group_by(sex, education) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
# Examples of hypothesis tests on the diamonds -------------
library(tidyverse)
library(broom)
library(rsq)
# First few rows of the diamonds data
diamonds
# 2-sample t- test ---------------------------
# Q: Is there a difference in the carats of color = E and color = I diamonds?
# split data
carat <- diamonds %>%
filter(color %in% c("E", "I")) %>%
with(split(carat, color))
# run t-test
htest <- t.test(carat$E, carat$I,
alternative = "two.sided") # Two-sided test
htest # Print result
# run wilcoxon test
htest <- wilcox.test(carat$E, carat$I,
alternative = "two.sided") # Two-sided test
htest # Print result
File | Rows | Columns | Description |
---|---|---|---|
kc_house.csv | 21613 | 21 | House sale prices for King County between May 2014 and May 2015. |
1
Wrangling
Function | Package | Description |
---|---|---|
rename() |
dplyr |
Rename columns |
select() |
dplyr |
Select columns based on name or index |
filter() |
dplyr |
Select rows based on some logical criteria |
arrange() |
dplyr |
Sort rows |
mutate() |
dplyr |
Add new columns |
case_when() |
dplyr |
Recode values of a column |
group_by(), summarise() |
dplyr |
Group data and then calculate summary statistics |
left_join() |
dplyr |
Combine multiple data sets using a key column |
pivot_wider() |
tidyr |
Convert long data to wide format - from rows to columns |
pivot_longer() |
tidyr |
Convert wide data to long format - from columns to rows |
Statistical Tests
Function | Hypothesis Test |
---|---|
t.test() |
One and two sample t-test |
wilcox.test() |
Wilcoxon test |
cor.test() |
Correlation test |
chisq.test |
Chi-square tests for frequency tables |
dplyr
vignetteSee https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html for the full dplyr vignette with lots of wrangling tips and tricks.
from R Studio