Introduction to Data Science with R The R Bootcamp |
![]() |
from delinat.com
At the end of this practical you will be able to …
Open your TheRBootcamp
R project. Make sure that the data files listed in the Datasets
tab are in your 1_Data
folder.
Using library()
load the set of packages for this practical listed in the Functions section.
## Name
## Date
## Statistics Practical
library(XX)
library(XX)
wine.csv
. Read it in with read_csv()
from its location on a website. Copy the link of the wine.csv file into your read_csv()
function.# Read in the data
wine <- read_csv(file = "https://bit.ly/3zV98fi")
character
variables are converted to factors. This will help the statistical model to interpret categorical variables correctly.# convert character to factor
wine <- wine %>% mutate_if(is.character, factor)
Residual_Sugar
(predictor) on the perceived Quality
(criterion) of wines. Use the template below, to run a simple linear regression testing the relationship and save it in an object called wine_lm
.# Simple regression
wine_lm <- lm(formula = XX ~ XX,
data = XX)
wine_lm <- lm(formula = Quality ~ Residual_Sugar,
data = wine)
wine_lm
. Which information does it show? Can you make sense of it?wine_lm
Call:
lm(formula = Quality ~ Residual_Sugar, data = wine)
Coefficients:
(Intercept) Residual_Sugar
5.85532 -0.00679
lm
is not very informative. It only shows the code and the estimates for the mdoel’s parameters: the inetrcept and the coefficient for Residual_Sugar
. Use summary()
to obtain a more informative output.# Show results
summary(XX)
summary(wine_lm)
Call:
lm(formula = Quality ~ Residual_Sugar, data = wine)
Residuals:
Min 1Q Median 3Q Max
-2.851 -0.808 0.158 0.229 3.217
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.85532 0.01645 355.89 <2e-16 ***
Residual_Sugar -0.00679 0.00228 -2.98 0.0029 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.873 on 6495 degrees of freedom
Multiple R-squared: 0.00137, Adjusted R-squared: 0.00121
F-statistic: 8.89 on 1 and 6495 DF, p-value: 0.00287
summary()
shows a detailed output with five sections:Which sections shows you, whether Residual_Sugar
has an impact on the perceived Quality
of wine?
The section Coefficients
shows the estimates of the model parameters and whether they are significantly different from 0. What value was estimated for Residual_Sugar
? What does this value tell you?
An increase of 1
(g/ml) is associated with a change of -.0067
in the quality of wine. This means that the sweeter the wine, the lower the perceived quality of the wine. But is this relationship significant? Have a look at the corresponding p-value.
Now direct your attention to the final section. There you see a summary of the overall fit of the model to the data. Of particular interest is the Multiple R-squared
. It shows you how much the criterion’s variance is accounted by the model. How do you assess the values magnitude? Is it large or small?
The value is small. Only .1% of the variance is accounted for. It is time to include additional predictors.
Quality
(criterion) of the wine. Include, in addition to Residual_sugar
, the PH_Value
, Alcohol
, and Sulphate
. Use the template below, to run a multiple regression with these variables and save it as wine_lm
.# Multiple regression
wine_lm <- lm(formula = XX ~ XX + XX + XX + XX,
data = XX)
wine_lm <- lm(formula = Quality ~ Residual_Sugar + PH_Value + Alcohol + Sulphate,
data = wine)
wine_lm
Call:
lm(formula = Quality ~ Residual_Sugar + PH_Value + Alcohol +
Sulphate, data = wine)
Coefficients:
(Intercept) Residual_Sugar PH_Value Alcohol Sulphate
1.8422 0.0280 -0.0767 0.3669 0.4171
Alcohol
coded in per thousand (rather than percent), then its coefficient would be exactly ten times larger. You can tests this using the template below. Add the four predictors using above such that Alcohol
is placed within of I()
and run the model. Do not save it, only inspect the coefficients printed in the console.# Multiple regression
lm(formula = XX ~ XX + XX + I(XX / 10) + XX,
data = XX)
wine_lm <- lm(formula = Quality ~ Residual_Sugar + PH_Value + I(Alcohol/10) + Sulphate,
data = wine)
Regression coefficients must always be interpreted with the scaling of variables in mind. Nevertheless, the coefficients reveal some interesting results. Take a look at the coefficient of Residual_sugar
. Wasn’t the effect negative, when it was the only predictor? How come it has changed?
In multiple regressison, all predictors depend on each other. This means that one coefficient can only be interpreted in reference to the whole set of predictors. Now, take a look at the summary()
. Which effects are significant and what is their direction?
summary(wine_lm)
Call:
lm(formula = Quality ~ Residual_Sugar + PH_Value + I(Alcohol/10) +
Sulphate, data = wine)
Residuals:
Min 1Q Median 3Q Max
-3.494 -0.501 -0.035 0.492 3.099
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.84215 0.22155 8.32 < 2e-16 ***
Residual_Sugar 0.02800 0.00225 12.43 < 2e-16 ***
PH_Value -0.07671 0.06256 -1.23 0.22
I(Alcohol/10) 3.66868 0.08635 42.49 < 2e-16 ***
Sulphate 0.41713 0.06646 6.28 3.7e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.772 on 6492 degrees of freedom
Multiple R-squared: 0.219, Adjusted R-squared: 0.219
F-statistic: 456 on 4 and 6492 DF, p-value: <2e-16
Alcohol
, Sulphate
, and Residual_Sugar
all show a significant positive relationship with perceived quality of wines, whereas PH_Value
shows a non-significant negative relationship. Using the summary, you can now evaluate, which predictor is most important. Which values would you rely on to make that call?
The best value to evaluate the importance of predictors is the t-value. The larger the t-value, the stronger the signal for the corresponding predictor. Now, take a look at R-squared
. Has it increased relative to the simple regression?
The R-squared
increased substantially. About 22% of the criterion’s variance is accounted for by a linear model of the four predictors.
Color
of the wine, red or white, as a predictor for Quality
. Use the code below to generate two vectors that include quality ratings for white and red wine.# Qualitiy vectors by color
white <- wine %>% filter(Color == 'white') %>% pull(Quality)
red <- wine %>% filter(Color == 'red') %>% pull(Quality)
t.test()
template below to compare the two vectors with a t-test. You do not have to save the result.# t-test
t.test(x = XX, y = XX)
# t-test
t.test(x = white, y = red)
Welch Two Sample t-test
data: white and red
t = 10, df = 2951, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.195 0.289
sample estimates:
mean of x mean of y
5.88 5.64
What does the output of the t-tests tell you about the difference between white and red wine in perceived quality? You can find the answer in the second line of the output that starts with t=..
together with the last line.
White wine were rated 0.2419
(difference of the two means) points higher than red wine and this difference is significant. Now, try to get the same result with a regression. Predict Quality
with Color
.
# Regression
wine_lm <- lm(formula = XX ~ XX,
data = XX)
# Regression
wine_lm <- lm(formula = Quality ~ Color,
data = wine)
Print the object and inspect the regression coefficients. Do you recognize some of these numbers?
Exactly! The regression coefficient for Color
is the difference of red and white wines. What does the intercept represent? It represents the value of the category that got assigned 0 by R. For character
variables the default is assigned to the category that comes earlier in the alphabet, i.e., 'red' < 'white'
.
Now take a look at the summary()
of your model object and compare the degrees of freedom, t- and p-values with the ones from the t-test above.
# summary
summary(wine_lm)
Call:
lm(formula = Quality ~ Color, data = wine)
Residuals:
Min 1Q Median 3Q Max
-2.878 -0.878 0.122 0.364 3.122
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6360 0.0217 259.92 <2e-16 ***
Colorwhite 0.2419 0.0250 9.69 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.867 on 6495 degrees of freedom
Multiple R-squared: 0.0142, Adjusted R-squared: 0.0141
F-statistic: 93.8 on 1 and 6495 DF, p-value: <2e-16
t.test()
function permits different variances in the groups (red and white). In contrast, regression always assumes that the variances between groups are the identical. Re-calculate the t-test using the additional argument var.equal = TRUE
.# t-test
t.test(x = XX, y = XX, var.equal = XX)
# t-test
t.test(x = white, y = red, var.equal = TRUE)
Two Sample t-test
data: white and red
t = 10, df = 6495, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.193 0.291
sample estimates:
mean of x mean of y
5.88 5.64
# Kodierungen der Color
wine <- wine %>% mutate(Color_dummy = ifelse(Color == 'red', 0, 1),
Color_effect = ifelse(Color == 'red', -1, 1))
wine_dummy
and wine_effect
.# Regression dummy
wine_dummy <- lm(formula = XX ~ XX,
data = XX)
# Regression effect
wine_effect <- lm(formula = XX ~ XX,
data = XX)
# Regression dummy
wine_dummy <- lm(formula = Quality ~ Color_dummy,
data = wine)
# Regression effect
wine_effect <- lm(formula = Quality ~ Color_effect,
data = wine)
Print these objects and compare the coeffcients. The dummy coding coefficients should be familiar. How do they compare to the effect coding? Do you see the connection?
The coefficient for color using effect coding is exactly half of the coefficient for the dummy coding. To compensate for this change, the intercept has also changed for exactly the same value. Verify the weights with the calculations done in the code chunk below.
# Dummy-coding
mean(red) # intercept
[1] 5.64
mean(white) - mean(red) # coeffcient color
[1] 0.242
# EffekKodierung
(mean(red) + mean(white))/2 # intercept
[1] 5.76
mean(white) - (mean(red) + mean(white))/2 # coeffcient color
[1] 0.121
summary()
. Where is different, what the same?# Regression dummy
summary(wine_dummy)
Call:
lm(formula = Quality ~ Color_dummy, data = wine)
Residuals:
Min 1Q Median 3Q Max
-2.878 -0.878 0.122 0.364 3.122
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6360 0.0217 259.92 <2e-16 ***
Color_dummy 0.2419 0.0250 9.69 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.867 on 6495 degrees of freedom
Multiple R-squared: 0.0142, Adjusted R-squared: 0.0141
F-statistic: 93.8 on 1 and 6495 DF, p-value: <2e-16
# Regression effekt
summary(wine_effect)
Call:
lm(formula = Quality ~ Color_effect, data = wine)
Residuals:
Min 1Q Median 3Q Max
-2.878 -0.878 0.122 0.364 3.122
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.7570 0.0125 461.04 <2e-16 ***
Color_effect 0.1209 0.0125 9.69 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.867 on 6495 degrees of freedom
Multiple R-squared: 0.0142, Adjusted R-squared: 0.0141
F-statistic: 93.8 on 1 and 6495 DF, p-value: <2e-16
# Regression mit R
library(tidyverse)
# Model:
# Does engine displacement predict
# highway miles per gallon?
hwy_mod <- lm(formula = hwy ~ displ,
data = mpg)
# Results
summary(hwy_mod)
coef(hwy_mod)
# Fitted values
hwy_fit <- fitted(hwy_mod)
hwy_fit
# Resdiuals
hwy_resid <- residuals(hwy_mod)
hwy_resid
Datei | Zeile | Spalte |
---|---|---|
wine.csv | 6497 | 13 |
The wine.csv
file contains data of the Comissão De Viticultura Da Região Dos Vinhos Verdes, the official certification agency of Vinho Verde in Portugal for the years 2004 to 2007.
Name | Description |
---|---|
Quality | Quality rating on a scale between 1-9 |
Color | red or white wine |
Dissolved_Acid | Concentration of acids dissolved in the wine |
Free_Acid | Concentration of free acids |
Citric_Acid | Citric acid in the wine |
Residual_Sugar | Sugar concentration in the wine |
Chloride | Chlorid concentration in the |
Free_Sulfur_Dioxide | Free sulfur dioxide in the |
Total_Sulfur_Dioxide | Total amount of sulfur dioxide |
Density | Density of the wine |
PH_Value | pH-Value of the wine. The smaller the more acidic. |
Sulphate | Sulphate concentration of the wine |
Alcohol | Alcohol in the wine in % |
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
Function | Package | Description |
---|---|---|
lm |
stats |
Fit a linear model |
fitted |
stats |
Extract predicted values |
residuals |
stats |
Extract residuals |