Statistics

Overview

At the end of this practical you will be able to …

run simple and multiple regressions.
use categorical variables for group comparisons.

Tasks

A - Setup

Open your TheRBootcamp R project. Make sure that the data files listed in the Datasets tab are in your 1_Data folder.
Using library() load the set of packages for this practical listed in the Functions section.

## Name
## Date
## Statistics Practical

library(XX)     
library(XX)

In this practical, you will analyze a dataset called wine.csv. Read it in with read_csv() from its location on a website. Copy the link of the wine.csv file into your read_csv() function.

# Read in the data
wine <- read_csv(file = "https://bit.ly/3zV98fi")

Execute the code below to ensure that all character variables are converted to factors. This will help the statistical model to interpret categorical variables correctly.

# convert character to factor
wine <- wine %>% mutate_if(is.character, factor)

B - Einfache Regression

In this section, you will test the effect of Residual_Sugar (predictor) on the perceived Quality (criterion) of wines. Use the template below, to run a simple linear regression testing the relationship and save it in an object called wine_lm.

# Simple regression
wine_lm <- lm(formula = XX ~ XX,
              data = XX)

wine_lm <- lm(formula = Quality ~ Residual_Sugar,
              data = wine)

Print the fitted model wine_lm. Which information does it show? Can you make sense of it?

wine_lm


Call:
lm(formula = Quality ~ Residual_Sugar, data = wine)

Coefficients:
   (Intercept)  Residual_Sugar  
       5.85532        -0.00679

The deafult output of lm is not very informative. It only shows the code and the estimates for the mdoel’s parameters: the inetrcept and the coefficient for Residual_Sugar. Use summary() to obtain a more informative output.

# Show results
summary(XX)

summary(wine_lm)


Call:
lm(formula = Quality ~ Residual_Sugar, data = wine)

Residuals:
   Min     1Q Median     3Q    Max 
-2.851 -0.808  0.158  0.229  3.217 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.85532    0.01645  355.89   <2e-16 ***
Residual_Sugar -0.00679    0.00228   -2.98   0.0029 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.873 on 6495 degrees of freedom
Multiple R-squared:  0.00137,   Adjusted R-squared:  0.00121 
F-statistic: 8.89 on 1 and 6495 DF,  p-value: 0.00287

summary() shows a detailed output with five sections:

Call: The code specifying the model.
Residuals: Statistics of the distribution of residuals.
Coefficients: Estimates and tests of model parameters.
Signif. codes: Conventional coding of significances.
Schluss: Statistics of the overall model fit.

Which sections shows you, whether Residual_Sugar has an impact on the perceived Quality of wine?

The section Coefficients shows the estimates of the model parameters and whether they are significantly different from 0. What value was estimated for Residual_Sugar? What does this value tell you?
An increase of 1 (g/ml) is associated with a change of -.0067 in the quality of wine. This means that the sweeter the wine, the lower the perceived quality of the wine. But is this relationship significant? Have a look at the corresponding p-value.
Now direct your attention to the final section. There you see a summary of the overall fit of the model to the data. Of particular interest is the Multiple R-squared. It shows you how much the criterion’s variance is accounted by the model. How do you assess the values magnitude? Is it large or small?
The value is small. Only .1% of the variance is accounted for. It is time to include additional predictors.

C - Multiple Regression

Test the effect of multiple predictors on the perceived Quality (criterion) of the wine. Include, in addition to Residual_sugar, the PH_Value, Alcohol, and Sulphate. Use the template below, to run a multiple regression with these variables and save it as wine_lm.

# Multiple regression
wine_lm <- lm(formula = XX ~ XX + XX + XX + XX,
              data = XX)

wine_lm <- lm(formula = Quality ~ Residual_Sugar + PH_Value + Alcohol + Sulphate,
              data = wine)

Print the fitted mdoel. How do you interpret the coefficients? Do they tell you, which predictor is most important?

wine_lm


Call:
lm(formula = Quality ~ Residual_Sugar + PH_Value + Alcohol + 
    Sulphate, data = wine)

Coefficients:
   (Intercept)  Residual_Sugar        PH_Value         Alcohol        Sulphate  
        1.8422          0.0280         -0.0767          0.3669          0.4171

Raw coefficients tell little about the importance of predictors, as they are also affect by the scale of the precictors. For instance, was Alcohol coded in per thousand (rather than percent), then its coefficient would be exactly ten times larger. You can tests this using the template below. Add the four predictors using above such that Alcohol is placed within of I() and run the model. Do not save it, only inspect the coefficients printed in the console.

# Multiple regression
lm(formula = XX ~ XX + XX + I(XX / 10) + XX,
   data = XX)

wine_lm <- lm(formula = Quality ~ Residual_Sugar + PH_Value + I(Alcohol/10) + Sulphate,
              data = wine)

Regression coefficients must always be interpreted with the scaling of variables in mind. Nevertheless, the coefficients reveal some interesting results. Take a look at the coefficient of Residual_sugar. Wasn’t the effect negative, when it was the only predictor? How come it has changed?
In multiple regressison, all predictors depend on each other. This means that one coefficient can only be interpreted in reference to the whole set of predictors. Now, take a look at the summary(). Which effects are significant and what is their direction?

summary(wine_lm)


Call:
lm(formula = Quality ~ Residual_Sugar + PH_Value + I(Alcohol/10) + 
    Sulphate, data = wine)

Residuals:
   Min     1Q Median     3Q    Max 
-3.494 -0.501 -0.035  0.492  3.099 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.84215    0.22155    8.32  < 2e-16 ***
Residual_Sugar  0.02800    0.00225   12.43  < 2e-16 ***
PH_Value       -0.07671    0.06256   -1.23     0.22    
I(Alcohol/10)   3.66868    0.08635   42.49  < 2e-16 ***
Sulphate        0.41713    0.06646    6.28  3.7e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.772 on 6492 degrees of freedom
Multiple R-squared:  0.219, Adjusted R-squared:  0.219 
F-statistic:  456 on 4 and 6492 DF,  p-value: <2e-16

Alcohol, Sulphate, and Residual_Sugar all show a significant positive relationship with perceived quality of wines, whereas PH_Value shows a non-significant negative relationship. Using the summary, you can now evaluate, which predictor is most important. Which values would you rely on to make that call?
The best value to evaluate the importance of predictors is the t-value. The larger the t-value, the stronger the signal for the corresponding predictor. Now, take a look at R-squared. Has it increased relative to the simple regression?
The R-squared increased substantially. About 22% of the criterion’s variance is accounted for by a linear model of the four predictors.

D - Comparing groups: t-test versus regression

In this part, you will analyze the effect of Color of the wine, red or white, as a predictor for Quality. Use the code below to generate two vectors that include quality ratings for white and red wine.

# Qualitiy vectors by color
white <- wine %>% filter(Color == 'white') %>% pull(Quality)
red <- wine %>% filter(Color == 'red') %>% pull(Quality)

Use the t.test() template below to compare the two vectors with a t-test. You do not have to save the result.

# t-test
t.test(x = XX, y = XX)

# t-test
t.test(x = white, y = red)


    Welch Two Sample t-test

data:  white and red
t = 10, df = 2951, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.195 0.289
sample estimates:
mean of x mean of y 
     5.88      5.64

What does the output of the t-tests tell you about the difference between white and red wine in perceived quality? You can find the answer in the second line of the output that starts with t=.. together with the last line.
White wine were rated 0.2419 (difference of the two means) points higher than red wine and this difference is significant. Now, try to get the same result with a regression. Predict Quality with Color.

# Regression
wine_lm <- lm(formula = XX ~ XX, 
              data = XX)

# Regression
wine_lm <- lm(formula = Quality ~ Color, 
              data = wine)

Print the object and inspect the regression coefficients. Do you recognize some of these numbers?
Exactly! The regression coefficient for Color is the difference of red and white wines. What does the intercept represent? It represents the value of the category that got assigned 0 by R. For character variables the default is assigned to the category that comes earlier in the alphabet, i.e., 'red' < 'white'.
Now take a look at the summary() of your model object and compare the degrees of freedom, t- and p-values with the ones from the t-test above.

# summary
summary(wine_lm)


Call:
lm(formula = Quality ~ Color, data = wine)

Residuals:
   Min     1Q Median     3Q    Max 
-2.878 -0.878  0.122  0.364  3.122 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.6360     0.0217  259.92   <2e-16 ***
Colorwhite    0.2419     0.0250    9.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.867 on 6495 degrees of freedom
Multiple R-squared:  0.0142,    Adjusted R-squared:  0.0141 
F-statistic: 93.8 on 1 and 6495 DF,  p-value: <2e-16

The values from the t-test do not exactly match the estimates for the coefficients in the regression. The reason is that the t.test() function permits different variances in the groups (red and white). In contrast, regression always assumes that the variances between groups are the identical. Re-calculate the t-test using the additional argument var.equal = TRUE.

# t-test
t.test(x = XX, y = XX, var.equal = XX)

# t-test
t.test(x = white, y = red, var.equal = TRUE)


    Two Sample t-test

data:  white and red
t = 10, df = 6495, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.193 0.291
sample estimates:
mean of x mean of y 
     5.88      5.64

Order has been restored! All values in the t-test and the regression should now be identical.

E - Comparing groups: Coding

The default in R is dummy coding, which assigns 0 to one feature and 1 for to another. Alternatively, you can code effects, by assigning -1 to one value and 1 to another. To see the difference between these codings generate two new varibales in your dataset using the code below.

# Kodierungen der Color
wine <- wine %>% mutate(Color_dummy = ifelse(Color == 'red', 0, 1),
                        Color_effect = ifelse(Color == 'red', -1, 1))

Now, calculate two regressions, one with dummy coding and the other with effect coding as the predictor, and save them as new objects called wine_dummy and wine_effect.

# Regression dummy
wine_dummy <- lm(formula = XX ~ XX, 
                 data = XX)

# Regression effect
wine_effect <- lm(formula = XX ~ XX, 
                  data = XX)

# Regression dummy
wine_dummy <- lm(formula = Quality ~ Color_dummy, 
                 data = wine)

# Regression effect
wine_effect <- lm(formula = Quality ~ Color_effect, 
                  data = wine)

Print these objects and compare the coeffcients. The dummy coding coefficients should be familiar. How do they compare to the effect coding? Do you see the connection?
The coefficient for color using effect coding is exactly half of the coefficient for the dummy coding. To compensate for this change, the intercept has also changed for exactly the same value. Verify the weights with the calculations done in the code chunk below.

# Dummy-coding
mean(red) # intercept

[1] 5.64

mean(white) - mean(red) # coeffcient color

[1] 0.242

# EffekKodierung
(mean(red) + mean(white))/2 # intercept

[1] 5.76

mean(white) - (mean(red) + mean(white))/2 # coeffcient color

[1] 0.121

Now, compare the models with summary(). Where is different, what the same?

# Regression dummy
summary(wine_dummy)


Call:
lm(formula = Quality ~ Color_dummy, data = wine)

Residuals:
   Min     1Q Median     3Q    Max 
-2.878 -0.878  0.122  0.364  3.122 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.6360     0.0217  259.92   <2e-16 ***
Color_dummy   0.2419     0.0250    9.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.867 on 6495 degrees of freedom
Multiple R-squared:  0.0142,    Adjusted R-squared:  0.0141 
F-statistic: 93.8 on 1 and 6495 DF,  p-value: <2e-16

# Regression effekt
summary(wine_effect)


Call:
lm(formula = Quality ~ Color_effect, data = wine)

Residuals:
   Min     1Q Median     3Q    Max 
-2.878 -0.878  0.122  0.364  3.122 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.7570     0.0125  461.04   <2e-16 ***
Color_effect   0.1209     0.0125    9.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.867 on 6495 degrees of freedom
Multiple R-squared:  0.0142,    Adjusted R-squared:  0.0141 
F-statistic: 93.8 on 1 and 6495 DF,  p-value: <2e-16

The coding (dummy or effect) has, with the exception of the coefficient scaling and the standard error, no influence on the t-value and the p-value for predictors. However, the intercept changed dramatically, because for the effect coding the intercept is considerably further away from zero.

Examples

# Regression mit R

library(tidyverse)

# Model:
# Does engine displacement predict
# highway miles per gallon?
hwy_mod <- lm(formula = hwy ~ displ,
               data = mpg)

# Results 
summary(hwy_mod)
coef(hwy_mod)

# Fitted values
hwy_fit <- fitted(hwy_mod)
hwy_fit

# Resdiuals 
hwy_resid <- residuals(hwy_mod)
hwy_resid

Datasets

Datei	Zeile	Spalte
wine.csv	6497	13

wine.csv

The wine.csv file contains data of the Comissão De Viticultura Da Região Dos Vinhos Verdes, the official certification agency of Vinho Verde in Portugal for the years 2004 to 2007.

Name	Description
Quality	Quality rating on a scale between 1-9
Color	red or white wine
Dissolved_Acid	Concentration of acids dissolved in the wine
Free_Acid	Concentration of free acids
Citric_Acid	Citric acid in the wine
Residual_Sugar	Sugar concentration in the wine
Chloride	Chlorid concentration in the
Free_Sulfur_Dioxide	Free sulfur dioxide in the
Total_Sulfur_Dioxide	Total amount of sulfur dioxide
Density	Density of the wine
PH_Value	pH-Value of the wine. The smaller the more acidic.
Sulphate	Sulphate concentration of the wine
Alcohol	Alcohol in the wine in %

Functions

Packages

Package	Installation
`tidyverse`	`install.packages("tidyverse")`

Functions

Function	Package	Description
`lm`	`stats`	Fit a linear model
`fitted`	`stats`	Extract predicted values
`residuals`	`stats`	Extract residuals

Resources

Books

Discovering Statistics with R excellent and entertaining book on statistics.

Statistics

Introduction to Data Science with R The R Bootcamp

Overview

Tasks

A - Setup

B - Einfache Regression

C - Multiple Regression

D - Comparing groups: t-test versus regression

E - Comparing groups: Coding

Examples

Datasets

wine.csv

Functions

Packages

Functions

Resources

Books

Introduction to Data Science with R
The R Bootcamp