class: center, middle, inverse, title-slide # Statistics ###
Intro to Data Science with R
The R Bootcamp @ Unibe
### September 2021 --- layout: true --- # General Linear Models .pull-left4[ <ul> <li class="m1"><span>The general linear model is the swiss army knife of statistics.</span></li> <li class="m2"><span>This includes:</span></li> <ul class="level"> <li><span><high>Regression</high></li></span> <li><span><high>t-Test<high></li></span> <li><span><high>Analysis of variance (ANOVA)</high></li></span> <li><span>Mediationanalysis</li></span> <li><span>Factoranalysis</li></span> <li><span>Structural Equation Modeling</li></span> </ul> </ul> ] .pull-right5[ <p align="center"> <img src="image/swiss_sm.png"> </p> ] --- # Simple linear regression .pull-left4[ <ul> <li class="m1"><span>How well does a <high>linear function using one predictor (x)</high> account for the criterion (y)?</span></li> <li class="m2"><span>Parameters:</span></li> <ul class="level"> <li><span>β<sub>0</sub>: <high>Intercept</high> of y-axis</span></li> <li><span>β<sub>1</sub>: <high>Slope</high></span></li> </ul> </ul> <br> `$$\Large \hat{y} = b_0 + b_1 * x$$` ] .pull-right5[ <img src="Statistics_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] --- # Simple linear regression .pull-left4[ <ul> <li class="m1"><span>How well does a <high>linear function using one predictor (x)</high> account for the criterion (y)?</span></li> <li class="m2"><span>Parameters:</span></li> <ul class="level"> <li><span>β<sub>0</sub>: <high>Intercept</high> of y-axis</span></li> <li><span>β<sub>1</sub>: <high>Slope</high></span></li> </ul> </ul> <br> `$$\Large \hat{Nächte} = b_0 + b_1 * Äquiv.eink.$$` ] .pull-right5[ <img src="Statistics_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- # Multiple lineare regression .pull-left4[ <ul> <li class="m1"><span>How well does a <high>linear function using multiple predictors (x)</high> account for the criterion (y)?</span></li> <li class="m2"><span>Parameter:</span></li> <ul class="level"> <li><span>β<sub>0</sub>: <high>Intercept</high> of y-axis</span></li> <li><span>β<sub>1</sub>: <high>Slope</high> for x<sub>1</sub></span></li> <li><span>β<sub>2</sub>: <high>Slope</high> for x<sub>2</sub></span></li> <li><span>β<sub>3</sub>: <high>Slope</high> for x<sub>k</sub></span></li> </ul> </ul> <br> `$$\Large \hat{y} = b_0 + b_1 \cdot x_1 + ... b_k \cdot x_k$$` ] .pull-right5[
] --- # Formulas .pull-left4[ <ul> <li class="m1"><span>Models in R are defined using <highm>formula</highm> expressions.</span></li> </ul> <font style="font-size:20px"><u>Syntax</u></p> <table style="cellspacing:0; cellpadding:0; border:none; padding-top:10px" width=100%> <col width="40%"> <col width="60%"> <tr> <td bgcolor="white"> <b>Function</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>+</mono> / <mono>-</mono> </td> <td bgcolor="white"> Add / remove predictor. </td> </tr> <tr> <td bgcolor="white"> <mono>*</mono> / <mono>:</mono> </td> <td bgcolor="white"> Add interactions with or w/o main effects. </td> </tr> <tr> <td bgcolor="white"> <mono>1</mono> / <mono>0</mono> </td> <td bgcolor="white"> Add / remove intercept </td> </tr> <tr> <td bgcolor="white"> <mono>.</mono> </td> <td bgcolor="white"> Add all predictors. </td> </tr> </table> ] <br> .pull-right5[ <p align="center"> <img src="image/formula.png"> </p> ] --- # <mono>lm()</mono> .pull-left35[ <font style="font-size:20px"><u>Fitting</u></p> <table style="cellspacing:0; cellpadding:0; border:none; padding-top:10px" width=100%> <col width="40%"> <col width="60%"> <tr> <td bgcolor="white"> <b>Function</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>lm(formula, data)</mono> </td> <td bgcolor="white"> Fit a <high>linear model</high>. </td> </tr> </table> <font style="font-size:20px"><u>Evaluation</u></p> <table style="cellspacing:0; cellpadding:0; border:none; padding-top:10px" width=100%> <col width="40%"> <col width="60%"> <tr> <td bgcolor="white"> <b>Function</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>summary()</mono> </td> <td bgcolor="white"> Show <high>result overview</high>. </td> </tr> <tr> <td bgcolor="white"> <mono>coef(mod)</mono> </td> <td bgcolor="white"> Extract <high>coefficients</high>. </td> </tr> <tr> <td bgcolor="white"> <mono>predict(mod)</mono>, <mono>resid(mod)</mono> </td> <td bgcolor="white"> Extract <high>fitted values</high> / <high>residuals</high>. </td> </tr> </table> ] .pull-right6[ ```r # Fit model nights_lm <- lm( formula = Nights ~ `Equivalent income` + Population, data = nights) ``` ] --- # <mono>lm()</mono> .pull-left35[ <font style="font-size:20px"><u>Fitting</u></p> <table style="cellspacing:0; cellpadding:0; border:none; padding-top:10px" width=100%> <col width="40%"> <col width="60%"> <tr> <td bgcolor="white"> <b>Function</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>lm(formula, data)</mono> </td> <td bgcolor="white"> Fit a <high>linear model</high>. </td> </tr> </table> <font style="font-size:20px"><u>Evaluation</u></p> <table style="cellspacing:0; cellpadding:0; border:none; padding-top:10px" width=100%> <col width="40%"> <col width="60%"> <tr> <td bgcolor="white"> <b>Function</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>summary()</mono> </td> <td bgcolor="white"> Show <high>result overview</high>. </td> </tr> <tr> <td bgcolor="white"> <mono>coef(mod)</mono> </td> <td bgcolor="white"> Extract <high>coefficients</high>. </td> </tr> <tr> <td bgcolor="white"> <mono>predict(mod)</mono>, <mono>resid(mod)</mono> </td> <td bgcolor="white"> Extract <high>fitted values</high> / <high>residuals</high>. </td> </tr> </table> ] .pull-right6[ ```r # Printe naechte_lm nights_lm ``` ``` ## ## Call: ## lm(formula = Nights ~ `Equivalent income` + Population, data = nights) ## ## Coefficients: ## (Intercept) `Equivalent income` Population ## -1.99e+03 1.17e-01 8.33e-02 ``` ] --- # <mono>summary()</mono> .pull-left35[ <font style="font-size:20px"><u>Fitting</u></p> <table style="cellspacing:0; cellpadding:0; border:none; padding-top:10px" width=100%> <col width="40%"> <col width="60%"> <tr> <td bgcolor="white"> <b>Function</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>lm(formula, data)</mono> </td> <td bgcolor="white"> Fit a <high>linear model</high>. </td> </tr> </table> <font style="font-size:20px"><u>Evaluation</u></p> <table style="cellspacing:0; cellpadding:0; border:none; padding-top:10px" width=100%> <col width="40%"> <col width="60%"> <tr> <td bgcolor="white"> <b>Function</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>summary()</mono> </td> <td bgcolor="white"> Show <high>result overview</high>. </td> </tr> <tr> <td bgcolor="white"> <mono>coef(mod)</mono> </td> <td bgcolor="white"> Extract <high>coefficients</high>. </td> </tr> <tr> <td bgcolor="white"> <mono>predict(mod)</mono>, <mono>resid(mod)</mono> </td> <td bgcolor="white"> Extract <high>fitted values</high> / <high>residuals</high>. </td> </tr> </table> ] .pull-right6[ ```r # Show results summary(nights_lm) ``` ``` ## ## Call: ## lm(formula = Nights ~ `Equivalent income` + Population, data = nights) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5403 -795 144 672 10721 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.99e+03 1.12e+03 -1.78 0.085 . ## `Equivalent income` 1.17e-01 6.32e-02 1.86 0.073 . ## Population 8.33e-02 1.67e-02 4.99 2.2e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] --- # Categorical Variables .pull-left4[ <ul> <li class="m1"><span>The general linear model can deal with <high>categorical predictors</high>.</span></li> <li class="m2"><span>In addition to <high>dedizierten Tests</high> (e.g., <mono>t.test()</mono>), such predicators can also be added into <mono>lm()</mono>.</span></li> <li class="m3"><span>Examples</span></li> <ul class="level"> <li><span><high>Comparison of multiple groups</high></li></span> <li><span><high>A/B tests<high></li></span> </ul> </ul> ] .pull-right5[ <img src="Statistics_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] --- # `t.test()` .pull-left4[ <ul> <li class="m1"><span>The t-test <high>compares two groups</high> in one continous variable.</span></li> <li class="m2"><span>The null-hypothesis states that these groups have <high>identical means</high>.</span></li> <li class="m3"><span>Examples</span></li> <ul class="level"> <li><span><high>Comparison of multiple groups</high></li></span> <li><span><high>A/B tests<high></li></span> </ul> </ul> ] .pull-right5[ ```r # t-test t.test(tour$Nights_log[tour$Region == 'Europa'], tour$Nights_log[tour$Region == 'Asien']) ``` ``` ## ## Welch Two Sample t-test ## ## data: tour$Nights_log[tour$Region == "Europa"] and tour$Nights_log[tour$Region == "Asien"] ## t = 1.4, df = 40, p-value = 0.2 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.3593 2.1123 ## sample estimates: ## mean of x mean of y ## 8.695 7.818 ``` ] --- # `lm()` .pull-left4[ <ul> <li class="m1"><span>The t-test <high>compares two groups</high> in one continous variable.</span></li> <li class="m2"><span>The null-hypothesis states that these groups have <high>identical means</high>.</span></li> <li class="m3"><span>Examples</span></li> <ul class="level"> <li><span><high>Comparison of multiple groups</high></li></span> <li><span><high>A/B tests<high></li></span> </ul> </ul> ] .pull-right5[ ```r lm(Nights_log ~ Region, tour %>% filter(Region %in% c('Europa', 'Asien'))) ``` ``` ## ## Call: ## lm(formula = Nights_log ~ Region, data = tour %>% filter(Region %in% ## c("Europa", "Asien"))) ## ## Coefficients: ## (Intercept) RegionEuropa ## 7.818 0.876 ``` ] --- # Coding .pull-left4[ <ul> <li class="m1"><span>Categorical variables have to be recoded into <high>k-1 new variables</high>.</span></li> <li class="m2"><span>Two ways to code this:</span></li> <ul> <li><span><high>Dummy coding</high> recodes values of one category into 1, else 0<br>→ <high>intercept = 0-category</high></span></li><br> <li><span><high>Effect coding</high> recodes values of one category into 1, else -1<br>→ <high>Intercept = ȳ </high></span></li> </ul> </ul> ] .pull-right5[ <p align="center"> <img src="image/coding.png" height=420px> </p> ] --- # t-Test - three types .pull-left4[ <ul> <li class="m1"><span>The t-test <high>compares two groups</high> in one continous variable.</span></li> <li class="m2"><span>The null-hypothesis states that these groups have <high>identical means</high>.</span></li> <li class="m3"><span>Examples</span></li> <ul class="level"> <li><span><high>Comparison of multiple groups</high></li></span> <li><span><high>A/B tests<high></li></span> </ul> </ul> ] .pull-right5[ ```r # Regular t-test t_test <- t.test(tour$Nights_log[tour$Region == 'Europa'], tour$Nights_log[tour$Region == 'Asien'], var.equal = TRUE) # Regression with dummy lm_dummy <- lm( Nights_log ~ Region, tour %>% filter(Region %in% c('Europa', 'Asien'))) # Regression with effect lm_effect <- lm( Nights_log ~ Region, tour %>% filter(Region %in% c('Europa', 'Asien')), contrasts = list(Region = contr.sum)) ``` ] --- # t-Test - three types .pull-left4[ <ul> <li class="m1"><span>The t-test <high>compares two groups</high> in one continous variable.</span></li> <li class="m2"><span>The null-hypothesis states that these groups have <high>identical means</high>.</span></li> <li class="m3"><span>Examples</span></li> <ul class="level"> <li><span><high>Comparison of multiple groups</high></li></span> <li><span><high>A/B tests<high></li></span> </ul> </ul> ] .pull-right5[ ```r t_test[c('statistic','parameter','p.value')] %>% unlist ``` ``` ## statistic.t parameter.df p.value ## 1.4225 55.0000 0.1605 ``` ```r summary(lm_dummy)$coef ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.8181 0.4964 15.749 4.907e-22 ## RegionEuropa 0.8765 0.6162 1.423 1.605e-01 ``` ```r summary(lm_effect)$coef ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.2564 0.3081 26.800 2.998e-33 ## Region1 -0.4382 0.3081 -1.423 1.605e-01 ``` ] --- # Multiple Categories .pull-left4[ <ul> <li class="m1"><span>If there are more than two categories <high><mono>k - 1</mono> dummy variables</high> will be constructed.</span></li> <li class="m2"><span>Everything else stays the same.</span></li> </ul> ] .pull-right5[ <img src="Statistics_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] --- # Multiple categories .pull-left4[ <ul> <li class="m1"><span>If there are more than two categories <high><mono>k - 1</mono> dummyvariables</high> will be constructed.</span></li> <li class="m2"><span>Everything else stays the same.</span></li> </ul> ] .pull-right5[ <p align="center"> <img src="image/dummy2.png" height=420px> </p> ] --- # `lm()` .pull-left35[ <ul> <li class="m1"><span>If there are more than two categories <high><mono>k - 1</mono> dummyvariables</high> will be constructed.</span></li> <li class="m2"><span>Everything else stays the same.</span></li> </ul> ] .pull-right55[ ```r # Regression with all regions lm(Nights_log ~ Region, tour) ``` ``` ## ## Call: ## lm(formula = Nights_log ~ Region, data = tour) ## ## Coefficients: ## (Intercept) RegionAmerika RegionAsien ## 7.8006 1.3884 0.0176 ## RegionAustra. RegionEuropa ## 0.9004 0.8941 ``` ] --- # `lm()` .pull-left35[ <ul> <li class="m1"><span>If there are more than two categories <high><mono>k - 1</mono> dummyvariables</high> will be constructed.</span></li> <li class="m2"><span>Everything else stays the same.</span></li> </ul> ] .pull-right55[ ```r # Regression with all regions mod <- lm(Nights_log ~ Region, tour) # Show results summary(mod)$coef ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.80057 1.092 7.14623 8.896e-10 ## RegionAmerika 1.38837 1.337 1.03851 3.028e-01 ## RegionAsien 0.01757 1.196 0.01469 9.883e-01 ## RegionAustra. 0.90041 1.891 0.47624 6.355e-01 ## RegionEuropa 0.89407 1.149 0.77809 4.393e-01 ``` ] --- # `anova()` .pull-left35[ <ul> <li class="m1"><span>Analysis of variance (ANOVA) is a generalization of the t-test and can be understood as a <high>special case of regression</high>.</span></li> <li class="m2"><span>The null hypothesis states that all groups have <high>identical means</high>.</span></li> </ul> ] .pull-right55[ ```r # Regression with all regions mod <- lm(Nights_log ~ Region, tour) # Show ANOVA results anova(mod) ``` ``` ## Analysis of Variance Table ## ## Response: Nights_log ## Df Sum Sq Mean Sq F value Pr(>F) ## Region 4 16.4 4.09 0.86 0.49 ## Residuals 66 314.6 4.77 ``` ] --- class: middle, center <h1><a href="https://dwulff.github.io/Intro2R_Unibe_2021/_sessions/Statistics/Statistics_practical.html">Practical</a></h1>