class: center, middle, inverse, title-slide # Statistics ### Introduction to R
Bern R Bootcamp
### June 2020 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> Introduction to R | June 2020 </font> </a> </span> </div> --- # Statistics I .pull-left45[ #### <high>Descriptive statistics</high> with `dplyr` ```r # Group-summarise idiom baselers %>% group_by(sex, eyecor) %>% summarise( N = n(), age_mean = mean(age), height_median = median(height), children_max = max(children) ) ``` #### <high>Simply hypothesis test</high> with `stats` ```r # Simple hypothesis test t.test(baselers$happiness, baselers$fitness, var.equal = TRUE) ``` ] .pull-right45[ <p align = "center"> <img src="image/null_hypothesis.png" height=430px><br> <font style="font-size:10px">from <a href="https://xkcd.com/892/">xkcd.com</a></font> </p> ] --- # `dplyr` + `tidyr` .pull-left5[ To wrangle data in R, we will use the <high><mono>dplyr</mono></high> and <high><mono>tidyr</mono></high> packages, which are part of the <high><mono>tidyverse</mono></high>. | Package | Function| Function| |:-------------|:----| |<b>dplyr</b> | Transformation | `rename()`, `mutate()`, `case_when()`, `*_join()` | |<b>dplyr</b> | Organisation | `arrange()`, `slice()`, `filter()`, `select()` | |<b>tidyr</b> | Organisation | `pivot_longer()`, `pivot_wider()` | |<b>dplyr</b> | Aggregation | `group_by()`, `summarise()` | ] .pull-right4[ <p align = "center"> <img src="image/packages.png" height=320px> </p> ] --- # Grouped aggregation .pull-left3[ <high>(Conditional) descriptives statistic</high>s are easily calculated using `dplyr`'s `group_by()` and `summerise()` idiom. ```r # Start with data data %>% # AND THEN... # GROUPING VARIABLE GROUP_BY %>% # DO SUMMARIES SUMMARISE( RESULT_1, RESULT_2, RESULT_3 ) ``` ] .pull-right6[ <p align="right"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/summarsed_data_diagram.png" height="414px"> </p> ] --- # The Pipe! <high>`%>%`</high> .pull-left4[ `dplyr` makes extensive use of a new operator called the "Pipe" <high>`%>%`</high><br> Read the "Pipe" <high>`%>%`</high> as "And Then..." <br> ```r # Start with data data %>% # AND THEN... DO_SOMETHING %>% # AND THEN... DO_SOMETHING %>% # AND THEN... DO_SOMETHING %>% # AND THEN... ``` ] .pull-right55[ <p align="center"> <img src="https://upload.wikimedia.org/wikipedia/en/thumb/b/b9/MagrittePipe.jpg/300px-MagrittePipe.jpg" width = "450px"><br> This is not a pipe (but %>% is!) </p> ] --- # `summarise()` .pull-left45[ Use `summarise()` to create new columns of <high>summary statistics</high>. The result of `summarise()` is always be a tibble. Functions used in `summerise()` <high>must return a single value</high>. ```r data %>% summarise( NAME = SUMMARY_FUN(A), NAME = SUMMARY_FUN(B), ... ) ``` ] .pull-right5[ ```r # Calculate summary statistics baselers %>% summarise( N = n(), age_mean = mean(age), height_median = median(height), children_max = max(children) ) ``` ``` # A tibble: 1 x 4 N age_mean height_median children_max <int> <dbl> <dbl> <dbl> 1 10000 44.6 171. 6 ``` ] --- # `summarise()` .pull-left45[ Use `summarise()` to create new columns of <high>summary statistics</high>. The result of `summarise()` is always be a tibble. Functions used in `summerise()` <high>must return a single value</high>. ```r data %>% summarise( NAME = SUMMARY_FUN(A), NAME = SUMMARY_FUN(B), ... ) ``` ] .pull-right5[ | Function| Purpose | Returns | |:-------------|:-------|:-------| | `n()`| Count values | <font color="6ABA9A"><b>Single value</b></font> | | `mean()`, `median()` | Central tendencies | <font color="6ABA9A"><b>Single value</b></font> | | `sd()`, `var()` | Variance | <font color="6ABA9A"><b>Single value</b></font> | | `max()`, `min()` | Extremes | <font color="6ABA9A"><b>Single value</b></font> | | `quantile()` | Quantiles | <font color="6ABA9A"><b>One</b></font> or <high>multiple values</high> | | `range()` | Range | <high>Two values<high> | | `table()` | (Cross-) tables | <high>Multiple values<high> | | `summary()` | Overview | <high>Multiple values<high> | ] --- # `group_by()` + `summarise()` .pull-left45[ Use `group_by()` to <high>group data</high> according to one or more columns. Then, use `summarise()` to <high>calculate summary statistics</high> for each group. You can include <high>one or more</high> grouping variables. ```r data %>% group_by(A, B, ...) %>% summarise( NAME = SUMMARY_FUN(A), NAME = SUMMARY_FUN(B), ... ) ``` ] .pull-right5[ ```r # Group data by arm, and calculate many # summary statistics baselers %>% group_by(sex) %>% summarise( N = n(), age_mean = mean(age), height_median = median(height), children_max = max(children) ) ``` ``` # A tibble: 2 x 5 sex N age_mean height_median children_max <chr> <int> <dbl> <dbl> <dbl> 1 female 5000 45.4 164 6 2 male 5000 43.8 178. 6 ``` ] --- # `group_by()` + `summarise()` .pull-left45[ Use `group_by()` to <high>group data</high> according to one or more columns. Then, use `summarise()` to <high>calculate summary statistics</high> for each group. You can include <high>one or more</high> grouping variables. ```r data %>% group_by(A, B, ...) %>% summarise( NAME = SUMMARY_FUN(A), NAME = SUMMARY_FUN(B), ... ) ``` ] .pull-right5[ ```r # Group data by arm, and calculate many # summary statistics baselers %>% group_by(sex, eyecor) %>% summarise( N = n(), age_mean = mean(age), height_median = median(height) ) ``` ``` # A tibble: 4 x 5 # Groups: sex [2] sex eyecor N age_mean height_median <chr> <chr> <int> <dbl> <dbl> 1 female no 1731 45.3 164. 2 female yes 3269 45.5 164 3 male no 1772 43.6 178. 4 male yes 3228 43.9 178. ``` ] --- # Full pipeline .pull-left25[ Combine <high>tranformation</high>, <high>organization</high>, and <high>aggregating</high> operations at once! Just use the <high>pipe %>%</high>! ] .pull-right65[ ```r baselers %>% mutate(catholic = confession == "catholic") %>% filter(sex == "male" & children > 0) %>% group_by(sex, catholic) %>% summarise( N = n(), age_mean = mean(age), income_median = median(income, na.rm = TRUE) ) ``` ``` # A tibble: 3 x 5 # Groups: sex [1] sex catholic N age_mean income_median <chr> <lgl> <int> <dbl> <dbl> 1 male FALSE 2452 43.7 7100 2 male TRUE 1401 44.0 7100 3 male NA 703 43.5 7000 ``` ] --- # Inferential statistics .pull-left6[ <u>Specific tests</u> | Function| Purpose | |:------|:-------| | `t.test()` | Compare group means | | `cor.test()` | Compare correlations | | `chisq.test()` | Compare cell frequencies | | `wilcox.test()` | Compare group means non-parametrically | <u>Fomula-based tests</u> | Function| Purpose | |:-----|:-------| | `lm()`, `glm()`| (Generalized) linear models | | `lmer()`, `glmer()` | (Generalized) mixed-linear models | | `regressionBF()` | Bayesian (generalized) linear models | ] .pull-right35[ <p align = "center"> <img src="image/null_hypothesis.png" height=430px><br> <font style="font-size:10px">from <a href="https://xkcd.com/892/">xkcd.com</a></font> </p> ] --- # `t.test()` .pull-left45[ The <high>t-test</high> compares one group mean versus a <high>reference</high> or versus <high> another group</high>. Compares two means by providing <high>two numeric vectors</high> for the arguments `x`, and `y`. Alternative <high>arguments allow for variations</high>, e.g., to account for unequal variances. ] .pull-right5[ ```r # 2-sample t-test t.test(baselers$happiness, baselers$fitness) ``` ``` Welch Two Sample t-test data: baselers$happiness and baselers$fitness t = 83, df = 15844, p-value <2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.738 1.822 sample estimates: mean of x mean of y 6.905 5.125 ``` ] --- # `t.test()` .pull-left45[ The <high>t-test</high> compares one group mean versus a <high>reference</high> or versus <high> another group</high>. Compares two means by providing <high>two numeric vectors</high> for the arguments `x`, and `y`. Alternative <high>arguments allow for variations</high>, e.g., to account for unequal variances. ] .pull-right5[ ```r # 2-sample t-test assuming equal variance t.test(baselers$happiness, baselers$fitness, var.equal = TRUE) ``` ``` Two Sample t-test data: baselers$happiness and baselers$fitness t = 83, df = 19998, p-value <2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.738 1.822 sample estimates: mean of x mean of y 6.905 5.125 ``` ] --- # `cor.test()` .pull-left45[ The <high>correlation test</high> compares the correlation of two variables against 0. Evaluate the correlation by providing <high>two numeric vectors</high> for the arguments `x`, and `y`. Alternative <high>arguments allow for variations</high>, e.g., to conduct the test using different correlation measures. ] .pull-right5[ ```r # correlation test cor.test(x = baselers$age, y = baselers$income) ``` ``` Pearson's product-moment correlation data: baselers$age and baselers$income t = 183, df = 8508, p-value <2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.8882 0.8968 sample estimates: cor 0.8926 ``` ] --- # `cor.test()` .pull-left45[ The <high>correlation test</high> compares the correlation of two variables against 0. Evaluate the correlation by providing <high>two numeric vectors</high> for the arguments `x`, and `y`. Alternative <high>arguments allow for variations</high>, e.g., to conduct the test using different correlation measures. ] .pull-right5[ ```r # correlation test cor.test(x = baselers$age, y = baselers$income, method = "spearman") ``` ``` Warning in cor.test.default(x = baselers$age, y = baselers$income, method = "spearman"): Cannot compute exact p-value with ties ``` ``` Spearman's rank correlation rho data: baselers$age and baselers$income S = 1.3e+10, p-value <2e-16 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.8756 ``` ] --- # `chisq.test()` .pull-left45[ The <high>chi-square test</high> compares frequencies in (cross-) tables for equality in absolute or relative frequency. Evaluate frequencies by providing a <high>table</high>, <high>vectors</high> or <high>matrix</matrix> for the argument `x`. Alternative <high>arguments allow for variations</high>, e.g., to conduct the test using different coefficents. ] .pull-right5[ ```r # compute cross-table with table tab <- baselers %>% mutate(tattoo = tattoos == TRUE) %>% select(sex, tattoo) %>% table() # show table tab ``` ``` tattoo sex FALSE TRUE female 4703 297 male 4794 206 ``` ] --- # `chisq.test()` .pull-left45[ The <high>chi-square test</high> compares frequencies in (cross-) tables for equality in absolute or relative frequency. Evaluate frequencies by providing a <high>table</high>, <high>vectors</high> or <high>matrix</matrix> for the argument `x`. Alternative <high>arguments allow for variations</high>, e.g., to conduct the test using different coefficents. ] .pull-right5[ ```r # chi-square test chisq.test(tab) ``` ``` Pearson's Chi-squared test with Yates' continuity correction data: tab X-squared = 17, df = 1, p-value = 4e-05 ``` ] --- # `wilcox.test()` .pull-left45[ The <high>Wilcoxon test</high> compares one group' average ranks <high>versus reference</high> or <high>versus group</high>'s average rank. Compare average ranks of two groups by providing <high>two numeric vectors</high> for the arguments `x`, and `y`. Alternative <high>arguments allow for variations</high>. ] .pull-right5[ ```r # 2-sample wilcoxon rank test assuming equal variance wilcox.test(baselers$happiness, baselers$fitness) ``` ``` Wilcoxon rank sum test with continuity correction data: baselers$happiness and baselers$fitness W = 7.8e+07, p-value <2e-16 alternative hypothesis: true location shift is not equal to 0 ``` ] --- class: middle, center <h1><a href="https://dwulff.github.io/Intro2R_Unibe/_sessions/StatisticsI/StatisticsI_practical.html">Practical</a></h1>