Statistics

# Statistics
### Introduction to R <a href='https://therbootcamp.github.io'> Bern R Bootcamp </a> <a href='https://dwulff.github.io/Intro2R_Unibe/'> </a>  <a href='https://therbootcamp.github.io'> </a>  <a href='mailto:therbootcamp@gmail.com'> </a>  <a href='https://www.linkedin.com/company/basel-r-bootcamp/'> </a>
### June 2020

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 www.therbootcamp.com
 
 
 </a>
 <a href="https://therbootcamp.github.io/">
 
 Introduction to R | June 2020
 
 </a>
 
 </div> 
---

# Statistics I

#### <high>Descriptive statistics</high> with `dplyr`

```r
# Group-summarise idiom
baselers %>%
  group_by(sex, eyecor) %>%
  summarise(
    N = n(),
    age_mean = mean(age),
    height_median = median(height),
    children_max = max(children)
  )
```

#### <high>Simply hypothesis test</high> with `stats`

```r
# Simple hypothesis test
t.test(baselers$happiness,
       baselers$fitness,
       var.equal = TRUE)
```

]

<img src="image/null_hypothesis.png" height=430px> 
from <a href="https://xkcd.com/892/">xkcd.com</a>

]

---

# `dplyr` + `tidyr`

To wrangle data in R, we will use the <high><mono>dplyr</mono></high> and <high><mono>tidyr</mono></high> packages, which are part of the <high><mono>tidyverse</mono></high>.

| Package | Function| Function|
|:-------------|:----|
|dplyr | Transformation | `rename()`, `mutate()`, `case_when()`, `*_join()` |
|dplyr | Organisation | `arrange()`, `slice()`, `filter()`, `select()` |
|tidyr | Organisation | `pivot_longer()`, `pivot_wider()` |
|dplyr | Aggregation | `group_by()`, `summarise()` |

]
 
.pull-right4[

]

---

# Grouped aggregation

<high>(Conditional) descriptives statistic</high>s are easily calculated using `dplyr`'s `group_by()` and `summerise()` idiom.

```r
# Start with data
data %>% # AND THEN...
  
# GROUPING VARIABLE
GROUP_BY %>% 
  
# DO SUMMARIES
SUMMARISE( 
  
  RESULT_1, 
  RESULT_2,
  RESULT_3
  
  ) 
```

]

]

---

# The Pipe! <high>`%>%`</high>

`dplyr` makes extensive use of a new operator called the "Pipe" <high>`%>%`</high>

Read the "Pipe" <high>`%>%`</high> as "And Then..."

```r
# Start with data
data %>% # AND THEN...
  
DO_SOMETHING %>% # AND THEN...
  
DO_SOMETHING %>% # AND THEN...
  
DO_SOMETHING %>% # AND THEN...
```

]

<img src="https://upload.wikimedia.org/wikipedia/en/thumb/b/b9/MagrittePipe.jpg/300px-MagrittePipe.jpg" width = "450px"> 
 This is not a pipe (but %>% is!)

]

---

# `summarise()`

Use `summarise()` to create new columns of <high>summary statistics</high>.

The result of `summarise()` is always be a tibble.

Functions used in `summerise()` <high>must return a single value</high>.

```r
data %>%
  summarise(
    NAME = SUMMARY_FUN(A),
    NAME = SUMMARY_FUN(B),
    ...
  )
```

]

```r
# Calculate summary statistics
baselers %>%
  summarise(
    N = n(),
    age_mean = mean(age),
    height_median = median(height),
    children_max = max(children)
  )
```

```
# A tibble: 1 x 4
 N age_mean height_median children_max
 <int> <dbl> <dbl> <dbl>
1 10000 44.6 171. 6
```

]

---

# `summarise()`

Use `summarise()` to create new columns of <high>summary statistics</high>.

The result of `summarise()` is always be a tibble.

Functions used in `summerise()` <high>must return a single value</high>.

```r
data %>%
  summarise(
    NAME = SUMMARY_FUN(A),
    NAME = SUMMARY_FUN(B),
    ...
  )
```

]

| Function| Purpose | Returns |
|:-------------|:-------|:-------| 
| `n()`| Count values | Single value |
| `mean()`, `median()` | Central tendencies | Single value |
| `sd()`, `var()` | Variance | Single value |
| `max()`, `min()` | Extremes | Single value |
| `quantile()` | Quantiles | One or <high>multiple values</high> |
| `range()` | Range | <high>Two values<high> |
| `table()` | (Cross-) tables | <high>Multiple values<high> |
| `summary()` | Overview | <high>Multiple values<high> |

]

---

# `group_by()` + `summarise()`

Use `group_by()` to <high>group data</high> according to one or more columns.

Then, use `summarise()` to <high>calculate summary statistics</high> for each group.

You can include <high>one or more</high> grouping variables.

```r
data %>%
  group_by(A, B, ...) %>%
  summarise(
    NAME = SUMMARY_FUN(A),
    NAME = SUMMARY_FUN(B),
    ...
  )
```

]

```r
# Group data by arm, and calculate many
#  summary statistics
baselers %>%
  group_by(sex) %>%
  summarise(
    N = n(),
    age_mean = mean(age),
    height_median = median(height),
    children_max = max(children)
  )
```

```
# A tibble: 2 x 5
 sex N age_mean height_median children_max
 <chr> <int> <dbl> <dbl> <dbl>
1 female 5000 45.4 164 6
2 male 5000 43.8 178. 6
```

]

---

# `group_by()` + `summarise()`

Use `group_by()` to <high>group data</high> according to one or more columns.

Then, use `summarise()` to <high>calculate summary statistics</high> for each group.

You can include <high>one or more</high> grouping variables.

```r
data %>%
  group_by(A, B, ...) %>%
  summarise(
    NAME = SUMMARY_FUN(A),
    NAME = SUMMARY_FUN(B),
    ...
  )
```

]

```r
# Group data by arm, and calculate many
#  summary statistics
baselers %>%
  group_by(sex, eyecor) %>%
  summarise(
    N = n(),
    age_mean = mean(age),
    height_median = median(height)
  )
```

```
# A tibble: 4 x 5
# Groups: sex [2]
 sex eyecor N age_mean height_median
 <chr> <chr> <int> <dbl> <dbl>
1 female no 1731 45.3 164.
2 female yes 3269 45.5 164 
3 male no 1772 43.6 178.
4 male yes 3228 43.9 178.
```

]

---

# Full pipeline

Combine <high>tranformation</high>, <high>organization</high>, and <high>aggregating</high> operations at once!

Just use the <high>pipe %>%</high>!

]

```r
baselers %>%
  mutate(catholic = confession == "catholic") %>%
  filter(sex == "male" & children > 0) %>%
  group_by(sex, catholic) %>%
  summarise(
    N = n(),
    age_mean = mean(age),
    income_median = median(income, na.rm = TRUE)
  )
```

```
# A tibble: 3 x 5
# Groups: sex [1]
 sex catholic N age_mean income_median
 <chr> <lgl> <int> <dbl> <dbl>
1 male FALSE 2452 43.7 7100
2 male TRUE 1401 44.0 7100
3 male NA 703 43.5 7000
```

]

---

# Inferential statistics

Specific tests

| Function| Purpose |
|:------|:-------| 
| `t.test()` | Compare group means | 
| `cor.test()` | Compare correlations  |
| `chisq.test()` | Compare cell frequencies |
| `wilcox.test()` | Compare group means non-parametrically |

Fomula-based tests

| Function| Purpose |
|:-----|:-------| 
| `lm()`, `glm()`| (Generalized) linear models  |
| `lmer()`, `glmer()` | (Generalized) mixed-linear models |
| `regressionBF()` | Bayesian (generalized) linear models |

]

<img src="image/null_hypothesis.png" height=430px> 
from <a href="https://xkcd.com/892/">xkcd.com</a>

]

---

# `t.test()`

The <high>t-test</high> compares one group mean versus a <high>reference</high> or versus <high> another group</high>.

Compares two means by providing <high>two numeric vectors</high> for the arguments `x`, and `y`.

Alternative <high>arguments allow for variations</high>, e.g., to account for unequal variances.

]

```r
# 2-sample t-test
t.test(baselers$happiness,
       baselers$fitness)
```

```

Welch Two Sample t-test

data: baselers$happiness and baselers$fitness
t = 83, df = 15844, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.738 1.822
sample estimates:
mean of x mean of y 
 6.905 5.125 
```

]

---

# `t.test()`

The <high>t-test</high> compares one group mean versus a <high>reference</high> or versus <high> another group</high>.

Compares two means by providing <high>two numeric vectors</high> for the arguments `x`, and `y`.

Alternative <high>arguments allow for variations</high>, e.g., to account for unequal variances.

]

```r
# 2-sample t-test assuming equal variance
t.test(baselers$happiness,
       baselers$fitness,
       var.equal = TRUE)
```

```

Two Sample t-test

data: baselers$happiness and baselers$fitness
t = 83, df = 19998, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.738 1.822
sample estimates:
mean of x mean of y 
 6.905 5.125 
```

]

---

# `cor.test()`

The <high>correlation test</high> compares the correlation of two variables against 0.

Evaluate the correlation by providing <high>two numeric vectors</high> for the arguments `x`, and `y`.

Alternative <high>arguments allow for variations</high>, e.g., to conduct the test using different correlation measures.

]

```r
# correlation test
cor.test(x = baselers$age,
         y = baselers$income)
```

```

Pearson's product-moment correlation

data: baselers$age and baselers$income
t = 183, df = 8508, p-value <2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8882 0.8968
sample estimates:
 cor 
0.8926 
```

]

---

# `cor.test()`

The <high>correlation test</high> compares the correlation of two variables against 0.

Evaluate the correlation by providing <high>two numeric vectors</high> for the arguments `x`, and `y`.

Alternative <high>arguments allow for variations</high>, e.g., to conduct the test using different correlation measures.

]

```r
# correlation test
cor.test(x = baselers$age,
         y = baselers$income,
         method = "spearman")
```

```
Warning in cor.test.default(x = baselers$age, y = baselers$income, method = "spearman"): Cannot compute exact
p-value with ties
```

```

Spearman's rank correlation rho

data: baselers$age and baselers$income
S = 1.3e+10, p-value <2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
 rho 
0.8756 
```

]

---

# `chisq.test()`

The <high>chi-square test</high> compares frequencies in (cross-) tables for equality in absolute or relative frequency.

Evaluate frequencies by providing a <high>table</high>, <high>vectors</high> or <high>matrix</matrix> for the argument `x`.

Alternative <high>arguments allow for variations</high>, e.g., to conduct the test using different coefficents.
]

```r
# compute cross-table with table
tab <- baselers %>% 
 mutate(tattoo = tattoos == TRUE) %>%
 select(sex, tattoo) %>% 
 table()

# show table
tab
```

```
        tattoo
sex      FALSE TRUE
  female  4703  297
  male    4794  206
```

]

---

# `chisq.test()`

The <high>chi-square test</high> compares frequencies in (cross-) tables for equality in absolute or relative frequency.

Evaluate frequencies by providing a <high>table</high>, <high>vectors</high> or <high>matrix</matrix> for the argument `x`.

Alternative <high>arguments allow for variations</high>, e.g., to conduct the test using different coefficents.
]

```r
# chi-square test 
chisq.test(tab)
```

```

Pearson's Chi-squared test with Yates' continuity correction

data:  tab
X-squared = 17, df = 1, p-value = 4e-05
```

]

---

# `wilcox.test()`

The <high>Wilcoxon test</high> compares one group' average ranks <high>versus reference</high> or <high>versus group</high>'s average rank.

Compare average ranks of two groups by providing <high>two numeric vectors</high> for the arguments `x`, and `y`.

Alternative <high>arguments allow for variations</high>.

]

```r
# 2-sample wilcoxon rank test assuming equal variance
wilcox.test(baselers$happiness,
            baselers$fitness)
```

```

Wilcoxon rank sum test with continuity correction

data: baselers$happiness and baselers$fitness
W = 7.8e+07, p-value <2e-16
alternative hypothesis: true location shift is not equal to 0
```

]

---

<h1><a href="https://dwulff.github.io/Intro2R_Unibe/_sessions/StatisticsI/StatisticsI_practical.html">Practical</a></h1>