Tokenization

# Tokenization
### <a href='https://dwulff.github.io/NLP_2020Autumn'> Natural language processing </a> <a href='https://dwulff.github.io/NLP_2020Autumn/menu/materials.html'> </a>  <a href='https://dwulff.github.io/NLP_2020Autumn'> </a>  <a href='mailto:dirk.wulff@unibas.ch'> 
### September 2020

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://cdsbasel.github.io/dataanalytics/">
 
 
 cdsbasel.github.io/dataanalytics/
 
 
 </a>
 <a href="https://cdsbasel.github.io/dataanalytics/">
 
 Data Analytics for Psychology and Business | April 2019
 
 </a>
 
 </div>

---

# Encoding

.pull-left55[
1960: ASCII
<img src="https://www.asciitable.com/index/asciifull.gif">
More info: [here](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) & [here](http://kunststube.net/encoding/)

]

.pull-right4[
1991: Unicode
<a href="http://unicode.org/"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Unicode_logo.svg/512px-Unicode_logo.svg.png" width = 365></a>

]

---

| Character| Code Point | Encoding | Precision | Representation|
|:------|:--------|:--------||:--------|
| `A`| `U+0041` | ASCII | fixed 7 bit | 1000001 |
| `A`| `U+0041` | UTF-8 | min 8 bit / 1 byte | 01000001 |
| `A`| `U+0041` | UTF-16 | min 16 bit / 2 byte | 00000000 01000001 |
| `A`| `U+0041` | UTF-32 | min 32 bit / 4 byte | 00000000 00000000 00000000 01000001 |
| `あ`| `U+3042` | ASCII | fixed 7 bit | - |
| `あ`| `U+3042` | UTF-8 | min 8 bit / 1 byte | 11100011 10000001 10000010 |
| `あ`| `U+3042` | UTF-16 | min 16 bit / 2 byte | 00110000 01000010 |
| `あ`| `U+3042` | UTF-32 | min 32 bit / 4 byte | 00000000 00000000 00110000 01000010 |
| <img src="https://images.emojiterra.com/google/android-oreo/512px/1f600.png" height="20px"> | `U+1F600` | ASCII | fixed 7 bit | - |
| <img src="https://images.emojiterra.com/google/android-oreo/512px/1f600.png" height="20px"> | `U+1F600` | UTF-8 | min 8 bit / 1 byte | 1111 0000 1001 1111 1001 1000 1000 0000 |
| <img src="https://images.emojiterra.com/google/android-oreo/512px/1f600.png" height="20px"> | `U+1F600` | UTF-16 | min 16 bit / 2 byte | 1101 1000 0011 1101 1101 1110 0000 0000 |
| <img src="https://images.emojiterra.com/google/android-oreo/512px/1f600.png" height="20px"> | `U+1F600` | UTF-32 | min 32 bit / 4 byte | - |

---

# Regular expressions

According to [Wikipedia](https://en.wikipedia.org/wiki/Regular_expression):
>A regular expression, <high>regex or regexp</high> (sometimes called a rational expression) is, in theoretical computer science and formal language theory, <high>a sequence of characters that define a search pattern</high>. Usually this pattern is then used by <high>string searching algorithms</high> for "find" or "find and replace" operations on strings.

]

`(?<=\.) {2,}(?=[A-Z])`

]

---

# Regular expressions

The `stringr` package provides a series of high performance regular expression functions based on the <a href="http://site.icu-project.org/home">`ICU`</a> C++ library.

```r
str_*(string, pattern, ...)
```

| Function prefix | Use |
|:---------------|:------------------------------|
|   `str_detect*`  | Test if pattern is present. |
|   `str_count*`   | Count number of pattern matches. |
|   `str_locate*`  | Find location of pattern. |
|   `str_extract*` | Extract strings matching pattern. |
|   `str_replace*` | Replace string matching pattern by other string. |
|   `str_split*`   | Split string around pattern. |

]

]

---

# Using regular expressions

```r
# text
txt <- "Happy families are all alike; every unhappy family is unhappy in its own way."

# Select all a
str_extract_all(txt, "a")
```

```
## [[1]]
## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a"
```

```r
# Select all words starting with a
str_extract_all(txt, "a[:alpha:]+")
```

```
## [[1]]
## [1] "appy"    "amilies" "are"     "all"     "alike"   "appy"    "amily"  
## [8] "appy"    "ay"
```

---

# Using regular expressions

```r
# text
txt <- "Happy families are all alike; every unhappy family is unhappy in its own way."

# Select all a
str_extract_all(txt, "a")
```

```
## [[1]]
## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a"
```

```r
# Select all words starting with a and ending with e
str_extract_all(txt, "a[:alpha:]+e")
```

```
## [[1]]
## [1] "amilie" "are"    "alike"
```

---

# Tokenization

```r
# text
txt <- "Happy families are all alike; every unhappy family is unhappy in its own way."

# Split by space
str_split(txt, " ")
```

```
## [[1]]
##  [1] "Happy"    "families" "are"      "all"      "alike;"   "every"   
##  [7] "unhappy"  "family"   "is"       "unhappy"  "in"       "its"     
## [13] "own"      "way."
```

```r
# Split by space
str_split(txt, "[:blank:]")
```

```
## [[1]]
##  [1] "Happy"    "families" "are"      "all"      "alike;"   "every"   
##  [7] "unhappy"  "family"   "is"       "unhappy"  "in"       "its"     
## [13] "own"      "way."
```

---

# Tokenization

```r
# text
txt <- "Happy families are all alike; every unhappy family is unhappy in its own way."

# Tokenize
str_extract_all(txt, "[:alpha:]+")
```

```
## [[1]]
##  [1] "Happy"    "families" "are"      "all"      "alike"    "every"   
##  [7] "unhappy"  "family"   "is"       "unhappy"  "in"       "its"     
## [13] "own"      "way"
```

```r
# Tokenize
str_extract_all(txt, "[A-Za-z]+")
```

```
## [[1]]
##  [1] "Happy"    "families" "are"      "all"      "alike"    "every"   
##  [7] "unhappy"  "family"   "is"       "unhappy"  "in"       "its"     
## [13] "own"      "way"
```

---

# Sentence segmentation

```r
# text
txt <- "Happy families are all alike; every unhappy family is unhappy in its own way."

# Sentenize
str_extract_all(txt, '[^[:space:]][^[.!?;]]*[.!?;]')
```

```
## [[1]]
## [1] "Happy families are all alike;"                  
## [2] "every unhappy family is unhappy in its own way."
```

---

<h1><a href=https://dwulff.github.io/NLP_2020Autumn/menu/materials.html>Materials</a></h1>