Tokenization

Overview

In this assignment you will…

download an R project.
download a book from the Gutenbrg project & read it into R.
start a markdown document.
read-in and tokenize the text.

Tasks

A - Download Project

Download this zipped project.
Locate the downloaded file, unzip it and move it to a suitable location on your hard drive.

B - Get book

The first task of this assignment consists of choosing a book of your liking and loading it into R. Visit the website of Project Gutenberg and select a book that you like. Tipp: check out Project Gutenberg’s Top 100 list.
To download the book enter the book’s site and select Textdatei UTF-8 (or Plain Text UTF-8). Depending on your browser (and your browser settings) this will either download the file directly or open up the text in the browser tab. In the latter case, use right-click on the text and select save as (or comparable) to download the text as a text-file to your hard-drive.
Locate the downloaded text file and move it into your data folder inside your project folder.

C - Get started

Open R Studio by double-clicking NLP_2020Autumn.Rproj within your project folder.
Open a new R markdown file with html output and save it as Tokenization.Rmd in your project folder.
Create a new chunk and load the packages readr, stringr, and wordcloud using library().
Use the read_file() function to read-in the text file and assign the result to an object called text. The path to the file should be data/NAME_OF_FILE.txt
The text object should be a vector of type character that has length one. Confirm this using typeof() and length().
Determine the the number of characters using nchar().
Inspect the first 1,000 characters using the str_sub() function. The str_sub() function takes three arguments: the text, the starting character index (1), and the ending character index (1000).

library(readr)
library(stringr)
library(wordcloud)

# load text
text <- read_file('pg2197.txt')

# evaluate object
typeof(text)

## [1] "character"

length(text)

## [1] 1

nchar(text)

## [1] 357761

str_sub(text, 1, 1000)

## [1] "The Project Gutenberg EBook of The Gambler, by Fyodor Dostoyevsky\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.net\r\n\r\n\r\nTitle: The Gambler\r\n\r\nAuthor: Fyodor Dostoyevsky\r\n\r\nPosting Date: March 1, 2009 [EBook #2197]\r\nRelease Date: May, 2000\r\n[Last updated: July 24, 2011]\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK THE GAMBLER ***\r\n\r\n\r\n\r\n\r\nProduced by Martin Adamson.  HTML version by Al Haines.\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nTHE GAMBLER\r\n\r\n\r\nBy\r\n\r\nFYODOR DOSTOYEVSKY\r\n\r\n\r\nTranslated by C. J. Hogarth\r\n\r\n\r\n\r\nI\r\n\r\nAt length I returned from two weeks leave of absence to find that my\r\npatrons had arrived three days ago in Roulettenberg. I received from\r\nthem a welcome quite different to that which I had expected. The\r\nGeneral eyed me coldly, greeted me in rather haughty fashion, and"

D - Tokenize text

Before you can tokenize the text, you must remove several text sections added by the Project Gutenberg containing information on the text and the license of use. The sections are separated by header lines with leading and trailing star symbols, e.g., *** START OF THIS PROJECT GUTENBERG EBOOK THE GAMBLER ***. Build a regular expression that identifies such lines using the following elements: escaped star symbol \\*, curly brackets {} to indicate the number of symbol repetitions, the print class [:print:] for every letter in-between star symbols, and the plus + to indicate the number of print repetitions.
Then use your regular expression within the str_split()-function to split the text. This will return a list of length one, with the only element being a character vector containing the individual sections. Store the list’s only element (select with [[1]]) in an object called text_split.
Use class() to confirm that that text_split is a vector.
Now identify the element in the vector that has the most characters, which will be the book’s actual text. To do this use nchar().
Select the largest element and store it as main_text.

# cut text into sections
text_split = str_split(text, '\\*{3}[:print:]*\\*{3}')

# count characters per section
nchar(text_split[[1]])

## [1]    502 338120      4   1299  17639

# extract main text
main_text = text_split[[1]][2]

As a final step before tokenzing the text, transform the main_text-object to lower-case only using str_to_lower().
Now, use str_extract_all() and [:alpha:]+ to tokenize main_text. This should extract all individual words from the text and return them as a vector contained in a list. Store the vector (i.e., the list first element) as an object called tokens.
Count the number of words in tokens using length. This is the number of words in your book.

# to lower
main_text = str_to_lower(main_text)

# tokenize
tokens = str_extract_all(main_text, '[:alpha:]+')[[1]]

# count words
length(tokens)

## [1] 61271

E - Wordcloud

Use the function wordcloud to create a wordcloud of your tokens. It may take a while. If it takes too long, use only a subset of the words, e.g., the first 10,000 words (tokens[1:10000]).

# to lower
wordcloud(tokens)