In this assignment you will…
Start a new markdown document and start out with working code from the previous assignment in order to tokenize the text.
At the beginning also load the package stopwords
.
library(readr)
library(stringr)
library(wordcloud)
library(stopwords)
# load text
text <- read_file('pg2197.txt')
# cut text into sections
text_split = str_split(text, '\\*{3}[:print:]*\\*{3}')
# count characters per section
nchar(text_split[[1]])
## [1] 502 338120 4 1299 17639
# extract main text
main_text = text_split[[1]][2]
# to lower
main_text = str_to_lower(main_text)
# tokenize
tokens = str_extract_all(main_text, '[:alpha:]+')[[1]]
table()
. The table()
-function will return a table object that you can treat, for the most part, like a named integer vector. Store the resulting object as token_frequencies
.# count tokens
token_frequencies = table(tokens)
length()
to determine the length of token_frequencies
. What does this number tell you?# count tokens
length(token_frequencies)
## [1] 5611
token_frequencies
has one element for every unique token. It’s length therefore indicates the number of distinct tokens in the book. Print a few entries, e.g., token_frequencies[1:100]
, to get a feel what those elements look like.# print first tokens
token_frequencies[1:100]
## tokens
## a aback abandoned abandonment
## 1393 2 2 2
## abased abasement abated aberration
## 1 4 1 1
## abide ability abject able
## 2 1 1 18
## abode abominable about above
## 5 1 141 10
## abridged abroad abrupt absence
## 1 5 3 3
## absent absolument absolute absolutely
## 6 1 7 2
## absolve absolves absorbed abstraction
## 1 1 1 1
## absurd absurdity abundant abuse
## 8 1 1 3
## abused abusive abyss accent
## 2 2 3 1
## accents accept accepted accepting
## 2 4 5 2
## accident accompanied accompany accompanying
## 2 5 4 1
## accord accordance accorded according
## 1 1 1 6
## accordingly account accountant accounted
## 3 10 1 1
## accounts accumulated accursed accusations
## 2 1 9 1
## accuse accustomed ached aching
## 1 1 1 1
## acquaintance acquaintances acquaintanceship acquainted
## 8 3 1 1
## acquire acquired acquiring acquisition
## 1 3 3 1
## across act acted acting
## 3 8 4 5
## action actions actors actress
## 5 1 1 1
## acts actual actually acutely
## 2 6 16 1
## adamson add added addicted
## 1 2 34 1
## addition additional address addressed
## 4 1 3 3
## addressee addresses addressing adhered
## 1 1 1 1
## adjacent adjoining adjourn administer
## 1 3 1 1
## admit adopt adopted adorned
## 3 1 3 1
sort()
with decreasing = TRUE
to arrange the elements of token_frequencies
in descending order. Store the sorted table again as token_frequencies
(i.e., overwrite the object).# sort token table
token_frequencies = sort(token_frequencies, decreasing = TRUE)
token_frequencies[1:100]
, which now should show the 100 most frequent words. What are the most frequent words in your book? With the exception of stopwords (see below), which are per default eliminated by wordcloud()
, the same ones should have popped up in the word cloud from the previous assignment."Token rank"
and "Token frequency"
. To do this, create two vectors, one called token_rank
containing the token ranks (use rank()
), where the smallest rank is assigned to the token with the highest frequency, and one called token_freq
containing the token frequencies as a vector (use c()
).# Create vectors
token_rank = rank(-token_frequencies)
token_freq = c(token_frequencies)
token_freq
against token_rank
as a line and label the axes as "Token rank"
and "Token frequency"
. Use the standard plot()
function with type = "l"
.# Plot Zipf
plot(token_rank, token_freq,
type = "l", xlab = "Token rank", ylab = "Token frequency")
lines()
to add a second line determined by \(max(f)/{(r + \beta)}^\alpha\) with \(\alpha = 1\), \(\beta = 2.7\), \(f\) being the token frequency, and \(r\) being the token rank. Choose a different color for this line.# Add model
plot(token_rank, token_freq,
type = "l", xlab = "Token rank", ylab = "Token frequency")
lines(token_rank, max(token_freq) * (1 / token_rank * 2.7), col = "salmon")
log = "xy"
as an the plot function.# Add model
plot(token_rank, token_freq,
type = "l", xlab = "Token rank", ylab = "Token frequency", log="xy")
lines(token_rank, max(token_freq) * (1 / token_rank * 2.7), col = "salmon")
names()
and nchar()
, create a new vector from token_frequencies
containing the number of characters of each of the tokens. Call this vector token_nchar
.# Create vector of characters
token_nchar = nchar(names(token_frequencies))
plot()
, plot token_freq
against token_nchar
as points (i.e., don’t use type = "l"
).# Plot Zipf pt. II
plot(token_nchar, token_freq,
xlab = "Token n character", ylab = "Token frequency")
Extremely frequent words are usually less interesting because they occur in so many different context that they cannot carry a lot of meaning. For many analyses focusing on meaning or content, it is common to eleminate such overly frequent words devoid of specific meaning, which are known as stopwods. Use stopwords()
to create a vector called stopwords
.
Take a look at stopwords
by printing it. Which kind of words are included in the vector?
Determine the average rank of the words in stopwords
based on token_frequencies
. This code snippet should be useful names(token_frequencies) %in% stopwords
. Does this confirm the high frequency of stop words?
# stopwords
stopwords = stopwords()
# average rank
token_rank[names(token_frequencies) %in% stopwords] %>%
mean(na.rm = T)
## [1] 190.6585