Zipf

Overview

In this assignment you will…

Analyze word frequency.
Learn about the concept of stop words.

Tasks

A - Return to previous assignment

Start a new markdown document and start out with working code from the previous assignment in order to tokenize the text.
At the beginning also load the package stopwords.

library(readr)
library(stringr)
library(wordcloud)
library(stopwords)

# load text
text <- read_file('pg2197.txt')

# cut text into sections
text_split = str_split(text, '\\*{3}[:print:]*\\*{3}')

# count characters per section
nchar(text_split[[1]])

## [1]    502 338120      4   1299  17639

# extract main text
main_text = text_split[[1]][2]

# to lower
main_text = str_to_lower(main_text)

# tokenize
tokens = str_extract_all(main_text, '[:alpha:]+')[[1]]

B - Analyze frequencies

Begin analyzing the frequency distribution of tokens by counting the occurrences of each token using table(). The table()-function will return a table object that you can treat, for the most part, like a named integer vector. Store the resulting object as token_frequencies.

# count tokens
token_frequencies = table(tokens)

Use length() to determine the length of token_frequencies. What does this number tell you?

# count tokens
length(token_frequencies)

## [1] 5611

token_frequencies has one element for every unique token. It’s length therefore indicates the number of distinct tokens in the book. Print a few entries, e.g., token_frequencies[1:100], to get a feel what those elements look like.

# print first tokens
token_frequencies[1:100]

## tokens
##                a            aback        abandoned      abandonment 
##             1393                2                2                2 
##           abased        abasement           abated       aberration 
##                1                4                1                1 
##            abide          ability           abject             able 
##                2                1                1               18 
##            abode       abominable            about            above 
##                5                1              141               10 
##         abridged           abroad           abrupt          absence 
##                1                5                3                3 
##           absent       absolument         absolute       absolutely 
##                6                1                7                2 
##          absolve         absolves         absorbed      abstraction 
##                1                1                1                1 
##           absurd        absurdity         abundant            abuse 
##                8                1                1                3 
##           abused          abusive            abyss           accent 
##                2                2                3                1 
##          accents           accept         accepted        accepting 
##                2                4                5                2 
##         accident      accompanied        accompany     accompanying 
##                2                5                4                1 
##           accord       accordance         accorded        according 
##                1                1                1                6 
##      accordingly          account       accountant        accounted 
##                3               10                1                1 
##         accounts      accumulated         accursed      accusations 
##                2                1                9                1 
##           accuse       accustomed            ached           aching 
##                1                1                1                1 
##     acquaintance    acquaintances acquaintanceship       acquainted 
##                8                3                1                1 
##          acquire         acquired        acquiring      acquisition 
##                1                3                3                1 
##           across              act            acted           acting 
##                3                8                4                5 
##           action          actions           actors          actress 
##                5                1                1                1 
##             acts           actual         actually          acutely 
##                2                6               16                1 
##          adamson              add            added         addicted 
##                1                2               34                1 
##         addition       additional          address        addressed 
##                4                1                3                3 
##        addressee        addresses       addressing          adhered 
##                1                1                1                1 
##         adjacent        adjoining          adjourn       administer 
##                1                3                1                1 
##            admit            adopt          adopted          adorned 
##                3                1                3                1

Use sort() with decreasing = TRUE to arrange the elements of token_frequencies in descending order. Store the sorted table again as token_frequencies (i.e., overwrite the object).

# sort token table
token_frequencies = sort(token_frequencies, decreasing = TRUE)

Print again token_frequencies[1:100], which now should show the 100 most frequent words. What are the most frequent words in your book? With the exception of stopwords (see below), which are per default eliminated by wordcloud(), the same ones should have popped up in the word cloud from the previous assignment.

C - Zipf’s law

Analyze token frequency as a function of the token’s rank and label the axes as "Token rank" and "Token frequency". To do this, create two vectors, one called token_rank containing the token ranks (use rank()), where the smallest rank is assigned to the token with the highest frequency, and one called token_freq containing the token frequencies as a vector (use c()).

# Create vectors
token_rank = rank(-token_frequencies)
token_freq = c(token_frequencies)

Now plot token_freq against token_rank as a line and label the axes as "Token rank" and "Token frequency". Use the standard plot() function with type = "l".

# Plot Zipf
plot(token_rank, token_freq, 
     type = "l", xlab = "Token rank", ylab = "Token frequency")

Now use lines() to add a second line determined by \(max(f)/{(r + \beta)}^\alpha\) with \(\alpha = 1\), \(\beta = 2.7\), \(f\) being the token frequency, and \(r\) being the token rank. Choose a different color for this line.

# Add model
plot(token_rank, token_freq, 
     type = "l", xlab = "Token rank", ylab = "Token frequency")
lines(token_rank, max(token_freq) * (1 / token_rank * 2.7), col = "salmon")

The two lines should have overlapped relatively well. This should be even easier to see using a log-scaled visualization. Try it out. Simply add log = "xy" as an the plot function.

# Add model
plot(token_rank, token_freq, 
     type = "l", xlab = "Token rank", ylab = "Token frequency", log="xy")
lines(token_rank, max(token_freq) * (1 / token_rank * 2.7), col = "salmon")

D - Zipf’s part II

Another law related to the one just studied and also associated with Zipf is the relationship betweenw word length and frequency. Using names() and nchar(), create a new vector from token_frequencies containing the number of characters of each of the tokens. Call this vector token_nchar.

# Create vector of characters
token_nchar = nchar(names(token_frequencies))

Using plot(), plot token_freq against token_nchar as points (i.e., don’t use type = "l").

# Plot Zipf pt. II
plot(token_nchar, token_freq, 
     xlab = "Token n character", ylab = "Token frequency")

Is there a relationship between the number of characters in a token and it’s frequency? What does this reveal about language?

D - Stopwords

Extremely frequent words are usually less interesting because they occur in so many different context that they cannot carry a lot of meaning. For many analyses focusing on meaning or content, it is common to eleminate such overly frequent words devoid of specific meaning, which are known as stopwods. Use stopwords() to create a vector called stopwords.
Take a look at stopwords by printing it. Which kind of words are included in the vector?
Determine the average rank of the words in stopwords based on token_frequencies. This code snippet should be useful names(token_frequencies) %in% stopwords. Does this confirm the high frequency of stop words?

# stopwords
stopwords = stopwords()

# average rank
token_rank[names(token_frequencies) %in% stopwords] %>% 
  mean(na.rm = T)

## [1] 190.6585