Semantics

Overview

In this assignment you will…

Determine a term-document-matrix.
Calculate PPMI and cosine similarities.
Visualize cosines similarities.

Tasks

A - Return to previous assignment

Start a new markdown document and start out with working code from the previous assignment up to the point where you have extracted the main_text and set all characters to lower-case.
At the beginning also load the package stopwords.

library(readr)
library(stringr)
library(wordcloud)
library(stopwords)

# load text
text <- read_file('pg2197.txt')

# cut text into sections
text_split = str_split(text, '\\*{3}[:print:]*\\*{3}')

# count characters per section
nchar(text_split[[1]])

## [1]    502 338120      4   1299  17639

# extract main text
main_text = text_split[[1]][2]

# to lower
main_text = str_to_lower(main_text)

B - Term-document matrix

Begin by splitting your text into sentences using str_extract_all() and [^[:space:]][^[.!?;]]*[.!?;] as the regex. Store the first element of the resulting list as sentences.

# sentenize
sentences = str_extract_all(main_text, '[^[:space:]][^[.!?;]]*[.!?;]')[[1]]

Now use str_extract_all() to tokenize each of the sentences. Remember, the stringr functions are vectorized, implying that the string argument can be a character vector of length longer than 1. Store the resulting list as tokens.

# tokenize
tokens = str_extract_all(sentences, '[:alpha:]+')

Using table(), count the number of occurences of each token. Store the table as token_freq. To do this you cannot use tokens itself, given that this time tokens is a list. Instead you must use unlist(tokens), which creates a single vector from tokens, where the sentence’s tokens are appended one after each other.

# count tokens
token_freq = table(unlist(tokens))

Using token_freq, create a vector of words including only those tokens that have a frequency of five or larger and are not included in stopwords(). To do this you will need to use names(token_freq), single brackets [], and two logical statements, > 4 and !XX %in% stopwords() (XX is placeholder) combined with &. The reason you do this is to constrain the analysis to words that have at least a minimal frequency and but are not stopwords. Store the resulting vector as retain.

# to be retained tokens
retain = names(token_freq)[token_freq > 4 & !names(token_freq) %in% stopwords()]

Run a loop iterating from 1 to length of tokens. At each iteration overwrite tokens[[i]] with a vector containing only those words in tokens[[i]] that exist in retain.

# retain tokens
for(i in 1:length(tokens)){
  tokens[[i]] = tokens[[i]][tokens[[i]] %in% retain]
  }

Remove sentences (i.e., elements in tokens) that now have length 4 or smaller. To do this use single brackets [], the vectorized function lengths() (don’t forget the s), and the logical comparison > 4. Overwrite the original tokens object. The reason you do this is to elminate sentences that would contribute little to revealing the typical contexts of words.

# remove sentences with fewer than 5 tokens
tokens = tokens[lengths(tokens) > 4]

Using unique(), create a vector called terms that contains the unique, remaining words in tokens. First you will need to make a vector out of tokens using unlist().

# extract unique terms
terms = unique(unlist(tokens))

Using matrix(), create a matrix of 0s with number of rows equal to number the length of terms and number of columns equal to the number of sentences, i.e., the length of tokens. Store the matrix as tdm, because this will be your term-document-matrix.

# create empty tdm
tdm = matrix(0, nrow = length(terms), ncol = length(tokens))

Using rownames(tdm), assign the names of the rows of the matrix to terms.

# name rows
rownames(tdm) = terms

Run a loop iterating from 1 to length of tokens. At each iteration count the tokens in tokens[[i]] using table() and store the result as tab_i. Then assign tdm[names(tab_i), i] to the token counts using c(tab_i).

# fill tdm
for(i in 1:length(tokens)){
  tab_i = table(tokens[[i]])
  tdm[names(tab_i), i] = c(tab_i)
  }

Inspect the dimensionality of tdm using dim() and take a look at the first few rows and columns using, e.g., tdm[1:10, 1:10]. Is everything looking in order? (Don’t print the entire thing!)

C - PPMI

The first step towards transforming the tdm to PPMI values is to turn the occurences into probabilities. To do this, divide tdm by the sum of tdm using sum(). Store the resulting matrix as p_tdm.

# turn into probabilities
p_tdm = tdm / sum(tdm)

Next, you need to determine the marginal probabilities of terms and documents by applying rowSums() and colSums() on p_tdm. Store the resulting vectors as p_terms and p_docs.

# calculate marginals
p_terms = rowSums(p_tdm)
p_docs = colSums(p_tdm)

Now use the function outer to create a matrix from p_terms and p_docs that matches the dimensionality of p_tdm. Store the result as p_tdm_expected. Using dim() confirm that the dimensionality of p_tdm_expected is appropriate.

# determine expected ps
p_tdm_expected = outer(p_terms, p_docs)

Using p_tdm_expected and log2(), you can now calculate the point-wise mutual information. Specifically, divide p_tdm by p_tdm_expected and then take the log2 of that it. Store the result as pmi.

# compute pmi
pmi = log2(p_tdm / outer(p_terms, p_docs))

Finally, create a new matrix ppmi from pmi and set all values smaller than 0 in ppmi to 0. Et voila, you have computed the positive pointwise mutual information.

# compute ppmi
ppmi = pmi
ppmi[ppmi < 0] = 0

Print a few rows and columns to get a feel for the values inside ppmi.

D - Cosine similarities

The first step towards computing cosine similarities from the ppmi matrix is to determien the dot-product of all rows. To do this multiply ppmi with t(ppmi), the transpose of ppmi, using matrix multiplication %*%. Store the resulting matrix as dot_prod.

# get dot product
dot_prod <- ppmi %*% t(ppmi)

Verify using dim() that dot_prod is a square matrix with as many rows and columns as there are terms.
Next extract the diagonal of dot_prod using diag() and store the resulting vector as dot_prod_diag.

# get diagonal
dot_prod_diag = diag(dot_prod)

Use sqrt(dot_prod_diag) and outer(), to compute the appropriate denominator for dot_prod. Store the resulting matrix as diag_outer.

# determine denominator
diag_outer = outer(sqrt(dot_prod_diag), sqrt(dot_prod_diag))

Calculate the matrix of cosine similarities by dividing dot_prod by diag_outer. Store the resulting matrix as cosines.

# calculate cosines
cosines = dot_prod/diag_outer

Using the expression below, you can now explore the cosine similarities. Replace "casino" with one of the terms in your book and the expression will show you the ten most associated terms.

sort(cosines["casino",],decreasing = T)[1:10]

##    casino      near   entered       arm    appear      park    salons    police 
## 1.0000000 0.2884082 0.2037457 0.1777109 0.1715131 0.1357752 0.1317005 0.1250892 
##   finally    pocket 
## 0.1217858 0.1133951

E - Visualize cosines

A simple, but imperfect way to visualize the pattern of cosine similarities is using multi-dimensional scaling. To get there, first create a new token_freq object from tokens and extract the 200 most frequent tokens. Store the result as top_200.

# count tokens and extract top 200
token_freq = table(unlist(tokens))
top_200 = names(sort(token_freq, decreasing = T)[1:200])

Now use the function below to run multidemnsional scaling on the cosines between the top 200 tokens.

# run MDS
mds = cmdscale(1 - cosines[top_200, top_200]**.5)

Finally, use the code below to illustrate the patterns of cosine similarities in the 2D-plane. On average, the close two tokens are in the illustration, the higher their cosine similarity. However, note this is not strictly true, given that a 2D-representation never can perfectly represent the full higher-dimensional cosine space.

# plot mds solution
par(mar = c(1,1,1,1))
plot.new();plot.window(xlim = range(mds[,1]), ylim = range(mds[,2]))
text(mds[,1],mds[,2],labels=rownames(mds), cex=.5)

Study the MDS-plot. Does the displayed patterns of association match your intuition? In case, too many words are clustered together in the center, try changing the **.5 to a smaller number, e.g., to **.3.