In this assignment you will…
Start a new markdown document and start out with working code from the previous assignment up to the point where you have extracted the main_text
and set all characters to lower-case.
At the beginning also load the package stopwords
# load text
text <- read_file('pg2197.txt')
# cut text into sections
text_split = str_split(text, '\\*{3}[:print:]*\\*{3}')
# count characters per section
## [1] 502 338120 4 1299 17639
# extract main text
main_text = text_split[[1]][2]
# to lower
main_text = str_to_lower(main_text)
and [^[:space:]][^[.!?;]]*[.!?;]
as the regex. Store the first element of the resulting list as sentences
.# sentenize
sentences = str_extract_all(main_text, '[^[:space:]][^[.!?;]]*[.!?;]')[[1]]
to tokenize each of the sentences. Remember, the stringr
functions are vectorized, implying that the string
argument can be a character vector of length longer than 1. Store the resulting list as tokens
.# tokenize
tokens = str_extract_all(sentences, '[:alpha:]+')
, count the number of occurences of each token. Store the table as token_freq
. To do this you cannot use tokens
itself, given that this time tokens
is a list. Instead you must use unlist(tokens)
, which creates a single vector from tokens
, where the sentence’s tokens are appended one after each other.# count tokens
token_freq = table(unlist(tokens))
, create a vector of words including only those tokens that have a frequency of five or larger and are not included in stopwords()
. To do this you will need to use names(token_freq)
, single brackets []
, and two logical statements, > 4
and !XX %in% stopwords()
(XX is placeholder) combined with &
. The reason you do this is to constrain the analysis to words that have at least a minimal frequency and but are not stopwords. Store the resulting vector as retain
.# to be retained tokens
retain = names(token_freq)[token_freq > 4 & !names(token_freq) %in% stopwords()]
. At each iteration overwrite tokens[[i]]
with a vector containing only those words in tokens[[i]]
that exist in retain
.# retain tokens
for(i in 1:length(tokens)){
tokens[[i]] = tokens[[i]][tokens[[i]] %in% retain]
) that now have length 4 or smaller. To do this use single brackets []
, the vectorized function lengths()
(don’t forget the s), and the logical comparison > 4
. Overwrite the original tokens
object. The reason you do this is to elminate sentences that would contribute little to revealing the typical contexts of words.# remove sentences with fewer than 5 tokens
tokens = tokens[lengths(tokens) > 4]
, create a vector called terms
that contains the unique, remaining words in tokens. First you will need to make a vector out of tokens
using unlist()
.# extract unique terms
terms = unique(unlist(tokens))
, create a matrix of 0
s with number of rows equal to number the length of terms
and number of columns equal to the number of sentences, i.e., the length of tokens
. Store the matrix as tdm
, because this will be your term-document-matrix.# create empty tdm
tdm = matrix(0, nrow = length(terms), ncol = length(tokens))
, assign the names of the rows of the matrix to terms
.# name rows
rownames(tdm) = terms
. At each iteration count the tokens in tokens[[i]]
using table()
and store the result as tab_i
. Then assign tdm[names(tab_i), i]
to the token counts using c(tab_i)
.# fill tdm
for(i in 1:length(tokens)){
tab_i = table(tokens[[i]])
tdm[names(tab_i), i] = c(tab_i)
using dim()
and take a look at the first few rows and columns using, e.g., tdm[1:10, 1:10]
. Is everything looking in order? (Don’t print the entire thing!)tdm
to PPMI values is to turn the occurences into probabilities. To do this, divide tdm
by the sum of tdm
using sum()
. Store the resulting matrix as p_tdm
.# turn into probabilities
p_tdm = tdm / sum(tdm)
and colSums()
on p_tdm
. Store the resulting vectors as p_terms
and p_docs
.# calculate marginals
p_terms = rowSums(p_tdm)
p_docs = colSums(p_tdm)
to create a matrix from p_terms
and p_docs
that matches the dimensionality of p_tdm
. Store the result as p_tdm_expected
. Using dim()
confirm that the dimensionality of p_tdm_expected
is appropriate.# determine expected ps
p_tdm_expected = outer(p_terms, p_docs)
and log2()
, you can now calculate the point-wise mutual information. Specifically, divide p_tdm
by p_tdm_expected
and then take the log2
of that it. Store the result as pmi
.# compute pmi
pmi = log2(p_tdm / outer(p_terms, p_docs))
from pmi
and set all values smaller than 0
in ppmi
to 0
. Et voila, you have computed the positive pointwise mutual information.# compute ppmi
ppmi = pmi
ppmi[ppmi < 0] = 0
matrix is to determien the dot-product of all rows. To do this multiply ppmi
with t(ppmi)
, the transpose of ppmi
, using matrix multiplication %*%
. Store the resulting matrix as dot_prod
.# get dot product
dot_prod <- ppmi %*% t(ppmi)
Verify using dim()
that dot_prod
is a square matrix with as many rows and columns as there are terms
Next extract the diagonal of dot_prod
using diag()
and store the resulting vector as dot_prod_diag
# get diagonal
dot_prod_diag = diag(dot_prod)
and outer()
, to compute the appropriate denominator for dot_prod
. Store the resulting matrix as diag_outer
.# determine denominator
diag_outer = outer(sqrt(dot_prod_diag), sqrt(dot_prod_diag))
by diag_outer
. Store the resulting matrix as cosines
.# calculate cosines
cosines = dot_prod/diag_outer
with one of the terms in your book and the expression will show you the ten most associated terms.sort(cosines["casino",],decreasing = T)[1:10]
## casino near entered arm appear park salons police
## 1.0000000 0.2884082 0.2037457 0.1777109 0.1715131 0.1357752 0.1317005 0.1250892
## finally pocket
## 0.1217858 0.1133951
object from tokens
and extract the 200 most frequent tokens. Store the result as top_200
.# count tokens and extract top 200
token_freq = table(unlist(tokens))
top_200 = names(sort(token_freq, decreasing = T)[1:200])
# run MDS
mds = cmdscale(1 - cosines[top_200, top_200]**.5)
# plot mds solution
par(mar = c(1,1,1,1));plot.window(xlim = range(mds[,1]), ylim = range(mds[,2]))
text(mds[,1],mds[,2],labels=rownames(mds), cex=.5)
to a smaller number, e.g., to **.3