Embed and analyze text — embedR • embedR

The embedR package is an open-source R package to generate and analyze text embeddings. It gives access to state-of-the-art open and paid APIs from Hugging Face, OpenAI, and Cohere to gnerate text embeddings and offers methods to group, project, relabel, and visualize them. The following provides an overview of the package's functions:

Tokens

er_set_tokens sets access tokens for the APIs of Hugging Face, OpenAI, and Cohere.

er_get_tokens shows tokens that have been set during the current session.

Embed

er_embed generates state-of-the-art text embeddings using the APIs from Hugging Face, OpenAI, and Cohere.

Process

er_group groups identical or highly similar embedding vectors to produce group-based embeddings.

er_project projects embeddings into smaller dimensional spaces using MDS, UMAP, or PaCMAP.

Analyze

er_compare_vectors computes a similarity matrix containing the similarities of all pairs of embedding vectors.

er_compare_embeddings computes the representational similarity of pairs of embeddings.

er_cluster clusters the embedding vectors into larger groups using hierarchical clustering, dbscan, or louvain clustering.

Helper

er_frame generates a tibble from the embedding objects including potential attributes.

er_infer_labels uses state-of-the-art generative models from Hugging Face and OpenAI to generate category labels for groups of texts.

Visualize

plot produces a 2D scatter plot of embedding vectors (typically after projection) with options for customization.

Data

neo data set containing 300 items of the personality questionnaire NEO.

ai data set containing 2,500 free associations of artificial intelligence provided by laypeople.

Examples

if (FALSE) {
# load package
library(embedR)

# set api tokens
er_set_token("openai" = "TOKEN",
             "huggingface" = "TOKEN",
             "cohere" = "TOKEN")

# generate embedding
embedding = neo$text %>%

  # generate text embedding
  er_embed(api = "openai")

# analyze embedding
result = embedding %>%

  # group similar texts
  er_group(method = "fuzzy") %>%

  # generate 2D projection
  er_project(method = "umap") %>%

  # cluster projection
  er_cluster(method = "louvain") %>%

  # produce data frame
  er_frame()

# re-label text groups
result = embedding %>%

  # relabel groups
  er_mutate(labels = label(group_texts,
                           api = "openai"))

# visualize
result %>% plot()
}