Hi,
Once in a while I happen to get a dataset that is already pre-tokenized (dataframe with columns for tokens and with doc_id). Every time that happens I need to search forever to figure out how to coerce that format to something that quanteda likes.
Maybe it is already in the docs, but google fails me when I try to search for that.
My solution is this one, but I am not sure whether that is the best way:
library(quanteda)
library(BTM)
# example data from the BTM package
data("brussels_reviews_anno")
# cast tokenized data to list
tmp_list <- aggregate(token ~ doc_id, data = brussels_reviews_anno, FUN = "list")
# unpack data and create named list
l <- tmp_list$token
names(l) <- tmp_list$doc_id
# transform to quanteda dfm
converted_corpus <- l |> quanteda::as.tokens() |>
quanteda::dfm()
Hi,
Once in a while I happen to get a dataset that is already pre-tokenized (dataframe with columns for tokens and with doc_id). Every time that happens I need to search forever to figure out how to coerce that format to something that quanteda likes.
Maybe it is already in the docs, but google fails me when I try to search for that.
My solution is this one, but I am not sure whether that is the best way: