Add short section on how to import pre-tokenized text

Hi,

Once in  a while I happen to get a dataset that is already pre-tokenized (dataframe with columns for tokens and with doc_id). Every time that happens I need to search forever to figure out how to coerce that format to something that quanteda likes.

Maybe it is already in the docs, but google fails me when I try to search for that.

My solution is this one, but I am not sure whether that is the best way:


```r
library(quanteda)
library(BTM)

# example data from the BTM package
data("brussels_reviews_anno")

# cast tokenized data to list
tmp_list <- aggregate(token ~ doc_id, data = brussels_reviews_anno, FUN = "list")

# unpack data and create named list
l <- tmp_list$token
names(l) <- tmp_list$doc_id

# transform to quanteda dfm
converted_corpus <- l |> quanteda::as.tokens() |> 
  quanteda::dfm()

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add short section on how to import pre-tokenized text #106

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add short section on how to import pre-tokenized text #106

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions