-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Taken from the MLJText.jl requirements for transformers:
Generate a vector whose elements are either tokenized documents or bags of words/ngrams. Specifically, each element would be one of the following:
-
A vector of abstract strings (tokens), e.g., ["I", "like", "Sam",
".", "Sam", "is", "nice", "."] (scitype AbstractVector{Textual}) -
A dictionary of counts, indexed on abstract strings, e.g.,
Dict("I"=>1, "Sam"=>2, "Sam is"=>1) (scitype Multiset{Textual}}) -
A dictionary of counts, indexed on plain ngrams, e.g.,
Dict(("I",)=>1, ("Sam",)=>2, ("I", "Sam")=>1) (scitype
Multiset{<:NTuple{N,Textual} where N}); here a plain ngram is a
tuple of abstract strings.
Metadata
Metadata
Assignees
Labels
No labels