Skip to content

Add data set for text analysis #19

@ablaom

Description

@ablaom

Taken from the MLJText.jl requirements for transformers:

Generate a vector whose elements are either tokenized documents or bags of words/ngrams. Specifically, each element would be one of the following:

  • A vector of abstract strings (tokens), e.g., ["I", "like", "Sam",
    ".", "Sam", "is", "nice", "."] (scitype AbstractVector{Textual})

  • A dictionary of counts, indexed on abstract strings, e.g.,
    Dict("I"=>1, "Sam"=>2, "Sam is"=>1) (scitype Multiset{Textual}})

  • A dictionary of counts, indexed on plain ngrams, e.g.,
    Dict(("I",)=>1, ("Sam",)=>2, ("I", "Sam")=>1) (scitype
    Multiset{<:NTuple{N,Textual} where N}); here a plain ngram is a
    tuple of abstract strings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions