Skip to content

Documentation on tuning the lemmatizerΒ #114

@otherdave

Description

@otherdave

Is your feature request related to a problem? Please describe.
I am using some sample code to tokenize and extract lemmas of Spanish text. It's not fully working as expected so I'm looking for documentation on the tokenizer/lemmatizer or ways to tune it.

Describe the solution you'd like
I gave it the string "Quiero llamarte Susan" and expected the lemma of "llamarte" to be "llamar" but it came back "llamarte".

I'd like to know where to go next to learn more about what is happening and what's expected.

I'm using this code to tokenize it:

 Catalyst.Models.Spanish.Register();
  var nlp = await Pipeline.ForAsync(Language.Spanish);
  var doc = new Document(text, Language.Spanish);

  nlp.ProcessSingle(doc);

  var tokenList = doc.ToTokenList();

and I'm not sure if there's any tuning/tweaking I can do to get the desired result, if this is expected, or if this is a limitation. I'm not sure where to go next.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions