Skip to content

Allow token processing "middleware"Β #116

@mlucool

Description

@mlucool

Hi,

It is possible to allow for a option which first finds string distances to words in the positive/negative list, and then, if it is above some threshold, categorize it as that word so spelling mistakes and/casual writing style are not lost.

e.g.

> sentiment('Cats are dumb');
{ score: -3,
  comparative: -1,
  tokens: [ 'cats', 'are', 'dumb' ],
  words: [ 'dumb' ],
  positive: [],
  negative: [ 'dumb' ] }
> sentiment('Cats are dumbbb');
{ score: 0,
  comparative: 0,
  tokens: [ 'cats', 'are', 'dumbbb' ],
  words: [],
  positive: [],
  negative: [] }

This example dumbbb is so close to dumb that it should be classified as such. Using a library like natural makes this easy.

require('natural').JaroWinklerDistance('dumb', 'dumbbb')
0.9333333333333333

If adding natural is out of scope, maybe a way that someone could inject it in some processing step could work too.

What do you think? Would this work?

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions