Noisy text from server

Frequently there are text fields in this data that has noise (mistakes) such as incorrectly white spaces in middle of a word.
That can be detected and treated. We should be able to measure the extent of this. And also implement a naive not so computational costly algorithm to get rid of mistakes. One such approach can be summarized  like this:
0. All = split(text)
1. for (w,v) consecutive pairs of words:
2. if w is not a stop word *and* v is not a dictionary recognize word
3. verify ( w = w+v  ) is a recognized word
4. join w,v

This is naive, but not seems to introduce new error and is capable of reducing the kind of error of one white space breaking a single word.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Noisy text from server #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Noisy text from server #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions