-
Notifications
You must be signed in to change notification settings - Fork 68
Sentences within a dataset #6
Copy link
Copy link
Open
Description
I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases).
Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence?
PS: Furthermore, if the punctuation is filtered, the information about a "phrase" is completely lost, as documents became a bag of words, could it work also in this case?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels