Bengali-Stemmer

Language is ambigious. A simple word might have lots of form f.e. the word "give" can be "given", "gave". But both words has a common root which is "give". So, in NLP when we represnt sentence in a vector space it helps to reduce the ambiguity if we only represnt each word with it's root only. This process is called Stemming. To know more please follow this Wikipedia link https://en.wikipedia.org/wiki/Stemming .

But for Bangla it is not easier to implement stemmer. Because, While Bengali has 49 letters (to be more specific 11 vowels and 38 consonants) in its alphabet, there are also 18 potential diacritics, or accents. This means that there are many more graphemes, or the smallest units in a written language. The added complexity results in ~13,000 different grapheme variations (compared to English’s 250 graphemic units).

So, to implement the stemmer for Bangla I have tried different approach like seq2seq, SVM, etc. Please go through the "REVE TASK.ipynb" file.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.ipynb_checkpoints		.ipynb_checkpoints
README.md		README.md
REVE TASK.ipynb		REVE TASK.ipynb
STEMMER.csv		STEMMER.csv
base_form.txt		base_form.txt
decisiontree.sav		decisiontree.sav
inflected_form.txt		inflected_form.txt
salman_new.ipynb		salman_new.ipynb
support_vector_machine.sav		support_vector_machine.sav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bengali-Stemmer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bengali-Stemmer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages