Add Hindi stopwords to NLTK stopwords corpus#259
Add Hindi stopwords to NLTK stopwords corpus#259mridulchdry17 wants to merge 2 commits intonltk:gh-pagesfrom
Conversation
|
hi @ekaf - any feedback on this PR? |
|
Great work on this, @mridulchdry17! Thanks for putting this together. The methodology here is really solid—deriving the list empirically from the Multilingual C4 dataset makes this mathematically grounded, and focusing on the core syntactic glue is a great approach. I did want to raise two points for your consideration—one technical fix and one suggestion for robustness: 1. Technical Issue: File Encoding (Mojibake) 2. Suggestion: Corpus Representation and Informal Text In real-world informal Hindi, dropped nasalizations and phonetic misspellings are incredibly high-frequency (e.g., जिंहों, इंहिं, जेसा, रवासा). For context, spaCy’s list captures about 40% more tokens precisely because it accounts for these messy variations. To ensure this list is robust for NLTK users who need to process informal or social media text, would you consider accounting for this corpus gap? |
Hi @ekaf , @stevenbird - I have substantially revised the Hindi stopword list to address the earlier concerns regarding scope, noise, and validation. Below is a concise summary of the updated methodology and results.
Updated Methodology
Corpus
Processing steps
Manual linguistic pruning to remove:
named entities
Cross-validation against existing Hindi stopword resources
This workflow aims to reflect actual usage patterns in contemporary Hindi while maintaining linguistic cleanliness.
Final Statistics
The resulting list maintains broad grammatical coverage while remaining corpus-grounded.
Cross-Library Context
For additional context, I compared major English stopword resources (NLTK, spaCy, scikit-learn, stopwords-iso). These lists show overlap of 30-50 percent in different references
Request for Further Guidance
I would be happy to refine the list further if there are:
preferred size expectations for NLTK
specific linguistic categories to exclude
additional validation you would like to see