Skip to content

Add Hindi stopwords to NLTK stopwords corpus#259

Open
mridulchdry17 wants to merge 2 commits intonltk:gh-pagesfrom
mridulchdry17:gh-pages
Open

Add Hindi stopwords to NLTK stopwords corpus#259
mridulchdry17 wants to merge 2 commits intonltk:gh-pagesfrom
mridulchdry17:gh-pages

Conversation

@mridulchdry17
Copy link

Hi @ekaf , @stevenbird - I have substantially revised the Hindi stopword list to address the earlier concerns regarding scope, noise, and validation. Below is a concise summary of the updated methodology and results.

Updated Methodology

  • The current list is primarily corpus-driven and was rebuilt using the following pipeline.

Corpus

  • Multilingual C4 (Hindi subset)
  • ~50k streamed documents (multi-domain web Hindi)
  • Millions of tokens processed

Processing steps

  • Unicode normalization (NFC)
  • Devanagari-only token extraction via regex
  • Frequency distribution over the corpus
  • Selection of high-frequency candidates (top-K window)

Manual linguistic pruning to remove:

  • named entities

  • Cross-validation against existing Hindi stopword resources
    This workflow aims to reflect actual usage patterns in contemporary Hindi while maintaining linguistic cleanliness.

Final Statistics

  • Final list size: 206
  • ISO overlap: 0.604
  • spaCy overlap: 0.609
  • Common words with ISO: 136
  • Common words with spaCy: 140

The resulting list maintains broad grammatical coverage while remaining corpus-grounded.

Cross-Library Context

For additional context, I compared major English stopword resources (NLTK, spaCy, scikit-learn, stopwords-iso). These lists show overlap of 30-50 percent in different references

Request for Further Guidance

I would be happy to refine the list further if there are:
preferred size expectations for NLTK
specific linguistic categories to exclude
additional validation you would like to see

@mridulchdry17
Copy link
Author

hi @ekaf - any feedback on this PR?

@ekaf
Copy link
Member

ekaf commented Mar 20, 2026

Great work on this, @mridulchdry17!

Thanks for putting this together. The methodology here is really solid—deriving the list empirically from the Multilingual C4 dataset makes this mathematically grounded, and focusing on the core syntactic glue is a great approach.

I did want to raise two points for your consideration—one technical fix and one suggestion for robustness:

1. Technical Issue: File Encoding (Mojibake)
It looks like the stopword list is currently suffering from an encoding issue. The Hindi text seems to have been saved or read using a Western European encoding (like Windows-1252 or ISO-8859-1) instead of UTF-8. For example, अंदर is appearing as अंदर. We should ensure the file is strictly saved and decoded as UTF-8; otherwise, string matching will fail for standard Devanagari input.

2. Suggestion: Corpus Representation and Informal Text
Because the C4 dataset is explicitly filtered for clean, well-formed web text, your frequency analysis yielded a highly standardized list. However, this also means the underlying corpus likely underrepresents highly informal, fast-typed text (like Twitter/X, Reddit, or product reviews).

In real-world informal Hindi, dropped nasalizations and phonetic misspellings are incredibly high-frequency (e.g., जिंहों, इंहिं, जेसा, रवासा). For context, spaCy’s list captures about 40% more tokens precisely because it accounts for these messy variations.

To ensure this list is robust for NLTK users who need to process informal or social media text, would you consider accounting for this corpus gap?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants