Add Hindi stopwords to NLTK stopwords corpus by mridulchdry17 · Pull Request #259 · nltk/nltk_data

mridulchdry17 · 2026-02-22T20:15:52Z

Hi @ekaf , @stevenbird - I have substantially revised the Hindi stopword list to address the earlier concerns regarding scope, noise, and validation. Below is a concise summary of the updated methodology and results.

Updated Methodology

The current list is primarily corpus-driven and was rebuilt using the following pipeline.

Corpus

Multilingual C4 (Hindi subset)
~50k streamed documents (multi-domain web Hindi)
Millions of tokens processed

Processing steps

Unicode normalization (NFC)
Devanagari-only token extraction via regex
Frequency distribution over the corpus
Selection of high-frequency candidates (top-K window)

Manual linguistic pruning to remove:

named entities
Cross-validation against existing Hindi stopword resources
This workflow aims to reflect actual usage patterns in contemporary Hindi while maintaining linguistic cleanliness.

Final Statistics

Final list size: 206
ISO overlap: 0.604
spaCy overlap: 0.609
Common words with ISO: 136
Common words with spaCy: 140

The resulting list maintains broad grammatical coverage while remaining corpus-grounded.

Cross-Library Context

For additional context, I compared major English stopword resources (NLTK, spaCy, scikit-learn, stopwords-iso). These lists show overlap of 30-50 percent in different references

Request for Further Guidance

I would be happy to refine the list further if there are:
preferred size expectations for NLTK
specific linguistic categories to exclude
additional validation you would like to see

mridulchdry17 · 2026-03-19T16:47:54Z

hi @ekaf - any feedback on this PR?

ekaf · 2026-03-20T06:44:35Z

Great work on this, @mridulchdry17!

Thanks for putting this together. The methodology here is really solid—deriving the list empirically from the Multilingual C4 dataset makes this mathematically grounded, and focusing on the core syntactic glue is a great approach.

I did want to raise two points for your consideration—one technical fix and one suggestion for robustness:

1. Technical Issue: File Encoding (Mojibake)
It looks like the stopword list is currently suffering from an encoding issue. The Hindi text seems to have been saved or read using a Western European encoding (like Windows-1252 or ISO-8859-1) instead of UTF-8. For example, अंदर is appearing as à¤…à¤‚à¤¦à¤°. We should ensure the file is strictly saved and decoded as UTF-8; otherwise, string matching will fail for standard Devanagari input.

2. Suggestion: Corpus Representation and Informal Text
Because the C4 dataset is explicitly filtered for clean, well-formed web text, your frequency analysis yielded a highly standardized list. However, this also means the underlying corpus likely underrepresents highly informal, fast-typed text (like Twitter/X, Reddit, or product reviews).

In real-world informal Hindi, dropped nasalizations and phonetic misspellings are incredibly high-frequency (e.g., जिंहों, इंहिं, जेसा, रवासा). For context, spaCy’s list captures about 40% more tokens precisely because it accounts for these messy variations.

To ensure this list is robust for NLTK users who need to process informal or social media text, would you consider accounting for this corpus gap?

Mridul Chaudhary and others added 2 commits February 23, 2026 01:30

Add Hindi stopwords to NLTK stopwords corpus

2b35585

Auto-build index.xml after package update

616dccc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hindi stopwords to NLTK stopwords corpus#259

Add Hindi stopwords to NLTK stopwords corpus#259
mridulchdry17 wants to merge 2 commits intonltk:gh-pagesfrom
mridulchdry17:gh-pages

mridulchdry17 commented Feb 22, 2026

Uh oh!

mridulchdry17 commented Mar 19, 2026

Uh oh!

ekaf commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mridulchdry17 commented Feb 22, 2026

Updated Methodology

Corpus

Processing steps

Manual linguistic pruning to remove:

Final Statistics

Cross-Library Context

Request for Further Guidance

Uh oh!

mridulchdry17 commented Mar 19, 2026

Uh oh!

ekaf commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants