-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Added Hindi stopwords to NLTK stopwords corpus #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh-pages
Are you sure you want to change the base?
Conversation
|
@tomaarsen @stevenbird |
|
Thank you for the contribution! To help move this PR forward, could you please provide additional context, such as the source or justification for the Hindi stopwords list and any validation or tests performed? Linking to related issues or feature requests would also be helpful. |
|
I generated the initial Hindi stopwords list using ChatGPT, then manually reviewed and cross-checked it against trusted sources like the Indic NLP Library to ensure quality and relevance. I’m happy to further refine the list or validate it more rigorously based on community feedback |
|
To strengthen the PR, you might consider comparing the proposed list with stopwords identified through methods like TF-IDF or other statistical approaches to ensure its effectiveness and completeness. |
|
Hi @mridulchdry17 , Gemini reviewed your proposed Hindi stopword list, and has a few questions regarding its scope and content, aiming to ensure it's as comprehensive and accurate as possible for general NLP use. 1. Justification for including certain words as stopwords:
2. Justification for the absence of otherwise common stopwords:
Understanding these choices would help align the list with general NLP best practices for stopword removal, which typically focuses on words that are frequent but carry little unique semantic information across diverse contexts. Thanks! |
|
Hi @mridulchdry17, this PR seems stale, so it could be appropriate to close it until you learn more about the topic. |
This PR adds a list of common Hindi stopwords to the corpora directory. These stopwords can be useful for preprocessing in NLP tasks involving Hindi text.