-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Closed
Description
How to reproduce the behaviour
I have wrapped LanguageDetector like below in order to identify english sentences because I need to discard them for my project. I'm using a spanish model. The code looks like this:
def create_lang_detector(nlp, name):
return LanguageDetector()
Language.factory("language_detector", func=create_lang_detector)
nlp = spacy.load('es_core_news_md', disable=['tok2vec', 'textcat', 'ner', 'lemmatizer'])
nlp.max_length = globals_variables.global_nlp_max_length # 6000000
nlp.add_pipe('language_detector', last=True)
Then, to use it, I just do:
doc = nlp(sentence)
print(doc._.language['language'])
The problem here is that for some reason it is not identifying properly some english sentences (quite simple ones)
For example:
sentence: I like strawberries a lot
{'language': 'af', 'score': 0.8571416745386518} (This score changes a lot from 0.71 to 0.9999...)
sentence: I like oranges a lot
{'language': 'no', 'score': 0.9999961134960113} (This is not even a language!!!!)
I have investigated about this and I cannot see any issue like this one. Any ideas?
Thank you very much in advance
Your Environment
- spaCy version: 3.1.3
- Platform: Windows-10-10.0.18362-SP0
- Python version: 3.7.6
- Pipelines: es_core_news_md (3.1.0)
Metadata
Metadata
Assignees
Labels
No labels