LanguageDetector not working properly. #10233

CarolinaRDC · 2022-02-07T13:48:48Z

CarolinaRDC
Feb 7, 2022

How to reproduce the behaviour

I have wrapped LanguageDetector like below in order to identify english sentences because I need to discard them for my project. I'm using a spanish model. The code looks like this:

def create_lang_detector(nlp, name):
    return LanguageDetector()


Language.factory("language_detector", func=create_lang_detector)


nlp = spacy.load('es_core_news_md', disable=['tok2vec', 'textcat', 'ner', 'lemmatizer'])
nlp.max_length = globals_variables.global_nlp_max_length  # 6000000

nlp.add_pipe('language_detector', last=True)

Then, to use it, I just do:

doc = nlp(sentence)
print(doc._.language['language'])

The problem here is that for some reason it is not identifying properly some english sentences (quite simple ones)

For example:

sentence: I like strawberries a lot
{'language': 'af', 'score': 0.8571416745386518} (This score changes a lot from 0.71 to 0.9999...)

sentence: I like oranges a lot

{'language': 'no', 'score': 0.9999961134960113} (This is not even a language!!!!)

I have investigated about this and I cannot see any issue like this one. Any ideas?

Thank you very much in advance

Your Environment

spaCy version: 3.1.3
Platform: Windows-10-10.0.18362-SP0
Python version: 3.7.6
Pipelines: es_core_news_md (3.1.0)

Answered by polm

Feb 8, 2022

I assume you are using the code from here? That's a third-party module - we didn't write it and can't help you with it. You should ask the develop directly. However it looks like the package was never updated after release three years ago, so it may just be abandoned.

Not that no is often used as a language code for Norwegian.

You should be able to integrate a more actively maintained language detector easily, I think Facebook has one. Because you want to discard sentences that are the wrong language, there's not really any reason to wrap it as a spaCy component either - you can just filter texts before passing them to spaCy.

View full answer

polm · 2022-02-08T06:22:19Z

polm
Feb 8, 2022

I assume you are using the code from here? That's a third-party module - we didn't write it and can't help you with it. You should ask the develop directly. However it looks like the package was never updated after release three years ago, so it may just be abandoned.

Not that no is often used as a language code for Norwegian.

You should be able to integrate a more actively maintained language detector easily, I think Facebook has one. Because you want to discard sentences that are the wrong language, there's not really any reason to wrap it as a spaCy component either - you can just filter texts before passing them to spaCy.

0 replies

CarolinaRDC · 2022-02-08T14:33:06Z

CarolinaRDC
Feb 8, 2022
Author

Thank you very much for your suggestion. I removed the language detector from my Spacy pipeline and I'm using now pycld2 with cld2.detect and it's working fine :D
Actually I didn't want to discard them. I want them but marked as a 'not spanish sentence', so I do need a language detector hehe

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

LanguageDetector not working properly. #10233

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

LanguageDetector not working properly. #10233

Uh oh!

CarolinaRDC Feb 7, 2022

How to reproduce the behaviour

Your Environment

Replies: 2 comments

Uh oh!

polm Feb 8, 2022

Uh oh!

CarolinaRDC Feb 8, 2022 Author

CarolinaRDC
Feb 7, 2022

polm
Feb 8, 2022

CarolinaRDC
Feb 8, 2022
Author