LanguageDetector not working properly. #10233
-
How to reproduce the behaviourI have wrapped LanguageDetector like below in order to identify english sentences because I need to discard them for my project. I'm using a spanish model. The code looks like this:
Then, to use it, I just do:
The problem here is that for some reason it is not identifying properly some english sentences (quite simple ones) For example: sentence: I like strawberries a lot sentence: I like oranges a lot {'language': 'no', 'score': 0.9999961134960113} (This is not even a language!!!!) I have investigated about this and I cannot see any issue like this one. Any ideas? Thank you very much in advance Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
I assume you are using the code from here? That's a third-party module - we didn't write it and can't help you with it. You should ask the develop directly. However it looks like the package was never updated after release three years ago, so it may just be abandoned. Not that You should be able to integrate a more actively maintained language detector easily, I think Facebook has one. Because you want to discard sentences that are the wrong language, there's not really any reason to wrap it as a spaCy component either - you can just filter texts before passing them to spaCy. |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for your suggestion. I removed the language detector from my Spacy pipeline and I'm using now pycld2 with cld2.detect and it's working fine :D |
Beta Was this translation helpful? Give feedback.
I assume you are using the code from here? That's a third-party module - we didn't write it and can't help you with it. You should ask the develop directly. However it looks like the package was never updated after release three years ago, so it may just be abandoned.
Not that
no
is often used as a language code for Norwegian.You should be able to integrate a more actively maintained language detector easily, I think Facebook has one. Because you want to discard sentences that are the wrong language, there's not really any reason to wrap it as a spaCy component either - you can just filter texts before passing them to spaCy.