This repo provides clean implementation of Language Detection System in TensorFlow-2 using all best practices.
- Bulgarian
- Czech
- Danish
- Dutch
- English (Of course)
- Estonian
- Finnish
- French
- German
- Greek
- Hungarian
- Italian
- Latvian
- Lithuanian
- Polish
- Portuguese
- Romanian
- Slovak
- Slovenian
- Spanish
- Swedish
# Tensorflow CPU
conda activate (import tensorflow as tf)pip install -r requirements.txtNOTE: Models requires their respective tokenizers to work with; SO kindly download models along with their tokenizers
# Model
wget https://github.com/saahiluppal/langdet/blob/master/model.h5
# Tokenizer
wget https://github.com/saahiluppal/langdet/blob/master/tokenizer.jsonNot sure which model to use, You can find information about models here
# wanna detect language (we recommend using more than 5 words for better accuracy)
# file dependencies soon to be added
detect.py
# Training custom model (we recommend setting code which better suits your needs)
manual_tokens.py
# jupyter notebook for same
manual_tokens.ipynb
# Wanna preprocess downloaded data for custom use
extraction.pyI used Dataset from European Parliament Parallel Corpus,which can be found here
While full dataset is large (1.5 GB Unextracted) you might want to use smaller preprocessed dataset can be found here