This project demonstrates a small-scale multilingual NER pipeline for code-mixed Indian languages (Telugu/Tamil + English). Inspired by internship work at Palmtree Infotech.
- Hand-annotated sentences mixing English with Telugu/Tamil
- Entity labels: PERSON, LOCATION, ORGANIZATION, FOOD, etc.
- Format: CSV (sentences + annotation spans)
- fastText (language detection)
- Meta NLLB (translation)
- spaCy (custom NER)
- GLiNER (zero-shot NER)
- HuggingFace Transformers
- Pandas, Python
Sentence | Entities Detected |
---|---|
"I loved the dosai at Sangeetha, Chennai!" | FOOD: dosai, ORG: Sangeetha, LOC: Chennai |
- Clone the repo
- Navigate to
notebooks/ner_experiments.ipynb
- Install dependencies from
requirements.txt
- Run the notebook to test entity extraction with spaCy or GLiNER
data/
: CSV files with sentences and annotationsnotebooks/
: Jupyter notebooks with demo pipelinescripts/
: Optional scripts for training/evaluation
This is a public reconstruction of internship work using synthetic data and open-source tools. No proprietary data or internal IP is shared here.