This project demonstrates a small-scale multilingual NER pipeline for code-mixed Indian languages (Telugu/Tamil + English). Inspired by internship work at Palmtree Infotech.
- Hand-annotated sentences mixing English with Telugu/Tamil
- Entity labels: PERSON, LOCATION, ORGANIZATION, FOOD, etc.
- Format: CSV (sentences + annotation spans)
- fastText (language detection)
- Meta NLLB (translation)
- spaCy (custom NER)
- GLiNER (zero-shot NER)
- HuggingFace Transformers
- Pandas, Python
| Sentence | Entities Detected |
|---|---|
| "I loved the dosai at Sangeetha, Chennai!" | FOOD: dosai, ORG: Sangeetha, LOC: Chennai |
- Clone the repo
- Navigate to
notebooks/ner_experiments.ipynb - Install dependencies from
requirements.txt - Run the notebook to test entity extraction with spaCy or GLiNER
data/: CSV files with sentences and annotationsnotebooks/: Jupyter notebooks with demo pipelinescripts/: Optional scripts for training/evaluation
This is a public reconstruction of internship work using synthetic data and open-source tools. No proprietary data or internal IP is shared here.