Amharic E-Commerce Entity Extraction is a machine learning pipeline that scrapes Amharic Telegram e-commerce posts and fine-tunes a multilingual transformer model to extract key business entities like Product, Price, and Location, helping EthioMart become the central hub for Telegram-based digital commerce in Ethiopia.
This project is part of a data annotation and modeling pipeline for Amharic Telegram e-commerce channels. It includes data scraping, preprocessing, manual annotation in CoNLL format, and visualizations.
βββ .github/ # GitHub actions and workflows
βββ .venv/ # Python virtual environment
βββ data/
β βββ processed/
β β βββ conull.csv # Final labeled data in CoNLL table format
β β βββ telegram_scraped_data_cleaned.csv # Cleaned Telegram messages
β β βββ top_30_messages_per_channel.csv # Top 30 messages per channel for annotation
β βββ raw/
β β βββ images/ # Downloaded product images
β β βββ telegram_scraped_data.csv # Raw scraped Telegram messages
β
βββ models/ # Folder for storing fine-tuned NER models
β
βββ notebook/
β βββ task-1/
β β βββ normalization_and_tokenization.ipynb # Preprocessing pipeline
β β βββ scrapper_session.session # Telethon session file
β β βββ scrapping.ipynb # Telegram scraping script
β βββ task-2/
β β βββ coNull.ipynb # CoNLL labeling and analysis
β β βββ conll_ready_tokenized.txt # Tokenized text for manual labeling
β
βββ src/
β βββ config.py # Channel list, phone, and output paths
β βββ pre_processing.py # Amharic text cleaning and normalization
β βββ scrapper.py # Telegram scraping with Telethon
β βββ coNLL.py # Exporting CoNLL formatted files and label analysis
β βββ visualization.py # Word counts, channel stats, font-safe Amharic plots
β
βββ requirements.txt # Python dependencies
βββ .gitignore # Files to ignore by Git
βββ README.md # This file
# Clone the repo
$ git clone https://github.com/sumeyaaaa/-Amharic-E-commerce-Data-Extractor.git
$ cd -Amharic-E-commerce-Data-Extractor
# Create virtual environment
$ python -m venv .venv
$ .venv\Scripts\activate # On Windows
# Install dependencies
$ pip install -r requirements.txt- Scrape Amharic e-commerce content using Telethon.
- Normalize and clean the Amharic text.
- Select top 30 messages per channel.
- Export cleaned tokens in a
.txtfile ready for manual labeling. - Load manually labeled CoNLL table and compute label coverage.
- Top N most common words
- Bar chart of message counts per channel
- Custom font support for Amharic text using
Abyssinica SIL
Use pretrained models:
xlm-roberta-base
bert-tiny-amharic
afroxlmr
Tokenize and align labels
Train using Hugging Faceβs Trainer API
Evaluate multiple models (XLM-R, DistilBERT, mBERT)
Compare F1-score, training speed, token alignment issues
Select the best for production
Use SHAP to analyze token-level contributions
Use LIME for local explanations on misclassified entities
Identify weaknesses in model handling ambiguous or nested entities
Compute per-vendor metrics:
π Posts per week (activity)
ποΈ Average views per post (engagement)
π° Average product price (business profile)
python Copy Edit lending_score = (avg_views * 0.5) + (posts_per_week * 0.5) Present results in a comparative table
By completing this project, you will:
Build a full-stack NLP pipeline from raw data collection to model interpretation
Adapt LLMs (like XLM-R) to low-resource languages (Amharic)
Use SHAP and LIME for trustworthy model deployment
Design analytics tools for FinTech and e-commerce decision-making
bash Copy Edit transformers datasets pandas numpy telethon scikit-learn shap lime matplotlib Install them via:
bash Copy Edit pip install -r requirements.txt π References Getting Started with Hugging Face NER
SHAP Documentation
LIME GitHub Repo
Amharic NER Dataset
β GitHub repo with all scripts and models
β PDF Report:
Model results
Vendor scorecard
Interpretation summary
Oβ Outside any entityB-PRODUCT,I-PRODUCTB-AUDIENCEB-BRAND,I-BRANDB-COMPONENT,I-COMPONENTB-TASK,I-TASKB-CONTACT_INFOB-PRICE,I-PRICEB-LOC,I-LOCB-DATE,I-DATEB-FEATURE,I-FEATUREB-ATTRIBUTE,I-ATTRIBUTE
After cleaning:
- Total tokens = N
- Labeled tokens (not 'O') = M
- Coverage = (M / N) * 100%
(To be updated after each labeling batch)
- Complete model training and evaluation.
- Integrate prediction pipeline into EthioMartβs backend.
- Build a scoring engine to rank vendors based on posting frequency, product diversity, and customer engagement.
Configuration: phone number, channel usernames, and file paths.
- Clean Amharic text (remove punctuations, links, emojis).
- Normalize characters for consistent tokenization.
- Login and fetch messages using Telethon.
- Save raw messages with metadata.
- Tokenizes messages and exports token-per-line
.txt. - Loads labeled data and analyzes how many tokens are labeled.
plot_channel_distribution()for message counts.plot_top_words()to show frequent words in Amharic (with font).
To render Amharic glyphs in plots:
- Download Abyssinica SIL
- Place
AbyssinicaSIL-Regular.ttfin afonts/directory - Use the font in visualization:
plot_top_words(
df,
text_column="text",
top_n=20,
title="α¨αα°α α¨α°α°αα ααα΅",
font_path="fonts/AbyssinicaSIL-Regular.ttf"
)- Train a Named Entity Recognition (NER) model on labeled CoNLL data
- Expand to multi-platform Amharic datasets
- Improve normalization for OCR text
Developed by @sumeyaaaa
Note: This project is part of a 10 Academy Week 4 challenge.
- Python 3.8+
- pandas, regex, telethon
- Jupyter Notebook
jupyter notebook notebook/task-1/scrapping.ipynb