Skip to content

sumeyaaaa/-Amharic-E-commerce-Data-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

-Amharic-E-commerce-Data-Extractor

Amharic E-Commerce Entity Extraction is a machine learning pipeline that scrapes Amharic Telegram e-commerce posts and fine-tunes a multilingual transformer model to extract key business entities like Product, Price, and Location, helping EthioMart become the central hub for Telegram-based digital commerce in Ethiopia.

This project is part of a data annotation and modeling pipeline for Amharic Telegram e-commerce channels. It includes data scraping, preprocessing, manual annotation in CoNLL format, and visualizations.

πŸ“ Directory Structure of AMHARIC-E-COMMERCE-DATA-EXTRACTOR

β”œβ”€β”€ .github/                             # GitHub actions and workflows
β”œβ”€β”€ .venv/                               # Python virtual environment
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ processed/
β”‚   β”‚   β”œβ”€β”€ conull.csv                   # Final labeled data in CoNLL table format
β”‚   β”‚   β”œβ”€β”€ telegram_scraped_data_cleaned.csv  # Cleaned Telegram messages
β”‚   β”‚   └── top_30_messages_per_channel.csv    # Top 30 messages per channel for annotation
β”‚   β”œβ”€β”€ raw/
β”‚   β”‚   β”œβ”€β”€ images/                      # Downloaded product images
β”‚   β”‚   └── telegram_scraped_data.csv   # Raw scraped Telegram messages
β”‚
β”œβ”€β”€ models/                              # Folder for storing fine-tuned NER models
β”‚
β”œβ”€β”€ notebook/
β”‚   β”œβ”€β”€ task-1/
β”‚   β”‚   β”œβ”€β”€ normalization_and_tokenization.ipynb # Preprocessing pipeline
β”‚   β”‚   β”œβ”€β”€ scrapper_session.session              # Telethon session file
β”‚   β”‚   └── scrapping.ipynb                       # Telegram scraping script
β”‚   β”œβ”€β”€ task-2/
β”‚   β”‚   β”œβ”€β”€ coNull.ipynb                          # CoNLL labeling and analysis
β”‚   β”‚   └── conll_ready_tokenized.txt            # Tokenized text for manual labeling
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py                    # Channel list, phone, and output paths
β”‚   β”œβ”€β”€ pre_processing.py            # Amharic text cleaning and normalization
β”‚   β”œβ”€β”€ scrapper.py                  # Telegram scraping with Telethon
β”‚   β”œβ”€β”€ coNLL.py                     # Exporting CoNLL formatted files and label analysis
β”‚   └── visualization.py            # Word counts, channel stats, font-safe Amharic plots
β”‚
β”œβ”€β”€ requirements.txt                 # Python dependencies
β”œβ”€β”€ .gitignore                       # Files to ignore by Git
└── README.md                        # This file

πŸ”¨ Setup & Installation

# Clone the repo
$ git clone https://github.com/sumeyaaaa/-Amharic-E-commerce-Data-Extractor.git
$ cd -Amharic-E-commerce-Data-Extractor

# Create virtual environment
$ python -m venv .venv
$ .venv\Scripts\activate      # On Windows

# Install dependencies
$ pip install -r requirements.txt

πŸ“Œ Key Tasks

βœ… Task 1: Scraping & Preprocessing

  • Scrape Amharic e-commerce content using Telethon.
  • Normalize and clean the Amharic text.
  • Select top 30 messages per channel.

βœ… Task 2: CoNLL Annotation Prep

  • Export cleaned tokens in a .txt file ready for manual labeling.
  • Load manually labeled CoNLL table and compute label coverage.

πŸ“Š Visualizations

  • Top N most common words
  • Bar chart of message counts per channel
  • Custom font support for Amharic text using Abyssinica SIL

Task 3: Fine-Tune NER Model

Use pretrained models:

xlm-roberta-base

bert-tiny-amharic

afroxlmr

Tokenize and align labels

Train using Hugging Face’s Trainer API

Task 4: Model Comparison

Evaluate multiple models (XLM-R, DistilBERT, mBERT)

Compare F1-score, training speed, token alignment issues

Select the best for production

Task 5: Interpretability with SHAP & LIME

Use SHAP to analyze token-level contributions

Use LIME for local explanations on misclassified entities

Identify weaknesses in model handling ambiguous or nested entities

Task 6: FinTech Vendor Scorecard

Compute per-vendor metrics:

πŸ•’ Posts per week (activity)

πŸ‘οΈ Average views per post (engagement)

πŸ’° Average product price (business profile)

Combine into a custom Lending Score

python Copy Edit lending_score = (avg_views * 0.5) + (posts_per_week * 0.5) Present results in a comparative table

Learning Outcomes

By completing this project, you will:

Build a full-stack NLP pipeline from raw data collection to model interpretation

Adapt LLMs (like XLM-R) to low-resource languages (Amharic)

Use SHAP and LIME for trustworthy model deployment

Design analytics tools for FinTech and e-commerce decision-making

Dependencies

bash Copy Edit transformers datasets pandas numpy telethon scikit-learn shap lime matplotlib Install them via:

bash Copy Edit pip install -r requirements.txt πŸ“Ž References Getting Started with Hugging Face NER

SHAP Documentation

LIME GitHub Repo

Amharic NER Dataset

Final Deliverables

βœ… GitHub repo with all scripts and models

βœ… PDF Report:

Methodology

Model results

Vendor scorecard

Interpretation summary


πŸ“Œ Labels Used for Annotation

  • O β€” Outside any entity
  • B-PRODUCT, I-PRODUCT
  • B-AUDIENCE
  • B-BRAND, I-BRAND
  • B-COMPONENT, I-COMPONENT
  • B-TASK, I-TASK
  • B-CONTACT_INFO
  • B-PRICE, I-PRICE
  • B-LOC, I-LOC
  • B-DATE, I-DATE
  • B-FEATURE, I-FEATURE
  • B-ATTRIBUTE, I-ATTRIBUTE

πŸ“Š Labeling Stats

After cleaning:

  • Total tokens = N
  • Labeled tokens (not 'O') = M
  • Coverage = (M / N) * 100%

(To be updated after each labeling batch)


πŸš€ Future Plans

  • Complete model training and evaluation.
  • Integrate prediction pipeline into EthioMart’s backend.
  • Build a scoring engine to rank vendors based on posting frequency, product diversity, and customer engagement.

πŸ“¦ Module Overview

src/config.py

Configuration: phone number, channel usernames, and file paths.

src/pre_processing.py

  • Clean Amharic text (remove punctuations, links, emojis).
  • Normalize characters for consistent tokenization.

src/scrapper.py

  • Login and fetch messages using Telethon.
  • Save raw messages with metadata.

src/coNLL.py

  • Tokenizes messages and exports token-per-line .txt.
  • Loads labeled data and analyzes how many tokens are labeled.

src/visualization.py

  • plot_channel_distribution() for message counts.
  • plot_top_words() to show frequent words in Amharic (with font).

πŸ“Š Amharic Font Setup for Visualization

To render Amharic glyphs in plots:

  1. Download Abyssinica SIL
  2. Place AbyssinicaSIL-Regular.ttf in a fonts/ directory
  3. Use the font in visualization:
plot_top_words(
  df,
  text_column="text",
  top_n=20,
  title="αŠ¨αα‰°αŠ› α‹¨α‰°α‹°αŒˆαˆ˜ α‰ƒαˆ‹α‰΅",
  font_path="fonts/AbyssinicaSIL-Regular.ttf"
)

🧠 Future Improvements

  • Train a Named Entity Recognition (NER) model on labeled CoNLL data
  • Expand to multi-platform Amharic datasets
  • Improve normalization for OCR text

πŸ“© Contact

Developed by @sumeyaaaa


Note: This project is part of a 10 Academy Week 4 challenge.

πŸ§ͺ Setup Instructions

Requirements

  • Python 3.8+
  • pandas, regex, telethon
  • Jupyter Notebook

Run scraping:

jupyter notebook notebook/task-1/scrapping.ipynb

About

Amharic E-Commerce Entity Extraction is a machine learning pipeline that scrapes Amharic Telegram e-commerce posts and fine-tunes a multilingual transformer model to extract key business entities like Product, Price, and Location, helping EthioMart become the central hub for Telegram-based digital commerce in Ethiopia.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors