A machine learning system for classifying English text according to CEFR proficiency levels (A1-C2) using BERT-based classification.
- Pure Classification Approach: No regression or contrastive learning
- Class-Weighted Loss: Handles imbalanced datasets effectively
- Mean Pooling: Better sentence representation than [CLS] token
- F1 Score Evaluation: More suitable for imbalanced classification
- REST API: Easy integration with web applications
pip install -r requirements.txtpython train.pyThis will:
- Load data from
dataset/dataset.csv - Train the BERT-based classifier
- Save the best model as
cefr_bert_model.pth - Show evaluation metrics and sample predictions
python server.pyThe server will run on http://localhost:5050
curl http://localhost:5050/healthcurl -X POST "http://localhost:5050/predict" \
-H "Content-Type: application/json" \
-d '{"sentences": ["I like cats.", "The implementation requires careful consideration."]}'curl http://localhost:5050/levels- A1 (Beginner): Basic everyday expressions and very simple phrases
- A2 (Elementary): Simple sentences on familiar topics and routine matters
- B1 (Intermediate): Clear communication on familiar matters and personal interests
- B2 (Upper Intermediate): Complex texts and abstract topics with good fluency
- C1 (Advanced): Flexible and effective language use for social, academic and professional purposes
- C2 (Proficient): Very high level with precise, nuanced expression and full command of language
✓ Pure classification (no regression)
✓ Class-weighted loss for imbalanced data
✓ Mean pooling for better sentence representation
✓ F1 score evaluation (better for imbalanced classes)
✓ Gradient clipping for training stability
✓ Based on CEFR-SP research methodology
model.py: Main CEFR classifier implementationtrain.py: Training scriptserver.py: FastAPI REST API serverdataset/dataset.csv: Training data (required)
- Python 3.8+
- PyTorch
- Transformers
- FastAPI
- scikit-learn
- pandas
- numpy