Fine-tune a DeBERTa model for CEFR level classification using Modal cloud GPUs.
Published Model: robg/speako-cefr-deberta
# 1. Install Modal CLI
uv pip install modal
modal token new
# 2. Start Training
modal run ml/train_cefr_deberta.pyPython linting uses ruff via uvx (no installation required):
uvx ruff check ml/ # Lint
uvx ruff format ml/ # FormatRuff runs automatically in the pre-commit hook.
| Property | Value |
|---|---|
| Base Model | microsoft/deberta-v3-small |
| HuggingFace ID | robg/speako-cefr-deberta |
| Parameters | ~44M |
| ONNX Size | ~90MB (quantized) |
| Labels | A1, A2, B1, B2, C1, C2 |
| Max Length | 256 tokens |
| Task | text-classification |
flowchart LR
A[Text Input] --> B[DeBERTa Encoder]
B --> C[Classification Head]
C --> D[CEFR Level + Confidence]
The model uses DeBERTa-v3-small as the base encoder with a 6-class classification head for CEFR levels.
Important
The training uses only openly-licensed data to ensure the published model complies with all licenses.
| Data Source | License | Usage |
|---|---|---|
| UniversalCEFR | CC-BY-NC-4.0 | Training |
| S&I Corpus 2025 | Restrictive | Validation only |
The CC-BY-NC-4.0 license explicitly allows creating derivative models for non-commercial use.
UniversalCEFR is a large-scale multilingual dataset with 500k+ CEFR-labeled texts.
- Paper: arXiv:2506.01419
- License: CC-BY-NC-4.0 (allows derivative models)
- Languages: 13 (filtered to English only)
- Levels: A1–C2
Tip
Long texts are automatically chunked (5-50 words) to match speech transcript lengths and prevent the model from learning "long text = high CEFR".
The S&I Corpus is used only for evaluation, not training.
| File | Purpose | Samples |
|---|---|---|
eval-asr.stm |
Held-out test | ~9,200 |
Caution
The S&I license prohibits releasing models derived from the corpus. Since validation data doesn't influence model weights, using it for evaluation is permitted under the non-commercial research clause.
The training script applies noise augmentation to simulate ASR transcription errors:
| Augmentation | Probability | Example |
|---|---|---|
| Character swap | 5% | speaking → spekaing |
| Character deletion | 3% | speaking → speaing |
| Word deletion (short words) | 3% | I am going → am going |
This helps the model generalize to imperfect Whisper transcriptions.
| Parameter | Value |
|---|---|
| GPU | NVIDIA T4 |
| Batch Size | 64 |
| Learning Rate | 2e-5 |
| Epochs | 4 (with early stopping) |
| Max Samples | 60,000 |
| Mixed Precision | FP16 |
| Early Stopping | Patience 2 epochs |
| Metric | Weighted F1 |
| File | Description |
|---|---|
train_cefr_deberta.py |
Main training script (Modal) |
cefr_utils.py |
Data parsing and augmentation utilities |
The model is automatically loaded in the browser when users start recording or run validation:
Loading CEFR Classifier with WEBGPU...
CEFR Classifier loaded successfully with WEBGPU!
[MetricsCalculator] ML CEFR prediction: B2 (87.3%)
- WebGPU is preferred for fast inference
- WASM fallback if WebGPU unavailable or fails
- Model is cached in browser for subsequent visits
- Check browser console for errors
- Ensure internet connection (model downloads from HuggingFace)
- Verify WebGPU support:
navigator.gpuin console - Model will fallback to WASM if WebGPU fails
# Check logs
modal app logs cefr-deberta-training
# Verify S&I validation data exists (optional)
ls test-data/reference-materials/stms/| Error | Solution |
|---|---|
ModuleNotFoundError: cefr_utils |
Ensure ml/cefr_utils.py exists |
STM file not found |
Symlink test-data to S&I corpus directory (optional for validation) |
CUDA out of memory |
Reduce batch_size parameter |
Failed to load UniversalCEFR |
Check internet connection (dataset downloads from HuggingFace) |
- DeBERTa Paper – Decoding-enhanced BERT with Disentangled Attention
- Modal Documentation – Serverless GPU compute
- CEFR Framework – Common European Framework of Reference
- UniversalCEFR Dataset – Training data (CC-BY-NC-4.0)
- Speak & Improve Corpus 2025 – Validation data