CEFR Classifier Training

Fine-tune a DeBERTa model for CEFR level classification using Modal cloud GPUs.

Published Model: robg/speako-cefr-deberta

Quick Start

# 1. Install Modal CLI
uv pip install modal
modal token new

# 2. Start Training
modal run ml/train_cefr_deberta.py

Development

Python linting uses ruff via uvx (no installation required):

uvx ruff check ml/   # Lint
uvx ruff format ml/  # Format

Ruff runs automatically in the pre-commit hook.

Model Details

Property	Value
Base Model	`microsoft/deberta-v3-small`
HuggingFace ID	`robg/speako-cefr-deberta`
Parameters	~44M
ONNX Size	~90MB (quantized)
Labels	A1, A2, B1, B2, C1, C2
Max Length	256 tokens
Task	`text-classification`

Architecture

flowchart LR
    A[Text Input] --> B[DeBERTa Encoder]
    B --> C[Classification Head]
    C --> D[CEFR Level + Confidence]

The model uses DeBERTa-v3-small as the base encoder with a 6-class classification head for CEFR levels.

Training Data

License Compliance

Important

The training uses only openly-licensed data to ensure the published model complies with all licenses.

Data Source	License	Usage
UniversalCEFR	CC-BY-NC-4.0	Training
S&I Corpus 2025	Restrictive	Validation only

The CC-BY-NC-4.0 license explicitly allows creating derivative models for non-commercial use.

Training: UniversalCEFR

UniversalCEFR is a large-scale multilingual dataset with 500k+ CEFR-labeled texts.

Paper: arXiv:2506.01419
License: CC-BY-NC-4.0 (allows derivative models)
Languages: 13 (filtered to English only)
Levels: A1–C2

Tip

Long texts are automatically chunked (5-50 words) to match speech transcript lengths and prevent the model from learning "long text = high CEFR".

Validation: Speak & Improve Corpus 2025

The S&I Corpus is used only for evaluation, not training.

File	Purpose	Samples
`eval-asr.stm`	Held-out test	~9,200

Caution

The S&I license prohibits releasing models derived from the corpus. Since validation data doesn't influence model weights, using it for evaluation is permitted under the non-commercial research clause.

Data Augmentation

The training script applies noise augmentation to simulate ASR transcription errors:

Augmentation	Probability	Example
Character swap	5%	`speaking` → `spekaing`
Character deletion	3%	`speaking` → `speaing`
Word deletion (short words)	3%	`I am going` → `am going`

This helps the model generalize to imperfect Whisper transcriptions.

Training Configuration

Parameter	Value
GPU	NVIDIA T4
Batch Size	64
Learning Rate	2e-5
Epochs	4 (with early stopping)
Max Samples	60,000
Mixed Precision	FP16
Early Stopping	Patience 2 epochs
Metric	Weighted F1

Files

File	Description
`train_cefr_deberta.py`	Main training script (Modal)
`cefr_utils.py`	Data parsing and augmentation utilities

Browser Integration

The model is automatically loaded in the browser when users start recording or run validation:

Loading CEFR Classifier with WEBGPU...
CEFR Classifier loaded successfully with WEBGPU!
[MetricsCalculator] ML CEFR prediction: B2 (87.3%)

Loading Behavior

WebGPU is preferred for fast inference
WASM fallback if WebGPU unavailable or fails
Model is cached in browser for subsequent visits

Troubleshooting

Model not loading in browser?

Check browser console for errors
Ensure internet connection (model downloads from HuggingFace)
Verify WebGPU support: navigator.gpu in console
Model will fallback to WASM if WebGPU fails

Training fails on Modal?

# Check logs
modal app logs cefr-deberta-training

# Verify S&I validation data exists (optional)
ls test-data/reference-materials/stms/

Common errors

Error	Solution
`ModuleNotFoundError: cefr_utils`	Ensure `ml/cefr_utils.py` exists
`STM file not found`	Symlink `test-data` to S&I corpus directory (optional for validation)
`CUDA out of memory`	Reduce `batch_size` parameter
`Failed to load UniversalCEFR`	Check internet connection (dataset downloads from HuggingFace)

References

DeBERTa Paper – Decoding-enhanced BERT with Disentangled Attention
Modal Documentation – Serverless GPU compute
CEFR Framework – Common European Framework of Reference
UniversalCEFR Dataset – Training data (CC-BY-NC-4.0)
Speak & Improve Corpus 2025 – Validation data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CEFR Classifier Training

Quick Start

Development

Model Details

Architecture

Training Data

License Compliance

Training: UniversalCEFR

Validation: Speak & Improve Corpus 2025

Data Augmentation

Training Configuration

Files

Browser Integration

Loading Behavior

Troubleshooting

Model not loading in browser?

Training fails on Modal?

Common errors

References

FilesExpand file tree

ml.md

Latest commit

History

ml.md

File metadata and controls

CEFR Classifier Training

Quick Start

Development

Model Details

Architecture

Training Data

License Compliance

Training: UniversalCEFR

Validation: Speak & Improve Corpus 2025

Data Augmentation

Training Configuration

Files

Browser Integration

Loading Behavior

Troubleshooting

Model not loading in browser?

Training fails on Modal?

Common errors

References