Skip to content

Latest commit

 

History

History
175 lines (123 loc) · 5.51 KB

File metadata and controls

175 lines (123 loc) · 5.51 KB

CEFR Classifier Training

Fine-tune a DeBERTa model for CEFR level classification using Modal cloud GPUs.

Published Model: robg/speako-cefr-deberta

Quick Start

# 1. Install Modal CLI
uv pip install modal
modal token new

# 2. Start Training
modal run ml/train_cefr_deberta.py

Development

Python linting uses ruff via uvx (no installation required):

uvx ruff check ml/   # Lint
uvx ruff format ml/  # Format

Ruff runs automatically in the pre-commit hook.

Model Details

Property Value
Base Model microsoft/deberta-v3-small
HuggingFace ID robg/speako-cefr-deberta
Parameters ~44M
ONNX Size ~90MB (quantized)
Labels A1, A2, B1, B2, C1, C2
Max Length 256 tokens
Task text-classification

Architecture

flowchart LR
    A[Text Input] --> B[DeBERTa Encoder]
    B --> C[Classification Head]
    C --> D[CEFR Level + Confidence]
Loading

The model uses DeBERTa-v3-small as the base encoder with a 6-class classification head for CEFR levels.

Training Data

License Compliance

Important

The training uses only openly-licensed data to ensure the published model complies with all licenses.

Data Source License Usage
UniversalCEFR CC-BY-NC-4.0 Training
S&I Corpus 2025 Restrictive Validation only

The CC-BY-NC-4.0 license explicitly allows creating derivative models for non-commercial use.


Training: UniversalCEFR

UniversalCEFR is a large-scale multilingual dataset with 500k+ CEFR-labeled texts.

  • Paper: arXiv:2506.01419
  • License: CC-BY-NC-4.0 (allows derivative models)
  • Languages: 13 (filtered to English only)
  • Levels: A1–C2

Tip

Long texts are automatically chunked (5-50 words) to match speech transcript lengths and prevent the model from learning "long text = high CEFR".


Validation: Speak & Improve Corpus 2025

The S&I Corpus is used only for evaluation, not training.

File Purpose Samples
eval-asr.stm Held-out test ~9,200

Caution

The S&I license prohibits releasing models derived from the corpus. Since validation data doesn't influence model weights, using it for evaluation is permitted under the non-commercial research clause.

Data Augmentation

The training script applies noise augmentation to simulate ASR transcription errors:

Augmentation Probability Example
Character swap 5% speakingspekaing
Character deletion 3% speakingspeaing
Word deletion (short words) 3% I am goingam going

This helps the model generalize to imperfect Whisper transcriptions.

Training Configuration

Parameter Value
GPU NVIDIA T4
Batch Size 64
Learning Rate 2e-5
Epochs 4 (with early stopping)
Max Samples 60,000
Mixed Precision FP16
Early Stopping Patience 2 epochs
Metric Weighted F1

Files

File Description
train_cefr_deberta.py Main training script (Modal)
cefr_utils.py Data parsing and augmentation utilities

Browser Integration

The model is automatically loaded in the browser when users start recording or run validation:

Loading CEFR Classifier with WEBGPU...
CEFR Classifier loaded successfully with WEBGPU!
[MetricsCalculator] ML CEFR prediction: B2 (87.3%)

Loading Behavior

  1. WebGPU is preferred for fast inference
  2. WASM fallback if WebGPU unavailable or fails
  3. Model is cached in browser for subsequent visits

Troubleshooting

Model not loading in browser?

  • Check browser console for errors
  • Ensure internet connection (model downloads from HuggingFace)
  • Verify WebGPU support: navigator.gpu in console
  • Model will fallback to WASM if WebGPU fails

Training fails on Modal?

# Check logs
modal app logs cefr-deberta-training

# Verify S&I validation data exists (optional)
ls test-data/reference-materials/stms/

Common errors

Error Solution
ModuleNotFoundError: cefr_utils Ensure ml/cefr_utils.py exists
STM file not found Symlink test-data to S&I corpus directory (optional for validation)
CUDA out of memory Reduce batch_size parameter
Failed to load UniversalCEFR Check internet connection (dataset downloads from HuggingFace)

References