Generation quality: Test prompts defined (diverse: factual, reasoning… by Sualeh77 · Pull Request #566 · The-School-of-AI/LLM

Sualeh77 · 2026-03-08T21:43:46Z

Description

This PR introduces comprehensive generation quality monitoring to the training pipeline, addressing the four key requirements of the Generation Quality task:

Diverse Test Prompts Defined: Extended GenerationConfig to support a diverse, fixed list of prompts across four distinct categories: factual, reasoning, creative, and code. This allows for tracking qualitative improvements across domains.
Checkpoint Generation: Integrated a generation trigger after every checkpoint save (using temperature=0 / greedy decoding) to ensure reproducible and consistent qualitative evaluation at key training milestones.
Automated Sample Logging: Generation results are now persisted to a dedicated, append-only JSONL file (<output_dir>/generation_samples/run_<run_id>_samples.jsonl) for easy tracking and review without depending on the observability stack. Logs also go to the P12 logger if it is active.
Repetition and Degeneration Checks: Implemented a 4-gram repetition score to automatically detect model starvation or loops. The system natively emits degeneration_warning (>0.3 score) and degeneration_critical (>0.7 score) alerts to the observability backend.

Key Implementation Details for Reviewers:

Distributed Safety: To prevent deadlocks under DeepSpeed ZeRO-3 (where parameters are sharded), the _generate method runs the forward pass synchronously across all ranks, but only Rank 0 handles the file I/O and telemetry logging.
Latency Optimization: The prompt suite is batched and left-padded into a single model.generate() call, eliminating sequential inference latency during the training loop.
Double Generation Fix: A generated_this_step flag was added to the training loop to prevent redundant sequential generations when a step interval and checkpoint interval align.

Related Issue: [Link to issue / Task List]

Checklist

I have added tests that prove my fix is effective or that my feature works.
I have added necessary documentation (if applicable).
My code follows the style guidelines, gitflow branching strategy, and naming conventions of this project Contribution Guidelines

…, creative, code) Generation tested at checkpoints (temperature=0 for reproducibility) Generation samples logged/saved for review Repetition and degeneration checked

Generation quality: Test prompts defined (diverse: factual, reasoning…

fcb9590

…, creative, code) Generation tested at checkpoints (temperature=0 for reproducibility) Generation samples logged/saved for review Repetition and degeneration checked

Sualeh77 requested review from Jayant-Guru-Shrivastava, Shwethaamrutha and yashwantram97 March 8, 2026 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation quality: Test prompts defined (diverse: factual, reasoning…#566

Generation quality: Test prompts defined (diverse: factual, reasoning…#566
Sualeh77 wants to merge 1 commit intorefactor/consolidationfrom
generation_quality

Sualeh77 commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sualeh77 commented Mar 8, 2026

Description

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants