Skip to content

Generation quality: Test prompts defined (diverse: factual, reasoning…#566

Open
Sualeh77 wants to merge 1 commit intorefactor/consolidationfrom
generation_quality
Open

Generation quality: Test prompts defined (diverse: factual, reasoning…#566
Sualeh77 wants to merge 1 commit intorefactor/consolidationfrom
generation_quality

Conversation

@Sualeh77
Copy link

@Sualeh77 Sualeh77 commented Mar 8, 2026

Description

This PR introduces comprehensive generation quality monitoring to the training pipeline, addressing the four key requirements of the Generation Quality task:

  1. Diverse Test Prompts Defined: Extended GenerationConfig to support a diverse, fixed list of prompts across four distinct categories: factual, reasoning, creative, and code. This allows for tracking qualitative improvements across domains.
  2. Checkpoint Generation: Integrated a generation trigger after every checkpoint save (using temperature=0 / greedy decoding) to ensure reproducible and consistent qualitative evaluation at key training milestones.
  3. Automated Sample Logging: Generation results are now persisted to a dedicated, append-only JSONL file (<output_dir>/generation_samples/run_<run_id>_samples.jsonl) for easy tracking and review without depending on the observability stack. Logs also go to the P12 logger if it is active.
  4. Repetition and Degeneration Checks: Implemented a 4-gram repetition score to automatically detect model starvation or loops. The system natively emits degeneration_warning (>0.3 score) and degeneration_critical (>0.7 score) alerts to the observability backend.

Key Implementation Details for Reviewers:

  • Distributed Safety: To prevent deadlocks under DeepSpeed ZeRO-3 (where parameters are sharded), the _generate method runs the forward pass synchronously across all ranks, but only Rank 0 handles the file I/O and telemetry logging.
  • Latency Optimization: The prompt suite is batched and left-padded into a single model.generate() call, eliminating sequential inference latency during the training loop.
  • Double Generation Fix: A generated_this_step flag was added to the training loop to prevent redundant sequential generations when a step interval and checkpoint interval align.

Related Issue: [Link to issue / Task List]

Checklist

  • I have added tests that prove my fix is effective or that my feature works.
  • I have added necessary documentation (if applicable).
  • My code follows the style guidelines, gitflow branching strategy, and naming conventions of this project Contribution Guidelines

…, creative, code)

Generation tested at checkpoints (temperature=0 for reproducibility)
Generation samples logged/saved for review
Repetition and degeneration checked
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants