NVIDIA-NeMo
diff --git a/‎.github/workflows/cicd-main.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/cicd-main.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/index.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/index.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/training/README.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/training/README.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/training/callbacks.md‎
Lines changed: 223 additions & 0 deletions b/‎docs/training/callbacks.md‎
Lines changed: 223 additions & 0 deletions
@@ -378,7 +378,7 @@ jobs:
       matrix:
         include:
           - script: L2_Launch_training
-            timeout: 40
+            timeout: 50
           - script: L2_Launch_converter
           - script: L2_Launch_models_deepseek
           - script: L2_Launch_models_gemma
 
@@ -50,6 +50,7 @@ training/cpu-offloading.md
 training/peft.md
 training/packed-sequences.md
 training/distillation.md
+training/callbacks.md
 ```
 
 ```{toctree}
 
@@ -22,7 +22,7 @@ This directory contains comprehensive documentation for training and customizing
 → Explore [Performance Guide](../performance-guide.md) and [Performance Summary](../performance-summary.md)
 
 **🔧 Customize training**
-→ See [PEFT](peft.md), [Distillation](distillation.md), and [Entry Points](entry-points.md)
+→ See [PEFT](peft.md), [Distillation](distillation.md), [Entry Points](entry-points.md), and [Callbacks](callbacks.md)
 
 ## Core Training Documentation
 
@@ -61,6 +61,7 @@ This directory contains comprehensive documentation for training and customizing
 | **[Packed Sequences](packed-sequences.md)** | Sequence packing for efficiency | Optimizing data loading |
 | **[Distillation](distillation.md)** | Knowledge distillation techniques | Transferring knowledge between models |
 | **[Checkpointing](checkpointing.md)** | Checkpoint saving, loading, and resuming | Managing training state |
+| **[Callbacks](callbacks.md)** | Inject custom logic into training loop | Custom logging, metrics, third-party integrations |
 
 ## Training Workflow
 
@@ -113,6 +114,7 @@ A typical training workflow involves:
 1. [PEFT](peft.md) - Parameter-efficient fine-tuning
 2. [Distillation](distillation.md) - Knowledge distillation
 3. [Entry Points](entry-points.md) - Custom training workflows
+4. [Callbacks](callbacks.md) - Inject custom logic (third-party integrations)
 
 ---
 
 
@@ -0,0 +1,223 @@
+# Callbacks
+
+Megatron Bridge provides a lightweight callback system for injecting custom logic into the training and evaluation loop without modifying framework code. This is ideal for propietary integrations or custom logging and metrics tracking.
+
+## Quick Start
+
+### Class-Based Callbacks
+
+Subclass {py:class}`bridge.training.callbacks.Callback` and override event methods:
+
+```python
+from megatron.bridge.training.callbacks import Callback
+from megatron.bridge.training.pretrain import pretrain
+
+class MyCallback(Callback):
+    def on_train_start(self, context):
+        context.user_state['start_time'] = time.time()
+        print(f"Training started at step {context.state.train_state.step}")
+
+    def on_train_step_end(self, context):
+        if context.loss_dict:
+            print(f"Step {context.state.train_state.step}: loss={context.loss_dict}")
+
+    def on_train_end(self, context):
+        elapsed = time.time() - context.user_state['start_time']
+        print(f"Training completed in {elapsed:.2f}s")
+
+# Pass callbacks to pretrain
+pretrain(config, forward_step_func, callbacks=[MyCallback()])
+```
+
+### Functional Callbacks
+
+Register functions directly with {py:class}`bridge.training.callbacks.CallbackManager`:
+
+```python
+from megatron.bridge.training.callbacks import CallbackManager
+from megatron.bridge.training import pretrain
+
+def log_step(context):
+    step = context.state.train_state.step
+    if context.loss_dict:
+        print(f"Step {step}: {context.loss_dict}")
+
+callback_manager = CallbackManager()
+callback_manager.register("on_train_step_end", log_step)
+
+pretrain(config, forward_step_func, callbacks=callback_manager)
+```
+
+### Mixing Both Patterns
+
+Both registration patterns can be combined:
+
+```python
+manager = CallbackManager()
+manager.add(MyCallback())
+manager.add([TimingCallback(), MetricsCallback()])
+manager.register("on_eval_end", lambda ctx: print("Evaluation complete!"))
+
+pretrain(config, forward_step_func, callbacks=manager)
+```
+
+## Available Events
+
+### Training Events
+
+| Event | When Fired | Available Context Fields |
+|-------|------------|-------------------------|
+| `on_train_start` | After `model.train()`, before training loop | `state`, `model`, `user_state`, `optimizer`, `scheduler` |
+| `on_train_step_start` | Before each training step | `state`, `model`, `user_state`, `optimizer`, `scheduler` |
+| `on_train_step_end` | After each training step | `state`, `model`, `user_state`, `optimizer`, `scheduler`, `loss_dict`, `grad_norm`, `skipped_iter` |
+| `on_train_end` | After training loop completes | `state`, `model`, `user_state`, `optimizer`, `scheduler` |
+
+### Validation Events
+
+| Event | When Fired | Available Context Fields |
+|-------|------------|-------------------------|
+| `on_eval_start` | After `model.eval()`, before validation loop | `state`, `model`, `user_state` |
+| `on_eval_step_start` | Before each validation step | `state`, `model`, `user_state` |
+| `on_eval_step_end` | After each validation step | `state`, `model`, `user_state` |
+| `on_eval_end` | After validation completes | `state`, `model`, `user_state`, `total_loss_dict` |
+
+### Test Events
+
+| Event | When Fired | Available Context Fields |
+|-------|------------|-------------------------|
+| `on_test_start` | After `model.eval()`, before test loop | `state`, `model`, `user_state` |
+| `on_test_step_start` | Before each test step | `state`, `model`, `user_state` |
+| `on_test_step_end` | After each test step | `state`, `model`, `user_state` |
+| `on_test_end` | After test completes | `state`, `model`, `user_state`, `total_loss_dict` |
+
+## CallbackContext
+
+The {py:class}`bridge.training.callbacks.CallbackContext` provides access to framework state:
+
+### Always Available
+
+- **`state`**: {py:class}`bridge.training.state.GlobalState` - Contains config, train_state, timers, and loggers
+- **`model`**: List of model chunks
+- **`user_state`**: Mutable dict for storing data across callback invocations
+
+### Training Events Only
+
+- **`optimizer`**: The optimizer instance
+- **`scheduler`**: Learning rate scheduler
+
+### Event-Specific Fields
+
+- **`loss_dict`** (`on_train_step_end`): Dictionary of reduced losses from the training step
+- **`grad_norm`** (`on_train_step_end`): Gradient norm (if computed)
+- **`skipped_iter`** (`on_train_step_end`): Whether the iteration was skipped
+- **`total_loss_dict`** (`on_eval_end`, `on_test_end`): Aggregated evaluation/test losses
+
+## User State
+
+The `CallbackManager` owns a `user_state` dictionary that persists across all callback invocations during a training run. Use it to share data between callbacks or accumulate metrics:
+
+```python
+class StepCounterCallback(Callback):
+    def on_train_start(self, context):
+        context.user_state['callback_step_count'] = 0
+
+    def on_train_step_end(self, context):
+        context.user_state['callback_step_count'] += 1
+
+    def on_train_end(self, context):
+        print(f"Callback saw {context.user_state['callback_step_count']} steps")
+```
+
+## Distributed Training
+
+Callbacks fire on **all ranks** without framework-level synchronization. If your callback should only run on specific ranks, add guards:
+
+```python
+import torch.distributed as dist
+
+class RankZeroCallback(Callback):
+    def on_train_step_end(self, context):
+        if dist.get_rank() == 0:
+            print(f"Step {context.state.train_state.step} complete")
+```
+
+## Exception Handling
+
+Exceptions from callbacks propagate to the caller. The framework does not catch or handle callback exceptions. If your callback might fail, wrap it in a try-except:
+
+```python
+def safe_callback(context):
+    try:
+        # Your logic here
+        external_service.log(context.loss_dict)
+    except Exception as e:
+        print(f"Callback failed: {e}")
+        # Don't re-raise to avoid stopping training
+```
+
+## Execution Order
+
+Callbacks fire in registration order:
+
+1. Callbacks added via `add()` fire in the order they were added
+2. Callbacks registered via `register()` fire in the order they were registered
+3. If both methods are used, the order depends on when each was called
+
+## Introspection
+
+Query registered callbacks:
+
+```python
+manager = CallbackManager()
+manager.register("on_train_start", my_fn)
+
+# Check if any callbacks exist for an event
+if manager.has_callbacks("on_train_start"):
+    print("Callbacks registered for on_train_start")
+
+# List all callbacks for an event
+callbacks = manager.list_callbacks("on_train_start")
+print(f"Found {len(callbacks)} callbacks")
+
+# Get all valid event names
+print(manager.events)  # frozenset of valid event names
+```
+
+## Design Principles
+
+The callback system follows these principles:
+
+1. **First-Party Isolation**: Framework code never uses callbacks for its own logic. Callbacks are strictly for third-party extensions.
+
+2. **Zero Overhead**: When no callbacks are registered, there is zero performance overhead.
+
+3. **Safety**: Callbacks receive framework state but modifying it is at the user's own risk. The framework makes no guarantees about the effects of modifications.
+
+## Examples
+
+### Proprietary Metrics
+
+```python
+class ProprietaryMetricsCallback(Callback):
+    """Send metrics to internal monitoring system."""
+
+    def __init__(self, endpoint: str):
+        self.client = InternalMetricsClient(endpoint)
+
+    def on_train_step_end(self, context):
+        if context.loss_dict:
+            self.client.send({
+                "step": context.state.train_state.step,
+                "loss": context.loss_dict.get("lm loss"),
+                "grad_norm": context.grad_norm,
+                "cluster_id": os.environ.get("CLUSTER_ID"),
+            })
+```
+
+## API Reference
+
+- {py:class}`bridge.training.callbacks.Callback`
+- {py:class}`bridge.training.callbacks.CallbackContext`
+- {py:class}`bridge.training.callbacks.CallbackManager`
+- {py:func}`bridge.training.callbacks.normalize_callbacks`
+- {py:func}`bridge.training.callbacks.should_fire`