Add MLflow logging support (#66)

RobotSail · claude · web-flow · commit 19d9ee626272 · 2026-02-04T19:06:06.000-05:00
* Add MLflow logging support

- Add mlflow_wrapper.py for optional MLflow imports with error handling
- Add mlflow_tracking_uri, mlflow_experiment_name, mlflow_run_name to TrainingArgs
- Update AsyncStructuredLogger to support MLflow logging
- Add MLflow CLI args to api_train.py and train.py
- Initialize MLflow at start, log params, finish at end of training

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Format code with ruff

* Address PR review comments

- Enable MLflow when any MLflow arg is provided (not just tracking_uri)
- Only init/finish MLflow on global rank 0 to avoid multiple runs in multi-node

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Fix MLflow logging to use explicit run ID

Store the run ID when initializing MLflow and use it explicitly when
logging params/metrics. This fixes an issue where async logging would
lose the thread-local run context and create a separate MLflow run.

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Fix MLflow logging - don't re-start already active run

The previous fix incorrectly tried to start a run in log_params() when
the run was already active from init(). Now log_params() logs directly
since the run is already active, and log() only resumes the run if
it's not currently active (for async contexts).

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Address PR review nitpicks

- Guard dist.get_rank() when process group isn't initialized in async log()
- Add mlflow as optional dependency in pyproject.toml

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Add explicit environment variable fallback for MLflow configuration

Implement kwarg &gt; env var precedence for mlflow_tracking_uri and
mlflow_experiment_name, matching the behavior of instructlab-training.

The configuration now follows this precedence:
1. Explicit kwargs (highest priority)
2. Environment variables (MLFLOW_TRACKING_URI, MLFLOW_EXPERIMENT_NAME)
3. MLflow defaults (lowest priority)

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Add async-safe pattern to log_params for thread-local context handling

Mirror the pattern used in log() to handle cases where thread-local
MLflow context is lost in async contexts.

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Fix MLflow run context handling - don't use context manager for resume

Using `with mlflow.start_run(run_id=...)` as a context manager ends the
run when the block exits, breaking subsequent logging calls. Changed to
call start_run() without context manager to keep the run active.

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Format mlflow_wrapper.py with ruff

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* guard against active mlflow runs

* comment

* provide instructions when loggers are not available but user requests it

* messaging

* updates

---------

Co-authored-by: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;
diff --git a/pyproject.toml b/pyproject.toml
@@ -62,6 +62,7 @@ dev = [
     "tox-uv",
 ]
 wandb = ["wandb"]
+mlflow = ["mlflow>=3.0"]
 
 [project.urls]
 Homepage = "https://github.com/Red-Hat-AI-Innovation-Team/mini_trainer"
diff --git a/src/mini_trainer/api_train.py b/src/mini_trainer/api_train.py
@@ -154,6 +154,16 @@ def run_training(torch_args: TorchrunArgs, train_args: TrainingArgs) -> None:
         if train_args.wandb_entity:
             command.append(f"--wandb-entity={train_args.wandb_entity}")
 
+    # mlflow-related arguments
+    if train_args.mlflow_tracking_uri:
+        command.append(f"--mlflow-tracking-uri={train_args.mlflow_tracking_uri}")
+        if train_args.mlflow_experiment_name:
+            command.append(
+                f"--mlflow-experiment-name={train_args.mlflow_experiment_name}"
+            )
+        if train_args.mlflow_run_name:
+            command.append(f"--mlflow-run-name={train_args.mlflow_run_name}")
+
     # validation-related arguments
     if train_args.validation_split > 0.0:
         command.append(f"--validation-split={train_args.validation_split}")
diff --git a/src/mini_trainer/async_structured_logger.py b/src/mini_trainer/async_structured_logger.py
@@ -14,21 +14,28 @@
 from tqdm import tqdm
 
 # Local imports
-from mini_trainer import wandb_wrapper
+from mini_trainer import wandb_wrapper, mlflow_wrapper
 from mini_trainer.wandb_wrapper import check_wandb_available
-
+from mini_trainer.mlflow_wrapper import check_mlflow_available
 
 
 class AsyncStructuredLogger:
-    def __init__(self, file_name="training_log.jsonl", use_wandb=False):
+    def __init__(
+        self, file_name="training_log.jsonl", use_wandb=False, use_mlflow=False
+    ):
         self.file_name = file_name
-        
+
         # wandb init is a special case -- if it is requested but unavailable,
         # we should error out early
         if use_wandb:
             check_wandb_available("initialize wandb")
         self.use_wandb = use_wandb
 
+        # mlflow init - same pattern as wandb
+        if use_mlflow:
+            check_mlflow_available("initialize mlflow")
+        self.use_mlflow = use_mlflow
+
         # Rich console for prettier output (force_terminal=True works with subprocess streaming)
         self.console = Console(force_terminal=True, force_interactive=False)
 
@@ -67,23 +74,35 @@ async def log(self, data):
             data["timestamp"] = datetime.now().isoformat()
             self.logs.append(data)
             await self._write_logs_to_file(data)
-            
-            # log to wandb if enabled and wandb is initialized, but only log this on the MAIN rank
+
+            # log to wandb/mlflow if enabled, but only log this on the MAIN rank
+            # Guard rank checks when the process group isn't initialized (single-process runs)
+            is_rank0 = not dist.is_initialized() or dist.get_rank() == 0
+
             # wandb already handles timestamps so no need to include
-            if self.use_wandb and dist.get_rank() == 0:
+            if self.use_wandb and is_rank0:
                 wandb_data = {k: v for k, v in data.items() if k != "timestamp"}
                 wandb_wrapper.log(wandb_data)
+
+            # log to mlflow if enabled, only on MAIN rank
+            # Filter out step from data since it's passed as a separate argument
+            if self.use_mlflow and is_rank0:
+                step = data.get("step")
+                mlflow_data = {
+                    k: v for k, v in data.items() if k not in ("timestamp", "step")
+                }
+                mlflow_wrapper.log(mlflow_data, step=step)
         except Exception as e:
             print(f"\033[1;38;2;0;255;255mError logging data: {e}\033[0m")
 
     async def _write_logs_to_file(self, data):
         """appends to the log instead of writing the whole log each time"""
         async with aiofiles.open(self.file_name, "a") as f:
             await f.write(json.dumps(data, indent=None) + "\n")
-    
+
     def log_sync(self, data: dict):
         """Runs the log coroutine non-blocking and prints metrics with tqdm-styled progress bar.
-        
+
         Args:
             data: Dictionary of metrics to log. Will automatically print a tqdm-formatted
                   progress bar with ANSI colors if step and steps_per_epoch are present.
@@ -96,61 +115,61 @@ def log_sync(self, data: dict):
         should_print = not dist.is_initialized() or dist.get_rank() == 0
         if should_print:
             data_with_timestamp = {**data, "timestamp": datetime.now().isoformat()}
-            
+
             # Print the JSON using Rich for syntax highlighting
             self.console.print_json(json.dumps(data_with_timestamp))
-            
+
             # Print tqdm-styled progress bar after JSON (prints as new line each time)
             # This works correctly with subprocess streaming
-            if 'step' in data and 'steps_per_epoch' in data and 'epoch' in data:
+            if "step" in data and "steps_per_epoch" in data and "epoch" in data:
                 # Initialize tqdm on first call (lazy init to avoid early printing)
                 if self.train_pbar is None:
                     # Simple bar format with ANSI colors - we'll add epoch and metrics manually
                     self.train_bar_format = (
-                        '{bar} '
-                        '\033[33m{percentage:3.0f}%\033[0m │ '
-                        '\033[37m{n}/{total}\033[0m'
+                        "{bar} "
+                        "\033[33m{percentage:3.0f}%\033[0m │ "
+                        "\033[37m{n}/{total}\033[0m"
                     )
                     self.train_pbar = tqdm(
-                        total=data['steps_per_epoch'],
+                        total=data["steps_per_epoch"],
                         bar_format=self.train_bar_format,
                         ncols=None,
                         leave=False,
                         position=0,
                         file=sys.stdout,
-                        ascii='━╺─',  # custom characters matching Rich style
+                        ascii="━╺─",  # custom characters matching Rich style
                         disable=True,  # disable auto-display, we'll manually call display()
                     )
 
                 # Reset tqdm if we're in a new epoch
-                current_step_in_epoch = (data['step'] - 1) % data['steps_per_epoch'] + 1
+                current_step_in_epoch = (data["step"] - 1) % data["steps_per_epoch"] + 1
                 if current_step_in_epoch == 1:
-                    self.train_pbar.reset(total=data['steps_per_epoch'])
+                    self.train_pbar.reset(total=data["steps_per_epoch"])
 
                 # Update tqdm position
                 self.train_pbar.n = current_step_in_epoch
 
                 # Manually format the complete progress line with metrics using format_meter
                 bar_str = self.train_pbar.format_meter(
                     n=current_step_in_epoch,
-                    total=data['steps_per_epoch'],
+                    total=data["steps_per_epoch"],
                     elapsed=0,  # we don't track elapsed time
                     ncols=None,
                     bar_format=self.train_bar_format,
-                    ascii='━╺─',
+                    ascii="━╺─",
                 )
 
                 # Prepend the epoch number (1-indexed)
-                epoch_prefix = f'\033[1;34mEpoch {data["epoch"] + 1}:\033[0m '
+                epoch_prefix = f"\033[1;34mEpoch {data['epoch'] + 1}:\033[0m "
                 bar_str = epoch_prefix + bar_str
-                
+
                 # Add the metrics to the bar string
                 metrics_str = (
                     f" │ \033[32mloss:\033[0m \033[37m{data['loss']:.4f}\033[0m"
                     f" │ \033[32mlr:\033[0m \033[37m{data['lr']:.2e}\033[0m"
                     f" │ \033[35m{data['tokens_per_second']:.0f}\033[0m \033[2mtok/s\033[0m"
                 )
-                
+
                 # Print the complete line
                 print(bar_str + metrics_str, file=sys.stdout, flush=True)
 
diff --git a/src/mini_trainer/mlflow_wrapper.py b/src/mini_trainer/mlflow_wrapper.py
@@ -0,0 +1,173 @@
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Wrapper for optional mlflow imports that provides consistent error handling
+across all processes when mlflow is not installed.
+"""
+
+import logging
+import os
+from typing import Any, Dict, Optional
+
+# Try to import mlflow
+try:
+    import mlflow
+
+    MLFLOW_AVAILABLE = True
+except ImportError:
+    MLFLOW_AVAILABLE = False
+    mlflow = None
+
+logger = logging.getLogger(__name__)
+
+# Store the active run ID to ensure we can resume the run if needed
+# This is needed because async logging may lose the thread-local run context
+_active_run_id: Optional[str] = None
+
+
+class MLflowNotAvailableError(ImportError):
+    """Raised when mlflow functions are called but mlflow is not installed."""
+
+    pass
+
+
+def check_mlflow_available(operation: str) -> None:
+    """Check if mlflow is available, raise error if not."""
+    if not MLFLOW_AVAILABLE:
+        error_msg = (
+            f"Attempted to {operation} but mlflow is not installed. "
+            "Please install mlflow with: pip install mlflow"
+        )
+        logger.error(error_msg)
+        raise MLflowNotAvailableError(error_msg)
+
+
+def init(
+    tracking_uri: Optional[str] = None,
+    experiment_name: Optional[str] = None,
+    run_name: Optional[str] = None,
+    **kwargs,
+) -> Any:
+    """
+    Initialize an mlflow run. Raises MLflowNotAvailableError if mlflow is not installed.
+
+    Configuration follows a precedence hierarchy:
+        1. Explicit kwargs (highest priority)
+        2. Environment variables (MLFLOW_TRACKING_URI, MLFLOW_EXPERIMENT_NAME)
+        3. MLflow defaults (lowest priority)
+
+    Args:
+        tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000").
+            Falls back to MLFLOW_TRACKING_URI environment variable if not provided.
+        experiment_name: Name of the experiment.
+            Falls back to MLFLOW_EXPERIMENT_NAME environment variable if not provided.
+        run_name: Name of the run
+        **kwargs: Additional arguments to pass to mlflow.start_run
+
+    Returns:
+        mlflow.ActiveRun object if successful
+
+    Raises:
+        MLflowNotAvailableError: If mlflow is not installed
+    """
+    global _active_run_id
+    check_mlflow_available("initialize mlflow")
+
+    # Apply kwarg > env var precedence for tracking_uri
+    effective_tracking_uri = tracking_uri or os.environ.get("MLFLOW_TRACKING_URI")
+    if effective_tracking_uri:
+        mlflow.set_tracking_uri(effective_tracking_uri)
+
+    # Apply kwarg > env var precedence for experiment_name
+    effective_experiment_name = experiment_name or os.environ.get(
+        "MLFLOW_EXPERIMENT_NAME"
+    )
+    if effective_experiment_name:
+        mlflow.set_experiment(effective_experiment_name)
+
+    # Remove run_name from kwargs if present to avoid duplicate keyword argument
+    # The explicit run_name parameter takes precedence
+    kwargs.pop("run_name", None)
+
+    # Reuse existing active run if one exists, otherwise start a new one
+    active_run = mlflow.active_run()
+    if active_run is not None:
+        run = active_run
+    else:
+        run = mlflow.start_run(run_name=run_name, **kwargs)
+    _active_run_id = run.info.run_id
+    return run
+
+
+def get_active_run_id() -> Optional[str]:
+    """Get the active run ID that was started by init()."""
+    return _active_run_id
+
+
+def _ensure_run_for_logging() -> None:
+    """Ensure there's an active MLflow run for logging.
+
+    This helper handles async contexts where thread-local run context may be lost.
+    If no active run exists but we have a stored run ID, it resumes that run.
+    """
+    active_run = mlflow.active_run()
+    if not active_run and _active_run_id:
+        # No active run in this thread but we have a stored run ID - resume it
+        # This can happen in async contexts where thread-local context is lost
+        # Note: We don't use context manager here because it would end the run on exit
+        mlflow.start_run(run_id=_active_run_id)
+
+
+def log_params(params: Dict[str, Any]) -> None:
+    """
+    Log parameters to mlflow. Raises MLflowNotAvailableError if mlflow is not installed.
+
+    Args:
+        params: Dictionary of parameters to log
+
+    Raises:
+        MLflowNotAvailableError: If mlflow is not installed
+    """
+    check_mlflow_available("log params to mlflow")
+    # MLflow params must be strings
+    str_params = {k: str(v) for k, v in params.items()}
+
+    _ensure_run_for_logging()
+    mlflow.log_params(str_params)
+
+
+def log(data: Dict[str, Any], step: Optional[int] = None) -> None:
+    """
+    Log metrics to mlflow. Raises MLflowNotAvailableError if mlflow is not installed.
+
+    Args:
+        data: Dictionary of data to log (non-numeric values will be skipped)
+        step: Optional step number for the metrics
+
+    Raises:
+        MLflowNotAvailableError: If mlflow is not installed
+    """
+    check_mlflow_available("log to mlflow")
+    # Filter to only numeric values for metrics
+    metrics = {}
+    for k, v in data.items():
+        try:
+            metrics[k] = float(v)
+        except (ValueError, TypeError):
+            pass  # Skip non-numeric values
+    if metrics:
+        _ensure_run_for_logging()
+        mlflow.log_metrics(metrics, step=step)
+
+
+def finish() -> None:
+    """
+    End the mlflow run. Raises MLflowNotAvailableError if mlflow is not installed.
+
+    Raises:
+        MLflowNotAvailableError: If mlflow is not installed
+    """
+    global _active_run_id
+    check_mlflow_available("finish mlflow run")
+    mlflow.end_run()
+    _active_run_id = None
diff --git a/src/mini_trainer/train.py b/src/mini_trainer/train.py
diff --git a/src/mini_trainer/training_types.py b/src/mini_trainer/training_types.py

Original file line number	Diff line number	Diff line change
`@@ -62,6 +62,7 @@ dev = [`
`62`	`62`	`"tox-uv",`
`63`	`63`	`]`
`64`	`64`	`wandb = ["wandb"]`
	`65`	`+mlflow = ["mlflow>=3.0"]`
`65`	`66`
`66`	`67`	`[project.urls]`
`67`	`68`	`Homepage = "https://github.com/Red-Hat-AI-Innovation-Team/mini_trainer"`