Metric Logging updates 5/N - enable streaming #363

felipemello1 · 2025-10-09T19:51:41Z

Changes logging mode, so its clearer:

Before:

reduce_across_ranks: bool
share_run_id: bool

After:

logging_mode: Enum[GLOBAL_REDUCE, PER_RANK_REDUCE, PER_RANK_NO_REDUCE]
per_rank_share_run: bool

Adds class:

class LoggingMode(Enum):
    GLOBAL_REDUCE = "global_reduce"
    PER_RANK_REDUCE = "per_rank_reduce"
    PER_RANK_NO_REDUCE = "per_rank_no_reduce"

Introduces the "PER_RANK_NO_REDUCE" mode, aka streaming. This means we call backend.log(metric) as soon as we get it, without any reduction.

Before, MetricLogger.push(metric) would just collect the metric. Now, it also logs.

def push(self, metric: Metric) -> None:
      # flush in "PER_RANK_NO_REDUCE" mode
      for backend in self.per_rank_no_reduce_backends:
            backend.log_stream(metric=metric, global_step=self.global_step)

      # Always accumulate for reduction and state return
        key = metric.key
        if key not in self.accumulators:
            self.accumulators[key] = metric.reduction.accumulator_class(
                metric.reduction
            )
        self.accumulators[key].append(metric.value)

Notice how x-axis is timestamp:

Main design change: logger backends now have async def log_batch and def log_stream. It not totally clear to me if both should be async/sync or if i should try to unify them.

class LoggerBackend(ABC):
    """Abstract logger_backend for metric logging, e.g. wandb, jsonl, etc."""

    def __init__(self, logger_backend_config: dict[str, Any]) -> None:
        self.logger_backend_config = logger_backend_config

    @abstractmethod
    async def init(
        self,
        role: BackendRole,
        primary_logger_metadata: dict[str, Any] | None = None,
        process_name: str | None = None,
    ) -> None:
        """Initializes backend, e.g. wandb.run.init()."""
        pass

    @abstractmethod
    async def log_batch(
        self, metrics: list[Metric], global_step: int, *args, **kwargs
    ) -> None:
        """Log batch of accumulated metrics to backend"""
        pass

    def log_stream(self, metric: Metric, global_step: int, *args, **kwargs) -> None:
        """Stream single metric to backend immediately."""
        pass

    async def finish(self) -> None:
        pass

    def get_metadata_for_secondary_ranks(self) -> dict[str, Any] | None:
        """Return sharable state after primary init (e.g., for shared modes). Called only on globals."""
        return None

…estamp_logging_diff2

…estamp_logging_diff3

…estamp_logging_diff4

codecov-commenter · 2025-10-15T17:56:52Z

Codecov Report

❌ Patch coverage is 42.85714% with 68 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.04%. Comparing base (4c14792) to head (69f9f8c).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
src/forge/observability/metrics.py	32.81%	43 Missing ⚠️
src/forge/observability/metric_actors.py	28.57%	25 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #363      +/-   ##
==========================================
- Coverage   73.68%   65.04%   -8.65%     
==========================================
  Files          81       82       +1     
  Lines        7729     7901     +172     
==========================================
- Hits         5695     5139     -556     
- Misses       2034     2762     +728

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ebsmothers · 2025-10-15T18:03:20Z

apps/grpo/qwen3_1_7b.yaml

+    logging_mode: global_reduce # global_reduce, per_rank_reduce, per_rank_no_reduce
+    per_rank_share_run: False
  console:
-    reduce_across_ranks: True
+    logging_mode: global_reduce


Why do we need to duplicate logging_mode across different configs like this? feels like clunky UX to me

this is per backend. You could have scuba logging on streamining mode, console logging global_reduce and wandb logging per rank. If you have a single backend, you define it only once.

ebsmothers · 2025-10-15T18:57:43Z