[Logging] add time stamp logging + test #303

felipemello1 · 2025-10-03T18:05:43Z

How to review this PR:

I dont think you should review every file. Too many small changes.
Main feedback should be on logging_mode and specially on the API for backend.log_batch and backend.log_stream
--> I added comments for both in the PR

Summary of changes:

Added timestamp logging system for metrics tracking

This means we now enable zero reduce. The metrics are pushed immediately to the backend with a timestamp and step, and its up to the backend to buffer if they want to.

Changed metric logging configs
Before:

reduce_across_ranks: bool
share_run_id: bool

After:

logging_mode: Enum[GLOBAL_REDUCE, PER_RANK_REDUCE, PER_RANK_NO_REDUCE]
per_rank_share_run: bool

It should be more intuitive what the options mean and we avoid adding a 3rd flag for no_reduce.

unit tests for metrics.py and metric_actors.py

Better actor naming. Now we get this:

…estamp_logging

felipemello1 · 2025-10-05T20:22:04Z

src/forge/observability/metrics.py

+    def push(self, metric: Metric) -> None:
+        """Immediately log metrics to backends marked as "no_reduce" and adds metrics to accumulators for reduction
+        and later logging."""
        if not self._is_initialized:
-            raise ValueError("Collector not initialized—call init first")
+            raise ValueError(
+                "MetricCollector was not initialized. This happens when you try to use `record_metric` "
+                "before you have initialized any logging backends. Please call in your main file:\n"
+                "`mlogger = await get_or_create_metric_logger(actor_name='Controller')`\n"
+                "`await mlogger.init_backends.call_one(logging_config)`\n"
+                "or, to disable metric logging globally, set env variable `FORGE_DISABLE_METRICS=True`"
+            )
+
+        # Validate metric object
+        if not isinstance(metric, Metric):
+            raise TypeError(f"Expected {Metric} object, got {metric}")

+        # Always accumulate for deferred logging and state return
+        key = metric.key
        if key not in self.accumulators:
-            self.accumulators[key] = reduction.accumulator_class(reduction)
+            self.accumulators[key] = metric.reduction.accumulator_class(
+                metric.reduction
+            )
+        self.accumulators[key].append(metric.value)

-        self.accumulators[key].append(value)
+        # For PER_RANK_NO_REDUCE backends: log immediately
+        for backend in self.per_rank_no_reduce_backends:
+            backend.log_immediate(metric=metric, step=self.step)


This is the major chunk that needs review, IMO.

We do async log_batch flush, i.e

for train_step in range(10): do_something() await mlogger.flush.call_one()

mlogger.flush calls MetricLogger.flush, which flushes all metrics accumulated during the train step.

However, when logging_mode=PER_RANK_NO_REDUCE, there is no reduce, and MetricCollector flushes immediately on every push, i.e.

record_metric(key, value, reduce_type) # reduce is ignored

record_metric then calls MetricCollector.push, which calls backend.log_stream(key, value). Notice that this is sync.

I am not sure if this is the right API

async batch flush that calls backend.log
sync single metric flush that calls backend.log_stream

What happens when we have backend.log_table. Do we need .log_table_stream too?

felipemello1 · 2025-10-05T20:45:22Z

src/forge/observability/utils.py

+logger = logging.getLogger(__name__)
+
+
+def detect_actor_name_from_call_stack() -> str:


@allenwang28 , mind taking a look at this file?

I used inspect to get the class name from the call stack. I didnt find a good way to get the ActorName from the context(). I tried it in get_actor_name_with_rank, but it returns the local_fetcher, and not the actor.

I am getting this currently

src/forge/observability/metrics.py

JenniferWang · 2025-10-06T13:41:28Z

apps/grpo/main.py

        ),
    )

+    # Call after services are initialized


Would you maybe explain in the comment, why the init_backends should be called after services are initialized?

Calling before works for every mode except when 'per_rank_share_run=True'. Then it hangs. wandb says its experimental, and it didnt investigate it more deeply to see if i need to wait for something to finish. But i agree, i will add a note! Edit: done

can we debug this further instead of checking in this workaround?

src/forge/observability/metrics.py

…estamp_logging

apps/grpo/qwen3_32b.yaml

apps/toy_rl/toy_metrics/main.py

joecummings · 2025-10-06T18:36:36Z

src/forge/env_constants.py

 # Force all timing methods in forge.observability.perf_tracker.py to use
 # CPU timer if False or GPU timer if True. If unset, defaults to the assigned value to the function.
-METRIC_TIMER_USES_CUDA = "METRIC_TIMER_USES_CUDA"
+METRIC_TIMER_USES_GPU = "METRIC_TIMER_USES_GPU"


making it future proof when we support other backends besides cuda

TPUs here we come

"METRIC_TIMER_USES_ACCELERATOR"? Is this what torch uses? geez, we refactor it when the time comes

src/forge/observability/metrics.py

felipemello1 · 2025-10-07T15:37:56Z

src/forge/observability/metrics.py

    async def _init_shared_local(self, primary_metadata: Dict[str, Any]):
        import wandb
+        from wandb.sdk.lib.service import service_token

        shared_id = primary_metadata.get("shared_run_id")
        if shared_id is None:
            raise ValueError(
                f"Shared ID required but not provided for {self.name} backend init"
            )
+
+        # Clear any stale service tokens that might be pointing to dead processes
+        # In multiprocessing environments, WandB service tokens can become stale and point
+        # to dead service processes. This causes wandb.init() to hang indefinitely trying
+        # to connect to non-existent services. Clearing forces fresh service connection.
+        service_token.clear_service_in_env()
+
        settings = wandb.Settings(mode="shared", x_primary=False, x_label=self.name)
        self.run = wandb.init(
-            id=shared_id,
-            project=self.project,
-            group=self.group,
-            settings=settings,
+            id=shared_id, project=self.project, group=self.group, settings=settings


@allenwang28 @JenniferWang This fixes the wandb hang. Now we can initialize before or after services, no difference.

tldr: it seems that spawning new processes copies environment variables used by wandb for shared mode, but they are not true for new runs. When spawning everything AFTER services, we didnt have this issue, because every new proc would have their own fresh vars.

…estamp_logging

apps/grpo/qwen3_8b.yaml

DNXie · 2025-10-07T22:38:57Z

src/forge/observability/metrics.py

-            if self.share_run_id:
+        elif role == BackendRole.LOCAL:
+            if self.per_rank_share_run:
                await self._init_shared_local(primary_logger_metadata)


wouldn't this call simply overwrite self.run that is initialized in self._init_shared_global() and set the x_primary to False?

I am not sure what you mean by "overwrite". Are you saying because the backend would be called once with backend.init(role=global, ...) and then again (role=local)?

If so, then no. This should never happen. A backend is initialized only once per process.

What happens is that each process has a different instance of this class. So controller has a wandb backend, trainActor has another, etc.

global happens here: https://github.com/felipemello1/forge/blob/ece12d72d39120392ac679dc951125772157bfa6/src/forge/observability/metric_actors.py#L262

Local happens here: https://github.com/felipemello1/forge/blob/ece12d72d39120392ac679dc951125772157bfa6/src/forge/observability/metrics.py#L471

Let me know if thats not clear

But here in GlobalLoggingActor.init_backends, it would call

await backend.init(role=BackendRole.GLOBAL)

as you pointed out. And also

fetcher.init_backends.call( self.metadata_per_primary_backend, self.config )

https://github.com/felipemello1/forge/blob/ece12d72d39120392ac679dc951125772157bfa6/src/forge/observability/metric_actors.py#L276

In LocalFetcherActor.init_backends it would call MetricCollector.init_backends
https://github.com/felipemello1/forge/blob/ece12d72d39120392ac679dc951125772157bfa6/src/forge/observability/metric_actors.py#L167

Are they not the same backend instances? But with grpo/main, there is only one rank involved (rank0). Could you help me understand?

sure! We have:

Single GlobalLoggingActor.

Then, when every process is spawned, we spawn a LocalFetcherActor with it in provisioner.py.
So, if TrainActor has 2 replicas with 2 gpus each, we have 4 LocalFetcherActor.

Each process also has its own MetricCollector.

Each MetricCollector has its own backend. So in the example above, we have 4 backends, 1 per process, if you set logging_mode="per_rank_reduce"|"per_rank_no_reduce". If you set "global_reduce", then no backend is instantiated in the ranks.

When we call GlobalLoggingActor.init_backends --> calls LocalFetcherActor.init_backends, --> calls MetricCollector.init_backends --> local_backend.init, tagged as "local".

So where is the "global" backend? This one is instantiated inside of the GlobalLoggingActor, and its the only "global" one: https://github.com/felipemello1/forge/blob/timestamp_logging/src/forge/observability/metric_actors.py#L262

So to recap:

Each process has its own backend. N processes == N Backends instances

And we have an extra one in global

If it makes it easier to visualize, imagine that instead of backend, it was just a .json that we write to. Every process writes to its own file. Global has its own file too.

global.json
r0.json
r1.json
Etc...

Thanks, Felipe, for the explanation! It is much clear now.

…estamp_logging

felipemello1 · 2025-10-08T14:38:58Z

will break it down into 4 PRs

add time stamp logging

0f63c4e

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2025

felipemello1 changed the title ~~[Logging] add time stamp logging~~ [DRAFT - DO NOT REVIEW] [Logging] add time stamp logging Oct 3, 2025

felipemello1 marked this pull request as draft October 3, 2025 18:06

delete file

f18e9a0

felipemello1 changed the title ~~[DRAFT - DO NOT REVIEW] [Logging] add time stamp logging~~ [DRAFT - DO NOT REVIEW] [Logging] add time stamp logging + test Oct 3, 2025

Felipe Mello added 7 commits October 3, 2025 14:32

review pass

f2aa103

nits, tests and linter

0aa9e15

nit + update env flag

412c453

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

8ebcc6d

…estamp_logging

delete file + update cfg

0e6a549

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

abc6447

…estamp_logging

update configs

4ac667a

felipemello1 marked this pull request as ready for review October 5, 2025 13:02

felipemello1 changed the title ~~[DRAFT - DO NOT REVIEW] [Logging] add time stamp logging + test~~ [Logging] add time stamp logging + test Oct 5, 2025

Felipe Mello added 2 commits October 5, 2025 12:49

lint

c7c34aa

reutilize reduce_metrics_states

372862d

felipemello1 commented Oct 5, 2025

View reviewed changes

felipemello1 mentioned this pull request Oct 5, 2025

[metric logging ] - open TODOs #258

Open

14 tasks

change method name

db27d86

felipemello1 requested review from JenniferWang and joecummings October 5, 2025 23:13

rename + docstrings

9d2debf

felipemello1 commented Oct 6, 2025

View reviewed changes

src/forge/observability/metrics.py Show resolved Hide resolved

add comment

504d7e1

JenniferWang reviewed Oct 6, 2025

View reviewed changes

Felipe Mello added 2 commits October 6, 2025 07:55

update comments

ec86741

not initing backends will raise warning instead of breaking

8037b7a

felipemello1 commented Oct 6, 2025

View reviewed changes

src/forge/observability/metrics.py Show resolved Hide resolved

Felipe Mello added 2 commits October 6, 2025 10:43

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

292d018

…estamp_logging

delete file

715c74d

joecummings reviewed Oct 6, 2025

View reviewed changes

felipemello1 mentioned this pull request Oct 6, 2025

Add Sample-level Logging API #309

Open

Felipe Mello added 8 commits October 6, 2025 12:38

config nit

83e63b5

sort prints

7edf942

rename arg

6a28f9e

more arg names

f21afb7

more arg names

60e6382

fix wandb hang

25caeb0

add unit tet for step count

24a5e96

change step -> global_step

b726b00

felipemello1 commented Oct 7, 2025

View reviewed changes

Felipe Mello added 2 commits October 7, 2025 08:38

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

a297090

…estamp_logging

change toy config

5535eb6

joecummings approved these changes Oct 7, 2025

View reviewed changes

apps/grpo/qwen3_8b.yaml Outdated Show resolved Hide resolved

remove comment

ece12d7

DNXie reviewed Oct 7, 2025

View reviewed changes

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

1af10b1

…estamp_logging

felipemello1 closed this Oct 8, 2025

		logger = logging.getLogger(__name__)


		def detect_actor_name_from_call_stack() -> str:

[Logging] add time stamp logging + test #303

[Logging] add time stamp logging + test #303

Uh oh!

Conversation

felipemello1 commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to review this PR:

Summary of changes:

Uh oh!

felipemello1 Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

felipemello1 Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DNXie Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

felipemello1 commented Oct 3, 2025 •

edited

Loading

felipemello1 Oct 5, 2025 •

edited

Loading

felipemello1 Oct 6, 2025 •

edited

Loading

felipemello1 Oct 7, 2025 •

edited

Loading

felipemello1 Oct 7, 2025 •

edited

Loading

DNXie Oct 7, 2025 •

edited

Loading

felipemello1 Oct 7, 2025 •

edited

Loading

felipemello1 Oct 7, 2025 •

edited

Loading