[Observability 2/7] Structured system logging and gantt chart generation#2602
[Observability 2/7] Structured system logging and gantt chart generation#2602felipemello1 wants to merge 1 commit intogh/felipemello1/18/basefrom
Conversation
[ghstack-poisoned]
| class MetricsProcessor: | ||
| """Step context manager for the toy trainer. | ||
|
|
||
| Mirrors the method order of components/metrics.py MetricsProcessor | ||
| so the toy and production versions are easy to compare. | ||
| """ |
There was a problem hiding this comment.
this class is incrementally built over the prs 1-5 (toy examples), so its ready for prs 6-7 (integration with trainer)
| _system_logger = logging.getLogger(SYSTEM_LOGGER_NAME) | ||
|
|
||
|
|
||
| def init_observability(source: str, output_dir: str, rank: int | None = None) -> None: |
There was a problem hiding this comment.
this function (and maybe a couple others) should probably not be here, since its NOT specific to system metrics (its also shared by experiment metrics). I can move in a follow up.
| self.step = step | ||
| tokens, labels, loss_mask = next(data_iterator) | ||
| self.train_step(tokens, labels, loss_mask) | ||
| self.metrics_processor.set_step(step) |
There was a problem hiding this comment.
When should we call set_step directly, vs. call it from MetricsProcessor?
There was a problem hiding this comment.
MetricsProcessor calls set_step (and i guess a few other things, like clear_tags). Its just a convenience.
Ideally, you call set_step at the start of every step. Take a look at the toyrl example.
There was a problem hiding this comment.
It would be clearer if this functionality was pushed outside of Metrics Processor - otherwise, like @tianyu-l, I'm not sure when I would use set_step directly or use MetricProcessor.set_step.
| dist.all_reduce( | ||
| loss_scalar, op=dist.ReduceOp.SUM, group=self.dp_mesh.get_group() | ||
| ) | ||
| record_event( |
There was a problem hiding this comment.
shouldn't it be using record_metrics, although I haven't internalized the difference between record_metrics and record_event
There was a problem hiding this comment.
record_metric goes to metric logs (something you want to put in wandb). Record event goes to system logs (something that goes to scuba). In your gantt, they are the arrows bellow. Imagine also that they can be something like "counters", for example to check if all ranks have reached that spot, or anything that you would like to query later
There was a problem hiding this comment.
'shouldn't it be using record_metrics'
added in the next PR. This PR is about record_spans and record_event
There was a problem hiding this comment.
IIUC, you actually wouldn't be logging the metrics here - you're just doing it to show it's possible but it would actually be handled by record_metric?
Otherwise, the conflation of the two is a little confusing.
There was a problem hiding this comment.
yeah, i agree its confusing. In later PRs Tianyu also questioned it and i gave an example on how sixlib does it. I think we can iron it out a bit.
| return self.value | ||
|
|
||
|
|
||
| class EventType(StrEnum): |
There was a problem hiding this comment.
How necessary is this? My worry is that we'd constantly need to add things to this entry. In the extreme case, it's hard to check if something I want is in the existing enum, so that I would blindly add a new one.
Why don't we just stick with whatever use give in str format? If users want, they can define Enums over the string values they have.
There was a problem hiding this comment.
I was also debating this, and my resolution was to make it optional in record_span. So, if the user does:
record_span("my_string"), it works.
The reason i decided to keep the Event, was that if we want to compare runs or experiments, and there are N engineers working on different branches, there are no guarantees that they will use the same EventType. So, imagine that you build some postprocessing code to check that "FWD_BWD" meets some specification. If we don't have the EventType, the algorithm is not reliable.
Do you think that this is a strong enough argument to keep it? or would you say to remove it entirely?
There was a problem hiding this comment.
If anything, I would strip this down to the bare minimum for shared Enums (fwd_bwd, step, optim). The rest could be project specific. So when I go to setup my RL workflow, I can choose to include some common RL event types I want to track. Or I can ignore them entirely. The point is the granularity is up to me.
|
|
||
|
|
||
| def to_structured_json(log_dict: dict[str, Any]) -> str: | ||
| """Convert a log dict to 4-column JSON (int/normal/double/normvector).""" |
There was a problem hiding this comment.
could you share a specification on the naming? seems there's some convention going on
There was a problem hiding this comment.
do you mean 'int/normal/double/normvector'? That's for scuba
There was a problem hiding this comment.
That's for scuba
Can you say more? Is this just a scuba thing or is it the standard for other visualization/logging tools as well?
There was a problem hiding this comment.
yeah, scuba expects these 4 columns. I do NOT know if its a standard for other logging tools. I think that we would have to understand that with time/user requests, or have a follow up trying to integrate DuckDB or something like that. Also, I think that i will probably need to enable handles using ENV args. E.g.\
torchrun .... --enable_handler_scuba
So we can have meta specific handlers.
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| class EventsOnlyFilter(logging.Filter): |
There was a problem hiding this comment.
why do we need this? if everything is filtered, we could have required LOG_TYPE_NAME to be given when calling record_span
There was a problem hiding this comment.
It protects against user error. If someone, for some reason, does:
logger = logging.getLogger("torchtitan.observability.system") # logger dedicated to structured logger
logger.info("hello")
because of this filter, it would correctly NOT be added to the jsonl.
There was a problem hiding this comment.
If I'm a user in the above scenario, I would be very confused. Should include a warning saying my message X will not be logged and how to properly log something to the system.
There was a problem hiding this comment.
Should include a warning saying my message X will not be logged and how to properly log something to the system.
You do NOT want to use this logging.getLogger("torchtitan.observability.system") for normal logger.info. This is something that is internal to record_span. This would only ever happen if an user really tried to make a mistake, and if that happened, it would be safe. The logger would be a normal logger, printing to console, and it would not damage the output jsonl. To log to jsonl, user has to pass the "extras" field, which record_span does.
To initialize regular logger in titan, we call init_logger().
| rank = int(os.environ.get("RANK", 0)) | ||
|
|
||
| # --- System handler --- | ||
| sys_logger = logging.getLogger(SYSTEM_LOGGER_NAME) |
There was a problem hiding this comment.
would be helpful if you could also help improve logging module naming across all torchtitan components.
There was a problem hiding this comment.
what do you have in mind? You want the logger to show the callsite info?
| Adds per-rank system and experiment JSONL handlers for structured | ||
| logging. Does NOT set up console logging — call ``init_logger()`` | ||
| from ``torchtitan.tools.logging`` for that. |
There was a problem hiding this comment.
do you think we can put both of them into torchtitan/components/observability?
There was a problem hiding this comment.
yes, i wanted to minimize the amount of changes that werent necessary, since the PRs were already so long.
| ) | ||
|
|
||
|
|
||
| class record_span(ContextDecorator): # noqa: N801 |
There was a problem hiding this comment.
noob question: what's the difference between context decorator and context manager?
| ) | ||
| handler = StructuredLoggingHandler(filepath=sys_path) | ||
| handler.setFormatter(StructuredJSONFormatter(rank=rank, source=source)) | ||
| sys_logger.addHandler(handler) |
There was a problem hiding this comment.
oh IIUC this is using registration mechanism provided by python's logging.getLogger, sounds neat
|
|
||
| message = record.getMessage() | ||
| if message is not None: | ||
| if len(message) <= MAX_MESSAGE_SIZE: |
There was a problem hiding this comment.
Why is there a MAX_MESSAGE_SIZE?
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| class EventsOnlyFilter(logging.Filter): |
There was a problem hiding this comment.
If I'm a user in the above scenario, I would be very confused. Should include a warning saying my message X will not be logged and how to properly log something to the system.
|
|
||
|
|
||
| def to_structured_json(log_dict: dict[str, Any]) -> str: | ||
| """Convert a log dict to 4-column JSON (int/normal/double/normvector).""" |
There was a problem hiding this comment.
That's for scuba
Can you say more? Is this just a scuba thing or is it the standard for other visualization/logging tools as well?
| TEXT = "text" | ||
|
|
||
|
|
||
| class ExtraFields(StrEnum): |
There was a problem hiding this comment.
Sorry, I just don't understand this one. Any more you can share?
| return self.value | ||
|
|
||
|
|
||
| class EventType(StrEnum): |
There was a problem hiding this comment.
If anything, I would strip this down to the bare minimum for shared Enums (fwd_bwd, step, optim). The rest could be project specific. So when I go to setup my RL workflow, I can choose to include some common RL event types I want to track. Or I can ignore them entirely. The point is the granularity is up to me.
| dist.all_reduce( | ||
| loss_scalar, op=dist.ReduceOp.SUM, group=self.dp_mesh.get_group() | ||
| ) | ||
| record_event( |
There was a problem hiding this comment.
IIUC, you actually wouldn't be logging the metrics here - you're just doing it to show it's possible but it would actually be handled by record_metric?
Otherwise, the conflation of the two is a little confusing.
| self.step = step | ||
| tokens, labels, loss_mask = next(data_iterator) | ||
| self.train_step(tokens, labels, loss_mask) | ||
| self.metrics_processor.set_step(step) |
There was a problem hiding this comment.
It would be clearer if this functionality was pushed outside of Metrics Processor - otherwise, like @tianyu-l, I'm not sure when I would use set_step directly or use MetricProcessor.set_step.
|
|
||
| # Simulate GC on every 5th step (mirrors gc_handler.run) | ||
| if step % 5 == 0: | ||
| add_step_tag("gc") |
There was a problem hiding this comment.
q: Why wouldn't I use record_span to get a timing on how long gc takes?
There was a problem hiding this comment.
we should. We dot it in pr7. I dont know if i forgot to do it in the toy example.
add_step_tag("gc") is just a tag that will be added to very other log moving forward. Its a way to say: "hey, this step may be slower than normal. So, if you want to, you can filter out all steps with "gc" tag".
|
|
||
| if step % EVAL_FREQ == 0: | ||
| self.validate(tokens, labels, loss_mask) | ||
| add_step_tag("eval") |
There was a problem hiding this comment.
q: Why does "eval" get a step tag and a record_span? Do these go to different things? If I record span, shouldn't that also tell me where I am in the step?
There was a problem hiding this comment.
yes, record_span saves the time. add_step_tag will add a tag to any subsequent log, so we can know that that log was generated in a step that also had eval. Does this make sense?
|
|
||
| if rank == 0: | ||
| print(f"Done. Output: {OUTPUT_DIR}") | ||
| sys_logs = os.path.join(OUTPUT_DIR, "system_logs") |
There was a problem hiding this comment.
These would be nice as convenience methods - not sure if added later in the stack

Stack from ghstack (oldest at bottom):
How to review
Prioritize the files under /observability. Check how they are used in the toy example.
No need to nitpick the toy example. Real implementation in trainer are in PRs 6 and 7.
Summary
The system logging layer. Every process writes structured JSONL to disk via Python's logging module.
Two APIs:
record_span(description, event_type)— context manager that writes START/END events with wall-clock duration. Used for timing training phases (forward/backward, optimizer, data loading, etc).record_event({"key": value, ...})— point-in-time scalar snapshots for per-rank diagnostics.Example
Output folder (from toy_spmd on this PR)
Gantt chart
toy_rl

toy_spmd

Others
init_observability(source, output_dir)sets up the JSONL file handlers. Each rank gets its own file:{output_dir}/system_logs/{source}_rank_{rank}_system.jsonl.The JSONL format uses four typed columns (
int,normal,double,normvector) for easy ingestion into Grafana, DuckDB, etc.generate_gantt_trace(log_dir, output_path)reads all system JSONL files and produces a Chrome Trace JSON. Open in Perfetto to see a gantt chart of everyrecord_spanacross all ranks.Also includes:
EventTypeenum for categorizing spans,set_step/add_step_tagfor step context,InflightEventTrackingHandlerfor crash forensicsTest plan
Run toy_spmd:
python -m torch.distributed.run --nproc_per_node=4 -m torchtitan.experiments.observability.toy_spmdRun toy_rl:
python -m torchtitan.experiments.observability.toy_rlConsole output (toy_spmd)
Console output (toy_rl)
Sample of generated logs
{"int": {"tid": 1056731, "thread_time_ns": 3325664547, "rank": 0, "pid": 1056731, "time": 1773610642, "time_ms": 1773610642167, "time_us": 1773610642167603, "seq_id": 0}, "normal": {"source": "trainer", "host_name": "devgpu011.ldc3.facebook.com", "log_type": "event", "log_type_name": "build_model_start", "event_name": "setup/model_build", "caller": "torchtitan/experiments/observability/toy_spmd.py:168:__init__", "log_file": "toy_spmd.py", "log_function": "__init__", "log_level": "INFO", "logger_name": "torchtitan.observability.system", "message": "[step N/A] setup/model_build build_model_start"}, "double": {"delta_ms": 0.00051083043217659}, "normvector": {}} {"int": {"tid": 1056731, "thread_time_ns": 3998473694, "rank": 0, "pid": 1056731, "time": 1773610642, "time_ms": 1773610642858, "time_us": 1773610642858849, "seq_id": 1}, "normal": {"source": "trainer", "host_name": "devgpu011.ldc3.facebook.com", "log_type": "event", "log_type_name": "build_model_end", "event_name": "setup/model_build", "caller": "torchtitan/experiments/observability/toy_spmd.py:168:__init__", "log_file": "toy_spmd.py", "log_function": "__init__", "log_level": "INFO", "logger_name": "torchtitan.observability.system", "message": "[step N/A] setup/model_build build_model_end took 691.24 ms"}, "double": {"delta_ms": 691.2433146499097, "value": 691.238438244909}, "normvector": {}}