[doc] Add monitoring and observability guidelines #13

codeJRV · 2025-05-31T21:21:35Z

Add basic monitoring guidelines

zxiiro · 2025-06-02T13:56:48Z

Monitoring.md

+- Minimum uptime of 99.9%
+- Maximum job queue time of 5 minutes
+- Job execution time variance within ±10% of baseline
+- Response time to critical alerts within 15 minutes


So we have an alerting channel on Slack over at #pytorch-infra-alerts should we list that as a requirement of where to report the alerts?

Yes. We can list that here. I've basically left stubs in the whole draft so that we can add specifics like these here.

zxiiro · 2025-06-02T14:02:07Z

Monitoring.md

+Production runners must maintain:
+
+- Minimum uptime of 99.9%
+- Maximum job queue time of 5 minutes


My concern about minimum uptime / max job queue time is who would be responsible to ensure these requirements are met?

I think today it's best effort since I've seen outages happen over the weekend that weren't addressed until Monday.

Do we have an idea of the current uptime, queue times, and what our expectations would be ?

@seemethere @jeanschmidt @ZainRizvi what are the current expectations for CI availability today?

afrittoli · 2025-06-17T12:58:13Z

@codeJRV could you add a DRAFT notice at the top and in the title, similar to #16 ?

Signed-off-by: Jibin Varghese <[email protected]>

afrittoli

Thanks for all the updates, I left a few comments

afrittoli · 2025-07-01T10:52:02Z

Observability-Guidelines.md

+
+Runners must track:
+
+- Registration/unregistration events


I couldn't find a webhook type for runner registration/unregistration on GitHub side, so this may have to be produced by the various runners individually. When ARC is used, ARC will produce this metric out of the box

Yes. I was thinking of ARC when I wrote this.

afrittoli · 2025-07-01T11:02:42Z

Observability-Guidelines.md

+Runners must track:
+
+- Registration/unregistration events
+- Job start/completion times


GitHub can trigger webhooks for workflow_run and workflow_job, we could use those to collect job metrics centrally, but I don't know if that will show a breakdown per-runner. Again, when ARC is used, ARC will produce this metric out of the box

Yes. I was thinking of ARC when I wrote this.

Github already shares this data by default. For context, we take the webhooks github emits on job/workflow updates and store them directly into ClickHouse

The thing it sort-of lacks though is tying that job start/end time to a particular cloud or instance.

Github does share the runner label, which today is enough to uniquely identify a specific cloud, but I'm not sure how that'll work with runner groups

afrittoli · 2025-07-01T11:06:06Z

Observability-Guidelines.md

+
+- Registration/unregistration events
+- Job start/completion times
+- Queue wait times


Is this the time the job queues after a runner has been assigned, or the time a job queues waiting for a runner to be assigned to it?

The former -- job queues after a runner has been assigned

We have rough measurements available for this by default today: We take the time the job was first pushed to ClickHouse (which happens after we get the new job creation webhook from github) and then take the delta against the time the job actually started running. It has a small room for error but is generally accurate enough

afrittoli · 2025-07-01T11:15:11Z

Observability-Guidelines.md

+Teams providing runners to the pool must
+
+- Implementing OpenTelemetry data source integration to HUD
+- Support real-time status overview
+- Support resource utilization graphs
+- Alert history and status
+- Runner pool capacity visualization


Does HUD integration add any requirements on top of the metrics and technical requirements above?
I think it would be good to call out any metric required for HUD specifically, and do not include HUD itself as a requirement for runner providers.

@ZainRizvi might know better. I was thinking we need to centralize the monitoring in one place, and HUD seemed like a good candidate for it, since it already exists.

HUD itself doesn't look at any infrastructure related data today. Right now all of it's data is taken directly from github (so it'll automatically keep working for any new cloud that spins up), though we may add more infra data there in the future.

For providing a data source, the general criteria might be to provide that data in a queryable format. We're considering migrating our dashboards over to Grafana (which is easier to build new dashboards on than HUD). The common bit both need is to have an interface that can be queried. Today both HUD and Grafana are powered by our ClickHouse database.

afrittoli · 2025-07-01T11:22:42Z

Observability-Guidelines.md

+
+### Alternative Dashboards
+
+Teams may implement:


What does teams refer to? I think it's worth clarifying. My take:

In case the pool is on a public cloud (credits, dedicated billing account, pre-provisioned resources), the team would be the PyTorch infra team.

In case of resources on a private cloud or private resource pool, the team would be the individuals from the companing sharing resource who are responsible for their operations.

afrittoli · 2025-07-01T11:26:18Z

Observability-Guidelines.md

+#### PyTorch Runners
+
+Must implement:
+
+- Dedicated monitoring namespace
+- Resource quotas and limits
+- Custom metrics for PyTorch-specific workloads
+- Integration with existing PyTorch monitoring infrastructure
+
+#### Community Runners
+
+Must implement:
+
+- Separate monitoring namespace
+- Basic resource monitoring
+- Job execution metrics
+- Error tracking and reporting


This section is not clear to me, perhaps we can expand on it a bit during the WG?

codeJRV

Addressing couple of comments

codeJRV · 2025-07-01T16:15:40Z

Observability-Guidelines.md

+
+Runners must track:
+
+- Registration/unregistration events


Yes. I was thinking of ARC when I wrote this.

codeJRV · 2025-07-01T16:16:06Z

Observability-Guidelines.md

+Runners must track:
+
+- Registration/unregistration events
+- Job start/completion times


Yes. I was thinking of ARC when I wrote this.

codeJRV · 2025-07-01T16:19:18Z

Observability-Guidelines.md

+
+- Registration/unregistration events
+- Job start/completion times
+- Queue wait times


The former -- job queues after a runner has been assigned

codeJRV · 2025-07-01T16:20:23Z

Observability-Guidelines.md

+Teams providing runners to the pool must
+
+- Implementing OpenTelemetry data source integration to HUD
+- Support real-time status overview
+- Support resource utilization graphs
+- Alert history and status
+- Runner pool capacity visualization


@ZainRizvi might know better. I was thinking we need to centralize the monitoring in one place, and HUD seemed like a good candidate for it, since it already exists.

codeJRV · 2025-07-01T16:20:37Z

Observability-Guidelines.md

+
+### Alternative Dashboards
+
+Teams may implement:


codeJRV · 2025-07-01T16:21:06Z

Observability-Guidelines.md

+#### PyTorch Runners
+
+Must implement:
+
+- Dedicated monitoring namespace
+- Resource quotas and limits
+- Custom metrics for PyTorch-specific workloads
+- Integration with existing PyTorch monitoring infrastructure
+
+#### Community Runners
+
+Must implement:
+
+- Separate monitoring namespace
+- Basic resource monitoring
+- Job execution metrics
+- Error tracking and reporting


ZainRizvi · 2025-07-01T16:56:14Z

Observability-Guidelines.md

+A candidate runner pool must:
+
+- Undergo stability assessment before deployment in critical CI/CD workflows
+- Maintain performance metrics during test jobs


What do you have in mind when you say "performance metrics"?

ZainRizvi · 2025-07-01T16:58:05Z

Observability-Guidelines.md

+
+### Metrics Requirements
+
+All runners must collect and expose the following metrics on [hud.pytorch.org/metrics](https://hud.pytorch.org/metrics)


The HUD metrics are available to all CI clouds for free because the data is sourced directly from Github without considering the cloud the jobs are executed on.

ZainRizvi · 2025-07-01T17:00:59Z

Observability-Guidelines.md

+Runners must track:
+
+- Registration/unregistration events
+- Job start/completion times


Github already shares this data by default. For context, we take the webhooks github emits on job/workflow updates and store them directly into ClickHouse

The thing it sort-of lacks though is tying that job start/end time to a particular cloud or instance.

Github does share the runner label, which today is enough to uniquely identify a specific cloud, but I'm not sure how that'll work with runner groups

ZainRizvi · 2025-07-01T17:03:24Z

Observability-Guidelines.md

+
+- Registration/unregistration events
+- Job start/completion times
+- Queue wait times


We have rough measurements available for this by default today: We take the time the job was first pushed to ClickHouse (which happens after we get the new job creation webhook from github) and then take the delta against the time the job actually started running. It has a small room for error but is generally accurate enough

ZainRizvi · 2025-07-01T17:06:08Z

Observability-Guidelines.md

+- Job start/completion times
+- Queue wait times
+- Job execution duration
+- Resource utilization during jobs


Is this CPU/GPU/Memory utilization? Is the idea to track this over time within a specific job?

The exact shape of what's recorded will be important here.

Fyi, we have per-job utilization reports available today
Example report: https://hud.pytorch.org/utilization/15989391092/45100138715/1

To find it, you go to a commit page (like this one) and click on the "utilization report" button next to a job

ZainRizvi · 2025-07-01T17:09:00Z

Observability-Guidelines.md

+
+All monitoring implementations must:
+
+- Expose metrics in OpenTelemetry format


Would be good to define the exact format each metric is expected to be emitted in, so that they're all in a consistent shape and can be queried easily

ZainRizvi · 2025-07-01T17:10:18Z

Observability-Guidelines.md

+- Expose metrics in OpenTelemetry format
+- Follow standardized metric naming conventions
+- Use consistent labeling across all runners
+- Implement proper metric aggregation and sampling


Is the idea that each cloud will host it's own metrics (thus the aggregation/sampling requirements)?

Or are we expecting to have all metrics related data emitted to a central location, and let that central service take care of aggregation/sampling?

ZainRizvi · 2025-07-01T17:17:16Z

Observability-Guidelines.md

+Production runners must maintain:
+
+- Minimum uptime of 99.9%
+- Maximum job queue time of 5 minutes


I want to square this with the actual behavior we see in the Meta team, to make sure we don't hold other clouds accountable to a higher standard than what the Meta cloud is held to

Perhaps something like:

p99 queue time of 5 minutes per hour, expecting most jobs to have very little queuing
pMax queue time of 30 minutes per day. Exceeding this means there's an outage.

ZainRizvi · 2025-07-01T17:18:44Z

Observability-Guidelines.md

+
+- Minimum uptime of 99.9%
+- Maximum job queue time of 5 minutes
+- Job execution time variance within ±10% of baseline


thanks to caching, some jobs tend to have high variance between runs. Maybe we can tighten this up to "P50 job execution time variance..."

ZainRizvi · 2025-07-01T17:19:49Z

Observability-Guidelines.md

+- Maximum job queue time of 5 minutes
+- Job execution time variance within ±10% of baseline
+- Response time to critical alerts within 15 minutes
+- Maximum capacity reduction of 10%


With capacity defined as: "Theoretical maximum number of jobs of a given type that can be run in this cloud in parallel"?

zxiiro · 2025-07-22T14:11:14Z

@codeJRV looks like there's some outstanding comments. Will you be able to address them before we merge?

zxiiro reviewed Jun 2, 2025

View reviewed changes

codeJRV force-pushed the jibinv-add-monitoring-reqs branch 5 times, most recently from 8276d58 to fe1f0fc Compare June 30, 2025 18:49

[doc] Add monitoring and observability guidelines

e2001a2

Signed-off-by: Jibin Varghese <[email protected]>

codeJRV force-pushed the jibinv-add-monitoring-reqs branch from fe1f0fc to e2001a2 Compare June 30, 2025 18:53

afrittoli reviewed Jul 1, 2025

View reviewed changes

codeJRV commented Jul 1, 2025

View reviewed changes

ZainRizvi reviewed Jul 1, 2025

View reviewed changes


		### Metrics Requirements

		All runners must collect and expose the following metrics on [hud.pytorch.org/metrics](https://hud.pytorch.org/metrics)


		All monitoring implementations must:

		- Expose metrics in OpenTelemetry format

[doc] Add monitoring and observability guidelines #13

Are you sure you want to change the base?

[doc] Add monitoring and observability guidelines #13

Uh oh!

Conversation

codeJRV commented May 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

afrittoli commented Jun 17, 2025

Uh oh!

afrittoli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codeJRV left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment