Skip to content

Conversation

codeJRV
Copy link

@codeJRV codeJRV commented May 31, 2025

Add basic monitoring guidelines

Monitoring.md Outdated
- Minimum uptime of 99.9%
- Maximum job queue time of 5 minutes
- Job execution time variance within ±10% of baseline
- Response time to critical alerts within 15 minutes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have an alerting channel on Slack over at #pytorch-infra-alerts should we list that as a requirement of where to report the alerts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can list that here. I've basically left stubs in the whole draft so that we can add specifics like these here.

Monitoring.md Outdated
Production runners must maintain:

- Minimum uptime of 99.9%
- Maximum job queue time of 5 minutes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern about minimum uptime / max job queue time is who would be responsible to ensure these requirements are met?

I think today it's best effort since I've seen outages happen over the weekend that weren't addressed until Monday.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an idea of the current uptime, queue times, and what our expectations would be ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seemethere @jeanschmidt @ZainRizvi what are the current expectations for CI availability today?

@afrittoli
Copy link
Collaborator

@codeJRV could you add a DRAFT notice at the top and in the title, similar to #16 ?

@codeJRV codeJRV force-pushed the jibinv-add-monitoring-reqs branch 5 times, most recently from 8276d58 to fe1f0fc Compare June 30, 2025 18:49
@codeJRV codeJRV force-pushed the jibinv-add-monitoring-reqs branch from fe1f0fc to e2001a2 Compare June 30, 2025 18:53
Copy link
Collaborator

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates, I left a few comments


Runners must track:

- Registration/unregistration events
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a webhook type for runner registration/unregistration on GitHub side, so this may have to be produced by the various runners individually. When ARC is used, ARC will produce this metric out of the box

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was thinking of ARC when I wrote this.

Runners must track:

- Registration/unregistration events
- Job start/completion times
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub can trigger webhooks for workflow_run and workflow_job, we could use those to collect job metrics centrally, but I don't know if that will show a breakdown per-runner. Again, when ARC is used, ARC will produce this metric out of the box

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was thinking of ARC when I wrote this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github already shares this data by default. For context, we take the webhooks github emits on job/workflow updates and store them directly into ClickHouse

The thing it sort-of lacks though is tying that job start/end time to a particular cloud or instance.

Github does share the runner label, which today is enough to uniquely identify a specific cloud, but I'm not sure how that'll work with runner groups


- Registration/unregistration events
- Job start/completion times
- Queue wait times
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the time the job queues after a runner has been assigned, or the time a job queues waiting for a runner to be assigned to it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former -- job queues after a runner has been assigned

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have rough measurements available for this by default today: We take the time the job was first pushed to ClickHouse (which happens after we get the new job creation webhook from github) and then take the delta against the time the job actually started running. It has a small room for error but is generally accurate enough

Comment on lines +95 to +101
Teams providing runners to the pool must

- Implementing OpenTelemetry data source integration to HUD
- Support real-time status overview
- Support resource utilization graphs
- Alert history and status
- Runner pool capacity visualization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does HUD integration add any requirements on top of the metrics and technical requirements above?
I think it would be good to call out any metric required for HUD specifically, and do not include HUD itself as a requirement for runner providers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZainRizvi might know better. I was thinking we need to centralize the monitoring in one place, and HUD seemed like a good candidate for it, since it already exists.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUD itself doesn't look at any infrastructure related data today. Right now all of it's data is taken directly from github (so it'll automatically keep working for any new cloud that spins up), though we may add more infra data there in the future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For providing a data source, the general criteria might be to provide that data in a queryable format. We're considering migrating our dashboards over to Grafana (which is easier to build new dashboards on than HUD). The common bit both need is to have an interface that can be queried. Today both HUD and Grafana are powered by our ClickHouse database.


### Alternative Dashboards

Teams may implement:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does teams refer to? I think it's worth clarifying. My take:

  • In case the pool is on a public cloud (credits, dedicated billing account, pre-provisioned resources), the team would be the PyTorch infra team.
  • In case of resources on a private cloud or private resource pool, the team would be the individuals from the companing sharing resource who are responsible for their operations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

Comment on lines +148 to +164
#### PyTorch Runners

Must implement:

- Dedicated monitoring namespace
- Resource quotas and limits
- Custom metrics for PyTorch-specific workloads
- Integration with existing PyTorch monitoring infrastructure

#### Community Runners

Must implement:

- Separate monitoring namespace
- Basic resource monitoring
- Job execution metrics
- Error tracking and reporting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is not clear to me, perhaps we can expand on it a bit during the WG?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

Copy link
Author

@codeJRV codeJRV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing couple of comments


Runners must track:

- Registration/unregistration events
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was thinking of ARC when I wrote this.

Runners must track:

- Registration/unregistration events
- Job start/completion times
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was thinking of ARC when I wrote this.


- Registration/unregistration events
- Job start/completion times
- Queue wait times
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former -- job queues after a runner has been assigned

Comment on lines +95 to +101
Teams providing runners to the pool must

- Implementing OpenTelemetry data source integration to HUD
- Support real-time status overview
- Support resource utilization graphs
- Alert history and status
- Runner pool capacity visualization
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZainRizvi might know better. I was thinking we need to centralize the monitoring in one place, and HUD seemed like a good candidate for it, since it already exists.


### Alternative Dashboards

Teams may implement:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

Comment on lines +148 to +164
#### PyTorch Runners

Must implement:

- Dedicated monitoring namespace
- Resource quotas and limits
- Custom metrics for PyTorch-specific workloads
- Integration with existing PyTorch monitoring infrastructure

#### Community Runners

Must implement:

- Separate monitoring namespace
- Basic resource monitoring
- Job execution metrics
- Error tracking and reporting
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

A candidate runner pool must:

- Undergo stability assessment before deployment in critical CI/CD workflows
- Maintain performance metrics during test jobs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you have in mind when you say "performance metrics"?


### Metrics Requirements

All runners must collect and expose the following metrics on [hud.pytorch.org/metrics](https://hud.pytorch.org/metrics)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HUD metrics are available to all CI clouds for free because the data is sourced directly from Github without considering the cloud the jobs are executed on.

Runners must track:

- Registration/unregistration events
- Job start/completion times

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github already shares this data by default. For context, we take the webhooks github emits on job/workflow updates and store them directly into ClickHouse

The thing it sort-of lacks though is tying that job start/end time to a particular cloud or instance.

Github does share the runner label, which today is enough to uniquely identify a specific cloud, but I'm not sure how that'll work with runner groups


- Registration/unregistration events
- Job start/completion times
- Queue wait times

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have rough measurements available for this by default today: We take the time the job was first pushed to ClickHouse (which happens after we get the new job creation webhook from github) and then take the delta against the time the job actually started running. It has a small room for error but is generally accurate enough

- Job start/completion times
- Queue wait times
- Job execution duration
- Resource utilization during jobs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this CPU/GPU/Memory utilization? Is the idea to track this over time within a specific job?

The exact shape of what's recorded will be important here.

Fyi, we have per-job utilization reports available today
Example report: https://hud.pytorch.org/utilization/15989391092/45100138715/1

To find it, you go to a commit page (like this one) and click on the "utilization report" button next to a job


All monitoring implementations must:

- Expose metrics in OpenTelemetry format

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to define the exact format each metric is expected to be emitted in, so that they're all in a consistent shape and can be queried easily

- Expose metrics in OpenTelemetry format
- Follow standardized metric naming conventions
- Use consistent labeling across all runners
- Implement proper metric aggregation and sampling

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that each cloud will host it's own metrics (thus the aggregation/sampling requirements)?

Or are we expecting to have all metrics related data emitted to a central location, and let that central service take care of aggregation/sampling?

Production runners must maintain:

- Minimum uptime of 99.9%
- Maximum job queue time of 5 minutes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to square this with the actual behavior we see in the Meta team, to make sure we don't hold other clouds accountable to a higher standard than what the Meta cloud is held to

Perhaps something like:

p99 queue time of 5 minutes per hour, expecting most jobs to have very little queuing
pMax queue time of 30 minutes per day. Exceeding this means there's an outage.


- Minimum uptime of 99.9%
- Maximum job queue time of 5 minutes
- Job execution time variance within ±10% of baseline

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks to caching, some jobs tend to have high variance between runs. Maybe we can tighten this up to "P50 job execution time variance..."

- Maximum job queue time of 5 minutes
- Job execution time variance within ±10% of baseline
- Response time to critical alerts within 15 minutes
- Maximum capacity reduction of 10%

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With capacity defined as: "Theoretical maximum number of jobs of a given type that can be run in this cloud in parallel"?

@zxiiro
Copy link

zxiiro commented Jul 22, 2025

@codeJRV looks like there's some outstanding comments. Will you be able to address them before we merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants