-
Notifications
You must be signed in to change notification settings - Fork 0
[doc] Add monitoring and observability guidelines #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Monitoring.md
Outdated
- Minimum uptime of 99.9% | ||
- Maximum job queue time of 5 minutes | ||
- Job execution time variance within ±10% of baseline | ||
- Response time to critical alerts within 15 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we have an alerting channel on Slack over at #pytorch-infra-alerts
should we list that as a requirement of where to report the alerts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We can list that here. I've basically left stubs in the whole draft so that we can add specifics like these here.
Monitoring.md
Outdated
Production runners must maintain: | ||
|
||
- Minimum uptime of 99.9% | ||
- Maximum job queue time of 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern about minimum uptime / max job queue time is who would be responsible to ensure these requirements are met?
I think today it's best effort since I've seen outages happen over the weekend that weren't addressed until Monday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an idea of the current uptime, queue times, and what our expectations would be ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seemethere @jeanschmidt @ZainRizvi what are the current expectations for CI availability today?
8276d58
to
fe1f0fc
Compare
Signed-off-by: Jibin Varghese <[email protected]>
fe1f0fc
to
e2001a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the updates, I left a few comments
|
||
Runners must track: | ||
|
||
- Registration/unregistration events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a webhook type for runner registration/unregistration on GitHub side, so this may have to be produced by the various runners individually. When ARC is used, ARC will produce this metric out of the box
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I was thinking of ARC when I wrote this.
Runners must track: | ||
|
||
- Registration/unregistration events | ||
- Job start/completion times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub can trigger webhooks for workflow_run
and workflow_job
, we could use those to collect job metrics centrally, but I don't know if that will show a breakdown per-runner. Again, when ARC is used, ARC will produce this metric out of the box
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I was thinking of ARC when I wrote this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github already shares this data by default. For context, we take the webhooks github emits on job/workflow updates and store them directly into ClickHouse
The thing it sort-of lacks though is tying that job start/end time to a particular cloud or instance.
Github does share the runner label, which today is enough to uniquely identify a specific cloud, but I'm not sure how that'll work with runner groups
|
||
- Registration/unregistration events | ||
- Job start/completion times | ||
- Queue wait times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the time the job queues after a runner has been assigned, or the time a job queues waiting for a runner to be assigned to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The former -- job queues after a runner has been assigned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have rough measurements available for this by default today: We take the time the job was first pushed to ClickHouse (which happens after we get the new job creation webhook from github) and then take the delta against the time the job actually started running. It has a small room for error but is generally accurate enough
Teams providing runners to the pool must | ||
|
||
- Implementing OpenTelemetry data source integration to HUD | ||
- Support real-time status overview | ||
- Support resource utilization graphs | ||
- Alert history and status | ||
- Runner pool capacity visualization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does HUD integration add any requirements on top of the metrics and technical requirements above?
I think it would be good to call out any metric required for HUD specifically, and do not include HUD itself as a requirement for runner providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZainRizvi might know better. I was thinking we need to centralize the monitoring in one place, and HUD seemed like a good candidate for it, since it already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HUD itself doesn't look at any infrastructure related data today. Right now all of it's data is taken directly from github (so it'll automatically keep working for any new cloud that spins up), though we may add more infra data there in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For providing a data source, the general criteria might be to provide that data in a queryable format. We're considering migrating our dashboards over to Grafana (which is easier to build new dashboards on than HUD). The common bit both need is to have an interface that can be queried. Today both HUD and Grafana are powered by our ClickHouse database.
|
||
### Alternative Dashboards | ||
|
||
Teams may implement: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does teams refer to? I think it's worth clarifying. My take:
- In case the pool is on a public cloud (credits, dedicated billing account, pre-provisioned resources), the team would be the PyTorch infra team.
- In case of resources on a private cloud or private resource pool, the team would be the individuals from the companing sharing resource who are responsible for their operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
#### PyTorch Runners | ||
|
||
Must implement: | ||
|
||
- Dedicated monitoring namespace | ||
- Resource quotas and limits | ||
- Custom metrics for PyTorch-specific workloads | ||
- Integration with existing PyTorch monitoring infrastructure | ||
|
||
#### Community Runners | ||
|
||
Must implement: | ||
|
||
- Separate monitoring namespace | ||
- Basic resource monitoring | ||
- Job execution metrics | ||
- Error tracking and reporting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is not clear to me, perhaps we can expand on it a bit during the WG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressing couple of comments
|
||
Runners must track: | ||
|
||
- Registration/unregistration events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I was thinking of ARC when I wrote this.
Runners must track: | ||
|
||
- Registration/unregistration events | ||
- Job start/completion times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I was thinking of ARC when I wrote this.
|
||
- Registration/unregistration events | ||
- Job start/completion times | ||
- Queue wait times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The former -- job queues after a runner has been assigned
Teams providing runners to the pool must | ||
|
||
- Implementing OpenTelemetry data source integration to HUD | ||
- Support real-time status overview | ||
- Support resource utilization graphs | ||
- Alert history and status | ||
- Runner pool capacity visualization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZainRizvi might know better. I was thinking we need to centralize the monitoring in one place, and HUD seemed like a good candidate for it, since it already exists.
|
||
### Alternative Dashboards | ||
|
||
Teams may implement: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
#### PyTorch Runners | ||
|
||
Must implement: | ||
|
||
- Dedicated monitoring namespace | ||
- Resource quotas and limits | ||
- Custom metrics for PyTorch-specific workloads | ||
- Integration with existing PyTorch monitoring infrastructure | ||
|
||
#### Community Runners | ||
|
||
Must implement: | ||
|
||
- Separate monitoring namespace | ||
- Basic resource monitoring | ||
- Job execution metrics | ||
- Error tracking and reporting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
A candidate runner pool must: | ||
|
||
- Undergo stability assessment before deployment in critical CI/CD workflows | ||
- Maintain performance metrics during test jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you have in mind when you say "performance metrics"?
|
||
### Metrics Requirements | ||
|
||
All runners must collect and expose the following metrics on [hud.pytorch.org/metrics](https://hud.pytorch.org/metrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HUD metrics are available to all CI clouds for free because the data is sourced directly from Github without considering the cloud the jobs are executed on.
Runners must track: | ||
|
||
- Registration/unregistration events | ||
- Job start/completion times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github already shares this data by default. For context, we take the webhooks github emits on job/workflow updates and store them directly into ClickHouse
The thing it sort-of lacks though is tying that job start/end time to a particular cloud or instance.
Github does share the runner label, which today is enough to uniquely identify a specific cloud, but I'm not sure how that'll work with runner groups
|
||
- Registration/unregistration events | ||
- Job start/completion times | ||
- Queue wait times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have rough measurements available for this by default today: We take the time the job was first pushed to ClickHouse (which happens after we get the new job creation webhook from github) and then take the delta against the time the job actually started running. It has a small room for error but is generally accurate enough
- Job start/completion times | ||
- Queue wait times | ||
- Job execution duration | ||
- Resource utilization during jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this CPU/GPU/Memory utilization? Is the idea to track this over time within a specific job?
The exact shape of what's recorded will be important here.
Fyi, we have per-job utilization reports available today
Example report: https://hud.pytorch.org/utilization/15989391092/45100138715/1
To find it, you go to a commit page (like this one) and click on the "utilization report" button next to a job
|
||
All monitoring implementations must: | ||
|
||
- Expose metrics in OpenTelemetry format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to define the exact format each metric is expected to be emitted in, so that they're all in a consistent shape and can be queried easily
- Expose metrics in OpenTelemetry format | ||
- Follow standardized metric naming conventions | ||
- Use consistent labeling across all runners | ||
- Implement proper metric aggregation and sampling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that each cloud will host it's own metrics (thus the aggregation/sampling requirements)?
Or are we expecting to have all metrics related data emitted to a central location, and let that central service take care of aggregation/sampling?
Production runners must maintain: | ||
|
||
- Minimum uptime of 99.9% | ||
- Maximum job queue time of 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to square this with the actual behavior we see in the Meta team, to make sure we don't hold other clouds accountable to a higher standard than what the Meta cloud is held to
Perhaps something like:
p99 queue time of 5 minutes per hour, expecting most jobs to have very little queuing
pMax queue time of 30 minutes per day. Exceeding this means there's an outage.
|
||
- Minimum uptime of 99.9% | ||
- Maximum job queue time of 5 minutes | ||
- Job execution time variance within ±10% of baseline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks to caching, some jobs tend to have high variance between runs. Maybe we can tighten this up to "P50 job execution time variance..."
- Maximum job queue time of 5 minutes | ||
- Job execution time variance within ±10% of baseline | ||
- Response time to critical alerts within 15 minutes | ||
- Maximum capacity reduction of 10% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With capacity defined as: "Theoretical maximum number of jobs of a given type that can be run in this cloud in parallel"?
@codeJRV looks like there's some outstanding comments. Will you be able to address them before we merge? |
Add basic monitoring guidelines