Skip to content

Commit 76c3349

Browse files
Wip ab/final metrics (#996)
* Adding prometheus and grafana to the nemo curator metrics path Signed-off-by: Abhinav Garg <[email protected]> * Implement safe extraction for tar files in file_utils.py - Added functions `_is_safe_path` and `tar_safe_extract` to ensure safe extraction of tar files, preventing path traversal attacks. - Included necessary imports and updated the file structure by removing outdated files from the ray-curator module. Signed-off-by: [Your Name] [[email protected]] Signed-off-by: Abhinav Garg <[email protected]> * Refactor references from ray_curator to nemo_curator across multiple files - Updated file paths and comments in api-design.md, __init__.py, client.py, and start_prometheus_grafana.py to reflect the new nemo_curator namespace. - Changed package name in package_info.py from ray_curator to nemo_curator. Signed-off-by: [Your Name] [[email protected]] Signed-off-by: Abhinav Garg <[email protected]> * Rename function `get_ray_client` to `start_prometheus_grafana` in start_prometheus_grafana.py for clarity and consistency with the new metrics path. Update the function call in the main execution block accordingly. Signed-off-by: Abhinav Garg <[email protected]> * Adding prometheus and grafana to the nemo curator metrics path Signed-off-by: Abhinav Garg <[email protected]> * Adding README for metrics Signed-off-by: Abhinav Garg <[email protected]> --------- Signed-off-by: Abhinav Garg <[email protected]> Signed-off-by: [Your Name] [[email protected]] Co-authored-by: Sarah Yurick <[email protected]>
1 parent 008060b commit 76c3349

File tree

14 files changed

+3692
-95
lines changed

14 files changed

+3692
-95
lines changed

CONTRIBUTING.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,6 @@ uv sync --extra all
4444

4545
### Dev Pattern
4646

47-
- All upstream work/changes with the new API and ray backend should target the `NeMo-Curator/ray-api` branch.
48-
- When re-using code already in `NeMo-Curator/nemo_curator`, use `git mv` to move those source files into the `ray-curator/ray_curator` namespace.
4947
- Sign and signoff commits with `git commit -sS`. (May be relaxed in the future)
5048
- If project dependencies are updated a new uv lock file needs to be generated. Run `uv lock` and add the changes of the new uv.lock file.
5149

api-design.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -200,4 +200,4 @@ class RayDataExecutor(BaseExecutor):
200200

201201
## Examples
202202

203-
Please refer to the [quickstart](./ray_curator/examples/quickstart.py) for a basic example.
203+
Please refer to the [quickstart](./nemo_curator/examples/quickstart.py) for a basic example.

nemo_curator/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,6 @@
44

55
from cosmos_xenna.ray_utils.cluster import API_LIMIT
66

7-
# We set these incase a user ever starts a ray cluster with ray_curator, we need these for Xenna to work
7+
# We set these incase a user ever starts a ray cluster with nemo_curator, we need these for Xenna to work
88
os.environ["RAY_MAX_LIMIT_FROM_API_SERVER"] = str(API_LIMIT)
99
os.environ["RAY_MAX_LIMIT_FROM_DATA_SOURCE"] = str(API_LIMIT)

nemo_curator/core/client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ def start(self) -> None:
8989
msg = (
9090
"No monitoring services are running. "
9191
"Please run the `start_prometheus_grafana.py` "
92-
"script from ray_curator/metrics folder to setup monitoring services separately."
92+
"script from nemo_curator/metrics folder to setup monitoring services separately."
9393
)
9494
logger.warning(msg)
9595

nemo_curator/core/utils.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,10 @@ def init_cluster( # noqa: PLR0913
116116
os.environ["DASHBOARD_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_DASHBOARD_METRIC_PORT))
117117
os.environ["AUTOSCALER_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_AUTOSCALER_METRIC_PORT))
118118

119+
# We set some env vars for Xenna here. This is only used for Xenna clusters.
120+
os.environ["XENNA_RAY_METRICS_PORT"] = str(ray_metrics_port)
121+
os.environ["XENNA_RESPECT_CUDA_VISIBLE_DEVICES"] = "1"
122+
119123
proc = subprocess.Popen(ray_command, shell=False) # noqa: S603
120124
logger.info(f"Ray start command: {' '.join(ray_command)}")
121125
os.environ["RAY_ADDRESS"] = f"{ip_address}:{ray_port}"

nemo_curator/metrics/README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# NeMo Curator Metrics
2+
3+
This module provides monitoring and visualization capabilities for NeMo Curator operations using Prometheus and Grafana.
4+
5+
## Overview
6+
7+
The metrics module enables real-time monitoring of data curation workloads by:
8+
- Setting up Prometheus for metrics collection
9+
- Configuring Grafana for visualization dashboards
10+
- Providing pre-built dashboards for Ray/Xenna workloads
11+
12+
## Quick Start
13+
14+
### Start Monitoring Services
15+
16+
```bash
17+
# Start with default ports (Prometheus: 9090, Grafana: 3000)
18+
python -m nemo_curator.metrics.start_prometheus_grafana --yes
19+
20+
# Or specify custom ports
21+
python -m nemo_curator.metrics.start_prometheus_grafana --prometheus_web_port 9091 --grafana_web_port 3001
22+
```
23+
24+
### Run a Pipeline
25+
26+
```bash
27+
python examples/quickstart.py
28+
```
29+
30+
### Access Dashboards
31+
32+
- **Grafana Dashboard**: http://localhost:3000 (admin/admin)
33+
- **Prometheus Dashboard**: http://localhost:9090
34+
35+
### For Ray/Xenna Users
36+
37+
```bash
38+
export XENNA_RAY_METRICS_PORT=8080
39+
```
40+
41+
## Components
42+
43+
- `start_prometheus_grafana.py` - Main script to launch monitoring services
44+
- `utils.py` - Helper functions for downloading and configuring services
45+
- `constants.py` - Configuration constants and templates
46+
- `xenna_grafana_dashboard.json` - Pre-configured Grafana dashboard for Ray workloads
47+
48+
## Cleanup
49+
50+
```bash
51+
# Stop services
52+
pkill -f 'prometheus .*'
53+
pkill -f 'grafana server'
54+
55+
# Remove persistent data
56+
rm -rf data/
57+
```
58+
59+
## Custom Dashboards
60+
61+
Add custom Grafana dashboards by placing JSON files in:
62+
`/tmp/nemo_curator_metrics/grafana/dashboards/`

nemo_curator/metrics/constants.py

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,62 @@
1313
# limitations under the License.
1414

1515
DEFAULT_NEMO_CURATOR_METRICS_PATH = "/tmp/nemo_curator_metrics" # noqa: S108
16+
GRAFANA_VERSION = "12.0.2"
17+
18+
PROMETHEUS_YAML_TEMPLATE = """
19+
global:
20+
scrape_interval: 10s # Set the scrape interval to every 10 seconds. Default is every 1 minute.
21+
evaluation_interval: 10s # Evaluate rules every 10 seconds. The default is every 1 minute.
22+
# scrape_timeout is set to the global default (10s).
23+
24+
scrape_configs:
25+
# Scrape from each Ray node as defined in the service_discovery.json provided by Ray.
26+
- job_name: 'ray'
27+
file_sd_configs:
28+
- files:
29+
- /tmp/ray/prom_metrics_service_discovery.json
30+
"""
31+
32+
GRAFANA_INI_TEMPLATE = """
33+
[security]
34+
allow_embedding = true
35+
36+
[auth.anonymous]
37+
enabled = true
38+
org_name = Main Org.
39+
org_role = Viewer
40+
41+
[paths]
42+
provisioning = {provisioning_path}
43+
44+
[server]
45+
http_port = {grafana_web_port}
46+
"""
47+
48+
GRAFANA_DASHBOARD_YAML_TEMPLATE = """
49+
50+
apiVersion: 1
51+
52+
providers:
53+
- name: Ray # Default dashboards provided by OSS Ray
54+
folder: Ray
55+
type: file
56+
options:
57+
path: {dashboards_path}
58+
"""
59+
60+
GRAFANA_DATASOURCE_YAML_TEMPLATE = """
61+
apiVersion: 1
62+
datasources:
63+
- access: proxy
64+
isDefault: true
65+
jsonData: {{}}
66+
name: Prometheus
67+
secureJsonData: {{}}
68+
type: prometheus
69+
url: {prometheus_url}
70+
"""
71+
72+
DEFAULT_PROMETHEUS_WEB_PORT = 9090
73+
74+
DEFAULT_GRAFANA_WEB_PORT = 3000
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import argparse
16+
import os
17+
import sys
18+
import time
19+
20+
from loguru import logger
21+
22+
from nemo_curator.core.utils import get_free_port
23+
from nemo_curator.metrics.constants import (
24+
DEFAULT_GRAFANA_WEB_PORT,
25+
DEFAULT_NEMO_CURATOR_METRICS_PATH,
26+
DEFAULT_PROMETHEUS_WEB_PORT,
27+
)
28+
from nemo_curator.metrics.utils import (
29+
download_and_extract_prometheus,
30+
download_grafana,
31+
is_grafana_running,
32+
is_prometheus_running,
33+
launch_grafana,
34+
run_prometheus,
35+
write_grafana_configs,
36+
)
37+
38+
# ----------------------------
39+
# Helpers
40+
# ----------------------------
41+
42+
43+
def start_prometheus_grafana(
44+
prometheus_web_port: int = DEFAULT_PROMETHEUS_WEB_PORT,
45+
grafana_web_port: int = DEFAULT_GRAFANA_WEB_PORT,
46+
) -> None:
47+
# Check if the prometheus or grafana is running. If yes we assume that they were setup using this script and skip the setup.
48+
if is_prometheus_running() or is_grafana_running():
49+
logger.info("Prometheus or Grafana is already running. Skipping the setup.")
50+
return
51+
52+
# Touch our director for nemo curator incase it doesn't exist
53+
os.makedirs(DEFAULT_NEMO_CURATOR_METRICS_PATH, exist_ok=True)
54+
55+
prometheus_web_port = get_free_port(prometheus_web_port)
56+
grafana_web_port = get_free_port(grafana_web_port)
57+
58+
prometheus_dir = download_and_extract_prometheus()
59+
60+
# Run prometheus
61+
try:
62+
run_prometheus(prometheus_dir, prometheus_web_port)
63+
except Exception as error:
64+
error_msg = f"Failed to start Prometheus: {error}"
65+
logger.error(error_msg)
66+
raise
67+
68+
# -----------------------------
69+
# Grafana setup and launch
70+
# -----------------------------
71+
try:
72+
grafana_dir = download_grafana()
73+
grafana_ini_path = write_grafana_configs(grafana_web_port, prometheus_web_port)
74+
launch_grafana(grafana_dir, grafana_ini_path)
75+
76+
# Wait a bit to ensure Grafana starts
77+
time.sleep(2)
78+
except Exception as error:
79+
error_msg = f"Failed to setup or start Grafana: {error}"
80+
logger.error(error_msg)
81+
raise
82+
83+
logger.info("If you are running using Xenna, please remember to export XENNA_RAY_METRICS_PORT=8080")
84+
logger.info("You can access the grafana dashboard at http://localhost:3000, username: admin, password: admin")
85+
logger.info("You can access the prometheus dashboard at http://localhost:9090")
86+
logger.info("Currently, we only provide a xenna dashboard,")
87+
logger.info("but you can add more dashboards by adding json files")
88+
logger.info(f"in {DEFAULT_NEMO_CURATOR_METRICS_PATH}/grafana/dashboards")
89+
logger.info("from /tmp/ray/session_latest/metrics/grafana/dashboards")
90+
logger.info("To kill prometheus and grafana, run: pkill -f 'prometheus .*' ; pkill -f 'grafana server'")
91+
logger.info("Prometheus stores tha persistant data inside data, if you want to delete the data, run: rm -rf data")
92+
return
93+
94+
95+
if __name__ == "__main__":
96+
# Add an argument parser to get the prometheus and grafana ports
97+
parser = argparse.ArgumentParser()
98+
parser.add_argument(
99+
"--prometheus_web_port", type=int, default=DEFAULT_PROMETHEUS_WEB_PORT, help="The port to run prometheus on"
100+
)
101+
parser.add_argument(
102+
"--grafana_web_port", type=int, default=DEFAULT_GRAFANA_WEB_PORT, help="The port to run grafana on"
103+
)
104+
parser.add_argument("--yes", action="store_true", help="Skip the confirmation prompt")
105+
args = parser.parse_args()
106+
107+
if not args.yes:
108+
print("This will download and start prometheus and grafana on the following ports:")
109+
print(f"Prometheus: {args.prometheus_web_port}")
110+
print(f"Grafana: {args.grafana_web_port}")
111+
print("Are you sure you want to continue? (y/n)")
112+
if input() != "y":
113+
print("Exiting...")
114+
sys.exit(0)
115+
116+
start_prometheus_grafana(args.prometheus_web_port, args.grafana_web_port)

0 commit comments

Comments
 (0)