Skip to content

Commit a589cc8

Browse files
committed
Updating README with instructions on Prometheus setup
1 parent 32af561 commit a589cc8

File tree

2 files changed

+26
-2
lines changed

2 files changed

+26
-2
lines changed

README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,31 @@ export TORCH_SENDNN_LOG=CRITICAL
139139
export DT_DEEPRT_VERBOSE=-1
140140
```
141141

142+
### Setup the environment for reporting resource usage
143+
144+
When running `drive_paged_programs.py` you may want to see how much CPU and memory usage is
145+
happening. This is done using Prometheus, thus if you are running in a container, you want to set up a simple Prometheus server to start collecting these metrics. To do this, do the following:
146+
147+
1. Run `podman network create promnet`
148+
2. Run `podman run -d --name node-exporter --network promnet quay.io/prometheus/node-exporter:latest`
149+
3. Create a file called `prometheus.yml` that has the following contents:
150+
151+
```yaml
152+
global:
153+
  scrape_interval: 5s
154+
155+
scrape_configs:
156+
  - job_name: "node"
157+
    static_configs:
158+
      - targets: ["node-exporter:9100"]
159+
```
160+
161+
4. Run `podman run -d --name prometheus --network promnet -p 9091:9090   -v "$PWD/prometheus.yml:/etc/prometheus/prometheus.yml:Z"   quay.io/prometheus/prometheus:latest   --config.file=/etc/prometheus/prometheus.yml`
162+
5. Check the status of the server by running `curl -s "http://localhost:9091/api/v1/targets" | python3 -m json.tool | grep health` and ensuring that "health" says "up".
163+
6. When you are about to run DPP, run `export PROMETHEUS_URL="http://localhost:9091"`
164+
165+
If you are running in OpenShift, you are going to want to set `PROMETHEUS_URL` to an OpenShift route that has Prometheus set up. Additionally, you are going to want to set `PROMETHEUS_API_KEY` to your OpenShift OAuth token if the Prometheus instance on the cluster is protected. You can get this token by running `oc whoami -t`.
166+
142167
## How to use Foundation Model Stack (FMS) on AIU hardware
143168
The [scripts](https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/scripts) directory provides various scripts to use FMS on AIU hardware for many use cases. These scripts provide robust support for passing desired command line options for running encoder and decoder models along with other use cases. Refer to the documentation on [using different scripts](https://github.com/foundation-model-stack/aiu-fms-testing-utils/blob/main/scripts/README.md) for more details.
144169

aiu_fms_testing_utils/scripts/README.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Scripts for using Foundation Model Stack (FMS) on AIU hardware
22

3-
The scripts provided here allow you to run FMS on AIU device for a variety of models.
3+
The scripts provided here allow you to run FMS on AIU device for a variety of models.
44

55
Let's look at some of the example usage below.
66

@@ -75,4 +75,3 @@ python3 scripts/validation.py --architecture=hf_configured --model_path=/home/de
7575
```
7676

7777
To run a logits-based validation, pass `--validation_level=1` to the validation script. This will check for the logits output to match at every step of the model through cross-entropy loss. You can control the acceptable threshold with `--logits_loss_threshold`.
78-

0 commit comments

Comments
 (0)