Skip to content

Commit 0b42c0e

Browse files
authored
Merge pull request #320 from vprashar2929/update-script
fix: enhance cluster setup script for model training
2 parents 5621759 + 51358b2 commit 0b42c0e

File tree

2 files changed

+313
-247
lines changed

2 files changed

+313
-247
lines changed

model_training/README.md

Lines changed: 53 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,83 @@
1-
# Contribute to power profiling amd model training
1+
# Contribute to power profiling and model training
2+
3+
<!--toc:start-->
4+
- [Contribute to power profiling and model training](#contribute-to-power-profiling-and-model-training)
5+
- [Requirements](#requirements)
6+
- [Pre-step](#pre-step)
7+
- [Setup](#setup)
8+
- [Prepare cluster](#prepare-cluster)
9+
- [From scratch (no target kubernetes cluster)](#from-scratch-no-target-kubernetes-cluster)
10+
- [For managed cluster](#for-managed-cluster)
11+
- [Run benchmark and collect metrics](#run-benchmark-and-collect-metrics)
12+
- [With manual execution](#with-manual-execution)
13+
- [[Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)](#manual-metric-collection-and-training-with-entrypointcmdinstructionmd)
14+
- [Clean up](#clean-up)
15+
<!--toc:end-->
216

317
## Requirements
18+
419
- git > 2.22
520
- kubectl
621
- yq, jq
722
- power meter is available
823

924
## Pre-step
10-
1. Fork and clone this repository and move to profile folder
11-
```bash
12-
git clone
13-
cd model_training
14-
chmod +x script.sh
15-
```
16-
## 1. Prepare cluster
1725

18-
### From scratch (no target kubernetes cluster)
19-
- port 9090 and 5101 not being used (will be used in port-forward for prometheus and kind registry respectively)
26+
- Fork and clone this repository and move to `model_training` folder
2027

21-
Run
28+
```bash
29+
git clone
30+
cd model_training
2231
```
23-
./script.sh prepare_cluster
24-
```
25-
The script will
26-
1. create a kind cluster `kind-for-training` with registry at port `5101`.
27-
2. deploy Prometheus.
28-
3. deploy Prometheus RBAC and node port to `30090` port on kind node which will be forwarded to `9090` port on the host.
29-
4. deploy service monitor for kepler and reload to Prometheus server
32+
33+
## Setup
34+
35+
### Prepare cluster
36+
37+
### From scratch (no target kubernetes cluster)
38+
39+
> Note: port 9090 and 5101 should not being used. It will be used in port-forward for prometheus and kind registry respectively
40+
41+
```bash
42+
./script.sh prepare_cluster
43+
```
44+
45+
The script will:
46+
47+
- create a kind cluster `kind-for-training` with registry at port `5101`.
48+
- deploy Prometheus.
49+
- deploy Prometheus RBAC and node port to `30090` port on kind node which will be forwarded to `9090` port on the host.
50+
- deploy service monitor for kepler and reload to Prometheus server
3051

3152
### For managed cluster
3253

3354
Please confirm the following requirements:
34-
1. Kepler installation
35-
2. Prometheus installation
36-
3. Kepler metrics are exported to Promtheus server
37-
4. Prometheus server is available at `http://localhost:9090`. Otherwise, set environment `PROM_SERVER`.
3855

39-
## 2. Run benchmark and collect metrics
56+
- Kepler installation
57+
- Prometheus installation
58+
- Kepler metrics are exported to Promtheus server
59+
- Prometheus server is available at `http://localhost:9090`. Otherwise, set environment `PROM_SERVER`.
4060

41-
### With benchmark automation and pipeline
42-
There are two options to run the benchmark and collect the metrics, [CPE-operator](https://github.com/IBM/cpe-operator) with manual script and [Tekton Pipeline](https://github.com/tektoncd/pipeline).
61+
### Run benchmark and collect metrics
62+
63+
There are two options to run the benchmark and collect the metrics, [CPE-operator](https://github.com/IBM/cpe-operator) with manual script and [Tekton Pipeline](https://github.com/tektoncd/pipeline).
4364

4465
> The adoption of the CPE operator is slated for deprecation. We are on transitioning to the automation of collection and training processes through the Tekton pipeline. Nevertheless, the CPE operator might still be considered for usage in customized benchmarks requiring performance values per sub-workload within the benchmark suite.
4566
46-
### [Tekton Pipeline Instruction](./tekton/README.md)
67+
- [Tekton Pipeline Instruction](./tekton/README.md)
68+
69+
- [CPE Operator Instruction](./cpe_script_instruction.md)
4770

48-
### [CPE Operator Instruction](./cpe_script_instruction.md)
71+
With manual execution
4972

50-
### With manual execution
5173
In addition to the above two automation approach, you can manually run your own benchmarks, then collect, train, and export the models by the entrypoint `cmd/main.py`
5274

53-
### [Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)
75+
[Manual Metric Collection and Training with Entrypoint](./cmd_instruction.md)
5476

5577
## Clean up
5678

57-
### For kind-for-training cluster
79+
For kind-for-training cluster:
5880

59-
Run
60-
```
81+
```bash
6182
./script.sh cleanup
6283
```

0 commit comments

Comments
 (0)