|
1 | | -# python-batchtools |
| 1 | + python-batchtools |
2 | 2 |
|
3 | | -## Set up your development environment |
| 3 | +## Overview |
4 | 4 |
|
5 | | -### Install tools |
| 5 | +`python-batchtools` is a CLI for students and researchers to |
| 6 | +submit **GPU batch jobs** through **Kueue-managed GPU queues** on an |
| 7 | +OpenShift cluster. It provides an inexpensive and accessible way to use |
| 8 | +GPU hardware without reserving dedicated GPU nodes. |
6 | 9 |
|
7 | | -1. Start by installing `uv`. Depending on your distribution, this may be as simple as: |
| 10 | +Users submit GPU jobs with a single command: |
8 | 11 |
|
9 | | - ```sh |
10 | | - sudo dnf -y install uv |
11 | | - ``` |
| 12 | +``` sh |
| 13 | +batchtools br "./cuda_program" |
| 14 | +``` |
| 15 | + |
| 16 | +The CLI automatically: |
| 17 | +- Creates the batch job<br> |
| 18 | +- Submits it to the appropriate Kueue-managed LocalQueue<br> |
| 19 | +- Tracks job status<br> |
| 20 | +- Streams logs on completion <br> |
| 21 | + |
| 22 | + |
| 23 | +# For Users |
12 | 24 |
|
13 | | - If you would like to run the latest version, you can install the command using `pipx`. First, install `pipx`: |
| 25 | +## Installation |
14 | 26 |
|
15 | | - ``` |
16 | | - sudo dnf -y install pipx |
17 | | - ``` |
| 27 | +### Option 1: Use the provided container image (recommended) |
18 | 28 |
|
19 | | - And then use `pipx` to install `uv`: |
| 29 | +### Option 2: Install from source |
| 30 | + |
| 31 | +``` sh |
| 32 | +git clone https://github.com/memalhot/python-batchtools.git |
| 33 | +cd python-batchtools |
| 34 | +pip install -e . |
| 35 | +``` |
20 | 36 |
|
21 | | - ``` |
22 | | - pipx install uv |
23 | | - ``` |
24 | 37 |
|
25 | | -2. Next, install `pre-commit`. As with `uv`, you can install this using your system package manager: |
| 38 | +## Prerequisites |
26 | 39 |
|
27 | | - ``` |
28 | | - sudo dnf -y install pre-commit |
29 | | - ``` |
| 40 | +1. A Kueue-enabled OpenShift cluster, with local-queues named: v100-localqueue, a100-localqueue, h100-localqueue, dummy-localqueue<br> |
| 41 | +2. An OpenShift account<br> |
| 42 | +3. The Python OpenShift client: |
| 43 | + |
| 44 | +``` sh |
| 45 | +pip install openshift-client |
| 46 | +``` |
30 | 47 |
|
31 | | - Or you can install a possibly more recent version using `pipx`: |
| 48 | +# Usage Examples |
32 | 49 |
|
33 | | - ``` |
34 | | - pipx install pre-commit |
35 | | - ``` |
| 50 | +For any command you can run: |
| 51 | +`batchtools <command> -h` or `batchtools <command> --help` |
36 | 52 |
|
| 53 | +## **1. Submit a Batch Job --- `br`** |
| 54 | +The br command is how to submit batchjobs. It submits code intended to run on GPUs to the Kueue, where it is queued, then run, produces logs stored in the `RUNDIR`, and then deletes the job for resource conservation. |
37 | 55 |
|
38 | | -### Activate pre-commit |
| 56 | +Here's how to use thed br command: |
39 | 57 |
|
40 | | -Activate `pre-commit` for your working copy of this repository by running: |
| 58 | +First write a CUDA program and compile it :D |
| 59 | +Then to submit your CUDA program to the GPU node: |
41 | 60 |
|
| 61 | +``` sh |
| 62 | +batchtools br "./cuda-code" |
42 | 63 | ``` |
43 | | -pre-commit install |
| 64 | + |
| 65 | +Submit a program with arguments: |
| 66 | + |
| 67 | +``` sh |
| 68 | +batchtools br './simulate --steps 1000' |
| 69 | +``` |
| 70 | + |
| 71 | +Specify GPU type: |
| 72 | + |
| 73 | +``` sh |
| 74 | +batchtools br --gpu v100 "./train_model" |
| 75 | +``` |
| 76 | + |
| 77 | +Run without waiting for logs (for longer runs, similar to a more traditional batch system): |
| 78 | + |
| 79 | +``` sh |
| 80 | +batchtools br --no-wait "./cuda_program" |
44 | 81 | ``` |
| 82 | +***WARNING*** |
| 83 | +If you run br with the --no-wait flag, it will not be cleaned up for you. You must delete it on your own by running `batchtools bd <job-name>` or `oc delete job <job-name>` |
| 84 | +But don't worry, running with --no-wait will give you a reminder to delete your jobs! |
45 | 85 |
|
46 | | -This will configure `.git/hooks/pre-commit` to run the `pre-commit` tool every time you make a commit. Running these tests locally ensures that your code is clean and that tests are passing before you share your code with others. To manually run all the checks: |
| 86 | +And if you need help or want to see more flas: |
47 | 87 |
|
| 88 | +``` sh |
| 89 | +batchtools br --h |
48 | 90 | ``` |
49 | | -pre-commit run --all-files |
| 91 | + |
| 92 | + |
| 93 | +## **2. List Jobs --- `bj`** |
| 94 | + |
| 95 | +List all jobs: |
| 96 | + |
| 97 | +``` sh |
| 98 | +batchtools bj |
50 | 99 | ``` |
51 | 100 |
|
52 | 101 |
|
53 | | -### Install dependencies |
| 102 | +## **3. Delete Jobs --- `bd`** |
54 | 103 |
|
55 | | -To install the project dependencies, run: |
| 104 | +Delete all jobs: |
56 | 105 |
|
| 106 | +``` sh |
| 107 | +batchtools bd |
57 | 108 | ``` |
58 | | -uv sync --all-extras |
| 109 | + |
| 110 | +To delete specific jobs: |
| 111 | + |
| 112 | +``` sh |
| 113 | +batchtools bd job-a job-b |
59 | 114 | ``` |
60 | 115 |
|
61 | | -### Run tests |
62 | 116 |
|
63 | | -To run just the unit tests: |
| 117 | +## **4. List active GPU pods per node --- `bps`** |
| 118 | + |
| 119 | +``` sh |
| 120 | +batchtools bps |
| 121 | +``` |
| 122 | + |
| 123 | +Output will be empty if all nodes are free. |
| 124 | + |
| 125 | +If some nodes are busy: |
| 126 | +``` |
| 127 | +wrk-4: BUSY 3 project-1/project-stuff testing/other-stuff test/fraud-detectiob |
| 128 | +``` |
| 129 | + |
| 130 | +To always ensure output, you can run: |
64 | 131 |
|
| 132 | +``` sh |
| 133 | +batchtools --verbose bps |
65 | 134 | ``` |
| 135 | +To get output like: |
| 136 | +``` |
| 137 | +ctl-0: FREE |
| 138 | +ctl-1: FREE |
| 139 | +ctl-2: FREE |
| 140 | +wrk-0: FREE |
| 141 | +wrk-1: FREE |
| 142 | +wrk-3: FREE |
| 143 | +wrk-4: BUSY 3 project-1/project-stuff testing/other-stuff test/fraud-detection |
| 144 | +wrk-5: FREE |
| 145 | +wrk-6: FREE |
| 146 | +wrk-7: FREE |
| 147 | +
|
| 148 | +``` |
| 149 | + |
| 150 | +## **5. Show pod logs --- `bl`** |
| 151 | + |
| 152 | +``` sh |
| 153 | +batchtools bl |
| 154 | +``` |
| 155 | + |
| 156 | +For a specific pod: |
| 157 | +``` sh |
| 158 | +batchtools bl pod-name |
| 159 | +``` |
| 160 | + |
| 161 | + |
| 162 | +## **6. Show pod logs --- `bq`** |
| 163 | + |
| 164 | +``` sh |
| 165 | +batchtools bq |
| 166 | +``` |
| 167 | + |
| 168 | +Output will look like: |
| 169 | +``` sh |
| 170 | +a100-clusterqueue admitted: 0 pending: 0 reserved: 0 GPUs: 0 BestEffortFIFO |
| 171 | +dummy-clusterqueue admitted: 0 pending: 0 reserved: 0 GPUs: 0 BestEffortFIFO |
| 172 | +h100-clusterqueue admitted: 0 pending: 0 reserved: 0 GPUs: 0 BestEffortFIFO |
| 173 | +v100-clusterqueue admitted: 0 pending: 0 reserved: 0 GPUs: 3 BestEffortFIFO |
| 174 | +``` |
| 175 | + |
| 176 | +# For Contributors |
| 177 | + |
| 178 | +## Tools |
| 179 | + |
| 180 | +Install uv: |
| 181 | + |
| 182 | +``` sh |
| 183 | +pipx install uv |
| 184 | +``` |
| 185 | + |
| 186 | +Install pre-commit: |
| 187 | + |
| 188 | +``` sh |
| 189 | +pipx install pre-commit |
| 190 | +``` |
| 191 | + |
| 192 | +Activate hooks: |
| 193 | + |
| 194 | +``` sh |
| 195 | +pre-commit install |
| 196 | +``` |
| 197 | + |
| 198 | +## Running Tests |
| 199 | + |
| 200 | +``` sh |
66 | 201 | uv run pytest |
67 | 202 | ``` |
68 | 203 |
|
69 | | -This will generate a test coverage report in `htmlcov/index.html`. |
| 204 | +Coverage report is generated at: |
| 205 | + |
| 206 | + htmlcov/index.html |
0 commit comments