|
| 1 | +# Data and Efficiency Metrics |
| 2 | + |
| 3 | +This page provides comprehensive documentation about the data structure and efficiency metrics available in the DS4CG Unity Job Analytics project. |
| 4 | + |
| 5 | +## Data Structure |
| 6 | + |
| 7 | +The project works with job data from the Unity cluster's Slurm scheduler. After preprocessing, the data contains the following key attributes: |
| 8 | + |
| 9 | +### Job Identification |
| 10 | +- **JobID** – Unique identifier for each job. |
| 11 | +- **ArrayID** – Array job identifier (`-1` for non-array jobs). |
| 12 | +- **User** – Username of the job submitter. |
| 13 | +- **Account** – Account/group associated with the job. |
| 14 | + |
| 15 | +### Time Attributes |
| 16 | +- **StartTime** – When the job started execution (datetime). |
| 17 | +- **SubmitTime** – When the job was submitted (datetime). |
| 18 | +- **Elapsed** – Total runtime duration (timedelta). |
| 19 | +- **TimeLimit** – Maximum allowed runtime (timedelta). |
| 20 | + |
| 21 | +### Resource Allocation |
| 22 | +- **GPUs** – Number of GPUs allocated. |
| 23 | +- **GPUType** – Type of GPU allocated (e.g., `"v100"`, `"a100"`, or `NA` for CPU-only jobs). |
| 24 | +- **Nodes** – Number of nodes allocated. |
| 25 | +- **CPUs** – Number of CPU cores allocated. |
| 26 | +- **ReqMem** – Requested memory. |
| 27 | + |
| 28 | +### Job Status |
| 29 | +- **Status** – Final job status (`"COMPLETED"`, `"FAILED"`, `"CANCELLED"`, etc.). |
| 30 | +- **ExitCode** – Job exit code. |
| 31 | +- **QOS** – Quality of Service level. |
| 32 | +- **Partition** – Cluster partition used. |
| 33 | + |
| 34 | +### Resource Usage |
| 35 | +- **CPUTime** – Total CPU time used. |
| 36 | +- **CPUTimeRAW** – Raw CPU time measurement. |
| 37 | + |
| 38 | +### Constraints and Configuration |
| 39 | +- **Constraints** – Hardware constraints specified. |
| 40 | +- **Interactive** – Whether the job was interactive (`"interactive"` or `"non-interactive"`). |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +## Efficiency and Resource Metrics |
| 45 | + |
| 46 | +### GPU and VRAM Metrics |
| 47 | + |
| 48 | +- **GPU Count** (`gpu_count`) |
| 49 | + Number of GPUs allocated to the job. |
| 50 | + |
| 51 | +- **Job Hours** (`job_hours`) |
| 52 | + $$ |
| 53 | + \text{job\_hours} = \frac{\text{Elapsed (seconds)}}{3600} \times \text{gpu\_count} |
| 54 | + $$ |
| 55 | + |
| 56 | +- **VRAM Constraint** (`vram_constraint`) |
| 57 | + VRAM requested via constraints, in GiB. Defaults are applied if not explicitly requested. |
| 58 | + |
| 59 | +- **Partition Constraint** (`partition_constraint`) |
| 60 | + VRAM derived from selecting a GPU partition, in GiB. |
| 61 | + |
| 62 | +- **Requested VRAM** (`requested_vram`) |
| 63 | + $$ |
| 64 | + \text{requested\_vram} = |
| 65 | + \begin{cases} |
| 66 | + \text{partition\_constraint}, & \text{if available} \\ |
| 67 | + \text{vram\_constraint}, & \text{otherwise} |
| 68 | + \end{cases} |
| 69 | + $$ |
| 70 | + |
| 71 | +- **Used VRAM** (`used_vram_gib`) |
| 72 | + Sum of peak VRAM used on all allocated GPUs (GiB). |
| 73 | + |
| 74 | +- **Approximate Allocated VRAM** (`allocated_vram`) |
| 75 | + Estimated VRAM based on GPU model(s) and job node allocation. |
| 76 | + |
| 77 | +- **Total VRAM-Hours** (`vram_hours`) |
| 78 | + $$ |
| 79 | + \text{vram\_hours} = \text{allocated\_vram} \times \text{job\_hours} |
| 80 | + $$ |
| 81 | + |
| 82 | +- **Allocated VRAM Efficiency** (`alloc_vram_efficiency`) |
| 83 | + $$ |
| 84 | + \text{alloc\_vram\_efficiency} = \frac{\text{used\_vram\_gib}}{\text{allocated\_vram}} |
| 85 | + $$ |
| 86 | + |
| 87 | +- **VRAM Constraint Efficiency** (`vram_constraint_efficiency`) |
| 88 | + $$ |
| 89 | + \text{vram\_constraint\_efficiency} = |
| 90 | + \frac{\text{used\_vram\_gib}}{\text{vram\_constraint}} |
| 91 | + $$ |
| 92 | + |
| 93 | +- **Allocated VRAM Efficiency Score** (`alloc_vram_efficiency_score`) |
| 94 | + $$ |
| 95 | + \text{alloc\_vram\_efficiency\_score} = |
| 96 | + \ln(\text{alloc\_vram\_efficiency}) \times \text{vram\_hours} |
| 97 | + $$ |
| 98 | + Penalizes long jobs with low VRAM efficiency. |
| 99 | + |
| 100 | +- **VRAM Constraint Efficiency Score** (`vram_constraint_efficiency_score`) |
| 101 | + $$ |
| 102 | + \text{vram\_constraint\_efficiency\_score} = |
| 103 | + \ln(\text{vram\_constraint\_efficiency}) \times \text{vram\_hours} |
| 104 | + $$ |
| 105 | + |
| 106 | +### CPU Memory Metrics |
| 107 | +- **Used CPU Memory** (`used_cpu_mem_gib`) – Peak CPU RAM usage in GiB. |
| 108 | +- **Allocated CPU Memory** (`allocated_cpu_mem_gib`) – Requested CPU RAM in GiB. |
| 109 | +- **CPU Memory Efficiency** (`cpu_mem_efficiency`) |
| 110 | + $$ |
| 111 | + \text{cpu\_mem\_efficiency} = \frac{\text{used\_cpu\_mem\_gib}}{\text{allocated\_cpu\_mem\_gib}} |
| 112 | + $$ |
| 113 | + |
| 114 | +--- |
| 115 | + |
| 116 | +## User-Level Metrics |
| 117 | + |
| 118 | +- **Job Count** (`job_count`) – Number of jobs submitted by the user. |
| 119 | +- **Total Job Hours** (`user_job_hours`) – Sum of job hours for all jobs of the user. |
| 120 | +- **Average Allocated VRAM Efficiency Score** (`avg_alloc_vram_efficiency_score`). |
| 121 | +- **Average VRAM Constraint Efficiency Score** (`avg_vram_constraint_efficiency_score`). |
| 122 | + |
| 123 | +- **Weighted Average Allocated VRAM Efficiency** |
| 124 | + $$ |
| 125 | + \text{expected\_value\_alloc\_vram\_efficiency} = |
| 126 | + \frac{\sum (\text{alloc\_vram\_efficiency} \times \text{vram\_hours})} |
| 127 | + {\sum \text{vram\_hours}} |
| 128 | + $$ |
| 129 | + |
| 130 | +- **Weighted Average VRAM Constraint Efficiency** |
| 131 | + $$ |
| 132 | + \text{expected\_value\_vram\_constraint\_efficiency} = |
| 133 | + \frac{\sum (\text{vram\_constraint\_efficiency} \times \text{vram\_hours})} |
| 134 | + {\sum \text{vram\_hours}} |
| 135 | + $$ |
| 136 | + |
| 137 | +- **Weighted Average GPU Count** |
| 138 | + $$ |
| 139 | + \text{expected\_value\_gpu\_count} = |
| 140 | + \frac{\sum (\text{gpu\_count} \times \text{vram\_hours})} |
| 141 | + {\sum \text{vram\_hours}} |
| 142 | + $$ |
| 143 | + |
| 144 | +- **Total VRAM-Hours** – Sum of allocated_vram × job_hours across all jobs of the user. |
| 145 | + |
| 146 | +--- |
| 147 | + |
| 148 | +## Group-Level Metrics |
| 149 | + |
| 150 | +For a group of users (e.g., PI group): |
| 151 | + |
| 152 | +- **Job Count** – Total number of jobs across the group. |
| 153 | +- **PI Group Job Hours** (`pi_acc_job_hours`). |
| 154 | +- **PI Group VRAM Hours** (`pi_ac_vram_hours`). |
| 155 | +- **User Count**. |
| 156 | +- Group averages and weighted averages of efficiency metrics (similar formulas as above). |
| 157 | + |
| 158 | +--- |
| 159 | + |
| 160 | +## Efficiency Categories |
| 161 | +- **High**: > 70% |
| 162 | +- **Medium**: 30–70% |
| 163 | +- **Low**: 10–30% |
| 164 | +- **Very Low**: < 10% |
0 commit comments