You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/features.md
+16-22Lines changed: 16 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -113,56 +113,50 @@ At the moment, the following information is gathered and can be viewed by users:
113
113
114
114
### Centralized Logging Scheme
115
115
116
-
Soperator implements a centralized logging system that automatically collects and categorizes Slurm workload outputs. Logs are organized by worker node to optimize filesystem performance and processed by OpenTelemetry collectors for centralized analysis.
116
+
Soperator implements a centralized logging system that automatically collects and categorizes Slurm workload outputs. Logs are organized by type and processed by OpenTelemetry collectors for centralized analysis.
117
117
118
118
#### Directory Structure
119
119
120
-
Logs are separated by worker node to prevent filesystem contention on the shared jail storage:
120
+
Logs are organized in a flat structure by log type:
121
121
122
122
```
123
123
/opt/soperator-outputs/
124
-
├── worker-0/
125
-
│ ├── nccl_logs/ # NCCL debug outputs from worker-0
126
-
│ ├── slurm_jobs/ # Slurm job outputs from worker-0
127
-
│ └── slurm_scripts/ # Script outputs (prolog, epilog, health checks) from worker-0
128
-
├── worker-1/
129
-
│ ├── nccl_logs/
130
-
│ ├── slurm_jobs/
131
-
│ └── slurm_scripts/
132
-
└── ...
124
+
├── nccl_logs/ # NCCL debug outputs from all workers
125
+
├── slurm_jobs/ # Slurm job outputs from all workers
126
+
└── slurm_scripts/ # Script outputs (prolog, epilog, health checks) from all workers
133
127
```
134
128
135
129
#### Logging Schema
136
130
137
-
Log files follow simplified naming patterns without worker prefixes (since worker identity is determined by directory structure):
131
+
Log files include the worker name at the beginning of the filename for easy identification:
138
132
139
133
**NCCL Logs:**
140
134
```
141
-
job_id.job_step_id.out
142
-
Example: 12345.67890.out (in /opt/soperator-outputs/worker-0/nccl_logs/)
135
+
worker_name.job_id.job_step_id.out
136
+
Example: worker-0.12345.67890.out
143
137
```
144
138
145
139
**Slurm Jobs:**
146
140
```
147
-
job_name.job_id[.array_id].out
141
+
worker_name.job_name.job_id[.array_id].out
148
142
Examples:
149
-
- benchmark.12345.out
150
-
- training.12345.1.out (array job)
143
+
- worker-1.benchmark.12345.out
144
+
- worker-2.training.12345.1.out (array job)
151
145
```
152
146
153
147
**Slurm Scripts:**
154
148
```
155
-
script_name[.context].out
149
+
worker_name.script_name.context.out
156
150
Examples:
157
-
- health_checker.prolog.out
158
-
- cleanup_enroot.epilog.out
151
+
- worker-0.health_checker.prolog.out
152
+
- worker-3.cleanup_enroot.epilog.out
159
153
```
160
154
161
155
#### Generated Labels
162
156
163
-
The logging system automatically extracts metadata and creates the following labels:
157
+
The logging system automatically extracts metadata from filenames and creates the following labels:
164
158
165
-
-`worker_name`: Worker pod identifier extracted from directory path
0 commit comments