Profiling

srtctl supports two profiling backends for performance analysis: Torch Profiler and NVIDIA Nsight Systems (nsys).

Quick Start
Profiling Modes
Configuration Options
- Top-level profiling section
- Parameters
Constraints
How It Works
Example Configurations
Output Files
- Viewing Results
Troubleshooting

Quick Start

Add a profiling section to your job YAML:

# must set benchmark type to "manual"
benchmark:
  type: "manual"

# For disaggregated mode (prefill_nodes + decode_nodes)
profiling:
  type: "torch" # or "nsys"
  isl: 1024
  osl: 128
  concurrency: 24
  prefill:
    start_step: 0
    stop_step: 50
  decode:
    start_step: 0
    stop_step: 50
# For aggregated mode (agg_nodes)
# profiling:
#   type: "torch"
#   isl: 1024
#   osl: 128
#   concurrency: 24
#   aggregated:
#     start_step: 0
#     stop_step: 50

Profiling Modes

Mode	Description	Output
`none`	Default. No profiling, uses `dynamo.sglang` for serving	-
`torch`	PyTorch Profiler. Good for Python-level and CUDA kernel analysis	`/logs/profiles/{mode}/` (Chrome trace format)
`nsys`	NVIDIA Nsight Systems. Low-overhead GPU profiling	`/logs/profiles/{mode}_{rank}.nsys-rep`

Configuration Options

Top-level `profiling` section

profiling:
  type: "torch" # Required: "none", "torch", or "nsys"

  # Traffic generator parameters (required when profiling is enabled)
  isl: 1024 # Input sequence length
  osl: 128 # Output sequence length
  concurrency: 24 # Batch size for profiling workload

  # Disaggregated mode: must set both prefill and decode sections
  prefill:
    start_step: 0 # Step to start profiling for prefill workers
    stop_step: 50 # Step to stop profiling for prefill workers
  decode:
    start_step: 0 # Step to start profiling for decode workers
    stop_step: 50 # Step to stop profiling for decode workers


  # Aggregated mode: must set aggregated section (and must NOT set prefill/decode)
  # aggregated:
  #   start_step: 0   # Step to start profiling for aggregated workers
  #   stop_step: 50   # Step to stop profiling for aggregated workers

Traffic generator parameters (isl, osl, concurrency) are shared across all phases. Per-phase start_step/stop_step allow different profiling windows for prefill vs decode workers.

Parameters

Parameter	Description	Default
`isl`	Input sequence length for profiling requests	Required
`osl`	Output sequence length for profiling requests	Required
`concurrency`	Number of concurrent requests (batch size)	Required
`prefill.start_step`	Step number to begin prefill profiling	`0`
`prefill.stop_step`	Step number to end prefill profiling	`50`
`decode.start_step`	Step number to begin decode profiling	`0`
`decode.stop_step`	Step number to end decode profiling	`50`
`aggregated.start_step`	Step number to begin aggregated profiling	`0`
`aggregated.stop_step`	Step number to end aggregated profiling	`50`

Constraints

Profiling has specific requirements:

Single worker only: Profiling requires exactly 1 prefill worker and 1 decode worker (or 1 aggregated worker)
```
resources:
  prefill_workers: 1 # Must be 1
  decode_workers: 1 # Must be 1
```
No benchmarking: Profiling and benchmarking are mutually exclusive
```
benchmark:
  type: "manual" # Required when profiling
```
Automatic config dump disabled: When profiling is enabled, enable_config_dump is automatically set to false

How It Works

Normal Mode (`type: none`)

Uses dynamo.sglang module for serving
Standard disaggregated inference path

Profiling Mode (`type: torch` or `nsys`)

Uses sglang.launch_server module instead
The --disaggregation-mode flag is automatically skipped (not supported by launch_server)
Profiling script (/scripts/profiling/profile.sh) runs on leader nodes
Sends requests via sglang.bench_serving to generate profiling workload

nsys-specific behavior

When using nsys, workers are wrapped with:

nsys profile -t cuda,nvtx --cuda-graph-trace=node \
  -c cudaProfilerApi --capture-range-end stop \
  -o /logs/profiles/{mode}_{rank} \
  python3 -m sglang.launch_server ...

Example Configurations

Torch Profiler (Recommended for Python analysis)

name: "profiling-torch"

model:
  path: "deepseek-r1"
  container: "latest"
  precision: "fp8"

resources:
  gpu_type: "gb200"
  prefill_nodes: 1
  decode_nodes: 1
  prefill_workers: 1
  decode_workers: 1
  gpus_per_node: 4

profiling:
  type: "torch"
  isl: 1024
  osl: 128
  concurrency: 24
  prefill:
    start_step: 0
    stop_step: 50
  decode:
    start_step: 0
    stop_step: 50

benchmark:
  type: "manual"

backend:
  sglang_config:
    prefill:
      kv-cache-dtype: "fp8_e4m3"
      tensor-parallel-size: 4
    decode:
      kv-cache-dtype: "fp8_e4m3"
      tensor-parallel-size: 4

Nsight Systems (Recommended for GPU kernel analysis)

profiling:
  type: "nsys"
  isl: 2048
  osl: 64
  concurrency: 16
  prefill:
    start_step: 10
    stop_step: 30
  decode:
    start_step: 10
    stop_step: 30

Output Files

After profiling completes, find results in the job's log directory:

logs/{job_id}_{workers}_{timestamp}/
├── profile_all.out         # Unified profiling script output
└── profiles/
    ├── prefill/            # Torch profiler traces (if type: torch)
    │   └── *.json
    ├── decode/
    │   └── *.json
    ├── prefill_0.nsys-rep  # Nsys reports (if type: nsys)
    └── decode_0.nsys-rep

Viewing Results

Torch Profiler traces:

Open in Chrome: chrome://tracing
Or use TensorBoard: tensorboard --logdir=logs/.../profiles/

Nsight Systems reports:

Open with NVIDIA Nsight Systems GUI
Or CLI: nsys stats logs/.../profiles/decode_0.nsys-rep

Troubleshooting

"Profiling mode requires single worker only"

Reduce your worker counts to 1:

resources:
  prefill_workers: 1
  decode_workers: 1

"Cannot enable profiling with benchmark type"

Set benchmark to manual:

benchmark:
  type: "manual"

Empty profile output

Ensure isl, osl, and concurrency are set - they're required for the profiling workload.

Profile too short/long

Adjust start_step and stop_step to capture the desired range. A typical profiling run uses 30-100 steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling

Table of Contents

Quick Start

Profiling Modes

Configuration Options

Top-level `profiling` section

Parameters

Constraints

How It Works

Normal Mode (`type: none`)

Profiling Mode (`type: torch` or `nsys`)

nsys-specific behavior

Example Configurations

Torch Profiler (Recommended for Python analysis)

Nsight Systems (Recommended for GPU kernel analysis)

Output Files

Viewing Results

Troubleshooting

"Profiling mode requires single worker only"

"Cannot enable profiling with benchmark type"

Empty profile output

Profile too short/long

FilesExpand file tree

profiling.md

Latest commit

History

profiling.md

File metadata and controls

Profiling

Table of Contents

Quick Start

Profiling Modes

Configuration Options

Top-level profiling section

Parameters

Constraints

How It Works

Normal Mode (type: none)

Profiling Mode (type: torch or nsys)

nsys-specific behavior

Example Configurations

Torch Profiler (Recommended for Python analysis)

Nsight Systems (Recommended for GPU kernel analysis)

Output Files

Viewing Results

Troubleshooting

"Profiling mode requires single worker only"

"Cannot enable profiling with benchmark type"

Empty profile output

Profile too short/long

Top-level `profiling` section

Normal Mode (`type: none`)

Profiling Mode (`type: torch` or `nsys`)