| sidebar-title | SGLang Image Generation |
|---|
This guide shows how to benchmark image generation APIs using a Docker-based server and AIPerf. You'll learn how to:
- Set up the server
- Create an input file and run the benchmark
- View the results and extract the generated images
For the most up-to-date information, please refer to the following resources:
Login to Hugging Face, and accept the terms of use for the following model: FLUX.1-dev.
Export your Hugging Face token as an environment variable:
export HF_TOKEN=<your-huggingface-token>Start the Docker container:
docker run --gpus all \
--shm-size 32g \
-it \
--rm \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc=host \
lmsysorg/sglang:devSet the server arguments:
```bash SERVER_ARGS=( --model-path black-forest-labs/FLUX.1-dev --text-encoder-cpu-offload --pin-cpu-memory --num-gpus 1 --port 30000 --host 0.0.0.0 ) ```The following arguments will setup the server to use the FLUX.1-dev model on a single GPU, on port 30000. You can modify these arguments to use a different model, different number of GPUs, different port, etc. See the SGLang Image Generation CLI for more details.
Start the server:
sglang serve "${SERVER_ARGS[@]}"Wait until the server is ready (watch the logs for the following message):
Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)Create an input file:
cat > image_prompts.jsonl << 'EOF'
{"text": "A serene mountain landscape at sunset"}
{"text": "A futuristic city with flying cars"}
{"text": "A cute robot playing with a kitten"}
EOFRun the benchmark:
aiperf profile \
--model black-forest-labs/FLUX.1-dev \
--tokenizer gpt2 \
--url http://localhost:30000 \
--endpoint-type image_generation \
--input-file image_prompts.jsonl \
--custom-dataset-type single_turn \
--extra-inputs size:512x512 \
--extra-inputs quality:standard \
--concurrency 1 \
--request-count 3Done! This sends 3 requests to http://localhost:30000/v1/images/generations
Sample Output (Successful Run):
NVIDIA AIPerf | Image Generation Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━┩
│ Request Latency (ms) │ 12,617.58 │ 12,251.41 │ 12,954.04 │ 12,947.91 │ 12,892.69 │ 12,647.29 │ 287.62 │
│ Input Sequence Length (tokens) │ 6.67 │ 6.00 │ 7.00 │ 7.00 │ 7.00 │ 7.00 │ 0.47 │
│ Request Throughput (requests/sec) │ 0.08 │ - │ - │ - │ - │ - │ - │
│ Request Count (requests) │ 3.00 │ - │ - │ - │ - │ - │ - │
└───────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴────────┘
aiperf profile \
--model black-forest-labs/FLUX.1-dev \
--tokenizer gpt2 \
--url http://localhost:30000 \
--endpoint-type image_generation \
--extra-inputs size:512x512 \
--extra-inputs quality:standard \
--synthetic-input-tokens-mean 150 \
--synthetic-input-tokens-stddev 30 \
--concurrency 1 \
--request-count 3Done! This sends 3 requests to http://localhost:30000/v1/images/generations
Sample Output (Successful Run):
NVIDIA AIPerf | Image Generation Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━┩
│ Request Latency (ms) │ 12,173.18 │ 11,918.37 │ 12,503.38 │ 12,495.27 │ 12,422.26 │ 12,097.79 │ 244.71 │
│ Input Sequence Length (tokens) │ 137.00 │ 107.00 │ 153.00 │ 152.96 │ 152.60 │ 151.00 │ 21.23 │
│ Request Throughput (requests/sec) │ 0.08 │ - │ - │ - │ - │ - │ - │
│ Request Count (requests) │ 3.00 │ - │ - │ - │ - │ - │ - │
└───────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴────────┘
Image generation endpoints report a focused set of metrics. Unlike LLM text endpoints, there are no token-level streaming metrics (TTFT, ITL) since the image is returned as a single response.
| Metric | Description |
|---|---|
| Request Latency (ms) | End-to-end image generation time — from sending the request to receiving the complete image. This is the primary measure of image generation speed. |
| Input Sequence Length (tokens) | Token count of the text prompt used to generate the image. |
| Request Throughput (requests/sec) | Number of images generated per second across all concurrent workers. |
| Request Count (requests) | Total number of completed image generation requests. |
Tip
To increase throughput, raise --concurrency. Each concurrent worker sends requests independently, allowing multiple images to be generated in parallel.
Create an input file:
cat > image_prompts.jsonl << 'EOF'
{"text": "A serene mountain landscape at sunset"}
{"text": "A futuristic city with flying cars"}
{"text": "A cute robot playing with a kitten"}
EOFRun the benchmark:
Use `--export-level raw` to get the raw input/output payloads.aiperf profile \
--model black-forest-labs/FLUX.1-dev \
--tokenizer gpt2 \
--url http://localhost:30000 \
--endpoint-type image_generation \
--input-file image_prompts.jsonl \
--custom-dataset-type single_turn \
--extra-inputs size:512x512 \
--extra-inputs quality:standard \
--concurrency 1 \
--request-count 3 \
--export-level rawExtract the generated images:
Copy the following code into a file called extract_images.py:
#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Extract base64-encoded images from AIPerf JSONL output file."""
import base64
import json
import os
from pathlib import Path
import sys
# Read input file path
input_file = Path(sys.argv[1]) if len(sys.argv) > 1 else Path('artifacts/black-forest-labs_FLUX.1-dev-openai-image_generation-concurrency1/profile_export_raw.jsonl')
output_dir = Path(sys.argv[2]) if len(sys.argv) > 2 else Path('extracted_images')
# Create output directory
os.makedirs(output_dir, exist_ok=True)
# Process each line in the JSONL file
with open(input_file, 'r') as f:
for line_num, line in enumerate(f, 1):
record = json.loads(line)
# Extract images from responses
for response in record.get('responses', []):
response_data = json.loads(response.get('text', '{}'))
for data_idx, item in enumerate(response_data.get('data', [])):
if b64_image := item.get('b64_json'):
# Decode and save image
image_data = base64.b64decode(b64_image)
filename = output_dir / f"image_{line_num:04d}_{data_idx:02d}.jpg"
with open(filename, 'wb') as img_file:
img_file.write(image_data)
print(f"Extracted: {filename.resolve()}")Run the script:
The script is setup to use the default directory and file names for the input and output files, but can be modified to use different files.
Usage: python extract_images.py <input_file> <output_dir>
python extract_images.pyOutput:
Extracted: /path/to/extracted_images/image_0001_00.jpg
Extracted: /path/to/extracted_images/image_0001_01.jpg
Extracted: /path/to/extracted_images/image_0001_02.jpg
View the generated images:
Prompt:
{"text": "A serene mountain landscape at sunset"}
Prompt:
{"text": "A futuristic city with flying cars"}
Prompt:
{"text": "A cute robot playing with a kitten"}
You've successfully set up an image generation server, run your first benchmarks, and learned how to extract and view the generated images. You can now experiment with different models, prompts, and concurrency settings to optimize your image generation workloads.
Now go forth and generate!


