| sidebar-title | Template Endpoint |
|---|
The template endpoint provides a flexible way to benchmark custom APIs that don't match standard OpenAI formats. You define request payloads using Jinja2 templates and optionally specify how to extract responses using JMESPath queries.
Use the template endpoint when:
- Your API has a custom request/response format
- Standard endpoints (chat, completions, embeddings, rankings) don't fit your use case
Benchmark an API that accepts text in a custom format:
aiperf profile \
--model your-model \
--url http://localhost:8000/custom-endpoint \
--endpoint-type template \
--extra-inputs payload_template:'
{
"text": {{ text|tojson }}
}' \
--synthetic-input-tokens-mean 100 \
--output-tokens-mean 50 \
--concurrency 4 \
--request-count 20Sample Output (Successful Run):
INFO Starting AIPerf System
INFO Using template endpoint with custom payload
INFO AIPerf System is PROFILING
Profiling: 20/20 |████████████████████████| 100% [00:28<00:00]
INFO Benchmark completed successfully
INFO Results saved to: artifacts/your-model-template-concurrency4/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Request Latency (ms) │ 456.78 │ 389.23 │ 567.45 │ 554.32 │ 452.34 │
│ Time to First Token (ms) │ 89.34 │ 67.45 │ 112.34 │ 109.23 │ 87.56 │
│ Inter Token Latency (ms) │ 11.23 │ 9.45 │ 14.56 │ 14.12 │ 11.01 │
│ Output Token Count (tokens) │ 50.00 │ 48.00 │ 52.00 │ 51.89 │ 50.00 │
│ Request Throughput (req/s) │ 8.78 │ - │ - │ - │ - │
└─────────────────────────────┴────────┴────────┴────────┴────────┴────────┘
JSON Export: artifacts/your-model-template-concurrency4/profile_export_aiperf.json
Configure the template endpoint using --extra-inputs:
payload_template: Jinja2 template defining the request payload format- Named template:
nv-embedqa - File path:
/path/to/template.json - Inline string:
'{"text": {{ text|tojson }}}'
- Named template:
response_field: JMESPath query to extract data from responses- Auto-detection is used if not provided
- Example:
data[0].embedding
Any other --extra-inputs fields are merged into every request payload:
--extra-inputs temperature:0.7 top_p:0.9text: First text content (orNone)texts: List of all text contentsimage,audio,video: First media content (orNone)images,audios,videos: Lists of all media contents
query: First query textqueries: All query textspassage: First passage textpassages: All passage textstexts_by_name: Dict mapping content names to text listsimages_by_name,audios_by_name,videos_by_name: Dicts for media
model: Model namemax_tokens: Output token limitstream: Whether streaming is enabledrole: Message roleturn: Current turn objectturns: List of all turnsrequest_info: Full request context
Auto-detection tries to extract in this order: embeddings, rankings, then text.
- Fields:
text,content,response,output,result - OpenAI:
choices[0].text,choices[0].message.content
- OpenAI:
data[].embedding - Simple:
embeddings,embedding
- Lists:
rankings,results
Specify a JMESPath query to extract specific fields:
--extra-inputs response_field:'data[0].vector'aiperf profile \
--model embedding-model \
--url http://localhost:8000/embed \
--endpoint-type template \
--extra-inputs payload_template:'
{
"input": {{ texts|tojson }},
"model": {{ model|tojson }}
}' \
--extra-inputs response_field:'embeddings' \
--synthetic-input-tokens-mean 50 \
--concurrency 8 \
--request-count 100Using the built-in nv-embedqa template:
aiperf profile \
--model nv-embed-v2 \
--url http://localhost:8000/embeddings \
--endpoint-type template \
--extra-inputs payload_template:nv-embedqa \
--synthetic-input-tokens-mean 100 \
--concurrency 4 \
--request-count 50Note: The nv-embedqa template expands to {"text": {{ texts|tojson }}}.
Create chat_template.json:
{
"model": {{ model|tojson }},
"prompt": {{ text|tojson }},
"max_new_tokens": {{ max_tokens|tojson }},
"stream": {{ stream|lower }}
}Use it:
aiperf profile \
--model custom-llm \
--url http://localhost:8000/generate \
--endpoint-type template \
--extra-inputs payload_template:./chat_template.json \
--extra-inputs response_field:'generated_text' \
--streaming \
--synthetic-input-tokens-mean 200 \
--output-tokens-mean 100 \
--concurrency 10aiperf profile \
--model vision-model \
--url http://localhost:8000/analyze \
--endpoint-type template \
--extra-inputs payload_template:'
{
"text": {{ text|tojson }},
"image": {{ image|tojson }}
}' \
--input-file ./multimodal_dataset.jsonl \
--concurrency 2- Always use
|tojsonfor string/list values to properly escape JSON - Use
-vor-vvto see debug logs with formatted payloads - Check
artifacts/<run-name>/inputs.jsonto see all formatted request payloads - Let auto-detection work first before specifying
response_field
Template didn't render valid JSON
- Use
|tojsonfilter for string or nullable values
Response not parsed correctly
- Use
-vvto see raw responses in logs - Specify
response_fieldwith a JMESPath query
Variables not available
- Verify your input dataset includes the required fields
- Use
request_infoandturnobjects for nested data