Skip to content

Latest commit

 

History

History
115 lines (90 loc) · 5.14 KB

File metadata and controls

115 lines (90 loc) · 5.14 KB
sidebar-title Multi-URL Load Balancing

Multi-URL Load Balancing

AIPerf supports distributing requests across multiple inference server instances for horizontal scaling. This is useful for:

  • Multi-GPU scaling: Run multiple inference containers on a single node, each serving a different GPU
  • Distributed inference: Load balance across multiple inference servers
  • High-throughput benchmarking: Aggregate throughput from multiple instances

Usage

Specify multiple --url options to enable load balancing:

# Round-robin across two servers
aiperf profile --model llama \
    --url http://server1:8000 \
    --url http://server2:8000 \
    --request-rate 20 \
    --request-count 100

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     Load balancing enabled: 2 URLs with round_robin strategy
INFO     Using Request_Rate strategy (20.0 req/s)
INFO     AIPerf System is PROFILING

Profiling: 100/100 |████████████████████████| 100% [00:05<00:00]

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/llama-chat-rate20/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                     Metric ┃    avg ┃    min ┃    max ┃    p99 ┃    p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│       Request Latency (ms) │ 234.56 │ 189.34 │ 312.45 │ 298.67 │ 231.23 │
│   Time to First Token (ms) │  56.78 │  45.12 │  78.90 │  75.34 │  55.67 │
│ Request Throughput (req/s) │  19.45 │      - │      - │      - │      - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘

JSON Export: artifacts/llama-chat-rate20/profile_export_aiperf.json
# Multi-GPU scaling on a single node
aiperf profile --model llama \
    --url http://localhost:8000 \
    --url http://localhost:8001 \
    --url http://localhost:8002 \
    --url http://localhost:8003 \
    --concurrency 32 \
    --benchmark-duration 60

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     Load balancing enabled: 4 URLs with round_robin strategy
INFO     AIPerf System is PROFILING

Profiling: [01:00] - Running for 60 seconds...

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/llama-chat-concurrency32/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                     Metric ┃    avg ┃    min ┃    max ┃    p99 ┃    p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│       Request Latency (ms) │ 198.34 │ 145.67 │ 289.12 │ 267.45 │ 194.23 │
│   Time to First Token (ms) │  48.90 │  37.23 │  69.45 │  65.78 │  47.89 │
│ Request Throughput (req/s) │  78.90 │      - │      - │      - │      - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘

JSON Export: artifacts/llama-chat-concurrency32/profile_export_aiperf.json

URL Selection Strategy

Currently supported strategies:

Strategy Description
round_robin (default) Distributes requests evenly across URLs in sequential order

You can explicitly set the strategy with --url-strategy:

aiperf profile --model llama \
    --url http://server1:8000 \
    --url http://server2:8000 \
    --url-strategy round_robin \
    --request-count 100

CLI Options

Option Type Default Description
--url list localhost:8000 One or more endpoint URLs; multiple URLs enable load balancing
--url-strategy enum round_robin Strategy for distributing requests across multiple URLs

Behavior Notes

  • Server metrics: Metrics are collected from all configured URLs
  • Backward compatibility: Single URL usage remains unchanged
  • Per-request assignment: Each request is assigned a URL at credit issuance time
  • Connection reuse: The --connection-reuse-strategy applies per-URL