Skip to content

feat: add --max-concurrency to limit global concurrency search space#505

Closed
Arsene12358 wants to merge 3 commits intoai-dynamo:mainfrom
Arsene12358:feat/max-concurrency
Closed

feat: add --max-concurrency to limit global concurrency search space#505
Arsene12358 wants to merge 3 commits intoai-dynamo:mainfrom
Arsene12358:feat/max-concurrency

Conversation

@Arsene12358
Copy link
Contributor

Summary

  • Add --max-concurrency CLI flag, Python API parameter, and YAML experiment key to let users cap the global concurrency (total concurrent requests across all DP ranks / workers) considered during AIConfigurator's Pareto sweep.
  • Agg mode: caps the per-engine batch size so that batch_size * pp_size * attention_dp_size <= max_concurrency, avoiding unnecessary sweep iterations. Parallel configs where even batch_size=1 would exceed the limit are skipped entirely.
  • Disagg mode: filters out replica compositions where per_worker_concurrency * num_decode_workers > max_concurrency during the rate-matching step.
  • Semantics are consistent with aiperf's --concurrency flag: the value represents the total number of in-flight requests for the entire deployment, not per DP rank.

Changes

File What changed
sdk/task.py TaskContext and TaskConfig gain max_concurrency; TaskRunner forwards it to agg_pareto / disagg_pareto
sdk/pareto_analysis.py agg_pareto() computes an effective max_batch_size per parallel config; disagg_pareto() threads through to the session
sdk/inference_session.py find_best_disagg_result_under_constraints() filters compositions exceeding the limit
cli/main.py --max-concurrency arg added to default mode; recognized in YAML experiment configs
cli/api.py cli_default() accepts max_concurrency kwarg
cli/example.yaml Documents the new key
tests/unit/sdk/task/test_task.py 5 new tests: storage, default, agg forwarding, disagg forwarding, None default
tests/unit/cli/test_argument_parsing.py 2 new tests: default None, integer parsing

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the feat label Mar 3, 2026
Allow users to cap the global concurrency (total concurrent requests
across all DP ranks / workers) considered during the Pareto sweep.

Agg mode: the per-engine batch size sweep is capped so that
  batch_size * pp_size * attention_dp_size <= max_concurrency.
Disagg mode: replica compositions whose
  per_worker_concurrency * num_decode_workers > max_concurrency
  are filtered out during rate matching.

Exposed via:
  - CLI: --max-concurrency <int>
  - Python API: cli_default(..., max_concurrency=N)
  - YAML experiment config: max_concurrency: N

Signed-off-by: Yimingl <yimingl@nvidia.com>
…tests

- to_yaml() now includes max_concurrency when set (was silently dropped)
- Autoscale path (pick_autoscale) now filters by max_concurrency
- Validate max_concurrency >= 1 in TaskConfig (raises ValueError)
- Add INFO-level log when max_concurrency constraint is active
- Update find_best_disagg_result_under_constraints docstring
- Add tests: validation rejects 0/-5, to_yaml round-trip with/without

Signed-off-by: Yimingl <yimingl@nvidia.com>
@Arsene12358 Arsene12358 force-pushed the feat/max-concurrency branch from 360f9c9 to dcf8615 Compare March 3, 2026 13:22
@Arsene12358
Copy link
Contributor Author

closing, will design a new version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant