feat: add --max-concurrency to limit global concurrency search space#505
Closed
Arsene12358 wants to merge 3 commits intoai-dynamo:mainfrom
Closed
feat: add --max-concurrency to limit global concurrency search space#505Arsene12358 wants to merge 3 commits intoai-dynamo:mainfrom
Arsene12358 wants to merge 3 commits intoai-dynamo:mainfrom
Conversation
Allow users to cap the global concurrency (total concurrent requests across all DP ranks / workers) considered during the Pareto sweep. Agg mode: the per-engine batch size sweep is capped so that batch_size * pp_size * attention_dp_size <= max_concurrency. Disagg mode: replica compositions whose per_worker_concurrency * num_decode_workers > max_concurrency are filtered out during rate matching. Exposed via: - CLI: --max-concurrency <int> - Python API: cli_default(..., max_concurrency=N) - YAML experiment config: max_concurrency: N Signed-off-by: Yimingl <yimingl@nvidia.com>
…tests - to_yaml() now includes max_concurrency when set (was silently dropped) - Autoscale path (pick_autoscale) now filters by max_concurrency - Validate max_concurrency >= 1 in TaskConfig (raises ValueError) - Add INFO-level log when max_concurrency constraint is active - Update find_best_disagg_result_under_constraints docstring - Add tests: validation rejects 0/-5, to_yaml round-trip with/without Signed-off-by: Yimingl <yimingl@nvidia.com>
360f9c9 to
dcf8615
Compare
Signed-off-by: Yimingl <yimingl@nvidia.com>
Contributor
Author
|
closing, will design a new version |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--max-concurrencyCLI flag, Python API parameter, and YAML experiment key to let users cap the global concurrency (total concurrent requests across all DP ranks / workers) considered during AIConfigurator's Pareto sweep.batch_size * pp_size * attention_dp_size <= max_concurrency, avoiding unnecessary sweep iterations. Parallel configs where evenbatch_size=1would exceed the limit are skipped entirely.per_worker_concurrency * num_decode_workers > max_concurrencyduring the rate-matching step.--concurrencyflag: the value represents the total number of in-flight requests for the entire deployment, not per DP rank.Changes
sdk/task.pyTaskContextandTaskConfiggainmax_concurrency;TaskRunnerforwards it toagg_pareto/disagg_paretosdk/pareto_analysis.pyagg_pareto()computes an effectivemax_batch_sizeper parallel config;disagg_pareto()threads through to the sessionsdk/inference_session.pyfind_best_disagg_result_under_constraints()filters compositions exceeding the limitcli/main.py--max-concurrencyarg added to default mode; recognized in YAML experiment configscli/api.pycli_default()acceptsmax_concurrencykwargcli/example.yamltests/unit/sdk/task/test_task.pytests/unit/cli/test_argument_parsing.py