Make Loadgen Async by nv-alicheng · Pull Request #255 · mlcommons/endpoints

nv-alicheng · 2026-04-02T07:03:35Z

What does this PR do?

Overhauls old multi-threaded design of load generator and converts it use async event loops for better compatibility with the PubSub system and the HTTP client.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

… (Phase 1)

…nd stale completion filtering (Phase 2)

…intClient.create() factory (Phase 3)

…tHandler) and update references

github-actions · 2026-04-02T07:03:48Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request implements a new asynchronous load generator architecture, replacing the legacy threaded system with a phase-based orchestration model. It introduces the BenchmarkSession for sequential phase management, specialized LoadStrategy implementations for various load patterns, and async factory methods for HTTPEndpointClient to avoid deadlocks. The review feedback highlights several critical reliability and efficiency improvements: exception handling must be added to the TimedIssueStrategy and BurstStrategy callbacks to prevent potential hangs, and ConcurrencyStrategy needs a try...finally block to prevent semaphore leaks. Furthermore, the polling mechanism in _drain_inflight should be replaced with an event-driven approach to reduce latency and overhead.

src/inference_endpoint/load_generator/strategy.py

src/inference_endpoint/load_generator/session.py

src/inference_endpoint/load_generator/strategy.py

nv-alicheng

Review Council — Multi-AI Code Review

Reviewed by: Claude + Codex | Depth: thorough

Found 13 issues across 6 files.

Must Fix (critical/high)

Issues that will cause incorrect behavior, data loss, or hangs.

#	File	Line	Category	Reviewer(s)	Summary
1	`scoring.py`	115	data-integrity	Codex	The async rewrite removed the `sample_idx_map.json` write path, but `Scorer.get_outputs()` still reads it. `TestMode.ACC
2	`execute.py`	426	data-integrity	Codex	No event logger service is launched, yet `Scorer.get_outputs()` reads `events.jsonl` for accuracy scoring. The old path
3	`strategy.py`	130	concurrency	Codex	`fire()` callbacks in `TimedIssueStrategy` and `BurstStrategy` call `phase_issuer.issue()` without exception handling. I
4	`session.py`	413	bug	Claude	`on_sample_complete` is never called for `StreamChunk(is_complete=True)`. When streaming endpoints use terminal StreamCh
5	`session.py`	379	bug	Codex	Failed requests (`QueryResult.error is not None`) are published as normal `COMPLETE` events. The MetricsAggregator only

Should Fix (medium)

Real issues under specific conditions or design flaws.

#	File	Line	Category	Reviewer(s)	Summary
6	`report.py`	178	data-integrity	Both	`n_samples_failed` reads `total_samples_failed` (all phases) while `n_samples_issued`/`completed` read `tracked_*` (perf
7	`execute.py`	474	data-integrity	Codex	Report is built from KVStore immediately after `session.run()` returns, before the MetricsAggregator subprocess processe
8	`execute.py`	345	data-integrity	Both	Accuracy phases reuse the same `random.Random` instances as the perf phase by reference. Accuracy sample ordering depend
9	`session.py`	334	performance	Claude	`_drain_inflight` busy-polls with `asyncio.sleep(0.01)` adding up to 10ms latency. Use an `asyncio.Event` set by `_handl
10	`execute.py`	470	error-handling	Claude	A second Ctrl+C during `loop.run_until_complete` raises `KeyboardInterrupt` that bypasses the `finally` block in `_run_b
11	`execute.py`	551	bug	Claude	If stopped before any perf phase, `total_issued=0` and `n_errors=len(collector.errors)` produces `successful = 0 - n_err

Consider (low)

Valid improvements for follow-ups.

#	File	Line	Category	Reviewer(s)	Summary
12	`execute.py`	411	error-handling	Claude	`assert zmq_ctx.socket_dir is not None` is disabled with `python -O`. Use a proper `if ... raise RuntimeError` instead.
13	`test_benchmark_command.py`	81	testing	Codex	Integration tests only exercise `TestMode.PERF`. No coverage for `ACC`/`BOTH` modes, which is how the missing `sample_id

Generated with Claude Code

src/inference_endpoint/commands/benchmark/execute.py

src/inference_endpoint/load_generator/strategy.py

src/inference_endpoint/load_generator/session.py

src/inference_endpoint/metrics/report.py

nv-alicheng · 2026-04-02T08:23:40Z

src/inference_endpoint/commands/benchmark/execute.py

+            n_samples_from_dataset=acc_ds.num_samples(),
+            n_samples_to_issue=acc_ds.num_samples() * acc_ds.repeats,
+            min_sample_count=acc_ds.num_samples() * acc_ds.repeats,
+            rng_sched=ctx.rt_settings.rng_sched,


[Both] medium (data-integrity): Accuracy phases reuse the same random.Random instances as the perf phase by reference. Accuracy sample ordering depends on how many RNG draws the perf phase consumed, breaking reproducibility.

rng_sched=ctx.rt_settings.rng_sched, # shared reference

Create fresh Random instances with derived seeds for each accuracy phase.

[Claude] Design limitation, added documentation — shared RNG is known issue for multi-phase reproducibility.

src/inference_endpoint/load_generator/session.py

nv-alicheng · 2026-04-02T08:23:40Z

src/inference_endpoint/commands/benchmark/execute.py

-        finally:
-            # Always restore original handler
-            signal.signal(signal.SIGINT, old_handler)
+    loop.add_signal_handler(signal.SIGINT, session.stop)


[Claude] medium (error-handling): A second Ctrl+C during loop.run_until_complete raises KeyboardInterrupt that bypasses the finally block in _run_benchmark_async, leaking ZMQ context, publisher, and HTTP client.

[Claude] Design limitation, added documentation — double SIGINT inherently hard to handle with loop.run_until_complete

src/inference_endpoint/commands/benchmark/execute.py

nv-alicheng

Review Council — Multi-AI Code Review (Round 2)

Reviewed by: Claude + Codex | Depth: thorough

Found 7 new issues across 2 files (previous issues excluded).

Must Fix (high)

#	File	Line	Category	Reviewer(s)	Summary
1	`session.py`	436	bug	Both	`on_sample_complete(resp)` is called with a `StreamChunk` for terminal streaming completions, but `ResponseCollector.on_
2	`execute.py`	413	bug	Claude	Socket name collision: `pub_socket_name = f"ev_pub_{session_id[:8]}"` slices to `"ev_pub_cli_benc"` — the static prefix,
3	`session.py`	179	data-integrity	Codex	`PhaseIssuer.issue()` calls `dataset.load_sample()` BEFORE recording the ISSUED timestamp. Dataset reads inflate TTFT/la

Should Fix (medium)

#	File	Line	Category	Reviewer(s)	Summary
4	`session.py`	241	concurrency	Codex	`stop()` sets `_stop_requested` and cancels the strategy task, but does NOT set `_drain_event`. If stop is called during
5	`execute.py`	349	data-integrity	Both	Accuracy phase name from `DATASET_ID`/class name may differ from `eval_cfg.dataset_name` (from config). `Scorer` looks u
6	`execute.py`	528	performance	Claude	`launcher.wait_for_exit(timeout=10.0)` is a blocking OS call inside an `async def`. Blocks the event loop for up to 10s

Consider (low)

#	File	Line	Category	Reviewer(s)	Summary
7	`execute.py`	430	bug	Claude	Tmpfs directories under `/dev/shm` are only cleaned in `_write_scoring_artifacts`. If benchmark crashes or is interrupte

Generated with Claude Code

src/inference_endpoint/load_generator/session.py

src/inference_endpoint/commands/benchmark/execute.py

src/inference_endpoint/load_generator/session.py

src/inference_endpoint/commands/benchmark/execute.py

viraatc · 2026-04-02T21:44:20Z

docs/load_generator/design.md

+> phases, or (b) per-phase metric namespacing (e.g., prefix keys with phase name),
+> or (c) the report builder computes deltas by snapshotting before and after each
+> phase. This will be addressed in a future change to the `MetricsAggregator`.
+> Option (b) is the most-likely planned change as it is the most robust.


viraatc · 2026-04-02T21:46:29Z

docs/load_generator/design.md

+with zero GIL contention and low response latency (0.6–1.4ms). No thread pool overhead.
+Degrades above 100k+ QPS where the callback queue saturates.
+
+`run_in_executor(busy_wait)` is available as an opt-in for workloads requiring sub-100μs


viraatc · 2026-04-02T22:42:48Z

docs/load_generator/design.md

+For `ConcurrencyStrategy`, `_handle_response` calls `strategy.on_query_complete()`
+which releases the semaphore. Since `recv()` returns as soon as the fd is readable
+and `eager_task_factory` executes the woken semaphore waiter synchronously, there
+is no added latency compared to a poll-based approach.


arekay-nv

Made a pass over the code + doc. Will make one on the tests later tonight.
One general comment - we should pick between sample and query - my preference is "request" since it captures the user aspect, but moving away from query might be useful since it draws too much similarity to databases.

arekay-nv · 2026-04-02T21:10:48Z

docs/load_generator/design.md

+    +-- [perf phase 1]   START_TRACKING → strategy.execute() → drain → STOP_TRACKING → snapshot report
+    +-- [saturation]     strategy.execute() → drain
+    +-- [perf phase 2]   START_TRACKING → strategy.execute() → drain → STOP_TRACKING → snapshot report


START_PERFORMANCE_TRACKING not START_TRACKING

arekay-nv · 2026-04-02T21:23:40Z

docs/load_generator/design.md

+- ISSUED: `monotonic_ns()` taken immediately before `issuer.issue()`. The ZMQ push is
+  sync and non-blocking, so this honestly represents when the query entered the transport.
+- COMPLETE: `QueryResult.completed_at` is set via `force_setattr(monotonic_ns())` in
+  `__post_init__`, regenerated on deserialization. Both ISSUED and COMPLETE timestamps


I think TTFT is still sensitive to the zmq overhead, but TPOT would not be -
RECV_FIRST and COMPLETE are both stamped on the main-process side after worker-to-main transit, avoiding cross-process clock skew; TPOT is therefore relatively insensitive to that transport bias, while TTFT still includes the end-to-end handoff to first-token path.

arekay-nv · 2026-04-02T21:38:39Z

src/inference_endpoint/commands/benchmark/execute.py

-    tokenizer = _load_tokenizer(config.model_params.name)
+    # Tokenizer check (light API call, no download)
+    model_name = config.model_params.name
+    tokenizer_name = model_name if _check_tokenizer_exists(model_name) else None


Would this work for non-standard tokenizers? Also, we might want to report an error/warning here since having a typo in the model name (or non-standard model name) might cause the tokenizer to be None but only reveal as an error when we do the TPOT calculations.

arekay-nv · 2026-04-02T21:42:32Z

src/inference_endpoint/commands/benchmark/execute.py

+    phases.append(
+        PhaseConfig(
+            "performance", ctx.rt_settings, ctx.dataloader, PhaseType.PERFORMANCE
+        )
+    )


Can we add a comment here indicating that this will change to support multiple performance datasets.
If it isn't too much work, you can probably plug in multiple perf datasets since the configs should support it.

arekay-nv · 2026-04-02T21:44:37Z

src/inference_endpoint/commands/benchmark/execute.py

+        "total_samples_issued",
+        "total_samples_completed",
+        "total_samples_failed",
+        "tracked_samples_issued",
+        "tracked_samples_completed",
+        "tracked_duration_ns",
+        "total_duration_ns",


These can be collected in a separate enum or something more type controlled - for instance a key, type (counter/series), data-type, streaming/non-streaming tuple. That will make extending to new metrics much easier.

arekay-nv · 2026-04-02T22:18:38Z

src/inference_endpoint/commands/benchmark/execute.py

+    # Report metrics: prefer Report from KVStore, fall back to SessionResult
+    if report is not None and report.duration_ns is not None:
+        perf_elapsed = report.duration_ns / 1e9
+        total_issued = report.n_samples_issued


We should rename n_samples_issued to perf_samples_issued or tracked_samples_issued to be consistent.

arekay-nv · 2026-04-02T22:20:23Z

src/inference_endpoint/commands/benchmark/execute.py

-                "successful": success,
-                "failed": report.n_samples_failed,
-                "elapsed_time": elapsed,
+                "successful": max(0, total_issued - n_errors),


This should't be necessary. Is this because we do not distinguish accuracy/saturation/performance errors?

arekay-nv · 2026-04-02T22:24:39Z

src/inference_endpoint/load_generator/session.py

+class PhaseType(str, Enum):
+    """Phase types control tracking and reporting behavior."""
+
+    PERFORMANCE = "performance"
+    ACCURACY = "accuracy"
+    SATURATION = "saturation"


We might want to rethink this - the only difference between perf/accuracy is whether we track metrics/results - and if the overhead is low enough, we can probably do it for both - the dataset characteristics would define whether it is for performance (uniform, small-ish OSL) or accuracy.
Similarly, saturation requires neither, but only has a missing barrier between the next phase, where performance datasets always have a barrier after it.
We can expose those knobs to the user and the system shouldn't care about what the intent of the phase it.
Just a thought though.

arekay-nv · 2026-04-02T22:31:43Z

src/inference_endpoint/load_generator/session.py

+            if self._stop_requested:
+                return True
+            if (
+                self._current_phase_issuer
+                and self._current_phase_issuer.issued_count >= total_samples
+            ):
+                return True
+            if (
+                max_duration_ns > 0
+                and (time.monotonic_ns() - phase_start_ns) >= max_duration_ns
+            ):
+                return True


When returning true, can we log the reason so it is clear that all samples were issued, or max-duration was reached etc.

arekay-nv · 2026-04-02T22:38:08Z

src/inference_endpoint/load_generator/strategy.py

+class PhaseIssuerProtocol(Protocol):
+    """Minimal interface that strategies see for issuing samples."""
+
+    def issue(self, sample_index: int) -> str | None: ...


Please document when None is expected. If i understand correctl, str is the id of the sample issued.

viraatc · 2026-04-02T23:06:03Z

src/inference_endpoint/commands/benchmark/execute.py

+                    self.pbar.set_postfix(refresh=True, errors=len(self.errors))
+            elif self.collect_responses:
+                self.responses[result.id] = result.get_response_output_string()
+        # StreamChunk(is_complete=True) — no response text to collect, just count it


heads up: this can never happen btw (worker will never send streamchunk with is-complete set True)

viraatc · 2026-04-02T23:06:32Z

src/inference_endpoint/commands/benchmark/execute.py

-        tokenizer = AutoTokenizer.from_pretrained(model_name)
-        logger.info("Tokenizer loaded successfully")
-        return tokenizer
+        from huggingface_hub import model_info


nit: move imports to top?

viraatc · 2026-04-02T23:08:17Z

src/inference_endpoint/commands/benchmark/execute.py

+    phases: list[PhaseConfig] = []
+
+    # Performance phase
+    phases.append(


does this mean benchmark always need to have performance phase?

viraatc · 2026-04-02T23:12:05Z

src/inference_endpoint/commands/benchmark/execute.py

+            logger.warning(f"Client cleanup error: {e}")
+        publisher.close()
+        await asyncio.to_thread(launcher.wait_for_exit, 10.0)
+        zmq_ctx.cleanup()


nit: can use zmq ctx scope we already support

viraatc · 2026-04-02T23:15:28Z

src/inference_endpoint/commands/benchmark/execute.py

+    """Create a KVStoreReader pre-registered with all metric keys."""
+    reader = BasicKVStoreReader(metrics_dir)
+    # Counter keys (from MetricCounterKey enum)
+    for key in [


thinking if its poissble to auto-populate enum names automatically, so we dont have to maintain a list in setup_kv_reader?

viraatc · 2026-04-02T23:19:07Z

src/inference_endpoint/load_generator/session.py

+                and phase_issuer is not None
+                and resp.id in phase_issuer.uuid_to_index
+            ):
+                phase_issuer.inflight -= 1


repeated block can be made into a named function?

viraatc

looks great thanks so much!

arekay-nv

Thanks!

arekay-nv · 2026-04-03T21:34:24Z

tests/integration/commands/test_accuracy_pipeline.py

+
+        # Verify sample_idx_map has both phases
+        with (report_dir / "sample_idx_map.json").open("rb") as f:
+            import msgspec.json


Move import to top.

arekay-nv · 2026-04-03T21:34:54Z

tests/integration/commands/test_accuracy_pipeline.py

+            e for e in events if e.get("event_type") == "sample.complete"
+        ]
+        # Should have both perf (3) and accuracy (5) completions
+        assert len(complete_events) >= 5


Tighter bounds with ==5

arekay-nv · 2026-04-03T21:36:07Z

tests/integration/commands/test_accuracy_pipeline.py

This test can be replicated by using the generated config file from the test to re-run though the dataset generation would be a bit challenging. Possibly as a followup feature.

arekay-nv · 2026-04-03T21:37:30Z

tests/integration/test_end_to_end_oracle.py

 ):
    dummy_dataloader = Dataset.load_from_file(
-        dataset_path,
+        ds_pickle_dataset_path,


Did I miss removing the pickle in the filename? If yes, please update here.

nv-alicheng added 6 commits April 1, 2026 21:21

Add design doc for async load-generator

41918b5

feat: add async load strategies, sample ordering, and delay functions…

1eb7f3d

… (Phase 1)

feat: async BenchmarkSession with multi-phase support, PhaseIssuer, a…

4be9129

…nd stale completion filtering (Phase 2)

feat: rewrite HttpClientSampleIssuer for async session, add HTTPEndpo…

da061b9

…intClient.create() factory (Phase 3)

refactor: remove legacy load generator (Sample, Scheduler, SampleEven…

09aad43

…tHandler) and update references

Fix test_end_to_end_oracle.py to use new loadgen

b5de33c

nv-alicheng requested a review from a team as a code owner April 2, 2026 07:03

github-actions bot requested review from arekay-nv and nvzhihanj April 2, 2026 07:03

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

Integrate pubsub into execute.py

21b2f8e

nv-alicheng commented Apr 2, 2026

View reviewed changes

Fix integration with accuracy tests and Scorer

b1d899f

nv-alicheng commented Apr 2, 2026

View reviewed changes

PR fixes, add full e2e accuracy integration test

be0cf98

viraatc reviewed Apr 2, 2026

View reviewed changes

arekay-nv reviewed Apr 2, 2026

View reviewed changes

viraatc reviewed Apr 2, 2026

View reviewed changes

viraatc approved these changes Apr 2, 2026

View reviewed changes

arekay-nv approved these changes Apr 3, 2026

View reviewed changes

Fix for sockets in some docker contexts

3199f0b

Conversation

nv-alicheng commented Apr 2, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nv-alicheng left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review

Must Fix (critical/high)

Should Fix (medium)

Consider (low)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nv-alicheng left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review (Round 2)

Must Fix (high)

Should Fix (medium)

Consider (low)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

github-actions bot commented Apr 2, 2026 •

edited

Loading

viraatc Apr 2, 2026 •

edited

Loading