Skip to content

[WIP] Refactor loaders v2#689

Draft
ajcasagrande wants to merge 5 commits intomainfrom
ajc/refactor-loaders-v2
Draft

[WIP] Refactor loaders v2#689
ajcasagrande wants to merge 5 commits intomainfrom
ajc/refactor-loaders-v2

Conversation

@ajcasagrande
Copy link
Contributor

@ajcasagrande ajcasagrande commented Feb 18, 2026

Summary by CodeRabbit

  • Documentation

    • Added comprehensive dataset loaders reference guide and custom loader creation documentation
    • Updated tutorial examples across all benchmarking scenarios
  • New Features

    • Added model selection strategies (round-robin, random, shuffle) for configurable model rotation during benchmarks
    • Expanded synthetic data generation capabilities for multimodal and ranking scenarios

…ture

Eliminate the composer abstraction layer, consolidating dataset composition
logic directly into DatasetManager and the loaders themselves. This reduces
indirection and makes the data flow easier to follow.

- Remove composer/ directory (base, custom, synthetic, synthetic_rankings)
- Replace base_loader.py and base_public_dataset.py with simplified base.py
- Add file/ and synthetic/ loader subdirectories
- Extract model_selection_strategies, output_tokens_sampler, public_datasets
- Update plugin registry with new loader categories and schema
- Update docs and tests to reflect new architecture

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Loaders now receive a DatasetBackingStoreProtocol and write conversations
as they're built, eliminating the ~2x peak memory from materializing the
full dataset in-memory before writing to mmap.

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…tions

Loaders are now pure producers — load() yields finalized Conversation
objects instead of pushing them into a backing store. DatasetManager
drives storage and metadata collection via async for.

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…dependencies

Move session_id_generator, prompt_generator, model_selector,
output_tokens_sampler, and seq_distribution into LoaderContext. Loaders
now access shared state via self.ctx instead of holding individual
references, reducing constructor complexity and coupling.

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Replace the hardcoded PublicDataset dataclass registry with a proper
public_dataset plugin category. PublicDatasetLoader now delegates to
file loaders specified in plugin metadata, unifying public datasets
with the rest of the plugin architecture. Also fixes sampling strategy
checks to consistently use model_fields_set.

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
@github-actions
Copy link

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e00b7c6410bb3eb6d442d88177d816fb0bd36198

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e00b7c6410bb3eb6d442d88177d816fb0bd36198

Last updated for commit: e00b7c6Browse code

@ajcasagrande ajcasagrande marked this pull request as draft February 18, 2026 23:00
@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

Walkthrough

This pull request refactors the dataset loading architecture from a composer-based system to a loader-based system. It renames the CLI flag --custom-dataset-type to --dataset-type, removes composer classes, introduces new loader abstractions with a context system, adds synthetic loaders, and updates comprehensive documentation covering dataset loaders, model selection strategies, and plugin system changes.

Changes

Cohort / File(s) Summary
Documentation: CLI and API Reference Updates
docs/api/synthesis.md, docs/cli_options.md, docs/benchmark_modes/*, docs/tutorials/*, docs/plugins/plugin-system.md, docs/genai-perf-feature-comparison.md
Renamed --custom-dataset-type flag to --dataset-type and updated option descriptions to include new synthetic loaders (synthetic_multimodal, synthetic_rankings) and sharegpt. Added model_selection_strategy options (shuffle) and clarified dataset auto-detection behavior.
Documentation: New Comprehensive Guides
docs/dataset-loaders.md, docs/dev/creating-dataset-loaders.md, mkdocs.yml
Added extensive documentation for dataset loaders system, including schema definitions, examples, auto-detection, troubleshooting, and developer guide for creating custom loaders. Updated navigation structure.
Documentation: Content Migration and Updates
docs/benchmark_datasets.md, docs/plugins/creating-your-first-plugin.md
Migrated content from benchmark_datasets to dataset-loaders, updated copyright year to 2026, and changed plugin documentation references from dataset_composer to dataset_loader.
Core Configuration and Enums
src/aiperf/common/config/config_defaults.py, src/aiperf/common/config/input_config.py, src/aiperf/common/config/user_config.py, src/aiperf/common/enums/enums.py, src/aiperf/common/enums/__init__.py, src/aiperf/common/tokenizer_validator.py
Renamed CUSTOM_DATASET_TYPE to DATASET_TYPE, replaced CustomDatasetType enum with DatasetLoaderType, moved PublicDatasetType to plugin enums, added SHUFFLE strategy to ModelSelectionStrategy, and updated all validation logic to reference new dataset_type field.
Dataset Composer Removal
src/aiperf/dataset/composer/__init__.py, src/aiperf/dataset/composer/base.py, src/aiperf/dataset/composer/custom.py, src/aiperf/dataset/composer/synthetic.py, src/aiperf/dataset/composer/synthetic_rankings.py
Completely removed dataset composer system including BaseDatasetComposer and all concrete implementations (CustomDatasetComposer, SyntheticDatasetComposer, SyntheticRankingsDatasetComposer).
New Loader Base Classes
src/aiperf/dataset/loader/base.py, src/aiperf/dataset/loader/context.py, src/aiperf/dataset/loader/file/base.py, src/aiperf/dataset/loader/file/__init__.py
Introduced BaseDatasetLoader abstract class, LoaderContext to centralize shared dependencies (config, tokenizer, generators), and BaseFileLoader for file-based loaders with two-stage pipeline (parse_and_validate → convert_to_conversations).
File-based Loaders Refactoring
src/aiperf/dataset/loader/single_turn.py, src/aiperf/dataset/loader/multi_turn.py, src/aiperf/dataset/loader/random_pool.py, src/aiperf/dataset/loader/mooncake_trace.py, src/aiperf/dataset/loader/sharegpt.py
Migrated all file-based loaders from old interface to new BaseFileLoader; changed constructors to accept LoaderContext, renamed load_dataset to parse_and_validate, updated can_load to can_load_file with stricter signatures, and replaced direct config access with ctx-based access.
New Synthetic Loaders
src/aiperf/dataset/loader/synthetic/__init__.py, src/aiperf/dataset/loader/synthetic/base.py, src/aiperf/dataset/loader/synthetic/multimodal.py, src/aiperf/dataset/loader/synthetic/rankings.py
Introduced new synthetic dataset loaders: BaseSyntheticLoader, SyntheticMultiModalLoader (with text/image/audio/video generation), and SyntheticRankingsLoader (for ranking datasets).
Loader Support Infrastructure
src/aiperf/dataset/loader/__init__.py, src/aiperf/dataset/loader/base_loader.py, src/aiperf/dataset/loader/base_public_dataset.py, src/aiperf/dataset/loader/models.py, src/aiperf/dataset/loader/protocols.py
Removed old base_loader.py and base_public_dataset.py; updated loader imports/exports; replaced CustomDatasetLoaderProtocol with DatasetLoaderProtocol (async load instead of two-stage); updated dataset model type hints from CustomDatasetType to DatasetLoaderType.
Public and Dataset Management
src/aiperf/dataset/public_datasets.py, src/aiperf/dataset/model_selection_strategies.py, src/aiperf/dataset/output_tokens_sampler.py, src/aiperf/dataset/dataset_manager.py
Added PublicDatasetLoader for downloading and delegating to file loaders, introduced model selection strategies (RoundRobin, Random, Shuffle), created OutputTokensSampler for max_tokens sampling, and completely refactored DatasetManager to use loader-based architecture with streaming and pluggable loaders.
Plugin System Updates
src/aiperf/plugin/categories.yaml, src/aiperf/plugin/enums.py, src/aiperf/plugin/plugins.py, src/aiperf/plugin/plugins.yaml, src/aiperf/plugin/schema/plugins.schema.json, src/aiperf/plugin/schema/schemas.py
Replaced dataset_composer/custom_dataset_loader with dataset_loader/public_dataset categories; introduced model_selection_strategy and dataset_sampler; updated enums from CustomDatasetType/ComposerType to DatasetLoaderType/PublicDatasetType/ModelSelectionStrategyType; added PublicDatasetMetadata schema; updated all plugin definitions and documentation.
Test Updates: Config and Enums
tests/unit/common/config/test_input_config.py, tests/unit/common/config/test_user_config_mooncake_trace.py, tests/unit/timing/test_timing_integration.py
Replaced CustomDatasetType with DatasetLoaderType, updated InputConfig field from custom_dataset_type to dataset_type, adjusted assertions and error messages to match new API.
Test Updates: Composer Removal
tests/unit/dataset/composer/__init__.py, tests/unit/dataset/composer/conftest.py, tests/unit/dataset/composer/test_base_composer.py, tests/unit/dataset/composer/test_custom_composer.py, tests/unit/dataset/composer/test_synthetic_composer.py, tests/unit/dataset/composer/test_synthetic_rankings_composer.py
Removed all composer-related tests and fixtures since composer system was deleted.
Test Updates: Loader API Changes
tests/unit/dataset/loader/conftest.py, tests/unit/dataset/loader/test_can_load.py, tests/unit/dataset/loader/test_single_turn.py, tests/unit/dataset/loader/test_multi_turn.py, tests/unit/dataset/loader/test_random_pool.py, tests/unit/dataset/loader/test_trace.py, tests/unit/dataset/loader/test_sharegpt.py
Updated loader tests to use new LoaderContext-based constructors, parse_and_validate method, can_load_file API, and DatasetLoaderType enum. Introduced LoaderContext fixture and helper functions for type inference.
Test Updates: DatasetManager
tests/unit/dataset/conftest.py, tests/unit/dataset/test_dataset_manager.py, tests/unit/dataset/test_dataset_manager_inputs_json.py, tests/unit/common/models/test_sequence_distribution.py
Updated fixtures to use new async loader-based manager initialization, replaced in-memory dataset storage with dataset_metadata/backing store patterns, adjusted mocks for new loader interface, and refactored sequence distribution tests from composer to loader model.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop hop hooray! The composers take a bow,
Loaders leap to center stage—refactored now!
From --custom-type to --dataset-type so bright,
Synthetic pathways shimmer with delight! 🌟
A grand restructuring, done just right! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title '[WIP] Refactor loaders v2' is vague and generic, using non-descriptive terms that don't convey meaningful information about the actual changes. Replace with a more specific title that describes the primary change, such as 'Refactor dataset loaders architecture' or 'Replace composer pattern with unified loader hierarchy'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 85.76% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/aiperf/dataset/loader/random_pool.py (1)

88-198: ⚠️ Potential issue | 🟠 Major

Use async I/O in parse_and_validate() to avoid blocking the event loop.

The parse_and_validate() method is called synchronously within the async load() method (BaseFileLoader line 44), blocking the event loop while reading files. Both _validate_path() and _load_dataset_from_file() use open() for synchronous file I/O.

Convert to async by either:

  1. Making parse_and_validate() async and using aiofiles for file operations, or
  2. Using asyncio.to_thread() to execute blocking I/O off the event loop

Per coding guidelines: "Use async/await for ALL I/O operations - no blocking calls."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/random_pool.py` around lines 88 - 198, Change
parse_and_validate to be asynchronous and ensure all blocking file operations
are moved off the event loop: make parse_and_validate an async def and either
(preferred minimal-change) call the existing blocking helpers (_validate_path
and _load_dataset_from_file) via asyncio.to_thread, or refactor file reads in
_validate_path and _load_dataset_from_file to use aiofiles and await them.
Update can_load_directory and any callers that expect parse_and_validate to be
sync (e.g., BaseFileLoader.load) to await the new async parse_and_validate, and
ensure ValidationError behavior is preserved; reference the methods
RandomPoolDatasetLoader.parse_and_validate,
RandomPoolDatasetLoader._validate_path, and
RandomPoolDatasetLoader._load_dataset_from_file when making the changes.
src/aiperf/dataset/loader/mooncake_trace.py (1)

96-101: ⚠️ Potential issue | 🟠 Major

Use async file I/O to prevent blocking the event loop in parse_and_validate.

parse_and_validate() is called directly from the async load() method (BaseFileLoader line 44) without awaiting, blocking the event loop with synchronous open(). Replace with aiofiles.open() for non-blocking file reads, or use asyncio.to_thread() to offload the operation if refactoring to async is not feasible. The codebase uses both patterns (see aiofiles in post_processors and asyncio.to_thread in dataset_manager, record_processor_service).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/mooncake_trace.py` around lines 96 - 101,
parse_and_validate currently uses synchronous open(self.filename) which blocks
the event loop because load() in BaseFileLoader calls it from async code; change
the file read to non-blocking I/O by using aiofiles.open in parse_and_validate
(or wrap the existing blocking loop in asyncio.to_thread) so each line is read
asynchronously, then call MooncakeTrace.model_validate_json for each non-empty
stripped line as before; ensure you import aiofiles and update
parse_and_validate to be async (or if using asyncio.to_thread, run the whole
parsing lambda in to_thread) and keep references to self.filename and
MooncakeTrace.model_validate_json unchanged.
🧹 Nitpick comments (19)
tests/unit/common/models/test_sequence_distribution.py (1)

687-697: Extract shared MockLoader to a class-level fixture to eliminate duplication

MockLoader is defined identically in both test methods. Factor it out to a class-level @pytest.fixture (or a simple nested class) to honor DRY and reduce maintenance surface.

♻️ Proposed refactor
 class TestSequenceCaching:
     """Test turn-level sequence length caching for ISL/OSL consistency."""
+
+    `@pytest.fixture`
+    def MockLoader(self):
+        from aiperf.dataset.loader.base import BaseDatasetLoader
+        from aiperf.common.models import Conversation
+        from aiperf.plugin.enums import DatasetSamplingStrategy
+
+        class _MockLoader(BaseDatasetLoader):
+            def load(self) -> list[Conversation]:
+                return []
+
+            `@classmethod`
+            def can_load(cls, _config) -> bool:
+                return False
+
+            `@classmethod`
+            def get_preferred_sampling_strategy(cls) -> DatasetSamplingStrategy:
+                return DatasetSamplingStrategy.SEQUENTIAL
+
+        return _MockLoader
 
-    def test_turn_sequence_caching(self):
+    def test_turn_sequence_caching(self, MockLoader):
         ...
-        # Create mock loader
-        class MockLoader(BaseDatasetLoader):
-            ...
-
-    def test_different_turns_get_different_cache_entries(self):
+    def test_different_turns_get_different_cache_entries(self, MockLoader):
         ...
-        # Create mock loader
-        class MockLoader(BaseDatasetLoader):
-            ...

Also applies to: 739-749

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/common/models/test_sequence_distribution.py` around lines 687 -
697, MockLoader is duplicated in two tests; extract it out as a shared
class-level fixture or a nested class to follow DRY. Create a single reusable
MockLoader (subclassing BaseDatasetLoader) that implements load, can_load, and
get_preferred_sampling_strategy, and then reference that fixture/class from both
test methods instead of redefining it; ensure the symbol names remain
MockLoader, load, can_load, and get_preferred_sampling_strategy so existing
tests pick it up.
tests/unit/dataset/conftest.py (1)

75-87: empty_dataset_manager doesn't need to be async.

This fixture has no await calls. While pytest-asyncio handles async fixtures fine, marking it async is misleading. Consider making it a regular def for clarity.

Suggested change
 `@pytest.fixture`
-async def empty_dataset_manager(user_config: UserConfig) -> DatasetManager:
+def empty_dataset_manager(user_config: UserConfig) -> DatasetManager:
     """Create a DatasetManager instance with empty dataset (no backing store/client)."""
     manager = DatasetManager(
         service_config=ServiceConfig(),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/dataset/conftest.py` around lines 75 - 87, The fixture
empty_dataset_manager is declared async but contains no awaits; change its
definition from async def empty_dataset_manager(...) to a regular def so it's a
synchronous pytest fixture. Update the signature while keeping the return type
DatasetManager and ensure the body that constructs DatasetManager, sets
manager.dataset_metadata using DatasetMetadata and
DatasetSamplingStrategy.SEQUENTIAL, and returns manager remains unchanged.
src/aiperf/dataset/loader/context.py (1)

23-24: Empty TYPE_CHECKING block is dead code.

The if TYPE_CHECKING: pass block serves no purpose. Remove it or populate it with the conditional imports it was intended for.

Suggested removal
-if TYPE_CHECKING:
-    pass
-
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/context.py` around lines 23 - 24, Remove the dead
`if TYPE_CHECKING: pass` block in the module and either delete it entirely or
replace it with the intended conditional imports; locate the `if TYPE_CHECKING`
block in the context loader module (symbol: TYPE_CHECKING) and remove the no-op,
or populate it with any forward-only imports required by type hints (e.g., the
specific classes/types used only for typing), ensuring no runtime imports are
introduced.
src/aiperf/dataset/dataset_manager.py (1)

144-192: _generate_input_payloads fetches every conversation individually — potential N+1 concern.

Line 161 calls await self._dataset_client.get_conversation(conv_meta.conversation_id) inside a loop over all conversations. If the dataset client involves any I/O per call (e.g., mmap seek + deserialization), this could be slow for large datasets. Consider whether a batch retrieval API would be beneficial.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/dataset_manager.py` around lines 144 - 192,
_generate_input_payloads currently performs an N+1 fetch by calling
self._dataset_client.get_conversation(conv_meta.conversation_id) inside the loop
over self.dataset_metadata.conversations; change this to either (a) add and use
a batch retrieval method on the dataset client (e.g.,
get_conversations(conversation_ids: list[str])) and fetch all conversations in
one call before iterating, or (b) if batching is not supported, collect all
conversation IDs then concurrently fetch them with asyncio.gather to avoid
serial I/O; update usage sites in _generate_input_payloads (replace per-loop
get_conversation calls) and keep the rest of the logic (session_payloads_map,
RequestInfo creation, endpoint.format_payload) unchanged.
src/aiperf/dataset/loader/synthetic/__init__.py (1)

4-6: Consider also exporting the concrete synthetic loaders from this subpackage.

SyntheticMultiModalLoader and SyntheticRankingsLoader live in sibling modules under synthetic/ but are not re-exported here. Consumers who navigate directly to aiperf.dataset.loader.synthetic would not find them. The parent loader/__init__.py re-exports them, but making the subpackage self-contained is idiomatic.

♻️ Proposed addition
 from aiperf.dataset.loader.synthetic.base import BaseSyntheticLoader
+from aiperf.dataset.loader.synthetic.multimodal import SyntheticMultiModalLoader
+from aiperf.dataset.loader.synthetic.rankings import SyntheticRankingsLoader

-__all__ = ["BaseSyntheticLoader"]
+__all__ = ["BaseSyntheticLoader", "SyntheticMultiModalLoader", "SyntheticRankingsLoader"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/synthetic/__init__.py` around lines 4 - 6, The
synthetic subpackage __init__ currently only exports BaseSyntheticLoader, which
hides the concrete loaders; update
src/aiperf/dataset/loader/synthetic/__init__.py to also import and include
SyntheticMultiModalLoader and SyntheticRankingsLoader in __all__ so that
consumers can access those classes directly from aiperf.dataset.loader.synthetic
(reference the symbols SyntheticMultiModalLoader, SyntheticRankingsLoader and
BaseSyntheticLoader when modifying the module exports).
tests/unit/common/config/test_user_config_mooncake_trace.py (1)

43-108: Consider @pytest.mark.parametrize for the repeated _count_dataset_entries scenarios

Five tests in TestMooncakeTraceRequestCount (lines 43–145) share identical scaffolding — same @patch decorators, same UserConfig/InputConfig construction, only mock_file_content and the asserted count differ. Per coding guidelines this pattern should use @pytest.mark.parametrize.

♻️ Sketch of a parametrized replacement
-    `@patch`("pathlib.Path.exists", return_value=True)
-    `@patch`("pathlib.Path.is_file", return_value=True)
-    def test_mooncake_trace_uses_dataset_size(self, mock_is_file, mock_exists):
-        ...
-    `@patch`("pathlib.Path.exists", return_value=True)
-    `@patch`("pathlib.Path.is_file", return_value=True)
-    def test_mooncake_trace_skips_empty_lines(self, mock_is_file, mock_exists):
-        ...
-    # (and three more near-identical tests)
+    `@pytest.mark.parametrize`(
+        "file_content,expected_count",
+        [
+            (
+                '{"input_length": 100, "hash_ids": [1], "timestamp": 1000}\n'
+                '{"input_length": 200, "hash_ids": [2], "timestamp": 2000}\n'
+                '{"input_length": 300, "hash_ids": [3], "timestamp": 3000}\n',
+                3,
+            ),
+            (  # empty lines skipped
+                '{"input_length": 100}\n\n   \n{"input_length": 200}\n{"input_length": 300}\n',
+                3,
+            ),
+            ("", 0),  # empty file
+        ],
+    )
+    `@patch`("pathlib.Path.exists", return_value=True)
+    `@patch`("pathlib.Path.is_file", return_value=True)
+    def test_count_dataset_entries_mooncake_trace_returns_expected_count(
+        self, mock_is_file, mock_exists, file_content, expected_count
+    ):
+        config = UserConfig(
+            endpoint=EndpointConfig(model_names=["test-model"]),
+            loadgen=LoadGeneratorConfig(request_count=999),
+            input=InputConfig(
+                file="/fake/path/test.jsonl",
+                dataset_type=DatasetLoaderType.MOONCAKE_TRACE,
+            ),
+        )
+        with patch("builtins.open", mock_open(read_data=file_content)):
+            assert config._count_dataset_entries() == expected_count

As per coding guidelines: "Use @pytest.mark.parametrize for data-driven tests."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/common/config/test_user_config_mooncake_trace.py` around lines 43
- 108, Multiple tests (test_mooncake_trace_uses_dataset_size,
test_mooncake_trace_skips_empty_lines,
test_mooncake_trace_empty_file_returns_zero) duplicate the same setup and only
vary mock_file_content and expected count; refactor them into a single
parametrized test using pytest.mark.parametrize that supplies
(mock_file_content, expected_count) and reuses the existing patches for
pathlib.Path.exists and is_file and the same UserConfig/InputConfig
construction, then call config._count_dataset_entries() and assert equality for
each parameter set.
src/aiperf/plugin/schema/schemas.py (1)

314-334: PublicDatasetMetadata (and the whole file) uses pydantic.BaseModel instead of AIPerfBaseModel

All eight classes in this file, including the new PublicDatasetMetadata, inherit directly from pydantic.BaseModel. Per coding guidelines, src/**/*.py data models should use AIPerfBaseModel.

If this file is intentionally exempt (e.g., to avoid circular imports with the plugin bootstrap path), consider adding a module-level comment to document the rationale so future contributors don't inadvertently add AIPerfBaseModel-based classes here.

As per coding guidelines: "Use AIPerfBaseModel for data models and BaseConfig for configuration."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/plugin/schema/schemas.py` around lines 314 - 334,
PublicDatasetMetadata and the other model classes in this module inherit from
pydantic.BaseModel instead of the project's AIPerfBaseModel; change the base
class for PublicDatasetMetadata (and the other seven classes) from BaseModel to
AIPerfBaseModel so they follow the coding guideline for data models (refer to
class PublicDatasetMetadata and other model class definitions), and if you
intentionally must avoid AIPerfBaseModel (e.g., to prevent import cycles with
the plugin bootstrap), add a clear module-level comment explaining the exemption
and rationale so future contributors understand why BaseModel was used.
src/aiperf/dataset/loader/models.py (2)

254-256: Consider renaming CustomDatasetT to align with the new naming convention.

The TypeVar is still named CustomDatasetT while the enum it relates to has been renamed from CustomDatasetType to DatasetLoaderType. A name like DatasetModelT would be more consistent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/models.py` around lines 254 - 256, Rename the
TypeVar CustomDatasetT to follow the updated naming convention (e.g.,
DatasetModelT) so it aligns with DatasetLoaderType; update the declaration that
currently reads CustomDatasetT = TypeVar("CustomDatasetT", bound=SingleTurn |
MultiTurn | RandomPool | MooncakeTrace) to use the new identifier
(DatasetModelT) and adjust any references/usages of CustomDatasetT across the
module to the new name to ensure consistency with DatasetLoaderType and avoid
unresolved names.

24-24: type fields lack Field(description=...) per coding guidelines.

The type discriminator fields on lines 24, 103, 130, and 206 are modified but don't use Field(description=...). While these are structural discriminators, the guideline applies to every Pydantic field.

Example fix for line 24
-    type: Literal[DatasetLoaderType.SINGLE_TURN] = DatasetLoaderType.SINGLE_TURN
+    type: Literal[DatasetLoaderType.SINGLE_TURN] = Field(DatasetLoaderType.SINGLE_TURN, description="Discriminator for single-turn dataset type.")

As per coding guidelines: src/**/*.py: Add Field(description="...") on EVERY Pydantic field.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/models.py` at line 24, The Pydantic `type`
discriminator fields (e.g., the field declared as `type:
Literal[DatasetLoaderType.SINGLE_TURN] = DatasetLoaderType.SINGLE_TURN`) must
include a Field(...) with a description per guidelines; update each `type`
declaration (the ones at the locations using `DatasetLoaderType.SINGLE_TURN` and
the other discriminator literals around lines 103, 130, and 206) to use
Field(description="...") giving a short description of the discriminator, and
ensure `Field` is imported from `pydantic` if not already present.
docs/dataset-loaders.md (1)

428-430: Minor: Missing horizontal rule before "Synthetic Loaders" section.

All other top-level loader sections (single_turn, multi_turn, random_pool, mooncake_trace, sharegpt) end with a --- separator. The sharegpt section at line 428 is missing one before the "## Synthetic Loaders" heading at line 430.

Proposed fix
 **Default sampling strategy:** Sequential
 
+---
+
 ## Synthetic Loaders
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/dataset-loaders.md` around lines 428 - 430, Add the missing horizontal
rule separator before the "## Synthetic Loaders" heading: insert a line with
`---` immediately after the "Default sampling strategy: Sequential" (or at the
end of the sharegpt section) so the "## Synthetic Loaders" heading is preceded
by the same `---` top-level separator used elsewhere.
tests/unit/dataset/test_dataset_manager.py (1)

91-98: Suppress Ruff ARG001 for mock helper parameters.

Ruff flags args and kwargs as unused (ARG001). These are intentionally ignored since this is a patch replacement for BaseFileLoader.load which receives self when called as an instance method. Consider adding an inline suppression or using _ prefixes.

Suggested fix
-async def _mock_load_dummy_conversation(*args, **kwargs):
+async def _mock_load_dummy_conversation(*_args, **_kwargs):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/dataset/test_dataset_manager.py` around lines 91 - 98, The helper
function _mock_load_dummy_conversation is intentionally written to accept the
instance args passed when patching BaseFileLoader.load but Ruff flags unused
args/kwargs (ARG001); update the signature to accept `_ *args` and `**_kwargs`
(or rename to `_args, **_kwargs`) or add an inline noqa comment (e.g., `# noqa:
ARG001`) to the function definition so Ruff no longer complains while preserving
its behavior when used as a replacement for BaseFileLoader.load.
src/aiperf/dataset/loader/multi_turn.py (1)

116-138: Blocking file I/O in parse_and_validate runs on the event loop.

open(self.filename) at line 127 is synchronous blocking I/O, called from the async def load() pipeline in BaseFileLoader. For large JSONL files this could block the event loop. This is a pre-existing pattern carried from the old load_dataset and is an architectural constraint of the BaseFileLoader design. Consider wrapping with asyncio.to_thread or using aiofiles if this becomes a bottleneck.

As per coding guidelines, "Use async/await for ALL I/O operations - no time.sleep, no blocking calls."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/multi_turn.py` around lines 116 - 138,
parse_and_validate is performing synchronous file I/O by calling
open(self.filename) inside an async pipeline (see async def load in
BaseFileLoader); change the file reading to run off the event loop (either wrap
the blocking work in asyncio.to_thread or switch to an async file reader like
aiofiles) so parsing remains non-blocking, i.e., move the loop that reads lines
and calls MultiTurn.model_validate_json into a to_thread call or replace
open/read with aiofiles and await the reads while keeping session_id generation
via self.ctx.session_id_generator.next() intact.
src/aiperf/dataset/loader/synthetic/multimodal.py (1)

28-37: Video modality excluded from the "at least one payload" validation.

The constructor validates that at least one of prompt, image, or audio is enabled, but include_video is not checked. A user enabling only video generation would hit this error. The warning at line 102–106 has the same omission. If video-only usage is not a supported configuration, this is fine — but consider documenting that explicitly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/synthetic/multimodal.py` around lines 28 - 37, The
constructor's "at least one payload" validation and the subsequent warning block
omit include_video, causing video-only configs to be rejected; update the
validation in the class constructor to include self.include_video in the boolean
check (so the ValueError only fires when prompt, image, audio, and video are all
False), and update the corresponding warning message block to include
include_video in its condition and text so video-only generation is accepted and
the error/warning message accurately lists video as a valid enabled modality.
src/aiperf/dataset/loader/synthetic/base.py (3)

37-38: Stale "composer" naming in RNG derivation keys.

The seed strings "composer.conversation.turn_count" and "composer.conversation.turn_delay" are leftovers from the old composer architecture. They still work functionally but are inconsistent with the loader-based refactoring. Note: changing them would alter RNG sequences for users relying on seeded reproducibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/synthetic/base.py` around lines 37 - 38, The RNG
derivation keys use stale "composer" prefixes; update the seed strings passed to
rng.derive for _turn_sampler_rng and _delay_sampler_rng from
"composer.conversation.turn_count" and "composer.conversation.turn_delay" to the
new loader-consistent keys (e.g. "loader.conversation.turn_count" and
"loader.conversation.turn_delay") by changing the arguments in the calls that
create self._turn_sampler_rng and self._delay_sampler_rng; be mindful this will
change RNG sequences for seeded runs.

107-118: Inconsistent null-guard on video vs. image/audio payload generation.

_generate_video_payloads (line 116) guards with if data: before appending, but _generate_image_payloads and _generate_audio_payloads append unconditionally. If a generator can return a falsy value, the same guard should be applied consistently; otherwise, the guard is dead code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/synthetic/base.py` around lines 107 - 118, The
video loader currently checks for falsy returns before appending (in
_generate_video_payloads -> video_generator.generate + video.contents.append)
while _generate_image_payloads and _generate_audio_payloads append
unconditionally; make these consistent: either remove the `if data:` guard from
_generate_video_payloads or add the same `if data:` guard to
_generate_image_payloads and _generate_audio_payloads around their generator
calls (image_generator.generate / audio_generator.generate) and their
contents.append calls, depending on whether generators can legitimately return
falsy values—update all three methods so they use the same null-guard behavior.

138-143: Consider standardizing modality check logic for consistency.

include_image and include_audio both check .mean > 0, but include_video uses bool(width and height). While this works correctly (video width/height are int | None, not config objects), the inconsistent checking pattern makes the logic harder to follow. Either document why video differs from other modalities, or refactor to use a consistent pattern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/synthetic/base.py` around lines 138 - 143, The
modality checks are inconsistent: include_image/include_audio use ".mean > 0"
while include_video checks width/height truthiness; update include_video to
follow the same pattern or explicitly document the difference. Modify the
include_video property to first try the same mean-based check (e.g., video_cfg =
self.ctx.config.input.video; return bool(getattr(video_cfg, "mean", 0) > 0) or
fallback to bool(video_cfg.width and video_cfg.height)), keeping references to
include_video and self.ctx.config.input.video.width/height so behavior is
consistent and clear across modalities.
tests/unit/dataset/loader/test_can_load.py (2)

19-31: Missing error wrapping compared to production _infer_type.

The production DatasetManager._infer_type() (from the relevant snippet at dataset_manager.py lines 369-422) wraps the explicit type resolution in try/except (ValueError, KeyError) to produce a user-friendly error message for invalid type values. This test helper on line 24 does not — DatasetLoaderType(data["type"]) will raise a raw ValueError for unknown type strings. This is fine for test inputs, but worth noting the behavioral divergence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/dataset/loader/test_can_load.py` around lines 19 - 31, The test
helper _infer_type differs from production by not wrapping explicit type
resolution; adjust _infer_type to catch ValueError and KeyError raised by
DatasetLoaderType(data["type"]) or plugins.get_class(...) and re-raise a
ValueError with a user-friendly message (mirroring DatasetManager._infer_type)
that includes the invalid type string, then proceed to call
LoaderClass.can_load(data, filename) and raise the existing error if it cannot
handle the data; reference _infer_type, DatasetLoaderType, plugins.get_class,
and LoaderClass.can_load when locating the change.

19-46: Test helper _infer_type duplicates production logic from DatasetManager._infer_type().

This standalone helper mirrors the production inference logic for testability without instantiating DatasetManager. This is a reasonable pattern, but the duplicate can drift from the production implementation over time. Consider adding a comment noting the source of truth, or in a future iteration, extracting the inference logic into a shared utility that both DatasetManager and tests can use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/dataset/loader/test_can_load.py` around lines 19 - 46, The test
helper _infer_type duplicates production logic from DatasetManager._infer_type;
to prevent drift, add a clear comment in the test above _infer_type that points
to DatasetManager._infer_type as the source of truth (mention DatasetLoaderType,
plugins.iter_all and PluginType.DATASET_LOADER), and optionally note that the
logic should be extracted to a shared utility in the future; if you prefer
immediate consolidation, move the inference code into a small internal utility
function (used by both DatasetManager._infer_type and the test) and update
references in the test to call that utility instead of duplicating logic.
src/aiperf/dataset/loader/single_turn.py (1)

90-110: Blocking file I/O in parse_and_validate.

Line 101 uses synchronous open(self.filename) which violates the coding guideline requiring async I/O for all operations. However, the parent BaseFileLoader.load() calls this synchronously within an async generator, so this appears to be an intentional architectural decision for file-based loaders (file parsing is CPU-bound and typically fast).

If large files are expected, consider wrapping in asyncio.to_thread or using aiofiles in a future iteration.

As per coding guidelines, src/**/*.py: "Use async/await for ALL I/O operations - no time.sleep, no blocking calls".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/single_turn.py` around lines 90 - 110,
parse_and_validate is doing blocking file I/O with open(self.filename) which
violates the async I/O guideline; make it async and perform file reads without
blocking the event loop. Change parse_and_validate to async def
parse_and_validate(...) and either (preferred) use aiofiles.open(self.filename)
and iterate lines with async for (preserving the SingleTurn.model_validate_json
and session id generation), or wrap the synchronous read in await
asyncio.to_thread(read_and_parse_fn, self.filename) if you prefer a minimal
change; also update the caller (e.g., BaseFileLoader.load) to await
parse_and_validate so the async flow remains correct and ensure you preserve
usage of self.ctx.session_id_generator.next() and returned dict[str,
list[SingleTurn]].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mkdocs.yml`:
- Around line 28-33: The navigation has a naming collision: two "Development"
entries (one linking to Development.md under Reference and one as a top-level
section containing dev/creating-dataset-loaders.md). Rename the top-level
section to a unique label (e.g., "Plugin Development", "Extending AIPerf", or
"Contributing") or move dev/creating-dataset-loaders.md under the existing
"Development" entry in mkdocs.yml, updating the nav key for the section that
contains dev/creating-dataset-loaders.md and ensuring the other link to
Development.md remains unchanged.

In `@src/aiperf/dataset/dataset_manager.py`:
- Around line 334-368: The _infer_dataset_type function performs blocking file
I/O via open(file_path) which can block the event loop when called from async
paths (_configure_dataset → _create_loader → _create_file_loader); change the
implementation to perform the file read on a thread (use asyncio.to_thread) or
convert _infer_dataset_type/_create_file_loader to async and use aiofiles so the
loop isn't blocked; specifically, move the logic that opens and iterates lines
(and calls load_json_str) into a callable passed to asyncio.to_thread (or an
aiofiles-based async read) and keep the rest of _infer_type(filename=...,
data=...) unchanged so callers like _create_loader/_configure_dataset remain
non-blocking.

In `@src/aiperf/dataset/loader/context.py`:
- Around line 121-124: The property seq_distribution is missing a return type
hint; change its signature to return SequenceLengthDistribution | None and add
SequenceLengthDistribution to the import list from aiperf.common.models so the
property (def seq_distribution(self) -> SequenceLengthDistribution | None) and
its use of self._seq_distribution are correctly typed.

In `@src/aiperf/dataset/loader/sharegpt.py`:
- Around line 59-61: The synchronous file read and JSON parsing in
parse_and_validate/load_json_str are blocking the async event loop called from
BaseFileLoader.load; change the implementation so the blocking work runs off the
event loop — either make parse_and_validate async using aiofiles or (preferred
to match repo patterns) call the blocking parts (open/read and load_json_str)
inside asyncio.to_thread from the loader's async entrypoint (e.g., within
ShareGPT loader's parse_and_validate or from BaseFileLoader.load) so that the
file open/read and JSON parsing happen on a threadpool instead of blocking the
event loop.

In `@src/aiperf/dataset/model_selection_strategies.py`:
- Around line 32-40: RoundRobinModelSelectionStrategy.__init__ (and likewise
RandomModelSelectionStrategy.__init__ and
ShuffleModelSelectionStrategy.__init__) must defensively validate the incoming
model_names list to prevent empty lists; add a guard that checks if model_names
is truthy and has length > 0 and raise a clear ValueError (or TypeError if not a
list) with a helpful message when invalid, so select() cannot raise
ZeroDivisionError/IndexError at runtime; update the constructors to validate
model_names before assigning to self._model_names and initializing self._counter
or shuffle state.

In `@src/aiperf/dataset/public_datasets.py`:
- Around line 78-99: The async function _download uses blocking filesystem calls
on cache_filepath (Path.exists(), cache_filepath.parent.mkdir, and
open(...).write) which can block the event loop; change these to async
equivalents by either using aiofiles for file write (open/write) and using
asyncio.to_thread for Path.exists() and mkdir (or use asyncio.to_thread for all
three) so all filesystem I/O in _download is non-blocking; ensure you reference
AIPERF_DATASET_CACHE_DIR and cache_filepath in the changes and preserve the
existing error handling that raises DatasetLoaderError on save failures.
- Around line 64-68: Remove the unused fragile classmethod
get_preferred_sampling_strategy from the dataset loader base in
public_datasets.py; callers already use the instance-level delegation path
(get_delegate_preferred_sampling_strategy) or call the method on concrete loader
instances, so deleting the classmethod eliminates the risky class-level contract
without changing runtime behavior—remove the def
get_preferred_sampling_strategy(cls) -> DatasetSamplingStrategy: ... block and
keep existing instance methods like get_delegate_preferred_sampling_strategy and
concrete implementations intact.
- Around line 39-49: The _download() helper currently performs blocking
filesystem calls (Path.mkdir and built-in open/write) from async code called by
_get_delegate(); change _download() to create directories with
asyncio.to_thread(Path.mkdir, parents=True, exist_ok=True) and perform file
writes using aiofiles (await aiofiles.open(...) and await file.write(...)) so
all I/O is non-blocking, and keep awaiting the async operations before returning
the local path used by _get_delegate().

In `@tests/unit/common/config/test_input_config.py`:
- Around line 138-148: The test docstring in
test_dataset_type_without_file_raises_error still references the old validator
name `validate_custom_dataset_file`; update the docstring to use the new
validator name (e.g., the current validator on InputConfig) so it reflects the
refactor—locate the test function `test_dataset_type_without_file_raises_error`
and replace the stale reference with the new validator identifier while keeping
the rest of the docstring intact.

In `@tests/unit/common/models/test_sequence_distribution.py`:
- Line 682: Remove the unused Conversation import from the test module: in the
import line that currently reads "from aiperf.common.models import Conversation,
Turn" used by test_turn_sequence_caching and
test_different_turns_get_different_cache_entries, drop Conversation so only Turn
is imported; update the import statement accordingly to silence the linter and
leave test logic (Turn usage) unchanged.
- Line 679: The test method names do not follow the required
test_<function>_<scenario>_<expected> convention; rename the methods to be
explicit: change test_turn_sequence_caching to
test_get_turn_sequence_lengths_same_turn_id_returns_cached_value and change
test_different_turns_get_different_cache_entries to
test_get_turn_sequence_lengths_distinct_turn_ids_cached_independently so the
function under test (get_turn_sequence_lengths) and the scenario/expected
outcome are encoded in the name; update any references/imports or test markers
that use those names accordingly (e.g., in
tests/unit/common/models/test_sequence_distribution.py where
test_turn_sequence_caching and test_different_turns_get_different_cache_entries
are defined).
- Around line 699-705: The test is calling nonexistent loader methods
`_get_turn_sequence_lengths` and `_clear_turn_cache` on MockLoader; fix by using
the correct public API on the context or implement the methods on the loader:
either change the test to call the LoaderContext public method
`ctx.get_turn_sequence_lengths(turn_id)` (and use the context to clear any cache
instead of calling `_clear_turn_cache`) and ensure `_seq_distribution` and
`_turn_sequence_cache` are set on the `LoaderContext`, or add explicit
implementations of `_get_turn_sequence_lengths` and `_clear_turn_cache` on
`BaseDatasetLoader` that delegate to the loader's `LoaderContext` storage
(`_seq_distribution`, `_turn_sequence_cache`) so the existing test calls
succeed; update references to `MockLoader`, `BaseDatasetLoader`,
`LoaderContext`, `ctx.get_turn_sequence_lengths`, `_seq_distribution`, and
`_turn_sequence_cache` accordingly.

---

Outside diff comments:
In `@src/aiperf/dataset/loader/mooncake_trace.py`:
- Around line 96-101: parse_and_validate currently uses synchronous
open(self.filename) which blocks the event loop because load() in BaseFileLoader
calls it from async code; change the file read to non-blocking I/O by using
aiofiles.open in parse_and_validate (or wrap the existing blocking loop in
asyncio.to_thread) so each line is read asynchronously, then call
MooncakeTrace.model_validate_json for each non-empty stripped line as before;
ensure you import aiofiles and update parse_and_validate to be async (or if
using asyncio.to_thread, run the whole parsing lambda in to_thread) and keep
references to self.filename and MooncakeTrace.model_validate_json unchanged.

In `@src/aiperf/dataset/loader/random_pool.py`:
- Around line 88-198: Change parse_and_validate to be asynchronous and ensure
all blocking file operations are moved off the event loop: make
parse_and_validate an async def and either (preferred minimal-change) call the
existing blocking helpers (_validate_path and _load_dataset_from_file) via
asyncio.to_thread, or refactor file reads in _validate_path and
_load_dataset_from_file to use aiofiles and await them. Update
can_load_directory and any callers that expect parse_and_validate to be sync
(e.g., BaseFileLoader.load) to await the new async parse_and_validate, and
ensure ValidationError behavior is preserved; reference the methods
RandomPoolDatasetLoader.parse_and_validate,
RandomPoolDatasetLoader._validate_path, and
RandomPoolDatasetLoader._load_dataset_from_file when making the changes.

---

Nitpick comments:
In `@docs/dataset-loaders.md`:
- Around line 428-430: Add the missing horizontal rule separator before the "##
Synthetic Loaders" heading: insert a line with `---` immediately after the
"Default sampling strategy: Sequential" (or at the end of the sharegpt section)
so the "## Synthetic Loaders" heading is preceded by the same `---` top-level
separator used elsewhere.

In `@src/aiperf/dataset/dataset_manager.py`:
- Around line 144-192: _generate_input_payloads currently performs an N+1 fetch
by calling self._dataset_client.get_conversation(conv_meta.conversation_id)
inside the loop over self.dataset_metadata.conversations; change this to either
(a) add and use a batch retrieval method on the dataset client (e.g.,
get_conversations(conversation_ids: list[str])) and fetch all conversations in
one call before iterating, or (b) if batching is not supported, collect all
conversation IDs then concurrently fetch them with asyncio.gather to avoid
serial I/O; update usage sites in _generate_input_payloads (replace per-loop
get_conversation calls) and keep the rest of the logic (session_payloads_map,
RequestInfo creation, endpoint.format_payload) unchanged.

In `@src/aiperf/dataset/loader/context.py`:
- Around line 23-24: Remove the dead `if TYPE_CHECKING: pass` block in the
module and either delete it entirely or replace it with the intended conditional
imports; locate the `if TYPE_CHECKING` block in the context loader module
(symbol: TYPE_CHECKING) and remove the no-op, or populate it with any
forward-only imports required by type hints (e.g., the specific classes/types
used only for typing), ensuring no runtime imports are introduced.

In `@src/aiperf/dataset/loader/models.py`:
- Around line 254-256: Rename the TypeVar CustomDatasetT to follow the updated
naming convention (e.g., DatasetModelT) so it aligns with DatasetLoaderType;
update the declaration that currently reads CustomDatasetT =
TypeVar("CustomDatasetT", bound=SingleTurn | MultiTurn | RandomPool |
MooncakeTrace) to use the new identifier (DatasetModelT) and adjust any
references/usages of CustomDatasetT across the module to the new name to ensure
consistency with DatasetLoaderType and avoid unresolved names.
- Line 24: The Pydantic `type` discriminator fields (e.g., the field declared as
`type: Literal[DatasetLoaderType.SINGLE_TURN] = DatasetLoaderType.SINGLE_TURN`)
must include a Field(...) with a description per guidelines; update each `type`
declaration (the ones at the locations using `DatasetLoaderType.SINGLE_TURN` and
the other discriminator literals around lines 103, 130, and 206) to use
Field(description="...") giving a short description of the discriminator, and
ensure `Field` is imported from `pydantic` if not already present.

In `@src/aiperf/dataset/loader/multi_turn.py`:
- Around line 116-138: parse_and_validate is performing synchronous file I/O by
calling open(self.filename) inside an async pipeline (see async def load in
BaseFileLoader); change the file reading to run off the event loop (either wrap
the blocking work in asyncio.to_thread or switch to an async file reader like
aiofiles) so parsing remains non-blocking, i.e., move the loop that reads lines
and calls MultiTurn.model_validate_json into a to_thread call or replace
open/read with aiofiles and await the reads while keeping session_id generation
via self.ctx.session_id_generator.next() intact.

In `@src/aiperf/dataset/loader/single_turn.py`:
- Around line 90-110: parse_and_validate is doing blocking file I/O with
open(self.filename) which violates the async I/O guideline; make it async and
perform file reads without blocking the event loop. Change parse_and_validate to
async def parse_and_validate(...) and either (preferred) use
aiofiles.open(self.filename) and iterate lines with async for (preserving the
SingleTurn.model_validate_json and session id generation), or wrap the
synchronous read in await asyncio.to_thread(read_and_parse_fn, self.filename) if
you prefer a minimal change; also update the caller (e.g., BaseFileLoader.load)
to await parse_and_validate so the async flow remains correct and ensure you
preserve usage of self.ctx.session_id_generator.next() and returned dict[str,
list[SingleTurn]].

In `@src/aiperf/dataset/loader/synthetic/__init__.py`:
- Around line 4-6: The synthetic subpackage __init__ currently only exports
BaseSyntheticLoader, which hides the concrete loaders; update
src/aiperf/dataset/loader/synthetic/__init__.py to also import and include
SyntheticMultiModalLoader and SyntheticRankingsLoader in __all__ so that
consumers can access those classes directly from aiperf.dataset.loader.synthetic
(reference the symbols SyntheticMultiModalLoader, SyntheticRankingsLoader and
BaseSyntheticLoader when modifying the module exports).

In `@src/aiperf/dataset/loader/synthetic/base.py`:
- Around line 37-38: The RNG derivation keys use stale "composer" prefixes;
update the seed strings passed to rng.derive for _turn_sampler_rng and
_delay_sampler_rng from "composer.conversation.turn_count" and
"composer.conversation.turn_delay" to the new loader-consistent keys (e.g.
"loader.conversation.turn_count" and "loader.conversation.turn_delay") by
changing the arguments in the calls that create self._turn_sampler_rng and
self._delay_sampler_rng; be mindful this will change RNG sequences for seeded
runs.
- Around line 107-118: The video loader currently checks for falsy returns
before appending (in _generate_video_payloads -> video_generator.generate +
video.contents.append) while _generate_image_payloads and
_generate_audio_payloads append unconditionally; make these consistent: either
remove the `if data:` guard from _generate_video_payloads or add the same `if
data:` guard to _generate_image_payloads and _generate_audio_payloads around
their generator calls (image_generator.generate / audio_generator.generate) and
their contents.append calls, depending on whether generators can legitimately
return falsy values—update all three methods so they use the same null-guard
behavior.
- Around line 138-143: The modality checks are inconsistent:
include_image/include_audio use ".mean > 0" while include_video checks
width/height truthiness; update include_video to follow the same pattern or
explicitly document the difference. Modify the include_video property to first
try the same mean-based check (e.g., video_cfg = self.ctx.config.input.video;
return bool(getattr(video_cfg, "mean", 0) > 0) or fallback to
bool(video_cfg.width and video_cfg.height)), keeping references to include_video
and self.ctx.config.input.video.width/height so behavior is consistent and clear
across modalities.

In `@src/aiperf/dataset/loader/synthetic/multimodal.py`:
- Around line 28-37: The constructor's "at least one payload" validation and the
subsequent warning block omit include_video, causing video-only configs to be
rejected; update the validation in the class constructor to include
self.include_video in the boolean check (so the ValueError only fires when
prompt, image, audio, and video are all False), and update the corresponding
warning message block to include include_video in its condition and text so
video-only generation is accepted and the error/warning message accurately lists
video as a valid enabled modality.

In `@src/aiperf/plugin/schema/schemas.py`:
- Around line 314-334: PublicDatasetMetadata and the other model classes in this
module inherit from pydantic.BaseModel instead of the project's AIPerfBaseModel;
change the base class for PublicDatasetMetadata (and the other seven classes)
from BaseModel to AIPerfBaseModel so they follow the coding guideline for data
models (refer to class PublicDatasetMetadata and other model class definitions),
and if you intentionally must avoid AIPerfBaseModel (e.g., to prevent import
cycles with the plugin bootstrap), add a clear module-level comment explaining
the exemption and rationale so future contributors understand why BaseModel was
used.

In `@tests/unit/common/config/test_user_config_mooncake_trace.py`:
- Around line 43-108: Multiple tests (test_mooncake_trace_uses_dataset_size,
test_mooncake_trace_skips_empty_lines,
test_mooncake_trace_empty_file_returns_zero) duplicate the same setup and only
vary mock_file_content and expected count; refactor them into a single
parametrized test using pytest.mark.parametrize that supplies
(mock_file_content, expected_count) and reuses the existing patches for
pathlib.Path.exists and is_file and the same UserConfig/InputConfig
construction, then call config._count_dataset_entries() and assert equality for
each parameter set.

In `@tests/unit/common/models/test_sequence_distribution.py`:
- Around line 687-697: MockLoader is duplicated in two tests; extract it out as
a shared class-level fixture or a nested class to follow DRY. Create a single
reusable MockLoader (subclassing BaseDatasetLoader) that implements load,
can_load, and get_preferred_sampling_strategy, and then reference that
fixture/class from both test methods instead of redefining it; ensure the symbol
names remain MockLoader, load, can_load, and get_preferred_sampling_strategy so
existing tests pick it up.

In `@tests/unit/dataset/conftest.py`:
- Around line 75-87: The fixture empty_dataset_manager is declared async but
contains no awaits; change its definition from async def
empty_dataset_manager(...) to a regular def so it's a synchronous pytest
fixture. Update the signature while keeping the return type DatasetManager and
ensure the body that constructs DatasetManager, sets manager.dataset_metadata
using DatasetMetadata and DatasetSamplingStrategy.SEQUENTIAL, and returns
manager remains unchanged.

In `@tests/unit/dataset/loader/test_can_load.py`:
- Around line 19-31: The test helper _infer_type differs from production by not
wrapping explicit type resolution; adjust _infer_type to catch ValueError and
KeyError raised by DatasetLoaderType(data["type"]) or plugins.get_class(...) and
re-raise a ValueError with a user-friendly message (mirroring
DatasetManager._infer_type) that includes the invalid type string, then proceed
to call LoaderClass.can_load(data, filename) and raise the existing error if it
cannot handle the data; reference _infer_type, DatasetLoaderType,
plugins.get_class, and LoaderClass.can_load when locating the change.
- Around line 19-46: The test helper _infer_type duplicates production logic
from DatasetManager._infer_type; to prevent drift, add a clear comment in the
test above _infer_type that points to DatasetManager._infer_type as the source
of truth (mention DatasetLoaderType, plugins.iter_all and
PluginType.DATASET_LOADER), and optionally note that the logic should be
extracted to a shared utility in the future; if you prefer immediate
consolidation, move the inference code into a small internal utility function
(used by both DatasetManager._infer_type and the test) and update references in
the test to call that utility instead of duplicating logic.

In `@tests/unit/dataset/test_dataset_manager.py`:
- Around line 91-98: The helper function _mock_load_dummy_conversation is
intentionally written to accept the instance args passed when patching
BaseFileLoader.load but Ruff flags unused args/kwargs (ARG001); update the
signature to accept `_ *args` and `**_kwargs` (or rename to `_args, **_kwargs`)
or add an inline noqa comment (e.g., `# noqa: ARG001`) to the function
definition so Ruff no longer complains while preserving its behavior when used
as a replacement for BaseFileLoader.load.

Comment on lines +28 to +33
- Dataset Loaders: dataset-loaders.md
- Benchmark Datasets: benchmark_datasets.md
- Feature Comparison: genai-perf-feature-comparison.md
- Development: Development.md
- Development:
- Creating Dataset Loaders: dev/creating-dataset-loaders.md
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== Reading mkdocs.yml to verify navigation structure ==="
cat -n mkdocs.yml

echo ""
echo "=== Checking for dataset-loaders.md ==="
find docs -name "dataset-loaders.md" -o -name "dataset-loaders.md" 2>/dev/null || echo "Not found"

echo ""
echo "=== Checking for creating-dataset-loaders.md ==="
find docs -name "creating-dataset-loaders.md" 2>/dev/null || echo "Not found"

echo ""
echo "=== Checking full 'dev/' directory contents ==="
find docs/dev -type f -name "*.md" 2>/dev/null || echo "dev directory not found or empty"

echo ""
echo "=== Checking all doc files with 'loader' in name ==="
find docs -type f -name "*loader*" 2>/dev/null || echo "No files with 'loader' found"

Repository: ai-dynamo/aiperf

Length of output: 2569


Naming collision in navigation structure.

Lines 31 and 32-33 create two navigation entries named "Development" at different hierarchy levels: one under "Reference" pointing to Development.md, and one as a top-level section containing Creating Dataset Loaders. This results in duplicate "Development" items in the navigation menu, creating confusion for users. Consider renaming the new top-level section (e.g., Plugin Development:, Extending AIPerf:, or Contributing:) or merging its content under an existing section.

Both referenced documentation files (dataset-loaders.md and dev/creating-dataset-loaders.md) exist and will not cause build failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mkdocs.yml` around lines 28 - 33, The navigation has a naming collision: two
"Development" entries (one linking to Development.md under Reference and one as
a top-level section containing dev/creating-dataset-loaders.md). Rename the
top-level section to a unique label (e.g., "Plugin Development", "Extending
AIPerf", or "Contributing") or move dev/creating-dataset-loaders.md under the
existing "Development" entry in mkdocs.yml, updating the nav key for the section
that contains dev/creating-dataset-loaders.md and ensuring the other link to
Development.md remains unchanged.

Comment on lines +334 to +368
def _infer_dataset_type(self, file_path: str) -> DatasetLoaderType:
"""Infer the dataset type from the input file.

Queries all registered loaders to check if they can handle the data format.

Args:
file_path: Path to the JSONL file or directory

Returns:
DatasetLoaderType if successfully inferred

Raises:
ValueError: If no loader can handle the data format
"""
path = Path(file_path)

# If it's a directory, use path-based detection only
if path.is_dir():
return self._infer_type(data=None, filename=file_path)

# For files, read first non-empty line and use both content and path detection
try:
with open(file_path) as f:
for line in f:
if not (line := line.strip()):
continue
data = load_json_str(line)
return self._infer_type(data=data, filename=file_path)
except ValueError as e:
self.exception(
f"Error inferring dataset type from file: {file_path}: {e!r}"
)
raise

raise ValueError(f"Empty file: {file_path}. Cannot infer dataset type.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate the file and check its existence
find . -name "dataset_manager.py" -type f

Repository: ai-dynamo/aiperf

Length of output: 99


🏁 Script executed:

# Read the relevant section to verify the code
sed -n '334,368p' src/aiperf/dataset/dataset_manager.py

Repository: ai-dynamo/aiperf

Length of output: 1326


🏁 Script executed:

# Check the _create_file_loader method to see if it calls _infer_dataset_type
rg -n '_create_file_loader|_infer_dataset_type' src/aiperf/dataset/dataset_manager.py -A 10 -B 2

Repository: ai-dynamo/aiperf

Length of output: 1668


🏁 Script executed:

# Check _create_loader to see if it's async and calls _create_file_loader
rg -n '_create_loader' src/aiperf/dataset/dataset_manager.py -A 15 -B 2

Repository: ai-dynamo/aiperf

Length of output: 1590


🏁 Script executed:

# Check _configure_dataset to see if it calls _create_loader
rg -n '_configure_dataset' src/aiperf/dataset/dataset_manager.py -A 30 -B 2 | head -80

Repository: ai-dynamo/aiperf

Length of output: 3548


🏁 Script executed:

# Verify the method signatures (async vs sync)
rg -n 'async def.*_configure_dataset|async def.*_create_loader|def _create_file_loader|def _infer_dataset_type' src/aiperf/dataset/dataset_manager.py

Repository: ai-dynamo/aiperf

Length of output: 387


🏁 Script executed:

# Check if asyncio.to_thread is already used in the file
rg -n 'asyncio\.to_thread|aiofiles' src/aiperf/dataset/dataset_manager.py

Repository: ai-dynamo/aiperf

Length of output: 113


Wrap blocking file I/O in asyncio.to_thread for event loop compatibility.

The open(file_path) at line 356 performs synchronous file I/O. This method is called from _configure_dataset (async) → _create_loader (async) → _create_file_loader (sync) → _infer_dataset_type (sync), running on the event loop thread and blocking it during file reading.

The file already uses asyncio.to_thread at line 136 for the tokenizer (a similar blocking operation). Apply the same pattern here: either wrap the file read in asyncio.to_thread or make _infer_dataset_type and _create_file_loader async and use aiofiles or asyncio.to_thread.

Per coding guidelines: "Use async/await for ALL I/O operations - no blocking calls."

🧰 Tools
🪛 Ruff (0.15.1)

[warning] 368-368: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/dataset_manager.py` around lines 334 - 368, The
_infer_dataset_type function performs blocking file I/O via open(file_path)
which can block the event loop when called from async paths (_configure_dataset
→ _create_loader → _create_file_loader); change the implementation to perform
the file read on a thread (use asyncio.to_thread) or convert
_infer_dataset_type/_create_file_loader to async and use aiofiles so the loop
isn't blocked; specifically, move the logic that opens and iterates lines (and
calls load_json_str) into a callable passed to asyncio.to_thread (or an
aiofiles-based async read) and keep the rest of _infer_type(filename=...,
data=...) unchanged so callers like _create_loader/_configure_dataset remain
non-blocking.

Comment on lines +121 to +124
@property
def seq_distribution(self):
"""The sequence distribution for ISL/OSL pairing, or None."""
return self._seq_distribution
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the specific lines in the file
sed -n '110,140p' src/aiperf/dataset/loader/context.py | cat -n

Repository: ai-dynamo/aiperf

Length of output: 1340


🏁 Script executed:

#!/bin/bash
# Search for get_sequence_distribution definition and its return type
rg "def get_sequence_distribution" --type py -A 5

Repository: ai-dynamo/aiperf

Length of output: 643


🏁 Script executed:

#!/bin/bash
# Find where _seq_distribution is assigned in the context.py file
rg "_seq_distribution" src/aiperf/dataset/loader/context.py -B 2 -A 2

Repository: ai-dynamo/aiperf

Length of output: 984


🏁 Script executed:

#!/bin/bash
# Find DistributionParser class and its parse method
rg "class DistributionParser" --type py -A 20

Repository: ai-dynamo/aiperf

Length of output: 2103


🏁 Script executed:

#!/bin/bash
# Also search for parse method definition within DistributionParser
rg "def parse" src/aiperf/common/models/sequence_distribution.py -B 2 -A 10

Repository: ai-dynamo/aiperf

Length of output: 641


🏁 Script executed:

#!/bin/bash
# Check imports at the top of context.py
head -50 src/aiperf/dataset/loader/context.py | cat -n

Repository: ai-dynamo/aiperf

Length of output: 2130


🏁 Script executed:

#!/bin/bash
# Search for SequenceLengthDistribution imports in the codebase
rg "from.*SequenceLengthDistribution|import.*SequenceLengthDistribution" --type py

Repository: ai-dynamo/aiperf

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check how SequenceLengthDistribution is used in other files
rg "SequenceLengthDistribution" --type py -B 2 | head -40

Repository: ai-dynamo/aiperf

Length of output: 3177


🏁 Script executed:

#!/bin/bash
# Check if SequenceLengthDistribution is exported from aiperf.common.models
grep -n "SequenceLengthDistribution" src/aiperf/common/models/__init__.py

Repository: ai-dynamo/aiperf

Length of output: 132


🏁 Script executed:

#!/bin/bash
# Check the structure of aiperf.common.models
cat src/aiperf/common/models/__init__.py

Repository: ai-dynamo/aiperf

Length of output: 5912


Add return type annotation to seq_distribution property.

Per coding guidelines, all functions require type hints on parameters and return types. The correct return type is SequenceLengthDistribution | None (imported from aiperf.common.models), not the generic object | None.

Suggested fix
 `@property`
-def seq_distribution(self):
+def seq_distribution(self) -> SequenceLengthDistribution | None:
     """The sequence distribution for ISL/OSL pairing, or None."""
     return self._seq_distribution

Add SequenceLengthDistribution to the existing import from aiperf.common.models on line 11.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/context.py` around lines 121 - 124, The property
seq_distribution is missing a return type hint; change its signature to return
SequenceLengthDistribution | None and add SequenceLengthDistribution to the
import list from aiperf.common.models so the property (def
seq_distribution(self) -> SequenceLengthDistribution | None) and its use of
self._seq_distribution are correctly typed.

Comment on lines +59 to +61
with open(self.filename) as f:
content = f.read()
dataset = load_json_str(content)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find the parse_and_validate method in sharegpt.py
rg -n "def parse_and_validate" src/aiperf/dataset/loader/sharegpt.py -A 10

Repository: ai-dynamo/aiperf

Length of output: 455


🏁 Script executed:

# Check the class structure and async context
rg -n "class.*Loader\|async def\|def parse_and_validate" src/aiperf/dataset/loader/sharegpt.py -B 2 -A 5

Repository: ai-dynamo/aiperf

Length of output: 42


🏁 Script executed:

# Find existing async filesystem patterns in the repo
rg -n "aiofiles|to_thread|run_in_executor|async.*open|read_text" src/aiperf -t py | head -30

Repository: ai-dynamo/aiperf

Length of output: 3107


🏁 Script executed:

# Check how the loader is instantiated and called
rg -n "ShareGPT|parse_and_validate" src/aiperf --type py -B 2 -A 2 | head -50

Repository: ai-dynamo/aiperf

Length of output: 3527


🏁 Script executed:

# Check the base loader class to understand the architecture
find src/aiperf/dataset/loader -name "*.py" -exec basename {} \; | sort

Repository: ai-dynamo/aiperf

Length of output: 250


🏁 Script executed:

# Examine BaseFileLoader class to understand the architecture
rg -n "class BaseFileLoader" src/aiperf/dataset/loader -A 50

Repository: ai-dynamo/aiperf

Length of output: 4073


🏁 Script executed:

# Find where parse_and_validate is called
rg -n "parse_and_validate" src/aiperf --type py -B 3 -A 3

Repository: ai-dynamo/aiperf

Length of output: 6804


🏁 Script executed:

# Check how loaders are invoked - look for load/initialize patterns
rg -n "def load|async def load|\.parse_and_validate" src/aiperf/dataset/loader/base.py -A 5

Repository: ai-dynamo/aiperf

Length of output: 300


Blocking file read will block the async event loop in the loader pipeline.

The open().read() at lines 59-60 performs synchronous blocking I/O, and parse_and_validate() is called directly (without await or wrapping) from the async load() method in BaseFileLoader (line 44 in base.py). This will block the event loop.

Recommended fix: either wrap the parse_and_validate() call with asyncio.to_thread() in the base loader's load() method, or convert parse_and_validate() to async using aiofiles. The repo already uses asyncio.to_thread() in similar contexts (e.g., record_processor_service.py, dataset_manager.py).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/sharegpt.py` around lines 59 - 61, The synchronous
file read and JSON parsing in parse_and_validate/load_json_str are blocking the
async event loop called from BaseFileLoader.load; change the implementation so
the blocking work runs off the event loop — either make parse_and_validate async
using aiofiles or (preferred to match repo patterns) call the blocking parts
(open/read and load_json_str) inside asyncio.to_thread from the loader's async
entrypoint (e.g., within ShareGPT loader's parse_and_validate or from
BaseFileLoader.load) so that the file open/read and JSON parsing happen on a
threadpool instead of blocking the event loop.

Comment on lines +32 to +40
def __init__(self, model_names: list[str]) -> None:
self._model_names = model_names
self._counter = 0

def select(self, turn: Turn) -> str:
"""Select the next model in round-robin order."""
model_name = self._model_names[self._counter % len(self._model_names)]
self._counter += 1
return model_name
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the model_selection_strategies.py file
find . -name "model_selection_strategies.py" -type f

Repository: ai-dynamo/aiperf

Length of output: 110


🏁 Script executed:

# Once we find the file, read its full content to understand all three strategies
cat -n src/aiperf/dataset/model_selection_strategies.py

Repository: ai-dynamo/aiperf

Length of output: 3042


🏁 Script executed:

# Search for where ModelSelectionStrategy is instantiated
rg -n "ModelSelectionStrategy\(" --type=py -B 2 -A 2

Repository: ai-dynamo/aiperf

Length of output: 368


🏁 Script executed:

# Check _create_model_selector or similar factory methods
rg -n "_create_model_selector" --type=py -A 15

Repository: ai-dynamo/aiperf

Length of output: 1929


🏁 Script executed:

# Search for validation of model_names in config or dataset context
rg -n "model_names" --type=py -B 3 -A 3 | head -100

Repository: ai-dynamo/aiperf

Length of output: 7813


🏁 Script executed:

# Check EndpointConfig definition and validation
rg -n "class EndpointConfig" --type=py -A 30

Repository: ai-dynamo/aiperf

Length of output: 2557


🏁 Script executed:

# Search for model_names field definition in EndpointConfig
rg -n "model_names" --type=py src/aiperf/common/config | head -50

Repository: ai-dynamo/aiperf

Length of output: 324


🏁 Script executed:

# Check if there's any validation in EndpointConfig
fd -e py -x grep -l "EndpointConfig" {} \; | head -5 | xargs -I {} grep -l "model_names" {}

Repository: ai-dynamo/aiperf

Length of output: 292


🏁 Script executed:

# Let's find the actual EndpointConfig definition
find . -name "*.py" -type f -exec grep -l "class EndpointConfig" {} \;

Repository: ai-dynamo/aiperf

Length of output: 105


🏁 Script executed:

# Read the full EndpointConfig to check for validation
cat -n src/aiperf/common/config/endpoint_config.py

Repository: ai-dynamo/aiperf

Length of output: 11093


🏁 Script executed:

# Check if there's validation for empty model_names in config_validators
grep -n "parse_str_or_list" src/aiperf/common/config/config_validators.py -A 10

Repository: ai-dynamo/aiperf

Length of output: 1828


🏁 Script executed:

# Check if there are any tests for empty model_names in strategies
rg -n "model_selection_strategies" --type=py tests/ -l

Repository: ai-dynamo/aiperf

Length of output: 42


🏁 Script executed:

# Check for any tests that might use empty model_names
rg -n "model_names\s*=\s*\[\s*\]" --type=py

Repository: ai-dynamo/aiperf

Length of output: 42


🏁 Script executed:

# Double-check parse_str_or_list to see if it validates non-empty
cat -n src/aiperf/common/config/config_validators.py | head -50

Repository: ai-dynamo/aiperf

Length of output: 2254


All three strategies crash on empty model_names, with no upstream validation to prevent it.

RoundRobinModelSelectionStrategy raises ZeroDivisionError (line 38), RandomModelSelectionStrategy raises IndexError (line 52), and ShuffleModelSelectionStrategy raises IndexError after a no-op reshuffle (line 78). While EndpointConfig.model_names is a required field, there is no min_length constraint (unlike urls which uses min_length=1), and the validator parse_str_or_list does not enforce non-empty lists. Since these are public plugin-registered strategies, adding defensive validation in each constructor would prevent cryptic runtime errors.

🛡️ Proposed fix: validate in constructors

For example, in RoundRobinModelSelectionStrategy:

 def __init__(self, model_names: list[str]) -> None:
+    if not model_names:
+        raise ValueError("model_names must not be empty")
     self._model_names = model_names
     self._counter = 0

Apply the same guard to RandomModelSelectionStrategy.__init__ and ShuffleModelSelectionStrategy.__init__.

🧰 Tools
🪛 Ruff (0.15.1)

[warning] 36-36: Unused method argument: turn

(ARG002)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/model_selection_strategies.py` around lines 32 - 40,
RoundRobinModelSelectionStrategy.__init__ (and likewise
RandomModelSelectionStrategy.__init__ and
ShuffleModelSelectionStrategy.__init__) must defensively validate the incoming
model_names list to prevent empty lists; add a guard that checks if model_names
is truthy and has length > 0 and raise a clear ValueError (or TypeError if not a
list) with a helpful message when invalid, so select() cannot raise
ZeroDivisionError/IndexError at runtime; update the constructors to validate
model_names before assigning to self._model_names and initializing self._counter
or shuffle state.

Comment on lines +78 to +99
async def _download(name: str, url: str, remote_filename: str) -> Path:
"""Download a public dataset (or return cached path)."""
cache_filepath = AIPERF_DATASET_CACHE_DIR / remote_filename

if cache_filepath.exists():
return cache_filepath

http_client = AioHttpClient(timeout=Environment.DATASET.PUBLIC_DATASET_TIMEOUT)
try:
record: RequestRecord = await http_client.get_request(
url, headers={"Accept": "application/json"}
)
content = record.responses[0].text
finally:
await http_client.close()

try:
cache_filepath.parent.mkdir(parents=True, exist_ok=True)
with open(cache_filepath, "w") as f:
f.write(content)
except Exception as e:
raise DatasetLoaderError(f"Error saving dataset to local cache: {e}") from e
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check the actual file and surrounding context
cat -n src/aiperf/dataset/public_datasets.py

Repository: ai-dynamo/aiperf

Length of output: 4537


🏁 Script executed:

# Search for async filesystem patterns in the codebase
rg -n "aiofiles|to_thread|run_in_executor|asyncio\.to_thread|write_text|read_text" src/aiperf --type py

Repository: ai-dynamo/aiperf

Length of output: 3424


🏁 Script executed:

# Find where _download is called from to understand the context
rg -n "_download\(|from.*public_datasets import" src/aiperf --type py -A 2

Repository: ai-dynamo/aiperf

Length of output: 649


🏁 Script executed:

# Check for BaseFileLoader or similar async patterns in the dataset module
rg -n "async def.*open|async.*write|async.*mkdir|class.*FileLoader" src/aiperf/dataset --type py

Repository: ai-dynamo/aiperf

Length of output: 655


Use async file operations for filesystem I/O inside async function.

The _download function at lines 78–99 is async but uses blocking filesystem calls (Path.exists(), mkdir(), open().write()), which stall the event loop during dataset downloads. The codebase has established patterns: use aiofiles (as seen in raw_record_writer_processor.py, buffered_jsonl_writer_mixin.py, memory_map_utils.py) or asyncio.to_thread() (as used throughout for blocking operations like inference_result_parser.py, dataset_manager.py).

Rewrite the file operations (lines 82, 95–97) to use aiofiles or wrap with asyncio.to_thread().

🧰 Tools
🪛 Ruff (0.15.1)

[warning] 78-78: Unused function argument: name

(ARG001)


[warning] 99-99: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/public_datasets.py` around lines 78 - 99, The async
function _download uses blocking filesystem calls on cache_filepath
(Path.exists(), cache_filepath.parent.mkdir, and open(...).write) which can
block the event loop; change these to async equivalents by either using aiofiles
for file write (open/write) and using asyncio.to_thread for Path.exists() and
mkdir (or use asyncio.to_thread for all three) so all filesystem I/O in
_download is non-blocking; ensure you reference AIPERF_DATASET_CACHE_DIR and
cache_filepath in the changes and preserve the existing error handling that
raises DatasetLoaderError on save failures.

Comment on lines +138 to 148
def test_dataset_type_without_file_raises_error():
"""
Test that setting custom_dataset_type without a file raises ValidationError.
Test that setting dataset_type without a file raises ValidationError.

This validates the validate_custom_dataset_file model validator.
"""
with pytest.raises(ValidationError) as exc:
InputConfig(custom_dataset_type=CustomDatasetType.SINGLE_TURN, file=None)
InputConfig(dataset_type=DatasetLoaderType.SINGLE_TURN, file=None)

assert "Custom dataset type requires --input-file to be provided" in str(exc.value)
assert "Dataset type requires --input-file to be provided" in str(exc.value)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stale docstring reference to old validator name.

Line 142 still references validate_custom_dataset_file, which was renamed in this refactor. Update to match the new validator name.

📝 Suggested fix
 def test_dataset_type_without_file_raises_error():
     """
     Test that setting dataset_type without a file raises ValidationError.
 
-    This validates the validate_custom_dataset_file model validator.
+    This validates the validate_dataset_type_file model validator.
     """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/common/config/test_input_config.py` around lines 138 - 148, The
test docstring in test_dataset_type_without_file_raises_error still references
the old validator name `validate_custom_dataset_file`; update the docstring to
use the new validator name (e.g., the current validator on InputConfig) so it
reflects the refactor—locate the test function
`test_dataset_type_without_file_raises_error` and replace the stale reference
with the new validator identifier while keeping the rest of the docstring
intact.

@@ -679,30 +679,39 @@ class TestSequenceCaching:
def test_turn_sequence_caching(self):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test names don't follow the required test_<function>_<scenario>_<expected> convention

Both test method names omit the scenario and expected-outcome components mandated by the project conventions.

Suggested renames:

  • test_turn_sequence_cachingtest_get_turn_sequence_lengths_same_turn_id_returns_cached_value
  • test_different_turns_get_different_cache_entriestest_get_turn_sequence_lengths_distinct_turn_ids_cached_independently

As per coding guidelines: "Name tests as test_<function>_<scenario>_<expected> (e.g., test_parse_config_missing_field_raises_error)".

Also applies to: 731-731

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/common/models/test_sequence_distribution.py` at line 679, The test
method names do not follow the required test_<function>_<scenario>_<expected>
convention; rename the methods to be explicit: change test_turn_sequence_caching
to test_get_turn_sequence_lengths_same_turn_id_returns_cached_value and change
test_different_turns_get_different_cache_entries to
test_get_turn_sequence_lengths_distinct_turn_ids_cached_independently so the
function under test (get_turn_sequence_lengths) and the scenario/expected
outcome are encoded in the name; update any references/imports or test markers
that use those names accordingly (e.g., in
tests/unit/common/models/test_sequence_distribution.py where
test_turn_sequence_caching and test_different_turns_get_different_cache_entries
are defined).


from aiperf.common.models import Turn
from aiperf.dataset.composer.base import BaseDatasetComposer
from aiperf.common.models import Conversation, Turn
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Conversation is imported but never used in either test method

Both test_turn_sequence_caching (line 682) and test_different_turns_get_different_cache_entries (line 734) import Conversation from aiperf.common.models, but neither method instantiates it — only Turn is used. Remove the unused import to silence the linter.

🛠️ Proposed fix
- from aiperf.common.models import Conversation, Turn
+ from aiperf.common.models import Turn

(apply to both test methods)

Also applies to: 734-734

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/common/models/test_sequence_distribution.py` at line 682, Remove
the unused Conversation import from the test module: in the import line that
currently reads "from aiperf.common.models import Conversation, Turn" used by
test_turn_sequence_caching and test_different_turns_get_different_cache_entries,
drop Conversation so only Turn is imported; update the import statement
accordingly to silence the linter and leave test logic (Turn usage) unchanged.

Comment on lines +699 to +705
# Set up loader with distribution
loader = MockLoader.__new__(MockLoader)
dist = DistributionParser.parse("128,64:50;256,128:50")
composer._seq_distribution = dist
composer._turn_sequence_cache = {}
loader._seq_distribution = dist
loader._turn_sequence_cache = {}
# Use the global RNG instead of _seq_rng
composer._rng = rng.derive("test_composer")
loader._rng = rng.derive("test_loader")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find the file structure and locate BaseDatasetLoader
find . -type f -name "*.py" | grep -E "(loader|dataset)" | head -20

Repository: ai-dynamo/aiperf

Length of output: 874


🏁 Script executed:

# Search for BaseDatasetLoader class definition and __init__
rg -n "class BaseDatasetLoader" --type py -A 50

Repository: ai-dynamo/aiperf

Length of output: 3467


🏁 Script executed:

# Search for the two methods being tested
rg -n "def _get_turn_sequence_lengths" --type py -A 20
rg -n "def _clear_turn_cache" --type py -A 15

Repository: ai-dynamo/aiperf

Length of output: 42


🏁 Script executed:

# Also check what MockLoader looks like and where it's defined
rg -n "class MockLoader" --type py -B 2 -A 30

Repository: ai-dynamo/aiperf

Length of output: 6277


🏁 Script executed:

# Search for _get_turn_sequence_lengths method
rg -n "def _get_turn_sequence_lengths" --type py -A 20 -B 2

Repository: ai-dynamo/aiperf

Length of output: 42


🏁 Script executed:

# Search for _clear_turn_cache method
rg -n "def _clear_turn_cache" --type py -A 15 -B 2

Repository: ai-dynamo/aiperf

Length of output: 42


🏁 Script executed:

# Also check if these methods are defined in BaseDatasetLoader
rg -n "_get_turn_sequence_lengths\|_clear_turn_cache" --type py -B 5 -A 3

Repository: ai-dynamo/aiperf

Length of output: 42


🏁 Script executed:

# Read the test file lines 680-730 to see the full test setup and execution
sed -n '680,730p' tests/unit/common/models/test_sequence_distribution.py

Repository: ai-dynamo/aiperf

Length of output: 1966


🏁 Script executed:

# Read the second test around lines 740-780
sed -n '740,780p' tests/unit/common/models/test_sequence_distribution.py

Repository: ai-dynamo/aiperf

Length of output: 1523


🏁 Script executed:

# Search everywhere for these method names
rg -n "get_turn_sequence_lengths" --type py
rg -n "clear_turn_cache" --type py

Repository: ai-dynamo/aiperf

Length of output: 1327


🏁 Script executed:

# Check the full BaseDatasetLoader class to see all its methods
cat -n src/aiperf/dataset/loader/base.py | head -100

Repository: ai-dynamo/aiperf

Length of output: 2618


🏁 Script executed:

# Check if there's a mixin that might provide these methods
rg -n "class.*Mixin" src/aiperf/dataset/loader/ --type py -A 5

Repository: ai-dynamo/aiperf

Length of output: 3364


🏁 Script executed:

# Check LoaderContext.get_turn_sequence_lengths
cat -n src/aiperf/dataset/loader/context.py | head -150

Repository: ai-dynamo/aiperf

Length of output: 6587


🏁 Script executed:

# Search for _seq_distribution usage in BaseDatasetLoader or context
rg -n "_seq_distribution" --type py -B 2 -A 2

Repository: ai-dynamo/aiperf

Length of output: 3046


Test calls non-existent methods _get_turn_sequence_lengths and _clear_turn_cache on MockLoader

These methods do not exist in BaseDatasetLoader or any parent class. The test instantiates MockLoader via __new__ and manually assigns _seq_distribution, _turn_sequence_cache, and _rng, then immediately calls loader._get_turn_sequence_lengths(turn_id) at lines 712, 713, 714, 728, 768, 769 and loader._clear_turn_cache(turn_id) at lines 725 and 780, which will fail with AttributeError.

Note that _seq_distribution and _turn_sequence_cache belong to LoaderContext, not to loaders. The public method is ctx.get_turn_sequence_lengths(), not a loader method. If these private methods are intended as new functionality on BaseDatasetLoader, they must be implemented first.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/common/models/test_sequence_distribution.py` around lines 699 -
705, The test is calling nonexistent loader methods `_get_turn_sequence_lengths`
and `_clear_turn_cache` on MockLoader; fix by using the correct public API on
the context or implement the methods on the loader: either change the test to
call the LoaderContext public method `ctx.get_turn_sequence_lengths(turn_id)`
(and use the context to clear any cache instead of calling `_clear_turn_cache`)
and ensure `_seq_distribution` and `_turn_sequence_cache` are set on the
`LoaderContext`, or add explicit implementations of `_get_turn_sequence_lengths`
and `_clear_turn_cache` on `BaseDatasetLoader` that delegate to the loader's
`LoaderContext` storage (`_seq_distribution`, `_turn_sequence_cache`) so the
existing test calls succeed; update references to `MockLoader`,
`BaseDatasetLoader`, `LoaderContext`, `ctx.get_turn_sequence_lengths`,
`_seq_distribution`, and `_turn_sequence_cache` accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant