-
Notifications
You must be signed in to change notification settings - Fork 48
feat: 2-3x faster trace dataset loading via HashIdRandomGenerator #724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
69b7532
66fd689
c9ad63e
34ad3f4
b65baa2
3617fdb
a327d90
cc5bd8d
dfc4918
766253c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """Hash-ID-based random generator for parallel processing with reproducibility. | ||
|
|
||
| Enables parallel processing of traces with hash_ids while maintaining | ||
| reproducibility. Each (trace_id, hash_id) pair produces a deterministic random | ||
| sequence regardless of worker count or processing order. | ||
|
|
||
| Architecture: | ||
| Global Seed -> Base RNG -> (trace_id, hash_id) -> Deterministic tokens | ||
|
|
||
| The trace_id (typically a content hash of the trace file) ensures that different | ||
| trace files with overlapping hash_id values produce different content, while the | ||
| same trace file always produces identical results. | ||
| """ | ||
|
|
||
| import hashlib | ||
|
|
||
| from aiperf.common.random_generator import RandomGenerator | ||
|
|
||
| __all__ = ["HashIdRandomGenerator"] | ||
|
|
||
|
|
||
| class _DisabledNumpyRNG: | ||
| """Raises on any attribute access to prevent NumPy RNG usage.""" | ||
|
|
||
| def __getattr__(self, name): | ||
| raise RuntimeError( | ||
| "HashIdRandomGenerator does not support NumPy RNG operations. " | ||
| "Use Python RNG methods (randrange, choice, etc.) instead." | ||
| ) | ||
|
|
||
|
|
||
| class HashIdRandomGenerator(RandomGenerator): | ||
| """RandomGenerator that re-seeds deterministically per (trace_id, hash_id). | ||
|
|
||
| Designed for parallel processing where multiple workers need to generate | ||
| identical content for the same hash_id within a trace file. | ||
|
|
||
| Thread Safety: | ||
| NOT thread-safe. Each worker process must have its own instance. | ||
| """ | ||
|
|
||
| @classmethod | ||
| def from_base_rng(cls, base_rng: RandomGenerator) -> "HashIdRandomGenerator": | ||
| """Create from a base RandomGenerator (typically from rng.derive()).""" | ||
| base_seed = base_rng.seed or base_rng.randrange(0, 2**64) | ||
| return cls(base_seed, _internal=True) | ||
|
|
||
| def __init__(self, base_seed: int, *, _internal: bool = False): | ||
| super().__init__(base_seed, _internal=_internal) | ||
| self._numpy_rng = _DisabledNumpyRNG() | ||
| self._trace_id: str = "" | ||
|
|
||
| def set_trace_id(self, trace_id: str) -> None: | ||
| """Set trace identifier to scope hash_ids to a specific trace file. | ||
|
|
||
| Args: | ||
| trace_id: Content hash or unique identifier for the trace file. | ||
| Different trace files must use different trace_ids. | ||
| """ | ||
| self._trace_id = trace_id | ||
|
|
||
| def reseed_for_hash_id(self, hash_id: int) -> None: | ||
| """Re-seed RNG deterministically for a specific hash_id. | ||
|
|
||
| After calling, all random operations use the derived seed until | ||
| the next reseed_for_hash_id call. | ||
|
|
||
| Args: | ||
| hash_id: KV block hash ID from trace data. | ||
| """ | ||
| seed_bytes = hashlib.sha256( | ||
| f"{self.seed}:{self._trace_id}:{hash_id}".encode() | ||
| ).digest() | ||
| self._python_rng.seed(int.from_bytes(seed_bytes[:8], "big")) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,6 +4,7 @@ | |
| """HuggingFace tokenizer wrapper with sensible defaults.""" | ||
|
|
||
| import contextlib | ||
| import inspect | ||
| import io | ||
| import logging | ||
| import os | ||
|
|
@@ -47,6 +48,14 @@ def __init__(self, name: str, suggestions: list[tuple[str, int]]) -> None: | |
| ) | ||
|
|
||
|
|
||
| def _supports_kwarg(obj: object, method_name: str, kwarg: str) -> bool: | ||
| """Check if a method on an object accepts a specific keyword argument.""" | ||
| method = getattr(obj, method_name, None) | ||
| if method is None: | ||
| return False | ||
| return kwarg in inspect.signature(method).parameters | ||
|
Comment on lines
+51
to
+56
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: cd /root && find . -name "tokenizer.py" -path "*/aiperf/*" | head -5Repository: ai-dynamo/aiperf Length of output: 115 🏁 Script executed: cat -n src/aiperf/common/tokenizer.py | head -70Repository: ai-dynamo/aiperf Length of output: 2839 🏁 Script executed: cat -n src/aiperf/common/tokenizer.py | sed -n '51,70p'Repository: ai-dynamo/aiperf Length of output: 984 🏁 Script executed: # Check usage of _supports_kwarg
rg "_supports_kwarg" --type py -B 2 -A 2Repository: ai-dynamo/aiperf Length of output: 5955 🏁 Script executed: # Check usage of _apply_kwarg_overrides
rg "_apply_kwarg_overrides" --type py -B 2 -A 2Repository: ai-dynamo/aiperf Length of output: 3710 🏁 Script executed: # Look for tokenizer instantiation and from_pretrained usage
rg "from_pretrained" src/aiperf/common/tokenizer.py -B 2 -A 2Repository: ai-dynamo/aiperf Length of output: 1313 Add error handling in Line 56 assumes Proposed fix def _supports_kwarg(obj: object, method_name: str, kwarg: str) -> bool:
"""Check if a method on an object accepts a specific keyword argument."""
method = getattr(obj, method_name, None)
if method is None:
return False
- return kwarg in inspect.signature(method).parameters
+ try:
+ return kwarg in inspect.signature(method).parameters
+ except (TypeError, ValueError):
+ return False🤖 Prompt for AI Agents |
||
|
|
||
|
|
||
| def _is_offline_mode() -> bool: | ||
| """Check if HuggingFace offline mode is enabled via environment variables.""" | ||
| return bool(os.environ.get("HF_HUB_OFFLINE", "")) or bool( | ||
|
|
@@ -147,6 +156,16 @@ def _require_init(self) -> None: | |
| if self._tokenizer is None: | ||
| raise NotInitializedError("Tokenizer is not initialized.") | ||
|
|
||
| def _apply_kwarg_overrides(self) -> None: | ||
| """Override default args for tokenizers that use non-standard kwargs (e.g. Kimi).""" | ||
| if self._tokenizer is None: | ||
| return | ||
| if _supports_kwarg(self._tokenizer, "encode", "allow_special_tokens"): | ||
| self._call_args = {"allow_special_tokens": False} | ||
| self._encode_args = {"allow_special_tokens": False} | ||
| if not _supports_kwarg(self._tokenizer, "decode", "skip_special_tokens"): | ||
| self._decode_args = {} | ||
|
|
||
| @staticmethod | ||
| def resolve_alias(name: str) -> AliasResolutionResult: | ||
| """Resolve a tokenizer name alias to its canonical repository ID.""" | ||
|
|
@@ -208,6 +227,7 @@ def from_pretrained( | |
| revision=revision, | ||
| ) | ||
| tokenizer_cls._resolved_name = resolved_name | ||
| tokenizer_cls._apply_kwarg_overrides() | ||
| except AmbiguousTokenizerNameError: | ||
| raise | ||
| except Exception as e: | ||
|
|
@@ -285,6 +305,7 @@ class _OfflineModelInfo: | |
| revision=revision, | ||
| local_files_only=True, | ||
| ) | ||
| tokenizer_cls._apply_kwarg_overrides() | ||
| return tokenizer_cls | ||
| finally: | ||
| huggingface_hub.model_info = _original_model_info | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: ai-dynamo/aiperf
Length of output: 1015
🏁 Script executed:
Repository: ai-dynamo/aiperf
Length of output: 1810
🏁 Script executed:
Repository: ai-dynamo/aiperf
Length of output: 1290
🏁 Script executed:
Repository: ai-dynamo/aiperf
Length of output: 2245
🏁 Script executed:
Repository: ai-dynamo/aiperf
Length of output: 42
🏁 Script executed:
Repository: ai-dynamo/aiperf
Length of output: 262
🏁 Script executed:
Repository: ai-dynamo/aiperf
Length of output: 6423
🏁 Script executed:
Repository: ai-dynamo/aiperf
Length of output: 42
🏁 Script executed:
Repository: ai-dynamo/aiperf
Length of output: 1499
🏁 Script executed:
Repository: ai-dynamo/aiperf
Length of output: 4665
🏁 Script executed:
Repository: ai-dynamo/aiperf
Length of output: 680
Fix truthiness check to preserve explicit seed 0.
Line 48 uses
orto evaluate seed, which treats0as falsy despite being a valid seed documented in the constructor. This breaks reproducibility when seed is intentionally set to 0.Proposed fix
🤖 Prompt for AI Agents