-
Notifications
You must be signed in to change notification settings - Fork 165
[WIP]feat(BA-3720) boost up cpu count when it starts #7790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This WIP pull request implements a CPU boost feature (BA-3720) that temporarily allocates additional CPU resources to containers during their initial startup phase, with automatic restoration to the originally requested amount after a configurable duration.
Key Changes
- Adds three new configuration options:
cpu-boost-enabled,cpu-boost-factor, andcpu-boost-durationto control the CPU boost behavior - Implements CPU boost logic at container creation time by modifying the
Cpusfield in Docker container configuration - Implements background task-based restoration mechanism to reduce CPU allocation back to original values after the boost period expires
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| src/ai/backend/agent/config/unified.py | Adds configuration fields for CPU boost feature with validation constraints (factor: 1.0-10.0, duration: ≥1.0s) |
| src/ai/backend/agent/docker/intrinsic.py | Implements apply_cpu_boost method and modifies generate_docker_args to apply boost at container creation |
| src/ai/backend/agent/docker/agent.py | Implements CPU restoration logic with _schedule_cpu_boost_restoration and _restore_cpu_from_boost methods |
Comments suppressed due to low confidence (1)
src/ai/backend/agent/docker/intrinsic.py:455
- There's a potential mismatch between the CPU limit set at container creation and the restoration mechanism. At creation time, the code sets "Cpus" (line 451) which should be the number of CPUs, but during restoration (agent.py:1081), it uses "CpuQuota" and "CpuPeriod".
However, the code doesn't set "CpuQuota" or "CpuPeriod" at container creation time, which means the initial boost may not be properly applied if Docker treats "Cpus" differently than the CpuQuota/CpuPeriod combination. Consider setting both "Cpus" and the corresponding CpuQuota/CpuPeriod values at creation time for consistency, or verify that "Cpus" alone is sufficient for the initial boost.
return {
"HostConfig": {
"Cpus": boosted_cores,
"CpusetCpus": ",".join(sorted_core_ids),
# 'CpusetMems': f'{resource_spec.numa_node}',
},
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| boost_enabled = self.local_config["container"].get("cpu-boost-enabled", False) | ||
| boost_factor = self.local_config["container"].get("cpu-boost-factor", 2.0) |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type inconsistency in config access. The plugin's local_config is typed as Mapping[str, Any] (from AbstractPlugin), but this code accesses it using dictionary-style indexing with .get() method. While this works at runtime, it's inconsistent with how local_config is accessed elsewhere in the agent.py file (e.g., line 1014, 1408) which uses attribute access like self.local_config.container.cpu_boost_duration.
For consistency with the rest of the codebase, this should use attribute access like: self.local_config.get("container", {}).get("cpu-boost-enabled", False) or better yet, parse it into a structured config object.
| def apply_cpu_boost( | ||
| self, | ||
| num_cores: int, | ||
| boost_enabled: bool, | ||
| boost_factor: float, | ||
| ) -> int: | ||
| """ | ||
| Apply CPU boost factor to the number of cores. | ||
| Args: | ||
| num_cores: Original number of CPU cores requested | ||
| boost_enabled: Whether CPU boost is enabled | ||
| boost_factor: Multiplication factor for CPU boost | ||
| Returns: | ||
| Boosted number of CPU cores (or original if boost is disabled). | ||
| When boost is enabled, returns at least num_cores + 1. | ||
| """ | ||
| if not boost_enabled: | ||
| return num_cores | ||
|
|
||
| boosted_cores = int(num_cores * boost_factor) | ||
| # Ensure at least +1 core boost when enabled | ||
| min_boosted_cores = num_cores + 1 | ||
| boosted_cores = max(boosted_cores, min_boosted_cores) | ||
|
|
||
| log.info( | ||
| "Applying CPU boost: {} cores -> {} cores (factor: {})", | ||
| num_cores, | ||
| boosted_cores, | ||
| boost_factor, | ||
| ) | ||
| return boosted_cores |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new apply_cpu_boost method lacks test coverage. Given that there are existing tests for CPUPlugin in tests/unit/agent/test_resources.py, new tests should be added to verify the boost logic, including edge cases such as when boost_factor would result in less than num_cores + 1 (handled by the min_boosted_cores logic).
| def _schedule_cpu_boost_restoration( | ||
| self, | ||
| container_id: str, | ||
| resource_spec: KernelResourceSpec, | ||
| ) -> None: | ||
| """ | ||
| Schedule a background task to restore CPU from boost after the configured duration. | ||
| This method extracts the original CPU allocation from the resource spec and | ||
| creates an asyncio task that will restore the container's CPU quota after | ||
| the boost duration expires. | ||
| Args: | ||
| container_id: ID of the container to schedule restoration for | ||
| resource_spec: Kernel resource specification containing CPU allocations | ||
| """ | ||
| cpu_alloc = resource_spec.allocations.get(DeviceName("cpu"), {}) | ||
| if not cpu_alloc: | ||
| log.debug( | ||
| "No CPU allocation found for container {}, skipping boost restoration", | ||
| container_id[:12], | ||
| ) | ||
| return | ||
|
|
||
| # cpu_alloc is Mapping[SlotName, Mapping[DeviceId, Decimal]] | ||
| # Count the number of allocated CPU cores | ||
| original_cpus = len(cpu_alloc.get(SlotName("cpu"), {})) | ||
| boost_duration = self.local_config.container.cpu_boost_duration | ||
|
|
||
| # Create a background task to restore CPU after boost_duration | ||
| asyncio.create_task( | ||
| self._restore_cpu_from_boost( | ||
| container_id, | ||
| original_cpus, | ||
| boost_duration, | ||
| ) | ||
| ) | ||
| log.debug( | ||
| "Scheduled CPU boost restoration for container {} in {} seconds", | ||
| container_id[:12], | ||
| boost_duration, | ||
| ) | ||
|
|
||
| async def _restore_cpu_from_boost( | ||
| self, | ||
| container_id: str, | ||
| original_cpus: int, | ||
| boost_duration: float, | ||
| ) -> None: | ||
| """ | ||
| Restore container CPU allocation from boosted value to original value after a delay. | ||
| Args: | ||
| container_id: ID of the container to restore | ||
| original_cpus: Original number of CPU cores requested | ||
| boost_duration: Duration in seconds to wait before restoring | ||
| """ | ||
| try: | ||
| await asyncio.sleep(boost_duration) | ||
| log.info( | ||
| "Restoring CPU from boost for container {}: reducing to {} cores", | ||
| container_id[:12], | ||
| original_cpus, | ||
| ) | ||
| async with closing_async(Docker()) as docker: | ||
| container = docker.containers.container(container_id) | ||
|
|
||
| # Check if container still exists and is running | ||
| try: | ||
| container_info = await container.show() | ||
| except DockerError as e: | ||
| if e.status == 404: | ||
| log.info( | ||
| "Container {} no longer exists, skipping CPU boost restoration", | ||
| container_id[:12], | ||
| ) | ||
| return | ||
| raise | ||
|
|
||
| # Only restore CPU if container is still running | ||
| container_state = container_info.get("State", {}) | ||
| if not container_state.get("Running", False): | ||
| log.info( | ||
| "Container {} is not running (status: {}), skipping CPU boost restoration", | ||
| container_id[:12], | ||
| container_state.get("Status", "unknown"), | ||
| ) | ||
| return | ||
|
|
||
| # Restore CPU to original allocation | ||
| # Docker update API doesn't support "Cpus" field, so we use CpuQuota | ||
| # CpuQuota = number_of_cpus * CpuPeriod (100_000 microseconds = 100ms) | ||
| # aiodocker doesn't have update method, so we call the API directly | ||
| update_config = { | ||
| "CpuQuota": original_cpus * 100_000, | ||
| "CpuPeriod": 100_000, | ||
| } | ||
| await docker._query_json( | ||
| f"containers/{container_id}/update", | ||
| method="POST", | ||
| data=update_config, | ||
| ) | ||
|
|
||
| log.info("CPU boost restoration completed for container {}", container_id[:12]) | ||
| except asyncio.CancelledError: | ||
| log.debug("CPU boost restoration task cancelled for container {}", container_id[:12]) | ||
| raise | ||
| except DockerError as e: | ||
| log.warning( | ||
| "Docker error while restoring CPU from boost for container {}: {} (status: {})", | ||
| container_id[:12], | ||
| e.message, | ||
| e.status, | ||
| ) | ||
| except Exception as e: | ||
| log.warning( | ||
| "Failed to restore CPU from boost for container {}: {!r}", | ||
| container_id[:12], | ||
| e, | ||
| ) |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CPU boost restoration logic in _schedule_cpu_boost_restoration and _restore_cpu_from_boost lacks test coverage. Tests should verify: 1) successful restoration after the boost duration, 2) handling of non-existent containers, 3) handling of stopped containers, 4) proper error handling for Docker API failures, and 5) task cancellation behavior.
| boosted_cores = int(num_cores * boost_factor) | ||
| # Ensure at least +1 core boost when enabled | ||
| min_boosted_cores = num_cores + 1 | ||
| boosted_cores = max(boosted_cores, min_boosted_cores) |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimum boost guarantee (num_cores + 1) may cause issues when system resources are constrained. If a container requests a large number of cores (e.g., 15 cores out of a 16-core system), boosting it to 16 cores may succeed, but if multiple containers are created simultaneously with boost enabled, the system could run out of CPU resources. Consider adding validation to ensure boosted_cores doesn't exceed available system resources, or at least document that this feature assumes sufficient overcommit headroom.
| "Cpus": boosted_cores, | ||
| "CpusetCpus": ",".join(sorted_core_ids), |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CpusetCpus constrains the actual cores available but doesn't match the boosted CPU count. The code sets "Cpus" to boosted_cores (e.g., 4) but "CpusetCpus" is still set to the original cores from device_alloc (e.g., cores 0,1 if 2 cores were allocated). This creates a conflict: the container is told it has 4 CPUs available but is pinned to only 2 specific cores.
Docker's behavior with this configuration may be undefined or the "Cpus" field may be ignored. To properly boost CPU, you either need to: 1) Not use CpusetCpus and rely on CpuQuota/CpuPeriod instead, or 2) Actually allocate additional cores from the device allocator and include them in CpusetCpus.
| await docker._query_json( | ||
| f"containers/{container_id}/update", | ||
| method="POST", | ||
| data=update_config, | ||
| ) |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing the private method docker._query_json is fragile and may break if the aiodocker library changes its internal API. While this pattern is used elsewhere in the codebase (e.g., stats.py:773, utils.py:214), consider requesting an official update method from the aiodocker library or wrapping this in a helper function that can be easily updated if the internal API changes.
| cpu_boost_factor: float = Field( | ||
| default=2.0, | ||
| ge=1.0, | ||
| le=10.0, | ||
| description=textwrap.dedent(""" | ||
| CPU boost multiplication factor. | ||
| The requested CPU will be multiplied by this factor during the boost period. | ||
| For example, with a factor of 2.0, a container requesting 2 CPUs will | ||
| initially receive 4 CPUs. | ||
| """), | ||
| examples=[1.5, 2.0, 3.0], | ||
| validation_alias=AliasChoices("cpu-boost-factor", "cpu_boost_factor"), | ||
| serialization_alias="cpu-boost-factor", | ||
| ) |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When cpu_boost_factor is set to exactly 1.0 (the minimum allowed value), the boosted_cores calculation on line 422 will result in the same number as num_cores. However, line 424-425 ensures at least +1 core boost when enabled, which means even with a factor of 1.0, containers will get +1 core. This behavior may be unexpected for users who set factor to 1.0 thinking it means "no boost". Consider either: 1) Setting the minimum value to greater than 1.0 (e.g., 1.1), or 2) Documenting this behavior clearly in the cpu_boost_factor description.
Co-authored-by: Copilot <[email protected]>
this is only for test. do not merge this.