Skip to content

Conversation

@daemyung-lablup
Copy link
Contributor

this is only for test. do not merge this.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This WIP pull request implements a CPU boost feature (BA-3720) that temporarily allocates additional CPU resources to containers during their initial startup phase, with automatic restoration to the originally requested amount after a configurable duration.

Key Changes

  • Adds three new configuration options: cpu-boost-enabled, cpu-boost-factor, and cpu-boost-duration to control the CPU boost behavior
  • Implements CPU boost logic at container creation time by modifying the Cpus field in Docker container configuration
  • Implements background task-based restoration mechanism to reduce CPU allocation back to original values after the boost period expires

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
src/ai/backend/agent/config/unified.py Adds configuration fields for CPU boost feature with validation constraints (factor: 1.0-10.0, duration: ≥1.0s)
src/ai/backend/agent/docker/intrinsic.py Implements apply_cpu_boost method and modifies generate_docker_args to apply boost at container creation
src/ai/backend/agent/docker/agent.py Implements CPU restoration logic with _schedule_cpu_boost_restoration and _restore_cpu_from_boost methods
Comments suppressed due to low confidence (1)

src/ai/backend/agent/docker/intrinsic.py:455

  • There's a potential mismatch between the CPU limit set at container creation and the restoration mechanism. At creation time, the code sets "Cpus" (line 451) which should be the number of CPUs, but during restoration (agent.py:1081), it uses "CpuQuota" and "CpuPeriod".

However, the code doesn't set "CpuQuota" or "CpuPeriod" at container creation time, which means the initial boost may not be properly applied if Docker treats "Cpus" differently than the CpuQuota/CpuPeriod combination. Consider setting both "Cpus" and the corresponding CpuQuota/CpuPeriod values at creation time for consistency, or verify that "Cpus" alone is sufficient for the initial boost.

        return {
            "HostConfig": {
                "Cpus": boosted_cores,
                "CpusetCpus": ",".join(sorted_core_ids),
                # 'CpusetMems': f'{resource_spec.numa_node}',
            },
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 445 to 446
boost_enabled = self.local_config["container"].get("cpu-boost-enabled", False)
boost_factor = self.local_config["container"].get("cpu-boost-factor", 2.0)
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type inconsistency in config access. The plugin's local_config is typed as Mapping[str, Any] (from AbstractPlugin), but this code accesses it using dictionary-style indexing with .get() method. While this works at runtime, it's inconsistent with how local_config is accessed elsewhere in the agent.py file (e.g., line 1014, 1408) which uses attribute access like self.local_config.container.cpu_boost_duration.

For consistency with the rest of the codebase, this should use attribute access like: self.local_config.get("container", {}).get("cpu-boost-enabled", False) or better yet, parse it into a structured config object.

Copilot uses AI. Check for mistakes.
Comment on lines +401 to +433
def apply_cpu_boost(
self,
num_cores: int,
boost_enabled: bool,
boost_factor: float,
) -> int:
"""
Apply CPU boost factor to the number of cores.
Args:
num_cores: Original number of CPU cores requested
boost_enabled: Whether CPU boost is enabled
boost_factor: Multiplication factor for CPU boost
Returns:
Boosted number of CPU cores (or original if boost is disabled).
When boost is enabled, returns at least num_cores + 1.
"""
if not boost_enabled:
return num_cores

boosted_cores = int(num_cores * boost_factor)
# Ensure at least +1 core boost when enabled
min_boosted_cores = num_cores + 1
boosted_cores = max(boosted_cores, min_boosted_cores)

log.info(
"Applying CPU boost: {} cores -> {} cores (factor: {})",
num_cores,
boosted_cores,
boost_factor,
)
return boosted_cores
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new apply_cpu_boost method lacks test coverage. Given that there are existing tests for CPUPlugin in tests/unit/agent/test_resources.py, new tests should be added to verify the boost logic, including edge cases such as when boost_factor would result in less than num_cores + 1 (handled by the min_boosted_cores logic).

Copilot uses AI. Check for mistakes.
Comment on lines 987 to 1106
def _schedule_cpu_boost_restoration(
self,
container_id: str,
resource_spec: KernelResourceSpec,
) -> None:
"""
Schedule a background task to restore CPU from boost after the configured duration.
This method extracts the original CPU allocation from the resource spec and
creates an asyncio task that will restore the container's CPU quota after
the boost duration expires.
Args:
container_id: ID of the container to schedule restoration for
resource_spec: Kernel resource specification containing CPU allocations
"""
cpu_alloc = resource_spec.allocations.get(DeviceName("cpu"), {})
if not cpu_alloc:
log.debug(
"No CPU allocation found for container {}, skipping boost restoration",
container_id[:12],
)
return

# cpu_alloc is Mapping[SlotName, Mapping[DeviceId, Decimal]]
# Count the number of allocated CPU cores
original_cpus = len(cpu_alloc.get(SlotName("cpu"), {}))
boost_duration = self.local_config.container.cpu_boost_duration

# Create a background task to restore CPU after boost_duration
asyncio.create_task(
self._restore_cpu_from_boost(
container_id,
original_cpus,
boost_duration,
)
)
log.debug(
"Scheduled CPU boost restoration for container {} in {} seconds",
container_id[:12],
boost_duration,
)

async def _restore_cpu_from_boost(
self,
container_id: str,
original_cpus: int,
boost_duration: float,
) -> None:
"""
Restore container CPU allocation from boosted value to original value after a delay.
Args:
container_id: ID of the container to restore
original_cpus: Original number of CPU cores requested
boost_duration: Duration in seconds to wait before restoring
"""
try:
await asyncio.sleep(boost_duration)
log.info(
"Restoring CPU from boost for container {}: reducing to {} cores",
container_id[:12],
original_cpus,
)
async with closing_async(Docker()) as docker:
container = docker.containers.container(container_id)

# Check if container still exists and is running
try:
container_info = await container.show()
except DockerError as e:
if e.status == 404:
log.info(
"Container {} no longer exists, skipping CPU boost restoration",
container_id[:12],
)
return
raise

# Only restore CPU if container is still running
container_state = container_info.get("State", {})
if not container_state.get("Running", False):
log.info(
"Container {} is not running (status: {}), skipping CPU boost restoration",
container_id[:12],
container_state.get("Status", "unknown"),
)
return

# Restore CPU to original allocation
# Docker update API doesn't support "Cpus" field, so we use CpuQuota
# CpuQuota = number_of_cpus * CpuPeriod (100_000 microseconds = 100ms)
# aiodocker doesn't have update method, so we call the API directly
update_config = {
"CpuQuota": original_cpus * 100_000,
"CpuPeriod": 100_000,
}
await docker._query_json(
f"containers/{container_id}/update",
method="POST",
data=update_config,
)

log.info("CPU boost restoration completed for container {}", container_id[:12])
except asyncio.CancelledError:
log.debug("CPU boost restoration task cancelled for container {}", container_id[:12])
raise
except DockerError as e:
log.warning(
"Docker error while restoring CPU from boost for container {}: {} (status: {})",
container_id[:12],
e.message,
e.status,
)
except Exception as e:
log.warning(
"Failed to restore CPU from boost for container {}: {!r}",
container_id[:12],
e,
)
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CPU boost restoration logic in _schedule_cpu_boost_restoration and _restore_cpu_from_boost lacks test coverage. Tests should verify: 1) successful restoration after the boost duration, 2) handling of non-existent containers, 3) handling of stopped containers, 4) proper error handling for Docker API failures, and 5) task cancellation behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +422 to +425
boosted_cores = int(num_cores * boost_factor)
# Ensure at least +1 core boost when enabled
min_boosted_cores = num_cores + 1
boosted_cores = max(boosted_cores, min_boosted_cores)
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum boost guarantee (num_cores + 1) may cause issues when system resources are constrained. If a container requests a large number of cores (e.g., 15 cores out of a 16-core system), boosting it to 16 cores may succeed, but if multiple containers are created simultaneously with boost enabled, the system could run out of CPU resources. Consider adding validation to ensure boosted_cores doesn't exceed available system resources, or at least document that this feature assumes sufficient overcommit headroom.

Copilot uses AI. Check for mistakes.
Comment on lines 451 to 452
"Cpus": boosted_cores,
"CpusetCpus": ",".join(sorted_core_ids),
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CpusetCpus constrains the actual cores available but doesn't match the boosted CPU count. The code sets "Cpus" to boosted_cores (e.g., 4) but "CpusetCpus" is still set to the original cores from device_alloc (e.g., cores 0,1 if 2 cores were allocated). This creates a conflict: the container is told it has 4 CPUs available but is pinned to only 2 specific cores.

Docker's behavior with this configuration may be undefined or the "Cpus" field may be ignored. To properly boost CPU, you either need to: 1) Not use CpusetCpus and rely on CpuQuota/CpuPeriod instead, or 2) Actually allocate additional cores from the device allocator and include them in CpusetCpus.

Copilot uses AI. Check for mistakes.
Comment on lines +1084 to +1088
await docker._query_json(
f"containers/{container_id}/update",
method="POST",
data=update_config,
)
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing the private method docker._query_json is fragile and may break if the aiodocker library changes its internal API. While this pattern is used elsewhere in the codebase (e.g., stats.py:773, utils.py:214), consider requesting an official update method from the aiodocker library or wrapping this in a helper function that can be easily updated if the internal API changes.

Copilot uses AI. Check for mistakes.
Comment on lines +836 to +849
cpu_boost_factor: float = Field(
default=2.0,
ge=1.0,
le=10.0,
description=textwrap.dedent("""
CPU boost multiplication factor.
The requested CPU will be multiplied by this factor during the boost period.
For example, with a factor of 2.0, a container requesting 2 CPUs will
initially receive 4 CPUs.
"""),
examples=[1.5, 2.0, 3.0],
validation_alias=AliasChoices("cpu-boost-factor", "cpu_boost_factor"),
serialization_alias="cpu-boost-factor",
)
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cpu_boost_factor is set to exactly 1.0 (the minimum allowed value), the boosted_cores calculation on line 422 will result in the same number as num_cores. However, line 424-425 ensures at least +1 core boost when enabled, which means even with a factor of 1.0, containers will get +1 core. This behavior may be unexpected for users who set factor to 1.0 thinking it means "no boost". Consider either: 1) Setting the minimum value to greater than 1.0 (e.g., 1.1), or 2) Documenting this behavior clearly in the cpu_boost_factor description.

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot added size:XL 500~ LoC and removed size:L 100~500 LoC labels Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants