[WIP]feat(BA-3720) boost up cpu count when it starts #7790

daemyung-lablup · 2026-01-05T11:31:15Z

this is only for test. do not merge this.

Copilot

Pull request overview

This WIP pull request implements a CPU boost feature (BA-3720) that temporarily allocates additional CPU resources to containers during their initial startup phase, with automatic restoration to the originally requested amount after a configurable duration.

Key Changes

Adds three new configuration options: cpu-boost-enabled, cpu-boost-factor, and cpu-boost-duration to control the CPU boost behavior
Implements CPU boost logic at container creation time by modifying the Cpus field in Docker container configuration
Implements background task-based restoration mechanism to reduce CPU allocation back to original values after the boost period expires

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
src/ai/backend/agent/config/unified.py	Adds configuration fields for CPU boost feature with validation constraints (factor: 1.0-10.0, duration: ≥1.0s)
src/ai/backend/agent/docker/intrinsic.py	Implements `apply_cpu_boost` method and modifies `generate_docker_args` to apply boost at container creation
src/ai/backend/agent/docker/agent.py	Implements CPU restoration logic with `_schedule_cpu_boost_restoration` and `_restore_cpu_from_boost` methods

Comments suppressed due to low confidence (1)

src/ai/backend/agent/docker/intrinsic.py:455

There's a potential mismatch between the CPU limit set at container creation and the restoration mechanism. At creation time, the code sets "Cpus" (line 451) which should be the number of CPUs, but during restoration (agent.py:1081), it uses "CpuQuota" and "CpuPeriod".

However, the code doesn't set "CpuQuota" or "CpuPeriod" at container creation time, which means the initial boost may not be properly applied if Docker treats "Cpus" differently than the CpuQuota/CpuPeriod combination. Consider setting both "Cpus" and the corresponding CpuQuota/CpuPeriod values at creation time for consistency, or verify that "Cpus" alone is sufficient for the initial boost.

        return {
            "HostConfig": {
                "Cpus": boosted_cores,
                "CpusetCpus": ",".join(sorted_core_ids),
                # 'CpusetMems': f'{resource_spec.numa_node}',
            },
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-05T11:37:05Z

src/ai/backend/agent/docker/intrinsic.py

+        boost_enabled = self.local_config["container"].get("cpu-boost-enabled", False)
+        boost_factor = self.local_config["container"].get("cpu-boost-factor", 2.0)


Type inconsistency in config access. The plugin's local_config is typed as Mapping[str, Any] (from AbstractPlugin), but this code accesses it using dictionary-style indexing with .get() method. While this works at runtime, it's inconsistent with how local_config is accessed elsewhere in the agent.py file (e.g., line 1014, 1408) which uses attribute access like self.local_config.container.cpu_boost_duration.

For consistency with the rest of the codebase, this should use attribute access like: self.local_config.get("container", {}).get("cpu-boost-enabled", False) or better yet, parse it into a structured config object.

Copilot · 2026-01-05T11:37:05Z

src/ai/backend/agent/docker/intrinsic.py

+    def apply_cpu_boost(
+        self,
+        num_cores: int,
+        boost_enabled: bool,
+        boost_factor: float,
+    ) -> int:
+        """
+        Apply CPU boost factor to the number of cores.
+
+        Args:
+            num_cores: Original number of CPU cores requested
+            boost_enabled: Whether CPU boost is enabled
+            boost_factor: Multiplication factor for CPU boost
+
+        Returns:
+            Boosted number of CPU cores (or original if boost is disabled).
+            When boost is enabled, returns at least num_cores + 1.
+        """
+        if not boost_enabled:
+            return num_cores
+
+        boosted_cores = int(num_cores * boost_factor)
+        # Ensure at least +1 core boost when enabled
+        min_boosted_cores = num_cores + 1
+        boosted_cores = max(boosted_cores, min_boosted_cores)
+
+        log.info(
+            "Applying CPU boost: {} cores -> {} cores (factor: {})",
+            num_cores,
+            boosted_cores,
+            boost_factor,
+        )
+        return boosted_cores


The new apply_cpu_boost method lacks test coverage. Given that there are existing tests for CPUPlugin in tests/unit/agent/test_resources.py, new tests should be added to verify the boost logic, including edge cases such as when boost_factor would result in less than num_cores + 1 (handled by the min_boosted_cores logic).

Copilot · 2026-01-05T11:37:05Z

src/ai/backend/agent/docker/agent.py

+    def _schedule_cpu_boost_restoration(
+        self,
+        container_id: str,
+        resource_spec: KernelResourceSpec,
+    ) -> None:
+        """
+        Schedule a background task to restore CPU from boost after the configured duration.
+
+        This method extracts the original CPU allocation from the resource spec and
+        creates an asyncio task that will restore the container's CPU quota after
+        the boost duration expires.
+
+        Args:
+            container_id: ID of the container to schedule restoration for
+            resource_spec: Kernel resource specification containing CPU allocations
+        """
+        cpu_alloc = resource_spec.allocations.get(DeviceName("cpu"), {})
+        if not cpu_alloc:
+            log.debug(
+                "No CPU allocation found for container {}, skipping boost restoration",
+                container_id[:12],
+            )
+            return
+
+        # cpu_alloc is Mapping[SlotName, Mapping[DeviceId, Decimal]]
+        # Count the number of allocated CPU cores
+        original_cpus = len(cpu_alloc.get(SlotName("cpu"), {}))
+        boost_duration = self.local_config.container.cpu_boost_duration
+
+        # Create a background task to restore CPU after boost_duration
+        asyncio.create_task(
+            self._restore_cpu_from_boost(
+                container_id,
+                original_cpus,
+                boost_duration,
+            )
+        )
+        log.debug(
+            "Scheduled CPU boost restoration for container {} in {} seconds",
+            container_id[:12],
+            boost_duration,
+        )
+
+    async def _restore_cpu_from_boost(
+        self,
+        container_id: str,
+        original_cpus: int,
+        boost_duration: float,
+    ) -> None:
+        """
+        Restore container CPU allocation from boosted value to original value after a delay.
+
+        Args:
+            container_id: ID of the container to restore
+            original_cpus: Original number of CPU cores requested
+            boost_duration: Duration in seconds to wait before restoring
+        """
+        try:
+            await asyncio.sleep(boost_duration)
+            log.info(
+                "Restoring CPU from boost for container {}: reducing to {} cores",
+                container_id[:12],
+                original_cpus,
+            )
+            async with closing_async(Docker()) as docker:
+                container = docker.containers.container(container_id)
+
+                # Check if container still exists and is running
+                try:
+                    container_info = await container.show()
+                except DockerError as e:
+                    if e.status == 404:
+                        log.info(
+                            "Container {} no longer exists, skipping CPU boost restoration",
+                            container_id[:12],
+                        )
+                        return
+                    raise
+
+                # Only restore CPU if container is still running
+                container_state = container_info.get("State", {})
+                if not container_state.get("Running", False):
+                    log.info(
+                        "Container {} is not running (status: {}), skipping CPU boost restoration",
+                        container_id[:12],
+                        container_state.get("Status", "unknown"),
+                    )
+                    return
+
+                # Restore CPU to original allocation
+                # Docker update API doesn't support "Cpus" field, so we use CpuQuota
+                # CpuQuota = number_of_cpus * CpuPeriod (100_000 microseconds = 100ms)
+                # aiodocker doesn't have update method, so we call the API directly
+                update_config = {
+                    "CpuQuota": original_cpus * 100_000,
+                    "CpuPeriod": 100_000,
+                }
+                await docker._query_json(
+                    f"containers/{container_id}/update",
+                    method="POST",
+                    data=update_config,
+                )
+
+            log.info("CPU boost restoration completed for container {}", container_id[:12])
+        except asyncio.CancelledError:
+            log.debug("CPU boost restoration task cancelled for container {}", container_id[:12])
+            raise
+        except DockerError as e:
+            log.warning(
+                "Docker error while restoring CPU from boost for container {}: {} (status: {})",
+                container_id[:12],
+                e.message,
+                e.status,
+            )
+        except Exception as e:
+            log.warning(
+                "Failed to restore CPU from boost for container {}: {!r}",
+                container_id[:12],
+                e,
+            )


The CPU boost restoration logic in _schedule_cpu_boost_restoration and _restore_cpu_from_boost lacks test coverage. Tests should verify: 1) successful restoration after the boost duration, 2) handling of non-existent containers, 3) handling of stopped containers, 4) proper error handling for Docker API failures, and 5) task cancellation behavior.

Copilot · 2026-01-05T11:37:05Z

src/ai/backend/agent/docker/intrinsic.py

+        boosted_cores = int(num_cores * boost_factor)
+        # Ensure at least +1 core boost when enabled
+        min_boosted_cores = num_cores + 1
+        boosted_cores = max(boosted_cores, min_boosted_cores)


The minimum boost guarantee (num_cores + 1) may cause issues when system resources are constrained. If a container requests a large number of cores (e.g., 15 cores out of a 16-core system), boosting it to 16 cores may succeed, but if multiple containers are created simultaneously with boost enabled, the system could run out of CPU resources. Consider adding validation to ensure boosted_cores doesn't exceed available system resources, or at least document that this feature assumes sufficient overcommit headroom.

Copilot · 2026-01-05T11:37:06Z

src/ai/backend/agent/docker/intrinsic.py

+                "Cpus": boosted_cores,
                "CpusetCpus": ",".join(sorted_core_ids),


CpusetCpus constrains the actual cores available but doesn't match the boosted CPU count. The code sets "Cpus" to boosted_cores (e.g., 4) but "CpusetCpus" is still set to the original cores from device_alloc (e.g., cores 0,1 if 2 cores were allocated). This creates a conflict: the container is told it has 4 CPUs available but is pinned to only 2 specific cores.

Docker's behavior with this configuration may be undefined or the "Cpus" field may be ignored. To properly boost CPU, you either need to: 1) Not use CpusetCpus and rely on CpuQuota/CpuPeriod instead, or 2) Actually allocate additional cores from the device allocator and include them in CpusetCpus.

src/ai/backend/agent/docker/agent.py

Copilot · 2026-01-05T11:37:06Z

src/ai/backend/agent/docker/agent.py

+                await docker._query_json(
+                    f"containers/{container_id}/update",
+                    method="POST",
+                    data=update_config,
+                )


Accessing the private method docker._query_json is fragile and may break if the aiodocker library changes its internal API. While this pattern is used elsewhere in the codebase (e.g., stats.py:773, utils.py:214), consider requesting an official update method from the aiodocker library or wrapping this in a helper function that can be easily updated if the internal API changes.

Copilot · 2026-01-05T11:37:06Z

src/ai/backend/agent/config/unified.py

+    cpu_boost_factor: float = Field(
+        default=2.0,
+        ge=1.0,
+        le=10.0,
+        description=textwrap.dedent("""
+        CPU boost multiplication factor.
+        The requested CPU will be multiplied by this factor during the boost period.
+        For example, with a factor of 2.0, a container requesting 2 CPUs will
+        initially receive 4 CPUs.
+        """),
+        examples=[1.5, 2.0, 3.0],
+        validation_alias=AliasChoices("cpu-boost-factor", "cpu_boost_factor"),
+        serialization_alias="cpu-boost-factor",
+    )


When cpu_boost_factor is set to exactly 1.0 (the minimum allowed value), the boosted_cores calculation on line 422 will result in the same number as num_cores. However, line 424-425 ensures at least +1 core boost when enabled, which means even with a factor of 1.0, containers will get +1 core. This behavior may be unexpected for users who set factor to 1.0 thinking it means "no boost". Consider either: 1) Setting the minimum value to greater than 1.0 (e.g., 1.1), or 2) Documenting this behavior clearly in the cpu_boost_factor description.

Co-authored-by: Copilot <[email protected]>

boost up cpu count when it starts

0d696a4

daemyung-lablup requested review from Copilot and hhoikoo January 5, 2026 11:31

github-actions bot assigned daemyung-lablup Jan 5, 2026

github-actions bot added size:L 100~500 LoC comp:agent Related to Agent component labels Jan 5, 2026

Copilot started reviewing on behalf of daemyung-lablup January 5, 2026 11:31 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

daemyung-lablup and others added 2 commits January 6, 2026 10:05

Update src/ai/backend/agent/docker/agent.py

8be2a42

Co-authored-by: Copilot <[email protected]>

add cpu-boost-test.md

6b4f89f

github-actions bot added size:XL 500~ LoC and removed size:L 100~500 LoC labels Jan 6, 2026

daemyung-lablup added 3 commits January 8, 2026 17:04

fix agent to boost

5531324

add --resource-opt for cpu boost

1445dbf

add --resource-opt for cpu boost

b536d22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP]feat(BA-3720) boost up cpu count when it starts #7790

[WIP]feat(BA-3720) boost up cpu count when it starts #7790

Uh oh!

daemyung-lablup commented Jan 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		boost_enabled = self.local_config["container"].get("cpu-boost-enabled", False)
		boost_factor = self.local_config["container"].get("cpu-boost-factor", 2.0)

		"Cpus": boosted_cores,
		"CpusetCpus": ",".join(sorted_core_ids),

[WIP]feat(BA-3720) boost up cpu count when it starts #7790

Are you sure you want to change the base?

[WIP]feat(BA-3720) boost up cpu count when it starts #7790

Uh oh!

Conversation

daemyung-lablup commented Jan 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants