Skip to content

Build: use only one setting for build time limit #12369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 10 additions & 14 deletions docs/dev/settings.rst
Original file line number Diff line number Diff line change
@@ -1,23 +1,19 @@
Interesting settings
====================

DOCKER_LIMITS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is backwards incompatible for local installs. We might want to bump a major version, or build a translation layer from the old setting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine bumping a major version. I don't think people is really changing this value, tho. The default should be enough for most of the development use cases.

-------------
BUILD_MEMORY_LIMIT
------------------

A dictionary of limits to virtual machines. These limits include:
The maximum memory allocated to the virtual machine.
If this limit is hit, build processes will be automatically killed.
Examples: '200m' for 200MB of total memory, or '2g' for 2GB of total memory.

time
An integer representing the total allowed time limit (in
seconds) of build processes. This time limit affects the parent
process to the virtual machine and will force a virtual machine
to die if a build is still running after the allotted time
expires.
BUILD_TIME_LIMIT
----------------

memory
The maximum memory allocated to the virtual machine. If this
limit is hit, build processes will be automatically killed.
Examples: '200m' for 200MB of total memory, or '2g' for 2GB of
total memory.
An integer representing the total allowed time limit (in seconds) of build processes.
This time limit affects the parent process to the virtual machine and will force a virtual machine
to die if a build is still running after the allotted time expires.

PRODUCTION_DOMAIN
------------------
Expand Down
5 changes: 4 additions & 1 deletion readthedocs/api/v2/models.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from datetime import timedelta

from django.conf import settings
from django.db import models
from django.utils import timezone
from django.utils.translation import gettext_lazy as _
Expand All @@ -18,7 +19,9 @@ def create_key(self, project):
Build API keys are valid for 3 hours,
and can be revoked at any time by hitting the /api/v2/revoke/ endpoint.
"""
expiry_date = timezone.now() + timedelta(hours=3)
# Use the project or default build time limit + 25% for the API token
delta = (project.container_time_limit or settings.BUILD_TIME_LIMIT) * 1.25
expiry_date = timezone.now() + timedelta(seconds=delta)
name_max_length = self.model._meta.get_field("name").max_length
return super().create_key(
# Name is required, so we use the project slug for it.
Expand Down
13 changes: 1 addition & 12 deletions readthedocs/core/utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,18 +86,7 @@ def prepare_build(
options["queue"] = project.build_queue

# Set per-task time limit
# TODO remove the use of Docker limits or replace the logic here. This
# was pulling the Docker limits that were set on each stack, but we moved
# to dynamic setting of the Docker limits. This sets a failsafe higher
# limit, but if no builds hit this limit, it should be safe to remove and
# rely on Docker to terminate things on time.
# time_limit = DOCKER_LIMITS['time']
time_limit = 7200
Comment on lines -89 to -95
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to this we were allowing 2h builds on .org and .com. However, our docs says we allow 15m and 30m by default respectively.

Changing this will have an impact on projects that were abusing our platform, but also on some valid projects.

I checked this and we have 227 project with >15m successful builds that didn't ask support for this extra time:

In [22]: len(set(Build.objects.filter(date__gt=timezone.now() - timezone.timedelta(days=90), length__gt=60*15, success=True, project__container_time_limit__isnull=True).values_list("project__slug", flat=True)))
Out[22]: 227

We need to decide what to do with them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an average of 40 build instances in the last year. This may explain why this number has been being that high...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was introduced in 2020 by this commit 2e2c6f4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is definitely going to be a pretty big breaking change. We should have a plan here, since otherwise we will just get destroyed with support messages from these projects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the time limit for these projects using a script https://gist.github.com/humitos/d247e99f93ebdd8624d06d663096edbc

try:
if project.container_time_limit:
time_limit = int(project.container_time_limit)
except ValueError:
log.warning("Invalid time_limit for project.")
time_limit = project.container_time_limit or settings.BUILD_TIME_LIMIT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why the old logic was so defensive here? Seems like it was probably there for a reason. We aren't even casting it to an integer anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I checked this. The field is an IntegerField so it cannot be anything different than an integer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [22]: set(Project.objects.filter(container_time_limit__isnull=False).values_list("container_time_limit", flat=True))
Out[22]: 
{60,
 950,
 1000,
 1200,
 1500,
 1600,
 1800,
 2000,
 2400,
 2500,
 2700,
 3000,
 3200,
 3300,
 3600,
 4200,
 5400,
 6300,
 7200,
 10800,
 31200,
 108000}


# Add 20% overhead to task, to ensure the build can timeout and the task
# will cleanly finish.
Expand Down
1 change: 0 additions & 1 deletion readthedocs/doc_builder/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
DOCKER_SOCKET = settings.DOCKER_SOCKET
DOCKER_VERSION = settings.DOCKER_VERSION
DOCKER_IMAGE = settings.DOCKER_IMAGE
DOCKER_LIMITS = settings.DOCKER_LIMITS
DOCKER_TIMEOUT_EXIT_CODE = 42
DOCKER_OOM_EXIT_CODE = 137

Expand Down
9 changes: 2 additions & 7 deletions readthedocs/doc_builder/environments.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@

from .constants import DOCKER_HOSTNAME_MAX_LEN
from .constants import DOCKER_IMAGE
from .constants import DOCKER_LIMITS
from .constants import DOCKER_OOM_EXIT_CODE
from .constants import DOCKER_SOCKET
from .constants import DOCKER_TIMEOUT_EXIT_CODE
Expand Down Expand Up @@ -581,8 +580,6 @@ class DockerBuildEnvironment(BaseBuildEnvironment):

command_class = DockerBuildCommand
container_image = DOCKER_IMAGE
container_mem_limit = DOCKER_LIMITS.get("memory")
container_time_limit = DOCKER_LIMITS.get("time")

def __init__(self, *args, **kwargs):
container_image = kwargs.pop("container_image", None)
Expand All @@ -609,10 +606,8 @@ def __init__(self, *args, **kwargs):
if container_image:
self.container_image = container_image

if self.project.container_mem_limit:
self.container_mem_limit = self.project.container_mem_limit
if self.project.container_time_limit:
self.container_time_limit = self.project.container_time_limit
self.container_mem_limit = self.project.container_mem_limit or settings.BUILD_MEMORY_LIMIT
self.container_time_limit = self.project.container_time_limit or settings.BUILD_TIME_LIMIT

structlog.contextvars.bind_contextvars(
project_slug=self.project.slug,
Expand Down
10 changes: 3 additions & 7 deletions readthedocs/projects/tasks/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,17 +162,13 @@ def finish_inactive_builds():

A build is consider inactive if it's not in a final state and it has been
"running" for more time that the allowed one (``Project.container_time_limit``
or ``DOCKER_LIMITS['time']`` plus a 20% of it).
or ``BUILD_TIME_LIMIT`` plus a 20% of it).

These inactive builds will be marked as ``success`` and ``CANCELLED`` with an
``error`` to be communicated to the user.
"""
# TODO similar to the celery task time limit, we can't infer this from
# Docker settings anymore, because Docker settings are determined on the
# build servers dynamically.
# time_limit = int(DOCKER_LIMITS['time'] * 1.2)
# Set time as maximum celery task time limit + 5m
time_limit = 7200 + 300
# TODO: delete this task once we are fully migrated to ``BUILD_HEALTHCHECK``
time_limit = settings.BUILD_TIME_LIMIT * 1.2
delta = datetime.timedelta(seconds=time_limit)
query = (
~Q(state__in=BUILD_FINAL_STATES)
Expand Down
5 changes: 3 additions & 2 deletions readthedocs/rtd_tests/tests/test_core_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from unittest import mock

import pytest
from django.conf import settings
from django.test import TestCase, override_settings
from django_dynamic_fixture import get

Expand Down Expand Up @@ -189,8 +190,8 @@ def test_trigger_max_concurrency_reached(self, update_docs, app):
trigger_build(project=self.project, version=self.version)
kwargs = {"build_commit": None, "build_api_key": mock.ANY}
options = {
"time_limit": int(7200 * 1.2),
"soft_time_limit": 7200,
"time_limit": settings.BUILD_TIME_LIMIT * 1.2,
"soft_time_limit": settings.BUILD_TIME_LIMIT,
"countdown": 5 * 60,
"max_retries": 25,
}
Expand Down
78 changes: 35 additions & 43 deletions readthedocs/settings/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -589,6 +589,33 @@ def TEMPLATES(self):
USE_I18N = True
USE_L10N = True

BUILD_TIME_LIMIT = 900 # seconds

@property
def BUILD_MEMORY_LIMIT(self):
"""
Set build memory limit dynamically, if in production, based on system memory.

We do this to avoid having separate build images. This assumes 1 build
process per server, which will be allowed to consume all available
memory.
"""
# Our normal default
default_memory_limit = "7g"

# Only run on our servers
if self.RTD_IS_PRODUCTION:
total_memory, memory_limit = self._get_build_memory_limit()

memory_limit = memory_limit or default_memory_limit
log.info(
"Using dynamic build limits.",
hostname=socket.gethostname(),
memory=memory_limit,
)
return memory_limit


# Celery
CELERY_APP_NAME = "readthedocs"
CELERY_ALWAYS_EAGER = True
Expand All @@ -605,7 +632,7 @@ def TEMPLATES(self):
# https://github.com/readthedocs/readthedocs.org/issues/12317#issuecomment-3070950434
# https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html#visibility-timeout
BROKER_TRANSPORT_OPTIONS = {
'visibility_timeout': 18000, # 5 hours
'visibility_timeout': BUILD_TIME_LIMIT * 1.15, # 15% more than the build time limit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using buffer of 15%, 20%, and 25% in different places. Should they be the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. These are different on purpose.

The task should be we lowest, then visibility timeout and then build API token. We still need to communicate with API after the task got killed, so we can update it.

}

CELERY_DEFAULT_QUEUE = "celery"
Expand Down Expand Up @@ -721,7 +748,13 @@ def TEMPLATES(self):
# since we can't read their config file image choice before cloning
RTD_DOCKER_CLONE_IMAGE = RTD_DOCKER_BUILD_SETTINGS["os"]["ubuntu-22.04"]

def _get_docker_memory_limit(self):
def _get_build_memory_limit(self):
"""
Return the buld memory limit based on available system memory.

We subtract ~1000Mb for overhead of processes and base system, and set
the build time as proportional to the memory limit.
"""
try:
total_memory = int(
subprocess.check_output(
Expand All @@ -735,47 +768,6 @@ def _get_docker_memory_limit(self):
# int and raise a ValueError
log.exception("Failed to get memory size, using defaults Docker limits.")

# Coefficient used to determine build time limit, as a percentage of total
# memory. Historical values here were 0.225 to 0.3.
DOCKER_TIME_LIMIT_COEFF = 0.25

@property
def DOCKER_LIMITS(self):
"""
Set docker limits dynamically, if in production, based on system memory.

We do this to avoid having separate build images. This assumes 1 build
process per server, which will be allowed to consume all available
memory.

We subtract 750MiB for overhead of processes and base system, and set
the build time as proportional to the memory limit.
"""
# Our normal default
limits = {
"memory": "2g",
"time": 900,
}

# Only run on our servers
if self.RTD_IS_PRODUCTION:
total_memory, memory_limit = self._get_docker_memory_limit()
if memory_limit:
limits = {
"memory": f"{memory_limit}m",
"time": max(
limits["time"],
round(total_memory * self.DOCKER_TIME_LIMIT_COEFF, -2),
),
}
log.info(
"Using dynamic docker limits.",
hostname=socket.gethostname(),
memory=limits["memory"],
time=limits["time"],
)
return limits

# Allauth
ACCOUNT_ADAPTER = "readthedocs.core.adapters.AccountAdapter"
SOCIALACCOUNT_ADAPTER = 'readthedocs.core.adapters.SocialAccountAdapter'
Expand Down
2 changes: 1 addition & 1 deletion readthedocs/settings/docker_compose.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ class DockerBaseSettings(CommunityBaseSettings):
RTD_DOCKER_COMPOSE_NETWORK = "community_readthedocs"
RTD_DOCKER_COMPOSE_VOLUME = "community_build-user-builds"
RTD_DOCKER_USER = f"{os.geteuid()}:{os.getegid()}"
DOCKER_LIMITS = {"memory": "2g", "time": 900}
BUILD_MEMORY_LIMIT = "2g"

PRODUCTION_DOMAIN = os.environ.get("RTD_PRODUCTION_DOMAIN", "devthedocs.org")
PUBLIC_DOMAIN = os.environ.get("RTD_PUBLIC_DOMAIN", "devthedocs.org")
Expand Down
3 changes: 2 additions & 1 deletion readthedocs/settings/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ class CommunityTestSettings(CommunityBaseSettings):
CELERY_ALWAYS_EAGER = True

# Skip automatic detection of Docker limits for testing
DOCKER_LIMITS = {"memory": "200m", "time": 600}
BUILD_TIME_LIMIT = 600
BUILD_MEMORY_LIMIT = "200m"

CACHES = {
"default": {
Expand Down