-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Build: use only one setting for build time limit #12369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
5908201
9a5b58b
29f31de
4c55898
4d36ec9
20b2aa9
722a612
de4aabb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -86,18 +86,7 @@ def prepare_build( | |
options["queue"] = project.build_queue | ||
|
||
# Set per-task time limit | ||
# TODO remove the use of Docker limits or replace the logic here. This | ||
# was pulling the Docker limits that were set on each stack, but we moved | ||
# to dynamic setting of the Docker limits. This sets a failsafe higher | ||
# limit, but if no builds hit this limit, it should be safe to remove and | ||
# rely on Docker to terminate things on time. | ||
# time_limit = DOCKER_LIMITS['time'] | ||
time_limit = 7200 | ||
Comment on lines
-89
to
-95
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Due to this we were allowing 2h builds on .org and .com. However, our docs says we allow 15m and 30m by default respectively. Changing this will have an impact on projects that were abusing our platform, but also on some valid projects. I checked this and we have 227 project with >15m successful builds that didn't ask support for this extra time: In [22]: len(set(Build.objects.filter(date__gt=timezone.now() - timezone.timedelta(days=90), length__gt=60*15, success=True, project__container_time_limit__isnull=True).values_list("project_
⋮ _slug", flat=True)))
Out[22]: 227 We need to decide what to do with them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have an average of 40 build instances in the last year. This may explain why this number has been being that high... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change was introduced in 2020 by this commit 2e2c6f4 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is definitely going to be a pretty big breaking change. We should have a plan here, since otherwise we will just get destroyed with support messages from these projects. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I updated the time limit for these projects using a script https://gist.github.com/humitos/d247e99f93ebdd8624d06d663096edbc |
||
try: | ||
if project.container_time_limit: | ||
time_limit = int(project.container_time_limit) | ||
except ValueError: | ||
log.warning("Invalid time_limit for project.") | ||
time_limit = project.container_time_limit or settings.BUILD_TIME_LIMIT | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we know why the old logic was so defensive here? Seems like it was probably there for a reason. We aren't even casting it to an integer anymore. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I checked this. The field is an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
# Add 20% overhead to task, to ensure the build can timeout and the task | ||
# will cleanly finish. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -589,6 +589,33 @@ def TEMPLATES(self): | |
USE_I18N = True | ||
USE_L10N = True | ||
|
||
BUILD_TIME_LIMIT = 900 # seconds | ||
|
||
@property | ||
def BUILD_MEMORY_LIMIT(self): | ||
""" | ||
Set build memory limit dynamically, if in production, based on system memory. | ||
|
||
We do this to avoid having separate build images. This assumes 1 build | ||
process per server, which will be allowed to consume all available | ||
memory. | ||
""" | ||
# Our normal default | ||
default_memory_limit = "7g" | ||
|
||
# Only run on our servers | ||
if self.RTD_IS_PRODUCTION: | ||
total_memory, memory_limit = self._get_build_memory_limit() | ||
|
||
memory_limit = memory_limit or default_memory_limit | ||
log.info( | ||
"Using dynamic build limits.", | ||
hostname=socket.gethostname(), | ||
memory=memory_limit, | ||
) | ||
return memory_limit | ||
|
||
|
||
# Celery | ||
CELERY_APP_NAME = "readthedocs" | ||
CELERY_ALWAYS_EAGER = True | ||
|
@@ -605,7 +632,7 @@ def TEMPLATES(self): | |
# https://github.com/readthedocs/readthedocs.org/issues/12317#issuecomment-3070950434 | ||
# https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html#visibility-timeout | ||
BROKER_TRANSPORT_OPTIONS = { | ||
'visibility_timeout': 18000, # 5 hours | ||
'visibility_timeout': BUILD_TIME_LIMIT * 1.15, # 15% more than the build time limit | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We're using buffer of 15%, 20%, and 25% in different places. Should they be the same? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No. These are different on purpose. The task should be we lowest, then visibility timeout and then build API token. We still need to communicate with API after the task got killed, so we can update it. |
||
} | ||
|
||
CELERY_DEFAULT_QUEUE = "celery" | ||
|
@@ -721,7 +748,13 @@ def TEMPLATES(self): | |
# since we can't read their config file image choice before cloning | ||
RTD_DOCKER_CLONE_IMAGE = RTD_DOCKER_BUILD_SETTINGS["os"]["ubuntu-22.04"] | ||
|
||
def _get_docker_memory_limit(self): | ||
def _get_build_memory_limit(self): | ||
""" | ||
Return the buld memory limit based on available system memory. | ||
|
||
We subtract ~1000Mb for overhead of processes and base system, and set | ||
the build time as proportional to the memory limit. | ||
""" | ||
try: | ||
total_memory = int( | ||
subprocess.check_output( | ||
|
@@ -735,47 +768,6 @@ def _get_docker_memory_limit(self): | |
# int and raise a ValueError | ||
log.exception("Failed to get memory size, using defaults Docker limits.") | ||
|
||
# Coefficient used to determine build time limit, as a percentage of total | ||
# memory. Historical values here were 0.225 to 0.3. | ||
DOCKER_TIME_LIMIT_COEFF = 0.25 | ||
|
||
@property | ||
def DOCKER_LIMITS(self): | ||
""" | ||
Set docker limits dynamically, if in production, based on system memory. | ||
|
||
We do this to avoid having separate build images. This assumes 1 build | ||
process per server, which will be allowed to consume all available | ||
memory. | ||
|
||
We subtract 750MiB for overhead of processes and base system, and set | ||
the build time as proportional to the memory limit. | ||
""" | ||
# Our normal default | ||
limits = { | ||
"memory": "2g", | ||
"time": 900, | ||
} | ||
|
||
# Only run on our servers | ||
if self.RTD_IS_PRODUCTION: | ||
total_memory, memory_limit = self._get_docker_memory_limit() | ||
if memory_limit: | ||
limits = { | ||
"memory": f"{memory_limit}m", | ||
"time": max( | ||
limits["time"], | ||
round(total_memory * self.DOCKER_TIME_LIMIT_COEFF, -2), | ||
), | ||
} | ||
log.info( | ||
"Using dynamic docker limits.", | ||
hostname=socket.gethostname(), | ||
memory=limits["memory"], | ||
time=limits["time"], | ||
) | ||
return limits | ||
|
||
# Allauth | ||
ACCOUNT_ADAPTER = "readthedocs.core.adapters.AccountAdapter" | ||
SOCIALACCOUNT_ADAPTER = 'readthedocs.core.adapters.SocialAccountAdapter' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is backwards incompatible for local installs. We might want to bump a major version, or build a translation layer from the old setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine bumping a major version. I don't think people is really changing this value, tho. The default should be enough for most of the development use cases.