Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 9, 2025

The Build._config field currently stores configuration data in two ways: the full YAML config, or a reference to another build's config ({"__config": <build_id>}). This dual-purpose design complicates querying (e.g., "which projects use ubuntu-22.04?") and only deduplicates within a project's consecutive builds.

Changes

  • New BuildConfig model with data JSONField and unique=True constraint

    • Automatically deduplicates identical configs across all projects via database constraint
    • Uses TimeStampedModel for audit trail
  • Added Build.readthedocs_yaml_data ForeignKey

    • Points to BuildConfig with SET_NULL on delete
    • Enables direct ORM queries: BuildConfig.objects.filter(data__build__os="ubuntu-22.04")
    • Always populated, even when _config uses reference style
  • Updated Build.save() to populate both fields

    • Uses get_or_create() to leverage unique constraint
    • Always populates readthedocs_yaml_data by following reference chains when needed
    • When previous build has BuildConfig, reuses it directly for efficiency
    • When previous build lacks BuildConfig, follows the reference to get actual config
  • Migration 0066_add_buildconfig_model.py marked Safe.before_deploy()

  • Focused integration tests validating Build and BuildConfig interaction

Example

# Before: Complex logic to resolve config references
build = Build.objects.get(pk=123)
config = build.config  # Might query another build internally

# After: Direct access via foreign key (always populated)
build = Build.objects.select_related('readthedocs_yaml_data').get(pk=123)
config = build.readthedocs_yaml_data.data

# Query all projects using a specific OS
projects_with_ubuntu = Project.objects.filter(
    builds__readthedocs_yaml_data__data__build__os="ubuntu-22.04"
).distinct()

Rollout

This implements Step 1 of the multi-step migration. The _config field remains functional for backward compatibility. Future steps will:

  1. Migrate existing _config data to BuildConfig instances
  2. Switch application code to use readthedocs_yaml_data
  3. Remove _config field and helper methods
Original prompt

This section details on the original issue you should resolve

<issue_title>Build: simplify _config attribute data modelling</issue_title>
<issue_description>We are currently using a models.JSONField for the Build._config attribute. This field can be used in two different ways:

  1. store the whole YAML configuration file
  2. save a build ID where the full/whole YAML configuration file is stored

This is basically to avoid duplicating the data on our database. However, it only avoids duplicated config on the same projects and only for consecutive builds.

I found this approach pretty confusing when exploring Build objects because we have to do extra operations to find out the real YAML used for that build (see for example the config @property at

@property
def config(self):
"""
Proxy to the configuration of the build.
:returns: The configuration used in the last successful build.
:rtype: dict
"""
last_build = (
self.builds(manager=INTERNAL).filter(
state=BUILD_STATE_FINISHED,
success=True,
).order_by('-date')
.only('_config')
.first()
)
if last_build:
return last_build.config
return None
)

Also, it makes hard to answer questions like "What are the projects using build.os: ubuntu-22.04?" in an easy way and many other similar questions that require querying the config file.

New approach proposal

Due to these limitations and complications, I want to propose creating a new model called BuildConfig that's used as a foreign key from the Build model:

class Build(models.Model):
    # ...
    readthedocs_yaml_data = models.ForeignKey(
        "BuildConfig"
        null=True,
        blank=True,
    )
    # ...

Then the new model would be:

class BuildConfig(TimeStampedModel):
    # ...
    data = models.JSONField(unique=True)
    # ...

Benefits

  • the "avoid duplicated" data is solved automatically by the database at a table level among all the projects, instead of per-project (e.g. BuildConfig.objects.get_or_create(data=...))
  • the previous point will reduce the size of the table considerably, since there won't be any duplicated config
  • getting the YAML for a particular build is going to be pretty fast since we have direct access to it
  • we can reduce the number of queries performed by using .select_related("readthedocs_yaml_data")
  • answer interesting/complex questions pretty easily using Django ORM (e.g. BuildConfig.objects.filter(data__build__os="ubuntu-22.04").projects.count() to answer my previous example question)
  • remove the requirement of a helper @property to get the YAML for a particular build
  • remove the requirement for a @config.setter helper
  • remove the requirement of having extra logic at Build.save
  • remove the need of using self._config_changed
  • remove having two different meanings for the same field
  • allows us to quickly show badges on "Build details page" as Anthony suggested for ext-theme

Summarizing, this approach keeps the same features but simplifies the application and modelling and gives us more features to analyze in an easier way platform usage and make better decisions based on them.

Rollout proposal

  1. add a Build.readthedocs_yaml_data field and create the new BuildConfig model
  2. start saving both fields Build.readthedocs_yaml_data and Build._config while doing the migration
  3. make a data migration to convert Build._config into BuildConfig and link Build.readthedocs_yaml_data to them
  4. make the application to use the new Build.readthedocs_yaml_data field
  5. remove the old code that uses Build._config field</issue_description>

Comments on the Issue (you are @copilot in this section)

@humitos Another important benefit that I'm seeing here about making querying our database easier: while working on the deprecation of config file and `build.image` I had to make multiple queries and iterate over all the builds returned to get the _real_ value of `Build._config` https://github.com/readthedocs/readthedocs.org/blob/1b74999add86d6a45f4c6f6149dcb21d5ef2322b/readthedocs/projects/tasks/utils.py#L235-L251

Instead, we would be able to do all that in just one query:

Project.objects.filter(
  builds__version__slug=F("default_version"),
  builds__date__gt=timezone.now() - timezone.timedelta(days=365),
  builds__success=True,
  builds___config__build__image="readthedocs/build:latest",
)
.distinct()

That query will give me "projects building with build.image: readthedocs/build:latest on their default version with a successful build in the last year", which is a lot easier to read, maintain and communicate to the rest of the team.

Note that with the current structure, we won't get the accurate data because those builds where the config file is identical to their previous build will have _config defined as {"__config": 12345678}"</comment_new>
<comment_new>@ericholscher
Yeah, it seems like our implementation is just a hacky way of doing a m2m, without any of the nice benefits. This seems like an obvious thing to move forward...


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI changed the title [WIP] Simplify _config attribute data modelling Build: Add BuildConfig model for deduplicated config storage Dec 9, 2025
Copilot AI requested a review from humitos December 9, 2025 17:51
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good first step towards the end goal. Removing the _config and using this new field will reduce the complexity a lot and will give us a lot more power to perform simple queries over config files 🎉

@copilot please, take a look at the feedback I gave you and fix it.

null=True,
blank=True,
)
readthedocs_yaml_data = models.ForeignKey(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we OK with this name? Other alternatives are:

  • readthedocs_yaml (no _data at the end)
  • yaml_config
  • config_yaml
  • config_file

cc @ericholscher what do you think?

Copilot AI requested a review from humitos December 10, 2025 10:13
@humitos
Copy link
Member

humitos commented Dec 10, 2025

This is a good starting point to create the field and start populating it. We can move forward here and then work on the part two of this work: use the new field everywhere, and then the part three: remove the old field. @ericholscher let me know if this plan sounds good to you.

@humitos humitos marked this pull request as ready for review December 10, 2025 11:56
@humitos humitos requested a review from a team as a code owner December 10, 2025 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build: simplify _config attribute data modelling

2 participants