Skip to content

Conversation

nammn
Copy link
Collaborator

@nammn nammn commented Aug 21, 2025

Summary

This pull request introduces a new branch and cache scoping system for Docker BuildKit remote cache in CI and local development environments. It adds robust branch detection and cache scope generation logic, integrates these into the image build process, and provides comprehensive unit tests to ensure correctness. The main goal is to enable per-branch and per-patch caching for Docker builds, improving build performance and cache isolation.

Branch and cache scoping infrastructure:

  • Added scripts/release/branch_detection.py with functions to detect the current git branch and generate sanitized cache scope strings for use in BuildKit caching, supporting both local development and Evergreen CI environments.

Integration with image build process:

  • Refactored scripts/release/build/image_build_process.py to use the new branch and cache scope utilities for configuring BuildKit cache, including logic to set up per-branch/per-patch cache repositories and read/write precedence (branch → master).
  • Ensured that ECR cache repositories are created as needed before builds, and added detailed logging for cache configuration during builds.

Proof of Work

  • green ci
  • example output
[2025/08/21 17:03:36.058] #31 exporting cache to registry
[2025/08/21 17:03:36.058] #31 preparing build cache for export
[2025/08/21 17:03:56.826] #31 writing layer sha256:104f1be5715fef257f90b38fddaeee7ef6ae9f1da7e6b03051fdd5ba50f18026
[2025/08/21 17:03:57.846] #31 writing layer sha256:104f1be5715fef257f90b38fddaeee7ef6ae9f1da7e6b03051fdd5ba50f18026 1.0s done
[2025/08/21 17:03:57.846] #31 writing layer sha256:11df871e90940bf39eb8c800c8966b01a29a704e1dc0cdbcd85107e2bbf98948
[2025/08/21 17:03:58.294] #31 writing layer sha256:11df871e90940bf39eb8c800c8966b01a29a704e1dc0cdbcd85107e2bbf98948 0.4s done
[2025/08/21 17:03:58.294] #31 writing layer sha256:14b1babc8f267ede2a5e969d069d4be8f28ada9ff5d950645d8dbed508b8e91d
  • unit tests

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

Copy link

github-actions bot commented Aug 21, 2025

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.3.0 Release Notes

New Features

Multi-Architecture Support

We've added comprehensive multi-architecture support for the kubernetes operator. This enhancement enables deployment on IBM Power (ppc64le) and IBM Z (s390x) architectures alongside
existing x86_64 support. Core images (operator, agent, init containers, database, readiness probe) now support multiple architectures. We do not add support IBM and ARM support for Ops-Manager and the init-ops-manager image.

  • MongoDB Agent images have been migrated to new container repository: quay.io/mongodb/mongodb-agent.
    • the agents in the new repository will support the x86-64, ARM64, s390x, and ppc64le architectures. More can be read in the public docs.
    • operator running >=MCK1.3.0 and static cannot use the agent images from the old container repository quay.io/mongodb/mongodb-agent-ubi.
  • quay.io/mongodb/mongodb-agent-ubi should not be used anymore, it's only there for backwards compatibility.

Bug Fixes

  • This change fixes the current complex and difficult-to-maintain architecture for stateful set containers, which relies on an "agent matrix" to map operator and agent versions which led to a sheer amount of images.
  • We solve this by shifting to a 3-container setup. This new design eliminates the need for the operator-version/agent-version matrix by adding one additional container containing all required binaries. This architecture maps to what we already do with the mongodb-database container.
  • Fixed an issue where the readiness probe reported the node as ready even when its authentication mechanism was not in sync with the other nodes, potentially causing premature restarts.

Other Changes

  • Optional permissions for PersistentVolumeClaim moved to a separate role. When managing the operator with Helm it is possible to disable permissions for PersistentVolumeClaim resources by setting operator.enablePVCResize value to false (true by default). When enabled, previously these permissions were part of the primary operator role. With this change, permissions have a separate role.
  • subresourceEnabled Helm value was removed. This setting used to be true by default and made it possible to exclude subresource permissions from the operator role by specifying false as the value. We are removing this configuration option, making the operator roles always have subresource permissions. This setting was introduced as a temporary solution for this OpenShift issue. The issue has since been resolved and the setting is no longer needed.
  • We have deliberately not published the container images for OpsManager versions 7.0.16, 8.0.8, 8.0.9 and 8.0.10 due to a bug in the OpsManager which prevents MCK customers to upgrade their OpsManager deployments to those versions.

@nammn nammn added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Aug 21, 2025
"ref": f"{cache_registry}:cache",
"mode": "max",
"oci-mediatypes": "true",
"image-manifest": "true"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""
By default, the OCI media type generates an image index for the cache image. Some OCI registries, such as Amazon ECR, don't support the image index media type: application/vnd.oci.image.index.v1+json. If you export cache images to ECR, or any other registry that doesn't support image indices, set the image-manifest parameter to true to generate a single image manifest instead of an image index for the cache image:
"""

"type": "registry",
"ref": f"{cache_registry}:cache",
"mode": "max",
"oci-mediatypes": "true",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oci is the future

cache_to = {
"type": "registry",
"ref": f"{cache_registry}:cache",
"mode": "max",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max means we are caching all layers

@nammn nammn changed the title add docker caching CLOUDP-339878 - add docker caching Aug 22, 2025
@nammn nammn marked this pull request as ready for review August 22, 2025 16:16
@nammn nammn requested a review from a team as a code owner August 22, 2025 16:16
return cache_from_refs, cache_to_refs


def ensure_all_cache_repositories(cache_image_names: List[str], region: str = "us-east-1"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this called anywhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - leftover - removed!

@nammn nammn requested a review from lucian-tosa September 1, 2025 12:17

DEFAULT_BUILDER_NAME = "multiarch" # Default buildx builder name


def ensure_ecr_cache_repository(repository_name: str, region: str = "us-east-1"):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is none - we should create the repo!

raise


def build_cache_configuration(base_registry: str) -> tuple[list[Any], dict[str, str]]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some gotchas:

  • its one cache per image
  • every cache tag has one manifest which gets overwritten
    • thus, i've added a dedicated cache tag per branch but always read from master and push to master on master merges (more info here and here and a really good blog post)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog Use this label in Pull Request to not require new changelog entry file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants