Skip to content

cluster-api-provider-aws-build-docker is frequently OOM killed #5576

@mdbooth

Description

@mdbooth

/kind bug

Examples from the last few days:

In all cases the build log stops during goreleaser, which you'd expect. e.g.:

hack/tools/bin/goreleaser build --config .goreleaser.yaml --snapshot --clean
  • starting build...
  • loading                                          path=.goreleaser.yaml
  • skipping validate...
  • loading environment variables
  • getting and validating git state
    • ignoring errors because this is a snapshot     error=couldn't get remote URL: fatal: No remote configured to list refs from.
    • git state                                      commit=none branch=none current_tag=v0.0.0 previous_tag=<unknown> dirty=false
    • pipe skipped                                   reason=disabled during snapshot mode
  • parsing tag
  • setting defaults
  • snapshotting
    • building snapshot...                           version=0.0.0-SNAPSHOT-none
  • checking distribution directory
  • loading go mod information
  • build prerequisites
  • writing effective config file
    • writing                                        config=dist/config.yaml
  • building binaries
    • building                                       binary=dist/clusterctl-aws_windows_arm64/bin/clusterctl-aws.exe
    • building                                       binary=dist/clusterctl-aws_darwin_arm64/bin/clusterctl-aws
    • building                                       binary=dist/clusterctl-aws_darwin_amd64_v1/bin/clusterctl-aws
    • building                                       binary=dist/clusterctl-aws_windows_amd64_v1/bin/clusterctl-aws.exe
    • building                                       binary=dist/clusterctl-aws_linux_amd64_v1/bin/clusterctl-aws
    • building                                       binary=dist/clusterctl-aws_linux_arm64/bin/clusterctl-aws

We can see the resource usage of these jobs in Grafana here:

https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?var-org=kubernetes-sigs&var-repo=cluster-api-provider-aws&var-job=pull-cluster-api-provider-aws-build-docker&orgId=1&from=now-24h&to=now

We can see that the jobs are coming perilously close to their 12G limit, which is already very high. The memory limit is defined here:

https://github.com/kubernetes/test-infra/blob/42c25d98a67da245e2bdf8612766f4c85103fe8c/config/jobs/kubernetes-sigs/cluster-api-provider-aws/cluster-api-provider-aws-presubmits.yaml#L91-L97

Rather than increasing the memory limit still further, I propose restricting the parallelism of these build jobs to bring the memory usage down.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions