-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[CI] Upstream premerge terraform configuration #117397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
+657
−0
Closed
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| .terraform* |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| # Premerge Infrastructure | ||
|
|
||
| This folder contains the terraform configuration files that define the GCP | ||
| resources used to run the premerge checks. Currently, only Google employees | ||
| with access to the GCP project where these checks are hosted are able to apply | ||
| changes. Pull requests from anyone are still welcome. | ||
|
|
||
| ## Setup | ||
|
|
||
| - install terraform (https://developer.hashicorp.com/terraform/install?product_intent=terraform) | ||
| - get the GCP tokens: `gcloud auth application-default login` | ||
| - initialize terraform: `terraform init` | ||
|
|
||
| To apply any changes to the cluster: | ||
| - setup the cluster: `terraform apply` | ||
| - terraform will list the list of proposed changes. | ||
| - enter 'yes' when prompted. | ||
|
|
||
| ## Setting the cluster up for the first time | ||
|
|
||
| ``` | ||
| terraform apply -target google_container_node_pool.llvm_premerge_linux_service | ||
| terraform apply -target google_container_node_pool.llvm_premerge_linux | ||
| terraform apply -target google_container_node_pool.llvm_premerge_windows | ||
| terraform apply | ||
| ``` | ||
|
|
||
| Setting the cluster up for the first time is more involved as there are certain | ||
| resources where terraform is unable to handle explicit dependencies. This means | ||
| that we have to set up the GKE cluster before we setup any of the Kubernetes | ||
| resources as otherwise the Terraform Kubernetes provider will error out. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| terraform { | ||
| backend "gcs" { | ||
| bucket = "3772b2f502380a18-terraform-remote-backend" | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| metrics: | ||
| enabled: true | ||
| alloy: | ||
| metricsTuning: | ||
| useIntegrationAllowList: true | ||
| cost: | ||
| enabled: true | ||
| kepler: | ||
| enabled: true | ||
| node-exporter: | ||
| enabled: true | ||
| logs: | ||
| enabled: true | ||
| pod_logs: | ||
| enabled: true | ||
| cluster_events: | ||
| enabled: true | ||
| traces: | ||
| enabled: true | ||
| receivers: | ||
| grpc: | ||
| enabled: true | ||
| http: | ||
| enabled: true | ||
| zipkin: | ||
| enabled: true | ||
| grafanaCloudMetrics: | ||
| enabled: false | ||
| opencost: | ||
| enabled: true | ||
| kube-state-metrics: | ||
| enabled: true | ||
| prometheus-node-exporter: | ||
| enabled: true | ||
| prometheus-operator-crds: | ||
| enabled: true | ||
| kepler: | ||
| enabled: true | ||
| alloy: {} | ||
| alloy-events: {} | ||
| alloy-logs: {} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| spec: | ||
| tolerations: | ||
| - key: "premerge-platform" | ||
| operator: "Equal" | ||
| value: "linux" | ||
| effect: "NoSchedule" | ||
| nodeSelector: | ||
| premerge-platform: linux | ||
| containers: | ||
| - name: $job | ||
| resources: | ||
| # The container is always scheduled on the same pod as the runner. | ||
| # Since we use the runner requests.cpu for scheduling/autoscaling, | ||
| # the request here should be set to something small. | ||
| # | ||
| # The limit however should be the number of cores of the node. Any limit | ||
| # inferior to the number of core could slow down the job. | ||
| # | ||
| # For memory however, the request/limits shall be correct. | ||
| # It's not used for scheduling, but can be used by k8 for OOM kill. | ||
| requests: | ||
| cpu: "100m" | ||
| memory: "50Gi" | ||
| limits: | ||
| cpu: 56 | ||
| memory: "100Gi" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| githubConfigUrl: "https://github.com/llvm" | ||
| githubConfigSecret: "github-token" | ||
|
|
||
| minRunners: 0 | ||
| maxRunners: 4 | ||
|
|
||
| containerMode: | ||
| type: "kubernetes" | ||
| kubernetesModeWorkVolumeClaim: | ||
| accessModes: ["ReadWriteOnce"] | ||
| storageClassName: "standard-rwo" | ||
| resources: | ||
| requests: | ||
| storage: "100Gi" | ||
| kubernetesModeServiceAccount: | ||
| annotations: | ||
|
|
||
| template: | ||
| spec: | ||
| tolerations: | ||
| - key: "premerge-platform" | ||
| operator: "Equal" | ||
| value: "linux" | ||
| effect: "NoSchedule" | ||
| nodeSelector: | ||
| premerge-platform: linux | ||
| containers: | ||
| - name: runner | ||
| image: ghcr.io/actions/actions-runner:latest | ||
| command: ["/home/runner/run.sh"] | ||
| resources: | ||
| # The container will be scheduled on the same node as this runner. | ||
| # This means if we don't set the CPU request high-enough here, 2 | ||
| # containers will be scheduled on the same pod, meaning 2 jobs. | ||
| # | ||
| # This number should be: | ||
| # - greater than number_of_cores / 2: | ||
| # A value lower than that could allow the scheduler to put 2 | ||
| # runners in the same pod. Meaning 2 containers in the same pod. | ||
| # Meaning 2 jobs sharing the resources. | ||
| # - lower than number_of_cores: | ||
| # Each pod has some basic services running (metrics for ex). Those | ||
| # already require some amount of CPU (~0.5). This means we don't | ||
| # exactly have N cores to allocate, but N - epsilon. | ||
| # | ||
| # Memory however shall be handled at the container level. The runner | ||
| # itself doesn't need much, just using something enough not to get | ||
| # OOM killed. | ||
| requests: | ||
| cpu: 50 | ||
| memory: "2Gi" | ||
| limits: | ||
| cpu: 56 | ||
| memory: "2Gi" | ||
| env: | ||
| - name: ACTIONS_RUNNER_CONTAINER_HOOKS | ||
| value: /home/runner/k8s/index.js | ||
| - name: ACTIONS_RUNNER_POD_NAME | ||
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: metadata.name | ||
| - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER | ||
| value: "true" | ||
| - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE | ||
| value: "/home/runner/pod-config/linux-container-pod-template.yaml" | ||
| volumeMounts: | ||
| - name: container-pod-config | ||
| mountPath: /home/runner/pod-config | ||
| securityContext: | ||
| fsGroup: 123 | ||
| volumes: | ||
| - name: container-pod-config | ||
| configMap: | ||
| name: linux-container-pod-template |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.ci/infrastructurefolder is a bit of a "general" name for this: should we nest it in.ci/infrastructure/terraform/instead to leave room for other things here?Also so far all of that kind of infrastructure has been in Zorg (like here for example), I'm not sure how we think about Zorg vs this right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can rename it to something like
.ci/terraform. Nesting it into something like.ci/infrastructure/terraformseems a bit premature to me. I don't think there are other major infrastructure pieces that we need.Good point on bringing up Zorg. I think it makes more sense to keep all of the precommit configurations in the monorepo. The infra is somewhat coupled to the workflows/associated scripting, which have to be in the monorepo. Given that, I think keeping the infra definitions here makes sense too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this more coupled than the way zorg interacts with the monorepo already? At some point Zorg seems also to be defining some infrastructure which refers to commands/scripts/options that are defined in the monorepo, so from far I can see a similarity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference is that all of the associated scripts/build configurations/setup is defined in the monorepo. The premerge CI scripts are already here, and all the GHA workflows will have to be in the monorepo. So yes, I think the premerge infra is significantly more coupled to the monorepo than the postcommit infra in zorg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn’t obvious to me why a terraform script that somewhere calls a premerge script in the mono repo is more coupled than any infra setup in Zerg that calls a similar script in the mono repo (or even call cmake+ninja, which isn’t different than a script entry point really).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zorg (from when I've hacked on it) mostly contains Python scripts that define testing workflows for the various builders. The equivalent testing workflows for premerge are defined in the monorepo (currently shell scripts in
.ci, eventually GHA workflow definitions in.github), which leads me to the conclusion that premerge is more tightly coupled to the monorepo.Putting premerge infra in Zorg seems a bit out of place to me. Currently, it solely contains postcommit configurations/infra. We already have a spot where all of the in-tree premerge scripts and what not live, and that's
.ci/in the monorepo.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found it very useful to be able to make changes to pre-merge configuration and script, and test them right away as a regular PR to the monorepo. It seems that the workflow for zorg changes is less straightforward. Since this PR is concerned with pre-merge stuff, I prefer to keep the changes in the monorepo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably misunderstand what this PR is doing then, because it seemed to me that nothing here is about something that you would be able to test with a PR, instead it defined the underlying infra (kube, machines, runners) that need to be setup offline before you can rely on it for testing.
Sorry: I don't get it, I read this as mixing up unrelated things here, but I may be misunderstanding what you're doing here. It seems that the two things are 1) the config of the infra (machines and runners) and 2) "what is executed by the runner based on some event".
In the monorepo we have so far:
In Zorg there is:
So it is true that 2 and 4 are at odds (but for good reason, with property such as what @Endilll mentioned above).
But I'm not following why this implies anything about the rest of the config, which is really more "infra" than anything. The question of coupling is really about this: how is the actual kube/docker config more coupled to the execution script (the 2./4. bullet point above that just wraps cmake/ninja basically) in one case than the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the points made so far. Is is true that most of what's proposed here is mostly internals for infra management so this could indeed live in llvm-zorg.
And since a googler has to run a
terraform applyto any change anyway, having them in this repo or the other doesn't bring IMO any sync/atomicity benefit.