Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .ci/infrastructure/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.terraform*
31 changes: 31 additions & 0 deletions .ci/infrastructure/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Premerge Infrastructure

This folder contains the terraform configuration files that define the GCP
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.ci/infrastructure folder is a bit of a "general" name for this: should we nest it in .ci/infrastructure/terraform/ instead to leave room for other things here?

Also so far all of that kind of infrastructure has been in Zorg (like here for example), I'm not sure how we think about Zorg vs this right now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rename it to something like .ci/terraform. Nesting it into something like .ci/infrastructure/terraform seems a bit premature to me. I don't think there are other major infrastructure pieces that we need.

Good point on bringing up Zorg. I think it makes more sense to keep all of the precommit configurations in the monorepo. The infra is somewhat coupled to the workflows/associated scripting, which have to be in the monorepo. Given that, I think keeping the infra definitions here makes sense too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this more coupled than the way zorg interacts with the monorepo already? At some point Zorg seems also to be defining some infrastructure which refers to commands/scripts/options that are defined in the monorepo, so from far I can see a similarity here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference is that all of the associated scripts/build configurations/setup is defined in the monorepo. The premerge CI scripts are already here, and all the GHA workflows will have to be in the monorepo. So yes, I think the premerge infra is significantly more coupled to the monorepo than the postcommit infra in zorg.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn’t obvious to me why a terraform script that somewhere calls a premerge script in the mono repo is more coupled than any infra setup in Zerg that calls a similar script in the mono repo (or even call cmake+ninja, which isn’t different than a script entry point really).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zorg (from when I've hacked on it) mostly contains Python scripts that define testing workflows for the various builders. The equivalent testing workflows for premerge are defined in the monorepo (currently shell scripts in .ci, eventually GHA workflow definitions in .github), which leads me to the conclusion that premerge is more tightly coupled to the monorepo.

Putting premerge infra in Zorg seems a bit out of place to me. Currently, it solely contains postcommit configurations/infra. We already have a spot where all of the in-tree premerge scripts and what not live, and that's .ci/ in the monorepo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found it very useful to be able to make changes to pre-merge configuration and script, and test them right away as a regular PR to the monorepo. It seems that the workflow for zorg changes is less straightforward. Since this PR is concerned with pre-merge stuff, I prefer to keep the changes in the monorepo.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found it very useful to be able to make changes to pre-merge configuration and script, and test them right away as a regular PR to the monorepo.

I probably misunderstand what this PR is doing then, because it seemed to me that nothing here is about something that you would be able to test with a PR, instead it defined the underlying infra (kube, machines, runners) that need to be setup offline before you can rely on it for testing.

[...] which leads me to the conclusion that premerge is more tightly coupled to the monorepo.

Sorry: I don't get it, I read this as mixing up unrelated things here, but I may be misunderstanding what you're doing here. It seems that the two things are 1) the config of the infra (machines and runners) and 2) "what is executed by the runner based on some event".

In the monorepo we have so far:

  1. the cmake config
  2. the premerge scripts that define how to invoke cmake & ninja & ... on premerge events.

In Zorg there is:

  1. buildbot configs for the machines
  2. terraform config for some clusters
  3. Docker images for the runners
  4. the post merge scripts which define what cmake/ninja/... to invoke on post-merge event.

So it is true that 2 and 4 are at odds (but for good reason, with property such as what @Endilll mentioned above).
But I'm not following why this implies anything about the rest of the config, which is really more "infra" than anything. The question of coupling is really about this: how is the actual kube/docker config more coupled to the execution script (the 2./4. bullet point above that just wraps cmake/ninja basically) in one case than the other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the points made so far. Is is true that most of what's proposed here is mostly internals for infra management so this could indeed live in llvm-zorg.

And since a googler has to run a terraform apply to any change anyway, having them in this repo or the other doesn't bring IMO any sync/atomicity benefit.

resources used to run the premerge checks. Currently, only Google employees
with access to the GCP project where these checks are hosted are able to apply
changes. Pull requests from anyone are still welcome.

## Setup

- install terraform (https://developer.hashicorp.com/terraform/install?product_intent=terraform)
- get the GCP tokens: `gcloud auth application-default login`
- initialize terraform: `terraform init`

To apply any changes to the cluster:
- setup the cluster: `terraform apply`
- terraform will list the list of proposed changes.
- enter 'yes' when prompted.

## Setting the cluster up for the first time

```
terraform apply -target google_container_node_pool.llvm_premerge_linux_service
terraform apply -target google_container_node_pool.llvm_premerge_linux
terraform apply -target google_container_node_pool.llvm_premerge_windows
terraform apply
```

Setting the cluster up for the first time is more involved as there are certain
resources where terraform is unable to handle explicit dependencies. This means
that we have to set up the GKE cluster before we setup any of the Kubernetes
resources as otherwise the Terraform Kubernetes provider will error out.
5 changes: 5 additions & 0 deletions .ci/infrastructure/backend.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
terraform {
backend "gcs" {
bucket = "3772b2f502380a18-terraform-remote-backend"
}
}
41 changes: 41 additions & 0 deletions .ci/infrastructure/grafana_values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
metrics:
enabled: true
alloy:
metricsTuning:
useIntegrationAllowList: true
cost:
enabled: true
kepler:
enabled: true
node-exporter:
enabled: true
logs:
enabled: true
pod_logs:
enabled: true
cluster_events:
enabled: true
traces:
enabled: true
receivers:
grpc:
enabled: true
http:
enabled: true
zipkin:
enabled: true
grafanaCloudMetrics:
enabled: false
opencost:
enabled: true
kube-state-metrics:
enabled: true
prometheus-node-exporter:
enabled: true
prometheus-operator-crds:
enabled: true
kepler:
enabled: true
alloy: {}
alloy-events: {}
alloy-logs: {}
26 changes: 26 additions & 0 deletions .ci/infrastructure/linux_container_pod_template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
spec:
tolerations:
- key: "premerge-platform"
operator: "Equal"
value: "linux"
effect: "NoSchedule"
nodeSelector:
premerge-platform: linux
containers:
- name: $job
resources:
# The container is always scheduled on the same pod as the runner.
# Since we use the runner requests.cpu for scheduling/autoscaling,
# the request here should be set to something small.
#
# The limit however should be the number of cores of the node. Any limit
# inferior to the number of core could slow down the job.
#
# For memory however, the request/limits shall be correct.
# It's not used for scheduling, but can be used by k8 for OOM kill.
requests:
cpu: "100m"
memory: "50Gi"
limits:
cpu: 56
memory: "100Gi"
74 changes: 74 additions & 0 deletions .ci/infrastructure/linux_runners_values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
githubConfigUrl: "https://github.com/llvm"
githubConfigSecret: "github-token"

minRunners: 0
maxRunners: 4

containerMode:
type: "kubernetes"
kubernetesModeWorkVolumeClaim:
accessModes: ["ReadWriteOnce"]
storageClassName: "standard-rwo"
resources:
requests:
storage: "100Gi"
kubernetesModeServiceAccount:
annotations:

template:
spec:
tolerations:
- key: "premerge-platform"
operator: "Equal"
value: "linux"
effect: "NoSchedule"
nodeSelector:
premerge-platform: linux
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
resources:
# The container will be scheduled on the same node as this runner.
# This means if we don't set the CPU request high-enough here, 2
# containers will be scheduled on the same pod, meaning 2 jobs.
#
# This number should be:
# - greater than number_of_cores / 2:
# A value lower than that could allow the scheduler to put 2
# runners in the same pod. Meaning 2 containers in the same pod.
# Meaning 2 jobs sharing the resources.
# - lower than number_of_cores:
# Each pod has some basic services running (metrics for ex). Those
# already require some amount of CPU (~0.5). This means we don't
# exactly have N cores to allocate, but N - epsilon.
#
# Memory however shall be handled at the container level. The runner
# itself doesn't need much, just using something enough not to get
# OOM killed.
requests:
cpu: 50
memory: "2Gi"
limits:
cpu: 56
memory: "2Gi"
env:
- name: ACTIONS_RUNNER_CONTAINER_HOOKS
value: /home/runner/k8s/index.js
- name: ACTIONS_RUNNER_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "true"
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: "/home/runner/pod-config/linux-container-pod-template.yaml"
volumeMounts:
- name: container-pod-config
mountPath: /home/runner/pod-config
securityContext:
fsGroup: 123
volumes:
- name: container-pod-config
configMap:
name: linux-container-pod-template
Loading
Loading