diff --git a/docs/add-disks-to-vms.md b/docs/add-disks-to-vms.md new file mode 100644 index 0000000..52a71fe --- /dev/null +++ b/docs/add-disks-to-vms.md @@ -0,0 +1,226 @@ +Maintenance objective: +Add a 100GB disk to each of the worker nodes. + + Departments involved: RPC support, RPC Manage Kubernetes SME's + Owning department: RPC support, RPC Managed Kubernetes SME's + Amount of time estimated for maintenance: 4 hours + +-------------------------------------------------------------------------------- + Maintenance Steps: +-------------------------------------------------------------------------------- + +## Get things ready + +1 - Setup the environment to work with the cluster + + * Credentials are stored in passwordsafe + +``` +cd /etc/openCenter/infrastructure/clusters/prosys_test +source venv/bin/activate +export BIN=${PWD}/.bin +export PATH=${BIN}:${PATH} +export ANSIBLE_INVENTORY=${PWD}/inventory/inventory.yaml +export KUBECONFIG=${PWD}/kubeconfig.yaml +export AWS_ACCESS_KEY_ID= +export AWS_SECRET_ACCESS_KEY= +export TF_VAR_os_application_credential_id='' +export TF_VAR_os_application_credential_secret="" + +``` + +2 - Verify the cluster state. + + * If there are changes in the plan review them first and work through any potential blockers for the maintenance. + +```bash +# terraform plan +... +module.kubespray-cluster.null_resource.run_kubespray[0]: Refreshing state... [id=2694448614732380735] +module.kubespray-cluster.null_resource.copy_and_update_kubeconfig: Refreshing state... [id=812292398106547937] + +No changes. Your infrastructure matches the configuration. + +# kubectl get nodes +NAME STATUS ROLES AGE VERSION +prosys-prod-cp0 Ready control-plane 31d v1.32.8 +prosys-prod-cp1 Ready control-plane 31d v1.32.8 +prosys-prod-cp2 Ready control-plane 31d v1.32.8 +prosys-prod-wn0 Ready 31d v1.32.8 +prosys-prod-wn1 Ready 31d v1.32.8 +prosys-prod-wn2 Ready 31d v1.32.8 +prosys-prod-wn3 Ready 31d v1.32.8 +prosys-prod-wn4 Ready 30d v1.32.8 + + +``` + +3 - Update the main.tf with the desired additional block devices + +* There is a section to be added to the locals variables and another to the openstack-nova module. + +**locals section** +```h + additional_block_devices_worker = [ + { + source_type = "blank" + volume_size = 20 + volume_type = "Performance" + boot_index = -1 + destination_type = "volume" + delete_on_termination = true + mountpoint = "/var/lib/longhorn" + filesystem = "ext4" + label = "longhorn-vol" + }, + ] +``` + +**module "openstack-nova"** + +``` +module "openstack-nova" { + source = "github.com/rackerlabs/openCenter-gitops-base.git//iac/cloud/openstack/openstack-nova?ref=multi-disk1" + availability_zone = local.availability_zone + additional_block_devices_worker = local.additional_block_devices_worker <---- Add this line + + +``` + + +4 - Verify the plan + +* If we run `terraform plan` we can review the configuration updates to be made. If we were to apply the changes it would replace all of the worker nodes at the same time which would break the cluster. We will need to use targeted apply to replace one node at a time and add it back to the cluster. + + +----------------------- + +## Apply the changes + +```bash +# kubectl get nodes +NAME STATUS ROLES AGE VERSION +prosys-prod-cp0 Ready control-plane 31d v1.32.8 +prosys-prod-cp1 Ready control-plane 31d v1.32.8 +prosys-prod-cp2 Ready control-plane 31d v1.32.8 +prosys-prod-wn0 Ready 31d v1.32.8 +prosys-prod-wn1 Ready 31d v1.32.8 +prosys-prod-wn2 Ready 31d v1.32.8 +prosys-prod-wn3 Ready 31d v1.32.8 +prosys-prod-wn4 Ready 30d v1.32.8 +``` + +We will target one worker node at a time starting with the last one. + +1 - Remove the node from the cluster + +``` +cd kubespray +ansible-playbook playbooks/facts.yml -b +ansible-playbook -b playbooks/remove_node.yml -e node=prosys-prod-wn4 +``` + +2 - Apply the change only targeting the single node + +``` +terraform apply -target 'module.openstack-nova.module.node_worker.openstack_compute_instance_v2.node[4]' + +Apply complete! Resources: 3 added, 0 changed, 1 destroyed. + +``` + +3 - Get the node ready + +Apply hardening +``` +# ansible-playbook ./inventory/os_hardening_playbook.yml -b --become-user=root --limit=prosys-prod-wn4 + +``` + +4 - Find the resulting disk configuration and set it in the group vars + +``` +# ansible prosys-prod-wn4 -m shell -a 'fdisk -l' -b + +prosys-prod-wn4 | CHANGED | rc=0 >> +Disk /dev/vda: 100 GiB, 107372085248 bytes, 209711104 sectors +Units: sectors of 1 * 512 = 512 bytes +Sector size (logical/physical): 512 bytes / 512 bytes +I/O size (minimum/optimal): 512 bytes / 512 bytes +Disklabel type: gpt +Disk identifier: A49C7288-9CD0-4F90-8F06-1C713F260C8C + +Device Start End Sectors Size Type +/dev/vda1 2099200 209711070 207611871 99G Linux filesystem +/dev/vda14 2048 10239 8192 4M BIOS boot +/dev/vda15 10240 227327 217088 106M EFI System +/dev/vda16 227328 2097152 1869825 913M Linux extended boot + +Partition table entries are not in disk order. + + +Disk /dev/vdb: 64 GiB, 68719476736 bytes, 134217728 sectors +Units: sectors of 1 * 512 = 512 bytes +Sector size (logical/physical): 512 bytes / 512 bytes +I/O size (minimum/optimal): 512 bytes / 512 bytes + + +Disk /dev/vdc: 4 GiB, 4294967296 bytes, 8388608 sectors +Units: sectors of 1 * 512 = 512 bytes +Sector size (logical/physical): 512 bytes / 512 bytes +I/O size (minimum/optimal): 512 bytes / 512 bytes + + +Disk /dev/vdd: 20 GiB, 21472739328 bytes, 41938944 sectors <----- NEW DRIVE +Units: sectors of 1 * 512 = 512 bytes +Sector size (logical/physical): 512 bytes / 512 bytes +I/O size (minimum/optimal): 512 bytes / 512 bytes +``` + +4 - Update the disk configuration in the group_vars with the same information as above. + +```yaml +# cat inventory/group_vars/oc_worker_nodes.yaml +disk_config: + - device: "/dev/vdd" <----- set the device name accordingly + label: "longhorn-vol" + mountpoint: "/var/lib/longhorn" + filesystem: "ext4" + boot_index: 1 + volume_size: 20 + +``` + +5 - Run the playbook to configure the disks + +```bash +ansible-playbook -b configure-disks.yaml --limit=prosys-prod-wn4 +``` + +6 - Add the node to the cluster + +```bash +cd kubespray +ansible-playbook playbooks/facts.yml -b +ansible-playbook -b playbooks/scale.yml -e "@../inventory/k8s_hardening.yml" --limit=prosys-prod-wn4 + +``` + +## Verify things are working + +1 - Verify the node is in ready state + +```bash +# kubectl get nodes +NAME STATUS ROLES AGE VERSION +prosys-prod-cp0 Ready control-plane 31d v1.32.8 +prosys-prod-cp1 Ready control-plane 31d v1.32.8 +prosys-prod-cp2 Ready control-plane 31d v1.32.8 +prosys-prod-wn0 Ready 31d v1.32.8 +prosys-prod-wn1 Ready 31d v1.32.8 +prosys-prod-wn2 Ready 31d v1.32.8 +prosys-prod-wn3 Ready 31d v1.32.8 +prosys-prod-wn4 Ready 20m v1.32.8 <---- New Node Ready +``` + + diff --git a/iac/cloud/openstack/lib/openstack-compute/main.tf b/iac/cloud/openstack/lib/openstack-compute/main.tf index 37683fa..0f24b16 100644 --- a/iac/cloud/openstack/lib/openstack-compute/main.tf +++ b/iac/cloud/openstack/lib/openstack-compute/main.tf @@ -42,6 +42,19 @@ resource "openstack_compute_instance_v2" "node" { destination_type = var.node_bfv_destination_type delete_on_termination = var.node_bfv_delete_on_termination } + + dynamic "block_device" { + for_each = var.additional_block_devices + content { + uuid = block_device.value.source_type == "blank" ? "" : null + source_type = block_device.value.source_type + volume_size = block_device.value.volume_size + volume_type = block_device.value.destination_type == "local" ? "" : block_device.value.volume_type + boot_index = block_device.value.boot_index + destination_type = block_device.value.destination_type + delete_on_termination = block_device.value.delete_on_termination + } + } network { port = openstack_networking_port_v2.node[count.index].id diff --git a/iac/cloud/openstack/lib/openstack-compute/variables.tf b/iac/cloud/openstack/lib/openstack-compute/variables.tf index 28f9144..d69ac60 100644 --- a/iac/cloud/openstack/lib/openstack-compute/variables.tf +++ b/iac/cloud/openstack/lib/openstack-compute/variables.tf @@ -1,3 +1,19 @@ +variable "additional_block_devices" { + description = "List of additional block devices to attach to instances" + type = list(object({ + source_type = string # "blank", "image", "volume", "snapshot" + volume_size = number + volume_type = optional(string, "") + boot_index = number # Must be > 0 for non-boot devices + destination_type = optional(string, "volume") + delete_on_termination = optional(bool, true) + mountpoint = string + filesystem = optional(string, "ext4") + label = string + })) + default = [] +} + variable "allowed_addresses" { type = list(string) default = [] diff --git a/iac/cloud/openstack/openstack-nova/main.tf b/iac/cloud/openstack/openstack-nova/main.tf index 1016045..5828be8 100644 --- a/iac/cloud/openstack/openstack-nova/main.tf +++ b/iac/cloud/openstack/openstack-nova/main.tf @@ -64,6 +64,7 @@ module "node_master" { source = "../lib/openstack-compute" depends_on = [module.bastion, module.ssh-keypair, module.secgroup] + additional_block_devices = var.additional_block_devices_master availability_zone = var.availability_zone allowed_addresses = [var.vrrp_ip ,var.subnet_nodes, var.subnet_pods, var.subnet_services] flavor_name = var.size_master.flavor @@ -92,6 +93,7 @@ module "node_worker" { source = "../lib/openstack-compute" depends_on = [module.bastion, module.ssh-keypair, module.secgroup] + additional_block_devices = var.additional_block_devices_worker availability_zone = var.availability_zone allowed_addresses = [var.subnet_nodes, var.subnet_pods, var.subnet_services] flavor_name = var.size_worker.flavor diff --git a/iac/cloud/openstack/openstack-nova/variables.tf b/iac/cloud/openstack/openstack-nova/variables.tf index e443bc6..12739d9 100644 --- a/iac/cloud/openstack/openstack-nova/variables.tf +++ b/iac/cloud/openstack/openstack-nova/variables.tf @@ -1,3 +1,35 @@ +variable "additional_block_devices_worker" { + description = "List of additional block devices to attach to worker instances" + type = list(object({ + source_type = string # "blank", "image", "volume", "snapshot" + volume_size = number + volume_type = optional(string, "") + boot_index = number # Must be > 0 for non-boot devices + destination_type = optional(string, "volume") + delete_on_termination = optional(bool, true) + mountpoint = string + filesystem = optional(string, "ext4") + label = string + })) + default = [] +} + +variable "additional_block_devices_master" { + description = "List of additional block devices to attach to master instances" + type = list(object({ + source_type = string # "blank", "image", "volume", "snapshot" + volume_size = number + volume_type = optional(string, "") + boot_index = number # Must be > 0 for non-boot devices + destination_type = optional(string, "volume") + delete_on_termination = optional(bool, true) + mountpoint = string + filesystem = optional(string, "ext4") + label = string + })) + default = [] +} + variable "additional_ports_master" { description = "List of additional ports to create security group rules for custom applications" type = list(string) diff --git a/playbooks/configure-disks.yaml b/playbooks/configure-disks.yaml new file mode 100644 index 0000000..27bd909 --- /dev/null +++ b/playbooks/configure-disks.yaml @@ -0,0 +1,146 @@ +--- +# configure-disks.yml +# Production-grade disk configuration with explicit device paths +# Uses group_vars to define disk_config variable +# Examplegroup_vars/oc_worker_nodes.yaml: +# disk_config: +# - device: "/dev/vdd" +# label: "longhorn-vol" +# mountpoint: "/var/lib/longhorn" +# filesystem: "ext4" +# boot_index: -1 +# volume_size: 20 + + +- name: Configure and mount block devices + hosts: all + become: true + gather_facts: true + + tasks: + - name: Skip hosts without disk configuration + ansible.builtin.meta: end_host + when: disk_config is not defined or disk_config | length == 0 + + - name: Verify all devices are specified + ansible.builtin.assert: + that: + - item.device is defined + - item.device is not none + - item.device != "" + fail_msg: "Device path missing for {{ item.label }}. Check group_vars configuration." + loop: "{{ disk_config }}" + loop_control: + label: "{{ item.label }}" + + - name: Install filesystem tools + ansible.builtin.package: + name: + - xfsprogs + - e2fsprogs + state: present + + - name: Display configuration + ansible.builtin.debug: + msg: "{{ item.device }} -> {{ item.mountpoint }} ({{ item.label }}, {{ item.filesystem }})" + loop: "{{ disk_config }}" + + - name: Verify devices exist + ansible.builtin.stat: + path: "{{ item.device }}" + register: device_checks + failed_when: not device_checks.stat.exists or not device_checks.stat.isblk + loop: "{{ disk_config }}" + loop_control: + label: "{{ item.device }}" + + - name: Check existing labels + ansible.builtin.command: blkid -s LABEL -o value {{ item.device }} + register: existing_labels + changed_when: false + failed_when: false + loop: "{{ disk_config }}" + loop_control: + label: "{{ item.device }}" + + - name: Format filesystems + community.general.filesystem: + fstype: "{{ item.item.filesystem }}" + dev: "{{ item.item.device }}" + opts: "-L {{ item.item.label }}" + force: false + loop: "{{ existing_labels.results }}" + loop_control: + label: "{{ item.item.device }} -> {{ item.item.label }}" + when: item.rc != 0 or item.stdout != item.item.label + + - name: Get UUIDs + ansible.builtin.command: blkid -s UUID -o value {{ item.device }} + register: uuids + changed_when: false + retries: 3 + delay: 1 + until: uuids.rc == 0 + loop: "{{ disk_config }}" + loop_control: + label: "{{ item.device }}" + + - name: Create mount points + ansible.builtin.file: + path: "{{ item.item.mountpoint }}" + state: directory + mode: '0755' + loop: "{{ uuids.results }}" + loop_control: + label: "{{ item.item.mountpoint }}" + + - name: Mount filesystems + ansible.posix.mount: + path: "{{ item.item.mountpoint }}" + src: "UUID={{ item.stdout }}" + fstype: "{{ item.item.filesystem }}" + opts: defaults,nofail + state: mounted + dump: '0' + passno: '2' + loop: "{{ uuids.results }}" + loop_control: + label: "{{ item.item.mountpoint }}" + + - name: Verify fstab entries + ansible.builtin.shell: grep "{{ item.mountpoint }}" /etc/fstab || echo "MISSING" + register: fstab_check + changed_when: false + loop: "{{ disk_config }}" + loop_control: + label: "{{ item.mountpoint }}" + + - name: Display fstab status + ansible.builtin.debug: + msg: "{{ item.item.mountpoint }}: {{ 'Present' if 'UUID' in item.stdout else 'MISSING - Something went wrong!' }}" + loop: "{{ fstab_check.results }}" + loop_control: + label: "{{ item.item.mountpoint }}" + + - name: Verify mounts + ansible.builtin.command: findmnt {{ item.mountpoint }} + changed_when: false + loop: "{{ disk_config }}" + loop_control: + label: "{{ item.mountpoint }}" + + - name: Show results + ansible.builtin.shell: | + lsblk -f {{ disk_config | map(attribute='device') | join(' ') }} + echo "" + df -h {{ disk_config | map(attribute='mountpoint') | join(' ') }} + register: result + changed_when: false + + - name: Display + ansible.builtin.debug: + msg: "{{ result.stdout_lines }}" + + - name: Summary + ansible.builtin.debug: + msg: "✓ Successfully configured {{ disk_config | length }} disk(s) on {{ inventory_hostname }}"