Restructure Terraform infra with services monorepo, CI/CD, and FP support by harshavemula-ua · Pull Request #312 · CIROH-UA/ngen-datastream

harshavemula-ua · 2026-02-19T20:36:44Z

Summary

Restructures Terraform infrastructure into a services-based monorepo with full CI/CD automation, replacing the previous flat layout and manual deployment process.

Infrastructure Changes

Services-based monorepo: Separate nrds-cfe-nom and nrds-routing-only services under infra/aws/terraform/services/
Shared orchestration module: Step Functions state machine, IAM roles, and EC2 policies in modules/orchestration/
S3 backend with native state locking for both services (no DynamoDB needed)
Templatized execution JSONs: templatefile() for dynamic schedule generation — no pre-generated JSON files
Forcing processor (fp) schedules: Conditional template routing (fp_template vs VPU_template) based on vpus list — remove fp from the list to disable fp schedules entirely
Secrets Manager access in EC2 policy for Docker Hub login credentials

CI/CD Workflows (5 new, OIDC auth)

terraform-deploy: Auto-deploys on push to main with path-based change detection, plan step with manual approval gate via GitHub Environment before apply
terraform-destroy: Manual-only with double confirmation (type service name twice)
terraform-pr-validate: Format check, validate, plan with PR comment, tfsec + checkov security scans
terraform-drift-detection: Daily drift check, auto-creates GitHub issue with terraform-drift label
terraform-health-check: Verifies Step Functions state machines are ACTIVE after deploy + every 6 hours
Removed 2 legacy workflows (infra_deploy_val.yaml, research_datastream_terraform.yaml) that used hardcoded AWS keys

Bug Fixes

Fix execution templates to single-line commands for valid JSON escaping with EventBridge Scheduler
Fix analysis_assim_extend fcst value to tm27_tm00 to match production
Remove test/ from cfe_nom S3 prefix to match production output path
Remove security group from Terraform-managed resources

Adding a New Service

Create infra/aws/terraform/services/<service-name>/ with main.tf, variables.tf, envs/
Add filter + registry entry in all 5 workflows (search SERVICE_REGISTRY)

Test Plan

terraform plan shows expected resources for both services
PR validation workflow posts plan comment on PR
Drift detection runs without errors
Health check confirms state machines are ACTIVE
Verify fp schedules match production (aws scheduler get-schedule)

Terraform Drift Detection workflow successful execution: https://github.com/CIROH-UA/ngen-datastream/actions/runs/22078756096
Terraform Deploy workflow successful execution: https://github.com/CIROH-UA/ngen-datastream/actions/runs/22190595089
Terraform Plan successful execution: https://github.com/CIROH-UA/ngen-datastream/actions/runs/22362787346?pr=312

- Root module now deploys only shared orchestration (state machine, lambdas, IAM, SG) - CFE_NOM schedules moved to services/nrds-cfe-nom/ with its own state file - Schedule service reads orchestration outputs via terraform_remote_state - Enables independent deploy/rollback per model service - Future services (routing-only, etc.) can be added as new service directories

Adds routing-only schedules as a standalone service matching the nrds-cfe-nom pattern: separate state, terraform_remote_state for orchestration outputs, templatefile() for execution JSONs.

Each service (nrds-cfe-nom, nrds-routing-only) now includes its own orchestration module directly instead of reading from terraform_remote_state. Removed backend "s3" block for local testing. Added orchestration vars to test.tfvars for both services.

- terraform-pr-validate.yml: PR validation with dynamic matrix (fmt, init, validate, plan with PR comments, tfsec + checkov security scans, plan artifact uploads) - terraform-deploy.yml: Deploy on merge with sequential matrix (max-parallel:1, fail-fast, concurrency with cancel-in-progress:false, lock-timeout:5m) - terraform-health-check.yml: Post-deploy verification (checks Step Functions state machines are ACTIVE, runs every 6h + after deploys) - terraform-drift-detection.yml: Daily drift detection with plan -detailed-exitcode, auto-creates GitHub Issues on drift All workflows use a matrix-driven SERVICE_REGISTRY pattern. Adding a new service requires only 2 lines per workflow (1 path filter + 1 JSON entry).

- Add S3 backend to nrds-routing-only/main.tf using native S3 locking (use_lockfile=true, no DynamoDB needed) - Bucket: ciroh-nrds-terraform-state, key: nrds-routing-only/test/terraform.tfstate - Bump required_version to >= 1.10 (native S3 locking requires 1.10+) - Bump TF_VERSION to 1.10.0 across all CI workflows

- Create envs/test.backend.hcl with partial backend config for routing-only service (ciroh-terraform-state bucket, native S3 locking) - Update main.tf to use empty backend "s3" {} block loaded via -backend-config flag - Add backend_config field to SERVICE_REGISTRY in all 3 workflows (pr-validate, deploy, drift-detection) - Conditional terraform init uses -backend-config when configured

- Create envs/test.backend.hcl with state key cfe-nom-test-datastream in ciroh-terraform-state bucket with native S3 locking - Update main.tf to use empty backend "s3" {} block (>= 1.10) - Update backend_config for cfe-nom in all 3 workflows

Reflect the service-based directory layout, S3 backend with partial .hcl configs, updated terraform init/plan/apply/destroy commands, and CI/CD workflow summary table.

- Add workflow_dispatch trigger to PR validate and deploy workflows for manual runs - Add -lock-timeout=5m to PR validation plan step to prevent failures when state is temporarily locked

Triggers only when the workflow file itself changes.

Replace AWS_ACCESS_KEY_ID/SECRET_ACCESS_KEY with role-to-assume using GitHub's OIDC provider. Eliminates long-lived credentials. Also quotes backend-config and var-file paths in all workflows.

Aligns the orchestration module version constraint with the service-level constraint, ensuring S3 native state locking support is available.

Use bash line continuation in execution templates for routing-only and cfe-nom services.

Replace hardcoded ciroh-community-ngen-datastream with a configurable s3_bucket variable threaded through all layers.

… gates, restrict to main branch

…te S3 prefix

github-actions · 2026-02-19T20:37:22Z

Terraform Plan: `nrds-routing-only` ✅

Plan: 46 to add, 0 to change, 0 to destroy.

Show Plan Details

        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["16_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst16_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 17 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["17_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst17_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 18 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["18_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst18_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 19 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["19_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst19_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 20 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["20_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst20_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 21 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["21_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst21_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 22 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["22_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst22_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 23 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["23_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

Plan: 46 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + datastream_arn             = (known after apply)
  + lambda_arns                = [
      + (known after apply),
      + (known after apply),
      + (known after apply),
      + (known after apply),
      + (known after apply),
    ]
  + lambda_function_names      = [
      + "nrds_routing_only_prod_start_ec2",
      + "nrds_routing_only_prod_ec2_commander",
      + "nrds_routing_only_prod_ec2_command_poller",
      + "nrds_routing_only_prod_s3_object_checker",
      + "nrds_routing_only_prod_ec2_stopper",
    ]
  + lambda_role_arn            = (known after apply)
  + short_range_schedule_count = 24

Pusher: @harshavemula-ua, Action: pull_request, Workflow: Terraform PR Validation

infra/aws/terraform/services/nrds-routing-only/modules/schedules/iam_scheduler.tf

+resource "aws_iam_policy" "scheduler_policy" {
+  name        = var.scheduler_policy_name
+  description = "Policy with permissions for statemachine execution"
+  policy = jsonencode({
+    Version = "2012-10-17",
+    Statement = [
+      {
+        "Effect" : "Allow",
+        Action = [
+          "states:StartExecution",
+          "events:PutTargets",
+          "events:PutRule",
+          "events:PutPermission"
+        ],
+        "Resource" : ["*"]
+      }
+    ]
+  })
+}


infra/aws/terraform/services/nrds-routing-only/modules/schedules/iam_scheduler.tf

+resource "aws_iam_policy" "scheduler_policy" {
+  name        = var.scheduler_policy_name
+  description = "Policy with permissions for statemachine execution"
+  policy = jsonencode({
+    Version = "2012-10-17",
+    Statement = [
+      {
+        "Effect" : "Allow",
+        Action = [
+          "states:StartExecution",
+          "events:PutTargets",
+          "events:PutRule",
+          "events:PutPermission"
+        ],
+        "Resource" : ["*"]
+      }
+    ]
+  })
+}


infra/aws/terraform/services/nrds-routing-only/modules/schedules/schedules.tf

+resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
+  for_each = local.short_range_routing_only_config
+
+  name       = "short_range_fcst${each.value.init}_vpu${each.value.vpu}_schedule_routing_only_${var.environment_suffix}"
+  group_name = var.schedule_group_name
+
+  flexible_time_window {
+    mode = "OFF"
+  }
+
+  schedule_expression          = "cron(0 ${local.short_range_times_routing_only[each.key]} * * ? *)"
+  schedule_expression_timezone = var.schedule_timezone
+
+  target {
+    arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
+    role_arn = aws_iam_role.scheduler_role.arn
+    input = <<-EOT
+{
+  "StateMachineArn": "${var.state_machine_arn}",
+  "Name": "routing_only_short_range_vpu${each.value.vpu}_init${each.value.init}_<aws.scheduler.execution-id>",
+  "Input": ${jsonencode(templatefile(local.routing_only_template_path, {
+    vpu                = each.value.vpu
+    init               = each.value.init
+    run_type_l         = each.value.run_type_l
+    run_type_h         = each.value.run_type_h
+    nprocs             = each.value.nprocs
+    ami_id             = local.routing_only_ami_id
+    instance_type      = each.value.instance_type
+    security_group_ids = local.routing_only_security_groups
+    instance_profile   = local.routing_only_instance_profile
+    volume_size        = each.value.volume_size
+    environment_suffix = var.environment_suffix
+    s3_bucket          = var.s3_bucket
+}))}
+}
+EOT
+}
+}


github-actions · 2026-02-19T20:39:54Z

Terraform Plan: `nrds-cfe-nom` ✅

Plan: 660 to add, 0 to change, 0 to destroy.

Show Plan Details

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_12"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu12_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_13"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu13_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_14"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu14_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_15"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu15_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_16"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu16_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_17"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu17_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_18"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu18_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_fp"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpufp_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 23 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

Plan: 660 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + analysis_assim_schedule_count = 22
  + datastream_arn                = (known after apply)
  + lambda_arns                   = [
      + (known after apply),
      + (known after apply),
      + (known after apply),
      + (known after apply),
      + (known after apply),
    ]
  + lambda_function_names         = [
      + "nrds_cfe_nom_prod_start_ec2",
      + "nrds_cfe_nom_prod_ec2_commander",
      + "nrds_cfe_nom_prod_ec2_command_poller",
      + "nrds_cfe_nom_prod_s3_object_checker",
      + "nrds_cfe_nom_prod_ec2_stopper",
    ]
  + lambda_role_arn               = (known after apply)
  + medium_range_schedule_count   = 88
  + short_range_schedule_count    = 528

Pusher: @harshavemula-ua, Action: pull_request, Workflow: Terraform PR Validation

EC2 instances will use default VPC security group instead of a Terraform-managed one, avoiding destroy conflicts when instances are still running.

…uting_only

Add fp template and EventBridge schedules for short_range (24 init cycles), medium_range (4 init cycles), and analysis_assim_extend (1 init cycle). FP processes NWM forcings for all VPUs at once using docker containers.

infra/aws/terraform/services/nrds-cfe-nom/modules/schedules/schedules.tf

Move fp from separate schedule resources into the vpus list with conditionals for template, AMI, timing, and member_suffix. Single place to control what gets deployed - remove fp from list to disable.

infra_deploy_val.yaml and research_datastream_terraform.yaml use hardcoded AWS keys and outdated patterns. Replaced by terraform-deploy, terraform-pr-validate, terraform-destroy, terraform-drift-detection, and terraform-health-check workflows with OIDC auth.

JordanLaserGit

@harshavemula-ua Thanks for all this work!

In separating the datastreams into services, the orchestration and scheduler infrastructure is replicated for each datastream. While this shouldn't cost anything extra, this isn't necessary from my stand point and may be a pain point. Is there a benefit for doing this? Ideally the orchestration infrastructure isn't dependent on the particular datastream (i.e naming the infra and backend based on the datastream name).

I think of the nrds as a service that we are adding on to by adding "datastreams" in the rough form of the combination of schedules.tf and the execution templates/config. I don't see why not just use a single terraform state file to reference when applying changes. Should take care of both the deploy and destroy, as well as remove the need to edit and matrix the workflows for each added datastream.

Said in another way, I think all of the variables defined in services/../envs/prod.tfvars should be the same from datastream to datastream, with the exception of environment_suffix if we want to implement the idea of a dev datastream vs a prod datastream. The AMI could be datastream specific, but ideally isn't. I'd think we want to keep the AMI's up to date and uniform, so we aren't using different versions of NGIAB for different datastreams.

I don't think this is a huge point of contention, and this all looks great and potentially ready to merge in it's current form though, perhaps with the edit of building a single set of production infrastructure from a single state file to make maintenance easier.

JordanLaserGit · 2026-02-24T20:41:09Z

.github/workflows/terraform-deploy.yml

+
+jobs:
+  # ---------------------------------------------------------------------------
+  # Detect which services changed and build a dynamic matrix.


I think Terraform does this automatically with a terraform apply. I don't believe we need to also check if anything changed before hand.

To clarify - "services" in this context refers to the ngen-datastream pipeline services (e.g., nrds-cfe-nom, nrds-routing-only), not AWS services. Each service directory under services/ encapsulates a complete pipeline configuration — its own orchestration module, schedules, execution templates, S3 backend, and environment configs.

The schedule name changes (e.g., _cfe_nom → _cfe_nom_prod) will be handled by terraform apply — it will destroy the old schedules and create the new ones as part of the state migration.

JordanLaserGit · 2026-02-24T20:41:43Z

.github/workflows/terraform-deploy.yml

+          # ORDER MATTERS: services deploy sequentially (max-parallel: 1).
+          # Put lower-risk / smaller services first as a canary.
+          # =====================================================================
+          SERVICE_REGISTRY='[


We could just manage a single production backend. This way we don't have to update these workflows when we add additional datastreams.

Per-service backends are better for production. The isolation is worth it (thats what I felt)— a bad terraform apply on one service won't blow up the other. State locking won't block unrelated services either.

Blast radius isolation is main benefit here in a situation where a bad apply on cfe-nom won't affect routing-only

I am fine either ways

JordanLaserGit · 2026-02-24T20:43:19Z

.github/workflows/terraform-destroy.yml

@@ -0,0 +1,72 @@
+name: Terraform Destroy


Do we need a workflow for destroying the resources? seems potentially dangerous and the terraform apply in the terraform-deploy workflow should destroy resources if we were to remove them from the deployment.

Makes sense — I originally included it thinking about end-to-end infrastructure management from workflows, but you're right that terraform apply already handles resource removal safely through the normal PR review process. I'll remove it in a follow-up.

harshavemula-ua added 30 commits February 13, 2026 10:27

Remove root wrapper files, orchestration module is applied directly

6070fe8

Add nrds-routing-only service under services/ directory

4a9bd57

Adds routing-only schedules as a standalone service matching the nrds-cfe-nom pattern: separate state, terraform_remote_state for orchestration outputs, templatefile() for execution JSONs.

Update S3 backend to use ciroh-terraform-state bucket

72f4712

Add S3 backend for nrds-cfe-nom service

5047218

- Create envs/test.backend.hcl with state key cfe-nom-test-datastream in ciroh-terraform-state bucket with native S3 locking - Update main.tf to use empty backend "s3" {} block (>= 1.10) - Update backend_config for cfe-nom in all 3 workflows

Update README with new service structure and backend commands

06a8bbd

Reflect the service-based directory layout, S3 backend with partial .hcl configs, updated terraform init/plan/apply/destroy commands, and CI/CD workflow summary table.

Fix S3 backend region to us-east-2 in both backend configs

22a343c

Add workflow_dispatch and lock-timeout to workflows

eae6a70

- Add workflow_dispatch trigger to PR validate and deploy workflows for manual runs - Add -lock-timeout=5m to PR validation plan step to prevent failures when state is temporarily locked

updated the fmt

fbd18b8

Add orchestration module validation job to PR workflow

c7db877

Fix PR comment to read plan from file instead of inline output

ffd7439

Fix plan comment to use full path for plan_output.txt

eb8a4db

Add PR trigger to drift detection for testing

e50bee4

Triggers only when the workflow file itself changes.

Remove PR trigger from drift detection workflow

2a66ee5

Switch from static AWS keys to OIDC authentication

760d0f3

Replace AWS_ACCESS_KEY_ID/SECRET_ACCESS_KEY with role-to-assume using GitHub's OIDC provider. Eliminates long-lived credentials. Also quotes backend-config and var-file paths in all workflows.

Tighten terraform required_version from >= 1.0 to >= 1.10

087a74d

Aligns the orchestration module version constraint with the service-level constraint, ensuring S3 native state locking support is available.

Break long command strings into readable multi-line format

264f923

Use bash line continuation in execution templates for routing-only and cfe-nom services.

Variablize S3 bucket name in execution templates

85479f2

Replace hardcoded ciroh-community-ngen-datastream with a configurable s3_bucket variable threaded through all layers.

Add manual Terraform destroy workflow

0eae79a

updated the ensemble to 1

83598bf

Rename environment from test to prod

9aeea54

Fix workflow issues: update health check sm_names, remove environment…

aaca77d

… gates, restrict to main branch

Trigger deploy workflow for routing-only service

c69cae9

Add feature branch to deploy workflow trigger for testing

8640bf1

Update S3 prefix in routing-only execution template

e497e12

Show plan summary in PR comment instead of full output

db2cf73

harshavemula-ua added 7 commits February 18, 2026 16:40

Restrict permissions on destroy validation job

b83bb81

Revert test prefix from routing-only execution template

21b4c97

Update cfe-nom execution template to match latest config

621d4bb

Comment out feature branch from deploy trigger, update cfe-nom templa…

e0b42d1

…te S3 prefix

Fix execution templates to single-line commands for valid JSON escaping

d904ffc

Remove test/ from cfe_nom S3_PREFIX to match production output path

cde8336

Remove fp from vpus lists, keep only 03W for routing-only

9a0b9d4

github-advanced-security bot found potential problems Feb 19, 2026

View reviewed changes

Remove security group from Terraform-managed resources

11c500d

EC2 instances will use default VPC security group instead of a Terraform-managed one, avoiding destroy conflicts when instances are still running.

harshavemula-ua force-pushed the templatize_cfe_nom_routing_only branch from dab7d36 to 11c500d Compare February 19, 2026 20:51

harshavemula-ua added 3 commits February 19, 2026 15:01

Merge remote-tracking branch 'origin/main' into templatize_cfe_nom_ro…

002c102

…uting_only

Fix analysis_assim_extend fcst value to tm27_tm00 to match production

f589db7

Add forcing processor (fp) schedules for CFE_NOM

7258ad4

Add fp template and EventBridge schedules for short_range (24 init cycles), medium_range (4 init cycles), and analysis_assim_extend (1 init cycle). FP processes NWM forcings for all VPUs at once using docker containers.

github-advanced-security bot found potential problems Feb 23, 2026

View reviewed changes

infra/aws/terraform/services/nrds-cfe-nom/modules/schedules/schedules.tf Fixed Show fixed Hide fixed

infra/aws/terraform/services/nrds-cfe-nom/modules/schedules/schedules.tf Fixed Show fixed Hide fixed

harshavemula-ua added 3 commits February 23, 2026 12:41

Refactor fp into vpus list with conditional template routing

806dc13

Move fp from separate schedule resources into the vpus list with conditionals for template, AMI, timing, and member_suffix. Single place to control what gets deployed - remove fp from list to disable.

Run terraform fmt on schedules.tf

a401ac0

Add Secrets Manager access to EC2 policy for Docker Hub login

ce74ceb

harshavemula-ua closed this Feb 23, 2026

harshavemula-ua reopened this Feb 24, 2026

harshavemula-ua changed the title ~~Templatize CFE_NOM and routing-only Terraform services~~ Add services-based Terraform monorepo with CI/CD and FP schedules Feb 24, 2026

harshavemula-ua changed the title ~~Add services-based Terraform monorepo with CI/CD and FP schedules~~ Restructure Terraform infra with services monorepo, CI/CD, and FP support Feb 24, 2026

harshavemula-ua requested review from JordanLaserGit, arpita0911patel and benlee0423 February 24, 2026 19:38

JordanLaserGit reviewed Feb 24, 2026

View reviewed changes

arpita0911patel requested a review from JoshCu February 25, 2026 17:23

harshavemula-ua mentioned this pull request Feb 25, 2026

Restructure Terraform to single shared backend with unified CI/CD #314

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure Terraform infra with services monorepo, CI/CD, and FP support#312

Restructure Terraform infra with services monorepo, CI/CD, and FP support#312
harshavemula-ua wants to merge 45 commits intomainfrom
templatize_cfe_nom_routing_only

harshavemula-ua commented Feb 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

Check failure

Check failure

Check failure

github-actions bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

JordanLaserGit left a comment

Uh oh!

JordanLaserGit Feb 24, 2026

Uh oh!

harshavemula-ua Feb 24, 2026

Uh oh!

JordanLaserGit Feb 24, 2026

Uh oh!

harshavemula-ua Feb 24, 2026

Uh oh!

JordanLaserGit Feb 24, 2026

Uh oh!

harshavemula-ua Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

harshavemula-ua commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Infrastructure Changes

CI/CD Workflows (5 new, OIDC auth)

Bug Fixes

Adding a New Service

Test Plan

Uh oh!

github-actions bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Terraform Plan: nrds-routing-only ✅

Uh oh!

Check failure

Check failure

Check failure

github-actions bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Terraform Plan: nrds-cfe-nom ✅

Uh oh!

Uh oh!

Uh oh!

JordanLaserGit left a comment

Choose a reason for hiding this comment

Uh oh!

JordanLaserGit Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

harshavemula-ua Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

JordanLaserGit Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

harshavemula-ua Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

JordanLaserGit Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

harshavemula-ua Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

harshavemula-ua commented Feb 19, 2026 •

edited

Loading

github-actions bot commented Feb 19, 2026 •

edited

Loading

Terraform Plan: `nrds-routing-only` ✅

github-actions bot commented Feb 19, 2026 •

edited

Loading

Terraform Plan: `nrds-cfe-nom` ✅