Skip to content

Restructure Terraform infra with services monorepo, CI/CD, and FP support#312

Open
harshavemula-ua wants to merge 45 commits intomainfrom
templatize_cfe_nom_routing_only
Open

Restructure Terraform infra with services monorepo, CI/CD, and FP support#312
harshavemula-ua wants to merge 45 commits intomainfrom
templatize_cfe_nom_routing_only

Conversation

@harshavemula-ua
Copy link
Collaborator

@harshavemula-ua harshavemula-ua commented Feb 19, 2026

Summary

Restructures Terraform infrastructure into a services-based monorepo with full CI/CD automation, replacing the previous flat layout and manual deployment process.

Infrastructure Changes

  • Services-based monorepo: Separate nrds-cfe-nom and nrds-routing-only services under infra/aws/terraform/services/
  • Shared orchestration module: Step Functions state machine, IAM roles, and EC2 policies in modules/orchestration/
  • S3 backend with native state locking for both services (no DynamoDB needed)
  • Templatized execution JSONs: templatefile() for dynamic schedule generation — no pre-generated JSON files
  • Forcing processor (fp) schedules: Conditional template routing (fp_template vs VPU_template) based on vpus list — remove fp from the list to disable fp schedules entirely
  • Secrets Manager access in EC2 policy for Docker Hub login credentials

CI/CD Workflows (5 new, OIDC auth)

  • terraform-deploy: Auto-deploys on push to main with path-based change detection, plan step with manual approval gate via GitHub Environment before apply
  • terraform-destroy: Manual-only with double confirmation (type service name twice)
  • terraform-pr-validate: Format check, validate, plan with PR comment, tfsec + checkov security scans
  • terraform-drift-detection: Daily drift check, auto-creates GitHub issue with terraform-drift label
  • terraform-health-check: Verifies Step Functions state machines are ACTIVE after deploy + every 6 hours
  • Removed 2 legacy workflows (infra_deploy_val.yaml, research_datastream_terraform.yaml) that used hardcoded AWS keys

Bug Fixes

  • Fix execution templates to single-line commands for valid JSON escaping with EventBridge Scheduler
  • Fix analysis_assim_extend fcst value to tm27_tm00 to match production
  • Remove test/ from cfe_nom S3 prefix to match production output path
  • Remove security group from Terraform-managed resources

Adding a New Service

  1. Create infra/aws/terraform/services/<service-name>/ with main.tf, variables.tf, envs/
  2. Add filter + registry entry in all 5 workflows (search SERVICE_REGISTRY)

Test Plan

  • terraform plan shows expected resources for both services
  • PR validation workflow posts plan comment on PR
  • Drift detection runs without errors
  • Health check confirms state machines are ACTIVE
  • Verify fp schedules match production (aws scheduler get-schedule)

Terraform Drift Detection workflow successful execution: https://github.com/CIROH-UA/ngen-datastream/actions/runs/22078756096
Terraform Deploy workflow successful execution: https://github.com/CIROH-UA/ngen-datastream/actions/runs/22190595089
Terraform Plan successful execution: https://github.com/CIROH-UA/ngen-datastream/actions/runs/22362787346?pr=312

Screenshot 2026-02-24 at 1 53 31 PM

- Root module now deploys only shared orchestration (state machine, lambdas, IAM, SG)
- CFE_NOM schedules moved to services/nrds-cfe-nom/ with its own state file
- Schedule service reads orchestration outputs via terraform_remote_state
- Enables independent deploy/rollback per model service
- Future services (routing-only, etc.) can be added as new service directories
Adds routing-only schedules as a standalone service matching the
nrds-cfe-nom pattern: separate state, terraform_remote_state for
orchestration outputs, templatefile() for execution JSONs.
Each service (nrds-cfe-nom, nrds-routing-only) now includes its own
orchestration module directly instead of reading from terraform_remote_state.
Removed backend "s3" block for local testing. Added orchestration vars
to test.tfvars for both services.
- terraform-pr-validate.yml: PR validation with dynamic matrix (fmt, init,
  validate, plan with PR comments, tfsec + checkov security scans, plan
  artifact uploads)
- terraform-deploy.yml: Deploy on merge with sequential matrix (max-parallel:1,
  fail-fast, concurrency with cancel-in-progress:false, lock-timeout:5m)
- terraform-health-check.yml: Post-deploy verification (checks Step Functions
  state machines are ACTIVE, runs every 6h + after deploys)
- terraform-drift-detection.yml: Daily drift detection with plan
  -detailed-exitcode, auto-creates GitHub Issues on drift

All workflows use a matrix-driven SERVICE_REGISTRY pattern. Adding a new
service requires only 2 lines per workflow (1 path filter + 1 JSON entry).
- Add S3 backend to nrds-routing-only/main.tf using native S3 locking
  (use_lockfile=true, no DynamoDB needed)
- Bucket: ciroh-nrds-terraform-state, key: nrds-routing-only/test/terraform.tfstate
- Bump required_version to >= 1.10 (native S3 locking requires 1.10+)
- Bump TF_VERSION to 1.10.0 across all CI workflows
- Create envs/test.backend.hcl with partial backend config for
  routing-only service (ciroh-terraform-state bucket, native S3 locking)
- Update main.tf to use empty backend "s3" {} block loaded via
  -backend-config flag
- Add backend_config field to SERVICE_REGISTRY in all 3 workflows
  (pr-validate, deploy, drift-detection)
- Conditional terraform init uses -backend-config when configured
- Create envs/test.backend.hcl with state key cfe-nom-test-datastream
  in ciroh-terraform-state bucket with native S3 locking
- Update main.tf to use empty backend "s3" {} block (>= 1.10)
- Update backend_config for cfe-nom in all 3 workflows
Reflect the service-based directory layout, S3 backend with
partial .hcl configs, updated terraform init/plan/apply/destroy
commands, and CI/CD workflow summary table.
- Add workflow_dispatch trigger to PR validate and deploy workflows
  for manual runs
- Add -lock-timeout=5m to PR validation plan step to prevent failures
  when state is temporarily locked
Triggers only when the workflow file itself changes.
Replace AWS_ACCESS_KEY_ID/SECRET_ACCESS_KEY with role-to-assume
using GitHub's OIDC provider. Eliminates long-lived credentials.
Also quotes backend-config and var-file paths in all workflows.
Aligns the orchestration module version constraint with the service-level
constraint, ensuring S3 native state locking support is available.
Use bash line continuation in execution templates for
routing-only and cfe-nom services.
Replace hardcoded ciroh-community-ngen-datastream with a
configurable s3_bucket variable threaded through all layers.
@github-actions
Copy link

github-actions bot commented Feb 19, 2026

Terraform Plan: nrds-routing-only

Plan: 46 to add, 0 to change, 0 to destroy.

Show Plan Details
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["16_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst16_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 17 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["17_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst17_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 18 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["18_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst18_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 19 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["19_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst19_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 20 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["20_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst20_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 21 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["21_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst21_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 22 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["22_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst22_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 23 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_routing_only["23_03W"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu03W_schedule_routing_only_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

Plan: 46 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + datastream_arn             = (known after apply)
  + lambda_arns                = [
      + (known after apply),
      + (known after apply),
      + (known after apply),
      + (known after apply),
      + (known after apply),
    ]
  + lambda_function_names      = [
      + "nrds_routing_only_prod_start_ec2",
      + "nrds_routing_only_prod_ec2_commander",
      + "nrds_routing_only_prod_ec2_command_poller",
      + "nrds_routing_only_prod_s3_object_checker",
      + "nrds_routing_only_prod_ec2_stopper",
    ]
  + lambda_role_arn            = (known after apply)
  + short_range_schedule_count = 24

Pusher: @harshavemula-ua, Action: pull_request, Workflow: Terraform PR Validation

Comment on lines +3 to +21
resource "aws_iam_policy" "scheduler_policy" {
name = var.scheduler_policy_name
description = "Policy with permissions for statemachine execution"
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
"Effect" : "Allow",
Action = [
"states:StartExecution",
"events:PutTargets",
"events:PutRule",
"events:PutPermission"
],
"Resource" : ["*"]
}
]
})
}

Check failure

Code scanning / Checkov

Ensure no IAM policies documents allow "*" as a statement's resource for restrictable actions

Ensure no IAM policies documents allow "*" as a statement's resource for restrictable actions
Comment on lines +3 to +21
resource "aws_iam_policy" "scheduler_policy" {
name = var.scheduler_policy_name
description = "Policy with permissions for statemachine execution"
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
"Effect" : "Allow",
Action = [
"states:StartExecution",
"events:PutTargets",
"events:PutRule",
"events:PutPermission"
],
"Resource" : ["*"]
}
]
})
}

Check failure

Code scanning / Checkov

Ensure IAM policies does not allow write access without constraints

Ensure IAM policies does not allow write access without constraints
Comment on lines 50 to 87
resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" {
for_each = local.short_range_routing_only_config

name = "short_range_fcst${each.value.init}_vpu${each.value.vpu}_schedule_routing_only_${var.environment_suffix}"
group_name = var.schedule_group_name

flexible_time_window {
mode = "OFF"
}

schedule_expression = "cron(0 ${local.short_range_times_routing_only[each.key]} * * ? *)"
schedule_expression_timezone = var.schedule_timezone

target {
arn = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
role_arn = aws_iam_role.scheduler_role.arn
input = <<-EOT
{
"StateMachineArn": "${var.state_machine_arn}",
"Name": "routing_only_short_range_vpu${each.value.vpu}_init${each.value.init}_<aws.scheduler.execution-id>",
"Input": ${jsonencode(templatefile(local.routing_only_template_path, {
vpu = each.value.vpu
init = each.value.init
run_type_l = each.value.run_type_l
run_type_h = each.value.run_type_h
nprocs = each.value.nprocs
ami_id = local.routing_only_ami_id
instance_type = each.value.instance_type
security_group_ids = local.routing_only_security_groups
instance_profile = local.routing_only_instance_profile
volume_size = each.value.volume_size
environment_suffix = var.environment_suffix
s3_bucket = var.s3_bucket
}))}
}
EOT
}
}

Check failure

Code scanning / Checkov

Ensure EventBridge Scheduler Schedule uses Customer Managed Key (CMK)

Ensure EventBridge Scheduler Schedule uses Customer Managed Key (CMK)
@github-actions
Copy link

github-actions bot commented Feb 19, 2026

Terraform Plan: nrds-cfe-nom

Plan: 660 to add, 0 to change, 0 to destroy.

Show Plan Details
  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_12"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu12_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_13"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu13_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_14"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu14_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_15"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu15_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_16"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu16_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_17"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu17_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_18"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpu18_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 0 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

  # module.schedules.aws_scheduler_schedule.datastream_schedule_short_range_cfe_nom["23_fp"] will be created
  + resource "aws_scheduler_schedule" "datastream_schedule_short_range_cfe_nom" {
      + arn                          = (known after apply)
      + group_name                   = "default"
      + id                           = (known after apply)
      + name                         = "short_range_fcst23_vpufp_schedule_cfe_nom_prod"
      + name_prefix                  = (known after apply)
      + schedule_expression          = "cron(0 23 * * ? *)"
      + schedule_expression_timezone = "America/New_York"
      + state                        = "ENABLED"

      + flexible_time_window {
          + mode = "OFF"
        }

      + target {
          + arn      = "arn:aws:scheduler:::aws-sdk:sfn:startExecution"
          + input    = (known after apply)
          + role_arn = (known after apply)
        }
    }

Plan: 660 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + analysis_assim_schedule_count = 22
  + datastream_arn                = (known after apply)
  + lambda_arns                   = [
      + (known after apply),
      + (known after apply),
      + (known after apply),
      + (known after apply),
      + (known after apply),
    ]
  + lambda_function_names         = [
      + "nrds_cfe_nom_prod_start_ec2",
      + "nrds_cfe_nom_prod_ec2_commander",
      + "nrds_cfe_nom_prod_ec2_command_poller",
      + "nrds_cfe_nom_prod_s3_object_checker",
      + "nrds_cfe_nom_prod_ec2_stopper",
    ]
  + lambda_role_arn               = (known after apply)
  + medium_range_schedule_count   = 88
  + short_range_schedule_count    = 528

Pusher: @harshavemula-ua, Action: pull_request, Workflow: Terraform PR Validation

EC2 instances will use default VPC security group instead of a
Terraform-managed one, avoiding destroy conflicts when instances
are still running.
@harshavemula-ua harshavemula-ua force-pushed the templatize_cfe_nom_routing_only branch from dab7d36 to 11c500d Compare February 19, 2026 20:51
Add fp template and EventBridge schedules for short_range (24 init cycles),
medium_range (4 init cycles), and analysis_assim_extend (1 init cycle).
FP processes NWM forcings for all VPUs at once using docker containers.
Move fp from separate schedule resources into the vpus list with
conditionals for template, AMI, timing, and member_suffix. Single
place to control what gets deployed - remove fp from list to disable.
infra_deploy_val.yaml and research_datastream_terraform.yaml use
hardcoded AWS keys and outdated patterns. Replaced by terraform-deploy,
terraform-pr-validate, terraform-destroy, terraform-drift-detection,
and terraform-health-check workflows with OIDC auth.
@harshavemula-ua harshavemula-ua changed the title Templatize CFE_NOM and routing-only Terraform services Add services-based Terraform monorepo with CI/CD and FP schedules Feb 24, 2026
@harshavemula-ua harshavemula-ua changed the title Add services-based Terraform monorepo with CI/CD and FP schedules Restructure Terraform infra with services monorepo, CI/CD, and FP support Feb 24, 2026
Copy link
Collaborator

@JordanLaserGit JordanLaserGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshavemula-ua Thanks for all this work!

In separating the datastreams into services, the orchestration and scheduler infrastructure is replicated for each datastream. While this shouldn't cost anything extra, this isn't necessary from my stand point and may be a pain point. Is there a benefit for doing this? Ideally the orchestration infrastructure isn't dependent on the particular datastream (i.e naming the infra and backend based on the datastream name).

I think of the nrds as a service that we are adding on to by adding "datastreams" in the rough form of the combination of schedules.tf and the execution templates/config. I don't see why not just use a single terraform state file to reference when applying changes. Should take care of both the deploy and destroy, as well as remove the need to edit and matrix the workflows for each added datastream.

Said in another way, I think all of the variables defined in services/../envs/prod.tfvars should be the same from datastream to datastream, with the exception of environment_suffix if we want to implement the idea of a dev datastream vs a prod datastream. The AMI could be datastream specific, but ideally isn't. I'd think we want to keep the AMI's up to date and uniform, so we aren't using different versions of NGIAB for different datastreams.

I don't think this is a huge point of contention, and this all looks great and potentially ready to merge in it's current form though, perhaps with the edit of building a single set of production infrastructure from a single state file to make maintenance easier.


jobs:
# ---------------------------------------------------------------------------
# Detect which services changed and build a dynamic matrix.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Terraform does this automatically with a terraform apply. I don't believe we need to also check if anything changed before hand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify - "services" in this context refers to the ngen-datastream pipeline services (e.g., nrds-cfe-nom, nrds-routing-only), not AWS services. Each service directory under services/ encapsulates a complete pipeline configuration — its own orchestration module, schedules, execution templates, S3 backend, and environment configs.

The schedule name changes (e.g., _cfe_nom → _cfe_nom_prod) will be handled by terraform apply — it will destroy the old schedules and create the new ones as part of the state migration.

# ORDER MATTERS: services deploy sequentially (max-parallel: 1).
# Put lower-risk / smaller services first as a canary.
# =====================================================================
SERVICE_REGISTRY='[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just manage a single production backend. This way we don't have to update these workflows when we add additional datastreams.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-service backends are better for production. The isolation is worth it (thats what I felt)— a bad terraform apply on one service won't blow up the other. State locking won't block unrelated services either.

Blast radius isolation is main benefit here in a situation where a bad apply on cfe-nom won't affect routing-only

I am fine either ways

@@ -0,0 +1,72 @@
name: Terraform Destroy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a workflow for destroying the resources? seems potentially dangerous and the terraform apply in the terraform-deploy workflow should destroy resources if we were to remove them from the deployment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense — I originally included it thinking about end-to-end infrastructure management from workflows, but you're right that terraform apply already handles resource removal safely through the normal PR review process. I'll remove it in a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants