Restructure Terraform infra with services monorepo, CI/CD, and FP support#312
Restructure Terraform infra with services monorepo, CI/CD, and FP support#312harshavemula-ua wants to merge 45 commits intomainfrom
Conversation
- Root module now deploys only shared orchestration (state machine, lambdas, IAM, SG) - CFE_NOM schedules moved to services/nrds-cfe-nom/ with its own state file - Schedule service reads orchestration outputs via terraform_remote_state - Enables independent deploy/rollback per model service - Future services (routing-only, etc.) can be added as new service directories
Adds routing-only schedules as a standalone service matching the nrds-cfe-nom pattern: separate state, terraform_remote_state for orchestration outputs, templatefile() for execution JSONs.
Each service (nrds-cfe-nom, nrds-routing-only) now includes its own orchestration module directly instead of reading from terraform_remote_state. Removed backend "s3" block for local testing. Added orchestration vars to test.tfvars for both services.
- terraform-pr-validate.yml: PR validation with dynamic matrix (fmt, init, validate, plan with PR comments, tfsec + checkov security scans, plan artifact uploads) - terraform-deploy.yml: Deploy on merge with sequential matrix (max-parallel:1, fail-fast, concurrency with cancel-in-progress:false, lock-timeout:5m) - terraform-health-check.yml: Post-deploy verification (checks Step Functions state machines are ACTIVE, runs every 6h + after deploys) - terraform-drift-detection.yml: Daily drift detection with plan -detailed-exitcode, auto-creates GitHub Issues on drift All workflows use a matrix-driven SERVICE_REGISTRY pattern. Adding a new service requires only 2 lines per workflow (1 path filter + 1 JSON entry).
- Add S3 backend to nrds-routing-only/main.tf using native S3 locking (use_lockfile=true, no DynamoDB needed) - Bucket: ciroh-nrds-terraform-state, key: nrds-routing-only/test/terraform.tfstate - Bump required_version to >= 1.10 (native S3 locking requires 1.10+) - Bump TF_VERSION to 1.10.0 across all CI workflows
- Create envs/test.backend.hcl with partial backend config for
routing-only service (ciroh-terraform-state bucket, native S3 locking)
- Update main.tf to use empty backend "s3" {} block loaded via
-backend-config flag
- Add backend_config field to SERVICE_REGISTRY in all 3 workflows
(pr-validate, deploy, drift-detection)
- Conditional terraform init uses -backend-config when configured
- Create envs/test.backend.hcl with state key cfe-nom-test-datastream
in ciroh-terraform-state bucket with native S3 locking
- Update main.tf to use empty backend "s3" {} block (>= 1.10)
- Update backend_config for cfe-nom in all 3 workflows
Reflect the service-based directory layout, S3 backend with partial .hcl configs, updated terraform init/plan/apply/destroy commands, and CI/CD workflow summary table.
- Add workflow_dispatch trigger to PR validate and deploy workflows for manual runs - Add -lock-timeout=5m to PR validation plan step to prevent failures when state is temporarily locked
Triggers only when the workflow file itself changes.
Replace AWS_ACCESS_KEY_ID/SECRET_ACCESS_KEY with role-to-assume using GitHub's OIDC provider. Eliminates long-lived credentials. Also quotes backend-config and var-file paths in all workflows.
Aligns the orchestration module version constraint with the service-level constraint, ensuring S3 native state locking support is available.
Use bash line continuation in execution templates for routing-only and cfe-nom services.
Replace hardcoded ciroh-community-ngen-datastream with a configurable s3_bucket variable threaded through all layers.
… gates, restrict to main branch
Terraform Plan:
|
| resource "aws_iam_policy" "scheduler_policy" { | ||
| name = var.scheduler_policy_name | ||
| description = "Policy with permissions for statemachine execution" | ||
| policy = jsonencode({ | ||
| Version = "2012-10-17", | ||
| Statement = [ | ||
| { | ||
| "Effect" : "Allow", | ||
| Action = [ | ||
| "states:StartExecution", | ||
| "events:PutTargets", | ||
| "events:PutRule", | ||
| "events:PutPermission" | ||
| ], | ||
| "Resource" : ["*"] | ||
| } | ||
| ] | ||
| }) | ||
| } |
Check failure
Code scanning / Checkov
Ensure no IAM policies documents allow "*" as a statement's resource for restrictable actions
| resource "aws_iam_policy" "scheduler_policy" { | ||
| name = var.scheduler_policy_name | ||
| description = "Policy with permissions for statemachine execution" | ||
| policy = jsonencode({ | ||
| Version = "2012-10-17", | ||
| Statement = [ | ||
| { | ||
| "Effect" : "Allow", | ||
| Action = [ | ||
| "states:StartExecution", | ||
| "events:PutTargets", | ||
| "events:PutRule", | ||
| "events:PutPermission" | ||
| ], | ||
| "Resource" : ["*"] | ||
| } | ||
| ] | ||
| }) | ||
| } |
Check failure
Code scanning / Checkov
Ensure IAM policies does not allow write access without constraints
| resource "aws_scheduler_schedule" "datastream_schedule_short_range_routing_only" { | ||
| for_each = local.short_range_routing_only_config | ||
|
|
||
| name = "short_range_fcst${each.value.init}_vpu${each.value.vpu}_schedule_routing_only_${var.environment_suffix}" | ||
| group_name = var.schedule_group_name | ||
|
|
||
| flexible_time_window { | ||
| mode = "OFF" | ||
| } | ||
|
|
||
| schedule_expression = "cron(0 ${local.short_range_times_routing_only[each.key]} * * ? *)" | ||
| schedule_expression_timezone = var.schedule_timezone | ||
|
|
||
| target { | ||
| arn = "arn:aws:scheduler:::aws-sdk:sfn:startExecution" | ||
| role_arn = aws_iam_role.scheduler_role.arn | ||
| input = <<-EOT | ||
| { | ||
| "StateMachineArn": "${var.state_machine_arn}", | ||
| "Name": "routing_only_short_range_vpu${each.value.vpu}_init${each.value.init}_<aws.scheduler.execution-id>", | ||
| "Input": ${jsonencode(templatefile(local.routing_only_template_path, { | ||
| vpu = each.value.vpu | ||
| init = each.value.init | ||
| run_type_l = each.value.run_type_l | ||
| run_type_h = each.value.run_type_h | ||
| nprocs = each.value.nprocs | ||
| ami_id = local.routing_only_ami_id | ||
| instance_type = each.value.instance_type | ||
| security_group_ids = local.routing_only_security_groups | ||
| instance_profile = local.routing_only_instance_profile | ||
| volume_size = each.value.volume_size | ||
| environment_suffix = var.environment_suffix | ||
| s3_bucket = var.s3_bucket | ||
| }))} | ||
| } | ||
| EOT | ||
| } | ||
| } |
Check failure
Code scanning / Checkov
Ensure EventBridge Scheduler Schedule uses Customer Managed Key (CMK)
Terraform Plan:
|
EC2 instances will use default VPC security group instead of a Terraform-managed one, avoiding destroy conflicts when instances are still running.
dab7d36 to
11c500d
Compare
Add fp template and EventBridge schedules for short_range (24 init cycles), medium_range (4 init cycles), and analysis_assim_extend (1 init cycle). FP processes NWM forcings for all VPUs at once using docker containers.
infra/aws/terraform/services/nrds-cfe-nom/modules/schedules/schedules.tf
Fixed
Show fixed
Hide fixed
infra/aws/terraform/services/nrds-cfe-nom/modules/schedules/schedules.tf
Fixed
Show fixed
Hide fixed
Move fp from separate schedule resources into the vpus list with conditionals for template, AMI, timing, and member_suffix. Single place to control what gets deployed - remove fp from list to disable.
infra_deploy_val.yaml and research_datastream_terraform.yaml use hardcoded AWS keys and outdated patterns. Replaced by terraform-deploy, terraform-pr-validate, terraform-destroy, terraform-drift-detection, and terraform-health-check workflows with OIDC auth.
JordanLaserGit
left a comment
There was a problem hiding this comment.
@harshavemula-ua Thanks for all this work!
In separating the datastreams into services, the orchestration and scheduler infrastructure is replicated for each datastream. While this shouldn't cost anything extra, this isn't necessary from my stand point and may be a pain point. Is there a benefit for doing this? Ideally the orchestration infrastructure isn't dependent on the particular datastream (i.e naming the infra and backend based on the datastream name).
I think of the nrds as a service that we are adding on to by adding "datastreams" in the rough form of the combination of schedules.tf and the execution templates/config. I don't see why not just use a single terraform state file to reference when applying changes. Should take care of both the deploy and destroy, as well as remove the need to edit and matrix the workflows for each added datastream.
Said in another way, I think all of the variables defined in services/../envs/prod.tfvars should be the same from datastream to datastream, with the exception of environment_suffix if we want to implement the idea of a dev datastream vs a prod datastream. The AMI could be datastream specific, but ideally isn't. I'd think we want to keep the AMI's up to date and uniform, so we aren't using different versions of NGIAB for different datastreams.
I don't think this is a huge point of contention, and this all looks great and potentially ready to merge in it's current form though, perhaps with the edit of building a single set of production infrastructure from a single state file to make maintenance easier.
|
|
||
| jobs: | ||
| # --------------------------------------------------------------------------- | ||
| # Detect which services changed and build a dynamic matrix. |
There was a problem hiding this comment.
I think Terraform does this automatically with a terraform apply. I don't believe we need to also check if anything changed before hand.
There was a problem hiding this comment.
To clarify - "services" in this context refers to the ngen-datastream pipeline services (e.g., nrds-cfe-nom, nrds-routing-only), not AWS services. Each service directory under services/ encapsulates a complete pipeline configuration — its own orchestration module, schedules, execution templates, S3 backend, and environment configs.
The schedule name changes (e.g., _cfe_nom → _cfe_nom_prod) will be handled by terraform apply — it will destroy the old schedules and create the new ones as part of the state migration.
| # ORDER MATTERS: services deploy sequentially (max-parallel: 1). | ||
| # Put lower-risk / smaller services first as a canary. | ||
| # ===================================================================== | ||
| SERVICE_REGISTRY='[ |
There was a problem hiding this comment.
We could just manage a single production backend. This way we don't have to update these workflows when we add additional datastreams.
There was a problem hiding this comment.
Per-service backends are better for production. The isolation is worth it (thats what I felt)— a bad terraform apply on one service won't blow up the other. State locking won't block unrelated services either.
Blast radius isolation is main benefit here in a situation where a bad apply on cfe-nom won't affect routing-only
I am fine either ways
| @@ -0,0 +1,72 @@ | |||
| name: Terraform Destroy | |||
There was a problem hiding this comment.
Do we need a workflow for destroying the resources? seems potentially dangerous and the terraform apply in the terraform-deploy workflow should destroy resources if we were to remove them from the deployment.
There was a problem hiding this comment.
Makes sense — I originally included it thinking about end-to-end infrastructure management from workflows, but you're right that terraform apply already handles resource removal safely through the normal PR review process. I'll remove it in a follow-up.
Summary
Restructures Terraform infrastructure into a services-based monorepo with full CI/CD automation, replacing the previous flat layout and manual deployment process.
Infrastructure Changes
nrds-cfe-nomandnrds-routing-onlyservices underinfra/aws/terraform/services/modules/orchestration/templatefile()for dynamic schedule generation — no pre-generated JSON filesfp_templatevsVPU_template) based on vpus list — removefpfrom the list to disable fp schedules entirelyCI/CD Workflows (5 new, OIDC auth)
terraform-driftlabelinfra_deploy_val.yaml,research_datastream_terraform.yaml) that used hardcoded AWS keysBug Fixes
analysis_assim_extendfcst value totm27_tm00to match productiontest/from cfe_nom S3 prefix to match production output pathAdding a New Service
infra/aws/terraform/services/<service-name>/withmain.tf,variables.tf,envs/SERVICE_REGISTRY)Test Plan
terraform planshows expected resources for both servicesaws scheduler get-schedule)Terraform Drift Detection workflow successful execution: https://github.com/CIROH-UA/ngen-datastream/actions/runs/22078756096
Terraform Deploy workflow successful execution: https://github.com/CIROH-UA/ngen-datastream/actions/runs/22190595089
Terraform Plan successful execution: https://github.com/CIROH-UA/ngen-datastream/actions/runs/22362787346?pr=312