datafold
diff --git a/‎README.md‎
Lines changed: 105 additions & 30 deletions b/‎README.md‎
Lines changed: 105 additions & 30 deletions
diff --git a/‎examples/deployment/infra/config.tf‎
Lines changed: 53 additions & 65 deletions b/‎examples/deployment/infra/config.tf‎
Lines changed: 53 additions & 65 deletions
@@ -1,31 +1,55 @@
 =======
 # Datafold Google module
 
-This repository provisions resources on Google, preparing them for a deployment of the
-application on a GKE cluster.
+This repository provisions infrastructure resources on Google Cloud for deploying Datafold using the datafold-operator.
 
 ## About this module
 
+**⚠️ Important**: This module is now **optional**. If you already have GKE infrastructure in place, you can configure the required resources independently. This module is primarily intended for customers who need to set up the complete infrastructure stack for GKE deployment.
+
+The module provisions Google Cloud infrastructure resources that are required for Datafold deployment. Application configuration is now managed through the `datafoldapplication` custom resource on the cluster using the datafold-operator, rather than through Terraform application directories.
+
+## Breaking Changes
+
+### Load Balancer Deployment (Default Changed)
+
+**Breaking Change**: The load balancer is **no longer deployed by default**. The default behavior has been toggled to `deploy_lb = false`.
+
+- **Previous behavior**: Load balancer was deployed by default
+- **New behavior**: Load balancer deployment is disabled by default
+- **Action required**: If you need a load balancer, you must explicitly set `deploy_lb = true` in your configuration, so that you don't lose it. (in the case it does happen, you need to redeploy it and then update your DNS to the new LB IP).
+
+### Application Directory Removal
+
+- The "application" directory is no longer part of this repository
+- Application configuration is now managed through the `datafoldapplication` custom resource on the cluster
+
 ## Prerequisites
 
-* A Google cloud account, preferably a new isolated one.
+* A Google Cloud account, preferably a new isolated one.
 * Terraform >= 1.4.6
 * A customer contract with Datafold
   * The application does not work without credentials supplied by sales
 * Access to our public helm-charts repository
+* The datafold-operator installed on your GKE cluster
+  * Application configuration is managed through the `datafoldapplication` custom resource
 
-This deployment will create the following resources:
+The full deployment will create the following resources:
 
 * Google VPC
-* Google subnet
+* Google subnets
 * Google GCS bucket for clickhouse backups
-* Google external application load balancer
-* Google HTTPS certificate, unless preregistered and provided
+* Google Cloud Load Balancer (optional, disabled by default)
+* Google-managed SSL certificate (if load balancer is enabled)
 * Three persistent disk volumes for local data storage
+* Cloud SQL PostgreSQL database
 * A GKE cluster
 * Service accounts for the GKE cluster to perform actions outside of its cluster boundary:
   * Provisioning persistent disk volumes
   * Updating Network Endpoint Group to route traffic to pods directly
+  * Managing GCS bucket access for ClickHouse backups
+
+**Infrastructure Dependencies**: For a complete list of required infrastructure resources and detailed deployment guidance, see the [Datafold Dedicated Cloud GCP Deployment Documentation](https://docs.datafold.com/datafold-deployment/dedicated-cloud/gcp).
 
 ## Negative scope
 
@@ -34,41 +58,92 @@ This deployment will create the following resources:
 ## How to use this module
 
 * See the example for a potential setup, which has dependencies on our helm-charts
-* Create secret files with our variables
 
-## Examples
+The example directory contains a single deployment example for infrastructure setup.
+
+Setting up the infrastructure:
+
+* It is easiest if you have full admin access in the target project.
+* Pre-create a symmetric encryption key that is used to encrypt/decrypt secrets of this deployment.
+  * Use the alias instead of the `mrk` link. Put that into `locals.tf`
+* **Certificate Requirements** (depends on load balancer deployment method):
+  * **If deploying load balancer from this Terraform module** (`deploy_lb = true`): Pre-create and validate the SSL certificate in your DNS, then refer to that certificate in main.tf using its domain name (Replace "datafold.example.com")
+  * **If deploying load balancer from within Kubernetes**: The certificate will be created automatically, but you must wait for it to become available and then validate it in your DNS after the deployment is complete
+* Change the settings in locals.tf
+  * provider_region = which region you want to deploy in.
+  * project_id = The GCP project ID where you want to deploy.
+  * kms_profile = The profile you want to use to issue the deployments. Targets the deployment account.
+  * kms_key = A pre-created symmetric KMS key. It's only purpose is for encryption/decryption of deployment secrets.
+  * deployment_name = The name of the deployment, used in kubernetes namespace, container naming and datadog "deployment" Unified Tag)
+* Run `terraform init` in the infra directory.
+* Run `terraform apply` in `infra` directory. This should complete ok. 
+  * Check in the console if you see the GKE cluster, Cloud SQL database, etc.
+  * If you enabled load balancer deployment, check for the load balancer as well.
+
+**Application Deployment**: After infrastructure is ready, deploy the application using the datafold-operator. See the [Datafold Helm Charts repository](https://github.com/datafold/helm-charts) for detailed application deployment instructions.
+
+## Infrastructure Dependencies
+
+This module is designed to provide the complete infrastructure stack for Datafold deployment. However, if you already have GKE infrastructure in place, you can choose to configure the required resources independently.
+
+**Required Infrastructure Components**:
+- GKE cluster with appropriate node pools
+- Cloud SQL PostgreSQL database
+- GCS bucket for ClickHouse backups
+- Persistent disks for persistent storage (ClickHouse data, ClickHouse logs, Redis data)
+- IAM roles and service accounts for cluster operations
+- Load balancer (optional, can be managed by Google Cloud Load Balancer Controller)
+- VPC and networking components
+- SSL certificate (validation timing depends on deployment method):
+  - **Terraform-managed LB**: Certificate must be pre-created and validated
+  - **Kubernetes-managed LB**: Certificate created automatically, validated post-deployment
+
+**Alternative Approaches**:
+- **Use this module**: Provides complete infrastructure setup for new deployments
+- **Use existing infrastructure**: Configure required resources manually or through other means
+- **Hybrid approach**: Use this module for some components and existing infrastructure for others
+
+For detailed specifications of each required component, see the [Datafold Dedicated Cloud GCP Deployment Documentation](https://docs.datafold.com/datafold-deployment/dedicated-cloud/gcp). For application deployment instructions, see the [Datafold Helm Charts repository](https://github.com/datafold/helm-charts).
+
+## Detailed Infrastructure Components
+
+Based on the [Datafold GCP Deployment Documentation](https://docs.datafold.com/datafold-deployment/dedicated-cloud/gcp), this module provisions the following detailed infrastructure components:
 
-* Implement the example in this repository
-* Change the settings
-* Run `terraform init`
-* Run `terraform apply`
+### Persistent Disks
+The Datafold application requires 3 persistent disks for storage, each deployed as encrypted Google Compute Engine persistent disks in the primary availability zone:
 
-### Initializing the application
+- **ClickHouse data disk**: Serves as the analytical database storage for Datafold. ClickHouse is a columnar database that excels at analytical queries. The default 40GB allocation usually provides sufficient space for typical deployments, but it can be scaled up based on data volume requirements.
+- **ClickHouse logs disk**: Stores ClickHouse's internal logs and temporary data. The separate logs disk prevents log data from consuming IOPS and I/O performance from actual data storage.
+- **Redis data disk**: Provides persistent storage for Redis, which handles task distribution and distributed locks in the Datafold application. Redis is memory-first but benefits from persistence for data durability across restarts.
 
-The deployment is created and the initjob should have created the databases and done the 
-initialization of the site settings.
+All persistent disks are encrypted by default using Google-managed encryption keys, ensuring data security at rest.
 
-If that didn't complete successfully, try to restart the job. 
+### Load Balancer
+The load balancer serves as the primary entry point for all external traffic to the Datafold application. The module offers 2 deployment strategies:
 
-Once the deployment is complete and the initjob succeeded, we can set the install to that for false in config.yaml:
+- **External Load Balancer Deployment** (the default approach): Creates a Google Cloud Load Balancer through Terraform
+- **Kubernetes-Managed Load Balancer**: Relies on the Google Cloud Load Balancer Controller running within the GKE cluster, deployed by the datafold application resource. This means Kubernetes creates the load balancer for you.
 
-```
-initjob:
-  install: false
-```
+### GKE Cluster
+The Google Kubernetes Engine (GKE) cluster forms the compute foundation for the Datafold application:
 
-Alternatively, here are the manual steps to achieve the same:
+- **Network Architecture**: The entire cluster is deployed into private subnets with Cloud NAT for egress traffic
+- **Security Features**: Workload Identity, Shielded nodes, Binary authorization, Network policy, and Private nodes
+- **Node Management**: Supports up to three managed node pools with automatic scaling
 
-Establish a shell into the `<deployment>-dfshell` container. 
-It is likely that the scheduler and server containers are crashing in a loop.
+### IAM Roles and Permissions
+The IAM architecture follows the principle of least privilege:
 
-All we need to is to run these commands:
+- **GKE service account**: Basic permissions for logging, monitoring, and storage access
+- **ClickHouse backup service account**: Custom role for ClickHouse to make backups and store them on Cloud Storage
+- **Datafold service accounts**: Pre-defined roles for different application components
 
-1. `./manage.py clickhouse create-tables`
-2. `./manage.py database create-or-upgrade`
-3. `./manage.py installation set-new-deployment-params`
+### Cloud SQL Database
+The PostgreSQL Cloud SQL instance serves as the primary relational database:
 
-Now all containers should be up and running.
+- **Storage configuration**: Starts with a 20GB initial allocation that can automatically scale up to 100GB
+- **High availability**: Intentionally disabled by default to reduce costs and complexity
+- **Security and encryption**: Always encrypts data at rest using Google-managed encryption keys
 
 <!-- BEGIN_TF_DOCS -->
 
 
@@ -1,76 +1,64 @@
-resource "local_file" "infra_config" {
-  filename = "${path.module}/../application/infra.dec.yaml"
-  content = templatefile(
-    "${path.module}/../templates/datafold/infra_settings.tpl",
+# Output the infrastructure configuration to console
+output "infra_config" {
+  description = "Infrastructure configuration for Datafold deployment"
+  value = templatefile(
+    "${path.module}/../templates/infra_settings.tpl",
     {
       aws_target_group_arn           = "",
-      cluster_scaler_role_arn        = "",
-      clickhouse_s3_backup_role      = "",
-      clickhouse_data_size           = module.gcp[0].clickhouse_data_size,
-      clickhouse_data_volume_id      = module.gcp[0].clickhouse_data_volume_id,
-      clickhouse_gcs_bucket          = module.gcp[0].clickhouse_gcs_bucket,
-      gcp_backup_account             = module.gcp[0].clickhouse_backup_sa,
-      clickhouse_logs_size           = module.gcp[0].clickhouse_logs_size,
-      clickhouse_log_volume_id       = module.gcp[0].clickhouse_logs_volume_id,
+      gcp_backup_account             = module.gcp.clickhouse_backup_sa,
+      clickhouse_data_size           = module.gcp.clickhouse_data_size,
+      clickhouse_data_volume_id      = module.gcp.clickhouse_data_volume_id,
+      clickhouse_gcs_bucket          = module.gcp.clickhouse_gcs_bucket,
+      clickhouse_logs_size           = module.gcp.clickhouse_logs_size,
+      clickhouse_log_volume_id       = module.gcp.clickhouse_logs_volume_id,
       clickhouse_s3_bucket           = "",
       clickhouse_s3_region           = "",
-      clickhouse_azblob_account_name = "",
+      clickhouse_s3_backup_role      = "",
+      clickhouse_azblob_client_id    = "",
       clickhouse_azblob_container    = "",
-      clickhouse_azblob_account_key  = "",
-      cloud_provider                 = module.gcp[0].cloud_provider,
-      cluster_name                   = module.gcp[0].cluster_name,
-      gcp_neg_name                   = module.gcp[0].neg_name,
-      load_balancer_ips              = jsondecode(module.gcp[0].lb_external_ip),
+      clickhouse_azblob_account_name = "",
+      cloud_provider                 = module.gcp.cloud_provider,
+      cluster_name                   = module.gcp.cluster_name,
+      gcp_neg_name                   = module.gcp.neg_name,
+      load_balancer_ips              = jsondecode(module.gcp.lb_external_ip),
       load_balancer_controller_arn   = "",
-      postgres_database              = module.gcp[0].postgres_database_name,
-      postgres_password              = module.gcp[0].postgres_password,
-      postgres_port                  = module.gcp[0].postgres_port,
-      postgres_server                = module.gcp[0].postgres_host,
-      postgres_user                  = module.gcp[0].postgres_username,
-      redis_password                 = module.gcp[0].redis_password,
-      redis_data_size                = module.gcp[0].redis_data_size,
-      redis_data_volume_id           = module.gcp[0].redis_data_volume_id,
-      server_name                    = module.gcp[0].domain_name,
-      vpc_cidr                       = module.gcp[0].vpc_cidr,
+      cluster_scaler_role_arn        = "",
+      postgres_database              = local.database_name,
+      postgres_password              = module.gcp.postgres_password,
+      postgres_port                  = module.gcp.postgres_port,
+      postgres_server                = module.gcp.postgres_host,
+      postgres_user                  = module.gcp.postgres_username,
+      redis_data_size                = module.gcp.redis_data_size,
+      redis_data_volume_id           = module.gcp.redis_data_volume_id,
+      server_name                    = module.gcp.domain_name,
+      vpc_cidr                       = module.gcp.vpc_cidr,
 
       # service accounts vars
-      dfshell_role_arn                        = try(module.gcp[0].dfshell_role_arn, "")
-      dfshell_service_account_name            = try(module.gcp[0].dfshell_service_account_name, "datafold-dfshell")
-      worker_portal_role_arn                  = try(module.gcp[0].worker_portal_role_arn, "")
-      worker_portal_service_account_name      = try(module.gcp[0].worker_portal_service_account_name, "datafold-worker-portal")
-      operator_role_arn                       = try(module.gcp[0].operator_role_arn, "")
-      operator_service_account_name           = try(module.gcp[0].operator_service_account_name, "datafold-operator")
-      server_role_arn                         = try(module.gcp[0].server_role_arn, "")
-      server_service_account_name             = try(module.gcp[0].server_service_account_name, "datafold-server")
-      scheduler_role_arn                      = try(module.gcp[0].scheduler_role_arn, "")
-      scheduler_service_account_name          = try(module.gcp[0].scheduler_service_account_name, "datafold-scheduler")
-      worker_role_arn                         = try(module.gcp[0].worker_role_arn, "")
-      worker_service_account_name             = try(module.gcp[0].worker_service_account_name, "datafold-worker")
-      worker_catalog_role_arn                 = try(module.gcp[0].worker_catalog_role_arn, "")
-      worker_catalog_service_account_name     = try(module.gcp[0].worker_catalog_service_account_name, "datafold-worker-catalog")
-      worker_interactive_role_arn             = try(module.gcp[0].worker_interactive_role_arn, "")
-      worker_interactive_service_account_name = try(module.gcp[0].worker_interactive_service_account_name, "datafold-worker-interactive")
-      worker_singletons_role_arn              = try(module.gcp[0].worker_singletons_role_arn, "")
-      worker_singletons_service_account_name  = try(module.gcp[0].worker_singletons_service_account_name, "datafold-worker-singletons")
-      worker_lineage_role_arn                 = try(module.gcp[0].worker_lineage_role_arn, "")
-      worker_lineage_service_account_name     = try(module.gcp[0].worker_lineage_service_account_name, "datafold-worker-lineage")
-      worker_monitor_role_arn                 = try(module.gcp[0].worker_monitor_role_arn, "")
-      worker_monitor_service_account_name     = try(module.gcp[0].worker_monitor_service_account_name, "datafold-worker-monitor")
-      storage_worker_role_arn                 = try(module.gcp[0].storage_worker_role_arn, "")
-      storage_worker_service_account_name     = try(module.gcp[0].storage_worker_service_account_name, "datafold-storage-worker")
-
+      dfshell_role_arn                        = module.gcp.dfshell_role_arn,
+      dfshell_service_account_name            = module.gcp.dfshell_service_account_name,
+      worker_portal_role_arn                  = module.gcp.worker_portal_role_arn,
+      worker_portal_service_account_name      = module.gcp.worker_portal_service_account_name,
+      operator_role_arn                       = module.gcp.operator_role_arn,
+      operator_service_account_name           = module.gcp.operator_service_account_name,
+      server_role_arn                         = module.gcp.server_role_arn,
+      server_service_account_name             = module.gcp.server_service_account_name,
+      scheduler_role_arn                      = module.gcp.scheduler_role_arn,
+      scheduler_service_account_name          = module.gcp.scheduler_service_account_name,
+      worker_role_arn                         = module.gcp.worker_role_arn,
+      worker_service_account_name             = module.gcp.worker_service_account_name,
+      worker_catalog_role_arn                 = module.gcp.worker_catalog_role_arn,
+      worker_catalog_service_account_name     = module.gcp.worker_catalog_service_account_name,
+      worker_interactive_role_arn             = module.gcp.worker_interactive_role_arn,
+      worker_interactive_service_account_name = module.gcp.worker_interactive_service_account_name,
+      worker_singletons_role_arn              = module.gcp.worker_singletons_role_arn,
+      worker_singletons_service_account_name  = module.gcp.worker_singletons_service_account_name,
+      worker_lineage_role_arn                 = module.gcp.worker_lineage_role_arn,
+      worker_lineage_service_account_name     = module.gcp.worker_lineage_service_account_name,
+      worker_monitor_role_arn                 = module.gcp.worker_monitor_role_arn,
+      worker_monitor_service_account_name     = module.gcp.worker_monitor_service_account_name,
+      storage_worker_role_arn                 = module.gcp.storage_worker_role_arn,
+      storage_worker_service_account_name     = module.gcp.storage_worker_service_account_name,
     }
   )
-
-  provisioner "local-exec" {
-    environment = {
-      "AWS_PROFILE" : "${local.kms_profile}",
-      "SOPS_KMS_ARN" : "${local.kms_key}"
-    }
-    command = "sops --aws-profile ${local.kms_profile} --output '${path.module}/../application/infra.yaml' -e '${path.module}/../application/infra.dec.yaml'"
-  }
-
-  depends_on = [
-    module.gcp
-  ]
+  sensitive = false
 }