Skip to content

Commit 00d6d1f

Browse files
committed
docs: Update documentation for deployment
1 parent 6fb5dc6 commit 00d6d1f

File tree

5 files changed

+223
-113
lines changed

5 files changed

+223
-113
lines changed

README.md

Lines changed: 105 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,55 @@
11
=======
22
# Datafold Google module
33

4-
This repository provisions resources on Google, preparing them for a deployment of the
5-
application on a GKE cluster.
4+
This repository provisions infrastructure resources on Google Cloud for deploying Datafold using the datafold-operator.
65

76
## About this module
87

8+
**⚠️ Important**: This module is now **optional**. If you already have GKE infrastructure in place, you can configure the required resources independently. This module is primarily intended for customers who need to set up the complete infrastructure stack for GKE deployment.
9+
10+
The module provisions Google Cloud infrastructure resources that are required for Datafold deployment. Application configuration is now managed through the `datafoldapplication` custom resource on the cluster using the datafold-operator, rather than through Terraform application directories.
11+
12+
## Breaking Changes
13+
14+
### Load Balancer Deployment (Default Changed)
15+
16+
**Breaking Change**: The load balancer is **no longer deployed by default**. The default behavior has been toggled to `deploy_lb = false`.
17+
18+
- **Previous behavior**: Load balancer was deployed by default
19+
- **New behavior**: Load balancer deployment is disabled by default
20+
- **Action required**: If you need a load balancer, you must explicitly set `deploy_lb = true` in your configuration, so that you don't lose it. (in the case it does happen, you need to redeploy it and then update your DNS to the new LB IP).
21+
22+
### Application Directory Removal
23+
24+
- The "application" directory is no longer part of this repository
25+
- Application configuration is now managed through the `datafoldapplication` custom resource on the cluster
26+
927
## Prerequisites
1028

11-
* A Google cloud account, preferably a new isolated one.
29+
* A Google Cloud account, preferably a new isolated one.
1230
* Terraform >= 1.4.6
1331
* A customer contract with Datafold
1432
* The application does not work without credentials supplied by sales
1533
* Access to our public helm-charts repository
34+
* The datafold-operator installed on your GKE cluster
35+
* Application configuration is managed through the `datafoldapplication` custom resource
1636

17-
This deployment will create the following resources:
37+
The full deployment will create the following resources:
1838

1939
* Google VPC
20-
* Google subnet
40+
* Google subnets
2141
* Google GCS bucket for clickhouse backups
22-
* Google external application load balancer
23-
* Google HTTPS certificate, unless preregistered and provided
42+
* Google Cloud Load Balancer (optional, disabled by default)
43+
* Google-managed SSL certificate (if load balancer is enabled)
2444
* Three persistent disk volumes for local data storage
45+
* Cloud SQL PostgreSQL database
2546
* A GKE cluster
2647
* Service accounts for the GKE cluster to perform actions outside of its cluster boundary:
2748
* Provisioning persistent disk volumes
2849
* Updating Network Endpoint Group to route traffic to pods directly
50+
* Managing GCS bucket access for ClickHouse backups
51+
52+
**Infrastructure Dependencies**: For a complete list of required infrastructure resources and detailed deployment guidance, see the [Datafold Dedicated Cloud GCP Deployment Documentation](https://docs.datafold.com/datafold-deployment/dedicated-cloud/gcp).
2953

3054
## Negative scope
3155

@@ -34,41 +58,92 @@ This deployment will create the following resources:
3458
## How to use this module
3559

3660
* See the example for a potential setup, which has dependencies on our helm-charts
37-
* Create secret files with our variables
3861

39-
## Examples
62+
The example directory contains a single deployment example for infrastructure setup.
63+
64+
Setting up the infrastructure:
65+
66+
* It is easiest if you have full admin access in the target project.
67+
* Pre-create a symmetric encryption key that is used to encrypt/decrypt secrets of this deployment.
68+
* Use the alias instead of the `mrk` link. Put that into `locals.tf`
69+
* **Certificate Requirements** (depends on load balancer deployment method):
70+
* **If deploying load balancer from this Terraform module** (`deploy_lb = true`): Pre-create and validate the SSL certificate in your DNS, then refer to that certificate in main.tf using its domain name (Replace "datafold.example.com")
71+
* **If deploying load balancer from within Kubernetes**: The certificate will be created automatically, but you must wait for it to become available and then validate it in your DNS after the deployment is complete
72+
* Change the settings in locals.tf
73+
* provider_region = which region you want to deploy in.
74+
* project_id = The GCP project ID where you want to deploy.
75+
* kms_profile = The profile you want to use to issue the deployments. Targets the deployment account.
76+
* kms_key = A pre-created symmetric KMS key. It's only purpose is for encryption/decryption of deployment secrets.
77+
* deployment_name = The name of the deployment, used in kubernetes namespace, container naming and datadog "deployment" Unified Tag)
78+
* Run `terraform init` in the infra directory.
79+
* Run `terraform apply` in `infra` directory. This should complete ok.
80+
* Check in the console if you see the GKE cluster, Cloud SQL database, etc.
81+
* If you enabled load balancer deployment, check for the load balancer as well.
82+
83+
**Application Deployment**: After infrastructure is ready, deploy the application using the datafold-operator. See the [Datafold Helm Charts repository](https://github.com/datafold/helm-charts) for detailed application deployment instructions.
84+
85+
## Infrastructure Dependencies
86+
87+
This module is designed to provide the complete infrastructure stack for Datafold deployment. However, if you already have GKE infrastructure in place, you can choose to configure the required resources independently.
88+
89+
**Required Infrastructure Components**:
90+
- GKE cluster with appropriate node pools
91+
- Cloud SQL PostgreSQL database
92+
- GCS bucket for ClickHouse backups
93+
- Persistent disks for persistent storage (ClickHouse data, ClickHouse logs, Redis data)
94+
- IAM roles and service accounts for cluster operations
95+
- Load balancer (optional, can be managed by Google Cloud Load Balancer Controller)
96+
- VPC and networking components
97+
- SSL certificate (validation timing depends on deployment method):
98+
- **Terraform-managed LB**: Certificate must be pre-created and validated
99+
- **Kubernetes-managed LB**: Certificate created automatically, validated post-deployment
100+
101+
**Alternative Approaches**:
102+
- **Use this module**: Provides complete infrastructure setup for new deployments
103+
- **Use existing infrastructure**: Configure required resources manually or through other means
104+
- **Hybrid approach**: Use this module for some components and existing infrastructure for others
105+
106+
For detailed specifications of each required component, see the [Datafold Dedicated Cloud GCP Deployment Documentation](https://docs.datafold.com/datafold-deployment/dedicated-cloud/gcp). For application deployment instructions, see the [Datafold Helm Charts repository](https://github.com/datafold/helm-charts).
107+
108+
## Detailed Infrastructure Components
109+
110+
Based on the [Datafold GCP Deployment Documentation](https://docs.datafold.com/datafold-deployment/dedicated-cloud/gcp), this module provisions the following detailed infrastructure components:
40111

41-
* Implement the example in this repository
42-
* Change the settings
43-
* Run `terraform init`
44-
* Run `terraform apply`
112+
### Persistent Disks
113+
The Datafold application requires 3 persistent disks for storage, each deployed as encrypted Google Compute Engine persistent disks in the primary availability zone:
45114

46-
### Initializing the application
115+
- **ClickHouse data disk**: Serves as the analytical database storage for Datafold. ClickHouse is a columnar database that excels at analytical queries. The default 40GB allocation usually provides sufficient space for typical deployments, but it can be scaled up based on data volume requirements.
116+
- **ClickHouse logs disk**: Stores ClickHouse's internal logs and temporary data. The separate logs disk prevents log data from consuming IOPS and I/O performance from actual data storage.
117+
- **Redis data disk**: Provides persistent storage for Redis, which handles task distribution and distributed locks in the Datafold application. Redis is memory-first but benefits from persistence for data durability across restarts.
47118

48-
The deployment is created and the initjob should have created the databases and done the
49-
initialization of the site settings.
119+
All persistent disks are encrypted by default using Google-managed encryption keys, ensuring data security at rest.
50120

51-
If that didn't complete successfully, try to restart the job.
121+
### Load Balancer
122+
The load balancer serves as the primary entry point for all external traffic to the Datafold application. The module offers 2 deployment strategies:
52123

53-
Once the deployment is complete and the initjob succeeded, we can set the install to that for false in config.yaml:
124+
- **External Load Balancer Deployment** (the default approach): Creates a Google Cloud Load Balancer through Terraform
125+
- **Kubernetes-Managed Load Balancer**: Relies on the Google Cloud Load Balancer Controller running within the GKE cluster, deployed by the datafold application resource. This means Kubernetes creates the load balancer for you.
54126

55-
```
56-
initjob:
57-
install: false
58-
```
127+
### GKE Cluster
128+
The Google Kubernetes Engine (GKE) cluster forms the compute foundation for the Datafold application:
59129

60-
Alternatively, here are the manual steps to achieve the same:
130+
- **Network Architecture**: The entire cluster is deployed into private subnets with Cloud NAT for egress traffic
131+
- **Security Features**: Workload Identity, Shielded nodes, Binary authorization, Network policy, and Private nodes
132+
- **Node Management**: Supports up to three managed node pools with automatic scaling
61133

62-
Establish a shell into the `<deployment>-dfshell` container.
63-
It is likely that the scheduler and server containers are crashing in a loop.
134+
### IAM Roles and Permissions
135+
The IAM architecture follows the principle of least privilege:
64136

65-
All we need to is to run these commands:
137+
- **GKE service account**: Basic permissions for logging, monitoring, and storage access
138+
- **ClickHouse backup service account**: Custom role for ClickHouse to make backups and store them on Cloud Storage
139+
- **Datafold service accounts**: Pre-defined roles for different application components
66140

67-
1. `./manage.py clickhouse create-tables`
68-
2. `./manage.py database create-or-upgrade`
69-
3. `./manage.py installation set-new-deployment-params`
141+
### Cloud SQL Database
142+
The PostgreSQL Cloud SQL instance serves as the primary relational database:
70143

71-
Now all containers should be up and running.
144+
- **Storage configuration**: Starts with a 20GB initial allocation that can automatically scale up to 100GB
145+
- **High availability**: Intentionally disabled by default to reduce costs and complexity
146+
- **Security and encryption**: Always encrypts data at rest using Google-managed encryption keys
72147

73148
<!-- BEGIN_TF_DOCS -->
74149

Lines changed: 53 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,76 +1,64 @@
1-
resource "local_file" "infra_config" {
2-
filename = "${path.module}/../application/infra.dec.yaml"
3-
content = templatefile(
4-
"${path.module}/../templates/datafold/infra_settings.tpl",
1+
# Output the infrastructure configuration to console
2+
output "infra_config" {
3+
description = "Infrastructure configuration for Datafold deployment"
4+
value = templatefile(
5+
"${path.module}/../templates/infra_settings.tpl",
56
{
67
aws_target_group_arn = "",
7-
cluster_scaler_role_arn = "",
8-
clickhouse_s3_backup_role = "",
9-
clickhouse_data_size = module.gcp[0].clickhouse_data_size,
10-
clickhouse_data_volume_id = module.gcp[0].clickhouse_data_volume_id,
11-
clickhouse_gcs_bucket = module.gcp[0].clickhouse_gcs_bucket,
12-
gcp_backup_account = module.gcp[0].clickhouse_backup_sa,
13-
clickhouse_logs_size = module.gcp[0].clickhouse_logs_size,
14-
clickhouse_log_volume_id = module.gcp[0].clickhouse_logs_volume_id,
8+
gcp_backup_account = module.gcp.clickhouse_backup_sa,
9+
clickhouse_data_size = module.gcp.clickhouse_data_size,
10+
clickhouse_data_volume_id = module.gcp.clickhouse_data_volume_id,
11+
clickhouse_gcs_bucket = module.gcp.clickhouse_gcs_bucket,
12+
clickhouse_logs_size = module.gcp.clickhouse_logs_size,
13+
clickhouse_log_volume_id = module.gcp.clickhouse_logs_volume_id,
1514
clickhouse_s3_bucket = "",
1615
clickhouse_s3_region = "",
17-
clickhouse_azblob_account_name = "",
16+
clickhouse_s3_backup_role = "",
17+
clickhouse_azblob_client_id = "",
1818
clickhouse_azblob_container = "",
19-
clickhouse_azblob_account_key = "",
20-
cloud_provider = module.gcp[0].cloud_provider,
21-
cluster_name = module.gcp[0].cluster_name,
22-
gcp_neg_name = module.gcp[0].neg_name,
23-
load_balancer_ips = jsondecode(module.gcp[0].lb_external_ip),
19+
clickhouse_azblob_account_name = "",
20+
cloud_provider = module.gcp.cloud_provider,
21+
cluster_name = module.gcp.cluster_name,
22+
gcp_neg_name = module.gcp.neg_name,
23+
load_balancer_ips = jsondecode(module.gcp.lb_external_ip),
2424
load_balancer_controller_arn = "",
25-
postgres_database = module.gcp[0].postgres_database_name,
26-
postgres_password = module.gcp[0].postgres_password,
27-
postgres_port = module.gcp[0].postgres_port,
28-
postgres_server = module.gcp[0].postgres_host,
29-
postgres_user = module.gcp[0].postgres_username,
30-
redis_password = module.gcp[0].redis_password,
31-
redis_data_size = module.gcp[0].redis_data_size,
32-
redis_data_volume_id = module.gcp[0].redis_data_volume_id,
33-
server_name = module.gcp[0].domain_name,
34-
vpc_cidr = module.gcp[0].vpc_cidr,
25+
cluster_scaler_role_arn = "",
26+
postgres_database = local.database_name,
27+
postgres_password = module.gcp.postgres_password,
28+
postgres_port = module.gcp.postgres_port,
29+
postgres_server = module.gcp.postgres_host,
30+
postgres_user = module.gcp.postgres_username,
31+
redis_data_size = module.gcp.redis_data_size,
32+
redis_data_volume_id = module.gcp.redis_data_volume_id,
33+
server_name = module.gcp.domain_name,
34+
vpc_cidr = module.gcp.vpc_cidr,
3535

3636
# service accounts vars
37-
dfshell_role_arn = try(module.gcp[0].dfshell_role_arn, "")
38-
dfshell_service_account_name = try(module.gcp[0].dfshell_service_account_name, "datafold-dfshell")
39-
worker_portal_role_arn = try(module.gcp[0].worker_portal_role_arn, "")
40-
worker_portal_service_account_name = try(module.gcp[0].worker_portal_service_account_name, "datafold-worker-portal")
41-
operator_role_arn = try(module.gcp[0].operator_role_arn, "")
42-
operator_service_account_name = try(module.gcp[0].operator_service_account_name, "datafold-operator")
43-
server_role_arn = try(module.gcp[0].server_role_arn, "")
44-
server_service_account_name = try(module.gcp[0].server_service_account_name, "datafold-server")
45-
scheduler_role_arn = try(module.gcp[0].scheduler_role_arn, "")
46-
scheduler_service_account_name = try(module.gcp[0].scheduler_service_account_name, "datafold-scheduler")
47-
worker_role_arn = try(module.gcp[0].worker_role_arn, "")
48-
worker_service_account_name = try(module.gcp[0].worker_service_account_name, "datafold-worker")
49-
worker_catalog_role_arn = try(module.gcp[0].worker_catalog_role_arn, "")
50-
worker_catalog_service_account_name = try(module.gcp[0].worker_catalog_service_account_name, "datafold-worker-catalog")
51-
worker_interactive_role_arn = try(module.gcp[0].worker_interactive_role_arn, "")
52-
worker_interactive_service_account_name = try(module.gcp[0].worker_interactive_service_account_name, "datafold-worker-interactive")
53-
worker_singletons_role_arn = try(module.gcp[0].worker_singletons_role_arn, "")
54-
worker_singletons_service_account_name = try(module.gcp[0].worker_singletons_service_account_name, "datafold-worker-singletons")
55-
worker_lineage_role_arn = try(module.gcp[0].worker_lineage_role_arn, "")
56-
worker_lineage_service_account_name = try(module.gcp[0].worker_lineage_service_account_name, "datafold-worker-lineage")
57-
worker_monitor_role_arn = try(module.gcp[0].worker_monitor_role_arn, "")
58-
worker_monitor_service_account_name = try(module.gcp[0].worker_monitor_service_account_name, "datafold-worker-monitor")
59-
storage_worker_role_arn = try(module.gcp[0].storage_worker_role_arn, "")
60-
storage_worker_service_account_name = try(module.gcp[0].storage_worker_service_account_name, "datafold-storage-worker")
61-
37+
dfshell_role_arn = module.gcp.dfshell_role_arn,
38+
dfshell_service_account_name = module.gcp.dfshell_service_account_name,
39+
worker_portal_role_arn = module.gcp.worker_portal_role_arn,
40+
worker_portal_service_account_name = module.gcp.worker_portal_service_account_name,
41+
operator_role_arn = module.gcp.operator_role_arn,
42+
operator_service_account_name = module.gcp.operator_service_account_name,
43+
server_role_arn = module.gcp.server_role_arn,
44+
server_service_account_name = module.gcp.server_service_account_name,
45+
scheduler_role_arn = module.gcp.scheduler_role_arn,
46+
scheduler_service_account_name = module.gcp.scheduler_service_account_name,
47+
worker_role_arn = module.gcp.worker_role_arn,
48+
worker_service_account_name = module.gcp.worker_service_account_name,
49+
worker_catalog_role_arn = module.gcp.worker_catalog_role_arn,
50+
worker_catalog_service_account_name = module.gcp.worker_catalog_service_account_name,
51+
worker_interactive_role_arn = module.gcp.worker_interactive_role_arn,
52+
worker_interactive_service_account_name = module.gcp.worker_interactive_service_account_name,
53+
worker_singletons_role_arn = module.gcp.worker_singletons_role_arn,
54+
worker_singletons_service_account_name = module.gcp.worker_singletons_service_account_name,
55+
worker_lineage_role_arn = module.gcp.worker_lineage_role_arn,
56+
worker_lineage_service_account_name = module.gcp.worker_lineage_service_account_name,
57+
worker_monitor_role_arn = module.gcp.worker_monitor_role_arn,
58+
worker_monitor_service_account_name = module.gcp.worker_monitor_service_account_name,
59+
storage_worker_role_arn = module.gcp.storage_worker_role_arn,
60+
storage_worker_service_account_name = module.gcp.storage_worker_service_account_name,
6261
}
6362
)
64-
65-
provisioner "local-exec" {
66-
environment = {
67-
"AWS_PROFILE" : "${local.kms_profile}",
68-
"SOPS_KMS_ARN" : "${local.kms_key}"
69-
}
70-
command = "sops --aws-profile ${local.kms_profile} --output '${path.module}/../application/infra.yaml' -e '${path.module}/../application/infra.dec.yaml'"
71-
}
72-
73-
depends_on = [
74-
module.gcp
75-
]
63+
sensitive = false
7664
}

0 commit comments

Comments
 (0)