ClickHouse Cloud Deployment

Complete guide for deploying your cloud cost analytics to ClickHouse Cloud with optional data anonymization for public dashboards.

Overview
Quick Start
Detailed Setup
Switching Between Local and Cloud
Data Anonymization
GitHub Actions Automation
Troubleshooting

Overview

Two Deployment Modes

Local Development (default):

Data: Parquet files in viz_rill/data/
Visualization: Local Rill server
Command: make run-all

Cloud Production:

Data: ClickHouse Cloud database
Visualization: Rill Cloud or local Rill pointing to ClickHouse
Command: make run-all-cloud

Architecture

Local Mode:
  Cloud APIs → dlt → Parquet files → Rill (DuckDB) → localhost:9009

Cloud Mode:
  Cloud APIs → dlt → ClickHouse Cloud → Rill Cloud → Public dashboard

Key Feature: Same codebase works for both! Just switch environment variables.

Quick Start

Prerequisites

✅ Working local setup (make run-all works)
✅ ClickHouse Cloud account (sign up free)

5-Minute Setup

# 1. Add ClickHouse credentials to .dlt/secrets.toml
[destination.clickhouse.credentials]
host = "xxxxx.europe-west4.gcp.clickhouse.cloud"
port = 8443
username = "default"
password = "your-password"
secure = 1

# 2. Initialize database
make init-clickhouse

# 3. Load data to ClickHouse
make run-etl-clickhouse

# 4. (Optional) Anonymize for public dashboards
make anonymize-clickhouse

# 5. Configure Rill to use ClickHouse
cd viz_rill
echo 'RILL_CONNECTOR=clickhouse' >> .env
echo 'CONNECTOR_CLICKHOUSE_DSN=clickhouse://user:pass@host:8443/default?secure=true' >> .env

# 6. View dashboards
cd .. && make serve

Done! Your dashboards now show ClickHouse data.

Detailed Setup

Step 1: Create ClickHouse Service

Go to ClickHouse Cloud Console
Click "Create new service"
Choose:
- Provider: GCP, AWS, or Azure
- Region: Choose closest to your data sources
- Tier: Development (free tier) or Production
Click "Create service"

Save credentials (you won't see the password again):

Host: xxxxx.region.gcp.clickhouse.cloud
Port: 8443
Username: default
Password: [copy this!]

Step 2: Configure Local Credentials

Edit .dlt/secrets.toml:

[destination.clickhouse.credentials]
host = "xxxxx.europe-west4.gcp.clickhouse.cloud"
port = 8443
username = "default"
password = "your-secure-password"
secure = 1

See .dlt/secrets.toml.example for full template.

Step 3: Initialize Database

make init-clickhouse

This creates the necessary database and permissions for dlt.

Step 4: Test Data Ingestion

Start with one pipeline:

make run-aws-clickhouse

Verify data arrived:

# Using clickhouse-client
clickhouse-client --host xxxxx.clickhouse.cloud --secure \
  --password 'your-password' \
  --query "SELECT count(*) FROM aws_costs___cur_export_test_00001"

Or use ClickHouse Cloud SQL console in the web UI.

Step 5: Run All Pipelines

make run-etl-clickhouse

This runs AWS, GCP, and Stripe pipelines to ClickHouse.

Switching Between Local and Cloud

Mode Switching

The project uses environment variables to switch between local and cloud:

ETL destination (set automatically by Makefile):

DLT_DESTINATION=filesystem → Parquet files (local)
DLT_DESTINATION=clickhouse → ClickHouse Cloud

Rill connector (set in viz_rill/.env):

RILL_CONNECTOR="" or not set → DuckDB reads parquet (local)
RILL_CONNECTOR=clickhouse → Rill queries ClickHouse

Local Mode (Default)

# ETL to parquet
make run-all

# Rill automatically uses parquet files
make serve

No configuration needed! This is the default.

Cloud Mode

Option A: ETL + View locally

# 1. Load to ClickHouse
make run-etl-clickhouse

# 2. Configure Rill
cd viz_rill
cat > .env << 'EOF'
RILL_CONNECTOR=clickhouse
CONNECTOR_CLICKHOUSE_DSN=clickhouse://user:pass@host:8443/default?secure=true
EOF

# 3. View dashboards
cd .. && make serve

Option B: Full Cloud Pipeline (ETL + Anonymize)

# Loads data and anonymizes in one command
make run-all-cloud

# Then configure Rill as above

Switching Back to Local

cd viz_rill
# Remove or comment out RILL_CONNECTOR
echo 'RILL_CONNECTOR=""' > .env
cd .. && make serve

Rill now reads parquet files again.

Data Structure

ClickHouse tables created:

aws_costs___cur_export_test_00001      # AWS raw data
gcp_costs___bigquery_billing_table     # GCP raw data
stripe_costs___balance_transaction     # Stripe raw data

aws_costs                              # Rill model (materialized)
gcp_costs                              # Rill model (materialized)
stripe_revenue                         # Rill model (materialized)
unified_cost_model                     # Rill model (materialized)

Data Anonymization

For public dashboards, anonymize your cost data:

What Gets Anonymized

Cost values: Multiplied by random 2-8x factor
Row duplication: 3x more data by default
IDs hashed: Account IDs, project IDs anonymized

Usage

# After loading data to ClickHouse
make anonymize-clickhouse

# Customize with environment variables
COST_MULTIPLIER_MIN=5.0 COST_MULTIPLIER_MAX=10.0 DUPLICATE_ROWS=5 make anonymize-clickhouse

Clearing Data

# Interactive (asks for confirmation)
make clear-clickhouse

# Force mode (for scripts)
make clear-clickhouse-force

# Dry run (see what would be deleted)
uv run python scripts/clear_clickhouse.py --dry-run

This drops all dlt tables AND Rill model tables.

Environment Variables

Variable	Default	Description
`COST_MULTIPLIER_MIN`	`2.0`	Minimum cost multiplier
`COST_MULTIPLIER_MAX`	`8.0`	Maximum cost multiplier
`DUPLICATE_ROWS`	`3`	Row duplication factor

See ANONYMIZATION.md for details.

GitHub Actions Automation

Automate daily data loads to ClickHouse.

Setup

Add GitHub Secrets (Settings → Secrets → Actions):
- CLICKHOUSE_HOST - Your ClickHouse host
- CLICKHOUSE_USERNAME - Usually default
- CLICKHOUSE_PASSWORD - Your ClickHouse password
- AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
- GCP_PROJECT_ID, GCP_PRIVATE_KEY, GCP_CLIENT_EMAIL
- STRIPE_SECRET_KEY
Workflow already exists at .github/workflows/etl-pipeline.yml
Test manual trigger:
- Go to Actions tab → Cloud Cost ETL Pipeline
- Click "Run workflow"
- Monitor progress

Schedule

Runs daily at 2 AM UTC automatically
Loads only new data (incremental)
~5-10 minutes per run

Monitoring

Check Actions tab for status
Green checkmark = success
Logs available in each workflow run

Troubleshooting

Connection Issues

"Cannot connect to ClickHouse"

Solutions:

Verify ClickHouse service is running (check cloud console)
Check host includes full domain: xxxxx.region.gcp.clickhouse.cloud
Ensure secure = 1 in secrets.toml

Test connection:

clickhouse-client --host xxx --secure --password 'xxx' --query "SELECT 1"

Authentication Errors

"Authentication failed"

Solutions:

Verify credentials in .dlt/secrets.toml
No trailing spaces in values
Try regenerating password in ClickHouse console
Update both local config and GitHub Secrets

Rill Connection Issues

"Model execution is disabled" or "read_parquet not found"

Solutions:

Check RILL_CONNECTOR=clickhouse is set in viz_rill/.env
Verify ClickHouse DSN is correct
Make sure rill.yaml has olap_connector: clickhouse
Restart Rill after changing config

Data Issues

"No data in dashboards"

Solutions:

Verify data loaded: SELECT count(*) FROM aws_costs___cur_export_test_00001
Check Rill is using correct connector
Look for errors in Rill startup logs
Verify models reconciled successfully

"Dashboards show errors on some visualizations"

Solutions:

Some visualizations may need column adjustments
Check metrics files reference correct columns
Models select all columns with SELECT *, so schema should match

Performance Issues

"Queries are slow"

Solutions:

Add indexes in ClickHouse for common query patterns
Use materialized views for aggregations
Optimize metrics to reduce data scanned
Consider table partitioning for large datasets

Clearing and Reloading

Start fresh:

# 1. Clear ClickHouse
make clear-clickhouse-force

# 2. Reload data
make run-etl-clickhouse

# 3. Anonymize if needed
make anonymize-clickhouse

Cost Optimization

ClickHouse Cloud Costs

Free tier: Development tier available
Idle scaling: Auto-scales to zero when not in use
Optimization: Run ETL once daily (sufficient for most use cases)

GitHub Actions Costs

Free tier: 2,000 minutes/month for public repos
Usage: ~5-10 minutes per run × 30 days = ~300 minutes/month
Well under limit for daily schedules

Advanced: Normalization (Optional)

Are normalization scripts still needed?

Short answer: No, normalization is optional for both local and cloud modes.

The normalization scripts (normalize.py, normalize_gcp.py) are now optional because:

Models use SELECT * to get all columns from raw data
ClickHouse and DuckDB can query nested structures directly
Current dashboards work perfectly with raw, unnormalized data
Models add derived columns (like product_product_name) where needed

When to use normalization

Only use normalization if you need:

CUR Wizard dynamic dashboards (canvas generation)
Pre-flattened tag/label columns for external tools
Compatibility with tools that can't query nested structures

Current approach (no normalization):

Pipelines load raw data to parquet/ClickHouse
Models query raw data and add derived columns
Dashboards use models

If you still want normalization

Important: Dynamic dashboard generation (make aws-dashboards / make gcp-dashboards) only works in local mode because it requires parquet files to analyze. ClickHouse mode doesn't create parquet files.

For local mode only:

# These commands work ONLY with local parquet files
make aws-dashboards   # Requires: viz_rill/data/aws_costs/*.parquet
make gcp-dashboards   # Requires: viz_rill/data/gcp_costs/*.parquet

For ClickHouse (alternative approach):

-- Create materialized view instead of Python normalization
CREATE MATERIALIZED VIEW aws_costs_normalized AS
SELECT
  *,
  resource_tags['Environment'] AS tag_environment,
  resource_tags['Team'] AS tag_team
FROM aws_costs___cur_export_test_00001;

ClickHouse materialized views are:

✅ Faster than Python scripts
✅ No intermediate parquet files needed
✅ Updated automatically as new data arrives
✅ Native ClickHouse performance

Next Steps

After successful ClickHouse deployment:

Configure Rill Cloud (optional):
- Deploy dashboards: cd viz_rill && rill deploy
- Set ClickHouse credentials in Rill Cloud settings
- Share dashboards with team
Set up monitoring (optional):
- Add Slack/email notifications to workflow
- Monitor data freshness
- Set up alerts for failed runs
Optimize performance (optional):
- Add ClickHouse indexes for common queries
- Create aggregation tables
- Adjust workflow frequency based on needs

Resources

ClickHouse docs: https://clickhouse.com/docs
dlt ClickHouse destination: https://dlthub.com/docs/dlt-ecosystem/destinations/clickhouse
Rill docs: https://docs.rilldata.com

FilesExpand file tree

CLICKHOUSE.md

Latest commit

History

CLICKHOUSE.md

File metadata and controls

ClickHouse Cloud Deployment

Table of Contents

Overview

Two Deployment Modes

Architecture

Quick Start

Prerequisites

5-Minute Setup

Detailed Setup

Step 1: Create ClickHouse Service

Step 2: Configure Local Credentials

Step 3: Initialize Database

Step 4: Test Data Ingestion

Step 5: Run All Pipelines

Switching Between Local and Cloud

Mode Switching

Local Mode (Default)

Cloud Mode

Switching Back to Local

Data Structure

Data Anonymization

What Gets Anonymized

Usage

Clearing Data

Environment Variables

GitHub Actions Automation

Setup

Schedule

Monitoring

Troubleshooting

Connection Issues

Authentication Errors

Rill Connection Issues

Data Issues

Performance Issues

Clearing and Reloading

Cost Optimization

ClickHouse Cloud Costs

GitHub Actions Costs

Advanced: Normalization (Optional)

Are normalization scripts still needed?

When to use normalization

If you still want normalization

Next Steps

Resources