Skip to content

Commit f669ead

Browse files
committed
update readme with ClickHouse cloud ready version
1 parent 4a7e293 commit f669ead

File tree

1 file changed

+144
-56
lines changed

1 file changed

+144
-56
lines changed

README.md

Lines changed: 144 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,17 @@
11
# Cloud Cost Analyzer Project
22

3-
Multi-cloud cost analytics platform combining AWS Cost and Usage Reports (CUR), GCP billing data, and Stripe revenue metrics. Built with dlt for data ingestion, DuckDB for storage, and Rill for visualization.
3+
Multi-cloud cost analytics platform combining AWS Cost and Usage Reports (CUR), GCP billing data, and Stripe revenue metrics. Built with dlt for data ingestion, DuckDB/ClickHouse for storage, and Rill for visualization.
4+
5+
> **NEW: Cloud-Ready Version**
6+
>
7+
> This version now supports **both local and cloud deployment**:
8+
> - **Local Mode**: Parquet files + DuckDB + local Rill (perfect for development)
9+
> - **Cloud Mode**: ClickHouse Cloud + Rill Cloud + GitHub Actions automation (production-ready)
10+
>
11+
> Switch between modes with a single command. The same codebase works everywhere!
12+
>
13+
> See [CLICKHOUSE.md](CLICKHOUSE.md) for cloud deployment guide
14+
> Looking for the original local-only version? Check out [branch `v1`](https://github.com/ssp-data/cloud-cost-analyzer/tree/v1)
415
516
![](img/tech-stack.png)
617

@@ -9,8 +20,11 @@ Multi-cloud cost analytics platform combining AWS Cost and Usage Reports (CUR),
920

1021
- **Multi-Cloud Cost Tracking** - AWS, GCP, and future cloud providers
1122
- **Revenue Integration** - Stripe payment data for margin analysis
23+
- **Dual Deployment Modes** - Run locally with Parquet/DuckDB or in the cloud with ClickHouse/Rill Cloud
1224
- **Incremental Loading** - Efficient append-only data pipeline with dlt
1325
- **Advanced Analytics** - RI/SP utilization, unit economics, effective cost tracking (adapted from [aws-cur-wizard](https://github.com/Twing-Data/aws-cur-wizard))
26+
- **GitHub Actions Automation** - [Daily data updates](.github/workflows/etl-pipeline.yml) with automated ETL pipelines
27+
- **Data Anonymization** - Built-in anonymization for public dashboards (see [ANONYMIZATION.md](ANONYMIZATION.md))
1428
- **Dynamic Dashboards** - Powered by Rill visualizations
1529

1630
## Quick Start with Demo Data
@@ -29,24 +43,34 @@ Opens at http://localhost:9009 with sample data.
2943

3044
## How it works
3145

32-
Once setup, we can run these seperate commands to run:
33-
```sh
34-
# View static dashboards (always available)
35-
make serve
46+
### Two Deployment Modes
47+
48+
This project supports both local development and cloud production:
3649

37-
# Generate dynamic dashboards (optional)
38-
make aws-dashboards
50+
**Local Mode** (default):
51+
```sh
52+
make run-all # ETL → Parquet files → Local Rill dashboard
53+
make serve # View dashboards at localhost:9009
54+
```
55+
Perfect for: Development, testing, small datasets
3956

40-
# Complete workflow (local)
41-
make run-all
57+
**Cloud Mode** (production-ready):
58+
```sh
59+
make run-all-cloud # ETL → ClickHouse Cloud → Anonymize → Rill Cloud/Local
60+
```
61+
Perfect for: Production, team collaboration, large datasets, public dashboards
4262

43-
# Cloud deployment with anonymized data
44-
make run-all-cloud
63+
**Additional commands:**
64+
```sh
65+
make aws-dashboards # Generate dynamic AWS dashboards (local mode only)
66+
make serve-clickhouse # View dashboards connected to ClickHouse
4567
```
4668

47-
**Cloud deployment guides:**
48-
- [CLICKHOUSE.md](CLICKHOUSE.md) - Complete ClickHouse Cloud setup, deployment, and switching guide
49-
- [ANONYMIZATION.md](ANONYMIZATION.md) - Data anonymization for public dashboards
69+
### Deployment Guides
70+
71+
- **[CLICKHOUSE.md](CLICKHOUSE.md)** - Complete ClickHouse Cloud setup, deployment, mode switching, and GitHub Actions automation
72+
- **[ANONYMIZATION.md](ANONYMIZATION.md)** - Data anonymization for public dashboards
73+
- **[CLAUDE.md](CLAUDE.md)** - Architecture details and development guide
5074

5175

5276
## Setup
@@ -217,30 +241,43 @@ The `initial_start_date` parameter controls how far back to load historical data
217241
**Note about AWS table_name and Rill dashboards:**
218242
If you change the AWS `table_name` from the default `cur_export_test_00001`, you'll also need to update the parquet path in `viz_rill/models/aws_costs.sql` (file has comments showing where).
219243

220-
#### Cloud Deployment (ClickHouse)
221-
This repo supports both local (default) and cloud deployment:
244+
#### Cloud Deployment with ClickHouse & Rill Cloud
245+
246+
**New in this version**: Deploy to production with ClickHouse Cloud and automate with GitHub Actions!
222247

223-
- **Local mode** (default): Parquet files + Rill local server
224-
- **Cloud mode**: ClickHouse Cloud + Rill Cloud or local Rill
248+
**Quick Cloud Setup:**
225249

226-
To deploy to ClickHouse Cloud, see [CLICKHOUSE.md](CLICKHOUSE.md) for complete setup instructions including:
227-
- ClickHouse Cloud account setup
228-
- Switching between local and cloud modes
229-
- Data anonymization for public dashboards
230-
- GitHub Actions automation
231-
- Troubleshooting
250+
1. **Create ClickHouse Cloud service** ([sign up free](https://clickhouse.cloud))
232251

233-
Short setup version:
252+
2. **Add credentials to `.dlt/secrets.toml`:**
253+
```toml
254+
[destination.clickhouse.credentials]
255+
host = "xxxxx.europe-west4.gcp.clickhouse.cloud"
256+
port = 8443
257+
username = "default"
258+
password = "your-password"
259+
secure = 1
260+
```
234261

235-
1. In rill.yaml change `olap_connector: clickhouse` to clickhouse
236-
2. set `RILL_CONNECTOR="clickhouse"` in .env in `viz_rill/.env` and add DNS a valid path for ClickHouse`CONNECTOR_CLICKHOUSE_DSN="https://<HOST>.europe-west4.gcp.clickhouse.cloud:8443?username=default&password=<PASSWORD>&secure=true&skip_verify=true"`
237-
3. use ENV `DLT_DESTINATION=clickhouse`, but it will be set automatically inside Makefile
238-
1. in rill.yaml change `olap_connector: clickhouse` to clickhouse
262+
3. **Configure Rill for ClickHouse** in `viz_rill/.env`:
263+
```bash
264+
RILL_CONNECTOR="clickhouse"
265+
CONNECTOR_CLICKHOUSE_DSN="clickhouse://default:password@host:8443/default?secure=true"
266+
```
239267

268+
4. **Run cloud pipeline:**
269+
```bash
270+
make init-clickhouse # Initialize database (one-time)
271+
make run-all-cloud # ETL + anonymize + serve
272+
```
240273

241-
After running the clickhouse pipeline with `make run-all-cloud`, it will load all data into clickhouse and serve rill from ClickHouse.
274+
**What you get:**
275+
- Data stored in ClickHouse Cloud (scalable, fast)
276+
- [Automated daily updates via GitHub Actions](.github/workflows/etl-pipeline.yml) (runs at 2 AM UTC)
277+
- Optional data anonymization for public dashboards
278+
- Works with both Rill Cloud and local Rill
242279

243-
Long version with details in [ClickHouse Setup](CLICKHOUSE.md).
280+
**Complete guide**: See [CLICKHOUSE.md](CLICKHOUSE.md) for detailed setup, GitHub Actions configuration, mode switching, and troubleshooting.
244281

245282
### 4. Run the Pipeline
246283

@@ -252,6 +289,7 @@ make serve # Opens Rill dashboards
252289

253290
## How the Data Pipeline Works
254291

292+
This pipeline supports both **local mode** (Parquet + DuckDB) and **cloud mode** (ClickHouse Cloud). The architecture remains the same, only the storage layer changes.
255293

256294
## The Data Flow
257295
```mermaid
@@ -280,13 +318,12 @@ subgraph "2: NORMALIZE (Python + DuckDB)"
280318
P3 --> R1
281319
end
282320
283-
subgraph "3: RAW STORAGE (Parquet)"
284-
R1[data/aws_costs/<br/>cur_export_test_00001/<br/>*.parquet]
285-
R2[data/gcp_costs/<br/>normalized.parquet]
286-
R3[data/stripe_costs/<br/>balance_transaction.parquet]
321+
subgraph "3: STORAGE (Dual Mode)"
322+
R1[LOCAL: Parquet files<br/>or<br/>CLOUD: ClickHouse tables]
323+
R2[Switch with DLT_DESTINATION env var]
287324
288325
N1 -.-> R1
289-
N2 --> R2
326+
N2 --> R1
290327
P1 --> R1
291328
end
292329
@@ -297,8 +334,8 @@ subgraph "4: MODEL (SQL - Star Schema)"
297334
M4[unified_cost_model.sql<br/>🌟 UNION ALL + Currency Conversion]
298335
299336
R1 --> M1
300-
R2 --> M2
301-
R3 --> M3
337+
R1 --> M2
338+
R1 --> M3
302339
303340
M1 --> M4
304341
M2 --> M4
@@ -328,6 +365,8 @@ style P2 fill:#4A90E2,stroke:#2E5C8A,color:#fff
328365
style P3 fill:#4A90E2,stroke:#2E5C8A,color:#fff
329366
style N1 fill:#9B59B6,stroke:#7D3C98,color:#fff
330367
style N2 fill:#9B59B6,stroke:#7D3C98,color:#fff
368+
style R1 fill:#FF6B6B,stroke:#C92A2A,color:#fff
369+
style R2 fill:#FF6B6B,stroke:#C92A2A,color:#fff
331370
style M4 fill:#E74C3C,stroke:#C0392B,color:#fff
332371
style MV3 fill:#27AE60,stroke:#1E8449,color:#fff
333372
style D3 fill:#F39C12,stroke:#D68910,color:#fff
@@ -344,22 +383,38 @@ style D3 fill:#F39C12,stroke:#D68910,color:#fff
344383

345384
### Incremental Loading
346385

347-
Uses `write_disposition="append"` - cost data is append-only (no updates/merges needed).
386+
Uses `write_disposition="append"` - cost data is append-only (no updates/merges needed). AWS uses `merge` for hard deduplication.
348387

349-
### Data Flow
388+
### Data Flow by Mode
389+
390+
**Local Mode:**
391+
```
392+
Cloud Providers dlt Pipelines Storage Visualization
393+
AWS S3 (CUR) → aws_pipeline.py → Parquet files → Rill (DuckDB)
394+
GCP BigQuery → google_bq_*.py → viz_rill/data/ → localhost:9009
395+
Stripe API → stripe_pipeline.py → →
396+
```
350397

398+
**Cloud Mode:**
351399
```
352-
Cloud Providers dlt Pipelines Storage Visualization
353-
AWS S3 (CUR) →→ aws_pipeline.py →→ Parquet files →→ Rill Dashboards
354-
GCP BigQuery →→ google_bq_*.py →→ viz_rill/data/ →→ localhost:9009
355-
Stripe API →→ stripe_pipeline.py →→ →→
400+
Cloud Providers dlt Pipelines Storage Visualization
401+
AWS S3 (CUR) → aws_pipeline.py → ClickHouse Cloud → Rill Cloud/Local
402+
GCP BigQuery → google_bq_*.py → (via dlt) → (connects to CH)
403+
Stripe API → stripe_pipeline.py → →
404+
405+
GitHub Actions (automated daily at 2 AM UTC)
356406
```
407+
See [workflow configuration](.github/workflows/etl-pipeline.yml) for details.
357408

358409
### Output
359410

360-
Data is stored in both formats:
411+
**Local Mode:**
412+
- **Parquet files**: `viz_rill/data/` (primary storage, used by Rill via DuckDB)
361413
- **DuckDB**: `cloud_cost_analytics.duckdb` (legacy, optional)
362-
- **Parquet**: `viz_rill/data/` (used by Rill dashboards)
414+
415+
**Cloud Mode:**
416+
- **ClickHouse tables**: Scalable cloud database
417+
- **Automated updates**: Via [GitHub Actions](.github/workflows/etl-pipeline.yml) (daily at 2 AM UTC)
363418

364419
## Troubleshooting
365420

@@ -414,9 +469,9 @@ The normalization scripts (`normalize.py`, `normalize_gcp.py`) flatten nested da
414469
### Do You Need It?
415470

416471
It works without also. The core dashboards work without normalization:
417-
- Static dashboards (`viz_rill/dashboards/*.yaml`) query raw data via models
418-
- Models use `SELECT *` to read all columns from raw parquet/ClickHouse
419-
- Everything works for both local and cloud deployment
472+
- Static dashboards (`viz_rill/dashboards/*.yaml`) query raw data via models
473+
- Models use `SELECT *` to read all columns from raw parquet/ClickHouse
474+
- Everything works for both local and cloud deployment
420475

421476
But it provides useful dashboards (alredy pre commited in this repo), but if you have different data, i'd suggest to run it. Normalization provides:
422477
- Auto-generated dimension-specific canvases (e.g., per-tag breakdowns)
@@ -460,21 +515,54 @@ See [CLICKHOUSE.md](CLICKHOUSE.md#advanced-normalization-optional) for more deta
460515

461516
## Complete Workflow
462517

518+
### Local Development Workflow
519+
463520
```bash
464-
# Full workflow: ETL + dashboards
521+
# Full local workflow: ETL + dashboards
465522
make run-all
466523

467524
# Or step-by-step:
468-
make run-etl # 1. Load AWS/GCP/Stripe data
525+
make run-etl # 1. Load AWS/GCP/Stripe data → Parquet files
469526
make aws-dashboards # 2. (Optional) Generate dynamic dashboards
470-
make serve # 3. View in browser
527+
make serve # 3. View in browser at localhost:9009
528+
```
529+
530+
### Cloud Production Workflow
531+
532+
```bash
533+
# Full cloud workflow: ETL + anonymize + serve
534+
make run-all-cloud
535+
536+
# Or step-by-step:
537+
make init-clickhouse # 1. Initialize ClickHouse (one-time)
538+
make run-etl-clickhouse # 2. Load data → ClickHouse Cloud
539+
make anonymize-clickhouse # 3. (Optional) Anonymize for public dashboards
540+
make serve-clickhouse # 4. View dashboards (local Rill → ClickHouse)
541+
```
542+
543+
**Switching modes:**
544+
```bash
545+
make set-connector-duckdb # Switch to local Parquet/DuckDB
546+
make set-connector-clickhouse # Switch to ClickHouse Cloud
471547
```
472548

473549
## Documentation
474550

475-
- [CLICKHOUSE.md](CLICKHOUSE.md) - ClickHouse Cloud deployment, mode switching, and troubleshooting
476-
- [ANONYMIZATION.md](ANONYMIZATION.md) - Data anonymization for public dashboards
477-
- `viz_rill/README.md` - Dashboard structure and how the visualization layer works
478-
- `ATTRIBUTION.md` - Third-party components (aws-cur-wizard) used in this project
551+
### Cloud Deployment
552+
- **[CLICKHOUSE.md](CLICKHOUSE.md)** - ClickHouse Cloud setup, GitHub Actions automation, mode switching, and troubleshooting
553+
- **[ANONYMIZATION.md](ANONYMIZATION.md)** - Data anonymization for public dashboards
554+
- **GitHub Actions Workflows**:
555+
- [Daily ETL Pipeline](.github/workflows/etl-pipeline.yml) - Automated data ingestion (runs at 2 AM UTC)
556+
- [Clear ClickHouse Data](.github/workflows/clear-clickhouse.yml) - Manual workflow to drop all tables
557+
558+
### Development & Architecture
559+
- **[CLAUDE.md](CLAUDE.md)** - Architecture details, key design patterns, and development guide
560+
- **[viz_rill/README.md](viz_rill/README.md)** - Dashboard structure and visualization layer
561+
- **[ATTRIBUTION.md](ATTRIBUTION.md)** - Third-party components (aws-cur-wizard)
562+
563+
### Related Resources
564+
- **Blog Post**: [Multi-Cloud Cost Analytics: From Cost-Export to Parquet to Rill](https://www.ssp.sh/blog/cost-analyzer-aws-gcp/) - Detailed write-up of this project
565+
- **Blog Post Part 2**: [Multi-Cloud Cost Analytics with dlt, ClickHouse & Rill](https://www.ssp.sh/posts/) - Detailed write-up of this project
566+
- **Original Local Version**: [Branch `v1`](https://github.com/ssp-data/cloud-cost-analyzer/tree/v1) - Pre-ClickHouse version
479567

480568

0 commit comments

Comments
 (0)