Skip to content

Commit cd359a1

Browse files
committed
docs: improve project documentation and README structure
- Streamline main README with clearer navigation and architecture overview - Add comprehensive workloads README with component descriptions - Enhance ingest workload documentation with usage examples and configuration - Add new Makefile targets for install and run-help commands - Improve project structure visibility and quick start instructions
1 parent 5f63443 commit cd359a1

File tree

4 files changed

+179
-206
lines changed

4 files changed

+179
-206
lines changed

README.md

Lines changed: 44 additions & 171 deletions
Original file line numberDiff line numberDiff line change
@@ -8,207 +8,84 @@
88

99
> A Kubernetes-native platform for building modern, declarative data pipelines with clear boundaries between ingestion and transformation.
1010
11-
## 📋 Quick Navigation
11+
## 🚀 Quick Navigation
1212

13-
- [Overview](#-overview)
14-
- [Key Features](#-key-features)
15-
- [Quick Start](#-quick-start)
16-
- [Architecture](#️-architecture)
17-
- [Documentation](#-documentation)
18-
- [Contributing](#-contributing)
13+
- **[🎛️ Kubernetes Operator](operator/README.md)** - Go-based CRD management and pipeline orchestration
14+
- **[📥 Ingest Workload](workloads/ingest/README.md)** - Type-safe data ingestion (Python)
15+
- **[🔄 Transform Workload](workloads/transform/README.md)** - dbt-based data transformation
16+
- **[⚡ Trigger Workload](workloads/trigger/README.md)** - Event-driven pipeline activation (Go)
17+
- **[📋 Examples](docs/examples.md)** - Comprehensive YAML examples and use cases
1918

2019
---
2120

22-
## 🚀 Overview
23-
24-
Pipeline Forge is a complete solution for orchestrating data pipelines in Kubernetes environments. It combines a powerful Kubernetes operator with specialized workloads to provide a declarative, event-driven approach to data pipeline management.
25-
26-
### 🎯 The Problem
27-
28-
Modern data teams face challenges with:
29-
30-
- **Complex Orchestration**: Managing dependencies between data ingestion and transformation
31-
- **Event-Driven Requirements**: Responding to file drops, database changes, and streaming events
32-
- **Infrastructure Complexity**: Deploying and scaling data processing workloads
33-
- **Observability Gaps**: Tracking pipeline health and data lineage
34-
- **Team Coordination**: Coordinating between data engineering and platform teams
35-
- **Resilience and Lifecycle Management**: Ensuring each pipeline step is connected in a clear lifecycle—if one step fails, others don't attempt to run, preventing cascading errors and maintaining robust execution
36-
37-
### 💡 The Solution
38-
39-
Pipeline Forge provides a Kubernetes-native platform that:
40-
41-
- **Declaratively Orchestrates** data pipelines using Custom Resource Definitions (CRDs)
42-
- **Flexible Ingestion** supports both event-driven (e.g., GCS file drops, Pub/Sub messages, BigQuery updates) and scheduled (CronJob-based) pipeline execution
43-
- **Clear Separation** between data ingestion and transformation phases
44-
- **Built-in Observability** with comprehensive status tracking and monitoring
45-
- **GitOps Ready** configuration that fits modern deployment practices
46-
47-
## ✨ Key Features
48-
49-
- **Unified Pipeline Lifecycle**: Connect ingestion with staging models in a single application lifecycle - if ingestion fails, the entire staging fails, preventing orphaned transformations
50-
- **Native Kubernetes Resources**: Each step runs on 100% native K8s resources (Transform → Job, Ingest → CronJob/Job/Trigger)
51-
- **Event-Driven Orchestration**: React to file drops, Pub/Sub messages, and BigQuery updates with intelligent retry policies
52-
- **Built-in Observability**: Comprehensive status tracking with detailed execution history and failure analysis
53-
- **Flexible Ingestion**: Reference existing CronJobs during Ingestion or create new ones as needed with full type safety and managed by the operator
54-
- **Custom Image Support**: Use your own image for each step, or use pre-built Docker images from the Pipeline Forge repository
55-
56-
### ⚡ Event-Driven Orchestration
57-
58-
- **GCS Triggers**: Monitor bucket changes and trigger pipelines
59-
- **Pub/Sub Triggers**: React to real-time messages with optional filtering
60-
- **BigQuery Triggers**: Watch for table updates and data freshness
61-
- **Retry & Cooldown**: Configurable retry policies with intelligent intervals
62-
63-
### ☸️ Kubernetes-Native
64-
65-
- **CRD-Based**: Native Kubernetes resources for pipeline definition
66-
- **RBAC Integration**: Fine-grained access control for teams
67-
- **Resource Management**: CPU, memory, and storage allocation
68-
- **Independent Scaling**: Each step scales independently as native K8s resources
69-
70-
### 📊 Built-in Observability
71-
72-
- **Rich Status Tracking**: Comprehensive pipeline health monitoring with detailed execution history
73-
- **Lifecycle Management**: Real-time phase tracking (Pending, Running, Completed, Failed)
74-
- **Execution Insights**: Track attempt counts, success/failure rates, and timing metrics
75-
- **Failure Analysis**: Detailed error messages and retry attempt tracking
76-
77-
### 🔄 Example
78-
79-
```yaml
80-
apiVersion: core.pipeline-forge.io/v1alpha1
81-
kind: Staging
82-
metadata:
83-
name: user-events-pipeline
84-
namespace: staging-events
85-
spec:
86-
ingest:
87-
mode: reference
88-
type: trigger
89-
name: user-events-trigger
90-
transform:
91-
name: user-events-transform
92-
project: analytics
93-
target: prod
94-
image: gcr.io/org/dbt-core:latest
95-
models:
96-
- stg_user_events
97-
```
98-
99-
📖 **[View comprehensive examples →](docs/examples.md)**
100-
101-
## 🚀 Quick Start
102-
103-
### Prerequisites
104-
105-
- Kubernetes cluster (v1.11.3+)
106-
- kubectl configured for your cluster
107-
- Docker or container runtime
108-
- Access to container registry
21+
## 🎯 What is Pipeline Forge?
10922

110-
### Installation
23+
A complete solution for orchestrating data pipelines in Kubernetes environments. Combines a powerful Kubernetes operator with specialized workloads to provide a declarative, event-driven approach to data pipeline management.
11124

112-
1. **Clone and Deploy**
25+
### Key Benefits
11326

114-
```bash
115-
# Clone the repository
116-
git clone https://github.com/your-org/pipeline-forge.git
117-
cd pipeline-forge/operator
27+
- **Unified Pipeline Lifecycle** - Connect ingestion with transformation in a single application lifecycle
28+
- **Native Kubernetes Resources** - Each step runs on 100% native K8s resources
29+
- **Event-Driven Orchestration** - React to file drops, Pub/Sub messages, and BigQuery updates
30+
- **Built-in Observability** - Comprehensive status tracking and monitoring
11831

119-
# Deploy the operator
120-
make deploy IMG=your-registry/pipeline-forge-operator:latest
121-
```
32+
## 🏗️ Architecture Overview
12233

123-
2. **Deploy Sample Pipelines**
34+
Pipeline Forge consists of two main components:
12435

125-
```bash
126-
# Apply example configurations
127-
kubectl apply -k operator/config/samples/
128-
```
36+
### 🎛️ [Kubernetes Operator](operator/README.md)
12937

130-
3. **Monitor Pipeline Status**
131-
```bash
132-
# Check pipeline health
133-
kubectl get staging
134-
kubectl describe staging user-events-staging
135-
```
38+
**Go-based CRD management and pipeline orchestration**
13639

137-
## 🏗️ Architecture
40+
- Custom Resource Definitions (CRDs) for pipeline definition
41+
- Automatic reconciliation and lifecycle management
42+
- RBAC integration and resource management
43+
- Event-driven trigger management
13844

139-
Pipeline Forge consists of two main components that work together seamlessly:
45+
### 🔧 [Specialized Workloads](workloads/README.md)
14046

141-
### 🎛️ Kubernetes Operator
47+
**Production-ready data processing components**
14248

143-
The operator manages the lifecycle of data pipeline stages through:
49+
- **[Ingest](workloads/ingest/README.md)** - Type-safe data ingestion from MySQL, PostgreSQL to BigQuery
50+
- **[Transform](workloads/transform/README.md)** - dbt-based data transformation with version control
51+
- **[Trigger](workloads/trigger/README.md)** - Event processing for GCS, Pub/Sub, and BigQuery
14452

145-
- **Staging Resources**: Complete pipeline steps that coordinate ingestion and transformation
146-
- **Trigger Resources**: Event-driven activation for pipelines
147-
- **Ingestion Management**: Supports ingestion via both event-driven triggers and CronJobs
148-
- **Automatic Reconciliation**: Ensures pipeline state matches desired configuration
149-
150-
### 🔧 Specialized Workloads
151-
152-
Pre-built, production-ready data processing components:
53+
## 🛠️ Technology Stack
15354

154-
- **Ingest Workloads**: Type-safe data ingestion from MySQL, PostgreSQL, and more
155-
- **Transform Workloads**: dbt-based data transformation with version control
156-
- **Trigger Workloads**: Event processing for GCS, Pub/Sub, and BigQuery
55+
| Component | Technology | Purpose |
56+
| ------------- | ----------------------------- | ----------------------------------------- |
57+
| **Operator** | Go, Kubernetes, Kubebuilder | Pipeline orchestration and CRD management |
58+
| **Ingest** | Python 3.13+, Pydantic, Typer | Type-safe data ingestion with validation |
59+
| **Transform** | dbt Core, BigQuery | Data transformation and analytics |
60+
| **Triggers** | Go, Google Cloud APIs | Event-driven pipeline activation |
15761

158-
## 📚 Documentation
62+
## 🚀 Quick Start
15963

160-
- **[📋 Examples](docs/examples.md)** - Comprehensive YAML examples and use cases
161-
- **[🎛️ Operator Guide](operator/README.md)** - Detailed operator documentation
162-
- **[🔧 Workloads](workloads/README.md)** - Data processing components
64+
```bash
65+
git clone https://github.com/DanielBlei/pipeline-forge.git
16366

164-
## 🛠️ Technology Stack
67+
# Run the operator (safety check ensures you're on kind/minikube cluster)
68+
make run-operator
16569

166-
| Component | Technology | Purpose |
167-
| ------------------ | ----------------------------- | ---------------------------------------------------- |
168-
| **Operator** | Go, Kubernetes, Kubebuilder | Pipeline orchestration and CRD management |
169-
| **Ingest** | Python 3.13+, Pydantic, Typer | Type-safe data ingestion with validation |
170-
| **Transform** | dbt Core, BigQuery | Data transformation and analytics |
171-
| **Triggers** | GCS, Pub/Sub, BigQuery APIs | Event-driven pipeline activation with retry policies |
172-
| **Infrastructure** | Kubernetes, Docker | Container orchestration and deployment |
70+
# Deploy k8s samples resources
71+
make apply-samples
72+
```
17373

17474
## 📁 Project Structure
17575

17676
```
17777
pipeline-forge/
17878
├── operator/ # Kubernetes operator (Go)
179-
│ ├── api/ # CRD definitions
180-
│ ├── controllers/ # Reconciliation logic
181-
│ └── config/ # Deployment manifests
182-
├── workloads/ # Data processing components
79+
├── workloads/ # Data processing components
18380
│ ├── ingest/ # Type-safe ingestion (Python)
18481
│ ├── transform/ # dbt transformations
185-
│ └── trigger/ # Event processing
82+
│ └── trigger/ # Event processing (Go)
18683
└── docs/ # Documentation
187-
└── examples.md # Comprehensive examples
18884
```
18985

19086
## 🤝 Contributing
19187

192-
We welcome contributions to Pipeline Forge! Whether you're interested in:
193-
194-
- **Operator Development**: Kubernetes controller logic and CRDs
195-
- **Workload Development**: Data processing components
196-
- **Documentation**: User guides and examples
197-
- **Testing**: End-to-end pipeline validation
198-
199-
Please see our contributing guidelines and development setup instructions in the respective component directories.
200-
201-
### 🛠️ Development Setup
202-
203-
```bash
204-
# Set up development environment
205-
cd operator
206-
make manifests generate
207-
make test
208-
209-
# Run locally
210-
make run
211-
```
88+
We welcome contributions! See individual component READMEs for development setup and guidelines.
21289

21390
## 📄 License
21491

@@ -225,7 +102,3 @@ distributed under the License is distributed on an "AS IS" BASIS,
225102
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
226103
See the License for the specific language governing permissions and
227104
limitations under the License.
228-
229-
```
230-
231-
```

workloads/README.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,21 @@
1-
# Workload Placeholder
1+
# Pipeline Forge - Workloads
22

3-
This directory will contain the code and Dockerfile for Pipeline Forge workloads.
3+
This directory contains the data processing components for Pipeline Forge. All workloads are packaged as Docker images for easy deployment and scaling.
4+
5+
## Available Workloads
6+
7+
- **[Ingest](./ingest/README.md)** - Data ingestion from databases to BigQuery (Python)
8+
- **[Transform](./transform/README.md)** - dbt-based data transformation
9+
- **[Trigger](./trigger/README.md)** - Event-driven pipeline activation (Go)
10+
11+
## Architecture Overview
12+
13+
All workloads follow consistent patterns:
14+
15+
- **Containerized Deployment** - Packaged as Docker images for easy deployment
16+
- **Standalone Operation** - Each workload runs independently with its own configuration
17+
- **Type-safe Configuration** - Runtime validation and environment-specific configs
18+
- **Comprehensive Testing** - Unit, integration, and end-to-end test coverage
19+
- **Modern Development Practices** - Linting, formatting, and CI/CD ready
20+
21+
_For detailed information about each workload, see their individual README files._

workloads/ingest/Makefile

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,15 @@ fix: ## Attempt to fix linting errors (Ruff)
1818
@uv run ruff check . --fix
1919
@uv run ruff format .
2020

21+
.PHONY: install
22+
install: ## Install the project
23+
@uv sync
24+
@uv pip install -e .
25+
26+
.PHONY: run-help
27+
run-help: ## Run the project and show the help
28+
@uv run ingest --help
29+
2130

2231
##@ Build
2332

0 commit comments

Comments
 (0)