Data Engineering GitOps Deployment Reference for Data Platforms

This guide outlines best practices for deploying open data stacks on Kubernetes using GitOps principles, focusing on efficient patterns that work well with modern DevOps workflows.

This provides a guideline and best practices for transitioning from traditional deployment methods to a DevOps workflow.

The examples shown and the accompanying folder structure in this repo are not meant to work out of the box, but rather serve as a guideline to see what such a deployment can and should involve, depending on the depth of the deployment. If you like a working copy with GitOps infrastructure using Flux CD, Kestra workflows, and Liquibase migrations with complete CI/CD pipeline implementation, checkout related repo gitops-flux-pipeline-showcase.

This repo contains key principles such as:

Separation of Concerns: Clear boundaries between infrastructure, platform services (if you want to go that deep), and applications such as business logic in data pipelines etc.
GitOps-Driven: Everything defined as code, with Git as the single source of truth
Hierarchical Configuration: Shared configurations at higher levels, specific overrides at lower levels
Secret Management: Secure handling of sensitive information using SOPS or sealed secrets
Automation: CI/CD pipelines for validation and deployment
Observability: Built-in monitoring and alerting for all components
Multi-tenancy: Isolation between different tenants/domains with clear access boundaries
Release Management: Consistent, repeatable deployment processes with clear versioning

Recommended GitOps Stack

There are several GitOps tools, most notable are ArgoCD, Flux or Terraform. This example uses Flux as I have used that at my previous work, but ArgoCD is definitely better if you like a visual web interface. Flux is more basic.

GitOps Controller: Flux for continuous delivery on Azure/AWS
- This means changes are made through git commits to a deployment repository, and Flux (or representative tool) will build a new release and deploy on dev/test/prod (depending on where you made your changes)
Package Management: Helm Charts for application deployment
- Most OSS tools provide Helm Charts that can be used to deploy on Kubernetes or anywhere. Therefore, these can be integrated into the deployment scripts.
Secret Management: SOPS with GitOps integration
Database Migration: Liquibase can handle database schema changes programmatically. Meaning you define alter changes in a specific form (called changesets) and instead of manually creating the DDL statement from one release to the next, or from one major version to the next major version manually, Liquibase can handle these. See example [[#Database Migration]]
CI/CD: GitHub Actions or similar for validation and deployment
- On Azure, the equivalent of GitHub Actions in Azure DevOps is Azure Pipelines.
Infrastructure Management: Platform-specific tools (Azure ARM, AWS CloudFormation) for resources that must be created before Kubernetes

Folder Structure

This is an extensive, but high-level example of how such a folder structure might look.

What's notable is that you have a folder for each environment (dev/test/prod) and you overwrite configurations or versions in those specific environment configs. E.g., in prod you might have a lower version than in dev where you test the latest version.

/data-platform-gitops/
├── clusters/                           # One directory per cluster
│   ├── dev/                            # Dev environment
│   │   ├── flux-system/                # Flux bootstrap configuration 
│   │   └── cluster-config.yaml         # Cluster-specific config pointing to apps
│   └── prod/                           # Production environment
│       ├── flux-system/
│       └── cluster-config.yaml
├── infrastructure/                     # Infrastructure components
│   ├── base/                           # Common infrastructure components
│   │   ├── observability/              # Monitoring, logging, tracing
│   │   ├── networking/                 # Ingress controllers, cert-manager, etc.
│   │   └── storage/                    # Storage classes, persistent volumes
│   └── overlays/                       # Environment-specific overlays
│       ├── dev/                        # Dev-specific infrastructure
│       └── prod/                       # Production-specific infrastructure
├── platform/                           # Platform services 
│   ├── base/                           # Base definitions for data platform components
│   │   ├── kestra/                     # Kestra infra-configuration
│   │   ├── superset/                   # Superset configuration
│   │   ├── dlt/                        # dlt configuration
│   │   ├── liquibase/                  # Database migration configuration
│   │   └── common/                     # Shared configurations
│   └── overlays/                       # Environment-specific overlays
│       ├── dev/                        # Dev environment configuration
│       └── prod/                       # Production environment configuration
├── domains/                            # Business/Data domain configurations
│   ├── finance/                        # Finance domain
│   │   ├── base/                       # Base configuration
│   │   └── overlays/                   # Environment-specific overrides
│   │       ├── dev/
│   │       └── prod/
│   └── marketing/                      # another domain like finance
└── .github/                            # CI/CD workflow definitions
    └── workflows/                      # GitHub Actions workflow files
        ├── validate.yaml               # Validation workflow
        ├── release.yaml                # Release workflow
        └── deploy.yaml                 # Deployment workflow

[!note] Tenant

Depending on if you have multiple Tenant, you might either have multiple devops repos like this for each tenant, or integrate tenants into the folder structure.

Workspaces

Remember, above is the infrastructure part, e.g. Kestra or dlt needed CPU's, webendpoint where you can reach the web UI, etc.

But the actual pipeline code is usually separated into a dedicated workspace

├── pipelines/                          # Data pipeline definitions
│   ├── <pipeline-name>.yml             # Kestra workflow definition files
│   ├── dlt/                            # Data integration scripts with dlt
│   │   └── <source-name>-<dest>.py     # Source-specific dlt scripts
│   └── README.md                       # Pipeline documentation

In this example we use:

Data Integration: dlt for reliable data loading
Workflow Orchestration: Kestra for pipeline orchestration

Concrete example of how we implemented Workspaces in HelloDATA-BE

This is a real-life example of how to integrate workspaces, which different teams like ML Team, product analysts, or data engineers might use for their pipelines or domain-specific implementation.

These are different from infrastructure and each typically uses a different set of libraries, e.g., data scientists experiment with Seaborn and other libraries that have data science algorithms, whereas data engineers use pandas, numpy, etc. to do tabular data manipulations on data.

Example of the structure of the above HelloDATA-BE example, where we have different tenants with one portal and orchestration, but different data domains to segregate data (postgres) independently, and each data domain has different workspaces:

See more on:

Documentation for Workspaces
GitHub starter workspace repo
kanton-bern/hellodata-be: The Open-Source Enterprise Data Platform in a single Portal

Code Locations

Other technologies to implement workspaces straight forward are Visual Studio Code Dev Containers. These allow with a single devcontainer.json to start e full environment with all required libraries included.

Devpod, GitPod or GitHub Codespaces. With these we can create workspaces that are decoupled from the infrastructre code.

Deployment Workflow

A typical deployment workflow includes:

Cluster Provisioning: Set up a Kubernetes cluster on Azure/AWS (can be done manually or through automation)
GitOps Bootstrap: Install and configure Flux/ArgoCD/etc. on the cluster
Platform Deployment: Deploy infrastructure components followed by platform services
Tenant Provisioning: Create tenant isolation boundaries
Domain Configuration: Apply business/data domain configurations within tenant boundaries
Workspace/pipeline Deployment: Deploy data pipelines within their domains

Best Practices

Here are some best practices used for deployment and release management:

Cluster Management

Use managed Kubernetes services (AKS on Azure, EKS on AWS) if you don't have a dedicated DevOps or operations team
Configure node pools for different workload types
Implement proper network policies
Use private clusters where possible
Regularly update Kubernetes versions

Configuration Management

Follow the "base/overlay" pattern with Kustomize
Use only one String substitution, either from Kustomize, or Flux substitutions
- Flux supports Kustomize substitution see docs and post-build substitution
Use and integrate existing HelmReleases for application deployment (e.g. Airflow, or Kestra, or Prometheus usually provide these in their open source repo, or sometimes need to ask and search slack and community resources)
Implement hierarchical configuration (cluster → environment → domain)
Keep environment-specific values in overlay directories

Secret Management

Encrypt secrets using SOPS with age or PGP keys
Store encrypted secrets in Git
Configure Flux to decrypt secrets automatically
Organize secrets hierarchically by environment and domain
Use external secret stores (Azure KeyVault, AWS Secrets Manager, Vault (HashiCorp) for production

Multi-tenancy

Implement namespace isolation between tenants
Use Kubernetes RBAC to control access rights
Apply network policies to restrict cross-tenant communication
Implement resource quotas for fair resource sharing
Consider tenant-specific persistent volumes for data isolation

Database Migration

Use mentioned Liquibase for database schema management
Store migration scripts in version control
Implement CI/CD pipelines for database validation
Run migrations as Kubernetes Jobs before application deployment
Track migration history in a dedicated schema

A concrete example can be found in domains/finance/overlays/dev/migrations/changelogs/.

Example

Here are some links and examples:

General Liquibase.com - Liquibase
Workflow: Introduction to Liquibase
- Example of renaming a column: renameColumn
- More examples on Liquibase Open Source Workflows
Data change - loadUpdateData
- This function can even run and load data from source and MERGE if data exists; see example in above link
- Example of renaming a Column

Release Management

Implement Semantic Versioning for all components
Use tags to mark stable releases
Implement GitOps-based promotion between environments
Consider blue/green or canary deployment strategies
Separate the release process from deployment process

Semantic Versioning

Semantic versioning is one of the most common methods of numbering software versions. This method uses three numbers for each software version, in Major.Minor.Patch sequence. When a major, minor, or patch update is made, the corresponding number is increased.

Versioning like: X.Y.Z-A.B

X: Major
Y: Minor
Z: Patch
A: Development Stage [alpha, beta, Release candidate RC, release, post-release]
B: Sequence of A

CI/CD Pipeline

In software engineering, CI/CD or CICD is the combined practices of continuous integration and continuous delivery or, less often, continuous deployment. They are sometimes referred to collectively as continuous development or continuous software development.

This is important for data projects as it's hard and expensive to debug errors in production. Therefore, the more bugs we can detect ahead of runtime with an extensive CI/CD pipeline, the better. Usually this is also strongly dependent on a balanced test data set. It's good to take a production data set, shrink it to a sensible size (< 1 GB), and anonymize sensitive data to have a good test set.

Continuous Delivery

These are typical automations that CD would contain for data projects:

Build and push Docker images to internal or public registry
Update Helm chart values with new image versions
Create Git tag for the release
Generate release notes
Deploy to development environment
Run automated integration tests with the above-mentioned test data sets. Best to run end-to-end integration tests that perform an extract based on a test data set, transform the data using ETL logic, and execute test SQL statements to count rows and run some aggregation queries.
Promote to higher environments when testing on DEV has succeeded and been approved

Continuous Integration

These are less important for data projects and more advanced features you could add to your CI:

Validate Kubernetes manifests
Lint YAML files
Test Helm chart rendering
Validate Kustomize overlays
Run static analysis on Dockerfiles
Validate database migrations

GitOps Deployment with Flux

Use Flux (or others) for automated deployments
Define dependencies between Kustomizations
Implement proper source reconciliation
Use GitOps for promoting between environments
Implement notification controllers for alerts
Use image automation for automatic updates
Leverage Flux's support for Helm, Kustomize, and raw manifests

Monitoring and Alerting

Deploy observability stack (Prometheus, Grafana, Loki)
Define SLOs/SLIs for critical services
Implement alerts for SLO violations
Create dashboards for key metrics
Track costs and resource utilization

Data Pipeline Development

Adding a new data pipeline typically involves:

Create dlt Integration Script: Develop a Python script using dlt to extract and load data
Define Destination Schema: Configure how data should be mapped to the destination
Create Kestra Workflow: Define a workflow to orchestrate the data pipeline execution
Configure Secrets: Set up secure access to source and destination systems
Deploy and Monitor: Deploy the pipeline and monitor its execution

Azure-Specific Best Practices

When deploying on Azure, consider these additional best practices:

Azure Container Registry (ACR):
- Use ACR for storing Docker images
- Integrate ACR with AKS for streamlined image pulling
- Implement image scanning for security
Azure Key Vault:
- Store sensitive credentials in Azure Key Vault
- Use the CSI Secret Store Driver to mount secrets
- Rotate credentials regularly
Azure Monitor:
- Configure Log Analytics for centralized logging
- Set up Application Insights for application monitoring
- Create custom dashboards for visibility
Azure Networking:
- Use Private Link for secure service connections
- Implement proper network security groups
- Consider Azure Firewall for enhanced security
Azure Identity:
- Use Azure AD for authentication
- Implement proper RBAC for all resources
- Use Managed Identities for service authentication

Migrating from Traditional to GitOps

For organizations transitioning from traditional deployment methods to GitOps:

Start with Infrastructure as Code: Define all infrastructure in code
Containerize Applications: Move applications to containers
Implement CI/CD: Set up automated build and test pipelines
Adopt Kubernetes: Deploy containerized applications to Kubernetes
Implement GitOps: Use Flux (or ArgoCD, etc.) to manage deployments

Simplified GitOps for Smaller Teams

For smaller teams or simpler projects, consider this streamlined approach:

Flatten Directory Structure: Reduce nesting levels
Simplify Multi-tenancy: Use namespaces without a tenant operator
Consolidate Applications: Package related services together
Use Managed Services: Leverage cloud provider managed services where possible
Start with Core Components: Begin with essential services and expand gradually

Security Considerations

These are advanced use cases that take a lot of time, but are getting increasingly important if you use many different libraries. Some default vulnerability tests are already available out of the box on GitHub and other platforms. But for completeness, here are some important pointers for security:

Cluster Security:
- Enable private API server endpoints
- Use network policies to restrict pod communication
- Implement pod security policies/standards
Application Security:
- Scan container images for vulnerabilities
- Use non-root users in containers
- Implement least privilege principles

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
_images		_images
clusters/dev		clusters/dev
domains/finance		domains/finance
infrastructure		infrastructure
platform		platform
workspaces/pipelines		workspaces/pipelines
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineering GitOps Deployment Reference for Data Platforms

Recommended GitOps Stack

Folder Structure

Workspaces

Concrete example of how we implemented Workspaces in HelloDATA-BE

Code Locations

Deployment Workflow

Best Practices

Cluster Management

Configuration Management

Secret Management

Multi-tenancy

Database Migration

Example

Release Management

Semantic Versioning

CI/CD Pipeline

Continuous Delivery

Continuous Integration

GitOps Deployment with Flux

Monitoring and Alerting

Data Pipeline Development

Azure-Specific Best Practices

Migrating from Traditional to GitOps

Simplified GitOps for Smaller Teams

Security Considerations

About

Uh oh!

Languages

ssp-data/kubernetes-gitops-deployment-blueprint

Folders and files

Latest commit

History

Repository files navigation

Data Engineering GitOps Deployment Reference for Data Platforms

Recommended GitOps Stack

Folder Structure

Workspaces

Concrete example of how we implemented Workspaces in HelloDATA-BE

Code Locations

Deployment Workflow

Best Practices

Cluster Management

Configuration Management

Secret Management

Multi-tenancy

Database Migration

Example

Release Management

Semantic Versioning

CI/CD Pipeline

Continuous Delivery

Continuous Integration

GitOps Deployment with Flux

Monitoring and Alerting

Data Pipeline Development

Azure-Specific Best Practices

Migrating from Traditional to GitOps

Simplified GitOps for Smaller Teams

Security Considerations

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages