A structured learning path from first pipeline to platform engineering.
DevOps is not a team name. It is not Jenkins. It is not Kubernetes. DevOps is a cultural philosophy that destroys the wall between the people who write software and the people who run it. The acronym that still frames the discipline is CALMS:
| Pillar | Meaning | What It Looks Like in Practice |
|---|---|---|
| Culture | Shared responsibility for outcomes | Devs carry pagers; ops review PRs |
| Automation | Eliminate toil through code | Everything that runs more than twice is automated |
| Lean | Small batches, fast feedback, eliminate waste | Trunk-based development, feature flags |
| Measurement | Data-driven decisions at every layer | DORA metrics baked into dashboards |
| Sharing | Open knowledge, cross-functional collaboration | Runbooks in Git, postmortems published company-wide |
A "DevOps team" that simply renames the sysadmin group without changing how work flows is theater. Real DevOps changes the feedback loop: developers see production behavior in minutes, not months; operators influence architecture before code is written. When this loop is tight, incidents drop, lead time shrinks, and teams ship with confidence.
By 2026, the industry has recognized that asking every developer to also be an infrastructure expert does not scale. Platform Engineering has emerged as the natural evolution of DevOps: a dedicated team builds an Internal Developer Platform (IDP) that provides golden paths -- opinionated, self-service templates for common tasks like creating a service, provisioning a database, or setting up monitoring. DevOps provided the cultural foundation; platform engineering provides the product layer on top of it.
Key insight: If DevOps is the question "how do we work together?", platform engineering is the answer "here is the paved road."
Every manual step in your delivery process is a risk. A missed configuration, a forgotten script, a copy-paste error -- these are the seeds of outages. The DevOps mental model is simple: if a human does it more than once, automate it.
The pipeline below represents the full journey from a developer's keyboard to a running production service:
flowchart LR
A[Code Commit] --> B[Lint & Static Analysis]
B --> C[Unit Tests]
C --> D[Build Artifact]
D --> E[Integration Tests]
E --> F[Security Scan]
F --> G[Container Image Build]
G --> H[Push to Registry]
H --> I[Deploy to Staging]
I --> J[Smoke / E2E Tests]
J --> K[Manual Approval Gate]
K --> L[Deploy to Production]
L --> M[Canary / Rolling Update]
M --> N[Observability Alerting]
N -->|Incident Detected| O[Automated Rollback]
O --> A
Each box is a gate: if it fails, the pipeline stops, and the developer gets fast feedback. The goal is to make the path from "code works on my machine" to "code works in production" as short and as safe as possible.
- Speed matters. A pipeline that takes an hour teaches developers to push less often. Target under 10 minutes for the critical path.
- Fail fast. Put the cheapest checks first (lint, unit tests) and the expensive ones later (E2E, security scans).
- Artifact immutability. Build once, deploy everywhere. The same container image that passes staging is the one that goes to production.
- Observability throughout. Every stage emits metrics. Pipeline duration, flaky-test rates, and deployment frequency are first-class signals.
- Rollback is a deployment. Automated rollback based on error-rate thresholds is not optional. It is the safety net that enables aggressive deployment cadence.
Everything in DevOps starts with version control. Not because Git is flashy, but because if it is not in version control, it does not exist. Infrastructure code, pipeline definitions, runbooks, configuration -- all of it belongs in a repository.
| Workflow | Branch Model | Best For | Complexity |
|---|---|---|---|
| Trunk-Based | Short-lived branches off main |
Continuous deployment, small teams | Low |
| GitHub Flow | Feature branches + PR to main |
Open source, most teams | Medium |
| GitFlow | develop, release, hotfix branches |
Scheduled releases, regulated industries | High |
Recommendation for 2026: Trunk-based development with feature flags. Branches live hours, not days. Long-lived branches are the enemy of integration.
Every shared repository should enforce:
- Require PR reviews -- at least one approval from a domain owner.
- Require status checks -- CI must pass before merge.
- Require signed commits -- GPG or SSH signature verification for audit trail.
- Restrict force pushes -- history is immutable on protected branches.
feat(auth): add OAuth2 PKCE flow for mobile clients
fix(payments): correct decimal rounding for EUR transactions
docs(api): update OpenAPI spec for v3 endpoints
ci(docker): pin base image digest for reproducible builds
Conventional commits enable automated changelogs and semantic versioning:
MAJOR.MINOR.PATCH
| | |
| | bug fixes (fix:)
| new features (feat:)
breaking changes (feat: ... BREAKING CHANGE)
Tools like commitizen, semantic-release, and standard-version turn this convention into automated release notes, NPM/Docker package publishing, and GitHub release creation.
Containers solve one fundamental problem: "it works on my machine" is not a deployment strategy. A container is a lightweight, immutable artifact that packages your application, its dependencies, and its runtime configuration into a single, portable unit.
| Benefit | Explanation |
|---|---|
| Reproducibility | Same image runs identically on a laptop, in CI, and in production |
| Density | Containers share the host kernel; you can run hundreds per node |
| Portability | Images run on any Linux host with a container runtime |
| Isolation | Process-level boundaries prevent dependency conflicts |
| Speed | Container startup is milliseconds, not minutes (VMs) |
flowchart LR
A[Application Source] --> B[Dockerfile]
B --> C[Build Image]
C --> D[Run Unit Tests in Container]
D --> E[Security Scan -- Trivy/Grype]
E --> F[Tag + Push to Registry]
F --> G[Deploy to Runtime]
style E fill:#f66,stroke:#333,color:#fff
# Stage 1: Build
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --frozen-lockfile
COPY . .
RUN npm run build
# Stage 2: Production
FROM node:22-alpine AS production
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]Key principles embedded in this Dockerfile:
- Multi-stage builds separate the build environment (compilers, dev dependencies) from the runtime image. The final image is orders of magnitude smaller.
- Non-root user --
appuserruns the process. If the container is compromised, the attacker has minimal privileges. - Layer caching --
package.jsonis copied before source code. Dependency installation only re-runs when lockfiles change, not on every code edit. - Health checks -- the runtime orchestrator knows when the application is healthy and can act on failures automatically.
- Pinned base images --
node:22-alpineuses a specific major version. In production, pin the digest:node:22-alpine@sha256:abc123....
| Technique | Impact |
|---|---|
| Multi-stage builds | 10-50x smaller final image |
.dockerignore file |
Prevents secrets, .git from entering build context |
| Alpine or distroless base | Fewer packages = smaller attack surface |
COPY --chown |
Avoid RUN chown layer bloat |
Merge layers with --squash or buildkit |
Fewer layers, smaller transfer |
| Pin dependency versions | Reproducible builds across time |
Every image should be scanned before it reaches the registry. Integrate Trivy or Grype into your CI pipeline:
# GitHub Actions snippet
- name: Scan image for vulnerabilities
uses: aquasecurity/trivy-action@master
with:
image-ref: "myregistry.azurecr.io/app:${{ github.sha }}"
severity: "CRITICAL,HIGH"
exit-code: "1"A CI pipeline that allows images with known critical CVEs to reach production is negligent. Security scanning is not optional; it is a gate.
CI/CD is the automation backbone of DevOps. Continuous Integration ensures every change is validated automatically. Continuous Deployment ensures validated changes reach users safely and quickly.
| Principle | Practice |
|---|---|
| Build once | One artifact promoted through environments |
| Fail fast | Lint and unit tests before integration tests |
| Parallel where safe | Run independent test suites concurrently |
| Immutable artifacts | Container images or binaries, never "rebuild in prod" |
| Environment parity | Staging mirrors production infrastructure |
| Idempotent deployments | Running deploy twice produces the same result |
GitHub Actions is the most widely adopted CI/CD platform for open-source and enterprise teams in 2026. Below is a production-grade pipeline:
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
permissions:
contents: read
packages: write
id-token: write
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: npm
- run: npm ci --frozen-lockfile
- run: npm run lint
- run: npm run test:unit -- --coverage
- uses: codecov/codecov-action@v4
build-and-push:
needs: lint-and-test
if: github.event_name == 'push'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: .
push: true
tags: |
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
cache-from: type=gha
cache-to: type=gha,mode=max
deploy-staging:
needs: build-and-push
runs-on: ubuntu-latest
environment: staging
steps:
- run: |
echo "Deploying ${{ github.sha }} to staging"
kubectl set image deployment/app \
app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
--namespace staging
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- run: |
echo "Deploying ${{ github.sha }} to production"
kubectl set image deployment/app \
app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
--namespace production| Strategy | How It Works | Risk Level | Rollback Speed |
|---|---|---|---|
| Rolling | New pods replace old pods incrementally | Medium | Moderate |
| Blue/Green | Two identical environments; traffic switched | Low | Instant |
| Canary | Small percentage of traffic routed to new version | Lowest | Fast |
| Feature flags | Code deployed but behavior toggled per user | Lowest | Instant |
Feature flags are the most powerful release strategy because they decouple deployment from release. Code lands in production behind a flag. The product team toggles it on for 1% of users, monitors, ramps to 100%, and eventually removes the flag. This workflow requires infrastructure (LaunchDarkly, Unleash, or a homegrown solution) but pays dividends in safety.
Kubernetes (K8s) is the industry standard for container orchestration. It provides automated deployment, scaling, networking, and self-healing for containerized applications. But it is also one of the most complex infrastructure platforms ever built. Do not adopt it prematurely.
| Scenario | Recommended Tool |
|---|---|
| Single host, few services | Docker Compose |
| Managed database + few services | Docker Compose or ECS |
| Serverless workloads, event-driven | AWS Lambda / Cloudflare Workers |
| Dozens of services, multiple teams | Kubernetes (managed) |
| Multi-cloud, portable workloads | Kubernetes + Helm/Kustomize |
If you cannot articulate why you need K8s, you do not need it. A managed container service (ECS, Cloud Run, App Runner) will serve you better with a fraction of the operational overhead.
graph TB
subgraph Control Plane
API[API Server]
ETCD[etcd -- State Store]
SCHED[Scheduler]
CTRL[Controller Manager]
API --- ETCD
API --- SCHED
API --- CTRL
end
subgraph Node 1
K1[kubelet]
P1[Pod]
P2[Pod]
K1 --- P1
K1 --- P2
end
subgraph Node 2
K2[kubelet]
P3[Pod]
P4[Pod]
K2 --- P3
K2 --- P4
end
API -->|Watch/Push| K1
API -->|Watch/Push| K2
ING[Ingress Controller] -->|Route Traffic| P1
ING -->|Route Traffic| P3
SVC[Service -- ClusterIP/LoadBalancer] --> P1
SVC --> P3
Deployment -- declares the desired state for your pods:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
labels:
app: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: ghcr.io/org/web-app:abc123def
ports:
- containerPort: 3000
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 3
periodSeconds: 5Service -- stable network endpoint for your pods:
apiVersion: v1
kind: Service
metadata:
name: web-app-service
spec:
selector:
app: web-app
ports:
- port: 80
targetPort: 3000
type: ClusterIPIngress -- external traffic routing:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-app-ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- app.example.com
secretName: web-app-tls
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web-app-service
port:
number: 80Raw YAML does not scale across environments. Helm templates parameterize your manifests:
charts/web-app/
Chart.yaml # Name, version, dependencies
values.yaml # Default configuration
values-staging.yaml # Staging overrides
values-prod.yaml # Production overrides
templates/
deployment.yaml # Templated deployment
service.yaml # Templated service
ingress.yaml # Templated ingress
helm install web-app ./charts/web-app -f values-prod.yaml -n productionHelm enables environment promotion: the same chart, different values. What changed between staging and production is explicit, auditable, and version-controlled.
You cannot operate what you cannot see. Observability is the ability to understand the internal state of a system by examining its external outputs. It rests on three pillars:
| Pillar | Tool (2026 Recommended) | What It Answers |
|---|---|---|
| Metrics | Prometheus + Grafana | "Is it slow? Is it broken?" |
| Logs | Grafana Loki or ELK Stack | "What happened when it broke?" |
| Traces | OpenTelemetry + Jaeger | "Where exactly is the latency?" |
OpenTelemetry has become the universal standard for instrumenting applications. Vendor-neutral, language-agnostic, and supported by every major observability platform. If you are starting a new service today, instrument with OTel from day one.
Prometheus scrapes metrics endpoints. Grafana visualizes them. Together they form the de facto standard for Kubernetes monitoring.
# Prometheus scrape configuration
scrape_configs:
- job_name: web-app
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: trueApplication code exposes a /metrics endpoint in the Prometheus exposition format:
from prometheus_client import Counter, Histogram, generate_latest
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"]
)
REQUEST_LATENCY = Histogram(
"http_request_duration_seconds",
"Request latency",
["method", "endpoint"]
)
# In your request handler:
REQUEST_COUNT.labels(method="GET", endpoint="/api/users", status=200).inc()
REQUEST_LATENCY.labels(method="GET", endpoint="/api/users").observe(0.042)GitOps applies the DevOps principle of version control to infrastructure management. The desired state of your entire system is declared in Git. An automated agent reconciles the actual state with the declared state.
ArgoCD is the leading GitOps operator for Kubernetes:
# ArgoCD Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: web-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/infra-manifests.git
targetRevision: main
path: apps/web-app/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=trueWith this manifest in place:
- A developer updates the image tag in Git.
- ArgoCD detects the change within seconds.
- ArgoCD applies the new manifest to the cluster.
- If a manual
kubectl editdrifts the state, ArgoCD self-heals back to the Git state.
The repository is the single source of truth. kubectl apply is replaced by git push. Audit trail, rollback, and access control all leverage Git's native capabilities.
Platform engineering is the discipline of designing and building toolchains and workflows that enable software engineering organizations to be self-serving. The platform team treats the developer experience as a product.
An IDP provides:
| Capability | Example Implementation |
|---|---|
| Service scaffolding | Backstage software templates |
| Infrastructure provisioning | Terraform modules + self-service UI |
| CI/CD pipeline generation | Pre-configured GitHub Actions workflows |
| Observability onboarding | Auto-instrumented dashboards and alerts |
| Documentation portal | Backstage TechDocs (MDX in Git) |
Backstage (originally open-sourced by Spotify) is the most widely adopted IDP framework in 2026. It provides:
- Software Catalog -- a registry of every service, website, and data pipeline in the organization, with ownership metadata.
- Software Templates -- golden paths that scaffold a new service with CI/CD, monitoring, and documentation pre-configured.
- TechDocs -- documentation that lives alongside code, rendered automatically.
- Plugin Ecosystem -- integrations with CI/CD, cloud providers, incident management, and cost tools.
A golden path is an opinionated, supported, default workflow for a common task. It is not the only way, but it is the easiest and safest way.
Example golden path for "create a new microservice":
- Developer selects "Go microservice" template in Backstage.
- Template generates a repository with: Dockerfile, GitHub Actions workflow, Helm chart, OTel instrumentation, and a TechDoc stub.
- Developer writes business logic. The platform handles the rest.
- On push, CI builds, scans, and deploys to a preview environment.
- On merge to
main, ArgoCD promotes to staging, then production.
The golden path encodes organizational best practices. Deviation is allowed, but the default is secure, observable, and deployable.
Cloud cost awareness is no longer a finance-team-only concern. In 2026, FinOps practices are embedded into the CI/CD pipeline:
- Cost estimation on PRs -- tools like Infracost comment on pull requests with estimated cost changes for infrastructure modifications.
- Resource right-sizing -- CI jobs that compare actual resource utilization to requested resources and suggest adjustments.
- Spend alerts per team -- Grafana dashboards that break down cloud spend by service and team, refreshed daily.
# Infracost PR comment integration
- name: Infracost breakdown
uses: infracost/actions/setup@v3
with:
api-key: ${{ secrets.INFRACOST_API_KEY }}
- name: Post cost comment
run: infracost comment github --path /tmp/infracost.json --behavior updateThis guide covered the foundation. Here is where to go deeper:
| Level | Next Steps |
|---|---|
| Beginner | Complete the Version Control and Containerization labs. Deploy your first GitHub Actions pipeline. |
| Intermediate | Build a multi-stage Docker pipeline with security scanning. Write Kubernetes manifests and Helm charts. |
| Advanced | Set up ArgoCD with Prometheus + Grafana. Evaluate Backstage for your organization. Implement FinOps in your CI/CD. |
Git Fundamentals --> Docker & Containers --> CI/CD with GitHub Actions
| | |
v v v
Branch Protection Multi-stage Builds Release Strategies
| | |
v v v
Conventional Commits Security Scanning Feature Flags
|
v
Kubernetes (when you need it)
|
v
Observability (Prometheus/Grafana/OTel)
|
v
GitOps (ArgoCD) + Platform Engineering
| Resource | Focus Area |
|---|---|
| Accelerate -- Forsgren, Humble, Kim | DORA metrics, DevOps research |
| The Phoenix Project -- Kim, Behr, Spafford | DevOps novel, culture |
| Site Reliability Engineering -- Google | SRE practices, SLIs/SLOs |
| Team Topologies -- Skelton, Pais | Team structures, platform teams |
| Platform Engineering on Kubernetes -- Wölfle | Backstage, IDP design |
DevOps is not a destination. It is a continuous practice of shortening feedback loops, automating toil, and building systems that are safe to change. The tools will change. The principles will not.