DevOps Engineer Fundamentals

A structured learning path from first pipeline to platform engineering.

1. DevOps in 2026: Culture, Not Just Tools

DevOps is not a team name. It is not Jenkins. It is not Kubernetes. DevOps is a cultural philosophy that destroys the wall between the people who write software and the people who run it. The acronym that still frames the discipline is CALMS:

Pillar	Meaning	What It Looks Like in Practice
Culture	Shared responsibility for outcomes	Devs carry pagers; ops review PRs
Automation	Eliminate toil through code	Everything that runs more than twice is automated
Lean	Small batches, fast feedback, eliminate waste	Trunk-based development, feature flags
Measurement	Data-driven decisions at every layer	DORA metrics baked into dashboards
Sharing	Open knowledge, cross-functional collaboration	Runbooks in Git, postmortems published company-wide

Why DevOps Is a Mindset

A "DevOps team" that simply renames the sysadmin group without changing how work flows is theater. Real DevOps changes the feedback loop: developers see production behavior in minutes, not months; operators influence architecture before code is written. When this loop is tight, incidents drop, lead time shrinks, and teams ship with confidence.

The Shift to Platform Engineering

By 2026, the industry has recognized that asking every developer to also be an infrastructure expert does not scale. Platform Engineering has emerged as the natural evolution of DevOps: a dedicated team builds an Internal Developer Platform (IDP) that provides golden paths -- opinionated, self-service templates for common tasks like creating a service, provisioning a database, or setting up monitoring. DevOps provided the cultural foundation; platform engineering provides the product layer on top of it.

Key insight: If DevOps is the question "how do we work together?", platform engineering is the answer "here is the paved road."

2. The DevOps Mental Model: Automation Pipeline

Every manual step in your delivery process is a risk. A missed configuration, a forgotten script, a copy-paste error -- these are the seeds of outages. The DevOps mental model is simple: if a human does it more than once, automate it.

The pipeline below represents the full journey from a developer's keyboard to a running production service:

flowchart LR
    A[Code Commit] --> B[Lint & Static Analysis]
    B --> C[Unit Tests]
    C --> D[Build Artifact]
    D --> E[Integration Tests]
    E --> F[Security Scan]
    F --> G[Container Image Build]
    G --> H[Push to Registry]
    H --> I[Deploy to Staging]
    I --> J[Smoke / E2E Tests]
    J --> K[Manual Approval Gate]
    K --> L[Deploy to Production]
    L --> M[Canary / Rolling Update]
    M --> N[Observability Alerting]
    N -->|Incident Detected| O[Automated Rollback]
    O --> A

Each box is a gate: if it fails, the pipeline stops, and the developer gets fast feedback. The goal is to make the path from "code works on my machine" to "code works in production" as short and as safe as possible.

The Five Pipeline Principles

Speed matters. A pipeline that takes an hour teaches developers to push less often. Target under 10 minutes for the critical path.
Fail fast. Put the cheapest checks first (lint, unit tests) and the expensive ones later (E2E, security scans).
Artifact immutability. Build once, deploy everywhere. The same container image that passes staging is the one that goes to production.
Observability throughout. Every stage emits metrics. Pipeline duration, flaky-test rates, and deployment frequency are first-class signals.
Rollback is a deployment. Automated rollback based on error-rate thresholds is not optional. It is the safety net that enables aggressive deployment cadence.

3. Version Control and Collaboration

Everything in DevOps starts with version control. Not because Git is flashy, but because if it is not in version control, it does not exist. Infrastructure code, pipeline definitions, runbooks, configuration -- all of it belongs in a repository.

Git Workflows Compared

Workflow	Branch Model	Best For	Complexity
Trunk-Based	Short-lived branches off `main`	Continuous deployment, small teams	Low
GitHub Flow	Feature branches + PR to `main`	Open source, most teams	Medium
GitFlow	`develop`, `release`, `hotfix` branches	Scheduled releases, regulated industries	High

Recommendation for 2026: Trunk-based development with feature flags. Branches live hours, not days. Long-lived branches are the enemy of integration.

Branch Protection Rules

Every shared repository should enforce:

Require PR reviews -- at least one approval from a domain owner.
Require status checks -- CI must pass before merge.
Require signed commits -- GPG or SSH signature verification for audit trail.
Restrict force pushes -- history is immutable on protected branches.

Conventional Commits and Semantic Versioning

feat(auth): add OAuth2 PKCE flow for mobile clients
fix(payments): correct decimal rounding for EUR transactions
docs(api): update OpenAPI spec for v3 endpoints
ci(docker): pin base image digest for reproducible builds

Conventional commits enable automated changelogs and semantic versioning:

MAJOR.MINOR.PATCH
  |     |     |
  |     |     bug fixes (fix:)
  |     new features (feat:)
  breaking changes (feat: ... BREAKING CHANGE)

Tools like commitizen, semantic-release, and standard-version turn this convention into automated release notes, NPM/Docker package publishing, and GitHub release creation.

4. Containerization

Containers solve one fundamental problem: "it works on my machine" is not a deployment strategy. A container is a lightweight, immutable artifact that packages your application, its dependencies, and its runtime configuration into a single, portable unit.

Why Containers Matter

Benefit	Explanation
Reproducibility	Same image runs identically on a laptop, in CI, and in production
Density	Containers share the host kernel; you can run hundreds per node
Portability	Images run on any Linux host with a container runtime
Isolation	Process-level boundaries prevent dependency conflicts
Speed	Container startup is milliseconds, not minutes (VMs)

Container Build Pipeline

flowchart LR
    A[Application Source] --> B[Dockerfile]
    B --> C[Build Image]
    C --> D[Run Unit Tests in Container]
    D --> E[Security Scan -- Trivy/Grype]
    E --> F[Tag + Push to Registry]
    F --> G[Deploy to Runtime]
    style E fill:#f66,stroke:#333,color:#fff

Dockerfile Best Practices (2026)

# Stage 1: Build
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --frozen-lockfile
COPY . .
RUN npm run build

# Stage 2: Production
FROM node:22-alpine AS production
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]

Key principles embedded in this Dockerfile:

Multi-stage builds separate the build environment (compilers, dev dependencies) from the runtime image. The final image is orders of magnitude smaller.
Non-root user -- appuser runs the process. If the container is compromised, the attacker has minimal privileges.
Layer caching -- package.json is copied before source code. Dependency installation only re-runs when lockfiles change, not on every code edit.
Health checks -- the runtime orchestrator knows when the application is healthy and can act on failures automatically.
Pinned base images -- node:22-alpine uses a specific major version. In production, pin the digest: node:22-alpine@sha256:abc123....

Image Optimization Checklist

Technique	Impact
Multi-stage builds	10-50x smaller final image
`.dockerignore` file	Prevents secrets, `.git` from entering build context
Alpine or distroless base	Fewer packages = smaller attack surface
`COPY --chown`	Avoid `RUN chown` layer bloat
Merge layers with `--squash` or buildkit	Fewer layers, smaller transfer
Pin dependency versions	Reproducible builds across time

Security Scanning

Every image should be scanned before it reaches the registry. Integrate Trivy or Grype into your CI pipeline:

# GitHub Actions snippet
- name: Scan image for vulnerabilities
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: "myregistry.azurecr.io/app:${{ github.sha }}"
    severity: "CRITICAL,HIGH"
    exit-code: "1"

A CI pipeline that allows images with known critical CVEs to reach production is negligent. Security scanning is not optional; it is a gate.

5. CI/CD Fundamentals

CI/CD is the automation backbone of DevOps. Continuous Integration ensures every change is validated automatically. Continuous Deployment ensures validated changes reach users safely and quickly.

Pipeline Design Principles

Principle	Practice
Build once	One artifact promoted through environments
Fail fast	Lint and unit tests before integration tests
Parallel where safe	Run independent test suites concurrently
Immutable artifacts	Container images or binaries, never "rebuild in prod"
Environment parity	Staging mirrors production infrastructure
Idempotent deployments	Running deploy twice produces the same result

GitHub Actions: Reference Implementation

GitHub Actions is the most widely adopted CI/CD platform for open-source and enterprise teams in 2026. Below is a production-grade pipeline:

name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

permissions:
  contents: read
  packages: write
  id-token: write

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: npm
      - run: npm ci --frozen-lockfile
      - run: npm run lint
      - run: npm run test:unit -- --coverage
      - uses: codecov/codecov-action@v4

  build-and-push:
    needs: lint-and-test
    if: github.event_name == 'push'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: |
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build-and-push
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - run: |
          echo "Deploying ${{ github.sha }} to staging"
          kubectl set image deployment/app \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace staging

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - run: |
          echo "Deploying ${{ github.sha }} to production"
          kubectl set image deployment/app \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace production

Release Strategies

Strategy	How It Works	Risk Level	Rollback Speed
Rolling	New pods replace old pods incrementally	Medium	Moderate
Blue/Green	Two identical environments; traffic switched	Low	Instant
Canary	Small percentage of traffic routed to new version	Lowest	Fast
Feature flags	Code deployed but behavior toggled per user	Lowest	Instant

Feature flags are the most powerful release strategy because they decouple deployment from release. Code lands in production behind a flag. The product team toggles it on for 1% of users, monitors, ramps to 100%, and eventually removes the flag. This workflow requires infrastructure (LaunchDarkly, Unleash, or a homegrown solution) but pays dividends in safety.

6. Kubernetes: When You Need It

Kubernetes (K8s) is the industry standard for container orchestration. It provides automated deployment, scaling, networking, and self-healing for containerized applications. But it is also one of the most complex infrastructure platforms ever built. Do not adopt it prematurely.

When to Use What

Scenario	Recommended Tool
Single host, few services	Docker Compose
Managed database + few services	Docker Compose or ECS
Serverless workloads, event-driven	AWS Lambda / Cloudflare Workers
Dozens of services, multiple teams	Kubernetes (managed)
Multi-cloud, portable workloads	Kubernetes + Helm/Kustomize

If you cannot articulate why you need K8s, you do not need it. A managed container service (ECS, Cloud Run, App Runner) will serve you better with a fraction of the operational overhead.

Kubernetes Architecture

graph TB
    subgraph Control Plane
        API[API Server]
        ETCD[etcd -- State Store]
        SCHED[Scheduler]
        CTRL[Controller Manager]
        API --- ETCD
        API --- SCHED
        API --- CTRL
    end

    subgraph Node 1
        K1[kubelet]
        P1[Pod]
        P2[Pod]
        K1 --- P1
        K1 --- P2
    end

    subgraph Node 2
        K2[kubelet]
        P3[Pod]
        P4[Pod]
        K2 --- P3
        K2 --- P4
    end

    API -->|Watch/Push| K1
    API -->|Watch/Push| K2

    ING[Ingress Controller] -->|Route Traffic| P1
    ING -->|Route Traffic| P3
    SVC[Service -- ClusterIP/LoadBalancer] --> P1
    SVC --> P3

Core K8s Resources (Minimal Working Example)

Deployment -- declares the desired state for your pods:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  labels:
    app: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
        - name: web-app
          image: ghcr.io/org/web-app:abc123def
          ports:
            - containerPort: 3000
          resources:
            requests:
              memory: "128Mi"
              cpu: "100m"
            limits:
              memory: "256Mi"
              cpu: "500m"
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 3000
            initialDelaySeconds: 3
            periodSeconds: 5

Service -- stable network endpoint for your pods:

apiVersion: v1
kind: Service
metadata:
  name: web-app-service
spec:
  selector:
    app: web-app
  ports:
    - port: 80
      targetPort: 3000
  type: ClusterIP

Ingress -- external traffic routing:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-app-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
    - hosts:
        - app.example.com
      secretName: web-app-tls
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: web-app-service
                port:
                  number: 80

Helm: Package Management for K8s

Raw YAML does not scale across environments. Helm templates parameterize your manifests:

charts/web-app/
  Chart.yaml          # Name, version, dependencies
  values.yaml         # Default configuration
  values-staging.yaml # Staging overrides
  values-prod.yaml    # Production overrides
  templates/
    deployment.yaml   # Templated deployment
    service.yaml      # Templated service
    ingress.yaml      # Templated ingress

helm install web-app ./charts/web-app -f values-prod.yaml -n production

Helm enables environment promotion: the same chart, different values. What changed between staging and production is explicit, auditable, and version-controlled.

7. Observability and GitOps

You cannot operate what you cannot see. Observability is the ability to understand the internal state of a system by examining its external outputs. It rests on three pillars:

The Three Pillars

Pillar	Tool (2026 Recommended)	What It Answers
Metrics	Prometheus + Grafana	"Is it slow? Is it broken?"
Logs	Grafana Loki or ELK Stack	"What happened when it broke?"
Traces	OpenTelemetry + Jaeger	"Where exactly is the latency?"

OpenTelemetry has become the universal standard for instrumenting applications. Vendor-neutral, language-agnostic, and supported by every major observability platform. If you are starting a new service today, instrument with OTel from day one.

Prometheus + Grafana Quick Start

Prometheus scrapes metrics endpoints. Grafana visualizes them. Together they form the de facto standard for Kubernetes monitoring.

# Prometheus scrape configuration
scrape_configs:
  - job_name: web-app
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Application code exposes a /metrics endpoint in the Prometheus exposition format:

from prometheus_client import Counter, Histogram, generate_latest

REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)

REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "Request latency",
    ["method", "endpoint"]
)

# In your request handler:
REQUEST_COUNT.labels(method="GET", endpoint="/api/users", status=200).inc()
REQUEST_LATENCY.labels(method="GET", endpoint="/api/users").observe(0.042)

GitOps: Declarative Infrastructure at Scale

GitOps applies the DevOps principle of version control to infrastructure management. The desired state of your entire system is declared in Git. An automated agent reconciles the actual state with the declared state.

ArgoCD is the leading GitOps operator for Kubernetes:

# ArgoCD Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: web-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/infra-manifests.git
    targetRevision: main
    path: apps/web-app/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

With this manifest in place:

A developer updates the image tag in Git.
ArgoCD detects the change within seconds.
ArgoCD applies the new manifest to the cluster.
If a manual kubectl edit drifts the state, ArgoCD self-heals back to the Git state.

The repository is the single source of truth. kubectl apply is replaced by git push. Audit trail, rollback, and access control all leverage Git's native capabilities.

8. Platform Engineering Trend

Platform engineering is the discipline of designing and building toolchains and workflows that enable software engineering organizations to be self-serving. The platform team treats the developer experience as a product.

The Internal Developer Platform

An IDP provides:

Capability	Example Implementation
Service scaffolding	Backstage software templates
Infrastructure provisioning	Terraform modules + self-service UI
CI/CD pipeline generation	Pre-configured GitHub Actions workflows
Observability onboarding	Auto-instrumented dashboards and alerts
Documentation portal	Backstage TechDocs (MDX in Git)

Backstage: The Reference IDP

Backstage (originally open-sourced by Spotify) is the most widely adopted IDP framework in 2026. It provides:

Software Catalog -- a registry of every service, website, and data pipeline in the organization, with ownership metadata.
Software Templates -- golden paths that scaffold a new service with CI/CD, monitoring, and documentation pre-configured.
TechDocs -- documentation that lives alongside code, rendered automatically.
Plugin Ecosystem -- integrations with CI/CD, cloud providers, incident management, and cost tools.

Golden Paths

A golden path is an opinionated, supported, default workflow for a common task. It is not the only way, but it is the easiest and safest way.

Example golden path for "create a new microservice":

Developer selects "Go microservice" template in Backstage.
Template generates a repository with: Dockerfile, GitHub Actions workflow, Helm chart, OTel instrumentation, and a TechDoc stub.
Developer writes business logic. The platform handles the rest.
On push, CI builds, scans, and deploys to a preview environment.
On merge to main, ArgoCD promotes to staging, then production.

The golden path encodes organizational best practices. Deviation is allowed, but the default is secure, observable, and deployable.

FinOps in CI/CD

Cloud cost awareness is no longer a finance-team-only concern. In 2026, FinOps practices are embedded into the CI/CD pipeline:

Cost estimation on PRs -- tools like Infracost comment on pull requests with estimated cost changes for infrastructure modifications.
Resource right-sizing -- CI jobs that compare actual resource utilization to requested resources and suggest adjustments.
Spend alerts per team -- Grafana dashboards that break down cloud spend by service and team, refreshed daily.

# Infracost PR comment integration
- name: Infracost breakdown
  uses: infracost/actions/setup@v3
  with:
    api-key: ${{ secrets.INFRACOST_API_KEY }}

- name: Post cost comment
  run: infracost comment github --path /tmp/infracost.json --behavior update

9. What's Next

This guide covered the foundation. Here is where to go deeper:

Level	Next Steps
Beginner	Complete the Version Control and Containerization labs. Deploy your first GitHub Actions pipeline.
Intermediate	Build a multi-stage Docker pipeline with security scanning. Write Kubernetes manifests and Helm charts.
Advanced	Set up ArgoCD with Prometheus + Grafana. Evaluate Backstage for your organization. Implement FinOps in your CI/CD.

Recommended Learning Path

Git Fundamentals --> Docker & Containers --> CI/CD with GitHub Actions
        |                    |                       |
        v                    v                       v
   Branch Protection   Multi-stage Builds    Release Strategies
        |                    |                       |
        v                    v                       v
   Conventional Commits  Security Scanning    Feature Flags
                                                    |
                                                    v
                                    Kubernetes (when you need it)
                                                    |
                                                    v
                                    Observability (Prometheus/Grafana/OTel)
                                                    |
                                                    v
                                    GitOps (ArgoCD) + Platform Engineering

Books and Resources

Resource	Focus Area
Accelerate -- Forsgren, Humble, Kim	DORA metrics, DevOps research
The Phoenix Project -- Kim, Behr, Spafford	DevOps novel, culture
Site Reliability Engineering -- Google	SRE practices, SLIs/SLOs
Team Topologies -- Skelton, Pais	Team structures, platform teams
Platform Engineering on Kubernetes -- Wölfle	Backstage, IDP design

DevOps is not a destination. It is a continuous practice of shortening feedback loops, automating toil, and building systems that are safe to change. The tools will change. The principles will not.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
99-workshop		99-workshop
AGENTS.md		AGENTS.md
README.md		README.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DevOps Engineer Fundamentals

1. DevOps in 2026: Culture, Not Just Tools

Why DevOps Is a Mindset

The Shift to Platform Engineering

2. The DevOps Mental Model: Automation Pipeline

The Five Pipeline Principles

3. Version Control and Collaboration

Git Workflows Compared

Branch Protection Rules

Conventional Commits and Semantic Versioning

4. Containerization

Why Containers Matter

Container Build Pipeline

Dockerfile Best Practices (2026)

Image Optimization Checklist

Security Scanning

5. CI/CD Fundamentals

Pipeline Design Principles

GitHub Actions: Reference Implementation

Release Strategies

6. Kubernetes: When You Need It

When to Use What

Kubernetes Architecture

Core K8s Resources (Minimal Working Example)

Helm: Package Management for K8s

7. Observability and GitOps

The Three Pillars

Prometheus + Grafana Quick Start

GitOps: Declarative Infrastructure at Scale

8. Platform Engineering Trend

The Internal Developer Platform

Backstage: The Reference IDP

Golden Paths

FinOps in CI/CD

9. What's Next

Recommended Learning Path

Books and Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages