Skip to content

TP-Coder-Innovation-Hub/devops-engineer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

DevOps Engineer Fundamentals

A structured learning path from first pipeline to platform engineering.


1. DevOps in 2026: Culture, Not Just Tools

DevOps is not a team name. It is not Jenkins. It is not Kubernetes. DevOps is a cultural philosophy that destroys the wall between the people who write software and the people who run it. The acronym that still frames the discipline is CALMS:

Pillar Meaning What It Looks Like in Practice
Culture Shared responsibility for outcomes Devs carry pagers; ops review PRs
Automation Eliminate toil through code Everything that runs more than twice is automated
Lean Small batches, fast feedback, eliminate waste Trunk-based development, feature flags
Measurement Data-driven decisions at every layer DORA metrics baked into dashboards
Sharing Open knowledge, cross-functional collaboration Runbooks in Git, postmortems published company-wide

Why DevOps Is a Mindset

A "DevOps team" that simply renames the sysadmin group without changing how work flows is theater. Real DevOps changes the feedback loop: developers see production behavior in minutes, not months; operators influence architecture before code is written. When this loop is tight, incidents drop, lead time shrinks, and teams ship with confidence.

The Shift to Platform Engineering

By 2026, the industry has recognized that asking every developer to also be an infrastructure expert does not scale. Platform Engineering has emerged as the natural evolution of DevOps: a dedicated team builds an Internal Developer Platform (IDP) that provides golden paths -- opinionated, self-service templates for common tasks like creating a service, provisioning a database, or setting up monitoring. DevOps provided the cultural foundation; platform engineering provides the product layer on top of it.

Key insight: If DevOps is the question "how do we work together?", platform engineering is the answer "here is the paved road."


2. The DevOps Mental Model: Automation Pipeline

Every manual step in your delivery process is a risk. A missed configuration, a forgotten script, a copy-paste error -- these are the seeds of outages. The DevOps mental model is simple: if a human does it more than once, automate it.

The pipeline below represents the full journey from a developer's keyboard to a running production service:

flowchart LR
    A[Code Commit] --> B[Lint & Static Analysis]
    B --> C[Unit Tests]
    C --> D[Build Artifact]
    D --> E[Integration Tests]
    E --> F[Security Scan]
    F --> G[Container Image Build]
    G --> H[Push to Registry]
    H --> I[Deploy to Staging]
    I --> J[Smoke / E2E Tests]
    J --> K[Manual Approval Gate]
    K --> L[Deploy to Production]
    L --> M[Canary / Rolling Update]
    M --> N[Observability Alerting]
    N -->|Incident Detected| O[Automated Rollback]
    O --> A
Loading

Each box is a gate: if it fails, the pipeline stops, and the developer gets fast feedback. The goal is to make the path from "code works on my machine" to "code works in production" as short and as safe as possible.

The Five Pipeline Principles

  1. Speed matters. A pipeline that takes an hour teaches developers to push less often. Target under 10 minutes for the critical path.
  2. Fail fast. Put the cheapest checks first (lint, unit tests) and the expensive ones later (E2E, security scans).
  3. Artifact immutability. Build once, deploy everywhere. The same container image that passes staging is the one that goes to production.
  4. Observability throughout. Every stage emits metrics. Pipeline duration, flaky-test rates, and deployment frequency are first-class signals.
  5. Rollback is a deployment. Automated rollback based on error-rate thresholds is not optional. It is the safety net that enables aggressive deployment cadence.

3. Version Control and Collaboration

Everything in DevOps starts with version control. Not because Git is flashy, but because if it is not in version control, it does not exist. Infrastructure code, pipeline definitions, runbooks, configuration -- all of it belongs in a repository.

Git Workflows Compared

Workflow Branch Model Best For Complexity
Trunk-Based Short-lived branches off main Continuous deployment, small teams Low
GitHub Flow Feature branches + PR to main Open source, most teams Medium
GitFlow develop, release, hotfix branches Scheduled releases, regulated industries High

Recommendation for 2026: Trunk-based development with feature flags. Branches live hours, not days. Long-lived branches are the enemy of integration.

Branch Protection Rules

Every shared repository should enforce:

  • Require PR reviews -- at least one approval from a domain owner.
  • Require status checks -- CI must pass before merge.
  • Require signed commits -- GPG or SSH signature verification for audit trail.
  • Restrict force pushes -- history is immutable on protected branches.

Conventional Commits and Semantic Versioning

feat(auth): add OAuth2 PKCE flow for mobile clients
fix(payments): correct decimal rounding for EUR transactions
docs(api): update OpenAPI spec for v3 endpoints
ci(docker): pin base image digest for reproducible builds

Conventional commits enable automated changelogs and semantic versioning:

MAJOR.MINOR.PATCH
  |     |     |
  |     |     bug fixes (fix:)
  |     new features (feat:)
  breaking changes (feat: ... BREAKING CHANGE)

Tools like commitizen, semantic-release, and standard-version turn this convention into automated release notes, NPM/Docker package publishing, and GitHub release creation.


4. Containerization

Containers solve one fundamental problem: "it works on my machine" is not a deployment strategy. A container is a lightweight, immutable artifact that packages your application, its dependencies, and its runtime configuration into a single, portable unit.

Why Containers Matter

Benefit Explanation
Reproducibility Same image runs identically on a laptop, in CI, and in production
Density Containers share the host kernel; you can run hundreds per node
Portability Images run on any Linux host with a container runtime
Isolation Process-level boundaries prevent dependency conflicts
Speed Container startup is milliseconds, not minutes (VMs)

Container Build Pipeline

flowchart LR
    A[Application Source] --> B[Dockerfile]
    B --> C[Build Image]
    C --> D[Run Unit Tests in Container]
    D --> E[Security Scan -- Trivy/Grype]
    E --> F[Tag + Push to Registry]
    F --> G[Deploy to Runtime]
    style E fill:#f66,stroke:#333,color:#fff
Loading

Dockerfile Best Practices (2026)

# Stage 1: Build
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --frozen-lockfile
COPY . .
RUN npm run build

# Stage 2: Production
FROM node:22-alpine AS production
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]

Key principles embedded in this Dockerfile:

  1. Multi-stage builds separate the build environment (compilers, dev dependencies) from the runtime image. The final image is orders of magnitude smaller.
  2. Non-root user -- appuser runs the process. If the container is compromised, the attacker has minimal privileges.
  3. Layer caching -- package.json is copied before source code. Dependency installation only re-runs when lockfiles change, not on every code edit.
  4. Health checks -- the runtime orchestrator knows when the application is healthy and can act on failures automatically.
  5. Pinned base images -- node:22-alpine uses a specific major version. In production, pin the digest: node:22-alpine@sha256:abc123....

Image Optimization Checklist

Technique Impact
Multi-stage builds 10-50x smaller final image
.dockerignore file Prevents secrets, .git from entering build context
Alpine or distroless base Fewer packages = smaller attack surface
COPY --chown Avoid RUN chown layer bloat
Merge layers with --squash or buildkit Fewer layers, smaller transfer
Pin dependency versions Reproducible builds across time

Security Scanning

Every image should be scanned before it reaches the registry. Integrate Trivy or Grype into your CI pipeline:

# GitHub Actions snippet
- name: Scan image for vulnerabilities
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: "myregistry.azurecr.io/app:${{ github.sha }}"
    severity: "CRITICAL,HIGH"
    exit-code: "1"

A CI pipeline that allows images with known critical CVEs to reach production is negligent. Security scanning is not optional; it is a gate.


5. CI/CD Fundamentals

CI/CD is the automation backbone of DevOps. Continuous Integration ensures every change is validated automatically. Continuous Deployment ensures validated changes reach users safely and quickly.

Pipeline Design Principles

Principle Practice
Build once One artifact promoted through environments
Fail fast Lint and unit tests before integration tests
Parallel where safe Run independent test suites concurrently
Immutable artifacts Container images or binaries, never "rebuild in prod"
Environment parity Staging mirrors production infrastructure
Idempotent deployments Running deploy twice produces the same result

GitHub Actions: Reference Implementation

GitHub Actions is the most widely adopted CI/CD platform for open-source and enterprise teams in 2026. Below is a production-grade pipeline:

name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

permissions:
  contents: read
  packages: write
  id-token: write

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: npm
      - run: npm ci --frozen-lockfile
      - run: npm run lint
      - run: npm run test:unit -- --coverage
      - uses: codecov/codecov-action@v4

  build-and-push:
    needs: lint-and-test
    if: github.event_name == 'push'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: |
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build-and-push
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - run: |
          echo "Deploying ${{ github.sha }} to staging"
          kubectl set image deployment/app \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace staging

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - run: |
          echo "Deploying ${{ github.sha }} to production"
          kubectl set image deployment/app \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace production

Release Strategies

Strategy How It Works Risk Level Rollback Speed
Rolling New pods replace old pods incrementally Medium Moderate
Blue/Green Two identical environments; traffic switched Low Instant
Canary Small percentage of traffic routed to new version Lowest Fast
Feature flags Code deployed but behavior toggled per user Lowest Instant

Feature flags are the most powerful release strategy because they decouple deployment from release. Code lands in production behind a flag. The product team toggles it on for 1% of users, monitors, ramps to 100%, and eventually removes the flag. This workflow requires infrastructure (LaunchDarkly, Unleash, or a homegrown solution) but pays dividends in safety.


6. Kubernetes: When You Need It

Kubernetes (K8s) is the industry standard for container orchestration. It provides automated deployment, scaling, networking, and self-healing for containerized applications. But it is also one of the most complex infrastructure platforms ever built. Do not adopt it prematurely.

When to Use What

Scenario Recommended Tool
Single host, few services Docker Compose
Managed database + few services Docker Compose or ECS
Serverless workloads, event-driven AWS Lambda / Cloudflare Workers
Dozens of services, multiple teams Kubernetes (managed)
Multi-cloud, portable workloads Kubernetes + Helm/Kustomize

If you cannot articulate why you need K8s, you do not need it. A managed container service (ECS, Cloud Run, App Runner) will serve you better with a fraction of the operational overhead.

Kubernetes Architecture

graph TB
    subgraph Control Plane
        API[API Server]
        ETCD[etcd -- State Store]
        SCHED[Scheduler]
        CTRL[Controller Manager]
        API --- ETCD
        API --- SCHED
        API --- CTRL
    end

    subgraph Node 1
        K1[kubelet]
        P1[Pod]
        P2[Pod]
        K1 --- P1
        K1 --- P2
    end

    subgraph Node 2
        K2[kubelet]
        P3[Pod]
        P4[Pod]
        K2 --- P3
        K2 --- P4
    end

    API -->|Watch/Push| K1
    API -->|Watch/Push| K2

    ING[Ingress Controller] -->|Route Traffic| P1
    ING -->|Route Traffic| P3
    SVC[Service -- ClusterIP/LoadBalancer] --> P1
    SVC --> P3
Loading

Core K8s Resources (Minimal Working Example)

Deployment -- declares the desired state for your pods:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  labels:
    app: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
        - name: web-app
          image: ghcr.io/org/web-app:abc123def
          ports:
            - containerPort: 3000
          resources:
            requests:
              memory: "128Mi"
              cpu: "100m"
            limits:
              memory: "256Mi"
              cpu: "500m"
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 3000
            initialDelaySeconds: 3
            periodSeconds: 5

Service -- stable network endpoint for your pods:

apiVersion: v1
kind: Service
metadata:
  name: web-app-service
spec:
  selector:
    app: web-app
  ports:
    - port: 80
      targetPort: 3000
  type: ClusterIP

Ingress -- external traffic routing:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-app-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
    - hosts:
        - app.example.com
      secretName: web-app-tls
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: web-app-service
                port:
                  number: 80

Helm: Package Management for K8s

Raw YAML does not scale across environments. Helm templates parameterize your manifests:

charts/web-app/
  Chart.yaml          # Name, version, dependencies
  values.yaml         # Default configuration
  values-staging.yaml # Staging overrides
  values-prod.yaml    # Production overrides
  templates/
    deployment.yaml   # Templated deployment
    service.yaml      # Templated service
    ingress.yaml      # Templated ingress
helm install web-app ./charts/web-app -f values-prod.yaml -n production

Helm enables environment promotion: the same chart, different values. What changed between staging and production is explicit, auditable, and version-controlled.


7. Observability and GitOps

You cannot operate what you cannot see. Observability is the ability to understand the internal state of a system by examining its external outputs. It rests on three pillars:

The Three Pillars

Pillar Tool (2026 Recommended) What It Answers
Metrics Prometheus + Grafana "Is it slow? Is it broken?"
Logs Grafana Loki or ELK Stack "What happened when it broke?"
Traces OpenTelemetry + Jaeger "Where exactly is the latency?"

OpenTelemetry has become the universal standard for instrumenting applications. Vendor-neutral, language-agnostic, and supported by every major observability platform. If you are starting a new service today, instrument with OTel from day one.

Prometheus + Grafana Quick Start

Prometheus scrapes metrics endpoints. Grafana visualizes them. Together they form the de facto standard for Kubernetes monitoring.

# Prometheus scrape configuration
scrape_configs:
  - job_name: web-app
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Application code exposes a /metrics endpoint in the Prometheus exposition format:

from prometheus_client import Counter, Histogram, generate_latest

REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)

REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "Request latency",
    ["method", "endpoint"]
)

# In your request handler:
REQUEST_COUNT.labels(method="GET", endpoint="/api/users", status=200).inc()
REQUEST_LATENCY.labels(method="GET", endpoint="/api/users").observe(0.042)

GitOps: Declarative Infrastructure at Scale

GitOps applies the DevOps principle of version control to infrastructure management. The desired state of your entire system is declared in Git. An automated agent reconciles the actual state with the declared state.

ArgoCD is the leading GitOps operator for Kubernetes:

# ArgoCD Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: web-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/infra-manifests.git
    targetRevision: main
    path: apps/web-app/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

With this manifest in place:

  1. A developer updates the image tag in Git.
  2. ArgoCD detects the change within seconds.
  3. ArgoCD applies the new manifest to the cluster.
  4. If a manual kubectl edit drifts the state, ArgoCD self-heals back to the Git state.

The repository is the single source of truth. kubectl apply is replaced by git push. Audit trail, rollback, and access control all leverage Git's native capabilities.


8. Platform Engineering Trend

Platform engineering is the discipline of designing and building toolchains and workflows that enable software engineering organizations to be self-serving. The platform team treats the developer experience as a product.

The Internal Developer Platform

An IDP provides:

Capability Example Implementation
Service scaffolding Backstage software templates
Infrastructure provisioning Terraform modules + self-service UI
CI/CD pipeline generation Pre-configured GitHub Actions workflows
Observability onboarding Auto-instrumented dashboards and alerts
Documentation portal Backstage TechDocs (MDX in Git)

Backstage: The Reference IDP

Backstage (originally open-sourced by Spotify) is the most widely adopted IDP framework in 2026. It provides:

  • Software Catalog -- a registry of every service, website, and data pipeline in the organization, with ownership metadata.
  • Software Templates -- golden paths that scaffold a new service with CI/CD, monitoring, and documentation pre-configured.
  • TechDocs -- documentation that lives alongside code, rendered automatically.
  • Plugin Ecosystem -- integrations with CI/CD, cloud providers, incident management, and cost tools.

Golden Paths

A golden path is an opinionated, supported, default workflow for a common task. It is not the only way, but it is the easiest and safest way.

Example golden path for "create a new microservice":

  1. Developer selects "Go microservice" template in Backstage.
  2. Template generates a repository with: Dockerfile, GitHub Actions workflow, Helm chart, OTel instrumentation, and a TechDoc stub.
  3. Developer writes business logic. The platform handles the rest.
  4. On push, CI builds, scans, and deploys to a preview environment.
  5. On merge to main, ArgoCD promotes to staging, then production.

The golden path encodes organizational best practices. Deviation is allowed, but the default is secure, observable, and deployable.

FinOps in CI/CD

Cloud cost awareness is no longer a finance-team-only concern. In 2026, FinOps practices are embedded into the CI/CD pipeline:

  • Cost estimation on PRs -- tools like Infracost comment on pull requests with estimated cost changes for infrastructure modifications.
  • Resource right-sizing -- CI jobs that compare actual resource utilization to requested resources and suggest adjustments.
  • Spend alerts per team -- Grafana dashboards that break down cloud spend by service and team, refreshed daily.
# Infracost PR comment integration
- name: Infracost breakdown
  uses: infracost/actions/setup@v3
  with:
    api-key: ${{ secrets.INFRACOST_API_KEY }}

- name: Post cost comment
  run: infracost comment github --path /tmp/infracost.json --behavior update

9. What's Next

This guide covered the foundation. Here is where to go deeper:

Level Next Steps
Beginner Complete the Version Control and Containerization labs. Deploy your first GitHub Actions pipeline.
Intermediate Build a multi-stage Docker pipeline with security scanning. Write Kubernetes manifests and Helm charts.
Advanced Set up ArgoCD with Prometheus + Grafana. Evaluate Backstage for your organization. Implement FinOps in your CI/CD.

Recommended Learning Path

Git Fundamentals --> Docker & Containers --> CI/CD with GitHub Actions
        |                    |                       |
        v                    v                       v
   Branch Protection   Multi-stage Builds    Release Strategies
        |                    |                       |
        v                    v                       v
   Conventional Commits  Security Scanning    Feature Flags
                                                    |
                                                    v
                                    Kubernetes (when you need it)
                                                    |
                                                    v
                                    Observability (Prometheus/Grafana/OTel)
                                                    |
                                                    v
                                    GitOps (ArgoCD) + Platform Engineering

Books and Resources

Resource Focus Area
Accelerate -- Forsgren, Humble, Kim DORA metrics, DevOps research
The Phoenix Project -- Kim, Behr, Spafford DevOps novel, culture
Site Reliability Engineering -- Google SRE practices, SLIs/SLOs
Team Topologies -- Skelton, Pais Team structures, platform teams
Platform Engineering on Kubernetes -- Wölfle Backstage, IDP design

DevOps is not a destination. It is a continuous practice of shortening feedback loops, automating toil, and building systems that are safe to change. The tools will change. The principles will not.

About

DevOps Engineer — CI/CD, containers, Kubernetes, GitOps, and automation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors