I'm a Site Reliability Engineer focused on building, operating, and scaling highly available, fault-tolerant infrastructure that supports mission-critical services.
I work at the intersection of infrastructure, networking, automation, and reliability engineering, ensuring systems remain stable, observable, and resilient under real production load.
- π‘ Run & maintain high-availability infrastructure supporting thousands of users across multi-region environments
- π₯ Manage 100+ production servers & VMs using OpenStack, with continuous VM health checks & recovery workflows
- βΈοΈ Operate Kubernetes services - pod health monitoring, rollout validation & deep log analysis
- π³ Manage Docker workloads - container lifecycle, image updates & container log debugging
- π Build & operate observability systems with Prometheus & Grafana for proactive incident detection
- βοΈ Lead incident response, RCA, automation & service restoration for mission-critical systems


