20+ years building resilient systems that don't break when things go wrong. I help engineering teams move faster while keeping production stable, secure, and actually reliable.
Currently leading Chaos Engineering for a major financial services firm, teaching systems to fail gracefully before customers notice.
Want to know more about my career journey? Chat with my AI Alter Ego - it responds exactly like I would, trained on my complete professional background and available 24/7.
Chaos Engineering
Breaking things on purpose so they don't break by accident. Reduced high severity incidents by 30% and medium/low incidents by 70% through Game Days and controlled failure injection.
Site Reliability Engineering
Keeping production running when everyone else is asleep. Built monitoring, automation, and incident response systems that cut MTTD by 37% and MTTR by 28%.
Cloud & Infrastructure
15 years deep in AWS, managing everything from a few servers to massive distributed systems. Saved companies $9M+ through smart automation and resource optimization.
DevOps & Platform Engineering
Built CI/CD pipelines that took deployment time from days to minutes. Made developers 3x more productive by removing friction and automating the boring stuff.
AI in Operations
Applying Generative AI, LLMs, and Agentic AI to make incident response smarter and faster. Building chatbots that actually help instead of just looking cool.
Leadership
Mentored teams from 4 to 50+ engineers across the US, India, China, and Australia. I believe in teaching people to solve problems, not just following runbooks.
- Uptime: Maintained five 9's availability (99.999%) for financial services applications handling billions in transactions
- Speed: Cut release cycle time by 50-80% through Kubernetes, Terraform, and smart automation
- Cost: Delivered $10.83M in savings across Expedia and Arcesium through intelligent tooling and cloud optimization
- Resilience: Achieved 100% compliance to RTO/RPO in disaster recovery drills quarter after quarter
- Security: Zero security incidents across multiple organizations by baking security into developer workflows (SAST, DAST, SCA)
- Incidents: Reduced production downtime by 20-50% through proactive monitoring and chaos engineering
Chaos Engineering: Gremlin (certified), Litmus Chaos, Chaos Monkey, custom Python tools
Cloud: AWS (Solution Architect certified), 15 years of production experience
Containers: Kubernetes, Docker, EKS, ECS, Fargate, Helm, Istio
Infrastructure as Code: Terraform, AWS CloudFormation
CI/CD: Jenkins, ArgoCD, GitOps workflows, MLOps
Monitoring: Datadog, Prometheus, Grafana, CloudWatch, New Relic, Splunk, ELK
Programming: Python (automation, ML, chatbots), Bash
AI/ML: AWS Bedrock, Sagemaker, building LLM powered tools for SRE, AIOps
I share what I learn on Medium, focusing on Chaos Engineering, SRE practices, Generative AI applications in reliability engineering, and making systems more resilient:
- Chaos Engineering Requirement Analysis (CERA) - Why requirement analysis matters more than the chaos test itself
- Your AI Model Can Be Poisoned With 250 Documents (And You'd Never Know) - Security risks in Generative AI systems and how to detect them
- We Passed Every Chaos Test in Staging. Production Still Melted Down. - Why staging success doesn't guarantee production resilience
- When Infra Is Green and the Funnel Is Red - Solving the mystery when all systems look healthy but customers can't complete actions
- Taming the Token Limit: Smart Ways to Manage Conversation History in LLMs and Agentic AI - Practical techniques for handling context windows in AI applications
Morgan Stanley (via Capgemini) - Senior Manager, Chaos Engineering
Leading global team driving resilience across wealth management platforms
RingCentral - Director, DevOps & SRE
Built the foundation for their India market launch, won Best Debut and Best Team awards
Arcesium - Associate Director, Platform SRE
Implemented chaos engineering, disaster recovery, and error budget frameworks for hedge fund systems
Expedia Group - Senior Manager
Led global SRE teams, built automation that won hackathons and saved millions
Plus earlier roles at Guavus, HCL, and startups where I learned how to keep things running with duct tape and determination.
I love working with teams that care about both speed and reliability. If you're building systems that need to stay up when it matters, let's talk.
Open to discussing:
- Chaos Engineering leadership roles
- SRE/Platform Engineering positions (Staff+ level)
- DevOps leadership with focus on reliability
- Advisory or consulting on resilience engineering
- LinkedIn: linkedin.com/in/a-datta
- Medium: @AbhishekDatta22
- Email: contactabhishekdatta@gmail.com
If you're fighting fires in production or want to make your systems more resilient, I'm happy to chat. I've probably broken and fixed the same thing you're dealing with.