๐ง [email protected] | ๐ฑ +44 7776796147
I am a Senior Principal Software Engineer and technical leader with 13+ years of experience architecting and scaling mission-critical systems in the internet industry ๐. I specialize in transforming complex technical challenges into strategic business outcomes through innovative platform engineering and data infrastructure solutions.
As a hands-on technical leader, I've successfully guided cross-functional teams through large-scale cloud migrations, built production data platforms from the ground up, and established DevOps practices that enable hundreds of engineers to deliver thousands of code changes daily. My expertise spans the full spectrum of modern infrastructureโfrom designing cloud-native architectures to implementing real-time analytics platforms that process millions of metrics ๐.
- Linux, Docker, Kubernetes, ECS, EKS
- AWS, Kubernetes
- Jenkins, ArgoCD, Argo Workflows, Github Actions
- Grafana, Datadog, Scalyr
- Python, Django, Go
- Terraform, Git
- Scalable Deployment Strategy: Deploying diverse applications across multiple data centers with precision.
- Real-Time Monitoring Systems: Building systems that measure millions of metrics & logs in near real-time.
- Process Automation: Crafting enterprise-grade applications to automate manual processes.
- CI/CD Administration: Managing continuous integration systems using Jira, Git, Gerrit, Jenkins, Nexus, and Sonar.
- Cloud Infrastructure: Leveraging AWS and Docker for robust containerization and cloud solutions.
- Deployment Automation: Developing strategies based on project tech stacks to support zero-downtime deployments.
- Jenkins Pipelines: Creating efficient pipelines to handle over 1K code changes daily by hundreds of engineers.
- Automation & Reliability: Innovating to reduce Time to Detection (TTD) & Time to Resolution (TTR), minimizing manual efforts to ensure 100% uptime.
- Cloud Migration: Adopting cloud methodologies to migrate applications from data centers to AWS.
- Monitoring Infrastructure: Designing systems that record millions of metrics in near real-time for centralized dashboards and alerts.
- Directly managed a team of 10 DevOps Engineers, ensuring weekly progress and delivering on organizational and team goals.
- Promote a well-documented approach to projects and incidents, building a comprehensive knowledge base for the team.
I was honored to receive the "Impact Innovator of the Year 2017-18" award at the MMT Town Hall Meet, where it was presented by the CEO as part of the Ring of Honour recognition.
Master of Technology - MTech Data Science & Engineering
April 2021 - March 2023
June 2007 - June 2011
Position | Company | Location | Tenure |
---|---|---|---|
Platform Engineer | Fresha | ๐ฌ๐ง London | March 2023 - present |
Principal Consultant | Wipro UK | ๐ฌ๐ง London | October 2022 - March 2023 [6 months] |
DevOps Manager | Klevu | ๐ฌ๐ง London | May 2022 - October 2022 [6 months] |
Senior Principal Software Engineer | MakeMyTrip & GoIbibo | ๐ฎ๐ณ Gurgaon | April 2021 - May 2022 [1 year, 7 months] |
Principal Software Engineer | MakeMyTrip & GoIbibo | ๐ฎ๐ณ Gurgaon | October 2019 - April 2021 [1 year, 7 months] |
Lead Systems Engineer | MakeMyTrip | ๐ฎ๐ณ Gurgaon | April 2018 - October 2019 [1 year, 7 months] |
Senior Software Engineer II | MakeMyTrip | ๐ฎ๐ณ Gurgaon | February 2016 - April 2018 |
Developer | Wize Commerce | ๐ฎ๐ณ Gurgaon | July 2013 - February 2016 [2 years, 8 months] |
Software Developer | Czentrix | ๐ฎ๐ณ Gurgaon | August 2011 - June 2013 [1 year, 11 months] |
Built a comprehensive data lakehouse platform from the ground up, establishing Fresha as one of the UK's first StarRocks production pioneers. This project involved creating a robust, scalable infrastructure to support modern data analytics and real-time processing capabilities.
Infrastructure Foundation:
- AWS Landing Zone: Designed and implemented a secure, multi-account AWS landing zone using Terraform, establishing governance, security, and networking foundations for the data platform.
- EKS Cluster Provisioning: Created production-ready Amazon EKS clusters with Terraform, implementing best practices for security, networking, and resource management.
Platform Operations:
- GitOps Deployment: Leveraged ArgoCD for continuous deployment of foundational applications including:
- EBS CSI Driver for persistent storage management
- External Secrets Operator for secure secrets management
- Datadog for comprehensive observability and monitoring
- Cluster Autoscaler for dynamic resource scaling
- External DNS for automated DNS management
- AWS MSK Integration: Provisioned and configured Amazon Managed Streaming for Apache Kafka (MSK) clusters for real-time data streaming.
Data Platform Components:
- Kafka Ecosystem: Deployed and managed Kafka Connect clusters and Schema Registry for data ingestion and schema evolution.
- Lake Management: Implemented Lakekeeper for data lake governance and metadata management.
- Processing Engines: Provisioned Apache Spark and Apache Flink for batch and stream processing workloads.
- Analytics Engine: Successfully deployed StarRocks as the primary OLAP database, becoming one of the first UK companies to run StarRocks in production.
Impact & Innovation:
- Enabled real-time analytics capabilities across the organization
- Established a foundation for modern data architecture supporting both batch and streaming workloads
- Pioneered StarRocks adoption in the UK market, contributing to the broader data community through knowledge sharing
- Delivered a production-ready platform that scales with business needs while maintaining cost efficiency
This project demonstrated expertise in cloud-native data architecture, infrastructure as code, and emerging analytics technologies while positioning the organization at the forefront of modern data platform innovation.
GitHub Actions is a powerful automation platform that allows you to create custom CI/CD pipelines directly within your GitHub repositories.
I played a pivotal role in the design and setup of GitHub Actions for our projects. My primary responsibilities included:
- Provisioning Self-Hosted Runners: Set up and managed self-hosted runners on AWS EKS to ensure scalable and efficient execution of our CI/CD pipelines. This involved configuring the runners to handle various workloads and integrating them seamlessly with our existing infrastructure which was more cost effective than Github Action runners.
- CI/CD Pipeline Creation: Designed and implemented comprehensive CI/CD pipelines using GitHub Actions and workflows. This included defining workflows for building, testing, and deploying applications, ensuring that all processes were automated and streamlined.
- Observability: Integrated Datadog's CI visibility into our GitHub Actions workflows. Ensuring that any issues could be quickly identified and resolved.
The successful implementation of GitHub Actions modernized our CI/CD infrastructure and significantly improved the Developer Experience and maintainability of our CI/CD Pipelines.
---#### ๐ณ โธ ๏ธ โ๏ธ Argo Workflows
Argo Workflows is a powerful, open-source container-native workflow engine designed to orchestrate parallel jobs within Kubernetes environments.
I played a pivotal role in the design and setup of Argo Workflows on AWS EKS. My primary responsibilities included:
- Architectural Design: Crafted the architecture for integrating Argo Workflows into our existing infrastructure, ensuring seamless compatibility with AWS EKS and alignment with our development practices.
- Pipeline Migration: Led the comprehensive migration of legacy CI/CD pipelines from CircleCI to Argo Workflows. This involved mapping out existing workflows, reconfiguring build and deployment processes, and translating all components into the Argo environment.
- Performance Optimization: Focused on optimizing the performance of the new workflows post-migration. Addressed any bottlenecks and fine-tuned configurations to enhance overall efficiency and reliability.
- Collaboration & Training: Worked closely with development and operations teams to ensure a smooth transition. Conducted training sessions and created documentation to facilitate adoption and understanding of the new workflows.
- Observability: Created custom plugins to enable Datadog's CI visibility for Argo-Workflows using Datadog APIs.
The successful migration modernized our CI/CD infrastructure and significantly improved the scalability and maintainability of our deployment processes.
For encashing the benefits of Docker in the Production environment, we started migrating services to Docker using AWS ECS as a core Infra platform. Key highlights in this project -
- Added support to generate Docker image - by adding Dockerfiles in the source code repo and extending the CI system to generate Docker images and push images to AWS ECR.
- Deployment orchestration - Integrating blue-green deployment approach for ECS services that includes
- Automation for creating ECS Task Definitions
- Automation for creating ECS Services
- Canary metric comparison b/w blue and green pools
- State management and handling failure scenarios
- Log management for services running on the ECS platform - using Filebeat, Kafka, Logstash, ES, Kibana & S3
- ECS cluster management to ensure maximum utilization of resources and providing a cost-effective solution.
- Taking advantage of SPOT instances, managing early detection using SPOT interrupt.
Planned & executed migration of application and micro-services hosted in DataCenters to AWS Cloud
- Designed the strategy for application provisioning using automation.
- Designed the approach for achieving blue-green deployments. Coded to enhance existing deployment automation tool i.e. Edge
- The migration involved porting more than 400 applications.
- Added advanced features like Canary deployments along with the Blue-Green approach.
- Created an event-oriented state management system for AWS resources.
Layed the design for monitoring applications in the AWS Cloud using Logstash, Kafka, ElasticSearch, Apache Storm and Open TSDB. The architecture helped in creating a hybrid solution that can monitor applications hosted either in Datacenter or AWS Cloud
Continuous Integration System using Jenkins Pipeline + Docker + AWS autoscaling
Jenkins CI system is implemented in such a way that it supports projects built in Java, React, Go & NodeJS. Extensively utilizing Jenkins Pipeline concepts to build a robust and sustainable solution that enables hundreds of Engineers to collaborate with more than 1k code changes per day.
Dockerized Jenkins agents are used which have all the dependencies builtin and ready to use. Making the system fault tolerant and helps to keep infra requirements as a code.
Autoscaling of Jenkins slave (worker) nodes using the EC2 plugin. This enables to optimize EC2 instances costing as instances are only launched when required and terminated once there are no jobs to be run.
Reliable, Robust, Rapid and Scalable approach to deployment automation
Salient features ensuring reliability and speed:-
- Zero downtime staggered deployments
- Canary checks for metric comparison
- Auto roll-forward or roll-back based on Canary decision
- Parallel deployments across data center
- Application health checks
- Robust and readily available reporting
- Scheduled rollouts to production.
Designed a framework which can be used to gather facts about the quality of code using technologies like Jenkins, Maven, Jacoco & Sonar.
Jenkins being the job automation framework was used to create a pipeline of jobs being used to:
- Get the project source code. Compile the project & publish coverage numbers unit tests generated by executing unit tests using Maven build cycle.
- Deploy the deliverable of the project to a server or a docker container.
- Execute integration tests, pointing to the server over which application is deployed.
- Generate Jacoco reports and then run Sonar analysis on Jacoco reports.
- Raise alarms or break build lifecycle if coverage errors are beyond specified thresholds.
The system being used to benchmark quality of the project with each iteration before new changes are rolled out to the production environment.
LB Manager is an application that is one stop for managing all the live traffic.
LB Manager is capable of hooking to one or more F5 load balancer and present information in a single view, so that it is easier & quicker to take actions.
It presents users with following functionality
- LTM Pools, Nodes and VIPs.
- GTM Pools.
- DC switch: to move traffic from on DC to another.
- Route53's entries.
Challenge of this project is to get information from multiple sources and often such operations are of high latency. Hence made a design that is capable of making multiple parallel and controlled requests to all of the sources. Implemented caching to serve data.
Integrated this tool with AWS's Route53. Use case was to get DNS entries from multiple Route53s and provide a way to change these entries as in when required.
Taking the alerting system to the next level, from static thresholds to dynamic thresholds. Tracking anomalies based on unsupervised machine learning from the past data points and predicting if current datapoint matches the forecast.
-
Data Engineering: developed pipeline using Apache Kafka, Apache Storm and Cassandra for collecting metrics & respective data points.
-
Machine Learning: built a system around Facebook Prophet, VAR & Isolation Forest. Creating models at runtime and predicting and checking anomalies.
Setup , Developed and integrated a whole new Monitoring framework organizational wide.
- Setup and integerated graphite and statsd
- Deployed and reused the statsd daemon agent in code for better performance monitoring.
- Integrated and developed as per need graphite and statsd client.
- Deployed Latest realtime dashboards using cubism.js , d3.js and backbone.js integrated with HTML using graphite as backend.
- For NRT Monitoring deployed Team-Dashboard using RoR.
Utility to send docker events to kafka.
Docker Envoy aims to make customized processing over Docker events feasible. Design enables to publish events to Apache Kafka, this happens instantly (i.e. NRT). This project can run as an agent on each of the Docker hosts and publish messages to a single Kafka cluster.
Since this project has the capability to process each event before-hand and publish more meaningful messages to Kafka. We can write Apache Storm topologies to consume these messages and do required processing something like Docker-Serf.
- Developed and integerated change tracking system and been working on Incident management system using J2EE.
- Developed timer framework or batch job scheduler using Quartz scheduler
- Basic system level automation and code integeration with Python
- Application keeps track for each and every changes went on live site with a date range to ensure uptime of website or web application.
- Connected with different data sources with different framework
Engineered deployment automation that is used for deploying different application on around 2000 servers across 4 data centres.
- Setup of Linkedin Glu agents on all the servers.
- Programming in Groovy for deployment automation scripts.
- Creating custom states required to accommodate new deployment phases.
- Automating other manual efforts using Python, Jenkins and Glu REST APIs.