Site Reliability Engineer
- Primary SRE for multiple Tier-1 applications with end-to-end reliability ownership across multiple AKS clusters, Azure Web Apps, and ASE-based services.
- Implemented Datadog observability across multiple AKS clusters and automated onboarding through an Azure CI pipeline.
- Led AzureRM v3 to v4 Terraform Enterprise migration for critical Tier-1 codebases and removed fragile circular dependency logic.
- Implemented DR in East US 2 with Traffic Manager, Front Door, Web Apps failover, and Terraform-driven infrastructure.
- Owned incident management in a follow-the-sun model across AKS, DNS/network failures, app crashes, certificate/auth failures, and pipeline disruptions.
- Drove FinOps improvements by auditing orphan resources, unused quotas, and obsolete components for cost optimization.
- Owned AKS Overlay networking and Cilium PoC setup, configuration, testing, and findings documentation.
- Built AKS Fleet Manager PoC for centralized Kubernetes patching and upgrades across 17 clusters.
- Contributed to Azure Managed Redis PoC implementation/testing and Terraform module development.
- Designed Event Grid and Event Hub integration for Storage Account event streaming into Azure Data Explorer.
- Owned Terraform implementation for cross-region migration of 80+ TB data via Azure Object Replication with zero loss.
- Mentored two junior engineers and provided technical guidance across Azure, networking, Terraform, Docker, and Kubernetes.
- Executed 100+ production changes/releases with zero downtime using CI/CD rollout and risk mitigation strategies.
- Migrated large certificate estates to a new certificate authority using Terraform automation.
- Delivered Datadog enablement training across multiple engineering teams.
- Achievements: Quarterly Block Star Award and multiple recognitions for critical troubleshooting and root-cause resolution.