Automated CI/CD pipelines using GitHub Actions, significantly enhancing system reliability and accelerating model training, testing, and deployment processes.
Designed and deployed scalable, modular infrastructure with Terraform to support machine learning environments, including setting up NVIDIA L4 GPUs on GKE for AI/ML workloads.
Migrated applications to container-native Wolfi OS images, significantly reducing image size and vulnerabilities by integrating security scanning tools like Prisma Cloud, Twistlock, and Veracode.
Deployed MLFlow on GKE for experiment tracking and model lifecycle management, enabling seamless collaboration and improving resource isolation/cost efficiency across multi-tenant clusters.
Developed and implemented a native alerting automation system using Cloud Functions and Cloud Scheduler, backed by a custom SMTP server in GKE, enhancing operational responsiveness and re-engineered SLO/SLI setups to meet business SLA requirements.