Ashis Sharma

DevOps & AI Infrastructure Engineer
Hyderabad, IN.

About

Highly accomplished DevOps & AI Infrastructure Engineer with over 5 years of experience specializing in scaling GPU-based LLM training clusters (100+ GPUs) and automating multi-cloud MLOps pipelines across GCP and Azure. Proven leader in reducing deployment times by 60% and optimizing GPU utilization through expert application of Terraform, Helm, and Kubeflow. Passionate about architecting robust, scalable AI infrastructure that seamlessly bridges model research and production reliability.

Work

Jukshio
|

DevOps Engineer

Hyderabad, Telangana, India

Summary

Led the design, implementation, and optimization of robust DevOps and AI infrastructure, ensuring high performance and scalability for large-scale LLM operations.

Highlights

Orchestrated and optimized Slurm-based GPU clusters for large-scale LLM fine-tuning, supporting over 100+ GPUs to enhance model development capabilities.

Deployed scalable Kubernetes inference services utilizing SGLang and vLLM, significantly improving latency and throughput for critical customer workloads.

Automated infrastructure provisioning across multiple cloud environments using Terraform, Helm, and Kustomize, cutting setup time by 40%.

Designed and implemented multi-stage CI/CD pipelines in GitLab CI, reducing release cycles by 60% through integrated build, test, security scan, and deployment automation.

Enhanced compliance and security posture by implementing DevSecOps practices with SonarQube, Trivy, and OWASP WAF across all development stages.

Developed and automated a Kubeflow-based MLOps pipeline, effectively supporting 20+ model training jobs across diverse GPU clusters.

Managed and optimized Prometheus and Grafana observability stack, reducing time-to-detect incidents by 40% and enhancing system reliability.

Successfully migrated critical workloads from Azure to GCP, improving system reliability and reducing overall cloud costs by 25%.

Wipro
|

Test Engineer

Chennai, Tamil Nadu, India

Summary

Contributed to quality assurance processes by automating regression testing and developing data validation utilities to accelerate QA cycles.

Highlights

Automated comprehensive regression testing using Selenium WebDriver (Java), significantly improving testing efficiency and accuracy.

Developed robust data validation utilities, accelerating critical QA cycles and ensuring data integrity across various applications.

Education

SRM University Sikkim
Gangtok, Sikkim, India

Bachelor of Science

Information Technology

Skills

Cloud & Infrastructure

Google Cloud Platform (GCP), Microsoft Azure, Terraform.

Containers & Orchestration

Kubernetes, Docker, Helm, Kustomize, K3s, Slurm.

DevOps & Automation

GitLab CI/CD, GitHub Actions, Multi-stage Pipelines.

Monitoring & Security

Prometheus, Grafana, Loki, EFK Stack, Trivy, SonarQube, WAF, OWASP Rules.

Programming

GoLang, Python, JavaScript, Bash.

AI & HPC

CUDA, TensorRT, VLLM, SGLang, Kubeflow.

Projects

GPU Cluster Orchestration for LLM Training

Summary

Configured and automated large-scale Slurm-based GPU clusters to support fine-tuning and inference of open-source LLMs.

End-to-End MLOps Pipeline with Kubeflow

Summary

Built and maintained a Kubeflow-based pipeline enabling reproducible model training, validation, and deployment across GPU clusters.

Edge IoT Cluster Management System

Summary

Configured and managed K3s clusters for SmartCart IoT devices deployed in remote locations.

CI/CD and DevSecOps Automation Framework

Summary

Designed a multi-stage CI/CD pipeline (build, test, scan, deploy) utilizing GitLab CI and Terraform.