Samyak Jain

Data Engineer | Cloud Data Pipelines | Real-time Streaming

Gurugram, IN.

About

A highly skilled Data Engineer with 3 years of experience specializing in designing and optimizing scalable ETL/ELT pipelines and real-time streaming systems across AWS and GCP. Proven ability to deliver high-impact data products, processing over 20M daily records and 5K events/sec using Python, PySpark, and Kafka, with a strong focus on automation, performance, and delivering ML-ready data through clean, modular architectures. Seeking to leverage expertise in cloud data platforms and real-time analytics to drive significant data-driven initiatives and enhance data infrastructure.

Work

Google (via Vaco Binary Semantics)

Software Engineer

Gurugram, Haryana, India

Jul 2024

→

Present

Summary

Spearheaded the development and optimization of high-performance data pipelines and real-time streaming solutions for critical business and blockchain data across Google Cloud Platform.

Highlights

Optimized a modular batch pipeline for Amazon SP API reports, processing over 10K reports daily and reducing end-to-end latency by 70% through parallelized Pub/Sub and autoscaling Dataflow jobs.

Converted complex transformation logic into reusable Dataflow Flex Templates with config-driven schema mapping and parameterization, enabling rapid onboarding of new sellers and marketplaces without code changes.

Designed and implemented a low-latency real-time pipeline to ingest, transform, and load 10M+ Ethereum transactions daily into Google's proprietary graph database, leveraging Kafka and PySpark Structured Streaming.

Reduced data propagation latency by 65% in blockchain streaming through optimized micro-batch intervals, checkpoint tuning, partitioning, and parallelized graph API ingestion.

Enhanced a global AQI data pipeline, increasing daily data coverage by 60% to over 20M records across 120+ countries, by integrating enrichment, geospatial tagging, and deduplication components.

Improved pipeline uptime to 99.8% and reduced ingestion failures by 90% in the AQI pipeline by implementing robust validation rules and schema drift handling.

Wipro

Project Engineer

Noida, Uttar Pradesh, India

May 2022

→

Jul 2024

Summary

Developed and automated scalable ETL pipelines and real-time streaming solutions for high-volume business data, contributing to operational analytics and efficiency improvements for the Marelli Project.

Highlights

Developed scalable ETL pipelines to ingest high-volume business data into centralized data lakes utilizing AWS Glue, Redshift, and S3.

Designed real-time data streaming flows with Kinesis and Python, enabling sub-minute operational analytics for critical business insights.

Automated reporting and reconciliation workflows using SQL and AWS QuickSight, reducing manual effort by 70%.

Education

Bharati Vidyapeeth's College of Engineering

Pune, Maharashtra, India

Jan 2018

→

Jan 2022

B.Tech

Electronics and Communication

Grade: 8.3 CGPA

Awards

Best New Talent & Lead Recommendation Award

Oct 2024

Awarded By

Vaco

Awarded for rapid production delivery and exceptional performance within the initial three months of tenure, recognized as 'Best New Talent'.

Star of the Month (x2)

Jul 2024

Awarded By

Wipro

Recognized twice for outstanding contributions in ETL and reporting innovations, demonstrating exceptional performance and impact.

Certificates

Microsoft Certified: Azure Fundamentals

Jan 2023

Issued By

Microsoft

Skills

Programming Languages

Python, SQL, Java.

Data Engineering

PySpark, Apache Kafka, Airflow, Hadoop, ETL/ELT Pipelines, Real-time Streaming, Data Modeling, Dataflow Flex Templates, Structured Streaming, Watermarking, Schema Drift Handling, Data Reliability.

Cloud Platforms

AWS (Redshift, S3, Lambda, Glue, Kinesis), GCP (BigQuery, Composer, Pub/Sub, Looker, Dataflow).

Databases

PostgreSQL, MySQL, Graphical DB.

Tools & Methodologies

Git, REST APIs, Shell Scripting, CI/CD, Agile, Production Support, Monitoring, AWS QuickSight.