Rajat Gupta
AWS, GCP & Databricks Certified Lead Data Engineer
New Delhi, IN.About
Highly accomplished Lead Data Engineer with 11 years of experience driving data initiatives for Fortune 500 banking, finance, and telecom clients. AWS, GCP, and Databricks certified, I specialize in designing and implementing robust big data solutions, cloud architectures, and scalable ETL pipelines. My expertise spans data governance, real-time processing, and performance optimization, consistently delivering high-impact results in complex, multi-cloud environments.
Work
Dubai, Dubai, UAE
→
Summary
Led data architecture and engineering for Zahid Group, focusing on enhancing data discovery, lineage, and governance through advanced schema strategies and Databricks-driven transformations.
Highlights
Developed comprehensive schema strategies and metadata layers, including a Data Catalog, to enhance data discovery, lineage, and governance for Zahid Group.
Utilized Databricks and Apache Spark to cleanse and transform complex datasets, implementing a medallion architecture for structured and semi-structured data.
Streamlined data operations and automation by building reusable utilities, reducing manual effort and improving efficiency.
Gurugram, Haryana, India
→
Summary
Architected and implemented robust data ingestion and transformation pipelines on GCP for American Express, ensuring efficient data flow and business logic processing.
Highlights
Created scalable Data Ingestion Pipelines from diverse RDBMS databases to a centralized storage layer, orchestrated by Cloud Composer DAGs.
Transformed complex business logic using Databricks Spark, running efficiently on Cloud Composer jobs to support critical financial operations.
Remote, Maharashtra, India
→
Summary
Led the development of a new data lake for financial data processing, designing serverless architectures and optimizing CI/CD pipelines for enhanced data recovery and debt management.
Highlights
Built an event-based framework utilizing Cloud Functions for near real-time processing of financial data, leveraging Databricks workflows for efficiency.
Designed and implemented a serverless architecture with reusable components, adopted by multiple product teams to process financial data related to debt and recovery.
Contributed to Python Spark notebooks, SQL queries, and workflows for daily ETL data loads, ensuring high data quality and availability.
Automated CI/CD pipelines for code deployment and migration across Dev, QA, and Production environments using Terraform and Jenkins Groovy script, reducing deployment time and errors.
Collaborated with business stakeholders to gather requirements and translate them into technical user stories, facilitating seamless data onboarding from multiple source teams.
Remote, Scotland, United Kingdom of Great Britain and Northern Ireland
→
Summary
Developed a new data lake on AWS, creating efficient data ingestion and ETL pipelines to support critical data processing for the bank's operations.
Highlights
Created data ingestion pipelines from various RDBMS databases to AWS S3, leveraging AWS DMS for efficient data transfer and integration.
Developed robust ETL data pipelines using PySpark, Glue Jobs, Lambda, DynamoDB, Athena, and S3 with Parquet files, optimizing data processing for the new data lake.
Built new infrastructure in AWS using Terraform and a CI/CD Pipeline with Bitbucket and TeamCity, enhancing deployment efficiency and reliability.
Gurugram, Haryana, India
→
Summary
Led the design and implementation of ETL pipelines on AWS cloud for ZS Associates (Amgen), focusing on data transformation, warehousing, and performance optimization.
Highlights
Led requirements gathering, design, and implementation for ETL pipelines on AWS cloud, ensuring alignment with business needs for data integration.
Developed and designed reusable libraries, frameworks, and utilities for ETL data pipelines, data transformations, warehousing, and validations.
Optimized performance for PySpark data pipelines, significantly improving data load times and overall system efficiency for daily reports on AWS Databricks.
Greater Noida, Uttar Pradesh, India
→
Summary
Managed real-time data ingestion via Kafka and spearheaded data transformation and migration projects for major clients like Citigroup and Vodafone.
Highlights
Handled real-time data ingestion via Kafka with Spark Streaming for large-scale data processing, ensuring timely and efficient data availability.
Translated complex functional and technical requirements into detailed high and low-level designs, facilitating effective project execution.
Integrated HSM API with Kafka to provide hardware and software-level 256-bit encryption for secure debit/credit card transactions, enhancing data security.
Played a key role in data transformation using Spark scripts for structured and semi-structured data, improving data quality and usability.
Performed data analysis by implementing various machine learning algorithms via Spark ML, optimizing hyperparameter deployment with GridSearchCV for enhanced model performance.
Gurugram, Haryana, India
→
Summary
Developed PySpark scripts for tariff plan regression and managed data ingestion for new data lakes, supporting critical KPIs and improving end-user satisfaction.
Highlights
Developed PySpark scripts to calculate traditional and ad-hoc KPIs from structured and semi-structured data, providing critical insights for tariff plan regression.
Managed data ingestion using Sqoop, cleaning, and manipulation of data with Spark scripts, ensuring data quality and readiness for analysis.
Performed ETL from ENIQ Oracle database to HDFS using Sqoop and processed data with Hive scripts for the new Datalake for LTE Network, supporting critical KPIs.
Gurugram, Haryana, India
→
Summary
Developed a new dashboard to accommodate all KPIs for 3G and 4G networks, migrating from a legacy PHP system to Java.
Highlights
Migrated an old dashboard written in PHP scripts to a new Java-based dashboard, enhancing performance and scalability for 3G and 4G network KPIs.
New Delhi, Delhi, India
→
Summary
Developed data entity beans and established relationships for the Auto Giant Organization portal, enhancing data management and application functionality.
Highlights
Developed Data Entity Beans (POJOs) and established relationships between them, improving data integrity and application architecture for the portal.
Education
Awards
Intel® Edge AI Scholarship
Awarded By
Intel & Udacity
Awarded for excellence in Edge AI.
80% Scholarship for PGD in Data Science
Awarded By
Swades Foundation (NGO)
Awarded a significant scholarship for Post Graduate Diploma in Data Science.
Publications
Skills
Big Data Tools
HDFS, MapReduce, Sqoop, Hive, PIG, Impala, Oozie, ZooKeeper, Spark, Kafka, Airflow, dbt.
Cloud Platforms
AWS (EC2, S3, RDS, Redshift, EMR, SNS, DynamoDB, SageMaker, Glue, DMS, Lambda, SQS, Kinesis), Azure (ADLS, ADF), GCP (BigQuery, BigTable, CloudSQL, Dataproc, Cloud Composer, Kubernetes, Dataform).
Databases
Oracle, MySQL, Teradata, SQL Server, PostgreSQL.
DevOps & CI/CD
Docker, Kubernetes, Terraform, Jenkins.
Hadoop Distribution
Cloudera, Databricks, Confluent Kafka, AWS EMR.
IDEs
Eclipse, IBM RAD, PyCharm, IntelliJ IDEA, Jupyter Notebook, IBM Rational Team Concert, Autosys.
Programming Languages
Core Java, Python, MATLAB, C++, PHP.
Build Tools, Web Services & Git
Maven, Gradle, sbt, REST API, Bitbucket, Git, Jira.
Python Packages
scikit-learn, NumPy, Pandas, SciPy, Plot.ly, Pyplot, Beautiful Soup, Matplotlib, Math, Random, Seaborn, StatsModels.
Machine Learning
Linear/Logistic Regression, SVM, Random Forests, PCA, K Means, Hierarchical Clustering, Time Series, ANN.