Sourav Jha

Results-Driven Data Engineer

Hyderabad, IN.

About

Results-driven Data Engineer with over 3 years of experience in Big Data technologies and Google Cloud Platform (GCP), specializing in designing, implementing, and optimizing data pipelines. Proven ability to enhance Spark job performance by 30% and improve data processing times by 20% within consumer banking environments. Expert in building scalable, reusable big data solutions with a strong focus on data validation, governance, and delivering analysis-ready datasets for reporting and machine learning use cases.

Work

Lloyds Banking Group

Data Engineer (Big Data)

Hyderabad, Telangana, India

Feb 2024

→

Present

Summary

Led data engineering initiatives for a major banking group, focusing on the ingestion, storage, and processing of large datasets to deliver analysis-ready solutions.

Highlights

Led the ingestion, storage, and processing of large datasets, utilizing Hadoop and Spark (including Spark-SQL) for complex data transformations.

Developed and optimized robust ETL/ELT pipelines using SQL, Python, BigQuery, GCS, Dataflow/DataProc, Spanner, and Cloud Composer.

Built scalable and reusable data pipelines, ensuring high data consistency, integrity, and readiness for transformation across the enterprise.

Developed analysis-ready data assets by implementing robust validation and governance checks at multiple stages, enhancing data reliability by 15%.

Collaborated with cross-functional teams (business analysts, product owners, data scientists) to deliver harmonized, trusted datasets for critical business insights.

Utilized Hive, Sqoop, HBase, and Kafka for distributed processing and real-time data ingestion, supporting high-throughput data workflows.

Applied Change Data Capture (CDC) and Slowly Changing Dimensions (SCD) to synchronize source and warehouse data, improving data accuracy and timeliness.

Automated data ingestion and transformation tasks using Unix Shell Scripting and CRON jobs, reducing manual effort by 20% and improving pipeline efficiency.

Infosys

Data Engineer (Big Data)

Hyderabad, Telangana, India

Jul 2022

→

Feb 2024

Summary

Engineered scalable and reusable data pipelines using Hadoop, Spark, and Spark-SQL to support large-scale data transformation logic and business intelligence.

Highlights

Engineered scalable and reusable data pipelines using Hadoop, Spark, and Spark-SQL to support large-scale data transformation logic for diverse business needs.

Improved Spark job performance by 30% through targeted job tuning and resource optimization, enhancing processing efficiency and reducing operational costs.

Designed and implemented data validation frameworks, ensuring schema consistency and load accuracy for critical datasets.

Supported data modeling efforts across SQL-based and NoSQL-based storage systems (HBase, Hive), optimizing data structure for improved query performance.

Collaborated with stakeholders to translate complex business needs into robust, high-throughput data workflows, improving data accessibility for analytics.

Automated ingestion tasks and pipeline triggers using Shell Scripting and CRON jobs, streamlining data flow and reducing manual intervention.

Managed job orchestration and scheduling using Oozie within Cloudera Hadoop environments, ensuring timely and reliable data processing.

Contributed to the development of analysis-ready datasets, enabling reporting and machine learning use cases for enhanced business insights.

OpenText

Data Engineering Intern

Hyderabad, Telangana, India

Feb 2022

→

Jun 2022

Summary

Assisted in the development and optimization of data pipelines for large-scale data processing within a dynamic data engineering team.

Highlights

Assisted in developing and optimizing data pipelines using Hadoop, Hive, and Spark for large-scale data processing initiatives.

Supported data ingestion tasks, including efficiently importing and exporting data from HDFS using Sqoop, ensuring data availability.

Contributed to projects leveraging Cloudera Hadoop Distribution, HBase, and various data formats like AVRO and CSV files, enhancing data handling capabilities.

Gained practical experience in distributed data processing environments, supporting the team in delivering robust data solutions.

Education

Jawaharlal Nehru Technological University

Hyderabad, Telangana, India

Aug 2018

→

Jun 2022

B.Tech

Electronics and Communication Engineering

Languages

English

Hindi

Skills

Big Data Technologies

Hadoop, Hive, SQL, HBase, Kafka, Apache Spark (Spark Core, Spark SQL, Spark Streaming, PySpark).

Cloud Platforms

GCP (DataProc, Cloud Spanner, BigQuery, DataFlow, Composer, GCS).

Data Ingestion & Transformation

Sqoop, ETL/ELT Pipelines, Change Data Capture (CDC), Slowly Changing Dimensions (SCD), Data Modeling, Data Governance, Data Validation Frameworks.

Programming Languages

Python, Scala.

DevOps & Automation

CI/CD Workflows, Jenkins, Maven, Unix Shell Scripting, CRON jobs, Oozie Workflow, Apache Airflow.

Databases

MySQL, Oracle, Data Lake.

Operating Systems

Unix, Windows.

Methodologies

Agile, Data Warehousing Concepts, BDD (Business Driven Development).

Domain Knowledge

Consumer Banking, Regulated Environments (e.g., Pharma).

Projects

Optimized Data Pipelines for Consumer Banking

Feb 2024

→

Present

Summary

Developed and optimized data pipelines for a large consumer banking organization to enhance data processing efficiency and reliability.

Real-Time Data Streaming Solutions

Jul 2022

→

Feb 2024

Summary

Engineered real-time data streaming solutions to facilitate faster and more reliable data ingestion for analytical platforms.

Data Warehousing and CDC Implementation

Jul 2022

→

Feb 2024

Summary

Designed and implemented Change Data Capture (CDC) mechanisms for a financial data warehouse, ensuring timely and accurate data updates.