Sourav Jha

Results-Driven Data Engineer
Hyderabad, IN.

About

Results-driven Data Engineer with over 3 years of experience in Big Data technologies and Google Cloud Platform (GCP), specializing in designing, implementing, and optimizing data pipelines. Proven ability to enhance Spark job performance by 30% and improve data processing times by 20% within consumer banking environments. Expert in building scalable, reusable big data solutions with a strong focus on data validation, governance, and delivering analysis-ready datasets for reporting and machine learning use cases.

Work

Lloyds Banking Group
|

Data Engineer (Big Data)

Hyderabad, Telangana, India

Summary

Led data engineering initiatives for a major banking group, focusing on the ingestion, storage, and processing of large datasets to deliver analysis-ready solutions.

Highlights

Led the ingestion, storage, and processing of large datasets, utilizing Hadoop and Spark (including Spark-SQL) for complex data transformations.

Developed and optimized robust ETL/ELT pipelines using SQL, Python, BigQuery, GCS, Dataflow/DataProc, Spanner, and Cloud Composer.

Built scalable and reusable data pipelines, ensuring high data consistency, integrity, and readiness for transformation across the enterprise.

Developed analysis-ready data assets by implementing robust validation and governance checks at multiple stages, enhancing data reliability by 15%.

Collaborated with cross-functional teams (business analysts, product owners, data scientists) to deliver harmonized, trusted datasets for critical business insights.

Utilized Hive, Sqoop, HBase, and Kafka for distributed processing and real-time data ingestion, supporting high-throughput data workflows.

Applied Change Data Capture (CDC) and Slowly Changing Dimensions (SCD) to synchronize source and warehouse data, improving data accuracy and timeliness.

Automated data ingestion and transformation tasks using Unix Shell Scripting and CRON jobs, reducing manual effort by 20% and improving pipeline efficiency.

Infosys
|

Data Engineer (Big Data)

Hyderabad, Telangana, India

Summary

Engineered scalable and reusable data pipelines using Hadoop, Spark, and Spark-SQL to support large-scale data transformation logic and business intelligence.

Highlights

Engineered scalable and reusable data pipelines using Hadoop, Spark, and Spark-SQL to support large-scale data transformation logic for diverse business needs.

Improved Spark job performance by 30% through targeted job tuning and resource optimization, enhancing processing efficiency and reducing operational costs.

Designed and implemented data validation frameworks, ensuring schema consistency and load accuracy for critical datasets.

Supported data modeling efforts across SQL-based and NoSQL-based storage systems (HBase, Hive), optimizing data structure for improved query performance.

Collaborated with stakeholders to translate complex business needs into robust, high-throughput data workflows, improving data accessibility for analytics.

Automated ingestion tasks and pipeline triggers using Shell Scripting and CRON jobs, streamlining data flow and reducing manual intervention.

Managed job orchestration and scheduling using Oozie within Cloudera Hadoop environments, ensuring timely and reliable data processing.

Contributed to the development of analysis-ready datasets, enabling reporting and machine learning use cases for enhanced business insights.

OpenText
|

Data Engineering Intern

Hyderabad, Telangana, India

Summary

Assisted in the development and optimization of data pipelines for large-scale data processing within a dynamic data engineering team.

Highlights

Assisted in developing and optimizing data pipelines using Hadoop, Hive, and Spark for large-scale data processing initiatives.

Supported data ingestion tasks, including efficiently importing and exporting data from HDFS using Sqoop, ensuring data availability.

Contributed to projects leveraging Cloudera Hadoop Distribution, HBase, and various data formats like AVRO and CSV files, enhancing data handling capabilities.

Gained practical experience in distributed data processing environments, supporting the team in delivering robust data solutions.

Education

Jawaharlal Nehru Technological University
Hyderabad, Telangana, India

B.Tech

Electronics and Communication Engineering

Languages

English
Hindi

Skills

Big Data Technologies

Hadoop, Hive, SQL, HBase, Kafka, Apache Spark (Spark Core, Spark SQL, Spark Streaming, PySpark).

Cloud Platforms

GCP (DataProc, Cloud Spanner, BigQuery, DataFlow, Composer, GCS).

Data Ingestion & Transformation

Sqoop, ETL/ELT Pipelines, Change Data Capture (CDC), Slowly Changing Dimensions (SCD), Data Modeling, Data Governance, Data Validation Frameworks.

Programming Languages

Python, Scala.

DevOps & Automation

CI/CD Workflows, Jenkins, Maven, Unix Shell Scripting, CRON jobs, Oozie Workflow, Apache Airflow.

Databases

MySQL, Oracle, Data Lake.

Operating Systems

Unix, Windows.

Methodologies

Agile, Data Warehousing Concepts, BDD (Business Driven Development).

Domain Knowledge

Consumer Banking, Regulated Environments (e.g., Pharma).

Projects

Optimized Data Pipelines for Consumer Banking

Summary

Developed and optimized data pipelines for a large consumer banking organization to enhance data processing efficiency and reliability.

Real-Time Data Streaming Solutions

Summary

Engineered real-time data streaming solutions to facilitate faster and more reliable data ingestion for analytical platforms.

Data Warehousing and CDC Implementation

Summary

Designed and implemented Change Data Capture (CDC) mechanisms for a financial data warehouse, ensuring timely and accurate data updates.