SRI HARSHA KATTULA

Cloud Data Engineer | Big Data Specialist

Hyderabad, IN.

About

Highly skilled Cloud Data Engineer with 4 years of experience specializing in scalable data pipeline development within Big Data ecosystems, including Spark and Hadoop. Proven expertise in migrating on-premise data infrastructure to Google Cloud Platform (GCP) and optimizing data processing for enhanced efficiency and performance. Adept at leveraging Scala, Python, and Spark to deliver robust ETL solutions and drive data-driven insights for critical business functions.

Work

INVENTZO SYSTEMS(INDIA) PVT LTD. (Client: HSBC)

Data Engineer (GCP)

Hyderabad, Telangana, India

Feb 2024

→

Present

Summary

As a Data Engineer for the Climate Aligned Finance project at HSBC, I am responsible for migrating and optimizing large-scale data pipelines within the Google Cloud Platform ecosystem, ensuring high performance and data quality for critical financial data.

Highlights

Migrated large-scale data pipelines from on-premise Hadoop to Google Cloud Platform (GCP) using GCS and Dataproc, reducing processing overhead and enhancing scalability.

Developed and refactored Spark jobs in Scala for efficient ingestion and transformation of diverse file formats (CSV, JSON, Avro) into Parquet on GCS, significantly improving storage efficiency and query performance.

Optimized PySpark pipelines by converting them to Scala Spark for the CAF project, and tuned Spark configurations (executor memory, shuffle partitions, dynamic partitioning, caching) to ensure seamless transition and enhanced performance.

Implemented robust data quality validation frameworks and deployed CI/CD pipelines via Jenkins, automating build, test, and deployment of Spark jobs and orchestrating ETL workflows with Apache Airflow, improving reliability and operational readiness.

INVENTZO SYSTEMS(INDIA) PVT LTD. (Client: HP)

Big Data Developer

Hyderabad, Telangana, India

Dec 2022

→

Jan 2024

Summary

As a Big Data Developer for a key client project at HP, I analyzed extensive datasets, developed validation frameworks, and optimized ingestion and transformation processes to enhance data reliability and performance.

Highlights

Analyzed extensive datasets using Hive queries and Pig scripts, and managed Hive tables (partitioning, bucketing, loading) to integrate and analyze new data effectively.

Developed a robust Spark-Scala framework to validate incoming source files based on structure, count, and null values, improving data quality and reliability.

Extracted and processed files from Oracle RDBMS into Spark via Sqoop, and optimized existing ingestion/transformation frameworks, deploying changes to the production cluster to enhance performance.

Monitored and troubleshot production jobs to ensure continuous operation and stability of critical data pipelines.

INVENTZO SYSTEMS(INDIA) PVT LTD. (Client: Zeotics)

Big Data Developer

Hyderabad, Telangana, India

Aug 2021

→

Dec 2022

Summary

As a Big Data Developer for the Zeotics project, I engineered a Spark Scala ingestion framework to transform and calculate policy claim amounts from SQL sources into a data lake, automating job execution via Oozie workflows.

Highlights

Engineered a Spark Scala ingestion framework to transfer data from SQL sources to a data lake, storing transformed data in Hive, and accurately calculated policy claim amounts for each policy ID.

Scheduled complex data processing jobs using Oozie workflow for automated execution, enhancing operational efficiency and reliability.

Imported data from diverse sources (AWS S3, LFS) into Spark RDDs and converted Hive/SQL queries into optimized Spark transformations using Spark RDDs and Scala, improving processing speed.

Implemented Spark using Scala, DataFrames, and Spark SQL API for significantly faster processing of large datasets, and optimized production cluster jobs to ensure high performance and stability.

Education

SRKR Engineering College

Bhimavaram, Andhra Pradesh, India

Aug 2017

→

Jun 2021

B.Tech

Engineering

Languages

English

Hindi

Telugu

Skills

Big Data Ecosystems

Spark Core, Spark SQL, Hive, HDFS, Hadoop, Sqoop, Oozie.

Programming & Scripting

Scala, Python, PySpark, Unix Shell Scripting.

Cloud Technologies

Google Cloud Platform (GCP), BigQuery, DataProc, Google Cloud Storage (GCS), Pub/Sub, Kafka, Cloud Composer, Airflow, AWS S3.

Databases & Data Processing

RDBMS, Spark RDD, DataFrames, Spark SQL API, HiveQL, Parquet, CSV, JSON, Avro.

Development & Operations

Software Development Life Cycle (SDLC), Agile Methodologies, CI/CD, Jenkins, Git, Jira.

Operating Systems

Unix/Linux, Windows.

IDEs

Eclipse, IntelliJ.