SRI HARSHA KATTULA
Cloud Data Engineer | Big Data Specialist
Hyderabad, IN.About
Highly skilled Cloud Data Engineer with 4 years of experience specializing in scalable data pipeline development within Big Data ecosystems, including Spark and Hadoop. Proven expertise in migrating on-premise data infrastructure to Google Cloud Platform (GCP) and optimizing data processing for enhanced efficiency and performance. Adept at leveraging Scala, Python, and Spark to deliver robust ETL solutions and drive data-driven insights for critical business functions.
Work
Hyderabad, Telangana, India
→
Summary
As a Data Engineer for the Climate Aligned Finance project at HSBC, I am responsible for migrating and optimizing large-scale data pipelines within the Google Cloud Platform ecosystem, ensuring high performance and data quality for critical financial data.
Highlights
Migrated large-scale data pipelines from on-premise Hadoop to Google Cloud Platform (GCP) using GCS and Dataproc, reducing processing overhead and enhancing scalability.
Developed and refactored Spark jobs in Scala for efficient ingestion and transformation of diverse file formats (CSV, JSON, Avro) into Parquet on GCS, significantly improving storage efficiency and query performance.
Optimized PySpark pipelines by converting them to Scala Spark for the CAF project, and tuned Spark configurations (executor memory, shuffle partitions, dynamic partitioning, caching) to ensure seamless transition and enhanced performance.
Implemented robust data quality validation frameworks and deployed CI/CD pipelines via Jenkins, automating build, test, and deployment of Spark jobs and orchestrating ETL workflows with Apache Airflow, improving reliability and operational readiness.
Hyderabad, Telangana, India
→
Summary
As a Big Data Developer for a key client project at HP, I analyzed extensive datasets, developed validation frameworks, and optimized ingestion and transformation processes to enhance data reliability and performance.
Highlights
Analyzed extensive datasets using Hive queries and Pig scripts, and managed Hive tables (partitioning, bucketing, loading) to integrate and analyze new data effectively.
Developed a robust Spark-Scala framework to validate incoming source files based on structure, count, and null values, improving data quality and reliability.
Extracted and processed files from Oracle RDBMS into Spark via Sqoop, and optimized existing ingestion/transformation frameworks, deploying changes to the production cluster to enhance performance.
Monitored and troubleshot production jobs to ensure continuous operation and stability of critical data pipelines.
Hyderabad, Telangana, India
→
Summary
As a Big Data Developer for the Zeotics project, I engineered a Spark Scala ingestion framework to transform and calculate policy claim amounts from SQL sources into a data lake, automating job execution via Oozie workflows.
Highlights
Engineered a Spark Scala ingestion framework to transfer data from SQL sources to a data lake, storing transformed data in Hive, and accurately calculated policy claim amounts for each policy ID.
Scheduled complex data processing jobs using Oozie workflow for automated execution, enhancing operational efficiency and reliability.
Imported data from diverse sources (AWS S3, LFS) into Spark RDDs and converted Hive/SQL queries into optimized Spark transformations using Spark RDDs and Scala, improving processing speed.
Implemented Spark using Scala, DataFrames, and Spark SQL API for significantly faster processing of large datasets, and optimized production cluster jobs to ensure high performance and stability.
Languages
English
Hindi
Telugu
Skills
Big Data Ecosystems
Spark Core, Spark SQL, Hive, HDFS, Hadoop, Sqoop, Oozie.
Programming & Scripting
Scala, Python, PySpark, Unix Shell Scripting.
Cloud Technologies
Google Cloud Platform (GCP), BigQuery, DataProc, Google Cloud Storage (GCS), Pub/Sub, Kafka, Cloud Composer, Airflow, AWS S3.
Databases & Data Processing
RDBMS, Spark RDD, DataFrames, Spark SQL API, HiveQL, Parquet, CSV, JSON, Avro.
Development & Operations
Software Development Life Cycle (SDLC), Agile Methodologies, CI/CD, Jenkins, Git, Jira.
Operating Systems
Unix/Linux, Windows.
IDEs
Eclipse, IntelliJ.