About
Results-driven Data Engineer with over 3 years of experience in Big Data technologies and Google Cloud Platform (GCP), specializing in designing, implementing, and optimizing data pipelines. Proven ability to enhance Spark job performance by 30% and improve data processing times by 20% within consumer banking environments. Expert in building scalable, reusable big data solutions with a strong focus on data validation, governance, and delivering analysis-ready datasets for reporting and machine learning use cases.
Work
Hyderabad, Telangana, India
→
Summary
Led data engineering initiatives for a major banking group, focusing on the ingestion, storage, and processing of large datasets to deliver analysis-ready solutions.
Highlights
Led the ingestion, storage, and processing of large datasets, utilizing Hadoop and Spark (including Spark-SQL) for complex data transformations.
Developed and optimized robust ETL/ELT pipelines using SQL, Python, BigQuery, GCS, Dataflow/DataProc, Spanner, and Cloud Composer.
Built scalable and reusable data pipelines, ensuring high data consistency, integrity, and readiness for transformation across the enterprise.
Developed analysis-ready data assets by implementing robust validation and governance checks at multiple stages, enhancing data reliability by 15%.
Collaborated with cross-functional teams (business analysts, product owners, data scientists) to deliver harmonized, trusted datasets for critical business insights.
Utilized Hive, Sqoop, HBase, and Kafka for distributed processing and real-time data ingestion, supporting high-throughput data workflows.
Applied Change Data Capture (CDC) and Slowly Changing Dimensions (SCD) to synchronize source and warehouse data, improving data accuracy and timeliness.
Automated data ingestion and transformation tasks using Unix Shell Scripting and CRON jobs, reducing manual effort by 20% and improving pipeline efficiency.
Hyderabad, Telangana, India
→
Summary
Engineered scalable and reusable data pipelines using Hadoop, Spark, and Spark-SQL to support large-scale data transformation logic and business intelligence.
Highlights
Engineered scalable and reusable data pipelines using Hadoop, Spark, and Spark-SQL to support large-scale data transformation logic for diverse business needs.
Improved Spark job performance by 30% through targeted job tuning and resource optimization, enhancing processing efficiency and reducing operational costs.
Designed and implemented data validation frameworks, ensuring schema consistency and load accuracy for critical datasets.
Supported data modeling efforts across SQL-based and NoSQL-based storage systems (HBase, Hive), optimizing data structure for improved query performance.
Collaborated with stakeholders to translate complex business needs into robust, high-throughput data workflows, improving data accessibility for analytics.
Automated ingestion tasks and pipeline triggers using Shell Scripting and CRON jobs, streamlining data flow and reducing manual intervention.
Managed job orchestration and scheduling using Oozie within Cloudera Hadoop environments, ensuring timely and reliable data processing.
Contributed to the development of analysis-ready datasets, enabling reporting and machine learning use cases for enhanced business insights.
Hyderabad, Telangana, India
→
Summary
Assisted in the development and optimization of data pipelines for large-scale data processing within a dynamic data engineering team.
Highlights
Assisted in developing and optimizing data pipelines using Hadoop, Hive, and Spark for large-scale data processing initiatives.
Supported data ingestion tasks, including efficiently importing and exporting data from HDFS using Sqoop, ensuring data availability.
Contributed to projects leveraging Cloudera Hadoop Distribution, HBase, and various data formats like AVRO and CSV files, enhancing data handling capabilities.
Gained practical experience in distributed data processing environments, supporting the team in delivering robust data solutions.
Languages
English
Hindi
Skills
Big Data Technologies
Hadoop, Hive, SQL, HBase, Kafka, Apache Spark (Spark Core, Spark SQL, Spark Streaming, PySpark).
Cloud Platforms
GCP (DataProc, Cloud Spanner, BigQuery, DataFlow, Composer, GCS).
Data Ingestion & Transformation
Sqoop, ETL/ELT Pipelines, Change Data Capture (CDC), Slowly Changing Dimensions (SCD), Data Modeling, Data Governance, Data Validation Frameworks.
Programming Languages
Python, Scala.
DevOps & Automation
CI/CD Workflows, Jenkins, Maven, Unix Shell Scripting, CRON jobs, Oozie Workflow, Apache Airflow.
Databases
MySQL, Oracle, Data Lake.
Operating Systems
Unix, Windows.
Methodologies
Agile, Data Warehousing Concepts, BDD (Business Driven Development).
Domain Knowledge
Consumer Banking, Regulated Environments (e.g., Pharma).