→ Architecting and Building a Data Lake with 1M+ Companies
• Designed a data lake with lambda architecture to extract business insights from the data of 1M+ companies.
• Designed the data governance (lineage, catalog and audit) using Apache Atlas and the data security management with Kerberos, Sentry, roles in Hive, and ACL policies.
• Generated the data in avro and parquet formats with snappy compression.
• Managed the physical servers (configuration, temperature, adding/exchanging parts), O.S. (network, users) and also the applications using YARN, Docker, Docker-Compose.
• Improved the efficiency of queries by 48 times faster compared to the transactional environment.
• Tech Stack: Oracle Exadata, Cloudera, Hadoop
→ Creating a Multi-Tenant Environment for Data Scientists
• Designed and built a multi-tenant data analysis and modeling environment where users could perform analysis directly in the cluster without the need to install and configure libraries or Jobs Spark.
• Tech Stack: JupyterHub, Spark, YARN, Docker, Kerberos, Ansible
→ Architecting and Building a Data Warehouse with 300 TB+
• Builds multiple pipelines for daily data ingestion and processing.
• Automated creation of DB, tables, partitions and statistics update.
• Managed 300 TB+ on Hive/Impala.
• Achieved the objective of performing accounting analyses with a monthly volume of data in a few minutes, which was not possible before.
• Tech Stack: HDFS, Hive, Impala, YARN, HUE, Pyspark, Airflow
→ Data Pipeline Optimization from 30 Days Latency to < 4 Hours
• Developed a framework in Python, using Object-Oriented Programming and Template Method design pattern, to ingest and process more than 300k records/day.
• Architected data movement between tasks using Redis.
• Framework optimized from 30 days latency to < 4 hours.
• Tech Stack: Postgres, Oracle 19c, Airflow, Pytest, Pyspark, Celery, Arrow, Hive, Impala, HDFS, YARN, Docker, Azure DevOps, Ansible.