Designing an ETL Pipeline for a Data Warehouse
This project was my final submission for a database course instructed by Dr. Gheibi. The project was divided into three main phases:
Database Design: The first phase required designing a database for a library, ensuring it adhered to the fifth normal form (5NF). ETL Pipeline: The second phase involved creating an ETL (Extract, Transform, Load) pipeline to synchronize a data warehouse with the operational database. Time Machine: The final phase was to develop a “time machine” feature that could restore the database to a specific point in the past. For the ETL pipeline, a common approach is to replicate each database operation individually, but this can create significant overhead. Instead, I modeled the database as a Directed Acyclic Graph (DAG), where tables and their relationships were represented as vertices and edges. I then used a topological sort of this DAG to determine the optimal order for applying bulk insert, delete, and update operations.