Designing an ETL Pipeline for a Data Warehouse

Koorosh Moslemi · November 13, 2021

ETL PostgreSQL Python

This was my final project for a database course, which was instructed by Dr.Gheibi. This project had three phases:

First, one had to design a database for a library in the fifth normal form.
The second phase was about creating an ETL pipeline to synchronize a data warehouse with changes in the database.
Last but not least, one had to develop a time machine to restore the database to some time in the past.

One straightforward solution to the second part was to imitate changes operation by operation. This approach bypasses a lot of complexities, but it makes lots of overhead.

My solution was to model the database as a DAG, meaning tables and relations were mapped to vertices and edges. Then, I used a topological order of this DAG to figure out the order I should apply insert, delete and update operations in bulk.

In order to implement a versioning functionality in the last phase, I created one table and two views for each table in the original database. Each table in the warehouse had a column with tstzrange datatype. This column was meant to store valid time ranges for each row in the original table.

For more details and complete source code, please check out this repository.