Pipeline Consists of various modules:
(GoodReads Python Wrapper) ETL Jobs
- Once the data is moved to working zone, spark job is triggered which reads the data from working zone and apply transformation. Dataset is repartitioned and moved to the Processed Zone.
- Warehouse module of ETL jobs picks up data from processed zone and stages it into the Redshift staging tables. Using the Redshift staging tables and UPSERT operation is performed on the Data Warehouse tables to update the dataset. ETL job execution is completed once the data warehouse is updated.
- Airflow DAG has Analytics queries configured in a Custom Designed Operator. These queries are run and again a Data Quality Check is done on some selected Analytics Table. Dag execution completes after these Data Quality check.
Airflow DAG runs the data quality check on all Warehouse tables once the ETL job execution is completed.
EMR – I used a 3 node cluster with below Instance Types:
Redshift: For Redshift I used 2 Node cluster with Instace Types
I have written detailed instruction on how to setup Airflow using AWS CloudFormation script. Check out – Airflow using AWS CloudFormation
Project uses sshtunnel
to submit spark jobs using a ssh connection from the EC2 instance. This setup does not automatically install sshtunnel for apache airflow. You can install by running below command:ETL jobs in the project uses
to connect to Redshift cluster to run staging and warehouse queries. To install psycopg2 on EMR: Script to create cluster automatically.How to run
Make sure Airflow webserver and scheduler is running. Open the Airflow UI http: // :
GoodReads Pipeline DAG
DAG View:
DAG Tree View:
DAG Gantt View:
Read More
GIPHY App Key not set. Please check settings