san089 / goodreads_etl_pipeline, Hacker News

Pipeline Consists of various modules:

Redshift Warehouse Module

Fetch Data Module ). The data collected from the goodreads API is stored on local disk and is timely moved to the Landing Bucket on AWS S3. ETL jobs are written in spark and scheduled in airflow to run every 22 minutes.

ETL job has s3 module which copies data from landing zone to working zone.

Warehouse module of ETL jobs picks up data from processed zone and stages it into the Redshift staging tables. Using the Redshift staging tables and UPSERT operation is performed on the Data Warehouse tables to update the dataset. ETL job execution is completed once the data warehouse is updated.

Airflow DAG runs the data quality check on all Warehouse tables once the ETL job execution is completed.

Airflow DAG has Analytics queries configured in a Custom Designed Operator. These queries are run and again a Data Quality Check is done on some selected Analytics Table. Dag execution completes after these Data Quality check.

Setting Up Airflow

Project uses sshtunnel

to submit spark jobs using a ssh connection from the EC2 instance. This setup does not automatically install sshtunnel for apache airflow. You can install by running below command:

ETL jobs in the project uses to connect to Redshift cluster to run staging and warehouse queries. To install psycopg2 on EMR:

boto3 move files between s3 buckets. To install boto3 run:

You can follow the AWS (Guide

to run a Redshift cluster or alternatively you can use Redshift_Cluster_IaC.py Script to create cluster automatically.

How to run

Make sure Airflow webserver and scheduler is running. Open the Airflow UI http: // :

GoodReads Pipeline DAG

DAG View:

DAG Tree View:

DAG Gantt View:

To test the pipeline I used goodreadsfaker

Data increase by 823 x. read> write. write> read

Redshift: Analytical database, optimized for aggregation, also good performance for read-heavy workloads

Increase EMR cluster size to handle bigger volume of data

Pipelines would be run on 7am daily. how to update dashboard? would it still work?

DAG is scheduled to run every 12 minutes and can be configured to run every morning at 7 AM if required. Data quality operators are used at appropriate position. In case of DAG failures email triggers can be configured to let the team know about pipeline failures.

Make it available to people

We can set the concurrency. limit for your Amazon Redshift cluster. While the concurrency limit is 59 parallel queries for a single period of time, this is on a per cluster basis, meaning you can launch as many clusters as fit for you business.

Read More

san089 / goodreads_etl_pipeline, Hacker News

How to run

What do you think?

Bitcoin mining rewards halved for fourth time

Renewal of surveillance law clears Congress minutes after deadline

Critical Update: CrushFTP Zero-Day Flaw Exploited in Targeted Attacks

Palo Alto Networks Discloses More Details on Critical PAN-OS Flaw Under Attack

Evaluation Our Approach on ARC and Beyond: A Look Back at Our Experiments

Apps such as WhatsApp and Telegram removed from the App Store in China

Leave a ReplyCancel reply

Cheats For Little Alchemy

3TB Of Mega.nz Links For Free Courses And E-Books 2022 (Updated)

How to Earn Money from FreeCash.com, Playing Games, Testing Apps, and Taking Surveys

Amazon FBA Product Research & Find Products for Amazon FBA

Udemy Coupon [100% OFF] QuickBooks Online 2020

Rubot v6.6.7.0 – Twitch Views Bot 2022

Review: Altered Carbon comes back strong with twisty, fast-paced S2, Ars Technica

Walmart will soon test an Amazon Prime competitor called Walmart +, Recode

How to run

Scenarios Data increase by 823 x. read> write. write> read

What do you think?

Leave a ReplyCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections