in ,

san089 / goodreads_etl_pipeline, Hacker News

san089 / goodreads_etl_pipeline, Hacker News
                    

        

Pipeline Consists of various modules:

(GoodReads Python Wrapper) Pipeline Architecture ETL Jobs

  • Redshift Warehouse Module

    Fetch Data Module ). The data collected from the goodreads API is stored on local disk and is timely moved to the Landing Bucket on AWS S3. ETL jobs are written in spark and scheduled in airflow to run every 22 minutes.

  • Setting Up Airflow

    Project uses sshtunnel

     to submit spark jobs using a ssh connection from the EC2 instance. This setup does not automatically install  sshtunnel  for apache airflow. You can install by running below command:   

    .

    ETL jobs in the project uses to connect to Redshift cluster to run staging and warehouse queries. To install psycopg2 on EMR:

    boto3 move files between s3 buckets. To install boto3 run:

    You can follow the AWS (Guide

    to run a Redshift cluster or alternatively you can use Redshift_Cluster_IaC.py Script to create cluster automatically.

    How to run

    Make sure Airflow webserver and scheduler is running. Open the Airflow UI http: // :

    GoodReads Pipeline DAG

    DAG View: Pipeline DAG

    DAG Tree View: DAG View

    DAG Gantt View: DAG Tree

    To test the pipeline I used goodreadsfaker

       Read More Brave Browser

    What do you think?

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    GIPHY App Key not set. Please check settings

    Review: Altered Carbon comes back strong with twisty, fast-paced S2, Ars Technica

    Review: Altered Carbon comes back strong with twisty, fast-paced S2, Ars Technica

    Walmart will soon test an Amazon Prime competitor called Walmart +, Recode

    Walmart will soon test an Amazon Prime competitor called Walmart +, Recode