Skip to content

In this project we can run an ETL in AWS Glue by Orchestrating it with Airflow. This project we create a Docker Compose to raise the services as Airflow, Redis and PostgreSQL. PostgreSQL was use in this project to save metadata get of Airflow

Notifications You must be signed in to change notification settings

wlopezm-unal/Project-airflow-AWSGlue

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Pipeline with Airflow, AWS Glue, Athena, S3


In this project, an architecture that integrates Redis, Apache Airflow and AWS Glue has been implemented for the orchestration of ETL (Extract, Transform and Load) processes. The configuration allows Airflow to manage and execute DAGs (Directed Acyclic Graphs) containing ETL scripts hosted on AWS Glue.

Redis is used as a messaging system to optimize communication between Airflow components, improving efficiency in task execution. In turn, Airflow orchestrates the Glue jobs, which are responsible for reading data from an S3 bucket, performing transformations such as changing date formatting and renaming columns in dataframes, and finally, saving the transformed results back to S3.

This integration enables seamless and scalable management of ETL processes, ensuring that data is processed efficiently and stored appropriately for further analysis. In addition, the use of Airflow provides visibility and control over the workflow, facilitating scheduling and monitoring of tasks.

The data for this project was obtained from Colombia's open data page on COVID-19 positive cases. . Link : https://www.datos.gov.co/Salud-y-Protecci-n-Social/Casos-positivos-de-COVID-19-en-Colombia-/gt2j-8ykr/about_data


Table of Contents

  • Overview
  • Architecture
  • Prerequisites
  • System Setup

Overview

  1. Extract data from: datos.gov.co/Salud-y-Protecci-n-Social/Casos-positivos-de-COVID-19-en-Colombia-/gt2j-8ykr/about_data
  2. Store the raw data into an S3 bucket from Airflow.
  3. Transform the data using AWS Glue and Amazon Athena.
  4. Load the transformed data into Amazon Redshift for analytics and querying.

Architecture

image

  • Datos.gov.co: Source of the data.
  • Apache Airflow: Orchestrates the ETL process and manages task distribution.
  • PostgreSQL: Temporary storage and metadata management.
  • Amazon S3: Raw data storage.
  • AWS Glue: Data cataloging and ETL jobs.
  • Amazon Athena: SQL-based data transformation.
  • Amazon Redshift: Data warehousing and analytics.

Prerequisites

  • AWS Account with appropriate permissions for S3, Glue, Athena, and Redshift.
  • Docker Installation
  • Python 3.9 or higher

System Setup

  1. Clone the repository git clone https://github.com/wlopezm-unal/Project-airflow-AWSGlue.git
  2. Create a virtual environment python -m venv airflow-env
  3. Activate the virtual environment. source ./airflow-env/Scripts/activate
  4. Lauch docker compose docker compose up -d
  5. Create bucket S3
  6. Load dataset in S3
  7. Define your rutes S3:// para the AWS_glue files
  8. Configure Crawler of AWS GLue to read data what is in S3
  9. Create ETL JOBs in AWS GLue, and copy the code whay you can found in aws_glue files of this repository
  10. Create Airflow connection and insert Your credential AWS in Airflow image
  11. Run the DAG : aws_glue_etl_dag
  12. Configure your Crawler in AWS Glue to you can see and work with the process data using AWS Athena image

About

In this project we can run an ETL in AWS Glue by Orchestrating it with Airflow. This project we create a Docker Compose to raise the services as Airflow, Redis and PostgreSQL. PostgreSQL was use in this project to save metadata get of Airflow

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages