Data Pipeline with Airflow, AWS Glue, Athena, S3

In this project, an architecture that integrates Redis, Apache Airflow and AWS Glue has been implemented for the orchestration of ETL (Extract, Transform and Load) processes. The configuration allows Airflow to manage and execute DAGs (Directed Acyclic Graphs) containing ETL scripts hosted on AWS Glue.

Redis is used as a messaging system to optimize communication between Airflow components, improving efficiency in task execution. In turn, Airflow orchestrates the Glue jobs, which are responsible for reading data from an S3 bucket, performing transformations such as changing date formatting and renaming columns in dataframes, and finally, saving the transformed results back to S3.

This integration enables seamless and scalable management of ETL processes, ensuring that data is processed efficiently and stored appropriately for further analysis. In addition, the use of Airflow provides visibility and control over the workflow, facilitating scheduling and monitoring of tasks.

The data for this project was obtained from Colombia's open data page on COVID-19 positive cases. . Link : https://www.datos.gov.co/Salud-y-Protecci-n-Social/Casos-positivos-de-COVID-19-en-Colombia-/gt2j-8ykr/about_data

Overview

Extract data from: datos.gov.co/Salud-y-Protecci-n-Social/Casos-positivos-de-COVID-19-en-Colombia-/gt2j-8ykr/about_data
Store the raw data into an S3 bucket from Airflow.
Transform the data using AWS Glue and Amazon Athena.
Load the transformed data into Amazon Redshift for analytics and querying.

Architecture

Datos.gov.co: Source of the data.
Apache Airflow: Orchestrates the ETL process and manages task distribution.
PostgreSQL: Temporary storage and metadata management.
Amazon S3: Raw data storage.
AWS Glue: Data cataloging and ETL jobs.
Amazon Athena: SQL-based data transformation.
Amazon Redshift: Data warehousing and analytics.

Prerequisites

AWS Account with appropriate permissions for S3, Glue, Athena, and Redshift.
Docker Installation
Python 3.9 or higher

System Setup

Clone the repository git clone https://github.com/wlopezm-unal/Project-airflow-AWSGlue.git
Create a virtual environment python -m venv airflow-env
Activate the virtual environment. source ./airflow-env/Scripts/activate
Lauch docker compose docker compose up -d
Create bucket S3
Load dataset in S3
Define your rutes S3:// para the AWS_glue files
Configure Crawler of AWS GLue to read data what is in S3
Create ETL JOBs in AWS GLue, and copy the code whay you can found in aws_glue files of this repository
Create Airflow connection and insert Your credential AWS in Airflow
Run the DAG : aws_glue_etl_dag
Configure your Crawler in AWS Glue to you can see and work with the process data using AWS Athena

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
aws_glue		aws_glue
dags		dags
dockerfile		dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Pipeline with Airflow, AWS Glue, Athena, S3

Table of Contents

Overview

Architecture

Prerequisites

System Setup

About

Uh oh!

Releases

Packages

Uh oh!

Languages

wlopezm-unal/Project-airflow-AWSGlue

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline with Airflow, AWS Glue, Athena, S3

Table of Contents

Overview

Architecture

Prerequisites

System Setup

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages